- Introduction
- Installation and Configuration
- System Controllers
- Platform and Domain Configuration
- Memory and I/O Configuration
- Domain Administration
- Platform Security
- Error Analysis and Diagnosis
- Dynamic Reconfiguration
- Hot-Swappable PCI Adaptors
- About the Author
- Acknowledgments
- Related Resources
- Ordering Sun Documents
- Accessing Sun Documentation Online
Error Analysis and Diagnosis
This section addresses error analysis and diagnosis practices.
ECC Replacement Guidelines
Solaris OS can report three categories of ECC memory errors, which are identified as intermittent, persistent, and sticky. These errors are reported to the console as well as to the system-messages file. In addition, the Sun Fire 15K server detects ECC errors in the hardware and takes a snapshot of the hardware state. The hardware state dumps are known as record stops and are in the SC directory structure for the affected domain. ECC DIMM errors are related to main-memory modules and not level-two cache (L2CACHE) errors. For more information, refer to the Sun Blueprints OnLine article "Solaris Operating System Availability Features."
The ECC DIMM memory errors are categorized as follows:
IntermittentThe error was not detected on a reread of the affected memory location.
PersistentThe error was detected again on a reread of the affected memory location, but the scrub operation corrected it.
StickyThe error still exists in memory, even after the scrub operation.
TIP
If the ECC is intermittent, check reports.
If the ECC is persistent, replace the DIMM if three or more errors occur in a 24-hour period on the same DIMM.
If the ECC is sticky, replace the DIMM on first occurrence.
Maintain Current Explorer Data
Sun Fire server systems are designed with significant diagnostic capabilities. In the event of a system fault, the system should provide information for both hardware and software failures, which can be used to help determine the source of the fault. Errors can be reported and logged to several places depending on the type of error. Explorer software is the utility of choice for gathering the state of SCs and domains at the time a failure occurs. Be sure to use the most current release of Explorer software to capture all of the appropriate data.
For more information about Explorer, refer to:
http://sunsolve.Sun.COM/pub-cgi/show.pl?target=explorer/explorer
TIP
Use the latest release of Explorer software.
Fault Isolation
One of the most interesting features of the Sun Fire 15K server is the ability to reconfigure the platform and the domain via the software command line interface (CLI). In fact, the centerplane can be placed into a degraded mode of operation without shutting down the running domains. In most cases, this provides an opportunity to isolate a fault in a field replacable unit (FRU) before handling any hardware and risking damage to certain single point of failures (SPOFs) such as the centerplane. Given any type of hardware detected error, an attempt to isolate the failing components is indicated using the information provided by the software (hpost, redx, dsmd dumps, logfiles). Before attempting hardware replacement, use the SMS command disablecomponent to deconfigure and isolate failing hardware.
TIP
Before hardware replacement, use the SMS command disablecomponent to deconfigure and isolate the faulty hardware.
SMS 1.4 software has new functionality for the Auto Diagnosis and Recovery capabilities, to help detect system failures when they occur, deconfigure faulty components out of a system, and automatically restore a system. These new functions help customers minimize both planned and unplanned downtime. These new features are described as follows:
Automatically deconfigure the faulty components out of a system, reducing unplanned downtime (component health status).
Provide detailed error messages for faster problem resolution and faster time to service (auto diagnosis).
Detect potential CPU cache failures and offline affected CPUs, keeping system up and application available (CPU taken offline).
Automatically restore domains, reducing the impact of faulty components of system availability (auto recovery).
Automatically generate email notice informing designated recipients of domain events when they occur (email event notification).
For more information, refer to the Sun BluePrints OnLine article titled "Sun Fire 15K/12K Auto Diagnosis and Recovery."
TIP
Also check event notification to deconfigure and isolate the faulty hardware.