- Recommendations for Applying Preferred Practices
- Principals of Mission-Critical Implementations
- Physical Environment
- Internal Network Planning
- External Network Planning
- System Controller Configuration
- Platform and Domain Administration
- Security
- Error Analysis and Diagnosis
- Platform and Domain Configuration
- Dynamic Reconfiguration
- References
- Related Resources
Error Analysis and Diagnosis
The Sun Fire 15K/12K servers have many tools and utilities designed to help monitor, isolate, and report faults and errors.
Detecting ECC Errors
Occasionally, Sun Fire 15K/12K processors experience and detect a correctable memory error called an error correction code (ECC). When the server detects an ECC, it tries to correct the error, logs the error to its fault status register, and continues operation. Additionally, the Solaris OE logs these errors to the error log and to the system console as part of the error reporting process. Several errors might be logged in this process for just one correctable error event. Recognizing and understanding these types of errors can be complex and is best left for Sun Support Services engineers.
For information about memory error handling and a summary of asynchronous fault tags (AFTs), reference the Sun white paper "Soft Memory Errors and Their Effect on Sun Fire System," available at http://www.sun.com/products-n-solutions/hardware/docs/pdf/816-5053-10.pdf
Selecting Diagnostic Tools
To capture information used in support and problem resolution, implement a consistent process and set of tools on all Sun Fire 15K/12K server domains and system controllers. We recommend that you install the Sun Explorer Data Collection tool on all domains and run it after every major-change event. It is best to run the Sun Explorer scripts when the system is fully up, but not under a heavy load. Optionally, you can install the Sun Explorer tool and run it from an NFS server. The Sun Explorer tool changes frequently, so stay up-to-date on the latest versions. The tool is available for download from the Sun support portal at http://sunsolve.sun.com. You can also use the output data as an optional follow-up check of the installation using the Sun Services RAS profile.
Isolating Faulty Components
The Sun Fire 15K/12K servers have several mechanisms for identifying and isolating components that are faulty. These mechanisms include redx, POST, dsmd dumps, automatic system recovery (ASR), and log files. The redx program (sometimes referred to as "red cross" or "red ex") is used for debugging and maintenance purposes. The redx program is normally reserved for Sun Services engineers performing low-level hardware and firmware diagnostics. You can run the redx program offline on a separate Solaris workstation to look at Dstop (domain stop) dump files to investigate fatal errors, such as CPU internal failures.
The dsmd (domain status monitoring daemon) dumps are files created by the SMS domain status monitoring daemon. If the daemon detects a component fault error, contact Sun Services before attempting any physical hardware replacement. You can use the SMS software commands to isolate and place components offline until they can be replaced by Sun Services.
In some situations, the ASR feature of Sun Fire 15K/12K servers will automatically detect a fault error in a component, and the SMS process will place the component into the black list file. The failed component will then be deconfigured from the system by the POST on the next reboot.