Solaris OE Enhancements
Kernel updates for Solaris 8 OE and Solaris 9 OE on UltraSPARCTM III systems enhance the correctable error (CE) L2_SRAM module handling. Multiple CEs on accessing an L2_SRAM module indicate a higher probability of experiencing an uncorrectable error (UE). To prevent a fatal UE, the Solaris OE attempts to take CPUs offline. The availability of domains increases because the Solaris OE does not access L2_SRAM modules that have an increased failure probability.
The enhanced Solaris OE kernels have the ability to communicate hardware failures to the SC. If the system is using the appropriate kernel update either for the Solaris 8 OE (2/02) with patches 115831-01,115829-01, and 108528-27 or for the Solaris 9 OE (12/03) with patch 116009-01, a message is sent to the SC when the Solaris OE identifies and isolates a faulty L2_SRAM module. The failed L2_SRAM module is not reconfigured into a domain on future domain reboots or setkeyswitch off and setkeyswitch on operations, because the system controller has recorded the component as faulty in its CHS.
Similar to memory page retirement, the Solaris OE keeps track of the number of ECC errors over time on an L2_SRAM module (see FIGURE 4). Two types of ECC errors are considered herenonfatal multibit errors (UCU, CPU, WDU, EDU) and nonfatal single-bit correctable errors (UCC, CPC, WDC, EDC). If an L2_SRAM module experiences one nonfatal multibit error or three single-bit correctable errors in a 24-hour window, the L2_SRAM module is diagnosed with an increased probability of suffering a fatal failure in the future. In this scenario, the Solaris OE has been enhanced to automatically attempt to take the affected CPU module offline. It is possible that the CPU off-line might not succeed, because there might be processes bound to the CPU.
FIGURE 4 Solaris OE L2_SRAM Error Handling
The following code example shows the messages that are displayed after successfully taking a CPU offline that experienced more than three CE events in a 24-hour period.
Feb 3 06:38:40 doma SUNW,UltraSPARC-III: NOTICE: [AFT1] CPU6 offline due to more than 3 xxC Events in 24:00:00 (hh:mm:ss)
Once a CPU is taken offline, the Solaris OE sends a message to the SC. The SC updates the CHS of the affected FRU so that the faulty CPU is not configured into a domain on future reboots or setkeyswitch off and setkeyswitch on events.
Taking offline the CPU associated with L2_SRAM modules with a higher probability of experiencing fatal errors increases the availability of the Solaris OE. Communication between the Solaris OE and the SC to persistently store the CHS increases availability and provides easier diagnosis and serviceability of the system. Dynamically reconfigured CPU/Memory boards can be replaced with minimal impact to the Solaris OE and user applications.