Solaris OE Enhancements
Kernel updates for Solaris 8 OE and Solaris 9 OE on UltraSPARC III systems enhance the correctable error (CE) DIMM and L2_SRAM module handling. Multiple CEs on accessing a DIMM or L2_SRAM module indicate a higher probability of experiencing an uncorrectable error (UE). To prevent a fatal UE, memory pages are retired and CPUs are automatically off-lined. The availability of domains increases, because the Solaris OE does not access pages or L2_SRAM modules that have an increased failure probability. L2_SRAM enhancements are provided with KU-108528-20 for Solaris 8 OE and KU 112233-07 for Solaris 9 OE. For the upcoming KU version that supports the DIMM CE enhancements, refer to http://sunsolve.sun.com.
Virtual Memory Page Retirement
The Solaris OE keeps track of the number of CEs over time on a DIMM. If more than three errors occur on the same DIMM within a 24-hour window, the domain automatically schedules retirement of the memory page (FIGURE 6). Memory pages can be retired when all processes have released the page. Retired pages are not used by the domain. On a reboot, retired pages are again used by the domain.
FIGURE 6 Solaris OE Memory ECC Handling
TABLE 4, "Example 4," shows the messages output when a memory page is retired. On a reboot, retired pages are accessible again for the domain.
TABLE 4 Example 4
Jan 7 04:14:07 doma unix: [ID 596940 kern.warning] WARNING: [AFT0] 3 soft errors in less than 24:00 (hh:mm) detected from Memory Module Board 4 J3801 |
Jan 7 04:14:07 doma unix: [ID 618185 kern.notice] NOTICE: Scheduling removal of page 0x00000001.2bf6c000 |
Jan 7 04:14:12 doma unix: [ID 693633 kern.notice] NOTICE: Page 0x00000001.2bf6c000 removed from service |
CPU Off-lining
Similar to memory page retirement, the Solaris OE keeps track of the number of ECCs over time on an L2_SRAM module (FIGURE 7). Two types of ECCs are considered herenonfatal multibit errors (UCU, CPU, WDU, EDU) and nonfatal single-bit correctable errors (UCC, CPC, WDC, EDC). If an L2_SRAM module experiences one nonfatal multibit error or three single-bit correctable errors in a 24-hour window, the L2_SRAM module is diagnosed with an increased probability of suffering a fatal failure in future. In this scenario, the Solaris OE has been enhanced to automatically attempt to off-line the affected CPU module. It is possible that the CPU off-line may not succeed because there might be processes bound to that CPU.
FIGURE 7 Solaris OE L2_SRAM Error Handling
TABLE 5, "Example 5," shows the messages on successfully off-lining a CPU that experienced more than two CE events in a 24-hour window.
TABLE 5 Example 5
Feb 3 06:38:40 doma SUNW,UltraSPARC-III: NOTICE: [AFT1] CPU6 offlined due to more than 2 xxC Events in 24:00:00 (hh:mm:ss) |
On a reboot, the CPU is on-lined again. Off-lining the CPU associated with L2_SRAM modules with a higher probability of experiencing a fatal error increases the availability of Solaris OE. Dynamically reconfigured CPU/Memory boards can be replaced with minimal impact to the Solaris OE and user applications.