Solaris Operating System Availability Features
- Processor Offlining for L2 Cache Events
- Page Retirement
- About the Author
- Acknowledgements
- References
The SolarisTM Operating System (OS) provides enhanced availability with the release of features aimed at helping the system better react to certain types of error conditions which can occur during normal operation. Solaris 8 Kernel Update patch 108528-20 and Solaris 9 Kernel Update patch 112233-06 introduce an enhanced L2 Cache error-handling technique called Processor Offlining. Solaris 8 Kernel Update patch 108528-24 introduces an enhanced memory DIMM error-handling technique called Page Retirement. Subsequent kernel update patches have modified and will continue to modify the behavior of the initial implementations. This article discusses the availability features implemented as of the release of Solaris 8 Kernel Update patch 108528-24 and Solaris 9 Kernel Update patch 112233-08. Where necessary, notes are provided to indicate where features of prior revisions differ.
The article addresses the following topics:
"Processor Offlining for L2 Cache Events"
"Page Retirement"
This article is targeted at IT professionals interested in detailed technical information regarding the covered topics. Basic knowledge of memory and processor architecture is assumed.
Processor Offlining for L2 Cache Events
Processor offlining is implemented via bug IDs 4740766 and 4740769. The behavior is further modified via bug IDs 4832104, 4836134, 4846476, and 4833032. When an L2 Cache error checking and correcting (ECC) event is logged, the event specifics are also examined to see if the event meets the criteria for offlining the processor. Only UltraSPARC® III based systems see these changes, since the implementation is contained within the kernel code specific to those processors. For all UltraSPARC III based systems except UltraSPARC IIIi systems, this feature is turned on by default. In either case, the default can be changed via entries in the /etc/system file. Qualifying ECC events fall into three major categories:
Single-bit correctable L2 Cache events
UCC event with ME bit set
Uncorrectable L2 Cache events
Category A: Single-Bit Correctable L2 Cache Events
There are four L2 Cache correctable events in this category:
UCC Software-correctable L2 Cache ECC error for instruction fetch or data access other than block load.
CPC Hardware-corrected L2 Cache ECC error for copyout (snoop request).
WDC Hardware-corrected L2 Cache ECC error for writeback.
EDC Hardware-corrected L2 Cache ECC error for store merge or block load. For UltraSPARC III Cu systems, a hardware-corrected L2 Cache ECC error for software or hardware prefetch access also generates EDC.
For each of these events the error is corrected, allowing the thread taking the trap to be restarted. However, repeated events make the CPU and its associated L2 Cache candidates for replacement. The event timestamp is fed into a Soft Error Rate Discrimination (SERD) algorithm that detects when three distinct events have occurred on the same processor in a 24-hour period. Upon detection of the third qualifying event, the processor becomes a candidate for offlining.
Category B: UCC Event With ME Bit Set
The special combination of a UCC event with the multiple error (ME) bit set is treated as if three distinct UCC events as described above have occurred in very rapid succession. In this case, the SERD algorithm is short-circuited and the processor immediately becomes a candidate for offlining. Careful checking is done to make certain the ME could only have been the result of the UCC and not a result of any other event.
Category C: Uncorrectable L2 Cache Events
There are four L2 Cache uncorrectable events in this category:
UCU Uncorrectable L2 Cache ECC error for instruction fetch or data access other than block load.
CPU Uncorrectable L2 Cache ECC error for copyout (snoop request).
WDU Uncorrectable L2 Cache ECC error for writeback (victimization).
EDU Uncorrectable L2 Cache ECC error for store merge or block load. For UltraSPARC III Cu systems, an uncorrectable L2 Cache ECC error detected during a software or hardware prefetch access also generates EDU.
For each of the above events, the thread taking the trap can be restarted, but the processor becomes an immediate candidate for offlining. However, there are special syndromes which, depending upon the CPU type, can result in not offlining the processor. In these cases the CPU that has discovered the error is not necessarily the offending CPU. For UltraSPARC IIIi systems, there is just one special syndrome: 0x3. For other UltraSPARC III based systems the special syndromes are 0x3, 0x71, and 0x11c.
Note that Solaris 8 Kernel Update patches prior to 108528-24 and Solaris 9 Kernel Update patches 112233-08 and earlier do not consider the WDU and EDU events candidates for processor offlining. (Track bug ID 4832104 for further details on implementation in Solaris 9.) Therefore, no processor offlining is performed for WDU and EDU events in the specified kernel update releases. It is also possible in these kernel updates for CPUs to be offlined incorrectly due to memory UE and DUE errors. If you are experiencing this condition, disable the Category C class of processor offlining events via tunables in the /etc/system file or, in the case of Solaris 8, upgrade to Solaris 8 Kernel Update patch 108528-24. Track bug ID 4836134 for further details on implementation in Solaris 9.
Processor Offlining Method
After a processor becomes a candidate for offlining, attempts are made to take the processor offline using the common OS interface cpu_offline(). This is similar in effect to running the command psradm and is subject to the same restrictions as to when and if it will be successful. If the error logging (and therefore this code) happens to be executing on the candidate processor, an attempt is made to cause a different processor to perform the offlining, if one is available. If the offlining attempt fails, the attempt is repeated after waiting for an interval of time. (For instance, an attempt might fail because the processor has threads running on it that cannot be moved immediately to another processor, or because it is the only CPU in a processor set.) If interrupts are assigned to this processor which cannot be moved to another, the processor may still run interrupt handler code and therefore be subject to future events. If the number of offlining attempts exceeds a set limit, the algorithm stops trying and the processor is not offlined at that time. It remains eligible for future offlining consideration should additional qualifying L2 Cache ECC events occur. Both the maximum number of attempts and the interval between them are tunables that may be set in the /etc/system file. If a system administrator should put an offlined processor back online (for example, using psradm), it would again become subject to this algorithm, just as any other processor in the system.
Once a processor is offlined via these algorithms, a message (processor indictment) is sent to the system controller on Sun FireTM-class systems with the intention of allowing it to remove this processor from the configuration at the next POST/reboot. Firmware 5.15.3 has been released via patch 112884-04 for Sun Fire 6800/4810/4800/3800 systems to provide support for this feature and others. Refer to the Sun BlueprintsTM Online article entitled "Sun FireTM 6800/4810/4800/3800 System Auto Diagnosis and Recovery Enhancements" for more information. At this time, software with equivalent functionality for the Sun FireTM 12K/15K system controller is not yet released. On these systems, and others without the appropriate system controller software, the system controller ignores the processor indictment message. The resultant behavior is that the offlined processor will be part of the configuration again at the next POST/reboot, assuming no errors are encountered during that process.
NOTE
Solaris 8 Kernel Update patches prior to 108528-24 and Solaris 9 Kernel Update patches 112233-08 and earlier do not send processor indictments to the system controller. Track bug ID 4833032 for further details on implementation in Solaris 9.
Processor Offlining and Dynamic Reconfiguration
Processor offlining does not utilize dynamic reconfiguration (DR). Automatic DR of an entire system board is not attempted when a single processor, or even all processors, are offlined on a board. If a system administrator manually performs DR on a system board containing offlined processors out of a domain and into the same or another domain, the processors are active again if the POST process is successful.
Processor Offlining and Capacity on Demand
There is no interaction between Capacity on Demand (COD) and processor offlining. The interaction remains the same as that seen if the system administrator were to have manually offlined a processor using the psradm command.
Example Messaging
The offlining algorithm has three possible scenarios. It could:
Successfully offline a processor
Initially fail but try again
Give up after some number of failures (described in "Processor Offlining Method" on page 3)
A distinct message is output to the console and the system messages log file for each of these scenarios. All messages identify both the affected processor and the general category that prompted the attempt to offline it. This information is subsequently useful to the service personnel who are eventually called to replace the offlined hardware.
For each of the three categories defined above (A, B, and C), the following examples show the resulting messages from the offlining algorithm.
Category A Messages
Category A messages are demonstrated below.
Oct 23 11:37:05 sf15k-domc SUNW,UltraSPARC-III: [ID 962502 kern.info] NOTICE: [AFT0] WDC Event detected by CPU64 at TL=0, errID 0x000000e3.2774aaf0 Oct 23 11:37:05 sf15k-domc AFSR 0x00000040<WDC>.000001f0 AFAR 0x000000a1.f06b4000 Oct 23 11:37:05 sf15k-domc Fault_PC 0x10152880 Esynd 0x01f0 SB2/P0/E0 J4400 Oct 23 11:37:05 sf15k-domc SUNW,UltraSPARC-III: [ID 189302 kern.info] [AFT0] errID 0x000000e3.2774aaf0 Data Bit 111 was in error and corrected Oct 23 11:37:05 sf15k-domc SUNW,UltraSPARC-III: [ID 860712 kern.info] [AFT2] errID 0x000000e3.2774aaf0 E$tag PA=0x000000a0.00eb4000 does not match AFAR=0x000000a1.f06b4000 Oct 23 11:37:05 sf15k-domc SUNW,UltraSPARC-III: [ID 260100 kern.info] [AFT2] errID 0x000000e3.2774aaf0 PA=0x000000a0.00eb4000 Oct 23 11:37:05 sf15k-domc E$tag 0x00000140.01000001 E$state_0 Shared Oct 23 11:37:05 sf15k-domc SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x00000000.00000000 0x00000000.00000000 ECC 0x000 Oct 23 11:37:05 sf15k-domc SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x00000000.00000000 0x00000000.00000000 ECC 0x000 Oct 23 11:37:05 sf15k-domc SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000 Oct 23 11:37:05 sf15k-domc SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x00000000.00000000 0x00000000.00000000 ECC 0x000 Oct 23 11:37:05 sf15k-domc SUNW,UltraSPARC-III: [ID 929717 kern.info] [AFT2] D$ data not available Oct 23 11:37:05 sf15k-domc SUNW,UltraSPARC-III: [ID 335345 kern.info] [AFT2] I$ data not available
Upon encountering a second WDC event, messages are logged similar to those seen during the first WDC event.
Oct 23 11:38:05 sf15k-domc SUNW,UltraSPARC-III: [ID 137784 kern.info] NOTICE: [AFT0] WDC Event detected by CPU64 at TL=0, errID 0x000000f1.217a63c1 . . .
Upon encountering the third WDC event, you see the normal messages just as in the first and second events, along with the following additional messages.
Oct 23 11:39:37 sf15k-domc SUNW,UltraSPARC-III: [ID 709559 kern.info] NOTICE: [AFT0] WDC Event detected by CPU64 at TL=0, errID 0x00000106.a62214d3 . . . Oct 23 11:39:38 sf15k-domc SUNW,UltraSPARC-III: [ID 732650 kern.notice] NOTICE: [AFT1] Failed to offline CPU64 due to more than 2 xxC Events in 24:00:00 (hh:mm:ss), will try again
In this case, the CPU could not be immediately offlined. The system retries some number of seconds later.
Oct 23 11:40:08 sf15k-domc SUNW,UltraSPARC-III: [ID 915404 kern.notice] NOTICE: [AFT1] CPU64 offlined due to more than 2 xxC Events in 24:00:00 (hh:mm:ss)
Now the CPU has been successfully offlined.
Category B Messages
Category B messages are demonstrated below.
Oct 23 18:58:20 sf68-doma SUNW,UltraSPARC-III+: [ID 135428 kern.info] NOTICE: [AFT0] First Error UCC Event detected by CPU5 in Privileged mode at TL=0, errID 0x0001487b.327ee8b0 Oct 23 18:58:20 sf68-doma AFSR 0x00100400<PRIV,UCC>.00000031 AFAR 0x00000000.02c4e8f0 Oct 23 18:58:20 sf68-doma Fault_PC 0x104e860 Esynd 0x0031 /N0/ SB1/P1/E1 J5300 Oct 23 18:58:20 sf68-doma SUNW,UltraSPARC-III+: [ID 782357 kern.info] [AFT0] errID 0x0001487b.327ee8b0 Data Bit 40 was in error and corrected Oct 23 18:58:20 sf68-doma SUNW,UltraSPARC-III+: [ID 669499 kern.info] [AFT2] errID 0x0001487b.327ee8b0 PA=0x00000000.02c4e8c0 Oct 23 18:58:20 sf68-doma E$tag 0x00000000.0b249249 E$state_3 Shared Oct 23 18:58:20 sf68-doma SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x80a0a000.32680024 0xc45fa7f7.7ffff353 ECC 0x067 Oct 23 18:58:20 sf68-doma SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x90100003.c85fa7ef 0x1080001b.c45fa7f7 ECC 0x1e8 Oct 23 18:58:20 sf68-doma SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0xc45fa7f7.86100000 0x90100004.92100012 ECC 0x12b Oct 23 18:58:20 sf68-doma SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x94100011.8778b401 0x9938e000.7ffff1fb ECC 0x066 Oct 23 18:58:20 sf68-doma SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available Oct 23 18:58:20 sf68-doma SUNW,UltraSPARC-III+: [ID 335345 kern.info] [AFT2] I$ data not available Oct 23 18:58:31 sf68-doma SUNW,UltraSPARC-III+: [ID 391224 kern.info] NOTICE: [AFT0] UCC Event detected by CPU5 in Privileged mode at TL=0, errID 0x0001487b.327ee8b0 Oct 23 18:58:31 sf68-doma AFSR 0x00300400<ME,PRIV,UCC>.00000031 AFAR 0x00000000.02c4e8f0 Oct 23 18:58:31 sf68-doma Fault_PC 0x104e860 Esynd 0x0031 /N0/ SB1/P1/E1 J5300 Oct 23 18:58:31 sf68-doma SUNW,UltraSPARC-III+: [ID 782357 kern.info] [AFT0] errID 0x0001487b.327ee8b0 Data Bit 40 was in error and corrected Oct 23 18:58:31 sf68-doma SUNW,UltraSPARC-III+: [ID 669499 kern.info] [AFT2] errID 0x0001487b.327ee8b0 PA=0x00000000.02c4e8c0 Oct 23 18:58:31 sf68-doma E$tag 0x00000000.0b249249 E$state_3 Shared Oct 23 18:58:31 sf68-doma SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x80a0a000.32680024 0xc45fa7f7.7ffff353 ECC 0x067 Oct 23 18:58:31 sf68-doma SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x90100003.c85fa7ef 0x1080001b.c45fa7f7 ECC 0x1e8 Oct 23 18:58:31 sf68-doma SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0xc45fa7f7.86100000 0x90100004.92100012 ECC 0x12b Oct 23 18:58:31 sf68-doma SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x94100011.8778b401 0x9938e000.7ffff1fb ECC 0x066 Oct 23 18:58:31 sf68-doma SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available Oct 23 18:58:31 sf68-doma SUNW,UltraSPARC-III+: [ID 335345 kern.info] [AFT2] I$ data not available Oct 23 18:58:32 sf68-doma SUNW,UltraSPARC-III+: [ID 489146 kern.notice] NOTICE: [AFT1] CPU5 offlined due to UCC Event with ME set
Category C Messages
Category C messages are demonstrated below.
Oct 23 11:42:31 sf15k-domc SUNW,UltraSPARC-III: [ID 798832 kern.warning] WARNING: [AFT1] WDU Event detected by CPU65 at TL=0, errID 0x0000012f.1b1a7f3a Oct 23 11:42:31 sf15k-domc AFSR 0x00000020<WDU>.0000017a AFAR 0x000000a1.d9698000 Oct 23 11:42:31 sf15k-domc Fault_PC 0x10152880 Esynd 0x017a SB2/P1/E0 J5400 Oct 23 11:42:31 sf15k-domc SUNW,UltraSPARC-III: [ID 990173 kern.notice] [AFT1] errID 0x0000012f.1b1a7f3a Two Bits were in error Oct 23 11:42:31 sf15k-domc SUNW,UltraSPARC-III: [ID 974137 kern.info] [AFT2] errID 0x0000012f.1b1a7f3a E$tag PA=0x000000a0.00e98000 does not match AFAR=0x000000a1.d9698000 Oct 23 11:42:31 sf15k-domc SUNW,UltraSPARC-III: [ID 264775 kern.info] [AFT2] errID 0x0000012f.1b1a7f3a PA=0x000000a0.00e98000 Oct 23 11:42:31 sf15k-domc E$tag 0x00000140.01000001 E$state_0 Shared Oct 23 11:42:31 sf15k-domc SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x00000000.00000000 0x00000000.00000000 ECC 0x000 Oct 23 11:42:31 sf15k-domc SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x00000000.00000000 0x00000000.00000000 ECC 0x000 Oct 23 11:42:31 sf15k-domc SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000 Oct 23 11:42:31 sf15k-domc SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x00000000.00000000 0x00000000.00000000 ECC 0x000 Oct 23 11:42:31 sf15k-domc SUNW,UltraSPARC-III: [ID 929717 kern.info] [AFT2] D$ data not available Oct 23 11:42:31 sf15k-domc SUNW,UltraSPARC-III: [ID 335345 kern.info] [AFT2] I$ data not available Oct 23 11:42:31 sf15k-domc unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x000000a1.d9698000 Oct 23 11:42:32 sf15k-domc SUNW,UltraSPARC-III: [ID 277554 kern.notice] NOTICE: [AFT1] Failed to offline CPU65 due to xxU Event, will try again Oct 23 11:42:37 sf15k-domc unix: [ID 221039 kern.notice] NOTICE: Previously reported error on page 0x000000a1.d9698000 cleared
In this case the CPU could not be immediately offlined. The system retries some number of seconds later.
Oct 23 11:43:02 sf15k-domc SUNW,UltraSPARC-III: [ID 966792 kern.notice] NOTICE: [AFT1] CPU65 offlined due to xxU Event
Now the CPU has been successfully offlined.
Due to reasons stated previously, it is possible that a CPU cannot be offlined at all. The following messages depict this situation.
Oct 23 15:41:17 sf15k-domc SUNW,UltraSPARC-III: [ID 277554 kern.notice] NOTICE: [AFT1] Failed to offline CPU64 due to xxU Event, will try again Oct 23 15:41:23 sf15k-domc SUNW,UltraSPARC-III: [ID 277554 kern.notice] NOTICE: [AFT1] Failed to offline CPU64 due to xxU Event, will try again Oct 23 15:41:53 sf15k-domc last message repeated 6 times Oct 23 15:41:58 sf15k-domc SUNW,UltraSPARC-III: [ID 277554 kern.notice] NOTICE: [AFT1] Failed to offline CPU64 due to xxU Event, will try again Oct 23 15:43:14 sf15k-domc last message repeated 15 times Oct 23 15:43:19 sf15k-domc SUNW,UltraSPARC-III: [ID 324082 kern.notice] NOTICE: [AFT1] Failed to offline CPU64 due to xxU Event, giving up
This system was configured to attempt to offline the processor 24 times before giving up.
Tunables
The following /etc/system variables and their possible values are listed here for reference only. Changing the values to other than their defaults should only be done under the guidance of an authorized Sun Microsystems service provider.
TABLE 1 Processor Offlining Variables and Values
Variables |
Values |
set automatic_cpu_removal=0 |
Disables processor offlining. |
set automatic_cpu_removal=1 |
Enables only Category A offlining. |
set automatic_cpu_removal=2 |
Enables only Category B offlining. |
set automatic_cpu_removal=4 |
Enables only Category C offlining. |
set automatic_cpu_removal=3 |
Enables both Category A and B but not Category C offlining. Note: The values are expressed in decimal format, but if you consider the variable value in binary format, it is formed by ORing together three bits. A value of 1 in binary is 001, 2 is 010, and 4 is 100. To get both Category A and B, set the bit positions for both 1 and 2, which gives 011 in binary or 3 in decimal. |
set automatic_cpu_removal=7 |
Enable all three categories of offlining. If processor offlining is turned on for a specific processor type, this is the default. |
set cpu_remove_retry_seconds=30 set cpu_remove_retry_attempts=2400 |
When processor offlining is unsuccessful, retry again in cpu_remove_retry_seconds. Additionally, keep trying for cpu_remove_retry_attempts. Note: These are the default values for Solaris 8. |
set cpu_remove_retry_seconds=5 set cpu_remove_retry_attempts=24 |
When processor offlining is unsuccessful, retry again in cpu_remove_retry_seconds. Additionally, keep trying for cpu_remove_retry_attempts. Note: These are the default values for Solaris 9. |