Sun Fire RAS Features
CPU and System Interconnect
The most important reliability feature in any system is the protection of data integrity.
The UltraSPARC® III processor has parity protection for all major internal caches and ECC protection for transactions to and from the external caches.
The data interconnect has ECC and parity protection throughout the system. The address interconnect has parity protection.
CPU Error Protection
The CPU corrects errors detected in the internal data and instruction cache SRAMs. When an internal error is detected, the CPU invalidates the cache and retries the data or instruction load.
Two SRAM modules per CPU reside on the CPU/Memory system board. These modules contain the external cache (Ecache). The CPU corrects all parity errors detected during data cache or instruction cache by invalidating and flushing the cache line. The hardware corrects all single bit errors detected by ECC during data transfers on the fly. ECC also detects and reports any uncorrectable (multibit) errors to the Solaris OE.
System Interconnect Error Protection
The system has three types of error protectiondata interconnect, address interconnect, and error isolation.
Data Interconnect
The system protects all data interconnect pathways by using ECC and parity protection. It generates ECC and parity bits for all data blocks sourced by processors and PCI I/O controllers (system devices). All data switches in the path for each transfer check ECC and parity. The receiving system device checks and corrects ECC.
Address Interconnect
The system has parity protection for all address interconnect pathways and checks parity between all system devices and address repeaters. On the Sun Fire ™15K/12K systems there is also ECC protection on the address and address response crossbars for transactions across the centerplane.
Error Isolation
Because each of the data switches in the path for every data transfer checks ECC, the source of ECC errors can be identified in most cases. However, some types of ECC errors are difficult to isolate. For ECC errors, such as a CPU writing bad ECC into memory, finding the source is difficult because multiple system devices may read and report the bad ECC. In such cases, the ECC error usually can be isolated to a dual CPU data switch or its pair of processors. This is an improvement over the previous architecture in which it was more difficult to isolate these types of ECC errors.
FIGURE 1-1 CPU Board Block Diagram
FIGURE 1-2 Sun Fire 6800/ 4810/ 4800/ 3800 Systems Interconnection Diagram
FIGURE 1-3 Sun Fire 15K/12K Systems Interconnection Diagram
System Controller
A small system called the system controller (SC) manages the Sun Fire systems. The SC is responsible for all of the functions required to test, configure, and boot domains. It supplies all system clocks and virtual TOD clocks for the domains and monitors power and environmental status. It also provides domain control with virtual key switches and console connections for each domain.
Testing and Configuration
Upon power on or a reset, the SC runs a self test called SCPOST. It then starts the platform management software.
The SC configures and coordinates the initialization, testing, and boot processes. After a system failure, and when the virtual key switch of a domain is turned on, the SC powers on the system components associated with the domain and runs SPOST for the domain. SPOST controls the running of LPOST and IPOST. LPOST tests the CPU/Memory boards, I/O boards, and the system interconnect. IPOST tests the PCI controllers. The SC then configures the domain based on the components that pass the tests and starts the boot sequence.
Environmental Monitoring
The SC monitors the following conditions:
Voltage, current, and temperatures for power supplies
Voltage and temperatures for all system boards and processors
Temperatures of ASICS
Fan status and speed
If safe thresholds are exceeded, the SC shuts down components to prevent damage to the system.
System Administration and Maintenance
The SC provides access for platform and domain administration. Access to the SC is by the included RS-232 serial connection or by network connection.
Tasks performed at the platform level are:
SC setup and configuration
Allocation of system resources for domains
Domain creation
Status display of all domains
Power control for all system components
Logical enable and disable of components
Individual CPU/Memory board tests
Component and environmental status display
Platform error message administration
Platform password setup (Sun Fire 6800/4810/4800/3800 systems)
Platform and Domain Security configuration (Sun Fire 15K/12K systems)
Tasks performed at the domain level are:
Power and boot control
Domain status display
Logical domain component enable and disable
Individual CPU/Memory board tests
Domain error message administration
Domain password setup (Sun Fire 6800/4810/4800/3800 systems)
Redundant System Components
All systems in the Sun Fire system product line can be configured with redundant components. The ability to run with a subset of configured components increases availability of the system. As long as one processor with memory and one I/O module are functional, the system can run. If the system is configured with redundant connections to storage and network, the access can be maintained using these alternate paths.
Redundant components include:
CPU/Memory boards
I/O modules
PCI cards
System Controller boards
Repeater boards (Sun Fire 6800/4810/4800/3800 systems)
Sun™ Fireplane interconnect (Sun Fire 15K/12K systems)
Fan trays
Power supplies
CPU/Memory Boards
Depending on the model, Sun Fire systems can support up to 18 CPU/Memory boards. Each board contains two or four CPUs and is capable of running independently or together with other boards in a larger domain.
I/O Modules
Depending on the model, Sun Fire systems can support up to 18 I/O modules:
Sun Fire 15K system supports up to 18 I/O modules, with four PCI slots each.
Sun Fire 12K system can have up to nine I/O modules with four PCI slots each.
Sun Fire 6800 system can be configured with a combination of four I/O modules with eight PCI or four Compact PCI slots each.
Sun Fire 4810/4800 can be configured with two I/O Modules with eight PCI or four Compact PCI slots each.
Sun Fire 3800 system is configured with two I/O modules with six Compact PCI slots each.
TABLE 1-1 summarizes these configurations.
TABLE 1-1 I/O Module Configurations
System |
I/O Modules |
PCI Slots |
Compact PCI Slots |
Sun Fire 15K |
18 |
4 each |
0 |
Sun Fire 12K |
9 |
4 each |
0 |
Sun Fire 6800 |
4 |
8 each |
Or 4 each |
Sun Fire 4810/4800 |
2 |
8 each |
Or 4 each |
Sun Fire 3800 |
2 |
0 |
6 each |
PCI Cards
Redundant PCI cards can be configured to provide alternate paths to all peripheral connections.
System Controller Boards
With two SC boards configured in the system, a failure of the primary SC does not cause a domain interrupt. The system clocks, virtual TOD clocks, and all other SC functions fail over to the secondary SC without causing a domain failure.
Repeater Boards
The Sun Fire 6800/4810/4800 systems contain pairs of Repeater boards. The Repeater boards contain data and address repeater switches. The Sun Fire 6800 contains two pairs of Repeater boards. If one board fails, the system can run in degraded mode on the other pair of boards. The Sun Fire 4810/4800 system is configured with one pair of Repeater boards. If one board fails, the system can run in degraded mode on the other board. The Repeater boards can be replaced while the system is running. Although the Sun Fire 3800 system has the same ability to run in degraded mode in the case of a failure, the functionality of the Repeater boards is built into the centerplane and cannot be replaced without an interrupt of the system. TABLE 1-2 summarizes these configurations.
TABLE 1-2 Repeater Board Configurations
System |
Repeater boards |
Sun Fire 3800 |
Functionality is built into the centerplane |
Sun Fire 4810/4800 |
1 pair |
Sun Fire 6800 |
2 pairs |
Sun implements the Sun Fire 15K/12K interconnect differently than the midrange systems. The interconnect consists of three independent 18 way crossbarsone for data, one for addresses, and one for address responses. Some of the interconnect is made of multiple ASICs that reside on the centerplane that cannot be replaced while the system is running. If one of the crossbars fails, the system can run in degraded mode while the other crossbars continue with full bandwidth. Depending on the failure, the degraded crossbar may be configured to affect a single domain.
Fan Trays
All Sun Fire systems can be configured with redundant fan trays. All fan trays can be replaced while the system is running.
Power Input and Supplies
The Sun Fire 6800/4810/4800/3800 systems can be configured with up to four separate AC input connections. The systems and peripherals can be connected to two different AC input power grids using two Redundant Transfer Units (RTU), each with two redundant transfer switches (RTS).
The Sun Fire 15K/12K systems are configured with six dual AC-DC power supplies, each with two AC input connections. The power supplies convert the AC voltage to 48 VDC. The 48 VDC is supplied to all of the system modules. Each system module contains its own on-board DC-DC converter to supply the lower DC voltages needed by the logic components local to each module. A failure of a DC-DC converter only affects that board.
Domains and Partitions
The definitions of these features are:
DomainThe ability to create logically independent multiple sections within a partition, with each domain running its own operating system. The Sun Fire 6800 system can have up to four domains. The Sun Fire 4810/4800/3800 systems can each have up to two domains. Each instance of the Solaris Operating Environment (Solaris OE) runs in its own domain. Domains do not depend on each other and do not interact with each other.
PartitionA partition differs from a domain in the level of isolation. The Repeater boards are logically isolated from each other so the system functions as two separate servers. On a Sun Fire 6800 system, segments can be configured to reside completely within a single internal Sun Fire 6800 power grid. The Sun Fire 15K/ 12K systems interconnect does not use Repeater boards and does not support multiple partitions.
A Sun Fire system can be logically divided into multiple domains. Since each domain is comprised of one or more system boards, a domain can have anywhere from two to 106 processors (on the Sun Fire 15K system). Each Sun Fire system has at least one domain to support the main functionality of the system.
Additional domains can be used for:
Testing new applications
Operating system updates
Configuring several domains to support separate departments
Each domain runs its own instance of the operating system and has its own peripherals and network connections. Domains can be configured without interrupting the operation of other domains on the same system.
While production work continues on the remaining (and usually larger) domain, there is not any adverse interaction between any of the domains. You can gain confidence in the stability of applications or upgrades without disturbing production work. When the testing work is complete, the system can be rejoined logically without rebooting (there are no physical changes when you use domains). Thus, if problems occur, the rest of your system is not affected.
The Sun Fire 15K/12K systems can be configured with up to 18/9 domains. The Sun Fire 15K/12K systems do not use partitions. The Expander boards are responsible for domain separation.
FIGURE 1-4 Sun Fire 6800/4810/4800/3800 Domain and Partition Allocations
Mechanical Serviceability
Connectors are keyed so that boards cannot be installed upside down. Special tools are not required to access the inside of the system. This is because all voltages within the cabinet are considered extra-low voltages (ELVs) as defined by applicable safety agencies.
No jumpers are required for configuration of the Sun Fire system. This makes it much easier to install new and/or upgraded system components. There are no slot dependencies other than the special slots required for the SC and Repeater boards.
The Sun Fire system cooling-system design consists of redundant, hot-swappable modules. Standard proven parts and components are used wherever possible. Sun designs the field-replaceable units (FRUs) and subassemblies for quick and easy replacement with minimal use of tools required.