Introduction to Sun Fire Systems
Most companies want the best solution for their needs, especially when they are purchasing a computer system. However, designing a reliable system that performs well takes careful consideration and planning.
Consider a widely-used analogyautomobiles. A vast range of different types of vehicles, including sedans, sports cars, convertibles, sport-utility vehicles, trucks, and hybrids is available. Each of these types of vehicles has advantages and disadvantages. When trying to decide what type of vehicle to purchase, you must consider your needs. How many people must it be able to carry? How much power does it need? How much cargo must it hold? Is size a concern? And so forth.
You would probably be making a mistake if you purchased a two-seater sports car when you had a family of five that regularly attended soccer games. You would probably be making a similar mistake if you bought a sedan with the intent of using it to haul your trailer and dirt bikes out to the desert on the weekends. For a purchase to be worthwhile, you must ensure that it properly fits your specific requirements.
This analogy may seem trite, but many people do not put nearly as much thought into purchasing a high-end server as they do a car. Instead, a system often is purchased based on its processor speed or size alone. Little, if any, consideration is given to concerns such as how to layout the I/O, how much memory to get, and how much expansion capacity is really needed.
As with the car analogy, you must ensure that the design of your server configuration meets your company's needs. This design process requires time and effort, but it pays off in better system reliability, availability, serviceability (RAS) and performance.
This chapter summarizes the RAS and performance features of the Sun Fire system hardware. The definitions of the RAS features are:
ReliabilityThe ability of the system to run without interruption, to continue to operate when correctable errors are detected, and to prevent data corruption.
AvailabilityThe percentage of time the customer's system is able to do productive work. The ability to always recover after a failure by testing and bypassing failed components.
ServiceabilityThe system ensures that repair time (downtime) is minimized.
This chapter covers these topics in two sectionsRAS and Performance.
RAS
The RAS goals for the Sun Fire system are to protect the integrity of the customer's data and to maximize availability. The focus is on three areas:
Problem detection and isolationknowing what went wrong and ensuring that the problem is not propagated
Tolerance and recoveryabsorbing abnormal system behavior and fixing or dynamically circumventing it
Redundancyreplicating critical components
To ensure data integrity at the hardware level, all data is error correction code (ECC) protected, and address and control buses are protected by parity checks. These checks ensure the containment of errors.
For tolerance of errors, resilience capabilities are designed into the Sun Fire system to ensure that the system continues to operate, even in a degraded mode. The Sun Fire system can function with one or more processors disabled. In recovering from a problem, the system is checked quickly to determine the fault and to ensure minimum downtime. To reduce downtime, redundant hardware can be configured into the system.
Reliability
Sun Fire systems have five categories of reliability capabilities:
Reducing the probability of errors
Detecting and correcting errors using error correction code (ECC)
Detecting uncorrectable errors with ECC and parity checking
Redundant power and cooling
Environmental sensing
Availability
Availability is the ability of a system to be continually accessible and useful to the customer. Sun Fire systems have many features that contribute to this quality, including the ability to:
Test, identify, and de-configure failed components following a system interrupt
Configure and boot a usable configuration with a subset of the original configuration
Change the configuration without interrupts using dynamic reconfiguration (DR).
For higher levels of availability Sun Fire systems can be clustered.
Serviceability
To reduce repair time, the Sun Fire systems are designed with a number of maintenance capabilities and aids. These are used by the Sun Fire system administrator and by the service provider.
Failing components are listed in the failure logs in such a way that the field-replaceable unit (FRU) is clearly identified. You can remove and replace most system components in a properly configured system during system operation without scheduled downtime. If properly configured, CPU/Memory boards, I/O boards, I/O controllers, fans and power supplies can all be replaced while the system is running.