Availability
Availability is the time a system is providing the service or services it is intended to provide, typically expressed in a percentage.
Traditionally, availability is measured as:
where period is the measurement period in hours, and downtime is the total number of hours the system is down during the measurement period. For the typical measurement period of one year, and a total of eight hours downtime for the year, the derived availability number is as follows:
This conventional method provides a simple way of expressing availability, yet does not represent anything meaningful about the system or service being measured. Establishing availability and planning to meet uptime requirements requires understanding a system's attributes and the events that might cause an outage.
In the white paper titled R-Cubed (R3): Rate, Robustness and Recovery - An Availability Benchmark Framework, the three key components of availability are defined as rate, robustness, and recovery.
Rate is the rate of fault and maintenance events.
Robustness is a system's ability to continue providing service in the face of fault and maintenance events.
Recovery is the speed with which a system returns to service following an outage.
The rate of fault events encompasses both system hardware and software, as well as external faults (for example HVAC, power) that result in a system outage. Methods for calculating fault rates are under investigation by vendors and academia. Maintenance event rates have their own inherent set of complexities, including processes and procedures that are specific to the datacenter environment maintaining the service. Although we use fault injection in the recovery experiments described in "Recovery and Performance Measurements" on page 17, a discussion of calculating or predicting fault and maintenance rates is beyond the scope of this article.
Robustness is a difficult attribute to measure. We examine the reliability, availability and serviceability (RAS) features of the platform under consideration, to determine which features allow a system to continue operating when components fail. For example, Sun mid-range and high-end servers have redundant power and cooling hardware to minimize the possibility of a system outage in the event of power and cooling subsystems failures. Other server hardware features, with fault and error handling support in Solaris, are all designed to improve the robustness of Sun systems.
Our focus in this article is the recovery aspect of availability. The premise of this work is simplefaults happen, and even in redundant configurations (such as clusters) some level of recovery is required. Our focus is twofold:
Duration how much time does the recovery process require, and where is that time spent?
Configuration what can be done to reduce recovery time, and what are the performance or other trade-offs in such configuration choices?