Designing Highly Available Architectures: A Methodology
- Statistical Nature of Availability
- Establishing Expected System Availability
- Creating a Complete Failure Impact Specification
- Acknowledgements
This Sun BluePrints_ OnLine article presents a methodology for discussing availability requirements for Information Technology (IT) systems. This methodology focuses on the interaction between system vendors and customers at the early stage of a project and defines the minimum information that should be exchanged to design an architecture that will satisfy the availability requirements of the future owner of the system.
Sun Microsystems_ provides a complete portfolio of products and services that help increase system and application availability. The following examples are just a few of these offerings:
Solaris_ Volume Manager software for data protection
Sun_ Cluster software for failover in case of system disruption
Sun Storage Availability Suite for on-disk data snapshots
Sun Fire_ systems for fully redundant servers
In this article, we use the term "architecture" to denote a specifically designed combination of these types of products and services.
Customers and vendors strive to design optimal architectures. In this context, optimal means that a design is comprised of only those components that are required by the criticality of the application that the architecture will support. Efforts to quantify criticality at this level are often centered around availability or uptime ratio. In this case, availability is expressed as a single value (for example, 99.995 percent); therefore, at first sight, availability is a simple and attractive basis for decisions.
In this article, we show that availability is a statistical variable, not a property of a system. We demonstrate that availability is, in fact, difficult to express correctly.
As an alternative, we propose an approach where you create an inventory of failure scenarios for a proposed system architecture, and answer the following five questions for each scenario:
What is the time to recover (TTR) from this failure?
What is the back to nominal time (BTN), also thought of a time to repair?
Is recovery from this failure automated?
Can the broken component be replaced on-line?
What is the service level degradation?
For this set of five metrics we use the term "failure impact."
It may seem strange that our model does not include the frequency or probability of failures. This is exactly the foundation of this article.
Expected availability of an architecture has two contributing factors: the expected frequency of failures, and their impact.
In this article, we use the term "failure" for any event leading to unscheduled downtime. It is not limited to improper functioning of electronic components, but can include external factors like destruction of systems, or loss of data through administrator errors. The frequency of failures is a shared responsability between the vendor (who assures component reliability) and the customer (who controls environmental parameters, and the skill level and discipline of the system administrators). This reliability factor is important but our argument is that it is not usable in a design specification, as it is outside the control of the designer of an architecture.
What the designer of an architecture does control is the impact of failure, this by deploying the appropriate techniques. Therefore we use an analysis of the impact of failures as the initial design specification.
This article addresses the following topics:
"Statistical Nature of Availability"
"Establishing Expected System Availability"
"Accounting for the Impact of Redundant Components on Expected Availability"
"Employing Online Serviceability to Control Expected Availability"
"Considering the Impact of Service Degradation"
Statistical Nature of Availability
The components in any physical system have a limited lifetime and may stop functioning at an unpredictable point in time. The ability of a system to function in spite of this fact is usually expressed as availability. More precisely, availability is the fraction of time that a system is functional over a predefined period of time. In this section, we use a time period of one month.
Monthly availability can be measured over the lifetime of the system. Due to the unpredictable nature of failures, each measurement of availability provides a different result (for example, availability in January may be 98.7 percent in January, 100 percent in February, and 99.5 percent in March).
You can only capture this variation by sampling multiple intervals. These measurement values are represented in the following histogram:
FIGURE 1 Estimating Availability Using Multiple Measurements
This histogram shows, for instance, three availability measurements between 70 percent and 80 percent. When sufficient measurements are taken, we can smooth this histogram into a curve and normalize it (multiply it by a factor, reducing the area under the curve to 1). The obtained curve is the probability distribution for monthly availability, denoted as p(A). This curve might appear as shown in the following graphic:
FIGURE 2 Monthly Availability Probability Distribution
Probability distribution is the most accurate characterization that can be made about the availability of a system. It allows you to calculate the actual probability that the measured availability will be between two arbitrary limits. This probability is the area under the curve, and between the two points. For example, the following graphic shows the probability that availability will be between 90 percent and 97 percent is 0.7.
FIGURE 3 Probability for Availability Between Two Limits
It is beyond the scope of the article to provide introductory and background text on probability and statistics. For more information, refer to "Introductory Statistics," by Wonnacot and Wonnacot. The point we want to make is that the correct description of a variable that varies randomly with each measurement is a probability distribution curve, and not a single value. In other words, a 99.995 percent available system does not exist. What may exist, is a system with a 0.9 probability that availability is at least 99.95 percent. In other words, designers of a system cannot guarantee availability; at most they can design to reach a certain probability that, during a sample interval, the measured availability will equal or exceed a certain value.
IT architects face this problem with almost every project where availability is a primary concern. The future owner of the system requires a certain minimum level of availability. However, you can hardly expect a user of a system who requires system availability to be above 99.5 percent, to appreciate being presented with two alternative architectures: one for which the probability to get 99.5 percent availability is 0.93, and another, more expensive one, for which this probability is 0.96.
Availability is a primary concern of the owner of a system, and IT architects must fully recognize this. In this section, we have explained why we believe that a single availability value is not a sufficient design specification for an architecture. Our fundamental objection is that the components of a system only prove their reliability over the life of the system, and there is no way for IT architects to verifiably demonstrate that their designs will, for instance, achieve 99.99 percent availability. As a result, a solution may be sought in a legal contract between vendor and customer, with a premium and penalty for measured availability above or below a certain value.
We propose an alternative angle to discuss the availability aspect of architectures. To this end, we develop in the following sections a simple model for availability and try to single out the propability component and the deterministic component. This deterministic factor is what we will term "failure impact." Our suggestion is that in many circumstances, failure impact can be used as specification at the start of an IT project, when the availability requirements of the customer need to to be established.