- Statistical Nature of Availability
- Establishing Expected System Availability
- Creating a Complete Failure Impact Specification
- Acknowledgements
Creating a Complete Failure Impact Specification
In the preceding sections, we identified a number of metrics and showed how each of them has a direct impact on the expected availability of the entire system. The following list summarizes these metrics:
- Time to recover, TTR [hours]
- Automatic recovery [0/1]
- Back to nominal time, BTN [hours]
- Online repair [0/1]
- Performance degradation [percentage]
- Data regression [ad hoc]
A failure impact analysis of an entire system lists all possible failure scenarios (or failure modes) that can occur in the system. The level of detail is not essential, but the list must cover everything for which the system architect is responsible. In a next step, you consider every failure scenario and specify the preceding six metrics for each of them.
The failure modes of an architecture depend on the components that are deployed. This makes it difficult to present a generally applicable set of failure modes. In the following paragraphs, we show a generic set of failure modes for a mission-critical environment, deploying both a server consolidation platform and storage consolidation platform. It should be clear that the following failure scenarios are only provided as an example. A failure analysis of another architecture can yield a different list. You may want to drill down to lower-level components if their behavior is unique (for example, if they are much more difficult to service).
CPU/memory subsystem failure (non-correctable). Covers the computational core of the system. It includes all server-internal components whose failure interrupts a system's operation (crashes the system). It does not include the common platform components in the case of a multidomain system. These elements are considered in a separate failure category because they impact multiple servers.
Disk I/O channel failure. Covers the physical elements in the path between a server and its disks, except for those that are common to multiple servers. This includes host bus adapter, connectors, cables, and fibre channel switches.
Network I/O channel failure. Covers the physical elements in the path between server and network backbone: network interface card, cable, and patch panels.
Storage platform level failure. Covers any failures affecting the entire storage platform, excluding hard drives.
Disk failures. Covers the failure of an individual hard drive.
Platform level failure. Applicable only in the case of consolidated servers, this covers any failures affecting the entire server platform (for example, interconnect failure), excluding CPU/Memory boards and I/O boards.
Data loss or corruption. Groups all events (irrespective of the cause, which may be human or software related) that lead to the loss or the irreversible corruption of data used by the application.
Environmental problem. Covers all events that interrupt all systems in the computer room or data center. Examples include complete power failure, air conditioning failure, fire, or other events leading to a forced shutdown of all systems.
The proposed specification model takes the form of a two-dimensional matrix as follows:
TABLE 1 Specification Model Matrix
Failure Mode |
Time to Recover |
Automated Dault Detection |
Online Serviceability |
Back to Nominal |
Performance Degradation |
Data Regression |
CPU/memory | 15 minutes | yes | yes | 4 hours | 50 % | 0 |
Disk I/O channel | 1 minute | yes | yes | 4 hours | 50% | 0 |
Network I/O channel | 1 minute | yes | yes | 4 hours | 0% | 0 |
Storage platform | 24 hours | no | | | | 0 |
Server platform | 24 hours | no | | | | 0 |
Disk drive | 1 minute | yes | yes | 4 hours | 0% | 0 |
Data loss or corruption | 12 hours | no | | | | 1 hour |
Environment | 24 hours | no | yes | N/A | | 1 hour |
Note that the entry online serviceability indicates whether redundancy is employed in the architecture. In the absence of redundancy, a broken component implies a total outage, and there is no value in this column.
A complete matrix like the one shown in the preceding table can only result from a joint effort between system architect and customer. You may have to conduct multiple interviews to reach this detailed specification, discussing the consequences (cost, complexity, and the like) of an aggressive specification. In the following two sections, we explain why the effort is well-spent.
Creating a Design From the Specification
A specification like that shown in the preceding table enumerates the failure scenarios that must be accounted for and defines precisely the scope of the architecture and the elements for which the architect is responsible. The absence of a failure mode exempts the designer from responsability for the consequences of such a failure.
For example, TABLE 1 specifies that an environmental failure needs to be recovered from within 24 hours. This requirement translates into a need for a second computer room at sufficient distance, with independent power and air conditioning. This room must be prepared for spare equipment that is present or can be made available at short notice. It is up to the architect to provide a solution to meet this requirement.
As another example, TABLE 1 on page 12 also specifies data loss, implying that the architect must provide a complete backup solution.
In addition to defining the responsabilities of the architect, the values in this sort of table largely determine the implementation. For example, the requirement of recovery from data loss in one hour drives the choice for a network-based or a SAN- based backup solution. Furthermore, for a certain volume of data, on-disk point-in-time copies might be the only methods for achieving the required data restore time.
Further, you might consider how back-to-nominal times impose a maximum time to replace broken components. The architect is responsible for including the appropriate level of maintenance into the total solution, along with any other measures that may be required to achieve repair within the specified time frame (for example, the requirement to have spare CPU/memory boards permanently present in the system). In TABLE 1 on page 12, the absence of a specified value for BTN in case of environment failure expresses that repair time on power circuits and air conditioning are outside the scope of the sample project.
It is not our intention to provide guidelines for implementation. We count on the experience, product knowledge, and creativity of presales engineers to provide them. With the preceding examples, we hope to substantiate our claim that a specification shown in TABLE 1 on page 12 is a good investment. It considerably increases the chances of a first-time-right proposal, and, longer term, it is a basis for an objective agreement and understanding between customer and vendor.
Assessing Expected Availability From the Specification
TABLE 1 on page 12 has another interesting aspect. In the first part of this article, we developed expressions for expected availability. Now, we can do the reverse; we can estimate the expected availability of the architecture from the specification.
Suppose that we want to get an idea of the expected value of monthly availability, based on the preceding table. The table does not provide all the information we need. As discussed earlier, all the parameters that impact expected availability, but that are not under the designer of the architecture's control, are logically left out of the specification. To assess Ae, we have to provide estimates for the following missing parameters:
Failure probability [p]
Downtime following multiple failures (e.g. both nodes in a cluster) [MTTR2]
Failure detection time in case failure detection is not automated [TTR+]
Delay in repair due to the lack of online serviceability. This is when we have to defer the replacement of a redundant broken component until a maintenance slot can be scheduled [BTN+]
Under the assumption that Ae refers to a one-month period, failure probability can be derived as follows: Suppose that we estimate the failure to happen, on average, every two years. The probability of failure during a month is then 1/24 or 0.04.
The following example of estimates corresponds to TABLE 1 on page 12. Note that the values are ficticious and do not represent actual reliability estimates for existing hardware.
TABLE 2 Estimates for Parameters That Impact Ae
Failure Mode |
p |
MTTR2 |
TTR+ |
BTN+ |
CPU/memory | 0.03 | 1 week | | 0 |
Disk I/O channel | 0.05 | 1 day | | 0 |
Network I/O channel | 0.05 | 1 day | | 0 |
Storage platform | 0.02 | 1 week | 1 hour | |
Server platform | 0.02 | 1 week | 1 hour | |
Disk drive | 0.10 | 1 day | | 0 |
Data loss or corruption | 0.15 | | 1 hour | |
Environment | 0.08 | 2 weeks | 1 hour | |