- Transport Network Failures and Their Impacts
- Survivability Principles from the Ground Up
- Physical Layer Survivability Measures
- Survivability at the Transmission System Layer
- Logical Layer Survivability Schemes
- Service Layer Survivability Schemes
- Comparative Advantages of Different Layers for Survivability
- Measures of Outage and Survivability Performance
- Measures of Network Survivability
- Restorability
- Reliability
- Availability
- Network Reliability
- Expected Loss of Traffic and of Connectivity
3.11 Reliability
In ordinary English, "reliable" is a qualitative description, meaning that something or someone is predictable, usually available when needed, follows through on promises, etc. But the technical meaning of reliability is quantitative and much more narrowly defined [BiAl92], [OCon91]:
-
Reliability is the probability of a device (or system) performing its purpose adequately for the period of time intended under the operating conditions intended.
In other words, reliability is the probability of a system or device staying in the operating state, or providing its intended service uninterrupted, as a function of time since the system started in a fully operating condition at t=0. A reliability value can also be thought of as answering a mission-oriented question of the form: If the device or system was started in perfect working order at time t=0, and the mission takes until t=T (with no opportunity for external intervention or repair), then what is the probability that the system will work failure-free at least until the end of the mission?
Reliability is thus always a non-increasing function of time with R(0) = 1 and R(∞) = 0. When someone says the reliability of a system or device is a specific number, 0.8, say, there is usually some understood interval of time that is implicitly assumed. They are really saying the the probability of the device or system working that long without any failure is 0.8. More formally, the reliability function, also called the survivor function is:
and the cumulative failure distribution is its probabilistic complement:
Another way to think of the reliability function is as the complimentary cumulative distribution function (CDF) for the random variable that gives the time between failures. That is:
where f(t) is the failure density function which is the probability density function of time to failure from a known-good starting point.
Also useful is the "age-specific failure rate" otherwise known as the "hazard rate", λ, in reliability. Given a population of systems or components that may fail, the hazard rate is the rate of failure per member of the group given that the member has already survived this long. This itself is may be a function of time, λ(t). An example is the classical "bathtub curve" of infant mortality, useful life, and wear-out phases for devices, which reflects age-specific failure rates). Much useful work is, however, based on the assumption of a constant hazard rate λ which reflects systems during their useful life phase. The term "failure rate" or "hazard rate" are then equivalent and both terms are often used for λ. But more generally λ is the age-specific failure rate per unit, i.e.;
where Δt is a unit of elapsed time. Thus the hazard rate is strictly only the same as the failure rate if there is no age-dependency. When one is considering a single unit or system it follows that the hazard rate is the derivative of Q(t) (which is f(t)) because as soon as there is one failure, there is no remaining pool from which to generate more failures). If there are a group of items being observed, however, we have to reflect the conditional probability nature of the fact that for a failure to arise in time t±Δt/2 the sample of the elements being considered only contains those remaining units that already have survived until time t; the probability of which is by definition R(t). Therefore, the hazard rate (or age specific failure rate for per unit) is in general:
which is a simple differential equation from which it follows that:
and this applies for any hazard rate function λ(t). Also, because R(t) is the probability of a unit surviving to time t (i.e., not failing in [0,t]) then over a population of items or a succession of trials where one item is repeatedly repaired and allowed to run again to failure, it is meaningful to think about the expected time between failures or mean time to failure (MTTF).6 This will be the expected value of the failure density function:
Much practical analysis of network or equipment reliability assumes a constant failure rate for equipment items in service. This is not necessarily accurate but it is an accepted practice to characterize failures in service paths arising from a large number of possible independent failures over a large pool of operating equipment in service and independent external events each with individually low probabilities per unit time. Early life stress testing of critical components such as lasers helps eliminate the "infant mortality" portion of the non-constant hazard rate, improving the validity of the assumption somewhat. In addition, if external hazard mechanisms such as cable dig-ups are assumed to be unsynchronized with the equipment deployment, the overall hazard rate from cable cuts can reasonably be modeled as constant on average. A practical justification is also that while mathematical methods do exist to take the non-constant hazard rate curves into effect for each piece of equipment, doing so in real network calculations would imply tracking of the exact type, installation date, and every maintenance date in the life of each individual piece of equipment in each specific network path. Finally, there is recognition that what a planner is often doing with reliability or availability methods in the first place is making comparative assessments of alternate networking strategies or broad technology assessment studies of adopting new equipment or operating policies. In these contexts it is seldom the absolute numbers that matter, but the relative ranking of alternatives and these are unaffected by idealization of a constant failure rate. Thus, we have a special practical interest in the case where λ(t) = λ0 (a constant), for which we get the special results from above that:
The last result is otherwise recognized as the Poisson distribution.
The relationships between reliability and failure density functions, and its complement, the cumulative failure distribution function are illustrated in Figure 3-16. The dashed arrows linking function values on Q(t) and R(t) to areas under f(t) show the integral relationships involved. In effect the fundamental function is the failure density f(t). The cumulative failure distribution Q(t) is its integral and the reliability is just the probabilistic complement of Q(t).
Figure 3-16. Reliability R(t), cumulative failure Q(t), and failure density curve f(t) relationships for a constant hazard rate, λ0.