- Transport Network Failures and Their Impacts
- Survivability Principles from the Ground Up
- Physical Layer Survivability Measures
- Survivability at the Transmission System Layer
- Logical Layer Survivability Schemes
- Service Layer Survivability Schemes
- Comparative Advantages of Different Layers for Survivability
- Measures of Outage and Survivability Performance
- Measures of Network Survivability
- Restorability
- Reliability
- Availability
- Network Reliability
- Expected Loss of Traffic and of Connectivity
3.8 Measures of Outage and Survivability Performance
Let us now introduce various quantitative measures of failure impact, given a failure occurs, and of intrinsic survivability performance in terms of the ability to resist failures in the first place. Given the impact of failures, there is growing regulatory interest in attempts to quantify the magnitude of the impact of various failures that occur. The notion is that hurricanes, tornadoes and earthquakes each have a system for classification of their severity, so why not network failures too? Network operators are also interested in such standardized measures for quality improvement and competitive processes. A second sense of "measuring survivability" is to ask about those intrinsic properties of a network that by design make it less likely to sustain an outage in the face of failures within itself. These are the basic notions of reliability and availability and what we define as the restorability. Let us touch on these in sequence.
3.8.1 McDonald's ULE
McDonald [McDo94] was perhaps the first to advocate development of quantifiable measure of network outages. McDonald's argument was that any drive toward a standard method for quantification for network failures would focus attention on the issue and inevitably lead to improvements in avoiding outage. His proposed measure is the "User-Lost Erlangs" (ULE) defined as:
where E = average historical traffic intensity (in Erlangs) through the outage period and H = outage duration in hours. The measure is logarithmic, like the Richter scale. 10 Erlangs blocked for one hour is 1 ULE. 10 ULE is equivalent to an hour-long outage affecting 100 Erlangs of normally offered traffic, or 6 minutes of outage on 1000 Erlangs, etc. The logarithmic nature is a key idea for its utility. McDonald argues a logarithmic measure discriminates well between events of major and minor consequences. And it reflects a plausible belief that the overall societal impact somehow scales with the exponent of the total outage. We would add that it is also appropriate to avoid false precision: the data going into a ULE calculation will at best be estimates, so what is important is indeed the order of magnitude, not linear differences.
3.8.2 The (U,D,E) Framework for Quantifying Service Outages
The ULE notion was developed further with an eventual aim toward standardization in [T1A193]. In this framework the impact of a failure is assessed in terms of: Unservability (U), Duration (D) and Extent (E), called a (U, D, E) triple. The three parameters of the (U, D, E) framework are:
Unservability (U)
is defined in terms of a basic capability and unit of usage appropriate to the application. For example, in a circuit switched network, this would be the ability to establish connections with acceptable blocking and transmission performance. The unit of usage is a call attempt and the unservability is the percentage of call attempts that fail. In a packet network, the unit of usage is a packet and the unservability is the percentage of packets that were not delivered within a stipulated delay. In a leased line network, unservability is defined as the percentage of DS-0, DS-1 or DS-3 leased signals that are not available.
Duration (D)
is the elapsed time interval during which performance falls below the threshold for defining unservability.
Extent (E)
reflects the geographic area, population affected, traffic volumes, and customer traffic patterns, in which the unservability exceeds a given threshold.
The idea is not to operate on U, D, E values further boiling them down to a single measure but to preserve them as a three-dimensional characterization of any outage event. A (U, D, E) triple can thus be plotted in the corresponding 3-space for classification of the event as catastrophic, major, or minor, depending on which predefined "volume shell" the (U, D, E) vector enters. Figure 3-15 illustrates. It seems reasonable that some vector weighting scheme might also be agreed upon for definition of the qualifying regions. Or, conversely, a general (U, D, E) classification model would not necessarily have simple spherical shells for defining classifications unless the intent is to give strictly equal weight among U, D, E.
Figure 3-15. (U, D, E) concept for classification of network outages (source: [T1A193]).
(U, D, E) shells can be used both to categorize events as well as to lay down prescriptive policy for what might constitute an event requiring a company review of an incident or methods. For example, a Local Switch failure may be defined to have occurred whenever 500 subscriber lines (the extent E) are totally isolated (the definition of unservable, U) for 2 minutes or more (the duration, D), i.e., (U, D, E) = (100,2,500).
Outage Index
Later work in the T1A1.2 committee that produced [T1A193] considers an approach leading to a single Outage Index. It is conceptually equivalent to formation of a vector weighted magnitude of the (U, D, E) triple but involves predefined nonlinear weighting curves for D, E and discrete multipliers for time of day, type of trunk affected (inter- or intra-LATA, 911 etc.) Such weightings are ultimately arbitrary but nonetheless can be fully detailed in a standardized method and then be of valuable service when applied industry-wide.