- Transport Network Failures and Their Impacts
- Survivability Principles from the Ground Up
- Physical Layer Survivability Measures
- Survivability at the Transmission System Layer
- Logical Layer Survivability Schemes
- Service Layer Survivability Schemes
- Comparative Advantages of Different Layers for Survivability
- Measures of Outage and Survivability Performance
- Measures of Network Survivability
- Restorability
- Reliability
- Availability
- Network Reliability
- Expected Loss of Traffic and of Connectivity
3.13 Network Reliability
The field of "network reliability" is concerned with questions of retaining graph connectivity in networks where edges have a non-zero probability of being in a failed state. The central issue is simple-sounding but in fact it is quite difficult to exactly compute the probability that a graph remains connected as a whole, or if a path remains between specific nodes or sets of nodes, in a graph with unreliable edges. Specific measures that are studied in this field are questions of "{s,t}" or "two-terminal" reliability, k-terminal reliability, and all-terminal reliability. These are all various measures of the purely topology-dependent probability of graph disconnection between pairs of nodes points. Rai and Agrawal [RaAg90] provide a complete survey of this field. Here we try only to extract those basic ideas of network reliability that form part of a grounding for work on transport network survivability and feed into the problem of availability block diagram reduction.
Figure 3-21 illustrates the basic orientation for the network reliability problem. Four equally likely states are drawn for an assumed plink=0.32 (i.e., out of 28 links present we expect 9 of them down at any one time on average). A solid line is a working link, dashed is a failed link. If we pick nodes 0-11 we see that in (a)-(c) despite the failures there is always still a route between them. Inspection shows in fact that none of the randomly generated states (a)-(c) contributes any two-terminal unreliability: there is still at least one topologically feasible route between all node pairs. Equivalently, we can say that none of these failure combinations has disconnected the graph. Case (d), however, is an equally likely state but has a dramatically different effect. Four of the nine failure links form a cut of the graph across edges (14-19), (14-9), (6-13) and (13-5). The two-terminal reliability of all node pairs separated by the cut are thus affected by this failure state. This not only illustrates how abrupt and discontinuous network behavior is in general but it also conveys why numerical enumeration of all link state combinations, followed by tests for graph connectivity, is not feasible for this type of problem on a large network.8
Figure 3-21. Network reliability: How likely is it that at least one route exists between nodes? In this example there are 228 link-state combinations to consider.
Of course in a real network, there may also be outage due to finite capacity effects in Figure 3-21 (a) though (c) but this is not in the scope of the basic "network reliability" problem. Basic network reliability (in the sense of [Colb87],[HaKr95],[Shei91]) presumes that there are no routing or capacity constraints on the network graph. If at least one route exists topologically between {s,t}, then it is assumed the signal (or packet train) will discover and use it. With this limitation understood, however, its methods and concepts can provide tools for use in other means for more encompassing considerations availability analysis. The problem of most relevance to the availability of a service path through a network is that of two-terminal reliability.
3.13.1 Two-Terminal Network Reliability
Two-terminal reliability9 is the complement to the likelihood that every distinct path between {s,t} contains at least one failed (or blocked) link. Exact computation of the two-terminal reliability problem is NP-complete for general graphs even when the link failure probabilities are known. The computational complexity of trying to enumerate all networks states and inspect them for connectivity between nodes {s,t} has led to the approach of more computationally efficient bounds. A widely known general form is called the reliability polynomial:
where G = (V, E) is the network graph, m= |E| is the number of edges in the graph, {s,t} is a specific terminal pair and p is the link operating probability.
This form prescribes either exact or approximate (bounding) estimates of R(-), depending on how Ni(-) is obtained. In its exact form Ni(-) is the number of subgraphs of G in which there are exactly (m-i) failed links but the remaining graph contains a route between nodes {s,t}. Of course this just defers the problem of calculation R(-) to that of counting or estimating Ni(-). Two simple bounds are conceptually evident at this stage. One is to enumerate (for each i ∊ 1...m) only those m-i failure link combinations that constitute cuts of the graph between {s,t}. A cut-finding program can thus enumerate a large number of cuts and their associated weights (in terms of number of edges) for insertion into Equation 3.31. Obviously for p ≈ 1 the smallest cuts are the most likely and hence numerically dominant contributors to R(-). Assuming not all of the highest order cuts are enumerated10 the result will be an upper (i.e., optimistic) bound on the exact R(-). i.e.,
where c is the minimum cut of the graph between {s,t} and Ci(-) is the number of {s,t} cutsets found comprising exactly i edges. The exact reliability will be lower than this because network states involving i failures but containing a cutset of fewer than i edges are connectivity-failure states that are not counted.
A converse viewpoint for assessing Ni(-) is from the standpoint of network states that contain at least one working route among the set of all distinct routes between {s,t}. (The two are conceptually the same as the notion of "cuts and ties" in more advanced analysis of system availability block diagrams.) Here, all of the k-successively longer distinct (non-looping) routes on the graph between {s,t} are generated and each recorded with its associated length (number of edges in series en route). Then a simple upper (i.e., optimistic) bound on {s,t} reliability is:
where Li is the length of the kth distinct route between {s,t}. Figure 3-22(a) portrays the basic notion of {s,t} reliability being viewed in Equation 3.33 as the probability that not every possible route is blocked and implicitly treats routes as independent entities. In contrast, Figure 3-22(b) shows how several distinct routes may actually share single link failures in common, illustrating why Equation 3.33 is an optimistic bound.
Figure 3-22. Orientations to the network reliability calculation: (a) failures that together create an (s,t) cut set, (b) failures that defeat all routes between (s,t).
More precisely the route-based formulation is dependent on union of the probabilities that all edges in route i are operating, i.e.,
which calls for application of the inclusion-exclusion principle for the union of non-disjoint sets [GaTa92] (p.90). Denoting as the probability that all links in the ith route are operating,
In [Shei91] the application of the inclusion-exclusion principle for probability union is treated further, showing that there are always certain cancellation effects between terms of the inclusion-exclusion series that give further insights (the concept of irrelevant edges) and that can be exploited to simplify the expansion process.
3.13.2 Factoring the Graph and Conditional Decomposition
Let us now return to the problem of calculating system availability in cases where basic series and parallel relationships do not completely reduce the model. This is where the link to network reliability arises. If a network is completely reducible between nodes {s,t} by repeated application of simple reductions into a single equivalent link, the network is said to be two-terminal series-parallel. In such a case the resultant single reduced edge probability is R(G, {s,t},p). But many realistic cases are not two-terminal series-parallel in nature because of some edge that cross-couples the remaining relationships in a way that halts further application of the series-parallel reductions. In the approach that follows, which is also based in network reliability, such an edge is used as a kind of pivot point on which the problem is split into two conditional probability sub-versions that apply when the particular edge is in one case assumed available and in the other case where it is assumed to be down.
Figure 3-23 summarizes the basic series-parallel reduction rules in a canonical form on the edge probabilities (probabilities of the link being up, equivalent to the elemental availability). Cases (a) and (b) are the previous basic parallel and series relationships, to which case (c), called a "two-neighbor reduction," is added. When applied to either a network graph or an availability block diagram, these transformations are exact or, in the language of network reliability, they are "reliability preserving." To use these reductions, element failures must be statistically independent, and in cases (b) and (c) in Figure 3-23 node b must have no other arcs incident upon it. Node b also cannot be either the source or target. While single arcs are shown the rules apply to any block that is similarly reducible to a single probability expression, so that, for instance, p1 in Figure 3-23(a) may already be the result of a prior set of series-parallel reductions.
Figure 3-23. Reliability-preserving graph reduction rules: (a) parallel, (b) series, (c) two-neighbor reduction.
In general the application of series-parallel reduction rules will be exhausted before the original network is completely reduced. This will usually manifest itself through some edge that cross-couples between remaining subgraphs, i.e., one or more nodes will be like node b in Figure 3-23(b) but with the presence of more than just two arcs, so that another application of a series reduction is not possible. At this stage the graph can be "factored" to continue the reductions. Graph factoring is based on Moscowitz's pivital decomposition formula [Shei91] (p.10). The key idea is that:
where p is the probability the edge is available, G|e means graph "G given e", and {G-e} is graph G without edge e. Thus the whole is considered as the conditional probability decomposition of the two states that the confounding edge e may be in, with probability p and (1-p) respectively. G|e is represented by graph G where edge e is contracted or "short circuited." The probability-weighted sum of the two conditional probability decomposition terms is the two-terminal graph reliability. In practice the idea is to recognize a key edge e that will decouple the two resulting conditional subgraphs in a way that allows another round of series-parallel reductions. A complete graph may thus be decomposed through a series of series-parallel reductions, splitting to two conditional subgraphs, series parallel reductions on each, splitting again in those as needed, and so on. The real computational advantage of the decomposition steps is to overcome the situations where no further series-parallel reductions are possible. Were it not for this use of decomposition to link between subproblems that are further series-parallel reducible, it would be of little practical value because by itself it is equivalent to state-space enumeration by building a binary tree of all two-state edge combinations. More detailed treatments can be found in [Colb87] (p.77) and [HaKr95].
To illustrate the application to availability problems, however, consider the availability block diagram in Figure 3-24(a). Because of the "diagonal" element, it is not amenable to series-parallel reduction. We do, however, obtain two subgraphs that are each easily analyzed if we presuppose the state of the diagonal element. In (b) we presume it is failed. In (c) we presume it to be in a working state. Thus the resulting subgraphs are conditional probability estimates of the system availability. To get the overall availability we weight the result for each subgraph by the probability of the decomposed link state that lead to that subgraph. Therefore, for this example:
Figure 3-24. Example illustrating conditional decomposition of an availability block diagram.