- Transport Network Failures and Their Impacts
- Survivability Principles from the Ground Up
- Physical Layer Survivability Measures
- Survivability at the Transmission System Layer
- Logical Layer Survivability Schemes
- Service Layer Survivability Schemes
- Comparative Advantages of Different Layers for Survivability
- Measures of Outage and Survivability Performance
- Measures of Network Survivability
- Restorability
- Reliability
- Availability
- Network Reliability
- Expected Loss of Traffic and of Connectivity
3.5 Logical Layer Survivability Schemes
System layer protection schemes all rely on essentially fixed transmission and/or protection structures. An advantage of this is that once installed and tested, such systems are discrete identifiable network substructures, their operation is relatively simple and self-contained (i.e., they don't involve a highly general reaction over the network at large), and the restoration path taken for any failure is clearly known in advance. The difference (relative to mesh protection at the logical layer) was once explained to the author that "transmission people are comfortable with self-healing systems, but self-healing networks are (perceived to be) too general and unpredictable for their liking."
On the other hand, system layer implementations of protection are essentially static, and this can be more to the dislike of the services, planning, and business people within the same companies. If the configuration of as-built systems turns out not to match forecast demand well, it is not at all easy to change the configuration because it is essentially determined by the hardware installation. In addition, if a first-failure has occurred, there is nothing that can be done during the period of physical repair to particularly enhance the readiness of the network to withstand a possible second failure. Fixed system layer protection schemes also do not easily support differentiated quality of protection classes: when whole systems switch at the line rate to protect failures, everyone gets the same class of protection.
This brings us to consider logical layer protection (or restoration) schemes. The flexible ability of the logical layer to create paths on demand between desired end points, out of a general inventory of uncommitted channels on transmission system channels, makes it the natural domain of a number of survivability schemes with features that are not provided by the ring or APS system layer schemes. Foremost among these considerations is the higher capacity efficiency which can be achieved by "mesh" restoration schemes which permit extensive sharing of protection capacity over non-simultaneous failure scenarios. Capacity efficiently arises not only through sharing of spare capacity but also because cross-connects in the logical layer manage capacity at a finer granularity than in the system layer. As an example, a SONET DCS might manipulate STS-1 and STS3c signals whereas a OC-192 BLSR manipulates the entire 10 Gb/s line signal as a unit for protection purposes. Similarly in an optical network the logical layer OXC will manipulate single lightpaths for routing or restoration, whereas protection actions in the system layer (and capacity allocations to go with those actions) will probably be based on whole-fiber or large waveband levels of manipulation.
3.5.1 Concepts of Protection, Restoration and Distributed Preplanning
System layer schemes are inherently all of the class we will define as protection, while logical layer schemes can be either restoration or protection schemes. Let us now make the distinction. The term protection derives from its origin in APS systems. In a 1+1 DP APS system, the switching actions are completely predefined and the protection system is fully connected between its end nodes and in a pre-tested, ready-to-use state. The working system is said to be protected, as opposed to restorable, in these circumstances. If a 1:1 APS is involved, then signaling is required to request the head-end bridge and to bump any "extra traffic" off of the spare span. In addition the protection system has to be tested on the fly for correct transmission of the bridged signal. However, the protection route is completely pre-defined and no cross-connections are needed to create the signal path. It follows that UPSR is the same as 1+1 DP APS and BLSR is the same as 1:1 DP APS in these regards. The term protection is generally used for all these as a category. The main difference in restoration is really only that the replacement paths that will reroute the payload signals may have to be found and/or cross-connected in real-time when the failure occurs. Thus, one can say that in a pure protection scheme, the backup paths are completely dedicated and ready to bear rerouted working demand flow. And in a pure restoration scheme all redundant resources are held in a shared pool until configured on demand for restoration against a specific failure that arises.
Having identified these general distinctions it is important to stress that logical layer implementations of mesh-based survivability can be either protection schemes or restoration schemes. Possibly for competitive reasons, the classification of schemes as either "protection" and "restoration" has become rather over-emphasized, and over-simplified and coupled with an almost axiomatic assertion that "protection is fast and restoration is slow" and that the most efficient OXC-based mesh-survivability schemes are all "restoration" schemes. All of these points need to be sorted out. We hope to convey in this overview of the issues, and further in Chapters 5 and 8, that these views are overly simplified and dogmatic. In fact the best possible arrangement for survivability may be the combination of a distributed restoration mechanism embedded in the logical layer which self-generates efficient mesh network protection preplans to withstand any first-failure and then executes directly providing best-efforts state-adaptive restoration to a second failure, should it arise.
The basic assumption that needs to be challenged is that a process of finding paths in real-time is always slow and that if the replacement paths are known in advance they will always be fast. Neither are necessarily true as generalizations, especially with distributed preplanning to identify paths with a restoration mechanism in advance of any actual failure. Moreover, there are really at least three basic categories of scheme to consider of which pure protection and pure restoration are only the extremes. The two extremes and intermediate possibilities are detailed in Table 3-4. A somewhat similar categorization appears in [ElBo03] which also surveys a taxonomy of variations between pure protection and pure restoration. The intermediate schemes are in some ways the most promising in terms of combining efficiency and speed and how fast or slow these schemes are is dependent on whether path finding or path cross-connection time dominates, not whether these intermediate schemes are classified as protection or restoration. In the intermediate category are schemes where the restoration paths are fully known before a failure, but spare channels are not cross-connected until a specific failure arises. In this regard neither span restoration using distributed preplanning (SR-DPP) nor shared backup path protection (SBPP) can be classified as simply a protection or a restoration scheme.
Table 3-4. Three Fundamental Classes of Survivability Scheme
Type |
Description |
Examples |
Generic Term |
---|---|---|---|
(a) |
Pure Protection: Protection routes are known in advance and cross-connection is not required to use them: spare capacity is preconnected and needs only be accessed at end-points. |
1+1 APS, SNCP ring, UPSR, OPPR, BLSR, OSPR, p-cycles |
Protection |
(b) |
Pure Restoration: Restoration routes are found adaptively based on the failure and the state of the network at the time of failure; connections to assemble the restoration paths are also made in real-time. |
Distributed or centralized adaptive restoration algorithms or distributed restoration algorithms (DRA). |
Restoration |
(c) |
Intermediate: Replacement routes are known in advance and cross-connection maps for fast local action are in place at all nodes but cross-connection is required to assemble the restoration path-set in real-time. |
Distributed preplanning with Span restoration (SR-DPP), ATM Backup VP, shared backup path protection (SBPP). |
Preplanned Restoration |
SR-DPP is an especially powerful technique in that, even if path finding is slow, distributed preplanning (DPP) can create (and frequently update) protection preplans that are already in place in the nodes in advance of failure. DPP works by using a series of mock-failure trials responded to by a distributed restoration algorithm (DRA) embedded in the network or other restoration protocol. For each "dress rehearsal" nodes simply record the local set of cross-connections that constituted their participation (if any) in the assembly of restoration paths for each failure trial. The concept is described more fully in [Grov94] or [Grov97] and in Chapter 5. It is a simple technique that retains all of the generality and database freedom of a distributed restoration algorithm, but provides a "protection" scheme of the intermediate type in Table 3-4. This always-present relationship between restoration and a corresponding preplanned "protection" scheme, derivable through DPP, must be kept in mind. Moreover, it is fundamental that if one solves any variety of spare capacity planning problems for different classes of mesh survivability schemes, that spare capacity can either be accessed adaptively by a restoration algorithm in the real network (which gives certain extra tolerances for error) or, the same capacity planning solution for the restorable network can be used to provide a set of preplanned protection arrangements to be used in each node. Finally, regardless of whether any mesh-based survivability scheme operates in real-time with preplanned protection reactions, or with an adaptive restoration algorithm, there is no difference at all in the capacity required or in the definition of the capacity-planning problem (assuming the restoration algorithm is fully capable in the required path finding role).
The relative speed of the intermediate schemes (following a failure) depends on what dominates the real-time performances: path-finding time or cross-connecting time. Schemes of type (c) can approach the speed of pure protection if OXC cross-connection is fast, occurs in parallel at all nodes, and, through distributed preplanning (DPP), the protection routes and all local switching actions are completely known in advance. Upon failure, real-time is consumed only for failure notification. All nodes put their most recently preplanned actions into effect, in parallel. In span restoration with distributed preplanning on fast OXC nodes, the most dominant time delay could be the simple dissemination of fault notification. As soon as nodes learn the failure identity they assert an already known, locally stored, spare capacity cross-connection map into effect. This happens in parallel at all networks nodes as soon as notification arrives. In another type of intermediate scheme route-finding can take significant time but "assembly" of the path is virtual and takes essentially zero time. e.g., CR-LDP "redial": once label distribution is complete, there is essentially zero subsequent path establishment delay per se.
Thus, we need to appreciate the range of possibilities between pure protection and pure restoration, but avoid the oversimplifications associated with these categorizations, particularly regarding speed and availability. SR-DPP and SBPP are in particular important intermediate schemes for which no categorical statement about relative speed is really justified other than if based on a detailed implementation study. Depending on the relative speed of path finding to cross-connection either scheme may even approach the speed of pure protection in the same network. A final point related to this discussion is how to refer to spare capacity designed into a network for either protection or restoration purposes. For brevity we will make no further distinction and refer simply to spare capacity whether used for protection or restoration.
Let us now return to our overview of restoration or protection schemes that operate in the logical layer. To guide the overview we introduce Table 3-5. Because the book itself is devoted to in-depth treatment of the mesh-based survivability schemes, we will not go into the same depth introducing them here as we did for rings, OCDCs and GLBNs, which we will not be covering further.
Table 3-5. Overview of Logical Layer Mesh-type Survivability Schemes
Scheme or Principle |
Short Description or Equivalences |
Notes |
---|---|---|
Span restoration (SR) |
Dynamic k-shortest paths |
Uses a "DRA" (Chapter 5) or [Grov94] |
Span protection (SP) |
Shared-protection routes preplanned |
Centrally controlled or self-organized by distributed preplanning with a DRA |
Meta-mesh (MM) |
SR in meta-mesh graph, loopback in chain subnets |
A hybrid between span and path restoration (Chapter 5) |
λ−based p-cycles |
Like a BLSR that also protects straddling spans |
ADM-like system level or OCX managed p-cycles (Chapter 10) |
Path restoration (PR) (with stub release) |
MCMF with limited commodity requirements |
Theoretically most efficient possible scheme (Chapter 6) |
λ−based SBPP |
1:N APS sharing arranged over disjoint failures |
(Chapter 6) |
λ−based SLSP |
SBPP on redefined sub-path segments |
"Short leap" shared backup protection: overlapped SBPP sub-path setups |
GMPLS: OSPF-TE / CR-LDP |
Independent path reprovisioning attempts by all affected pairs |
No assured speed or recovery level |
3.5.2 Span Restoration or Span Protection
In span restoration, restoration paths (or preplanned protection paths) reroute locally around the break, between the nodes of the failed span. In pure span restoration the paths are both found and connected in real-time. Span protection refers to a network operating either as outlined above with DPP-based protection preplans or through centrally computed and downloaded preplans. This type of scheme is sometimes also called link protection or line restoration.
3.5.3 Meta-mesh
Meta-mesh is a variation on span restoration that enhances the spare capacity efficiency of span restoration in sparse network graphs. It involves a combination of ring-like ADM loopback within subnetworks that are chains of degree-2 nodes and mesh-like planning of capacity for restoration flows over the logical higher-degree skeleton of a network containing many chain sub networks. It represents a specific partial step toward path restoration. Chapter 5 is devoted to in depth treatment of span restoration including meta-mesh.
3.5.4 p-Cycles
p-Cycles were introduced as a system layer technique where they would use modular-capacity nodal elements similar to an ADM and implemented at the whole-fiber or waveband level. However, because the p-cycle concept separates the routing of working flows from the configuration of protection structures (not locking these two together as in rings), p-cycle based protection is also amenable to logical layer implementations. In this context the OCX nodes can set-up and take-down service paths as demand requires and separately configure and maintain a set of span-protecting p-cycles. Such p-cycles are established and managed at the logical channel, rather than system level and can be easily changed to adapt to shifting demand patterns. Multi-service priority schemes for access to p-cycles for protection can also be fairly easily implemented in a logical layer implementation of p-cycles but not in the system layer. Chapter 10 covers p-cycles.
3.5.5 Path Restoration
In path restoration (PR) the capacity design and corresponding rerouting problems are posed as multi-commodity maximum flow-like rerouting problems to replace affected paths end-to-end following removal of the failed span from the graph. This may or may not involve conversion of surviving working capacity of failed paths into capacity that is available for use in restoration, an aspect called "stub release." This type of path restoration with stub release is of special theoretical significance because it represents the most efficient possible class of survivable network.
3.5.6 Shared-Backup Path Protection (SBPP)
A related method that is particularly amenable to IP-centric control of optical networks is called shared-backup path protection (SBPP). A prior scheme for ATM VP-based transport networking works in the same logical manner and can also be used for MPLS path protection. The approach in SBPP is simplified relative to path restoration by defining a single fully disjoint backup path for each working or "primary" path. In effect, a 1:1 DP APS arrangement is established at the level of each service path. This simplifies real-time operation as the protection response is independent of where the failure occurs on the corresponding working path (whereas the PR response is failure-specific). Relatively high efficiency is still achieved even though a 1-for-1 APS setup exists because spare capacity is shared over failure-disjoint backup paths. Chapter 6 is devoted to path restoration and SBPP schemes. Chapter 7 treats the application of SBPP to the MPLS layer or ATM VP layer transport where oversubscription-based planning of protection capacity is possible to considerably reduce overall capacity requirements.
3.5.7 Segmented or Short-Leap Shared Protection (SLSP)
This is a variation on SBPP in which SBPP-like shared backup protection paths are set up over several segments or sections of the path, rather than end-to-end over the entire length of the working path. This accommodates a working path that may need to travel through several pre-defined protection domains. More generally, division of any working path into segments for protection produces a family of options between the extremes of pure span protection and SBPP. When a primary is protected with segment-wise disjoint paths, not end-to-end with a single backup path, availability is improved and protection arrangements can be managed locally within each domain the primary crosses between entry and egress nodes of that domain. By further defining protection domains to overlap, single point exposures to node failure are avoided where the SBPP segments would be otherwise connected in tandem through a single node. The idea of segmented interlaced backup paths was introduced in [KrPr00] and later applied to lightpath protection in [HoMo02b], [SaMu02], [GuPr03]. The concept of segment-based rerouting was also studied in [JoSa98] where it was similarly recognized that with ATM backup-VP protection, availability would degrade for long service paths. The methods for SBPP design in Chapter 6 cover SLSP when used on a transformation of the initial demand matrix which converts end-to-end path requirements into apparent demand between the designated segment protection nodes.
3.5.8 GMPLS Automatic Reprovisioning as a Restoration Mechanism
For completeness we should also recognize that GMPLS is being viewed by some as offering a network restoration mechanism in addition to its primary role of provisioning transport paths. The thinking is that since OSPF-TE will eventually produce an updated global view of the topology and available capacity following a failure, then each node will be in a position to begin re-establishing those paths which it lost in the failure using GMPLS to simply "redial" each of their lost connections, over the shortest route following the failure.
It is important to note in this regard that SBPP uses GMPLS to establish a working path, and a corresponding disjoint backup path, but it does so for each path as it is provisioned and ahead of the failure. It thus effects a preplanned protection arrangement that is cognizant of the physical spare capacity present and of failure-coordinated contention or sharing relationships that have been established on each unit of spare capacity. This is significantly different than the direct reliance on OSPF-TE/CR-LDP in real-time following a failure to attempt simultaneous re-establishment of all failed paths. Direct use of OSPF-TE/CR-LDP following a failure involves no preplanned reservation or sharing arrangements for capacity for a backup path. There are therefore considerable drawbacks to direct reliance on GMPLS auto reprovisioning for restoration. All other schemes involve considerations to coordinate or preplan the access to spare capacity for an effective and fast recovery following failure. By effective we mean that guarantees about the restorability level can be made by design. While GMPLS auto reprovisioning would usually be an effective response to isolated failure of a single lightpath, relying on the same method for recovery from a cable cut would seem almost irresponsible. A recovery response of some form would result but there is essentially no control or assurances that can be given about the duration and effectiveness of the overall recovery pattern that results.
When a cable is cut a large number of independent asynchronous instances of the topology update and path redialing protocol will be triggered. Recovery actions must first wait until the OSPF-TE global view is synchronized in each node. As soon as it is, there will be a mass onset of CR-LDP signaling instances as individual end-node pairs attempt to re-establish their failed paths. Each acts without coordination with the others doing so concurrently. OSPF-TE updates will continue to be generated as the available capacity state changes on each link as CR-LDP seizures of capacity occur. This causes other CR-LDP instances to fail or be initiated with out of date resource information and destined to have to crank back. The overall dynamics of possibly thousands of concurrently activated signaling, capacity seizure, and update dissemination protocol instances, and how they will interact to allocate the available capacity for protection, and how fast the whole process would settle down is quite uncertain. Even without considering signaling contention and fall-back dynamics, it is theoretically impossible to say what restoration level will be achieved because a finite-capacity multi-commodity rerouting problem is being attempted by a greedy (and mutually interfering) set of routing instances. The theoretical issue, treated in Chapter 4 and further in Chapter 6, is that of "mutual capacity"—where one path-finding instance with several choices may, when acting independently, take capacity that makes paths for many other pairs infeasible. This can occur even when there is sufficient capacity for full restoration under more coordinated routing. Thus, GMPLS auto reprovisioning may provide a useful built-in reaction for isolated path failures, but because there can be no assurances about the overall level or distribution of the recovery pattern for the set of paths that fail simultaneously under a cable cut, we do not treat it as a restoration method for use in the logical layer. This technique is more suited to use in the service layer.