Enterprise Network Design Patterns: High Availability
- Physical Network Topology and Availability
- Layer 2 Availability: Trunking —802.3ad—Link Aggregation
- Layer 2 Availability: Spanning Tree Protocol
- Layer 3—VRRP Router Redundancy
- Layer 3—IPMP—Host Network Interface Redundancy
- Layer 3—Integrated VRRP and IPMP
- Layer 3—OSPF Network Redundancy— Rapid Convergence
- Layer 3—RIP Network Redundancy
- Conclusion
Availability has always been an important design goal for network architectures. As enterprise customers increasingly deploy mission-critical web-based services, they require a deeper understanding of designing optimal network availability solutions. There are several approaches to implementing high-availability network solutions. This article provides an overview of the various approaches and describes where it makes sense to apply that solution.
FIGURE 1 provides a high-level overview of a typical corporate customer's network. This integrated network can be divided into the following sectors to create a logical partitioning, which can be helpful in understanding the motivation of the protocols that provide resiliency.
FIGURE 1 Networking Features to Increase Availability
Access networkThis sector connects the enterprise's private network to the service provider. This network is generally controlled by a network service provider, which is usually called an Internet service provider (ISP) because that provider provides connectivity to the Internet. The term access network is used by carriers because this is the point where end users and enterprises access the carrier networks. Depending on the configuration, there may be a static route from the enterprise to the ISP, or there may be an exterior routing protocol such as Border Gateway Protocol4 (BGP4). BGP4 is more resilient, if a particular route is down, an alternate route may be available.
Enterprise networkThis network is the enterprise's internal network, which is always partitioned and segregated from the external network primarily for security reasons. This network is the focus of our paper. Several methods provide network resiliency which we investigate further in this article.
Corporate WANThese networks provide the connectivity over long distances to the remote enterprise sites. There are varying degrees of connectivity, which include campus networks, that interconnect enterprise buildings within a certain distance: metropolitan area networks (MANs) that interconnect enterprise offices located within one local providers MAN network, and wide area networks (WANs) that connect enterprise branch offices that may span thousands of miles. n WAN connectivity generally requires the services of a Tier 1 service provider. Modern WAN providers may provide an IP tunnel for that enterprise to connect remote offices over a shared network.
In this paper, we briefly discuss how MPLS can be used for resiliency. MPLS has gained wide industry acceptance in the core networks.
The scope of this article is limited to interior routing protocols and enterprise network technologies for availability purposes.
Physical Network Topology and Availability
One of the first items to consider for network availability is the physical topology from an implementation perspective. In general, the topology will have a direct impact on the mean time between failure (MTBF) calculation. Serial components reduce availability and parallel components increase availability.
There are three topology aspects impacting network availability:
Component failureThis aspect is the probability of the device failing. It is measured using statistics averaging the amount of time the device works divided by the average time the device works plus the failed time. This value is called the MTBF. In calculating the MTBF, components that are connected serially drastically reduce the MTBF, while components that are in parallel, increase the MTBF.
FIGURE 2 shows two network designs. In both designs, Layer 2 switches simply provide physical connectivity for one virtual local area network (VLAN) domain. Layer 2-7 switches are multilayer devices providing routing, load balancing, and other IP services, in addition to physical connectivity.
FIGURE 2 Network Topologies and Impact on Availability
Design A shows a flat architecture, often seen with multi-layer chassis based switches using Extreme Networks Black Diamond_, Foundry Networks BigIron_, or Cisco_ switches. The switch can be partitioned into VLANs, isolating traffic from one segment to another, yet providing a much better solution overall. In this approach, the availability will be relatively high, because there are two parallel paths from the ingress to each server and only two serial components that a packet must traverse in order to reach the target server.
In Design B, the architecture provides the same functionality, but across many small switches. From an availability perspective, this solution will have a relatively lower MTBF because of the fact there are more serial components that a packet must traverse in order to reach a target server. Other disadvantages of this approach include manageability, scalability, and performance. However, one can argue that there may be increased security using this approach, which in some customer requirements, outweighs all other factors. In Design B, multiple switches need to be hacked to control the network; whereas in Design A, only one switch needs to be hacked to bring down the entire network.
System failureThis aspect captures failures that are caused by external factors, such as a technician accidentally pulling out a cable. The more components that are potential candidates for failure are directly proportional to the complexity, and thus, result in a higher system failure probability. So Design B, in FIGURE 2, has more components that can go wrong, which contributes to the increased probability of failure.
Single points of failureThis aspect captures the number of devices that can fail and still have the system functioning. Both approaches have no single points of failure, and are equal in this regard. However, Design B is somewhat more resilient because if a network interface card (NIC) fails, that failure is isolated by the Layer 2 switch, and does not impact the rest of the architecture. This issue is a trade-off to consider, where availability is sacrificed for increased resiliency and isolation of failures.