Quality of Service, Part 2 of 2: Managing Enterprise QoS
Consolidation of networks into IP-based cores is occurring in enterprises as well as service providers. The IP core is becoming a service platform, the resources of which are shared by a growing range of legacy and real-time applications. Downtime equates to revenue loss, and this fact is reflected in the appearance of financial rebate service-level agreements. So, how can the elements of QoS be managed effectively? To an extent, the tools are good if you stick with one vendor, but for multivendor networks the problem is more difficult.
As I described in the first article in this series, QoS management breaks down into specific elements: throughput, availability, delay, delay variation, and loss. The management system user aims to provide the best possible mix of these attributes.
Elements of Enterprise QoS
The previous article looked at voice-over-IP (VoIP) and the sources of delay that occur as a natural part of that technology. As far as possible, the designer tries to minimize the overall delay, which is the sum of the following delay types: coder, packetization, serialization, output queuing, WAN, and dejitter delay.
Voice service straddles both the analog and digital worlds; that is, traffic starts off in analog form, is made digital, is transmitted through the network, and then finishes out once again in analog form at the receiving end. Voice service typically (although not necessarily) employs a range of equipment types including legacy PBX/telephone/TDM elements. The provision of QoS for such a complex service is a challenging undertaking. Given the real-time nature of voice, one of the key QoS elements is delay. However, that's just one aspect of managing this class of network service. Figure 1 illustrates some of the other elements of QoS.
Figure 1 An IP/MPLS virtual private network (VPN).
Figure 1 illustrates an IP VPN with two sites interconnected by a VPN service provider. A production system would have many more VPNs and sitespossibly hundreds of thousands of each. The network management system (NMS) is used to provision, monitor, and update the network. Each site has one or more CE boxes (customer edge routers) that connect to the adjacent service provider PE (provider edge) box. The PE devices are in turn connected to core P devices (provider nodes that deliver a traffic transit service). In many instances, all three device types are owned and managed by the service provider. This setup frees the enterprise from concerns about the technologyit just uses and pays a fixed monthly fee for the service.
There are at least four main elements to delivering enterprise QoS successfully:
Network engineering
Traffic engineering (TE)
QoS technologies
Network operation
An important point about these steps is that they are generally iterative. An example is step 2 (traffic engineering), when a network operator wants to partition the network into specific paths. For example, LSP A in Figure 1 is for a real-time service, and LSP B is for a nonreal-time service. The operator may simply begin with best-effort service for both and employ traffic analysis before deciding on an appropriate TE setup. Once the paths have been arrived at and provisioned in the network, the associated QoS resources (DiffServ markings, bandwidth, queuing assignments, scheduling details, and so on) are assigned in step 3 (QoS technologies). Again, the traffic may be monitored to determine whether links have sufficient resources for the imposed traffic. This might necessitate further path definitions and QoS assignments; that is, revisiting steps 23.
Finally, the network goes into service (step 4) and, as in the test phase, the earlier steps may have to be revisited. And so it goes. Let's take a closer look at the steps involved.
Step 1: Network Engineering
Closely tied with planning, network engineering consists of putting together a network, that is, physically connecting the network devicesservers, client machines, routers, switches, hubs, service provider interconnects, and so on. Software and configuration data must be loaded into the devices. Then the interfaces must be configured, along with the associated signaling and routing protocols. Network engineering can be said to bring the network into a raw state of operational readiness. Many small LANs go straight into operation at this point. For larger networks, more work is needed (steps 24).
Step 2: Traffic Engineering
If network engineering puts the bandwidth into the network, traffic engineering is the process of "putting the traffic where the bandwidth is"; this means directing network traffic into devices and links that have sufficient resources to handle it.
Different technologies have their own unique TE facilities. For example, multiprotocol label switching (MPLS) provides what are called explicit route objects (ERO) to facilitate TE by allowing the operator to define a set of MPLS node hops. [1] The ERO is a little like a map for directing the traffic along the specified path.
An LSP (as in Figure 1) can then be created to follow the path specified by the ERO. So, to get from PE1 to PE2 in Figure 1, incoming traffic from VPN A, Site 1 is pushed into the explicitly routed LSP A. This control of routing paths gives MPLS great TE power and allows it to compete with legacy features of ATM (designated transit listsDTLsare the ATM equivalent of MPLS EROs). This is also a motivator for adopting MPLS.
TE provides the means for carving up a network core into a set of distinct paths, each one for a different type of traffic: real-time VoIP, nonreal-time email, and so on. TE can be thought of as a high-granularity mechanism. The individual node treatment of traffic provides a finer level of granularity, as shown in the next section.
Step 3: QoS Technologies
The area of QoS is almost a science in its own right. The MPLS-OPS mailing list frequently has spirited discussions that touch on this field. These conversations get so complex that they generally are taken offline after just a few contributions!
QoS has a large body of active research of great interest to network equipment vendors; for example, witness Cisco's acquisition of PARC Technologies. The latter has expertise in the area of network-path placement with respect to a set of constraints (such as bandwidth). An allied area is the detection of different traffic types as close as possible to the edge of the network. If it's possible to separate the traffic types at the network edge, each traffic type can be allocated its own specific QoS treatment, such as pushing it into a given LSP.
From the end user's perspective, QoS is the final result of many careful network operator TE and device-level configurations. As we saw in the previous article, queues are a key element in QoS provisioning. If a given traffic type (such as VoIP) can skip the queue ahead of a traffic type with less stringent QoS needs (such as email), we have a useful scheme for providing definable QoS levels. This is the way in which DiffServ can be used: Packets are marked in the IP header DiffServ field (a six-bit field) and this value then dictates the downstream treatment. In Figure 1, the DiffServ markings could be applied on link A or B by the CE routers. With respect to the CE devices, the downstream routers are the adjacent PEs (and core P nodes). The PEs (followed by the Ps) then provide the next step in the QoS process.
Priority queues in supermarkets (or other stores) can produce happy customers; the same analogy obtains for network devices. If one type of traffic can get through a queue ahead of the others, there's a good chance that the originator and receiver of that traffic will enjoy a decent QoS level.
It's no coincidence that PE devices tend to be complex, high-performance, multi-technology switch routers that implement options for IP, MPLS, ATM, frame relay, SONET/SDH, etc.
Device Simplicity
I'm often struck by the conceptual simplicity of network devices such as those in Figure 1. Traffic arrives on a given interface, some technology-specific (IP, MPLS, ATM, etc.) lookup/manipulation occurs, and the traffic is then pushed out an egress interface toward an adjacent device. The same process occurs all the way to the destination. Clearly, many other details [1] are required for a complete picture, such as DiffServ/IntServ, policing, rate limiting, shaping, scheduling classes, per-hop behaviors, and so forth, but the big picture is essentially one of simplicity.
Step 4: Network Operation
The final step we've identified is network operationwhere the service provider actually carries real customer traffic and attempts to supply the required QoS. At this point, the service-level agreement (SLA) comes into play. This is the contractual specification of how well the service provider will attempt to handle the customer service. As usual, network management plays a key role, with the NMS handling these processes (among others):
Fault monitoring and repair
Performance monitoring
If persistent problems are occurringfor example, if LSP A in Figure 1 is dropping packets due to congestionthere may be a need for extra resources. On the other hand, LSP A may be correctly forwarding all its traffic, but not quickly enough; in other words, the performance may not be as required. Both situations may warrant SLA rebates to customers, so the service provider may have to use the information gleaned from the operational network to reengineer the LSP or the links it traverses.
The important point is that the process is iterative. Traditionally, the areas of fault analysis and performance management have been seen as the baseline. As networks are now mission-critical enterprise elements, the network management system is an indispensable business-supporting technology. Ideally, the NMS should provide a network technologyindependent facility for provisioning and operating a network. Essentially, the operator specifies the required policies; QoS, TE, and the NMS should take care of the rest. The low-level details should be handled by the NMS, freeing the network operator to focus on higher value-added issues, such as these:
Overbooking resources to improve utilization
Backup resources to protect critical servicestunnel instances, fast reroute
The reality of commercial network management products is not so rosy. Many devices still have to be configured manually. Keeping some NMS applications and associated devices up and running in complex networks is the equivalent of a tightrope act without a safety net. Much progress is needed in this area.