Managing Large Networks: Problems and Solutions
- Bringing the Managed Data to the Code
- Scalability: Today's Network Is Tomorrow's NE
- MIB Note: Scalability
- Light Reading Trials
- Large NEs
- Expensive (and Scarce) Development Skill Sets
- Linked Overviews
- Elements of NMS Development
- Expensive (and Scarce) Operational Skill SetsElements of NMS Development
- MPLS: Second Chunk
- MPLS and Scalability
- Summary
Having looked at some of the nuts and bolts of network management technology, we now consider some of the problems of managing large networks. In many respects the large enterprise networks of today are reminiscent of the islands of automation that were common in manufacturing during the 1980s and 1990s. The challenge facing manufacturers was in linking together the islands of microprocessor-based controllers, PCs, minicomputers, and other components to allow end-to-end actions such as aggregated order entries leading to automated production runs. The hope was that the islands of automation could be joined so that the previously isolated intelligence could be leveraged to manufacture better products. Similar problems beset network operators at the beginning of the 21st century as traffic types and volumes continue to grow. In parallel with this, the range of deployed NMS is also growing. Multiple NMS adds to operational expense.
There is a strong need to reduce the cost of ownership and improve the return on investment (ROI) for network equipment. This is true not just during periods of economic downturn, but has become the norm as SLAs are applied to both enterprise and SP networks. NMS technology provides the network operator with some increasingly useful capabilities. One of these is a move away from tedious, error-prone, manually intensive operations to software-assisted, automated end-to-end operations.
Network operators must be able to execute automated end-to-end management operations on their networks [Telcordia]. An example of this is VLAN management in which an NMS GUI provides a visual picture—such as a cloud—of VLAN members (ports, MAC addresses, VLAN IDs). The NMS can also provide the ability to easily add, delete, and modify VLAN members as well as indicate any faults (e.g., link failures, warm starts) as and when they occur. Another example is enterprise WAN management in which ATM or FR virtual circuits are used to carry the traffic from branch offices into central sites. In this case, the enterprise network manager wants to be able to easily create, delete, modify, and view any faults on the virtual circuits (and the underlying nodes, links, and interfaces) to the remote sites. Other examples include storage (including SANs) management and video/audio conferencing equipment management. As we saw in Chapter 1, “Large Enterprise Networks,” the range of enterprise network services is growing all the time and so also is the associated management overhead.
The benefit of this type of end-to-end capability is a large reduction in the cost of managing enterprise networks by SLA fulfillment, less need for arcane NE know-how, smooth enterprise business processes, and happy end users. Open, vendor-independent NMS are needed for this, and later we look at ways in which software layering helps in designing and building such systems. Simple ideas such as always using default MIB values (seen in Chapter 1), pragmatic database design (matching default database and MIB values) and technology-sensitive menus also play an important part in providing NMS vendor-independence. The issue of presenting menu options appropriate to a given selected NE provides abstraction; for example, if the user wants to add a given NE interface to an IEEE 802.1Q VLAN, then (in order for the operation to be meaningful) that device must support this frame-tagging technology. The NMS should be able to figure this out and present the option only if the underlying hardware supports it. By presenting only appropriate options (rather than all possible options), the NMS reduces the amount of data the user must sift through to actually execute network management actions.
Automated, flow-through actions are required for as many network management operations as possible, including the following FCAPS areas:
-
Provisioning
-
Detecting faults
-
Checking (and verifying) performance
-
Billing/accounting
-
Initiating repairs or network upgrades
-
Maintaining the network inventory
Provisioning is a general term that relates to configuring network-resident objects, such as VLANs, VPNs, and virtual connections. It resolves down to the act of modifying agent MIB object instances, that is, SNMP setRequests. Provisioning usually involves both sets and gets. Later in this chapter we see this when we want to add a new entry to the MPLS tunnel table. We must read the instance value of the object mplsTunnelIndexNext before sending a setRequest to actually create the tunnel. Many NMS do not permit provisioning for a variety of reasons:
-
Provisioning code is hard to implement because of the issue of timeouts (i.e., when many set messages are sent, one or more may time out).
-
NE security settings are required to prevent unauthorized actions.
-
There is a lack of support for transactions that span multiple SNMP sets (i.e., SNMP does not provide rollback, a mechanism for use when failure occurs in one of a related sequence of SNMP sets. The burden of providing lengthy transactions and/or rollback is on the NMS).
-
Provisioning actions can alter network dynamics (i.e., pushing a lot of sets into the network adds traffic and may also affect the performance of the local agents).
If the NMS does not allow provisioning, then some other means must be found; usually, this is the EMS/CLI. SNMPv3 provides adequate security for NMS provisioning operations.
Fault detection is a crucial element of network management. NMS fault detection is most effective when it provides an end-to-end view; for example, if a VLAN link to the backbone network is broken (as in VLAN 2 in Chapter 1, Figure 1-4), then that VLAN GUI element (e.g., a network cloud) should change color instantly. The NMS user should then be able to drill down via the GUI to determine the exact nature of the problem. The NMS should give an indication of the problem as well as a possible resolution (as we've seen, this is often called root-cause analysis). The NMS should also cater to the case where the user is not looking at the NMS topology and should provide some other means of announcing the problem, for instance, by email, mobile phone short text message, or pager.
Performance management is increasingly important to enterprises that use service level agreements (SLAs). These are contractual specifications between IT and the enterprise users for service uptime, downtime, bandwidth/system/network availability, and so on.
Billing is important for those services that directly cost the enterprise money, such as the PSTN. It is important for appropriate billing to be generated for such services. Billing may even be applied to incoming calls because they consume enterprise network resources. Other elements of billing include departmental charges for remote logins to the network (external SP connections may be needed, for example, for remote-access VPN service) and other uses of the network, such as conference bridges. An important element of billing is verifying that network resources, such as congested PSTN/WAN trunks, are dimensioned correctly. In Chapter 1, we mentioned that branch offices are sometimes charged a flat rate for centralized corporate services (e.g., voice, LAN/WAN support). This is accounting rather than billing. In billing, money tends to be paid to some external organization, whereas in accounting, money may be merely transferred from one part of an organization to another. Many service providers offer services that are billed using a flat-rate model—for example, x dollars per month for an ATM link with bandwidth of y Mbps. Usage-based billing is increasingly attractive to customers because it allows for a pay-for-use or pay-as-you-grow model. It is likely that usage-based billing/accounting will increasingly be needed in enterprise NMS applications. This is particularly true as SLAs are adopted in enterprises.
Networks are dynamic entities, and repairs and upgrades are a constant concern for most enterprises. Any NE can become faulty, and switch/router interfaces can become congested. Repairs and upgrades need to be carried out and recorded, and the NMS is an effective means of achieving this.
All of the FCAPS applications combine to preserve and maintain the network inventory. An important aspect of any NMS is that the FCAPS applications are often inextricably interwoven; for example, a fault may be due to a specific link becoming congested, and this in turn may affect the performance of part of the network. We look at the important area of mediation in Chapter 6, “Network Management Software Components.”
It is usually difficult to efficiently create NMS FCAPS applications without a base of high-quality EMS facilities. This base takes the form of a well-implemented SNMP agent software with the standard MIB and (if necessary) well-designed private MIB extensions. Private MIB extensions are needed for cases where vendors have added additional features that differentiate their NEs from the competition.
All these sophisticated NMS features come at a price: NMS software is expensive and is often priced on a per-node basis, increasing the network cost base. Clearly, the bigger the network, the bigger the NMS price tag (however, the ratio of cost/bit may go down).
This chapter focuses on the following major issues and their proposed solutions:
-
Bringing the managed data to the code
-
Scalability
-
The shortage of development skills for creating management systems
-
The shortage of operational skills for running networks
Bringing the Managed Data to the Code
Bringing data and code together is a fundamental computing concept. It is central to the area of network management, and current trends in NE development bring it to center stage. Loading a locally hosted text file into an editor like Microsoft Notepad is a simple example: The editor is the code and the text file is the data. In this case, the code and data reside on the same machine, and bringing them together is a trivial task. Getting SNMP agent data to the manager code is not a trivial task in the distributed data model of network management because:
-
Managed objects reside on many SNMP agent hosts.
-
Copies of managed objects reside on SNMP management systems.
-
Changes in agent data may have to be regularly reconciled with the management system copy.
Agent-hosted managed objects change in tandem with the dynamics of the host machine and the underlying network—for example, the ipInReceives object from Chapter 1, which changes value every time an IP packet is received. This and many other managed objects change value constantly, providing a means for modeling the underlying system and its place in the network. The same is true of all managed NEs. MIBs provide a foundation for the management data model. The management system must keep track of relevant object value changes and apply new changes as and when they are required. As mentioned in Chapter 1, the management system keeps track of the NEs by a combination of polling, issuing set messages, and listening for notifications. This is a classic problem of storing the same data in two different places and is illustrated in Figure 3-1, where a management system tracks the objects in a managed network using the SNMP messages we saw in Chapter 2, “SNMPv3 and Network Management.”
Figure 3-1. Components of an NMS.
Figure 3-1 illustrates a managed network, a central NMS server, a relational database, and several client users. The clients access the FCAPS services exported by the NMS, for example, viewing faults, provisioning, and security configuration. The NMS strives to keep up with changes in the NEs and to reflect these in the clients.
Even though SNMP agents form a major part of the management system infrastructure, they are physically remote from the management system. Agent data is created and maintained in a computational execution space removed from that of the management system. For example, the ipInReceives object is mapped into the tables maintained by the host TCP/IP protocol suite, and from there it gets its value.1 Therefore, get or set messages sent from a manager to an agent result in computation on the agent host. The manager merely collects the results of the agent response. The manager-agent interaction can be seen as a loose type of message-based remote procedure call (RPC). The merit of not using a true RPC mechanism is the lack of associated overhead.
This is at once the strength and the weakness of SNMP. The important point is that the problem of getting the agent data to the manager is always present, particularly as networks grow in size and complexity. (This problem is not restricted to SNMP. Web site authors have a similar problem when they want to embed Java or JavaScript in their pages. The Java code must be downloaded along with the HTML in an effort to marry the browser with the Web site code and data. Interestingly, in network management the process is reversed: The data is brought to the code.) So, should the management system simply request all of the agent data? This is possibly acceptable on small networks but not on heavily loaded, mission-critical enterprise and SP networks. For this reason, the management system struggles to maintain an accurate picture of the ever-changing network. This is a key network management concept.
If an ATM network operator prefers not to use signaled virtual circuits, then an extra monitoring burden is placed on the NMS. This is so because unsignaled connections do not recover from intermediate link or node failures. Such failures give rise to a race between the operator fixing the problem and the user noticing a service loss. These considerations lead us to an important principle concerning NMS technology: The quality of an NMS is inversely proportional to the gap between its picture of the network and the actual state of the underlying network—the smaller the gap, the better the NMS. An ideal NMS responds to network changes instantaneously. Real systems will always experience delays in updating themselves, and it is the goal of the designers and developers to minimize them.
As managed NEs become more complex, an extra burden is placed on the management system. The scale of this burden is explored in the next section.