Recommendations
Our experience shows that a good approach is to start with conservative promises in the SLA that have a high probability of being met by the initial infrastructure. This approach allows time to tune the SLM.
The technical manager must consider the design of the SLM system during the planning and architecture phases. Overhead, network availability, agent definition and so forth can only be done effectively during the early stages. If you do it later, it becomes a bolt-on solution with its inherent agility issues.
We consider 5 to 10 percent CPU utilization overhead acceptable for a robust SLM solution. Naturally, business requirements will influence the actual results. Make sure the infrastructure capacity planning sessions always include the SLM overhead.
Our experience shows that if you need more than four or five tools to build the SLM infrastructure, you should seriously revisit the tool selection. The biggest risk associated with a SLM system is over-instrumenting the infrastructure. Typically there is one tool per layernetwork, system and application. Frequently the latter has custom agents because of the uniqueness of most applications. In addition, an aggregator tool should collect cross-layer statistics and correlate the information between SLA definitions and agent information. Each tool should have a modeler/trend analysis and reporting component.
For security, availability and performance reasons, it is imperative that the SLM system have its own network infrastructure. As a rule of thumb, no public traffic should be mixed with private traffic. Aside from the security implications, performance can be affected by data collection and analysis. In addition, it is good to have an alternate path to the system in case the primary path to service is down.
For similar reasons, all systems should write log information to a different, separate server or servers. This improves the ability to secure them because, in that case, a potential hacker has to be able to break the security of one additional server to erase his traces. It also will help off-load SLM overhead when log analysis is being done.
Make sure that the key performance indicators (KPIs) in the SLA can be translated easily into measurable metrics of components. This translation is done more easily by being as specific as possible in the definition of the SLA. More details can be found in the article on SLAs.
When defining the SLA metrics per component in the IDC, it is important to realize that the availability of the service is the multiplication of the availability of all components that comprise the service (for example 99.9 percent x 99.9 percent = 99.800 percent). This means that each component has to have a higher availability than defined in the SLA for the overall service.
Most components have no documented or published availability numbers; therefore the only way to obtain these numbers is by actually measuring availability after the fact. However, for most hardware some statistics are available under Non Disclosure. These numbers can be used as a guideline for the initial availability commitments.
TABLE 1 lists the aspects that should be considered when looking for management tools. Each aspect needs to be considered at all four layersthe lower platform (facilities, network, systems, storage and so forth), the upper platform (application infrastructure services like LDAP, domain name service (DNS), network file system (NFS) and the actual service or application layer.
Depending on the requirements, some aspects are more relevant than others. It is based on the ISO defined fault, configuration, accounting, performance and security (FCAPS) model and encompasses the operational management aspects.
Using TABLE 1 on page 12, you can assess what aspects at what layers should be addressed based on the known business requirements.
TABLE 1 Management Tool Aspects
|
Fault |
Configuration |
Accounting |
Performance |
Security |
Service/Business application |
|
|
|
|
|
Application Infrastructure |
|
|
|
|
|
Compute and Storage platform |
|
|
|
|
|
Network |
|
|
|
|
|
Facilities |
|
|
|
|
|