Deployment Considerations for Data Center Management Tools
Introduction
This article describes some of the main aspects to consider when deploying a data center management tools infrastructure (DCMTI). It also includes considerations to keep in mind when complementing this environment with a process management tool to facilitate the integration with other external processes such as, but not limited to, a help desk function.
This article is a prelude to a follow-on article that will describe an actual implementation of such a management architecture in one of Sun's iForceSM Ready Center programs.
The topics in this article are:
- Main Considerations
- Architecture
- Other Considerations
The main considerations when designing and implementing a DCMTI are:
Create visibility at all layers for all aspects. "FCAPS" on page 3 describes these aspects (fault, configuration, accounting, performance and security)
Create a process management environment to facilitate interaction with other organizations and service request control.
Considering these aspects results in a management architecture that has five major components:
- Agents
- Management servers and consoles
- Correlation and framework server
- Consoles
- Process management tool
The physical distribution within the management architecture can vary based on specific requirements. However, the natural separation points are:
Between the agents and the server
Between the management servers and the Framework server
Between the Framework server and the process server
We recommend a separate management network for performance, visibility and security reasons.
After reading this document, you should have a good understanding of some of the main aspects to consider when building a DCMTI, and you will be ready to begin deploying a DCMTI. A follow-on article will describe the details of a deployment that incorporates the suggestions in this article.
Main Considerations
A good DCMTI provides the information to support several different views into the managed environment. These views are often organized by layerfacilities, network, compute and storage, and application infrastructureLightweight Directory Access Protocol (LDAP), domain name service (DNS), relational database management system (RDBMS), Network Time Protocol (NTP) and so forth, and at the top, the business application.
In addition to these views, there should also be a Service Level Management (SLM) view. The main objective of this view is to show how the service provided measures against a predefined Service Level Agreement (SLA) and its associated Service Level Objectives (SLOs). The articles Service Level Management in the Data Center and Building a Service Level Agreement in the Data Center describe the main concepts of SLAs and SLOs, so no additional details are included herein.
The views by layer must provide information of all aspects that are deemed important by the operations staff to keep the systems up and running. The International Standards Organization (ISO) has defined five areas (FCAPS) that completely address this requirement.
FCAPS
The FCAPS aspects are:
- Fault
- Configuration
- Accounting
- Performance
- Security
Fault
This aspect looks at the status of the components and whether they are performing within set thresholds. It is event based. Broken disks and dead processes are examples of events.
Configuration
This aspect manages the configuration of the IT components. It tracks the parameters and values of the IT components. Preferably a history of configurations is maintained so a bad change is backed-out easily.
Accounting
This aspect is an older concept that stems from the mainframe world. It is the ability to track usage of system resources and relate that to business units and/or customers to enable billing. An interesting side note is that, with the emerging ASP business models, accounting has received renewed interest.
Performance
This aspect manages the challenging task of monitoring how fast or slow a system responds and processes transactions. A key process in this area is performance tuning and capacity planning, where historical data is submitted for analysis to discover trends or model anticipated changes in the environment.
Security
This aspect manages the complete infrastructure from an authentication, authorization and access perspective. Security is very pervasive and should be addressed early in the architecture design and deployment phases.
As mentioned earlier, all of these aspects should be managed at all layers in the infrastructure. TABLE 1 shows that concept. An advantage of this representation is that it enables a quick overview to assess and identify areas that are candidates to be addressed by the management infrastructure.
TABLE 1 FCAPS Overview
|
Fault |
Configuration |
Accounting |
Performance |
Security |
Business application |
5 |
2 |
2 |
3 |
2 |
Application infrastructure (RDBMS, LDAP and so forth) |
5 |
2 |
1 |
1 |
1 |
Compute and storage platform |
5 |
3 |
1 |
3 |
2 |
Network |
5 |
2 |
1 |
3 |
3 |
Facilities |
5 |
2 |
2 |
1 |
3 |
The numbers in this example, indicate a level of compliance. Five means, "well covered" and zero means, "not covered." The same table can be used to describe the requirements for a DCMTI. In that case, five could mean, "important requirement" and zero could mean "no requirement".
Interaction With Other Organizations
In addition to the views that represent the appropriate aspects organized by layer, a process management tool is a very important consideration.
A process management tool facilitates the transition of activities into other processes, and it facilitates the following main aspects:
Service Request Control
Status update (new, latest event and so on)
Progress enforcement (escalation, if needed)
Qualification and routing (where next?)
Closure (quality control surveys and so on)
Reporting
Periodic reports
Management
Service Performance
Exception reports
These functions are often provided by a help desk or customer care desk. However, in context of this document, the management infrastructure is assumed to be capable of generating requests based on predefined rules. The rules to determine when to create a request are implemented and enforced at the alert consolidation and correlation layer in the management infrastructure. "Architecture" on page 8 details this process.
Service Request Process
FIGURE 1 is a high-level process view of how the process management tool would handle a ticket. The intent is to highlight key steps that you must consider when building such a process and mapping it to the tools ticket.
FIGURE 1 Sample Request Process View of Ticket Handling
It is important to realize that there are multiple sources for action requests in the IT management environment. Four sources are given here as an example; other sources exist, depending on specific situations. Before the request enters the process it should be prioritized, localized (in case of multiple locations of activities) and categorized. Based on that information it will be qualified and assigned.
Typically, this should be a generic name or group (not a person's name) to avoid constant updating of the configuration files that link this information. Depending on priority, location and category, the ticket starts to follow a distinct process that tracks progress and key information for service performance and management reporting purposes.
Essential considerations for prioritizing and routing a request are:
PPriority
SSkills needed that determine the routing
AAction(s) to represent a distinct process
TABLE 2 Service and Management Reporting
Function |
Driver |
Examples |
Priority -> |
Cost of downtime No. of users affected |
P1More than 10 users affected and/or business critical system is down during production hours |
|
System function (business critical) |
P2Less than 10 users affected and/or not a business critical system during any time of the day |
|
Time of day |
P3Request for enhancement. Not business critical. No time pressure. |
|
Service Level Agreement |
P4Specific rules as per the agreement |
|
|
... and so on. |
Routing -> |
Skills needed |
|
|
What type pf technology? |
S1Computer Sun hardware disk fault P1 |
|
What type pf alert (FCAPS)? |
S2Computer IBM operating kernel performance P2 system |
|
What priority? |
S3Network Cisco hardware router configuration P3 |
Process -> |
Action needed |
|
|
Resolution time |
A1Must be resolved ASAP S3 P1 |
|
Skills needed |
A2Should be resolved within 4 hours S1 P2 |
|
Priority of request |
A3Should be resolved within 2 hours S2 P4 |
When all three functions have been defined, you can create a matrix that relates the priorities of a request, based on the skills needed to the appropriate process. This typically identifies which group is assigned. Based on the preceding table, TABLE 3 shows this priority request matrix.
TABLE 3 Priority Request Matrix
|
P1 |
P2 |
P3 |
P4 |
S1 |
A1 |
A1 |
A6 |
A10 |
S2 |
A2 |
A4 |
A7 |
A10 |
S3 |
A3 |
A5 |
A8 |
A10 |
S4 |
A3 |
A5 |
A9 |
A10 |
It is important to realize that service request priorities do not influence the priorities or criticality at the system agent layer. The health of a system is independent of its impact on the business. The former is addressed in the DCMTI, the latter in process management.
Each specific process has a rule to allow for escalation and re-assignment. When all goes well, the request is fulfilled and the ticket is closed. The closing process can include activities like informing users, updating databases, and sometimes even initiating clearing of alarms in the DCMTI.
FIGURE 2 shows some key aspects to consider in the specialized resolution process of a trouble ticket. It illustrates the preceding considerations with more detail.
FIGURE 2 Sample Trouble Ticket Resolution Process
Most notable is the update of the process database at key steps in the process. Also, in the decision tree towards the end, there is an interesting example of how escalation can be achieved. Generally, an automated approach to escalation is not recommended because it would automatically reassign a ticket. The most common approach is to run daily reports or create alerts for supervisors who make the best decision for the next step, and generate ad-hoc reports (email, text page and so on) for high priority events that require immediate attention.