An Information Technology Management Reference Architecture
Introduction
This article is the fourth in a series of articles by Edward Wustenhoff on the implementation of a data center management infrastructure. The focus of this article is on the infrastructure concepts and the details of the other requirements that drove the implementation of the management infrastructure.
This is a follow-up article on the "Management Framework Considerations" article published earlier by Edward Wustenhoff and the Sun BluePrints group. It describes the results of the process used to create an IT management reference architecture in the iForceSM Ready Center (iFRC) that displays an IDC Mail and Messaging Architecture. The iFRC is a program at Sun that provides reference implementations and proof of concepts to assist our customers in preventing common pitfalls.
A fifth article, to be published in July 2002, will describe the details of the actual implementation.
Quality of Service (QoS) is an important competitive advantage in the Internet service provider (ISP) and large IDC market so Sun decided to develop a management infrastructure to highlight how such a service can be managed.
This management infrastructure provides six views into the data center.
A Network Operations Center (NOC) view that shows all alerts in all components of the architecture. Each component has its own view.
A Service Level Management view (SLM) view that reports the status of the messaging application against a predefined Service Level Agreement (SLA).
An Applications Infrastructure view that reports on the status of the software components that support the application. Lightweight Directory Access Protocol (LDAP), Network Time Protocol (NTP) and domain name service (DNS) are included in this.
A System Administration view that reports on the status and enables the administration of all computer system components.
A Network Administration view that reports on the status and enables the administration of all network components.
A Performance view that monitors the status of the system and networks and their ability to handle the processing loads.
It is important to realize that the implementation of this architecture was done within the capabilities of the iFRC. This means that the developers worked with the systems that were available. The result is a proof of concept that emphasizes functionality. Availability and sizing of the management platform was not the main focus. Management tools selection and their functions was done within similar constraints. The developers implemented tools that were known to provide the desired functionality and were available in the laboratory.
However, the implemented architecture follows Sun's tools framework design considerations as much as possible. The intention is to build out the functionality and its associated tools in subsequent releases of this architecture.
"Concepts" describes briefly the underlying design considerations and essential concepts. Its main purpose is to position what the developers think about when addressing data center management, and to create an overview that positions where this proof of concept fits.
"Concepts" addresses two main ideasService Level Management (SLM) and the IT management framework. Each concept drives certain implementation requirements, which are listed as key considerations.
The section on "Implementation Requirements" on page 9 contains a detailed description of additional requirements that drove the actual implementation of the management infrastructure.
A follow-on article will describe the details of the actual implementation.
Concepts
One of the main concepts to address is the principle of SLM. The following section briefly explains how the developers look at this process from an instrumentation perspective.
SLM Process
FIGURE 1 is a graphical representation of the main SLM process.
FIGURE 1 SLM Process Flow Diagram
This process shows that the SLA definition sets the key performance indicators (KPI), which in turn drive the measurement data or metrics. The measurement data metrics drive the choice of the data collectors (or agents), that, in turn, influence the data analysis tool. The process creates a report that measures the infrastructure's performance against the SLA. If the SLA objectives are not met, a choice remainsimprove the performance or, if necessary, adjust the SLA.
Key Considerations for Service Level Management
This section describes the main SLM considerations that drove the management and operations (M&O) architecture. It points out the main considerations in the design process, not the details of SLM.
The description in this section follows the core SLM process.
SLA Definition
The following SLA sets the requirements for the SLM objectives. It is based on the nextslm.org sample template. Mainly, the modifications reflect an email focused ISP environment, because this is what the Sun IDC Mail and Messaging Architecture represents.
The Sun IDC Mail and Messaging Architecture is used by simulated email users to send and receive email. TABLE 1 lists the SLA objectives that the design of this architecture guarantees to meet.
TABLE 1 Service Level Agreement
The send and receive email capability will be available 99.99 percent(*) of the time during normal hours of operation as described in the Test Plans. Any individual outage in excess of 1 minute or sum of outages exceeding 4 minutes per month will constitute a violation. |
|
99.99 percent of Sending and Receiving email transactions (using SMTP, IMAP and POP protocols) will exhibit 10 seconds or less response time. Response time is defined as the interval from the time the emulated user sends a transaction to the time a visual confirmation of transaction completion is received by the Micromuse probe. Missing this metric will constitute a violation. |
|
The Internet Data Center Customer Care team will respond to service incidents that affect multiple users within 1 hour, resolve the problem within 72 hours, and update status every 4 hours. Missing any of these metrics on an incident will constitute a violation. [Note: not currently implemented since these statistics will come from a process management tool like Remedy. The deployment of this is planned for the next phase]. |
|
The Internet Data Center Customer Care team will respond to service incidents that affect individual users within 24 hours, resolve the problem within 3 business days, and update status every Day. Missing any of these metrics on an incident will constitute a violation. [Note: not currently implemented since these statistics will come from a process management tool like Remedy. The deployment of this is planned for the next phase]. The Internet Data Center Customer Care team will respond to non-critical inquiries within 24 hours, deliver an answer within 5 business days, and update status every other business day. Missing any of these metrics on an incident will constitute a violation. [Note: not currently implemented since these statistics will come from a process management tool like Remedy. The deployment of this is planned for the next phase]. |
|
The following shows the number of violations and associated penalty on a monthly basis. |
|
Number of violations |
Penalty |
1 > 5 |
1 day free service |
5 >10 |
A free month of service |
10 > |
2 free months of service plus a corrective action plan that details the steps taken to correct the issues. |
As services and technologies change, the SLA can change to reflect the improvements and/or changes. This SLA will be reviewed every 12 months and updated as necessary. When updates are deemed necessary, the customer will be asked to review and approve the changes. |
TABLE 1 contains a "short-form" SLA to illustrate the essential aspects between a consumer and the Service Provider in an IDC context. Internal SLAs, between operations support groups in the Internet data center (IDC), for example, are different and often contain more details and specifications.
TABLE 2 describes how the percentage relates to actual unavailability times. This table shows the impact of measuring against a "number of nines" of availability.
NOTE
One year equals 365.00 days.
TABLE 2 Actual Availability and Percentage of Downtime
Availability |
Uptime (days/year) |
Downtime (days/year |
Downtime (hours/year) |
Downtime (minutes/year) |
Downtime (minutes/month) |
Downtime (minutes/week) |
Downtime (minutes/day |
99.000 percent |
361.35000 |
3.65000 |
87.60000 |
5256.00000 |
438.00000 |
101.07692 |
14.43956 |
99.900 percent |
364.63500 |
0.36500 |
8.76000 |
525.60000 |
43.80000 |
10.10769 |
1.44396 |
99.990 percent |
364.96350 |
0.03650 |
0.87600 |
52.56000 |
4.38000 |
1.01077 |
0.14440 |
99.999 percent |
364.99635 |
0.00365 |
0.08760 |
5.25600 |
0.43800 |
0.10108 |
0.01444 |
Key Performance Indicators
Based on the SLA, the following parameters are the essential indicators that define the performance of the email service:
Sending an email using the SMTP protocol
Receiving an email using the POP3 protocol
Receiving an email using the IMAP4 protocol
The system is deemed unavailable if the response time of any of the KPIs is larger than 1 minute. If the service is unavailable for more than 4.38 minutes per month or 1.01077 minutes per week, the SLA has not been met.
A response time larger than 10 seconds for any of the KPIs will constitute a service performance violation.
As a result the service will be tested by using a synthetic transaction that simulates these parameters. One message is delivered using SMTP that is consequently retrieved using IMAP4 and another message is delivered by using SMTP that is consequently retrieved using POP3. The results of this transaction is measured against the above defined metrics. The KPIs are measured and reported on a monthly basis.
IT Management Framework
This section contains a brief overview of what we define as management and how it relates to architecture and business practices. Its purpose is to show how things fit together.
FIGURE 2 shows the Enterprise Stack (E-Stack) reference model recently developed by SunPSSM Americas service. This figure shows how the business architecture environment, the physical execution architecture (middle box) and the M & O architecture environment relate to each other. Its main purpose is to show that both the business and physical architecture drive what happens in the IT environment.
FIGURE 2 E-stack Reference Architecture
FIGURE 3 shows the details of each of the IT management architecture infrastructure.
FIGURE 3 IT Management Architecture
The IT management infrastructure consists of three dimensionspeople, process, and tools. The M&O architecture is based on this cube and focuses on the tools view of the IT management activities. The following paragraphs briefly describe each of the faces of this cube.
People
This dimension deals with all aspects of people management and organization. This cube shows four core aspectsbusiness, function, technology and the employee. The M&O architecture does not intend to address this area. It is mentioned here to emphasize that it is a part of the complete IT management challenge.
Process
Good processes are essential to good IT management. This cube identifies the core processes. The M&O architecture tools are well equipped to support event management (as part of problem management) and SLA (as part of the account management process). The M&O architecture also implements some configuration management capabilities.
The M&O architecture design enables the deployment of any tool that supports any of the preceding processes as they affect the IDC infrastructureat minimal overhead cost.
Tools
The major objective of this management architecture is to show a coherent set of tools. FIGURE 4 shows the structure of the deployed architecture.
FIGURE 4 M&O Architecture Tools Dimension
The instrumentation layer is implemented as the agents that run on the monitored systems. The resource and element managers are the management servers that control the agents and the associated elements. This is where the primary automation activities between element and server take place. An example is the automated adjustment of swap space when a threshold is reached by the Sun_ Management Center (SunMC) software server process.
The event and information managers take input from the resource and element managers to correlate and consolidate the data into useful management information. This is also the conduit into the process flow management application.
The SLM (see"SLM Process" on page 3) operates at another level of abstraction. It tests the user experience and includes all components of the infrastructure that comprise the service. The developers defined the implementation requirements for this implementation in the SLA described earlier in this document.
The process flow manager facilitates the integration with the Process dimension of the cube. A good example is the Remedy (now Peregrin) tool that supports the Helpdesk processes. The management portal represents the interface to management reporting. This function is to display the information in the management data repository with any desired angle. In the deployed M&O infrastructure, the process flow manager and the management portal aspects have been included in the architectural considerations, but have not yet been implemented by tools.