- Management Reference Guide
- Table of Contents
- Introduction
- Strategic Management
- Establishing Goals, Objectives, and Strategies
- Aligning IT Goals with Corporate Business Goals
- Utilizing Effective Planning Techniques
- Developing Worthwhile Mission Statements
- Developing Worthwhile Vision Statements
- Instituting Practical Corporate Values
- Budgeting Considerations in an IT Environment
- Introduction to Conducting an Effective SWOT Analysis
- IT Governance and Disaster Recovery, Part One
- IT Governance and Disaster Recovery, Part Two
- Customer Management
- Identifying Key External Customers
- Identifying Key Internal Customers
- Negotiating with Customers and Suppliers—Part 1: An Introduction
- Negotiating With Customers and Suppliers—Part 2: Reaching Agreement
- Negotiating and Managing Realistic Customer Expectations
- Service Management
- Identifying Key Services for Business Users
- Service-Level Agreements That Really Work
- How IT Evolved into a Service Organization
- FAQs About Systems Management (SM)
- FAQs About Availability (AV)
- FAQs About Performance and Tuning (PT)
- FAQs About Service Desk (SD)
- FAQs About Change Management (CM)
- FAQs About Configuration Management (CF)
- FAQs About Capacity Planning (CP)
- FAQs About Network Management
- FAQs About Storage Management (SM)
- FAQs About Production Acceptance (PA)
- FAQs About Release Management (RM)
- FAQs About Disaster Recovery (DR)
- FAQs About Business Continuity (BC)
- FAQs About Security (SE)
- FAQs About Service Level Management (SL)
- FAQs About Financial Management (FN)
- FAQs About Problem Management (PM)
- FAQs About Facilities Management (FM)
- Process Management
- Developing Robust Processes
- Establishing Mutually Beneficial Process Metrics
- Change Management—Part 1
- Change Management—Part 2
- Change Management—Part 3
- Audit Reconnaissance: Releasing Resources Through the IT Audit
- Problem Management
- Problem Management–Part 2: Process Design
- Problem Management–Part 3: Process Implementation
- Business Continuity Emergency Communications Plan
- Capacity Planning – Part One: Why It is Seldom Done Well
- Capacity Planning – Part Two: Developing a Capacity Planning Process
- Capacity Planning — Part Three: Benefits and Helpful Tips
- Capacity Planning – Part Four: Hidden Upgrade Costs and
- Improving Business Process Management, Part 1
- Improving Business Process Management, Part 2
- 20 Major Elements of Facilities Management
- Major Physical Exposures Common to a Data Center
- Evaluating the Physical Environment
- Nightmare Incidents with Disaster Recovery Plans
- Developing a Robust Configuration Management Process
- Developing a Robust Configuration Management Process – Part Two
- Automating a Robust Infrastructure Process
- Improving High Availability — Part One: Definitions and Terms
- Improving High Availability — Part Two: Definitions and Terms
- Improving High Availability — Part Three: The Seven R's of High Availability
- Improving High Availability — Part Four: Assessing an Availability Process
- Methods for Brainstorming and Prioritizing Requirements
- Introduction to Disk Storage Management — Part One
- Storage Management—Part Two: Performance
- Storage Management—Part Three: Reliability
- Storage Management—Part Four: Recoverability
- Twelve Traits of World-Class Infrastructures — Part One
- Twelve Traits of World-Class Infrastructures — Part Two
- Meeting Today's Cooling Challenges of Data Centers
- Strategic Security, Part One: Assessment
- Strategic Security, Part Two: Development
- Strategic Security, Part Three: Implementation
- Strategic Security, Part Four: ITIL Implications
- Production Acceptance Part One – Definition and Benefits
- Production Acceptance Part Two – Initial Steps
- Production Acceptance Part Three – Middle Steps
- Production Acceptance Part Four – Ongoing Steps
- Case Study: Planning a Service Desk Part One – Objectives
- Case Study: Planning a Service Desk Part Two – SWOT
- Case Study: Implementing an ITIL Service Desk – Part One
- Case Study: Implementing a Service Desk Part Two – Tool Selection
- Ethics, Scandals and Legislation
- Outsourcing in Response to Legislation
- Supplier Management
- Identifying Key External Suppliers
- Identifying Key Internal Suppliers
- Integrating the Four Key Elements of Good Customer Service
- Enhancing the Customer/Supplier Matrix
- Voice Over IP, Part One — What VoIP Is, and Is Not
- Voice Over IP, Part Two — Benefits, Cost Savings and Features of VoIP
- Application Management
- Production Acceptance
- Distinguishing New Applications from New Versions of Existing Applications
- Assessing a Production Acceptance Process
- Effective Use of a Software Development Life Cycle
- The Role of Project Management in SDLC— Part 2
- Communication in Project Management – Part One: Barriers to Effective Communication
- Communication in Project Management – Part Two: Examples of Effective Communication
- Safeguarding Personal Information in the Workplace: A Case Study
- Combating the Year-end Budget Blitz—Part 1: Building a Manageable Schedule
- Combating the Year-end Budget Blitz—Part 2: Tracking and Reporting Availability
- References
- Developing an ITIL Feasibility Analysis
- Organization and Personnel Management
- Optimizing IT Organizational Structures
- Factors That Influence Restructuring Decisions
- Alternative Locations for the Help Desk
- Alternative Locations for Database Administration
- Alternative Locations for Network Operations
- Alternative Locations for Web Design
- Alternative Locations for Risk Management
- Alternative Locations for Systems Management
- Practical Tips To Retaining Key Personnel
- Benefits and Drawbacks of Using IT Consultants and Contractors
- Deciding Between the Use of Contractors versus Consultants
- Managing Employee Skill Sets and Skill Levels
- Assessing Skill Levels of Current Onboard Staff
- Recruiting Infrastructure Staff from the Outside
- Selecting the Most Qualified Candidate
- 7 Tips for Managing the Use of Mobile Devices
- Useful Websites for IT Managers
- References
- Automating Robust Processes
- Evaluating Process Documentation — Part One: Quality and Value
- Evaluating Process Documentation — Part Two: Benefits and Use of a Quality-Value Matrix
- When Should You Integrate or Segregate Service Desks?
- Five Instructive Ideas for Interviewing
- Eight Surefire Tips to Use When Being Interviewed
- 12 Helpful Hints To Make Meetings More Productive
- Eight Uncommon Tips To Improve Your Writing
- Ten Helpful Tips To Improve Fire Drills
- Sorting Out Today’s Various Training Options
- Business Ethics and Corporate Scandals – Part 1
- Business Ethics and Corporate Scandals – Part 2
- 12 Tips for More Effective Emails
- Management Communication: Back to the Basics, Part One
- Management Communication: Back to the Basics, Part Two
- Management Communication: Back to the Basics, Part Three
- Asset Management
- Managing Hardware Inventories
- Introduction to Hardware Inventories
- Processes To Manage Hardware Inventories
- Use of a Hardware Inventory Database
- References
- Managing Software Inventories
- Business Continuity Management
- Ten Lessons Learned from Real-Life Disasters
- Ten Lessons Learned From Real-Life Disasters, Part 2
- Differences Between Disaster Recovery and Business Continuity , Part 1
- Differences Between Disaster Recovery and Business Continuity , Part 2
- 15 Common Terms and Definitions of Business Continuity
- The Federal Government’s Role in Disaster Recovery
- The 12 Common Mistakes That Cause BIAs To Fail—Part 1
- The 12 Common Mistakes That Cause BIAs To Fail—Part 2
- The 12 Common Mistakes That Cause BIAs To Fail—Part 3
- The 12 Common Mistakes That Cause BIAs To Fail—Part 4
- Conducting an Effective Table Top Exercise (TTE) — Part 1
- Conducting an Effective Table Top Exercise (TTE) — Part 2
- Conducting an Effective Table Top Exercise (TTE) — Part 3
- Conducting an Effective Table Top Exercise (TTE) — Part 4
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part One
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part Two
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part Three
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part Four
- The Information Technology Infrastructure Library (ITIL)
- The Origins of ITIL
- The Foundation of ITIL: Service Management
- Five Reasons for Revising ITIL
- The Relationship of Service Delivery and Service Support to All of ITIL
- Ten Common Myths About Implementing ITIL, Part One
- Ten Common Myths About Implementing ITIL, Part Two
- Characteristics of ITIL Version 3
- Ten Benefits of itSMF and its IIL Pocket Guide
- Translating the Goals of the ITIL Service Delivery Processes
- Translating the Goals of the ITIL Service Support Processes
- Elements of ITIL Least Understood, Part One: Service Delivery Processes
- Case Study: Recovery Reactions to a Renegade Rodent
- Elements of ITIL Least Understood, Part Two: Service Support
- Case Studies
- Case Study — Preparing for Hurricane Charley
- Case Study — The Linux Decision
- Case Study — Production Acceptance at an Aerospace Firm
- Case Study — Production Acceptance at a Defense Contractor
- Case Study — Evaluating Mainframe Processes
- Case Study — Evaluating Recovery Sites, Part One: Quantitative Comparisons/Natural Disasters
- Case Study — Evaluating Recovery Sites, Part Two: Quantitative Comparisons/Man-made Disasters
- Case Study — Evaluating Recovery Sites, Part Three: Qualitative Comparisons
- Case Study — Evaluating Recovery Sites, Part Four: Take-Aways
- Disaster Recovery Test Case Study Part One: Planning
- Disaster Recovery Test Case Study Part Two: Planning and Walk-Through
- Disaster Recovery Test Case Study Part Three: Execution
- Disaster Recovery Test Case Study Part Four: Follow-Up
- Assessing the Robustness of a Vendor’s Data Center, Part One: Qualitative Measures
- Assessing the Robustness of a Vendor’s Data Center, Part Two: Quantitative Measures
- Case Study: Lessons Learned from a World-Wide Disaster Recovery Exercise, Part One: What Did the Team Do Well
- (d) Case Study: Lessons Learned from a World-Wide Disaster Recovery Exercise, Part Two
In the first piece of this four-part series on improving high availability, I offered up and explained a definition for availability and presented some key terms used with this process. In this second part I compare and contrast additional terms associated with this process, discuss desirable traits of an availability process owner, present some effective methods for measuring availability.
Differentiating Slow Response from Downtime
Slow response can infuriate users and frustrate infrastructure specialists. The growth of a database, traffic on the network, contention for disk volumes, or the disabling of processors or portions of main memory in servers can all contribute to response time slowdowns. Each of these conditions requires analysis and resolution by infrastructure professionals. Users understandably are normally unaware of these root causes and sometimes interpret extremely slow response as downtime to their system. The threshold of time at which this interpretation occurs varies from user to user. It does not matter to users whether the problem is due to slowly responding software (slow response) or malfunctioning hardware (downtime). What does matter is that slow or non-responsive transactions can infuriate users who expect quick, consistent response times.
But slow response is different from downtime, and the root cause of these problems does matter a great deal to infrastructure analysts and administrators. They are charged with identifying, correcting, and permanently resolving the root causes of these service disruptions. Understanding the type of problem it is affects the course of action taken to resolve it. Slow response is usually a performance and tuning issue involving different personnel, different processes, and different process owners than those involved with downtime, which is an availability issue.
Differentiating Availability from High Availability
The primary difference between availability and high availability is that the latter is designed to tolerate virtually no downtime. All online computer systems are intended to maximize availability, or to minimize downtime, as much as possible. In high-availability environments, a number of design considerations are employed to make online systems as fault tolerant as possible. I refer to these considerations as the seven Rs of high availability and discuss them later in this chapter.
Desired Traits of an Availability Process Owner
As we mentioned previously, the most robust infrastructures select a single individual to be the process owner of availability. Some shops refer to this person as the availability manager. In some instances it is the operations managers; in others it is a strong technical lead in technical support. Regardless of who these individuals are, or to whom they report, they should be knowledgeable in a variety of areas, including systems, networks, databases, and facilities, and they must be able to think and act tactically. A slightly less critical, but desirable, trait of an ideal candidate for availability process owner is a knowledge of software and hardware configurations, backup systems, and desktop hardware and software.
Methods for Measuring Availability
The percentage of system availability is a very common measurement. It is found in almost all service level agreements and is calculated by dividing the amount of actual time a system was available by the total time it was scheduled to be up. For example, suppose an online system is scheduled to be up from 6:00 a.m. to midnight Monday through Friday and from 7:00 a.m. to 5:00 p.m. on Saturday. The total time it is scheduled to be up in hours is (18 x 5) + 10 = 100 hours. When online systems first began being used for critical business processing in the 1970s, online availability rates between 90% and 95% was common, expected, and reluctantly accepted. In our example, that would mean the system was up 90–95 hours per week or, more significantly, down for 5–10 hours per week and 20–40 hours per month.
Customers quickly realized that 10 hours a week of downtime was unacceptable and began negotiating service levels of 98% and even 99% guaranteed availability. As companies expanded worldwide and 24/7 systems became prevalent, the 99% level was questioned. Systems needing to operate around the clock were scheduled for 168 hours of uptime per week. At 99% availability, these systems were down, on average, approximately 1.7 hours per week. Infrastructure groups began targeting 99.9% uptime as their goal for availability for critical business systems. This target allowed for just over 10 minutes of downtime per week, but even this was not acceptable for systems such as worldwide email or an e-commerce Web site.
So the question becomes: Is the percentage of scheduled service delivered really the best measure of quality and of availability? An incident at Federal Express several years ago involving the measurement of service delivery will illustrate some points that could apply to the IT industry. FedEx had built its reputation on guaranteed overnight delivery. For many years its principal slogan was
When it positively, absolutely has to be there overnight, Federal Express.
FedEx guaranteed a package or letter would arrive on time, at the correct address, and in the proper condition. One of its key measurements of service delivery was the percentage of time that this guarantee was met. Early on, the initial goals of 99% and later 99.9% were easily met. The number of letters and packages they handled on a nightly basis was steadily growing from a few thousand to over 10,000, and less than 10 items were lost or delivered improperly.
A funny thing happened as the growth of their company started to explode in the 1980s. The target goal of 99.9% was not adjusted as the number of items handled daily started approaching one million. This meant that 1,000 packages or letters could be lost or improperly delivered every night and their service metric would still be met. One proposal to address this was to increase the target goal to 99.99%, but this goal could have been met while still allowing 100 items a night to be mishandled. A new set of deceptively simple measurements was established in which the number of items lost, damaged, delivered late, and delivered to the wrong address was tracked nightly regardless of the total number of objects handled.
The new set of measurements offered several benefits. By not tying it to percentages, it gave more visibility to the actual number of delivery errors occurring nightly. This helped in planning for anticipated customer calls, recovery efforts, and adjustments to revenue. By breaking incidents into three subcategories, each incident could be tracked separately as well as looked at in totals. Finally, by analyzing trends, patterns, and relationships, managers could pinpoint problem areas and recommend corrective actions.
In many ways, this experience with service delivery metrics at Federal Express relates closely to availability metrics in IT infrastructures. A small, start-up shop may initially offer online services only on weekdays for 10 hours and target for 99% availability. The 1% against the 50 scheduled hours allows for 0.5 hour of downtime per week. If the company grows to the point of offering similar online services 24/7 with 99% availability, the allowable downtime grows to approximately 1.7 hours.
A better approach is to track the quantity of downtime occurring on a daily, weekly, and monthly basis. As was the case with FedEx, infrastructure personnel can pinpoint and proactively correct problem areas by analyzing the trends, patterns, and relationships of these downtimes. Robust infrastructures also track several of the major components comprising an online system. The areas most commonly measured are the server environment, the disk storage environment, databases, and networks.
The tendency of many service suppliers to measure their availability in percentages of uptime is sometimes referred to as the rule of nines. Nines are continually added to the target availability goal as shown in Table 1. The table shows how the weekly minutes of allowable downtime changes from our example of the online system with 100 weekly hours and how the number of allowable undelivered items changes from our FedEx example.
Table 1. Rule of Nines Availability Percentage
Number |
Percentage |
Weekly |
Weekly |
Daily Packages |
Daily Packages |
1 |
90.000% |
10.000 |
600.00 |
1,000.0 |
100,000.0 |
2 |
99.000% |
1.000 |
60.00 |
100.0 |
10,000.0 |
3 |
99.900% |
0.100 |
6.00 |
10.0 |
1,000.0 |
4 |
99.990% |
0.010 |
0.60 |
1.0 |
100.0 |
5 |
99.999% |
0.001 |
0.06 |
0.1 |
10.0 |
In part three of this series I will discuss characteristics that I feel are essential for obtaining maximum availability. They coincidently all happen to start with the same letter leading me to refer to them as the Seven R's of High Availability.
References
Schiesser, Rich, IT Systems Management, Prentice Hall, 2002