- Management Reference Guide
- Table of Contents
- Introduction
- Strategic Management
- Establishing Goals, Objectives, and Strategies
- Aligning IT Goals with Corporate Business Goals
- Utilizing Effective Planning Techniques
- Developing Worthwhile Mission Statements
- Developing Worthwhile Vision Statements
- Instituting Practical Corporate Values
- Budgeting Considerations in an IT Environment
- Introduction to Conducting an Effective SWOT Analysis
- IT Governance and Disaster Recovery, Part One
- IT Governance and Disaster Recovery, Part Two
- Customer Management
- Identifying Key External Customers
- Identifying Key Internal Customers
- Negotiating with Customers and Suppliers—Part 1: An Introduction
- Negotiating With Customers and Suppliers—Part 2: Reaching Agreement
- Negotiating and Managing Realistic Customer Expectations
- Service Management
- Identifying Key Services for Business Users
- Service-Level Agreements That Really Work
- How IT Evolved into a Service Organization
- FAQs About Systems Management (SM)
- FAQs About Availability (AV)
- FAQs About Performance and Tuning (PT)
- FAQs About Service Desk (SD)
- FAQs About Change Management (CM)
- FAQs About Configuration Management (CF)
- FAQs About Capacity Planning (CP)
- FAQs About Network Management
- FAQs About Storage Management (SM)
- FAQs About Production Acceptance (PA)
- FAQs About Release Management (RM)
- FAQs About Disaster Recovery (DR)
- FAQs About Business Continuity (BC)
- FAQs About Security (SE)
- FAQs About Service Level Management (SL)
- FAQs About Financial Management (FN)
- FAQs About Problem Management (PM)
- FAQs About Facilities Management (FM)
- Process Management
- Developing Robust Processes
- Establishing Mutually Beneficial Process Metrics
- Change Management—Part 1
- Change Management—Part 2
- Change Management—Part 3
- Audit Reconnaissance: Releasing Resources Through the IT Audit
- Problem Management
- Problem Management–Part 2: Process Design
- Problem Management–Part 3: Process Implementation
- Business Continuity Emergency Communications Plan
- Capacity Planning – Part One: Why It is Seldom Done Well
- Capacity Planning – Part Two: Developing a Capacity Planning Process
- Capacity Planning — Part Three: Benefits and Helpful Tips
- Capacity Planning – Part Four: Hidden Upgrade Costs and
- Improving Business Process Management, Part 1
- Improving Business Process Management, Part 2
- 20 Major Elements of Facilities Management
- Major Physical Exposures Common to a Data Center
- Evaluating the Physical Environment
- Nightmare Incidents with Disaster Recovery Plans
- Developing a Robust Configuration Management Process
- Developing a Robust Configuration Management Process – Part Two
- Automating a Robust Infrastructure Process
- Improving High Availability — Part One: Definitions and Terms
- Improving High Availability — Part Two: Definitions and Terms
- Improving High Availability — Part Three: The Seven R's of High Availability
- Improving High Availability — Part Four: Assessing an Availability Process
- Methods for Brainstorming and Prioritizing Requirements
- Introduction to Disk Storage Management — Part One
- Storage Management—Part Two: Performance
- Storage Management—Part Three: Reliability
- Storage Management—Part Four: Recoverability
- Twelve Traits of World-Class Infrastructures — Part One
- Twelve Traits of World-Class Infrastructures — Part Two
- Meeting Today's Cooling Challenges of Data Centers
- Strategic Security, Part One: Assessment
- Strategic Security, Part Two: Development
- Strategic Security, Part Three: Implementation
- Strategic Security, Part Four: ITIL Implications
- Production Acceptance Part One – Definition and Benefits
- Production Acceptance Part Two – Initial Steps
- Production Acceptance Part Three – Middle Steps
- Production Acceptance Part Four – Ongoing Steps
- Case Study: Planning a Service Desk Part One – Objectives
- Case Study: Planning a Service Desk Part Two – SWOT
- Case Study: Implementing an ITIL Service Desk – Part One
- Case Study: Implementing a Service Desk Part Two – Tool Selection
- Ethics, Scandals and Legislation
- Outsourcing in Response to Legislation
- Supplier Management
- Identifying Key External Suppliers
- Identifying Key Internal Suppliers
- Integrating the Four Key Elements of Good Customer Service
- Enhancing the Customer/Supplier Matrix
- Voice Over IP, Part One — What VoIP Is, and Is Not
- Voice Over IP, Part Two — Benefits, Cost Savings and Features of VoIP
- Application Management
- Production Acceptance
- Distinguishing New Applications from New Versions of Existing Applications
- Assessing a Production Acceptance Process
- Effective Use of a Software Development Life Cycle
- The Role of Project Management in SDLC— Part 2
- Communication in Project Management – Part One: Barriers to Effective Communication
- Communication in Project Management – Part Two: Examples of Effective Communication
- Safeguarding Personal Information in the Workplace: A Case Study
- Combating the Year-end Budget Blitz—Part 1: Building a Manageable Schedule
- Combating the Year-end Budget Blitz—Part 2: Tracking and Reporting Availability
- References
- Developing an ITIL Feasibility Analysis
- Organization and Personnel Management
- Optimizing IT Organizational Structures
- Factors That Influence Restructuring Decisions
- Alternative Locations for the Help Desk
- Alternative Locations for Database Administration
- Alternative Locations for Network Operations
- Alternative Locations for Web Design
- Alternative Locations for Risk Management
- Alternative Locations for Systems Management
- Practical Tips To Retaining Key Personnel
- Benefits and Drawbacks of Using IT Consultants and Contractors
- Deciding Between the Use of Contractors versus Consultants
- Managing Employee Skill Sets and Skill Levels
- Assessing Skill Levels of Current Onboard Staff
- Recruiting Infrastructure Staff from the Outside
- Selecting the Most Qualified Candidate
- 7 Tips for Managing the Use of Mobile Devices
- Useful Websites for IT Managers
- References
- Automating Robust Processes
- Evaluating Process Documentation — Part One: Quality and Value
- Evaluating Process Documentation — Part Two: Benefits and Use of a Quality-Value Matrix
- When Should You Integrate or Segregate Service Desks?
- Five Instructive Ideas for Interviewing
- Eight Surefire Tips to Use When Being Interviewed
- 12 Helpful Hints To Make Meetings More Productive
- Eight Uncommon Tips To Improve Your Writing
- Ten Helpful Tips To Improve Fire Drills
- Sorting Out Today’s Various Training Options
- Business Ethics and Corporate Scandals – Part 1
- Business Ethics and Corporate Scandals – Part 2
- 12 Tips for More Effective Emails
- Management Communication: Back to the Basics, Part One
- Management Communication: Back to the Basics, Part Two
- Management Communication: Back to the Basics, Part Three
- Asset Management
- Managing Hardware Inventories
- Introduction to Hardware Inventories
- Processes To Manage Hardware Inventories
- Use of a Hardware Inventory Database
- References
- Managing Software Inventories
- Business Continuity Management
- Ten Lessons Learned from Real-Life Disasters
- Ten Lessons Learned From Real-Life Disasters, Part 2
- Differences Between Disaster Recovery and Business Continuity , Part 1
- Differences Between Disaster Recovery and Business Continuity , Part 2
- 15 Common Terms and Definitions of Business Continuity
- The Federal Government’s Role in Disaster Recovery
- The 12 Common Mistakes That Cause BIAs To Fail—Part 1
- The 12 Common Mistakes That Cause BIAs To Fail—Part 2
- The 12 Common Mistakes That Cause BIAs To Fail—Part 3
- The 12 Common Mistakes That Cause BIAs To Fail—Part 4
- Conducting an Effective Table Top Exercise (TTE) — Part 1
- Conducting an Effective Table Top Exercise (TTE) — Part 2
- Conducting an Effective Table Top Exercise (TTE) — Part 3
- Conducting an Effective Table Top Exercise (TTE) — Part 4
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part One
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part Two
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part Three
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part Four
- The Information Technology Infrastructure Library (ITIL)
- The Origins of ITIL
- The Foundation of ITIL: Service Management
- Five Reasons for Revising ITIL
- The Relationship of Service Delivery and Service Support to All of ITIL
- Ten Common Myths About Implementing ITIL, Part One
- Ten Common Myths About Implementing ITIL, Part Two
- Characteristics of ITIL Version 3
- Ten Benefits of itSMF and its IIL Pocket Guide
- Translating the Goals of the ITIL Service Delivery Processes
- Translating the Goals of the ITIL Service Support Processes
- Elements of ITIL Least Understood, Part One: Service Delivery Processes
- Case Study: Recovery Reactions to a Renegade Rodent
- Elements of ITIL Least Understood, Part Two: Service Support
- Case Studies
- Case Study — Preparing for Hurricane Charley
- Case Study — The Linux Decision
- Case Study — Production Acceptance at an Aerospace Firm
- Case Study — Production Acceptance at a Defense Contractor
- Case Study — Evaluating Mainframe Processes
- Case Study — Evaluating Recovery Sites, Part One: Quantitative Comparisons/Natural Disasters
- Case Study — Evaluating Recovery Sites, Part Two: Quantitative Comparisons/Man-made Disasters
- Case Study — Evaluating Recovery Sites, Part Three: Qualitative Comparisons
- Case Study — Evaluating Recovery Sites, Part Four: Take-Aways
- Disaster Recovery Test Case Study Part One: Planning
- Disaster Recovery Test Case Study Part Two: Planning and Walk-Through
- Disaster Recovery Test Case Study Part Three: Execution
- Disaster Recovery Test Case Study Part Four: Follow-Up
- Assessing the Robustness of a Vendor’s Data Center, Part One: Qualitative Measures
- Assessing the Robustness of a Vendor’s Data Center, Part Two: Quantitative Measures
- Case Study: Lessons Learned from a World-Wide Disaster Recovery Exercise, Part One: What Did the Team Do Well
- (d) Case Study: Lessons Learned from a World-Wide Disaster Recovery Exercise, Part Two
This article describes my most recent experiences dealing with the aftermath of a sustained computer center outage. During the past twenty years I have been involved with a number of major disasters. Each of these events disrupted the data centers and IT services for which I was responsible.
While managing the IT infrastructure for a major defense contractor in Southern California during the late 1980s and early 1990s, I experienced two major earthquakes. The Whittier Narrows earthquake occurred the morning of October 1, 1987 just after 7:00am. Its epicenter was only five miles from the company's sprawling facility that housed one of the largest computer centers in the country. The tumbler registered 5.9 on the Richter scale and did substantial damage to our computer room and tape library.
Seven years later the Northridge earthquake rumbled everyone awake at 4:30am on January 17, 1994. This quake was much stronger with a magnitude of 6.7 but resulted in much damage to our computer center. This was due to several precautions we had undertaken as a result of the previous major shaker, and the fact that its epicenter was over 25 miles away.
In April of 1995 I had just begun a new job as the Director of Computer Operations for a major motion picture company. Things got off to a shocking start. During my first week, a distribution transformer that provided electrical power to our computer center exploded. This resulted in a huge electrical spike being sent down the line and damaging beyond use two mission-critical computers. There was no disaster recovery plan in place, and only through the diligence of some very dedicated suppliers was an all-out catastrophe avoided.
Over the years I have experienced a number of other major mishaps including overhead sprinkler heads bursting, hazardous material spills and small electrical fires. As an IT consultant my work in recent years has taken me to numerous clients. But at these client sites both the frequency and severity of these kinds of disasters seemed to have diminished. The most recent of these types of events happened to a client of mine just a few months ago. The lessons learned from the aftermath provide an interesting view of how many companies are dealing with disaster recovery today.
Background on the Company and the Event
This case study concerns a motion picture production and distribution company in Southern California. The firm produces motion pictures outright or acquires the rights to distribute films already in production. The company has also acquired rights to extensive movie video libraries, and almost half of its revenue comes from the sales of films on digital video disks (DVDs). The firm employs approximately 300 full-time staff and roughly 200 contractors, most of whom are housed in a large office building in. The facility also houses the two small computer rooms from which all company IT services are provided.
One day in late 2006 at around 5:00am, all power to the building and the surrounding four-block area stopped. Utility crews found the source of the problem to be a rat that had chewed through high voltage cables coming out of a distribution transformer. Some residents and office workers were inconvenienced with no power for up to eight hours. But they fared much better than the rodent in question who literally fried to death.
Aftermath of the Event
The distribution transformer that failed had supplied electrical power to the two computer rooms of the entertainment company. The rest of the building was fed from another transformer that was not affected. The result was nearly 500 employees who relied heavily on IT services having little to do as they waited for power to the computer rooms to resume, and for their IT services to be restored.
By noon it was apparent the utility crew would need another hour or two to restore power, and that IT would need at least an hour or two after that to check out all of the equipment and restore all services. Senior management decided to let those go home who could do little work without email and the Internet. Surprisingly, a large number of staff was able to conduct business over the phone because of the nature of the nature of work with suppliers and distributors. IT finally restored all IT services by 5:00pm.
Within days of the outage the Executive Audit Committee directed the Compliance Department and IT to jointly investigate the company's ability to recover from a major disaster. One of the first findings was that there really was no formal disaster recovery or business continuity plans. I was brought in and assisted in a mini-business impact analysis (BIA) and risk assessment. Several vulnerabilities were uncovered including a critical data warehouse that was outsourced to a company that had not yet begun a formal backup program for it.
After weeks of interviews, analysis, and proposals, the joint Compliance and IT team presented their findings to the Senior Audit Committee. Among the recommendations were the installation of either an onsite or portable backup generator, a recovery facility to host critical applications and databases, a backup program for the critical data warehouse and a formal disaster recovery program for IT.
Lessons Learned
The following were among some of the key lessons learned from this event.
- Assess your threats and vulnerabilities on a regular basis – This company had not ever conducted a risk assessment. This event pointed out clearly the necessity and value of doing such an assessment.
- Prioritize your business processes in terms of time-dependent impact – Not all business processes have the same degree of urgency in terms of how quickly they need to be restored. Even a mini-BIA can help prioritize the sequence of restorations of business processes.
- Identify mitigation plans to minimize impacts in the short-term – This company discovered through its risk assessment that it had several exposures and single-points-of-failures. As a result, several mitigation plans were proposed and implemented.
- Propose long-term strategies to permanently reduce or eliminate risk – While mitigation plans were intended to address the current environment, long-term strategies were proposed to permanently eliminate many of these exposures.
- Ensure executive management is fully behind these efforts – None of the above activities could have been successfully implemented without the full support of senior management. Ensuring you have executive support is a primary requirement to ensure any efforts in risk assessments, business continuity and disaster recovery are effective.
Summary
This article described a case study of how one company responded to a sustained power outage, and the aftermath that followed. While the overall impact was relatively minor, the event did highlight the fact that the company was vulnerable to other types of risks. A thorough risk assessment was performed and documented. Several important lessons learned were identified and acted upon. Not least among these was the need to thoroughly assess one's environment on a routine basis for weaknesses, single points of failure, threats and vulnerabilities.