Home > Articles > Software Development & Management

IT Management Reference Guide

Jul 9, 2004

␡

⎙ Print

< Back Page 200 of 205 Next >

This is the third part of a four-part case study of an actual disaster recovery exercise conducted in May 2006. In this segment I describe the results of the exercise by sharing key portions of the final report on the exercise. I begin with the executive summary, followed by the scenario used, and end with the estimated and actual time-frames of the 29 recovery tasks. Where appropriate I have replaced names of individuals, departments, applications and locations with fictitious versions.

Executive Summary

This report summarizes the results of the disaster recovery exercise conducted on May 13, 2006. The overall purpose of this exercise was to identify and resolve issues associated with the recovery of key applications at the xyz data center. By all accounts the exercise was a success and provided much useful information for employees to learn about, build upon, and make improvements to our overall recovery strategies.

The following are among the highlights of this exercise:

All 13 specific objectives of this exercise were met (100%)
11 of the 17 applications were tested successfully by QA (65%)
11 of the 17 applications were tested successfully by Users (65%)
Total of 22 individuals from 10 departments took part in the exercise
Business customers from business unit xx and business unit yy participated in the exercise

This report consists of seven sections and ten appendices (not all of which are shown here). Following the executive summary are the status of the objectives and their methods of verification, and the final version of the assumptions used in this exercise. Next are the observations and issues documented during the exercise, followed by the lessons learned and their resulting post-exercise action items to implement improvement suggestions. In the lessons learned section (which will be in Part Four of this series), the responses are listed in priority order based on the voting by the respondents. The distribution of the voting is also shown.

It is expected that another similar recovery exercise of these applications will be conducted in mid-October 2006, and that many of these improvement suggestions will be implemented by that time.

The appendices include such items as the lists of participants, the scenario used, the pre-exercise action items, meeting attendee roster, and the variety of recovery tasks performed along with estimated and actual duration times.

Scenario Used

At approximately 7:25am on Saturday, May 13, 2006, a small fire is reported at the location xx facility of company yy. The fire stems from faulty wiring in a server cabinet in the location xx campus data center. For reasons unknown the fire suppression system does not activate and the fire quickly spreads to several other cabinets. By this time the fire department has been notified of the incident and has trucks rolling to the site. At 7:37am the first truck arrives, and by 7:53am the fire is extinguished.

Employee A of the Enterprise Storage group of IT is notified of the fire by the Network Operations Center (NOC) engineers at 7:32am. The engineers assess the damage and find it is limited to servers supporting application system qq, and the large scale storage array that houses all of its data. Employee A contacts his manager at 7:48am and they agree that the application system qq needs to be recovered immediately to the location xyz recovery data center. At approximately 8:00am, employee A contacts employee B and employee C at the recovery data center and advises them to initiate recovery actions for the application system qq.

Recovery Tasks

This section describes the 29 tasks required to recover the critical application system involved with this exercise. Table 1 shows the description of each task, the dependent tasks associated with each one (sometimes referred to as upstream/downstream or input/output), the person responsible for performing each task, the estimated and actual times, the variance (delta) between estimated and actual times, and any comments. As the comments section shows, there were initial problems with bringing up some of the databases in that they were designated as 'suspect'. The problem was eventually traced to an undetected failed script the night before during database shutdowns.

The table also shows that the majority of tasks were completed ahead of schedule. This was important for two reasons. One is that the overall recovery time is a key measure of business continuity and relates to many factors concerning true business impact. The second is that is helps to estimate more accurately in future exercises. The overall expected recovery time was 3 hours 40 minutes. If we subtract out time for unexpected troubleshooting (not likely to occur again), the actual recovery time was 3 hours 20 minutes.

In Part Four I discuss the lessons learned from this exercise and the follow-up actions that resulted from them

Table 1 Recovery Tasks (1 of 2)

Table 1 Recovery Tasks (2 of 2)

< Back Page 200 of 205 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address