Home > Articles > Software Development & Management

IT Management Reference Guide

Jul 9, 2004

␡

⎙ Print

< Back Page 63 of 205 Next >

In this day and age, with corporations depending more and more on their IT organizations to conduct business, gain a competitive advantage and streamline cost, the development of viable disaster recovery plans are more important than ever. But developing such a plan is merely one of the initial steps-after performing a business impact analysis and deciding on your recovery strategies-that are part of an overall process to enact a robust business continuity program. One of the key activities of any business continuity program follows the development of the plan, and consists of exercising it with one of three common methods.

The first method of exercising a disaster recovery plan-called a validation test-involves verifying that all of the updatable information in the plan such as contact personnel, phone numbers and software version are accurate. In the case of call trees, telephone calls are actually made to ensure correct numbers are in use. The second method-called a simulation test-is sometimes referred to a table-top or walk-through exercise because it consists of bringing all of the key players together around a table and simulating the sequence of events that would occur in the event of a disaster and comparing them to what the recovery plan says to do. The third method-called an operational test-involves using the recovery plan to actually bringing up selected systems at a recovery site.

More often than not, shops that have disaster recovery plans do not keep them up-to-date, or fail to exercise them at regular intervals, or both. This can result in a host of problems when it comes time to actually use the recovery plan. I refer to these types of events as nightmare incidents.

During my 20 years of managing and consulting on IT infrastructures, I have experienced directly, or indirectly through individuals with whom I have worked, a number of nightmarish incidents involving disaster recovery. Some are humorous, some are head scratching, and some are just plain bizarre. In all cases, they totally undermined what would have been a successful recovery from either a real or simulated disaster. Fortunately, no single client or employer with whom I was associated ever experienced more than any two of these, but in their eyes even one was more than acceptable. These incidents (listed Figure 1) illustrate how critical the planning, preparation, and performance of the disaster recovery plan really is.

The first four incidents all involve the handling of the backup tapes required to restore copies of data rendered inaccessible or damaged by a disaster. While many shops today replicate their critical data on disk, many still use tape as their primary means of backup. Verifying that the backup and—more important—the restore process is completing successfully should be one of the first requirements of any disaster recovery program. While most shops verify the backup portion of the process, more than a handful do not test that the restore process also works. Labels and locations can also cause problems when tapes are marked or stored improperly.

Although rare, I did know of a client who was denied retrieval of a tape because the offsite tape storage supplier had not been paid in months. Fortunately, it was not during a critical recovery. Communication to, documentation of, and training of all shifts on the proper recovery procedures are a necessity. Third-shift graveyard operators often receive the least of these due to their off hours and higher-than-normal turnover. These operators especially need to know who to call and how to contact offsite recovery services.

Classified environments can present their own brand of recovery nightmares. One of my classified clients had applied for a security clearance for its offsite tape storage supplier and had begun using the service prior to the clearance being granted. When the client's military customer found out, the tapes were confiscated. In a related issue, a separate defense contractor cleared its offsite vendor to a secured program but failed to clear the one individual who worked nights when a tape was requested for retrieval. The unclassified worker could not retrieve the classified tape that night, delaying the retrieval of the tape and the restoration of the data for at least a day.

The last two incidents involve tape canisters used during a full dry-run test of restoring and running critical applications at a remote hot site 3,000 miles away. The airline in question had just changed its policy of carry-on baggage, preventing the canisters from staying in the presence of the recovery team. Making matters worse was the fact that they were mislabeled, causing over six hours of restore time to be lost. The lesson-learned debriefing had much to talk about during its marathon postmortem session.

Backup tapes have no data on them.
Restore process has never been tested and found to not work.
Restore tapes are mislabeled.
Restore tapes cannot be found.
Offsite tape supplier has not been paid and cannot retrieve tapes.
Graveyard-shift operator does not know how to contact recovery service.
Recovery service to a classified defense program is not cleared.
Recovery service to a classified defense program is cleared,
but individual personnel are not cleared.
Operator cannot fit tape canister onto the plane.
Tape canisters are mislabeled.

Figure 1 Nightmare Incidents with Disaster Recovery

4.20.2 References

Schiesser, Rich, IT Systems Management, Prentice Hall, 2002

< Back Page 63 of 205 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address