- Management Reference Guide
- Table of Contents
- Introduction
- Strategic Management
- Establishing Goals, Objectives, and Strategies
- Aligning IT Goals with Corporate Business Goals
- Utilizing Effective Planning Techniques
- Developing Worthwhile Mission Statements
- Developing Worthwhile Vision Statements
- Instituting Practical Corporate Values
- Budgeting Considerations in an IT Environment
- Introduction to Conducting an Effective SWOT Analysis
- IT Governance and Disaster Recovery, Part One
- IT Governance and Disaster Recovery, Part Two
- Customer Management
- Identifying Key External Customers
- Identifying Key Internal Customers
- Negotiating with Customers and Suppliers—Part 1: An Introduction
- Negotiating With Customers and Suppliers—Part 2: Reaching Agreement
- Negotiating and Managing Realistic Customer Expectations
- Service Management
- Identifying Key Services for Business Users
- Service-Level Agreements That Really Work
- How IT Evolved into a Service Organization
- FAQs About Systems Management (SM)
- FAQs About Availability (AV)
- FAQs About Performance and Tuning (PT)
- FAQs About Service Desk (SD)
- FAQs About Change Management (CM)
- FAQs About Configuration Management (CF)
- FAQs About Capacity Planning (CP)
- FAQs About Network Management
- FAQs About Storage Management (SM)
- FAQs About Production Acceptance (PA)
- FAQs About Release Management (RM)
- FAQs About Disaster Recovery (DR)
- FAQs About Business Continuity (BC)
- FAQs About Security (SE)
- FAQs About Service Level Management (SL)
- FAQs About Financial Management (FN)
- FAQs About Problem Management (PM)
- FAQs About Facilities Management (FM)
- Process Management
- Developing Robust Processes
- Establishing Mutually Beneficial Process Metrics
- Change Management—Part 1
- Change Management—Part 2
- Change Management—Part 3
- Audit Reconnaissance: Releasing Resources Through the IT Audit
- Problem Management
- Problem Management–Part 2: Process Design
- Problem Management–Part 3: Process Implementation
- Business Continuity Emergency Communications Plan
- Capacity Planning – Part One: Why It is Seldom Done Well
- Capacity Planning – Part Two: Developing a Capacity Planning Process
- Capacity Planning — Part Three: Benefits and Helpful Tips
- Capacity Planning – Part Four: Hidden Upgrade Costs and
- Improving Business Process Management, Part 1
- Improving Business Process Management, Part 2
- 20 Major Elements of Facilities Management
- Major Physical Exposures Common to a Data Center
- Evaluating the Physical Environment
- Nightmare Incidents with Disaster Recovery Plans
- Developing a Robust Configuration Management Process
- Developing a Robust Configuration Management Process – Part Two
- Automating a Robust Infrastructure Process
- Improving High Availability — Part One: Definitions and Terms
- Improving High Availability — Part Two: Definitions and Terms
- Improving High Availability — Part Three: The Seven R's of High Availability
- Improving High Availability — Part Four: Assessing an Availability Process
- Methods for Brainstorming and Prioritizing Requirements
- Introduction to Disk Storage Management — Part One
- Storage Management—Part Two: Performance
- Storage Management—Part Three: Reliability
- Storage Management—Part Four: Recoverability
- Twelve Traits of World-Class Infrastructures — Part One
- Twelve Traits of World-Class Infrastructures — Part Two
- Meeting Today's Cooling Challenges of Data Centers
- Strategic Security, Part One: Assessment
- Strategic Security, Part Two: Development
- Strategic Security, Part Three: Implementation
- Strategic Security, Part Four: ITIL Implications
- Production Acceptance Part One – Definition and Benefits
- Production Acceptance Part Two – Initial Steps
- Production Acceptance Part Three – Middle Steps
- Production Acceptance Part Four – Ongoing Steps
- Case Study: Planning a Service Desk Part One – Objectives
- Case Study: Planning a Service Desk Part Two – SWOT
- Case Study: Implementing an ITIL Service Desk – Part One
- Case Study: Implementing a Service Desk Part Two – Tool Selection
- Ethics, Scandals and Legislation
- Outsourcing in Response to Legislation
- Supplier Management
- Identifying Key External Suppliers
- Identifying Key Internal Suppliers
- Integrating the Four Key Elements of Good Customer Service
- Enhancing the Customer/Supplier Matrix
- Voice Over IP, Part One — What VoIP Is, and Is Not
- Voice Over IP, Part Two — Benefits, Cost Savings and Features of VoIP
- Application Management
- Production Acceptance
- Distinguishing New Applications from New Versions of Existing Applications
- Assessing a Production Acceptance Process
- Effective Use of a Software Development Life Cycle
- The Role of Project Management in SDLC— Part 2
- Communication in Project Management – Part One: Barriers to Effective Communication
- Communication in Project Management – Part Two: Examples of Effective Communication
- Safeguarding Personal Information in the Workplace: A Case Study
- Combating the Year-end Budget Blitz—Part 1: Building a Manageable Schedule
- Combating the Year-end Budget Blitz—Part 2: Tracking and Reporting Availability
- References
- Developing an ITIL Feasibility Analysis
- Organization and Personnel Management
- Optimizing IT Organizational Structures
- Factors That Influence Restructuring Decisions
- Alternative Locations for the Help Desk
- Alternative Locations for Database Administration
- Alternative Locations for Network Operations
- Alternative Locations for Web Design
- Alternative Locations for Risk Management
- Alternative Locations for Systems Management
- Practical Tips To Retaining Key Personnel
- Benefits and Drawbacks of Using IT Consultants and Contractors
- Deciding Between the Use of Contractors versus Consultants
- Managing Employee Skill Sets and Skill Levels
- Assessing Skill Levels of Current Onboard Staff
- Recruiting Infrastructure Staff from the Outside
- Selecting the Most Qualified Candidate
- 7 Tips for Managing the Use of Mobile Devices
- Useful Websites for IT Managers
- References
- Automating Robust Processes
- Evaluating Process Documentation — Part One: Quality and Value
- Evaluating Process Documentation — Part Two: Benefits and Use of a Quality-Value Matrix
- When Should You Integrate or Segregate Service Desks?
- Five Instructive Ideas for Interviewing
- Eight Surefire Tips to Use When Being Interviewed
- 12 Helpful Hints To Make Meetings More Productive
- Eight Uncommon Tips To Improve Your Writing
- Ten Helpful Tips To Improve Fire Drills
- Sorting Out Today’s Various Training Options
- Business Ethics and Corporate Scandals – Part 1
- Business Ethics and Corporate Scandals – Part 2
- 12 Tips for More Effective Emails
- Management Communication: Back to the Basics, Part One
- Management Communication: Back to the Basics, Part Two
- Management Communication: Back to the Basics, Part Three
- Asset Management
- Managing Hardware Inventories
- Introduction to Hardware Inventories
- Processes To Manage Hardware Inventories
- Use of a Hardware Inventory Database
- References
- Managing Software Inventories
- Business Continuity Management
- Ten Lessons Learned from Real-Life Disasters
- Ten Lessons Learned From Real-Life Disasters, Part 2
- Differences Between Disaster Recovery and Business Continuity , Part 1
- Differences Between Disaster Recovery and Business Continuity , Part 2
- 15 Common Terms and Definitions of Business Continuity
- The Federal Government’s Role in Disaster Recovery
- The 12 Common Mistakes That Cause BIAs To Fail—Part 1
- The 12 Common Mistakes That Cause BIAs To Fail—Part 2
- The 12 Common Mistakes That Cause BIAs To Fail—Part 3
- The 12 Common Mistakes That Cause BIAs To Fail—Part 4
- Conducting an Effective Table Top Exercise (TTE) — Part 1
- Conducting an Effective Table Top Exercise (TTE) — Part 2
- Conducting an Effective Table Top Exercise (TTE) — Part 3
- Conducting an Effective Table Top Exercise (TTE) — Part 4
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part One
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part Two
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part Three
- The 13 Cardinal Steps for Implementing a Business Continuity Program — Part Four
- The Information Technology Infrastructure Library (ITIL)
- The Origins of ITIL
- The Foundation of ITIL: Service Management
- Five Reasons for Revising ITIL
- The Relationship of Service Delivery and Service Support to All of ITIL
- Ten Common Myths About Implementing ITIL, Part One
- Ten Common Myths About Implementing ITIL, Part Two
- Characteristics of ITIL Version 3
- Ten Benefits of itSMF and its IIL Pocket Guide
- Translating the Goals of the ITIL Service Delivery Processes
- Translating the Goals of the ITIL Service Support Processes
- Elements of ITIL Least Understood, Part One: Service Delivery Processes
- Case Study: Recovery Reactions to a Renegade Rodent
- Elements of ITIL Least Understood, Part Two: Service Support
- Case Studies
- Case Study — Preparing for Hurricane Charley
- Case Study — The Linux Decision
- Case Study — Production Acceptance at an Aerospace Firm
- Case Study — Production Acceptance at a Defense Contractor
- Case Study — Evaluating Mainframe Processes
- Case Study — Evaluating Recovery Sites, Part One: Quantitative Comparisons/Natural Disasters
- Case Study — Evaluating Recovery Sites, Part Two: Quantitative Comparisons/Man-made Disasters
- Case Study — Evaluating Recovery Sites, Part Three: Qualitative Comparisons
- Case Study — Evaluating Recovery Sites, Part Four: Take-Aways
- Disaster Recovery Test Case Study Part One: Planning
- Disaster Recovery Test Case Study Part Two: Planning and Walk-Through
- Disaster Recovery Test Case Study Part Three: Execution
- Disaster Recovery Test Case Study Part Four: Follow-Up
- Assessing the Robustness of a Vendor’s Data Center, Part One: Qualitative Measures
- Assessing the Robustness of a Vendor’s Data Center, Part Two: Quantitative Measures
- Case Study: Lessons Learned from a World-Wide Disaster Recovery Exercise, Part One: What Did the Team Do Well
- (d) Case Study: Lessons Learned from a World-Wide Disaster Recovery Exercise, Part Two
Last month I had the opportunity to lead a cross-functional team at a financial services company in planning and conducting a major disaster recovery exercise. Knowing how popular case studies are to our readers of the IT Management Guide, I decided to chronicle the key activities of this effort in this four-part series. In this initial segment I discuss how the project came about, how I assembled the team, and how we developed the objectives and assumptions for the exercise. In subsequent segments I will discuss the planning meetings and simulated walk-through; the results of the exercise itself; and the many lessons learned from the activity. Where appropriate I have replaced names of individuals, departments, applications and locations with fictitious versions.
The project was born out of IT's desire to test the restoration of one of the company's most critical application systems at the firm's newly built out-of-state recovery facility. IT managers asked me to coordinate what we described as an operational disaster recovery exercise. We would simulate the primary data center being completely disabled, and proceed to restore business operations supported by this critical application system. The system consisted of dozens of separate applications and databases.
The first decision was to determine how many applications to include in this exercise. After careful review we decided on 17 of the most critical applications. Managers wisely chose not to include all of them to keep the scope of the project to a reasonable level, to be able to complete it within six weeks, and because this was the first attempt at a recovery exercise of this magnitude.
The second decision was to include several business users to ensure the team had functionally restored all applications and that they could support the business processes of the users. This turned out to reap huge benefits later on in terms of building a spirit of trust, credibility and shared responsibility between the IT and business departments.
The third issue was deciding how to realistically simulate the primary data center being down without adversely impacting 24-hour production that needed to run. The technical teams accomplished this by revising access control lists in routers and using modified host files to route traffic to the recovery site. These decisions all combined to help with the next one, which was determining who would participate in the exercise, and what their roles would be.
10.10.2 Objectives
The overall purpose of this exercise was to flush out issues associated with the recovery of a key application system at an out-of-state recovery data center. It was not designed to be a Pass or Fail test. This was a recovery exercise intended for the staff to learn about, build upon, and make improvements to our overall recovery strategies. The team identified 13 specific objectives for this recovery exercise listed below in Figure 1.
- Determine to what degree the 17 key applications can be recovered to the location xyz recovery data center.
- Determine the minimum time needed to recover the entire system (recovery time objective, or RTO).
- Determine the minimum amount of data that cannot be recovered (recovery point objective, or RPO).
- Improve the partnering and teambuilding among appropriate individuals from the business units, IT and business continuity by collaborating on the development and execution of a successful recovery plan.
- Conduct a simulation walk-through one week prior to the recovery exercise to validate the sequence, thoroughness, dependencies and estimated timeframes of required recovery activities.
- Demonstrate that two separate branch offices can have all of their processing done using only the location xyz recovery data center.
- Demonstrate that a branch office can access web applications.
- Verify that there is no single point of failure associated with the primary data center in recovering systems at the location xyz recovery data center.
- Evaluate the development and execution process of the exercise by analyzing the results of surveys submitted by all participants.
- Conduct a lessons learned session to identify and prioritize mistakes to avoid and improvements to implement.
- Develop action plans to implement improvements.
- Compile a final report on the results of the exercise.
- Determine the priority and sequence of bringing up applications.
Figure 1 Recovery Exercise Objectives
Assumptions
The IT sponsoring managers and I developed an initial set of assumptions for this exercise. I discussed these with the entire cross functional team during subsequent planning meetings. Figure 2 lists our final set of assumptions.
- Critical application system xx will be tested for recovery and will include 17 specific applications (names removed here for proprietary reasons).
- Specific application qq6 was not initially available at the recovery data center during preparations for this exercise, but portions of it were available by May 13th.
- This exercise will also test for the recovery of any support applications
that are needed to run the primary applications listed above. These support
applications include, but are not limited to, the following:
- Internet Explorer
- Outlook/Exchange
- The issue of how Citrix will be used in this exercise is to load software on to the Citrix servers at the recovery data center.
- This exercise will simulate outside access from an outside party to the appropriate website and will include a test submission from an outside party partner. QA will test the appropriate applications as part of this.
- Replication between the primary and recovery data centers will be stopped for this exercise.
- Network connections will be severed between the participating branch offices and the primary data center.
- The branch offices will be used in this exercise to process test data.
- The test data will be created using all normal processes up to, but not including, final processing; the DBA and QA groups will be involved to ensure the test data is backed out from the system.
- The scenario selected for this exercise will be viable, practical and meaningful; the scenario will be described in sufficient detail so as to simulate an actual event.
- Branch office xyz will be used in this exercise to test application qq2.
- The Marketing and Business Development departments will not be included in this exercise.
- The routing of data from branch office aaa to the recovery data center will be pre-tested prior to May 1st.
- The loading of the software on the recovery data center Citrix servers will be pre-tested prior to May 1st.
- Full database consistency checks (DBCC) will be performed on all 13 database servers to ensure data integrity.
- If any of the DBCCs fail, the data will not be re-replicated.
- Citrix will be loaded on the recovery data center desktops prior to the exercise.
- Employee A will shadow the activities of the users at the branch offices from her desktop at the primary data center.
- The QA rep will be using a laptop at home for this exercise.
- We do not need to block traffic from the location xyz sales office to the primary data center.
- There will be interruptions to the Exchange servers at the recovery data center and to email services; the Help Desk will issue advisories about these interruptions.
Figure 2 Recovery Exercise Assumptions
Next Steps
In part two of this series, I discuss the weekly planning meetings I conducted in preparation for the exercise, and the simulated walk-through we performed to verify the sequence, thoroughness, dependencies and estimated timeframes of required recovery activities.