␡
- 1.1 Building a Site from Scratch
- 1.2 Growing a Small Site
- 1.3 Going Global
- 1.4 Replacing Services
- 1.5 Moving a Data Center
- 1.6 Moving to/Opening a New Building
- 1.7 Handling a High Rate of Office Moves
- 1.8 Assessing a Site (Due Diligence)
- 1.9 Dealing with Mergers and Acquisitions
- 1.10 Coping with Frequent Machine Crashes
- 1.11 Surviving a Major Outage or Work Stoppage
- 1.12 What Tools Should Every SA Team Member Have?
- 1.13 Ensuring the Return of Tools
- 1.14 Why Document Systems and Procedures?
- 1.15 Why Document Policies?
- 1.16 Identifying the Fundamental Problems in the Environment
- 1.17 Getting More Money for Projects
- 1.18 Getting Projects Done
- 1.19 Keeping Customers Happy
- 1.20 Keeping Management Happy
- 1.21 Keeping SAs Happy
- 1.22 Keeping Systems from Being Too Slow
- 1.23 Coping with a Big Influx of Computers
- 1.24 Coping with a Big Influx of New Users
- 1.25 Coping with a Big Influx of New SAs
- 1.26 Handling a High SA Team Attrition Rate
- 1.27 Handling a High User-Base Attrition Rate
- 1.28 Being New to a Group
- 1.29 Being the New Manager of a Group
- 1.30 Looking for a New Job
- 1.31 Hiring Many New SAs Quickly
- 1.32 Increasing Total System Reliability
- 1.33 Decreasing Costs
- 1.34 Adding Features
- 1.35 Stopping the Hurt When Doing This
- 1.36 Building Customer Confidence
- 1.37 Building the Teams Self-Confidence
- 1.38 Improving the Teams Follow-Through
- 1.39 Handling an Unethical or Worrisome Request
- 1.40 My Dishwasher Leaves Spots on My Glasses
- 1.41 Protecting Your Job
- 1.42 Getting More Training
- 1.43 Setting Your Priorities
- 1.44 Getting All the Work Done
- 1.45 Avoiding Stress
- 1.46 What Should SAs Expect from Their Managers?
- 1.47 What Should SA Managers Expect from Their SAs?
- 1.48 What Should SA Managers Provide to Their Boss?
This chapter is from the book
1.11 Surviving a Major Outage or Work Stoppage
- Consider modeling your outage response on the Incident Command System (ICS). This ad hoc emergency response system has been refined over many years by public safety departments to create a flexible response to adverse situations. Defining escalation procedures before an issue arises is the best strategy.
- Notify customers that you are aware of the problem on the communication channels they would use to contact you: intranet help desk “outages” section, outgoing message for SA phone, and so on.
- Form a “tiger team” of SAs, management, and key stakeholders; have a brief 15- to 30-minute meeting to establish the specific goals of a solution, such as “get developers working again,” “restore customer access to support site” and so on. Make sure that you are working toward a goal, not simply replicating functionality whose value is non-specific.
- Establish the costs of a workaround or fallback position versus downtime owing to the problem, and let the businesspeople and stakeholders determine how much time is worth spending on attempting a fix. If information is insufficient to estimate this, do not end the meeting without setting the time for the next attempt.
- Spend no more than an hour gathering information. Then hold a team meeting to present management and key stakeholders with options. The team should do hourly updates of the passive notification message with status.
- If the team chooses fix or workaround attempts, specify an order in which fixes are to be applied, and get assistance from stakeholders on verifying that the each procedure did or did not work. Document this, even in brief, to prevent duplication of effort if you are still working on the issue hours or days from now.
- Implement fix or workaround attempts in small blocks of two or three, taking no more than an hour to implement total. Collect error message or log data that may be relevant, and report on it in the next meeting.
- Don’t allow a team member, even a highly skilled one, to go off to try to pull a rabbit out of his or her hat. Since you can’t predict the length of the outage, you must apply a strict process in order to keep everyone in the loop.
- Appoint a team member who will ensure that meals are brought in, notes taken, and people gently but firmly disengaged from the problem if they become too tired or upset to work.