- Introduction to MOF and ITIL
- Incident Management
- Problem Management
- Change Management
- Configuration Management
- Summary
Incident Management
Incident Management refers to the process responsible for managing the life cycle of all Incidents, where an Incident is an unplanned interruption to an IT service or a reduction in the quality of an IT service. Failure of a CI (any component that needs to be managed to deliver an IT service, including IT services, hardware, software, buildings, people, and formal documentation such as process documentation and SLAs) that has not yet impacted service is also an Incident. The primary objective of Incident Management is to return the IT service to users as quickly as possible. For more information, see Chapter 10, which covers Incident Management in detail.
Incident Management is the process of managing deviations from normal service, restoring normal service operation quickly with minimum business disruption, and getting individual Users back up and running. Incident Management utilizes Configuration Management data to enable efficient and effective resolution of Incidents and to identify where Change releases have caused Incidents.
The goal of Incident Management is to restore normal service operation as quickly as possible with minimum disruption to the business and to ensure that the best achievable levels of availability and service are maintained.
Objectives of Incident Management include the following:
- Restoring normal service as quickly as possible.
- Minimizing the negative impact of Incidents on the business.
- Ensuring that Incidents are processed consistently and that none are lost
- Directing support resources where most required.
- Providing information that allows support processes to be optimized, the number of Incidents to be reduced, and management planning to be carried out.
Table 3.5 presents Incident Management key terminology.
Table 3.5. Key Terminology in Incident Management
Term |
Explanation |
Incident |
Any event that is not part of the standard operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service. |
Problem |
An unknown error, the underlying cause of one or more Incidents. |
Known Error |
A Problem for which a root cause and permanent fix or workaround are identified but where the fix has not been implemented. |
Workaround |
A temporary fix or technique that eliminates the customer's reliance on the faulty service component. |
Impact |
The likely effect on the business service, often equal to the extent of distortion of agreed or expected service levels. |
Urgency |
The speed of resolution required, based on impact and the business needs of the customer. |
Priority |
This is the relative sequence of resolution required, based on impact and urgency, and other relevant factors such as resource availability, and calculated based on impact and urgency. |
Escalation |
The mechanism that assists with timely resolution of an Incident. There are two types: functional escalation (transfer of an Incident between n-tier support departments) and hierarchical escalation (calling on management to assist in handling of an Incident). |
Incident Management helps IT professionals, teams, and organizations achieve a critical outcome: minimizing the business disruption of Incidents by getting individual "hands on the keyboard"—users back up and running and restoring service as quickly as possible.
Because resources are to be allocated to the Incident Management process, the value of that process to the business has to be determined so that the resources allocated can be justified. To determine the value an organization places on Incident Management, consider the following:
- What mechanisms are in place to reduce the business disruption of incidents and help ensure user satisfaction, especially when multiple Incidents arrive at the same time?
- Who handles Incidents when they arrive, and what does good handling look like as measured by impact on user satisfaction? What are expected service levels and resolve times, and how do you ensure performance is satisfactory?
- How do you ensure quick, consistent resolution of Incidents and keep Incidents from getting lost?
- How can you organize around Incidents in a way that fosters the productivity of both users and IT analysts?
- How do you minimize the impact of Incidents on service quality, either by preventing them in the first place or minimizing their impact when they do occur?
- Who are the stakeholders of Incident Management? What is their stake?
The value of Incident Management should drive all further discussions and decisions on scope, priority, resources allocated to, and automation of the Incident Management process with Service Manager.
Reporting is a means of understanding and managing the performance of the Incident Management process. Although Service Manager includes out-of-the-box reporting functionality for Incident Management, you can look to MOF and ITIL for further guidance and what to report, when, and why (including the KPIs that are important to Incident Management). This includes first-call fix rate, the number of Incidents raised based on a Change, number of escalated Incidents, number of Incidents not meeting SLA targets, and Operations Manager alert to Incident ticket ratio.
Figure 3.1 shows the activities in the Incident Management process.
Figure 3.1 Incident Management process activities.
Incident Management roles include the following:
- The incident manager, who owns the results of the Incident Management process
- The service desk manager, who owns the results of the service desk function
- IT managers and analysts in first, second, and third-tier support groups, including specialist support groups and external suppliers
- The problem manager, for major Incident handling
Table 3.6 shows inputs and outputs of the Incident Management process.
Table 3.6. Inputs and Outputs in Incident Management
Input |
Output |
Incident details (from the service desk, networks, or computer operations) |
Updated Incident records, including resolution/workarounds. |
Configuration details from the CMDB |
RFC for Incident resolution. |
Response from Incident matching against Problems and Known Errors |
Update, resolved, and closed Incidents. Communication to customers. |
Resolution details |
Management information (reports): service reports, incident statistics, audit reports. |
Response or result of RFC to effect resolution for Incident(s) |
Update, resolved, and closed Incidents. Communication to customers. |
The following key questions must be answered to drive decisions when implementing the Incident Management process with Service Manager:
- What is the value of Incident Management to the business?
- Which Incidents are within scope for the process, and what target resolution times have you identified?
- What values should be assigned to Incident record fields/drop-down (enumeration) list values?
- What are your Incident escalation procedures, and how do they relate to the Escalated tick box and Assigned to field in the Incident form?
- What Incident prioritization scheme will you use?
- How will you use the Incident process in conjunction with Problem, Change, and Configuration Management? What are the expected interfaces?
- What roles and responsibilities will be assigned for the Incident Management processes, and to whom?
- Will auto-ticketing be used (for example, for events trapped by Configuration Manager's Desired Configuration Management [DCM] or Operations Manager alerts)?
- What requirements do you have for automatic escalation or flexible routing of Incidents?
- Will the Self-Service portal and email ticketing be used to reduce inbound call volume?
- What requirement do you have for automated, rule-based Incident notification?
- Which metrics will you track, and which reports will you produce as a basis for managing performance? Will custom reports be required?
- Who needs to be informed and when throughout the life cycle of an Incident?
- What role will announcements and knowledge articles play in the Incident Management process?