Problem Management
Problem Management refers to the process by which Problems, which are the root cause of one or more Incidents, are identified and by which a workaround or a permanent fix is found, enabling the organization to reduce the number and impact of Incidents over time. Chapter 11 covers the Problem Management process in detail.
The goal of Problem Management is twofold—reactive and proactive:
- Being reactive minimizes the adverse effect on the business of Incidents and Problems caused by errors in the infrastructure, including supporting Incident Management, identifying and diagnosing Problems, escalating Problems, and monitoring Known Errors through the Change process.
- Being proactive preempts the occurrence of Incidents, Problems, and, including identifying potential Problems, initiating Change so that Problems don't (re)occur, and tracking problems and analyzing trends.
Here are the objectives of Problem Management:
- Minimize the negative impact of problems on the business
- Identify and correct the root cause of problems
Table 3.7 lists the key terminology of Problem Management.
Table 3.7. Key Terminology in Problem Management
Term |
Explanation |
Incident |
Any event that is not part of the standard operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service. |
Problem |
An unknown error, the underlying cause of one or more Incidents. |
Known Error |
A Problem for which a root cause and permanent fix or workaround are identified but where the fix has not been implemented. |
Work-around |
A temporary fix or technique that eliminates the customer's reliance on the faulty service component. |
It is important to understand that a Problem is not the same as an Incident. A Problem is the root cause of one or more Incidents. Problems are unknown errors; once the cause is known, they are flagged as Known Errors.
Similarly, the Problem Management process is related to but distinct from the Incident Management process, so much so that MOF and ITIL recommend against combining Incident and Problem Management in the same function because they have conflicting interests. The imperative of Incident Management is to get the service and user back up and running, whatever it takes. It is not to go after the root cause of multiple Incidents. Problem Management, on the other hand, pulls up the zoom level and focuses on the root cause of multiple Incidents, seeking to eliminate and minimize the negative business impact of them by going after the root cause and by sharing information about Problems, Known Errors, and Workarounds. These imperatives conflict because, for example, eliminating Problems typically results in a lower first-call closure rate, because a whole set of Incidents have been eliminated as a result of eliminating the Problem. Problem Management helps IT professionals, teams, and organizations achieve a critical outcome: getting to the root cause of Problems and knowing and articulating what the top Problems are, what has been done to advance them, and what will be done next.
Because resources are to be allocated to the Problem Management process, the value of that process to the business has to be determined so that the resources allocated can be justified. To determine the value of an organization places on Problem Management, consider these questions:
- What mechanisms are in place to reduce the impact of chronic Problems on service availability and reliability?
- What mechanisms are in place to reduce Incident volume and resolution time and the negative impact on the business for related sets of Incidents?
- What mechanisms are in place to reduce Change volume as driven by the need to address chronic Problems and the associated negative impact of on the business?
- How can you achieve a good balance between reactive root-cause analysis efforts and proactive efforts to preempt Problems in the first place?
- How can service availability be guaranteed if there are outstanding Problems left unresolved? What are the potential issues with having Problems bypassed and not actually resolved?
- How can you help ensure that the time spent on investigation and diagnosis of multiple related Incidents and their root cause is as productive as possible for IT analysts?
- What mechanisms are in place to permanently solve chronic or recurring issues and reduce their number and negative impact on the business over time?
- What mechanisms are in place to ensure organizational learning occurs so that each day isn't another "Groundhog Day?" What means do you have to feed the organization with historical data to identify trends, root-cause resolutions, and workarounds to prevent and reduce problems?
- What means do you have of ensuring quicker, more consistent Incident resolution and a better first-time fix rate at the service desk? Who is going to make sure the workarounds, Known Error records, and knowledge articles required for this exist, are current, and are made available to the service desk when it makes a difference for them?
- Who are the stakeholders of Problem Management? What is their stake?
The value of Problem Management should drive all further discussions and decisions on scope, priority, resources allocated to, and automation of the Incident Management process with Service Manager.
Reporting is a means of understanding and managing the performance of the Problem Management process. Although Service Manager includes out-of-the-box reporting functionality for Problem Management, you can look to MOF and ITIL for further guidance and what to report, when, and why. This can include top Problems, what has been done to advance them so far, and what will be done next, percentage reduction of repeat Incidents, and percentage reduction in SLA targets being missed that are attributable to Problems.
Figure 3.2 shows the activities in the Problem Management process.
Figure 3.2 Problem Management process activities.
It is vital to identify who does what relative to Problem Management. Otherwise, there is no ownership, no one who can be held accountable for its results, and no or unclear responsibility for carrying out the process activities. Problem Management roles that you should make sure are identified and assigned in your organization include the following:
- The problem manager, who owns the results of the Problem Management, assigns Problems, and handles major Problems
- Support groups, which are second- and third-line support groups, specialist support groups, and external suppliers who own normal Problems, progress Problems through resolution, assign Problems to resolver group, create teams to resolve Problems, and monitor and track Problems and ensure resolution
- Problem resolvers, who are the IT analysts who investigate and diagnose Problems
Table 3.8 shows Problem Management inputs and outputs.
Table 3.8. Inputs and Outputs in Problem Management
Input |
Output |
Incident details |
Known Errors |
Configuration details |
Fix/Workaround |
Failed Change details |
Management reports |
Defined workarounds |
Major Problem reviews |
Potential Problem reported |
RFC |
Trends reported |
Updated and closed Problem and Known Error records |
Annual survey results |
Improvements for procedures, documentation, training needs |
Anecdotal evidence from users |
Several vital questions need to be answered to drive decisions when implementing the Problem Management process with Service Manager. Without getting these questions answered, you and the organization will get stuck somewhere along the way (in design, deployment, or operation of Service Manager) as you seek to accomplish the objectives set out by the business.
The following key questions must be answered to drive decisions when implementing the Problem Management process with Service Manager:
- What is the value of Problem Management to the business?
- Which Problems are within scope for the process, and what target resolution times have you identified?
- What values should be assigned to Problem record fields/drop-down (enumeration) list values?
- What is your policy and procedure regarding ticking the Known Error box in the Problem Form, indicating that the root cause of the Problem has been identified? In other words, who can do this, and what else do they need to do in conjunction with it (for example, update knowledge articles, make announcements, and close related tickets).
- What provisions have you made to ensure that bugs that come out of development are transferred into Service Manager as Problems and Known Errors along with any associated workarounds and knowledge articles when the systems are moved to production?
- What problem prioritization scheme will you use?
- How will you use the Problem process in conjunction with Incident, Change, and Configuration Management? What are the expected interfaces?
- What roles and responsibilities will be assigned for the Problem Management processes and to whom?
- What requirements do you have for correlation of multiple Incidents to Problems and related Workarounds, Changes, and knowledge articles, and what procedures will you adopt for resolving and closing Incidents when the related Problem is resolved or closed?
- Which metrics will you track, and which reports will you produce as a basis for managing performance? Will custom reports be required?
- What provisions have you made to ensure that a post-implementation review is made after major Problems?
- Who needs to be informed and when throughout the life cycle of a Problem?
- What role will announcements and knowledge articles play in the Problem Management process?