Problem management is one aspect of ITIL implementation that is a headache for many organizations. The difficulty lies in the similarity between incident management and problem management. The two processes are so closely aligned that differentiating the activities can become difficult for ITIL neophytes. The aim of incident management is to restore the services to the user as quickly as possible, often through a work around, rather than through find a permanent solution which is the aim of problem management.
Problem management has reactive and proactive aspects:
Reactive – problem solving when one or more incidents occur
Proactive – identifying and solving problems and known errors before incidents occur in the first place.
The Problem Management process works in conjunction with Incident and Change Management to provide value to the business in a variety of ways. The primary goal of Problem Management is to minimize the impact of problems on the business and prevent recurrence. When successful, downtime and disruptions are reduced. Additional benefits include:
- Increased service availability
- Improved service quality
- Decreased Problem resolution time
- Reduction of the number of Incidents
- Increased productivity
- Reduced costs
- Improved customer satisfaction
Adopting and implementing ITIL processes and technology will minimize the chaos that IT organizations can face amid the rapidly changing technology landscape. Although Problem Management is its own process, it is dependent on an effective Incident Management process and the proper tools; tools that include a common interface, access to available knowledge, configuration management information and interaction with other related ITIL processes.
The ITIL problem management process has many steps, and each is vitally important to the success of the process and the quality of service delivered.
1) Problem Detection
Problems can be detected in a variety of ways, including as the result of an Incident report, ongoing Incident analysis, and automated detection by an event management tool, or supplier notification. A Problem is commonly detected when the cause of one or more Incidents reported to the service desk is unknown. It is possible that the service desk has resolved the Incident and it may occur again, but they are unsure of the underlying root cause and therefore create a Problem record. In other cases, it may be clear to the service desk that a reported Incident is associated to a Problem. This Problem may have already been recorded – Known Problem – and the Incident can be linked to the existing Problem record. If the Problem has not been recorded, then a Problem record should be immediately created to help assure service performance.
2) Problem Logging
In order to maintain a complete historical record, all Problems, regardless of method used to identify and report to the service desk, must by logged with all relevant details, including date/time, user information, description, related Configuration Item from the CMDB, associated Incidents, resolution details and closure information.
Categorization – Once logged, all appropriate categories must be selected in order to properly assign, escalate and monitor frequencies and Problem trends
Prioritization – Assigning priority is critical in determining how and when the Problem will be handled by staff. It is determined by the impact – number of associated Incidents which can provide insight into the number of affected users or its impact on the business. In addition, the urgency of the Problem – how quickly resolution is required is taken into account to define the priority
3) Investigation and Diagnosis
An investigation into the root cause of the Problem will take place based on the impact, severity and urgency of the Problem in question. Common investigation techniques include reviewing the Known Error Database (KEDB) in an effort to find matching Problems and resolutions and/or recreating the failure to determine the cause
In some situations, it is possible to provide a temporary fix or workaround to the user experiencing the Incident related to the Problem. However, it’s important to seek a permanent change resolution to the underlying error detected by Problem Management
5) Create Known Error Record
Once the investigation and diagnosis is complete, it’s important to create a Known Error record. If future Incidents or Problems arise, the investigating service desk technician will identify and provide resolution more quickly using the known error database (KEDB) and associated workaround(s)
Once resolved, the solution can be implemented using the standard change procedure and tested to confirm service recovery. However, if a normal change was required, an associated Request For Change (RFC) will be raised and approved before a resolution is applied to the Problem
Following confirmation that the Error has been resolved, the Problem and any associated Incidents can be closed. The service desk technician should ensure that the initial classification details are accurate for future reference and reporting.
8) Review the problem.
This is also known as a major problem review. The major problem review is an organizational activity that prevents future problems. During the review, the problem management team evaluates the problem documentation and identifies what happened and why. Lessons learned, such as process bottlenecks, what went wrong, and what helped should be discussed. This is where having a complete problem log will help. A completed log will work much better than trying to pull the details from memory. This problem review should result in improved processes, staff training, or more complete documentation.
This process is one that is integral to long-term service delivery success and therefore should not be ignored when designing a robust IT service, whether it’s internally or externally facing.