Incident Management is a key component to the success of your business. It never hurts to go back to basics. For successful incident management, first you need a process – repeatable sequence of steps and procedures. Such a process may include four broad categories of steps: detection, diagnosis, repair, and recovery.
1 – Detection
Identification: Problem identification can be handled using different tools. End user experience tools can mimic user behavior and identify users’ POV problems such as response time and service availability. Also, domain-specific tools enable detecting problems within specific environments or applications, such as a database or an ERP system.
On the other hand, users can help you detect unknown problems that are not reported by infrastructure or user behavior monitoring tools.
So which method should you use? It depends on your environment, the usage of the combination of multiple methods and tools would be the best solution. Unfortunately, no single tool will enable detecting all problems.
Logging events is the important component of this procedure as it will allow you to trace them at any point to improve your process.
Classification of events lets you categorize data for reporting and analysis purposes, so you know whether an event relates to hardware, software, service, etc. It is recommended to have no more than 5 levels of classification; otherwise it can get very confusing. You can start the top level with something like Hardware / Software / Service, or Problem / Service request.
Prioritization lets you determine the order in which the events should be handled and how to assign your resources. Prioritization of events requires a longer discussion, but be aware that you need to consider impact, urgency, and risk. Consider the impact as critical when a large group of users are unable to use a specific service. Consider the urgency as high when the impacted service is of critical nature and any downtime is affecting the business itself. The third factor, the risk, should be considered when the incident has not yet occurred, but has a high potential to happen, for example, a scenario in which the data center’s temperature is quickly rising due to an air conditioning malfunction. The result of a crashing data center is countless services going down, so in this case the risk is enormous, and the incident should be handled at the highest priority.
2 – Diagnosis
Diagnosis is where you figure out the source of the problem and how it can be fixed. This stage includes investigation and escalation.
Investigation is probably one of the most difficult parts of the process. With more straightforward problems, Runbook procedures may be very helpful to accelerate an investigation, as they outline troubleshooting steps in a methodical way.
Following the runbook can be very time consuming and lengthen the recovery time immensely. Instead, consider automating the diagnostic steps by using run book automation software. If you build the flow cleverly and weigh in all the steps that lead to a conclusion, automating the diagnostics process will give you quick answers, and help you decide what your next step is.
Escalation procedures are needed in cases when the incident needs to be resolved by a higher support level.
3 – Repair
The repair step fixes the problem. This may sometimes involve a gradual process, where a temporary fix or workaround is implemented primarily to bring back a service quickly. An incident repair may involve anything from a service restart, a hardware replacement, or even a complex software code change. Note that fixing the current incident does not mean that the issue won’t recur, but more on that issue in the next step.
In this case too, straightforward repairs such as a service restart, a disk cleanup and others can be automated.
4 – Recovery
The recovery phase involves two parts: closure and prevention.
Closure means handling any notifications previously sent to users about the problem or escalation alerts, where you are now notified about the problem resolution. Moreover, closure also entails the final closure of the problems in your logging system.
Prevention relates to the activities you take, if possible, to prevent a single incident from occurring again in the future and therefore becoming a problem. Implement two important tools to help you in this task:
RCA process (Root Cause Analysis) The purpose of the RCA process is to investigate what was the root-cause that led to the service downtime. It is important to mention that the RCA process should be performed by the service owners, who are not necessarily the ones who solved the specific incident. This is an additional reason why incident logging is so important – the information in the ticket is crucial for this investigation process.
The last stage is similar to the “follow up” stage in customer service. You need to monitor your resolution to make sure it is fail-safe and that everything is working perfectly. Follow up is incredibly important to your customer service success. Call the customer and make sure everything is working okay, ask if they need any assistance or have any questions at the moment.
All of these different steps will help your business gather necessary data and help you determine what’s working and what isn’t. Effective Incident Management helps your business improve and provide excellent service.