What is Incident Management?
In today’s digital world, technology has become the focal point of business performance and customer satisfaction across industries of all sorts. Because of the increased complexities within infrastructure environments and the additional abstractions that are layered within applications and services, the need for a centralized incident management platform has never been greater. But what is incident management in the context of technology?
Incident management is the end-to-end business process of addressing an outage, service disruption, or other major incident from its initial conception to its full resolution. While this definition may sound simple, the lifecycle management process itself is extremely complex and involves cross-team collaboration, disparate technologies, and distributed systems in order to resolve efficiently without risking customer experience, brand reputation, and most importantly, the bottom line of the business.
While the process of incident management can grow to be quite complex, you can break down the stages into these seven main categories:
- Incident identification
- Incident logging
- Incident categorization
- Incident prioritization
- Incident assignment
- Task creation and management
- Incident response
- Diagnosis
- Escalation
- Investigation
- Resolution and recovery
- Postmortem
A little more on the process of managing an incident later. But first, let’s discuss what exactly an incident is—and what isn’t.
Types of Incidents
Incidents occurring within a given IT environment can be categorized and defined in numerous ways. Some incidents are defined by severity or business impact, while others are defined by the root cause of the outage. For example, an incident can be as simple as a latency in the network due to high traffic, or as complex as dealing with a container failure for a mission-critical, customer-facing application, which can cause widespread outages to a customer base.
In many business cases, incidents are defined by its severity level and will often look like:
- Sev1
- Sev2
- Sev3
- P1
- P2
- P3
The Incident Management Process
Step 1: Identifying an Incident
It may sound obvious, but the first step in managing an incident is to first identify an incident. To do this, you must determine what defines an incident for your team. An incident is when your service experiences an unplanned interruption or reduction in quality. Since each company is different, as is their infrastructure and applications, it’s important to consider the specific types of incidents you might run into. For example, if your primary service includes an online shop, a possible incident you may run into could be slower page speeds caused by increased site traffic – perhaps during a big sale.
Step 2: Logging an Incident
Once an incident has been identified, the next step is to correctly log and track the incident. This will typically be done by your service desk. Incidents are logged as tickets, which should include the following information:
- User’s name and contact information
- Description of the incident
- Date and time of the incident (needed for SLA clearance)
Step 3: Categorizing an Incident
Once an incident is logged, it must then be categorized. This is extremely important, and every incident should be assigned at least one category (such as “Network” and subcategory (such as “Network Outage.” This will allow your service desk to easily sort through all incidents based on their categories and subcategories rather than having to sift through a sea of uncategorized tickets. We’ve all been there, and it’s not a fun place to be. Proper categorization of incidents can also help to show patterns, track how many times similar incidents occur, and diagnose larger problems and areas that may require additional training. For example, if you continuously run into speed issues, it may be time to discuss upgrading your infrastructure.
Step 4: Prioritizing Your Incidents
As with any task or to-do list, prioritization is key. Prioritizing incidents based on their severity will clearly point to major incidents that need to be solved right away, and minor incidents whose necessary resolution time is much more flexible. An incident’s priority and urgency will be based on the level of impact to users and their ability to use the service. With all incidents categorized, your team can automate how specific incident categories and subcategories should be prioritized.
Incidents are typically prioritized as:
- Low-priority incidents: Users experience no interruption in service
- Medium-priority incidents: Some internal staff affected with little to no interruption for users
- High-priority incidents: Large number of users experience service interruption and reduction in quality. High priority incidents often have negative financial impacts on business.
Step 5: Responding to an Incident
Once an incident has been identified, logged, categorized, and prioritized, it’s time to respond to the incident. This is a typical process of how an incident response is conducted:
- First, your service desk will need to make an initial diagnosis, where the issue is clearly described and troubleshooting questions are answered.
- Once the incident has been diagnosed, your service desk will determine whether or not an incident escalation is needed. An escalation is when there is advanced support needed to resolve an incident, in which case the incident will be assigned to the appropriate team.
- Next, the assigned team will investigate and diagnose the incident. This is typically done during a troubleshooting phase after confirming the initial incident hypothesis. Once a diagnosis has been made, your team will apply the needed fix, such as a software patch, change in settings, new hardware, etc.
- Finally, once an incident is fixed, your team can close the incident.
- Following the incident closure, your team should have an internal review meeting, and conduct any needed postmortems. At this point, you’ll also need to determine whether any public postmortem is needed.
Don’t forget about incident communication with your users! It’s important to remember that while responding to an incident, your team is also in communication with its users as needed. Incident communication is essential to maintaining the trust of your users, as well as the credibility of your brand. Should an incident arise that impacts their ability to use the service without interruption, your team should immediately notify users (whether via email, social media, a designated page or plugin, etc.) of the incident. Let them know your team is on it and provide them with regular updates throughout the incident response process.
Once an incident has been closed, notify users of the incident, how it’s been resolved, and whether or not any additional steps are needed.
Roles
Every organization typically has their own custom roles and responsibilities, below are some of the most common incident management roles:
- End user. This is the stakeholder who usually experiences the first sign of an outage or disruption and will flag it to initiate the incident management process.
- Tier 1 Service Desk. Typically the first point of contact when there is an incident ticket or request incoming.
- Tier 2 Service Desk. Comprised of technicians with primary knowledge around major incidents involving applications, infrastructure, and systems management.
- Tier 3 (and above) Service Desk. Specialist technicians that have advanced knowledge in extremely specific regions of the company’s infrastructure. Usually these professionals are brought in for complex maintenance and remediation.
- Incident Manager. A key stakeholder in the incident management process that drives the entirety of the lifecycle from diagnosis to resolution.
- Process Owner. This person typically moderates the incident lifecycle, analyzes the process, and points out areas of improvement to make the management lifecycle more efficient for teams.
But how does the process of incident management actually work? With PagerDuty, the process can be broken down into these four stages of management:
- Harness Data
- Make Sense of Data
- Respond & Engage Teams
- Analyze and Learn
Harness Digital Data
When incidents do inevitably occur, understanding the makeup of an incident and its root cause is critical to diagnosing—and eventually mitigating—the issue and saving time and money for your business. While there is no uniform identity to an incident, you can follow the breadcrumbs based on the type of outage you are seeing. For example, if there is a load balancing issue with one of your external applications, you may want to dig deeper into your container environment to better understand the issue. Having the ability to aggregate all of the digital data surrounding the incident will help you to uncover the root cause is the first step in orchestrating a coordinated, holistic response.
PagerDuty’s integration of ecosystem of over 350+ integrations allows your teams to have a centralized view into your entire environment, which enables data signals from any tool, webhook, system, or monitoring application to have one single point of ingestion.
Make Sense of the Data
With all of the data surrounding the incident in front of you, it’s nearly impossible to pin-point the disruptive signal, and would be similar to searching for a needle in a haystack. In order to uncover the identity of the incident, you need the ability to aggregate and segment the data you are surveying in order to paint a better picture of the incident makeup and turn the data into meaningful signals.
With so much data consistently flowing in and out of a given environment, being able to make sense of the data and create actionable paths to mitigation is a major key in resolving the issue before it starts to cascade across the rest of the business and your customer base. With PagerDuty’s collection of over 10+ years of historical data, we are able to help aggregate, correlate, and connect similar incidents and events into a single instance in order to help orchestrate an efficient and collaborated response.
Respond and Engage Teams
One of the most important functions of the incident management process is making sure the correct stakeholders and service owners are actively enabled and working to help mitigate the issue at hand. By looping in key stakeholders, teams can take a proactive approach to addressing and remediating the issue, as well as providing organizational visibility so teams are aware of the on-going response.
By using PagerDuty, key stakeholders and responders can be informed in real time as an incident is happening in order to make sure the incident is routed to the right team to take immediate action to prevent the issue from becoming customer- or revenue-impacting.
Analyze and Learn
Once an incident is fully resolved, the postmortem stage is an important function of the incident lifecycle as it helps teams to better understand what happened and how they can prevent recurring incidents in the future. This enables teams to take a preventative approach to incident management and make sure, when things do inevitably happen, that they are dealt with in a timely and frictionless manner.
PagerDuty gives teams the tools and information necessary to better understand an incidents makeup and give teams actionable insights in order to prevent similar incident from recurring in the future.
To learn more about how PagerDuty can improve your organization’s incident management process, try a 14-day free trial today.
Additional
Resources
EBook
Maximizing the ROI of incident management
Podcast
The Unplanned Show, Episode 3: LLMs and Incident Response