Your On-Call Engineer's Incident Management Checklist
The on-call engineer has a critical role to play in incident management. Since on-call engineers are the first responders, they can mean the difference between a critical incident or one that is resolved quickly.
Smaller companies have few choices regarding who should be on call, but as the organization grows and incident management becomes complex and critical, it’s important to have a structured process for the on-call engineer.
Whether you’re a small business or an enterprise, you can benefit from having a clear process for selecting and equipping your on-call engineer. Here are a few guidelines.
First response is critical
In the first few minutes of the incident occurring, the on-call engineer needs to know the severity and extent of the incident. Based on that, they must gauge who is needed to resolve the incident and how to onboard them quickly. This requires having a working knowledge of how the system functions, so that when something breaks, they are able to identify what is normal versus what is broken.
In the case of small to mid-sized teams, the role of the on-call engineer would be rotated. This way, the load is shared and everyone is aware of how to handle incidents and don’t lose their touch. In the case of a larger team, they have the luxury of having dedicated incident managers who can initiate the first response. In either case, the primary goal of the on-call engineer is not to resolve the incident, but to sound the alarm and get the necessary resources looped in to resolve the incident.
Have a secondary on-call engineer
It’s crucial to have a secondary on-call engineer ready for escalation. This means that there needs to be a schedule for rotation of roles within the team. It’s easy to set up automated rules so PagerDuty escalates to the backup engineer if there’s no response from the primary engineer.
Ensure your on-call engineer has the required training
Since there’s a lot at stake when an incident occurs, your on-call engineer needs to be a developer who can follow protocol and think on the go. They need to be aware of different strategies for point-of-care marketing and customer support. It is also useful to hand the on-call engineer a checklist or flowchart to follow when incidents occur.
Here are the steps an on-call engineer needs to take during an incident:
- Identify & Log: The first step is to identify or detect the incident, track the problems behind it, and create logs. Logging is important in order to get to the root cause of the issue quickly and to provide a comprehensive post-mortem of the incident once it’s resolved. Since it’s crucial to respond to the incident quickly, identifying and logging must also be done quickly and methodically in order to move on to the next step.
- Categorize & Prioritize: Due to the vast variety of problems that a team can encounter, it’s important to categorize each incident to prevent confusion. The basic criteria to categorize an incident at hand is to note the number of users affected, the features that have failed, the revenue affected, and so on. Prioritizing incidents can help the on-call engineer make a decision on whether the incident requires the time and resources of the rest of the team. Importantly, minor incidents can be handled by the engineer alone, saving the entire team’s time and giving the end user a better experience.
- Notify the Right People: If the priority of the incident is high enough, then solutions like PagerDuty and its Slack integration or Response Mobilizer can be used to muster the relevant people and bring them together in one place. In particular, using the room feature for ChatOps, shared video calls, and quick inputs can make a big difference in the outcome. While communicating with team members, it’s also important to be brief and use as few words as possible to describe the incident without wasting time. Teams can get distracted with alert overload and a solution like PagerDuty is imperative to suppress the noise and surface the signal.
- Troubleshoot: Troubleshooting doesn’t have to happen only when the whole team is notified and present. Even while waiting for their responses, it is vital that first responders like the on-call engineer be able to troubleshoot on the go. Rapid responses can be a lifesaver, much like real-life emergency services, where the first few minutes are important and can mean all the difference between things going critical or being manageable later on.
Choosing an on-call engineer must not be ignored or put aside. Having one with sufficient backups and a well-thought-out plan can mean efficiency when things go south. If your on-call engineer follows these basic steps, your team can spend more time creating and less time fixing.