Intelligent Alert Grouping: What It Is and How To Use It
Co-authored by Chris Bonnell, PagerDuty Data Scientist VI
It’s 2 AM and you’re paged when you’re still awake – how well can you find what you need to fix the latest mistake? When the incident begins it might only be impacting a single service, but as time progresses, your brain boots, the coffee is poured, the docs are read, and all the while as the incident is escalating to other services and teams that you might not see the alerts for if they’re not in your scope of ownership. Although you may be able to go through your alert tool’s UI and combine alerts that are all relevant to the same incident, this requires you to know that other alerts are 1 ) being sent and 2 ) that the incident in that other service or services is relevant to the one that you’re currently working on.
In the PagerDuty application, some of this work is done for you, via the Intelligent Alert Grouping (IAG) feature. While it is probably great to have this feature working at least somewhat automagically out-of-the-box, there are probably times that you wish you could use it better. Perhaps you want to improve how alerts are matched to an incident, or prevent alerts that are incorrectly associated as being part of the same incident from being matched in the future. Maybe you even want to tweak the designs of your alerts so that you don’t need to do as much correction after-the-fact, and during an active incident, at all. If that’s what you’re looking for, then look no further! In this blog post series we’ll be discussing the different ways that you can improve the accuracy of Intelligent Alert Grouping for your specific needs.
Common incident challenges
Increasing complexity and scale in systems design means that it is getting harder and harder to design alerts that convey enough information or even that are correctly correlated. When we build our monitors and corresponding alerts, we are usually doing so with the service in question in mind but we can’t always effectively map how dependencies will respond to each other’s latencies and outages. So it’s possible that when you see several alerts on a service, a subset of those are caused by other alerts and a different subset might be repeat alerts for the same issue. Depending on how you’ve designed your notifications, it’s also possible that multiple teams receive notifications for multiple services for the cause to be only one of them.
When we think about how we configure alerts, how we end up in those situations on the response side starts to make sense. We might think in terms of thresholds where a service can budget for a given amount of latency or outage time, but it’s dependent services might have stricter requirements. If those situations aren’t accounted for or are unknown, we can have situations like the above where different teams are working on different aspects of the same incident, they just aren’t aware. We can also see the opposite behavior, where an alert storm is triggered and it’s difficult to wade through the noise to map out what is happening where, and how the alerts should be grouped together in terms of incidents.
Reducing (some) complexity
These challenges aren’t new, and in all likelihood you’re already starting to respond to them. If you’re reading this post, you’re also likely doing so at least in part using the IAG feature. Briefly, IAG uses machine learning to build patterns from the data you send into the platform so that it can start to group alerts by their respective incidents for you. The goal is to help improve you and your teams’ ability to understand the topology of what is amiss in your system(s).
When you get started with IAG there is a lot that works “automagically” to reduce the learning curve and allow you to start improving your response process as soon as possible. That said, eventually you’ll reach a point where you need to correct and tune how alerts are grouped together – which is what this blog post series is about. We’re going to be covering how you can interact with IAG to improve its ability to group alerts by top-level incident.
Where to go from here
This is the first post in a series that will cover how to improve how you design your alerts and services in the PagerDuty application to improve Intelligent Alert Grouping’s ability to group incidents. Specifically:
- How you can train the built-in learning features
- How to craft alerts
- How to design services (in the PagerDuty application)
In our next post, we’ll be covering the first topic for built-in learning, explaining how it works as well as how to merge and unmerge incidents. Then we’ll move on to how to craft alerts, detailing how the different fields are used by IAG and what information you should make sure to include. Then in the last post we’ll cover configuring services in the PagerDuty application and what information to include and exclude.
All the posts in this series will be under the ei-architecture-series tag, so please make sure to reference that page when looking for subsequent posts!