Understanding Event Driven Automation
IT and SRE teams must respond to incidents quickly. But resolving incidents manually is time-consuming and inefficient. Event-driven automation helps teams automate incident response to accelerate resolution, reduce human error, and help teams focus on more strategic initiatives.
What is event-driven automation?
Event-driven automation is a process that starts at the event level. When new data comes in the system automatically normalizes it (reorganizes it to make it more logical) and enriches it, or adds extra details to help responders get up to speed faster. Event-driven automation can accelerate the incident response process by creating an alert from event data and routing it to the correct team with added context for what to do next.
What is end-to-end event-driven automation?
End-to-end automations are automatic processes that are in place during the entire incident lifecycle, which starts at the event level and runs through the incident’s resolution. For example, a team could create automations to initiate diagnostics or set up auto-remediation to resolve an issue before it becomes an incident.
How event-driven automation helps your team
Event-driven automation can help teams streamline the incident response process by improving response times and reducing manual work.
Here are some specific ways event-driven automation helps your team:
-
- Improves prioritization. By normalizing incoming event data and setting global routing criteria, teams can quickly identify and address the most critical issues. This makes operations more efficient and focused instead of wasting time on parsing irrelevant alerts.
- Reduces manual tasks and human intervention. Automation allows for event transformations and auto-remediation at the ingest level. This means incidents that can be resolved by process automation never bother a human, freeing up teams to focus on strategic tasks.
- Accelerates major incident response. Major Incident Management (MIM) teams benefit from early detection and immediate routing of major incidents. Diagnostics and normalizing event data can be automated, which reduces response times and minimizes additional costs.
- Gives engineering teams time for innovation. With intelligent routing and auto-remediation for routine issues, engineering teams can avoid constant firefighting and dedicate more time to driving innovation and adding value to the business.
- Enhances customer support. Faster Mean Time to Resolution (MTTR) and fewer customer-impacting incidents mean support teams can resolve issues proactively. With the right teams involved from the start, users experience fewer disruptions, leading to higher customer satisfaction.
How to set up end-to-end event-driven automation
Implementing end-to-end event-driven automation requires a multi-step approach.
Step 1: Suppression and eliminating transient alerts
Suppression helps reduce incident volume by preventing notifications for incidents that don’t meet specific criteria. For example, an event orchestration team can suppress events until they reach a target threshold. Once this threshold is met, suppression is lifted, and events are converted into incidents for further action.
Transient alerts often resolve on their own without intervention. Pausing notifications for these types of alerts allows teams to delay creating an incident for a designated period of time. When that period ends or is exceeded, an incident is created. This can be helpful for flapping incidents–incidents that switch or “flap” between a normal state and alert state. For example, a team could pause high CPU usage incidents for five minutes, ensuring an incident is only created if high CPU usage exceeds this threshold.
Step 2: Event, alert, and incident enrichment
Reducing unnecessary alerts is the first step, but after that, teams must ensure that the events, alerts, and incidents that make it through suppression are enriched with as much detail as possible.
- Event enrichment: Event enrichment involves rewriting common event formats (CEF) or adding new fields for additional information. Event enrichment ensures incidents are populated with the relevant details and normalizes event data so incidents appear consistently across teams. This helps teams resolve issues faster.
- Alert enrichment: Alert enrichment allows users to define the severity of an alert which influences how teams respond, so classifying an alert with the correct severity ensures the response is handled properly. An SRE team could classify alerts for customer-facing services that are listed as Sev1, since they can impact revenue. Alerts for lower-priority services could be classified as Sev3 or Sev4.
- Incident enrichment: Incident enrichment can help teams understand how to respond to an incident. It involves defining incident priority and including notes when an incident is created. These notes can help responders by identifying potential root causes or populate helpful resources like knowledge base articles or internal wikis that may help responders.
Step 3: End-to-end and auto-remediation
At this stage, teams can start implementing automation to gather diagnostics or resolve issues with pre-defined solutions. Teams can do this using webhooks or automated incident resolution.
- Webhooks: Webhooks automate key tasks by allowing users to define customer headers and payload body fields that trigger at incident creation. Webhooks provide essential diagnostic information for responders without having to run manual processes. For auto-remediation, webhooks can trigger an action to resolve an incident without involving a human. Using webhooks for diagnostics and auto-remediation can improve MTTR and make incidents less disruptive for customers and less of a burden for response teams.
- Automated incident resolution: Teams can implement automation tools to triage, diagnose, and remediate incidents, and help them respond without needing human intervention. Automated incident resolution is part of PagerDuty process automation. When an incident is created, automation jobs can be triggered automatically or manually by responders.
How to choose which events to automate
Specific criteria will vary by team, but here are some general guidelines to help teams choose:
- Event priority: Automate high-priority events, where faster resolution can reduce downtime or improve customer experience.
- Team efficiency: Boost efficiency by automating events that take significant time to resolve or frequently interrupt workflows.
- Repetitive tasks: Automating events that involve repetitive actions helps free up team members from handling these tasks manually.
- Resource allocation: Consider the existing tools, integrations, or automation frameworks your team already has.
- Event frequency: Automating high-frequency events can help reduce alert fatigue and allows team members to focus on strategic tasks instead of manual processes.
- Pre-defined solutions: Automate incidents with clear, repeatable solutions, such as restarting a service or clearing a queue.
These recommendations are a great starting point for testing and implementing automations, assessing their effectiveness, and expanding or refining workflows as needed.
Event-driven automation helps to enhance overall operational efficiency by accelerating incident response time, reducing manual tasks, and streamlining processes. Automations can help teams eliminate unnecessary alerts, which can lead to errors or alert fatigue. By automating repetitive tasks, event-driven automation saves time and resources and improves incident resolution and overall performance.
Discover how PagerDuty automates event management to help teams reduce noise and resolve incidents faster with diagnostics and auto-remidation. Start your free trial today!
Additional
Resources
Webinar
Powering Growth: Your Enhanced Plan to Mature Operations
PagerDuty University Training
PagerDuty 101