Building Operational Resilience
Operational resilience is an organization’s ability to predict, respond to, and prevent unplanned work to drive reliable customer experiences and protect revenue-at-risk. Resilience is measured in terms of reduced customer impact. This doesn’t just include downtime; it also covers service degradation due to latency or other factors. While organizations can measure operational resilience in terms of mean-time-to-acknowledge (MTTA), mean-time-to-resolve (MTTR), service level objectives (SLOs), or a variety of other metrics, the bottom line is how small the impact on the customer is when something goes wrong.
In fact, resilience is so key to success in the modern enterprise that, according to a survey conducted by PagerDuty, it was ranked as one of the top 3 operations priorities by IT and business leaders across industries, alongside improving security/reducing risk and supporting revenue growth.
So how do you build more resilient systems?
Defining the pillars of operational resilience
At PagerDuty, we focus on scaling people with automation and empowering them via AI. Each of these three pillars helps provide more reliable service for customers and more resilient systems and processes. Here’s how:
Automation-first: The influx of data and increase in noise and incidents means that humans are having trouble keeping up with the sheer amount of information coming in. Not only that, but responding to each of these problems leaves room for error and takes subject matter expert (SME) time away from critical work. It’s a waste of resources and only exacerbates customer impact.
With automation as the first line of defense, organizations can let machines enrich and normalize data, run diagnostics, remediate issues, and coordinate response efforts ahead of responders even being alerted to the issue. This preserves human capacity and makes systems more resilient against human error.
People-centric: That said, resilience is also reliant on the humans that power these technical systems. In cases where automation can’t resolve problems without intervention, it’s important to have processes in place that support teams doing their best work under challenging circumstances with as little disruption to both them and the customer.
Consider all the processes that go into ensuring that systems stay up and available. From on-call rotations to how postmortems are conducted and fixes prioritized, the people involved should feel like the processes help them become more efficient, proactive, and kept in the loop.
AI/ML-assisted: Resilience is, in part, a game of speed as well. Things will go wrong. It’s impossible to predict every failure. But being able to fix a broken system and provide a more reliable customer experience is time-sensitive. Every minute of downtime translates to a cost to the business.
Organizations need to leverage AI and ML to assist technical teams in triaging, communicating, and reporting problems faster. With the right information at responders’ fingertips, they’re armed with the right information to bring incidents to a resolution faster, able to communicate with less time and toil required, and can create post-incident reviews easier to ensure that the system is hardening over time.
How PagerDuty can help companies achieve operational resilience
Working towards improved operational resilience is an effort that will pay dividends in the long term. However, starting from scratch can be challenging. For many organizations, the right move is to work hand-in-hand with a strategic partner. PagerDuty has helped thousands of organizations improve their resilience on the path to operational excellence. Here are a few ways we see our customers take advantage of our unique expertise and capabilities:
- Machine-first response with event-driven automation: Event-driven automation is automation that is kick-started at the event level, normalizing and enriching data at ingest from trusted sources such as monitoring tools. At this point, automation can run diagnostics and remediations, dynamically route or escalate if humans are needed, and more.
- Preserving human capacity while keeping communication lines open: Keeping the humans in the loop during response is key. That includes internal business stakeholders, other technical teams, customer support agents, and customers themselves, and doing it with as little toil and overhead as possible.
- Getting the right information with a Copilot at your fingertips: PagerDuty’s platform-level AI assistant ensures that technical teams have the ability to ask questions about the system and get immediate answers in the most critical moments. Additionally, Copilot can serve as the first drafter for communications, postmortems, automation runbooks, and more to help teams use their capacity for more value-add work.
If you think your organization could see value in improving your resilience by leveraging AI and automation to help your teams scale, talk to our teams today.
Additional
Resources
Webinar
Improve Efficiency of Incident Response with Automated Diagnostics for AWS in PagerDuty
Webinar
Webinar: Resilient by Design: Preparing for IT Disruptions in a Complex World