Blog

Are you Prepared for Your Next Major Outage?

by PagerDuty University August 1, 2024 | 4 min read

Software is not perfect. And ultimately, it’s not a matter of if you will have an outage, but of when. With the increasing complexity and frequency of IT incidents,  is your organization prepared to respond and recover when each second counts?

Here at PagerDuty, we’ve compiled a list of best practices to keep your systems up and running.

Before an outage…

1. Document and practice major incident processes
Ensure that responders are prepared to be on call, familiar with incident management processes, and know how to engage with other teams. Conduct mock incident response scenarios to practice internal processes for resolving major incidents.

Tip: Use our Incident Management solution for rapid team engagement, including On-Call Readiness Reports to verify users’ profiles for responsiveness.

2. Determine where to focus your efforts
Do you respond to all incidents manually, or do you take preventative measures to get ahead of issues before they happen? Ensure you know where your organization sits in terms of its Operational Maturity.

PagerDuty Operational Maturity Model

 

 

Tip: Go from reactive to preventative by reviewing your Operational Maturity report in our Incident Management platform. Find and implement specific recommendations, such as adding automation or enhancing team efficiencies, to boost operational resiliency.

During an outage…

3. Increase situational awareness during incident triage
Give responders access to contextual information the moment they’re paged about an incident. Ensure responders have a way to gain situational awareness by identifying past and related incidents as well as any contextual information that answers the question, “Something broke. What changed?”

Tip: Use PagerDuty’s AIOps Root Cause Analysis features to instantly uncover critical insights from Past, Related, and Outlier incidents. Leverage our Change Events feature to view the most recent changes to your services (80% of incidents are the result of change events such as software deployments.)

4. Define roles for your response team
Ensure your response team has a clearly defined set of Incident Roles (e.g., Incident Commander, Customer Liaison, Scribe, etc.). Having established Incident Roles unambiguously defines responsibilities, promotes accountability, and enables a more focused incident response.

Outage responders

Tip: Use our Incident Management platform to create predefined Incident Roles that can be assigned during a major incident. 

5. Deploy automation to expedite diagnostics
Get teams out of firefighting mode. Automate routine tasks and incident response processes to remove manual toil. Reduce alert noise so there are fewer interruptions during incident response, leading to faster resolution.

Tip: Leverage our AIOps and Incident Management solutions for better event management and accelerated triage. For more granular control, use PagerDuty Runbook Automation to execute specific actions based on defined events.

On July 19, 2024, during the world’s largest IT outage ever, our AIOps + IM customers saw a 1425% increase in automation usage. This allowed teams to automate routine tasks, significantly scaling their incident response.

6. Keep your customers informed
Ensure Customer Support and Service teams are receiving real-time data and two-way communications from Engineering. This collaboration allows all teams to act as a unit and resolve issues faster together under the common goal of creating positive customer experiences (even during an outage).

Tip: Use our Customer Service Ops solution to customize workflows and integrate data from all your tools to give Customer Support immediate insights into your IT infrastructure.

7. Establish a stakeholder communication process
Create a clear protocol for communicating with stakeholders during an outage, detailing how they will receive status updates and where to find additional information.

Tip: Create Stakeholder Subscriptions to subscribe stakeholders to business services and incidents, and inform public and private audiences with Status Pages

After an outage…

8. Establish a post-incident review process
Don’t let an outage go to waste. Establish a clear post-incident review process to enhance future responses and create a continuous feedback loop for integrating improvements into your processes.

Tip: Check out our Post-Incident HOWIE guide for detailed recommendations on how to get the most out of your post-incident reviews.

Why listen to PagerDuty?
On July 19, 2024, (during the world’s largest IT outage ever,) our AIOps and Incident Management customers on average avoided 132 incident actions, saving over 1600 response hours—in just one day.

Take a look at this checklist to review your operational resiliency to prepare for the next mass service outage.