Blog

How the PagerDuty Operations Cloud Can Play a Part in Your Digital Operational Resilience Act (DORA) Strategy

by Lee Fredricks June 26, 2024 | 8 min read

Since I wrote DORA vs DORA!, a number of people have asked if I could give more practical advice on how the PagerDuty Operations Cloud can play a part in helping firms in the Financial Services Industry (FSI) to meet their obligations under DORA. Let me try to do that now.

Disclaimer: Please note that while PagerDuty can provide some really useful pieces of the puzzle, I am not in any way suggesting that you can achieve instant DORA compliance simply by adopting PagerDuty Operations Cloud.  What I AM suggesting is that when combined with a well thought through strategy, a sprinkling of consultancy, and more than a smidge of PagerDuty muscle, you can get a lot closer to DORA compliance.

When we talk to our customers about operational resiliency, three common themes come up:

  1. Teams don’t spend enough time on preventative design.
  2. Learnings from past incidents aren’t leveraged.
  3. Incident resolutions are slow due to noise and a lack of real-time systems.

So, a proactive approach to your DORA planning and strategy will help address some of these issues. Let’s walk through the core pillars of DORA and see where PagerDuty can help. For our UK audience, I’ll highlight where I feel a particular item is also relevant to the UK regulations (for example, FCA PS21/3/PRA PS6/21).

  1. Robust ICT risk management

Under ICT risk management, DORA mandates the establishment of strong incident management processes. This is really PagerDuty’s raison d’etre so I’ll try to be succinct.

    • Monitoring and alerting: The AIOps capabilities of the PagerDuty Operations Cloud are built on our foundational data model and trained on over a decade of customer data. It can be used to reduce noise by collating and aggregating events from a host of IT systems and tools. With over 700 out-of-the-box integrations, PagerDuty can be configured to receive events and alerts from diverse sources, such as cloud and network monitoring tools, security information and event management (SIEM) systems, and change management tools. This allows for early detection of potential issues that could snowball into more significant problems.
    • Alert Routing, call-out, and escalation: PagerDuty allows firms to define notification protocols for different types of incidents based on urgency and severity. PagerDuty works on a service-based model – think identified Important Business Services (IBS) from the PRA regs – and routes alerts directly to the most appropriate teams and individuals who have the necessary expertise to handle the situation. This radically reduces the mean time to restore (MTTR) the service. It’s also possible to visualize these IBSs and see their upstream and downstream dependencies in the service graph.
    • PagerDuty Automation, Workflow Automation, and Incident Workflows: PagerDuty offers tools to create standardized workflows for handling incidents. These workflows can include automated steps for troubleshooting, diagnosing, and resolving incidents, promoting a consistent, repeatable approach to managing ICT risks across the organization.
  1. Management, classification, and reporting of ICT-related Incidents:

DORA mandates reporting operational incidents that have a significant or potential impact on the delivery of financial services. This necessitates establishing clear procedures for identifying, reporting, and analyzing such incidents.

    • Immutable centralized incident record: PagerDuty provides a time-stamped log of all activities and resolution steps relating to an incident. This central record provides a clear audit trail for all incidents, simplifying compliance with DORA’s reporting requirements.
    • Automated reporting: PagerDuty includes a suite of out-of-the-box dashboards and analytical reports but also allows for integration with external systems, potentially enabling automated reporting of major incidents to the relevant authorities based on predefined criteria. PagerDuty also provides status update templates and web-based Status Pages – directly associated with and linked to Important Business Services (PRA again) – to allow for immediate mass communication to stakeholders and customers.
  1. Digital operational resiliency testing:

DORA and the UK regulations explicitly require FinServ institutions to conduct regular testing of their ICT systems and incident response plans to identify vulnerabilities in their operational resilience posture. This testing should include running simulations of various disruptive scenarios regularly. 

    • Incident simulation: Practice, practice, practice! PagerDuty Automation capabilities could be used to initiate a simulated incident. Alternatively, firms could manually disable a machine or application or create a PagerDuty test incident to trigger an outage and then practice their response procedures. This helps identify weaknesses and areas for improvement in the incident response plan. PagerDuty as a business conducts such simulations in its own systems every week (so-called “Failure Friday”!). Of course, operational resiliency goes beyond technology to encompass people and processes. We have ‘open sourced’ the PagerDuty incident response procedure – including roles and responsibilities – and you are free to take a copy and customise it as you wish (response.pagerduty.com). 
    • PagerDuty enables operational resiliency: During an incident — real or simulated — the core capabilities of the PagerDuty Operations Cloud (AIOPs, PagerDuty Automation, and Incident Response) combined with a firm’s Incident Response processes and training will help firms reduce the mean-time to acknowledge (MTTA) and resolve (MTTR) the incident and hence minimize disruption. 
    • Post-test analysis (post-incident reviews or postmortems): PagerDuty’s GenAI functionality (in early access) facilitates the creation of such reports, allowing firms to analyze incident resolution times and team collaboration during test scenarios. This data is invaluable when refining the incident response plan and improving the speed and efficiency of operational resiliency processes.
    • Automated disaster recovery (DR) response: Resilient firms use PagerDuty Automation to automate the provisioning and failover of disaster recovery environments and single or multiple applications. Testing is crucial to ensure these processes can be executed swiftly and reliably when needed and so help support a firm’s business continuity plan.
  1. ICT third-party risk management:

Firms must implement stringent measures to assess and manage risks associated with critical third-party providers (CTTPs) delivering ICT services.

    • Oversight into Incident Response practices: If a CTTP also uses the PagerDuty Operations Cloud, the financial institution might request access to PagerDuty reports that will give insights into the CTTPs Incident Response practices, their responsiveness to incidents and their overall operational resilience.
    • Classification and testing: The PagerDuty Operations Cloud has been used to classify incidents based on the third-party origin of the issue. In addition, PagerDuty Automation has been used to run tests against CTTPs to ensure their availability and robustness.
  1. Information sharing, maintaining records, and documentation:

Under both DORA and UK regulations, FinServ institutions must maintain comprehensive documentation of their ICT risk management activities, incident reports, and test results. This documentation will be crucial for demonstrating compliance with these regulations during potential audits. In addition, DORA encourages information sharing among financial institutions and authorities regarding cyber threats and incidents. This collaborative approach aims to improve overall preparedness and response capabilities within the financial sector.

    • Centralized repository: PagerDuty is a ‘system of action’ and, as mentioned above, serves as an immutable centralized repository for incident data, including time-stamped activity details, communication logs, and resolution steps, all gathered during the heat of an incident. PagerDuty can also integrate with and automatically keep up-to-date the firm’s chosen ‘system of record’ – ITSM and ITOps tools. This simplifies and improves record-keeping and demonstrates a documented, repeatable, and consistent approach to incident management.
    • Reporting and analytics: As mentioned earlier, PagerDuty offers analytical and reporting functionality that can generate reports on incident trends, resolution times, and team performance. These reports give data-driven insights that can drive focused engineering remediation efforts and demonstrate ongoing efforts to improve operational resiliency.
    • Status pages: Information sharing is automatic and effortless if firms use PagerDuty’s web-based status pages, which are linked to and fed directly by incidents affecting Important Business Services.

Conclusion

Hopefully, it’s clear from the above that the PagerDuty Operations Cloud is highly applicable to the core pillars of DORA and the UK regulations:

The PagerDuty Operations Cloud provides a near real-time system of action designed to help you resolve your issue as quickly as possible whilst simultaneously updating your ITSM and documentation tools. It reduces noise and toil for Operations and NOC teams, allowing them to move from reactive fire-fighting to a proactive focus on problem-solving. It can auto-remediate issues to shorten resolution times and provide invaluable post-incident analytics and reports to help you learn and improve your processes. 

Finally, there’s one overarching point worth making. More than anything else, Financial Service regulators want to see that firms are thinking about, documenting, and investing in their operational resilience strategy. Investing in and deploying PagerDuty is a clear indication that FinServ institutions are taking operational resiliency seriously.

If you’d like any further details or information, please reach out