Blog

AI Monitoring and LLMOps with PagerDuty

by Mitra Goswami February 5, 2025 | 5 min read

This post was authored by Mitra Goswami, Ralph Bird, Everaldo Aguiar, and Scott Sieper.

Over the past two years, generative AI (GenAI) has come a long way, from the early excitement of ChatGPT to early explorations and more and more companies deploying GenAI-powered features into production. As the field continues to evolve, with breakthroughs announced almost daily, we at PagerDuty have been on-call thinking about all aspects of this transformation and how to safely leverage GenAI to further improve our product and better support you with yours. 

The PagerDuty Operations Cloud sets itself apart in how it uses AI/ML to help teams eliminate alert noise, improve triage, manage and learn from incidents, automate tasks, and streamline communications. Central to that, we’ve recently announced PagerDuty Advance, which adds a layer of GenAI capabilities to our features and provides tangible improvements to the incident management lifecycle.

Thousands of users rely on us to ensure they can maintain high levels of trust in their products, and just as a server crash can erode that trust, so can a “model hallucination.” While monitoring traditional infrastructure is something we were very familiar with, monitoring AI models (particularly LLMs) is a new challenge. In light of that, we want to share a few notes on the lessons we’ve learned so far and how we are looking at the AI monitoring space moving forward.

LLMOps has several differences from other Ops roles, with companies, from startups to the traditional big players, providing different tools to help with the process.  Whether that is by providing guardrails to block inappropriate content in real-time or monitoring that identifies performance issues, these tools combine to provide a toolkit to help engineers operate GenAI in production. But unlike traditional Ops, where a system is either “on” or “off,” what should you be doing with an LLM output that is inherently non-deterministic and has an output that depends so much on the user’s inputs?

This non-determinism and input sensitivity make monitoring harder. Is the signal that you are seeing a true issue, a difference in user behavior, or a random fluctuation in the LLM’s output? How do you know whether to wake your engineers up or let them rest? Take these two examples:

Security watch: Real-time jailbreak monitoring and alerting
You have a monitor on your guardrails blocking jailbreaks. It was just triggered. Is this someone trying to break your system and leak your IP? Before you declare a security incident, you need to determine whether this is just a spike in the normal rate of false positives or a deliberate attack. Automation triggered by the event sent into PagerDuty can help here. By running a simple script, you can tell whether this is due to a single customer (a likely attack – let’s wake people up) or lots of customers (more likely noise – let them sleep). This automatic triaging allows us to set a low threshold on the monitor to catch all attacks whilst letting our engineers sleep through the false alarms.

Quality watch: Smart monitoring and real-time alerts
Many companies use a third-party vendor for their LLM models. This reliance adds a potential failure mode where changes to the model can change the output quality. But how do you know when this has happened? Step one is to monitor key parameters (we favor using quick/cheap metrics here, like output length or answer relevance scored using a small model); this lets you see that something could have changed, but is that with the model? Or users interacting with the product differently? Or is it just a random variation from a non-deterministic LLM?  The best way to tell is to run a test set of data with known answers and verify whether there has been a change (e.g., through LLM as a judge). Again, we can use automation to trigger this test so that the alerted engineer has the information they need when they start investigating the event.

Automation and workflows
PagerDuty allows teams to build response playbooks that outline steps for troubleshooting and resolving common LLM issues. These playbooks can be automatically triggered in response to specific incidents, helping to standardize and speed up responses across the team. With the rise of agentic AI, these response plays will become intelligent. Rather than following a pre-scripted workflow, they will diagnose the issue, undertake low-risk remediation (such as triggering a retraining job), and only alert an engineer if they need authorization to undertake higher-risk actions (like blocking a user’s access).

Integrations
PagerDuty has integrations with the LLM Ops Monitoring vendor Arize.  

The Arize and PagerDuty integration monitors your production models and sends alerts to PagerDuty when your models deviate from a certain threshold. Arize and PagerDuty help keep your teams in the loop, send more comprehensive metadata through alerts, and debug your models faster than ever before. Arize assists with ML Performance tracing, Unstructured data monitoring, and automated model monitoring. 

Conclusion
As the use of AI, and GenAI in particular, grows, companies will encounter an increasing number of challenges to running their systems reliably and securely. Monitoring is an important first step in this process, but without proper handling of the resultant alerts, how can you maximize their value to your business whilst minimizing the disruption to your team? This is where PagerDuty, particularly its automation capabilities, can help. By minimizing the noise and providing engineers with the required information, incidents can be reduced, performance can be increased, and you can ensure that you provide the service your customers rely upon.