Let’s Talk AIOps: Part 1: What IS AIOps, Exactly?
Editor’s Update: Since this blog was last published, we have announced a number of new features in our AIOps solution. You can read about them in this blog post or learn more at https://www.pagerduty.com/use-cases/aiops/
This is the first in a two-part blog series deconstructing AIOps for ITOps leaders.
If you gave me a dollar for every company that claims that they use “A.I.,” I’d be doing pretty well. But as a marketer, I can’t help but be a little skeptical about those claims. Let me explain.
First, there are over 50 vendors in the Gartner market guide for AIOps—that’s a lot of AIOps “solutions.” Then try putting yourself in the customer’s shoes and you likely will start to wonder: “How do I know what I’m getting exactly?”
It’s tough out there. Companies are being pushed harder than ever before to transform the business, to push the needle, to move to digital, faster. Even before the pandemic struck, many IT organizations were already starting to turn to AIOps as a potential investment area. They’re hopeful investing in this new technology will improve operations, provide more data-driven insights, and enable them to scale more efficiently and address issues faster.
But all that still begs the question: What actually is AIOps…and, perhaps more importantly, is it all hype?
The thing is, hype or not, the term isn’t going anywhere. To help clear the noise around AIOps, I sat down with Julian Dunn, Director of Product Marketing at PagerDuty, to level-set on the topic.
Q: Is AIOps just marketing hype?
JD: It may seem that way. But if we remove all the marketing, there is a big opportunity for AI and machine learning (ML) to play an important role in real-time work. If we absolutely had to define it, I think the simplest definition is that AIOps is the “future of monitoring,” one that includes not just the network, servers, and applications, but also the full end-to-end digital customer experience—and the ability to relate those areas to one another.
For real-time operations, AIOps has the potential to create insights across multiple domains to help provide in-depth context around underlying causes of outages. Let’s say an application is slow. Can we relate that all the way back to the customer? Was the cause due to, for example, a deluge of customers purchasing one specific product or interacting with our site in a way that’s slowed down the application? That’s the dream that we’re just taking baby steps towards now as an industry.
Ultimately, AI and machine learning sound like fancy terms, but remember: It’s algorithms, math, and statistics under the hood. What we have today is both an incredible amount of data and also an equally incredible amount of computing power in the cloud to apply to it, in a way that wasn’t possible 10 years ago. So ironically, when customers say “I can’t deal with this much data,” it’s actually a problem that ML algorithms love to chew on. The more data you have, the more feasible and accurate an AI model can be trained to be.
I do want to sound a note of caution, though, which is to make sure customers can clearly articulate the business problem they’re trying to solve with AIOps. Otherwise, it’s just another fun tool.
Q: What is actually driving people to buy AIOps?
JD: There are two things that we hearing most from IT leaders:
- They want to do more with the same number of people. Infrastructure and data volumes are growing astronomically, and they don’t have the operating budget to get more headcount.
- They want to reduce risk and are hoping that AI will help find the root cause faster and get people collaborating on fixing issues faster.
So that’s the entry point for the promise of AI for real-time ops. I won’t deny the fact that AI is and can be valuable, but it’s certainly not a silver bullet that will solve all of central IT’s problems. Some of what they’re looking for is simply not easily achievable, and some of it may not even be possible in the realm of computer science.
For example, with current AIOps solutions, what’s achievable is making recommendations about current behavior that closely resembles past behavior. For instance, “Service X is downstream of Service Y. We notice that the majority of the time, incidents on Service Y causes incidents on Service X within 5 minutes. Recommendation: Look at Service Y before Service X.”
What’s not achievable, however, is the ability to predict / detect behaviors that wildly deviate from past behavior—which is what customers want and what a lot of vendors claim they’re able to do. In cybernetics, this is called the “Law of Requisite Variety” (you can look it up).
In other words, statistics and data about past incidents and human behavior fed continually into an AI model can help us confidently make assumptions if they are closely related to past incidents. But many people are looking for an “AI magic machine” that will help them detect and resolve issues even when current data and behavior is significantly different from what’s seen in the past, and therein lies the challenge—and disappointment.
Q: So let me play devil’s advocate here for a minute. Does IT actually have an AIOps problem?
JD: From what we’ve seen, our customers do have an AIOps problem—it’s just often not the one they think they have.
Most of the time—and our conversations with industry analysts backs this up—it’s a central IT organization that’s looking to buy an AIOps solution. Typically some of the “jobs to be done” are noise reduction, anomaly detection, and correlation of events across services. Without getting into the theory, though, this often does run into the same feasibility issues I articulated above—for example, it’s tough to label something with high certainty as an anomaly that needs action if you’ve literally never seen it before.
The other blind spot for central IT organizations is they often overlook the fact that the operations world has changed. Operations teams are increasingly taking on a decentralized, “full-service ownership”, where lines of business staff their own technology teams, each with their own culture, velocity, toolchain, etc.
Interestingly enough, this decentralization actually makes it more feasible to do AIOps, because it segregates events in a way that makes it easier for algorithms to operate on. Yet we often see central IT organizations fighting the move to decentralized event management even though it would make everyone’s lives easier!
Not only that, but there’s a cultural and organizational issue here, too. An AIOps approach that focuses only on the centralized team’s needs will not deliver the required ROI because if you buy such a solution and try to impose it on decentralized teams, the latter will push back, making it very difficult to realize the business goals you hoped you would.
Think about central teams who try to buy ITSM solutions and shove them down the throats of developers—that rarely goes well. And one of those business goals that gets lost in the rush to delivering features is the need for technical teams to collaborate, particularly between centralized and decentralized groups. All the AI-driven noise reduction in the world isn’t helpful if teams aren’t communicating when things go wrong because they refuse to use each other’s tools.
Keep an eye out for part 2 of this topic next week, where we’ll talk about key considerations IT leaders should think about if they’re in the market for an AIOps solution. We’ll also share what PagerDuty’s approach and investment areas are to bring AIOps to life for our customers.
To further help reduce the noise around AIOps, Julian and I have collaborated to put together a webinar featuring our SVP of Product and Product Marketing, Jonathan Rende, addressing this very topic. “AIOps Explained: What It Is and How It Can Boost Real-Time Operations.” You can watch the on-demand recording on your own time to get a taste for how we at PagerDuty navigate the many different philosophies around what AIOps can be to formulate our perspective on the topic. In the webinar, Jonathan also shares ways you can evaluate technologies that promise AIOps to formulate ideas around where it can fit in your broader strategy.
This is just the tip of the iceberg for what PagerDuty has to offer when helping our customers leverage intelligence and automation to alleviate real-time work so teams can focus their time innovating instead of fighting fires. We also have some big product announcements coming at our annual conference, PagerDuty Summit, in just a few weeks, so if you haven’t already, register today for free to save your spot!