PagerDuty image

Honeycomb’s Improved Incident Management Process Removes Bottlenecks and Leads to Rich Results Across the Organization

PagerDuty image

Size: 201-500

Industry: Technology

Location: San Francisco, California

Key Integrations:

Slack
Zoom

Before Jeli

Honeycomb has quickly emerged as a leader in the observability space with an innovative team leading the charge. During a period of incredible growth, their SRE team began to feel the growing needs of what’s required to analyze and learn from incident:

  • The process was fairly labor-intensive and involved manually copying and pasting Slack channel messages into Google Docs, understanding key moments in time across disparate systems and tools, all while having a small team.
  • Communication and coordination between internal teams during an active incident lacked clear ownership.
  • Incidents often led to internal teams using a shared operations Slack channel to ask questions, which made it hard for responders to focus on the task at hand—diagnosing and resolving the incident. The key change that Jeli helped facilitate was that the moment an issue looks interesting, it now becomes a dedicated Jeli channel.

Honeycomb’s engineering team was looking for a better way to learn from the incidents they were experiencing (later, once they’ve used Jeli for a while, they also expanded their usage to improve how they worked with internal teams such as Sales and Customer Success during incidents). The team wanted to find themes and patterns that would help with identifying gaps in their systems and areas of improvement across technical and non-technical teams. Honeycomb was looking for a solution that would help them learn from their incidents, and they ended up getting that and much more—Jeli now also helps the team respond to and analyze incidents in a more efficient way.

Identifying A Solution

Honeycomb began using Jeli for incident analysis with the initial goal to scale their SRE team and minimize having single points of failure when it came to incident management. They were also hoping to share learnings from their incidents with stakeholders in Sales, Customer Success, Leadership, and beyond.

With a learning culture already baked into the culture, the next step was to solve some of the challenges that come with keeping stakeholders informed during the incident. Honeycomb turned to Jeli’s Incident Response bot to help them continue to build out their incident management practice, especially as their teams continued to grow.

“It’s the stuff you get with Jeli, which is a temporary channel, that is discoverable in a single place. Everybody knows what it is. Everybody can do it.”

– Ian Smith, Engineering Manager, Honeycomb

The Results

Fast forward to today, and Honeycomb has successfully scaled (and grown) their incident management practice from one person to the entire Platform Engineering team, who now participates in both incident response and learning reviews.

  • Jeli’s IR Bot makes it easier for responders to communicate with team members—automatically broadcasting messages to critical Slack channels to share updates with other teams in Sales, Customer Support, and leadership.
  • Auto-importing of messages and Slack threads into Jeli makes analyzing incidents a breeze compared to the previous method of copying and pasting messages in a Google Doc.
  • Jeli’s Narrative Builder helped Honeycomb create a process where engineers can spend more time writing high quality reports, and less time looking for information. They can spend their time investigating incidents that yield learning and growth opportunities for their team. They did this by using the Narrative Builder to create a lightweight (and more enjoyable) way to build a timeline to help tell the story of how the incident unfolded.

    “Back when we evaluated Jeli, I made an experiment where I had annotated a major incident (7h+ duration) by hand, and it had taken me ~4 days (which probably accounted for between 18-25h). I later re-analyzed the incident with Jeli and it took ~6 hours. That analysis duration was one of the key points in switching to Jeli.”

    -Fred Herbert, Staff SRE, Honeycomb

Summary

With the introduction of Jeli as a key component of Honeycomb’s incident management program, they’ve been able to make the incident management lifecycle a lot more efficient and useful. Jeli’s IR bot has given people more time back to focus on fixing the problem, and to create higher quality post-incident reviews that capture real facts and experiences, and drive critical conversations across the organization.

“Our management team uses incident analyses in Jeli to make informed decisions in our roadmap planning. The platform allows us to reference documents and learnings to drive continuous improvement of our software.”

– Ian Smith, Engineering Manager, Honeycomb

Jeli is now a key part of Honeycomb’s onboarding process for new on-call engineers, creating a simple and repeatable process as the company continues to grow.