Understanding Systemic Issues: The PagerDuty Health Check Process
Continuous improvement is one of the fundamental tenets of Agile methodology that PagerDuty’s product development teams emphasize. This already works fairly well at the individual team level via retrospective meetings and postmortems but sometimes we don’t notice larger or systemic issues that are outside the control of a single team. This blog will share the process that we use at PagerDuty to uncover those issues, the outcomes we have seen, and how we have evolved that process.
How We Do Health Checks at PagerDuty
A couple of years ago, the Agile Leadership Team (ALT) implemented a health check initiative in order to better understand the state of our teams and potentially uncover systemic issues. The health check process is based on the model currently used by Spotify (see the Spotify Squad Health Check Model) but modified to suit our environment and extended to extract even more value from the process.
On a quarterly basis, everyone on our product developments teams completes an anonymous Google Forms health check survey. The survey consists of 13 questions that ask individuals about how they are feeling on a range of topics, such as the value and quality of what their team is delivering, the processes that their team uses, and the general happiness of their team.
We intentionally keep the scoring pretty simple. Respondents can answer “happy,” “neutral,” or “sad” to each of the questions. We also ask them to rate how they see this area changing for their team. Their choices are: “Improving,” “Staying the same,” or “Getting worse.” A supporting comment can be provided for each question. The results are then compiled at a team level, and each team reviews and discusses their own team’s results.
Once all the teams have completed their team-level reviews, the results are consolidated and presented in a single grid. For general dissemination of the results, we intentionally leave the team names off the grid. We do this because the purpose of the exercise is not to compare one team with another, but to identify concerns that are shared across multiple teams.
Below is a sample piece of our consolidated health checks results grid.
How to read this diagram:
- Each column represents a team.
- Each row represents the answers from each team on a specific topic.
- The circles represent the current happiness level.
- The arrow represents the trend—no arrow implies the trend is “Staying the same.”
How We Use the Health Check Results
Once the results from all the teams are consolidated we hold two separate meetings to discuss the data.
In the interests of transparency, we have one meeting where anyone from the product development organization can attend and hear about the results. The ALT presents interesting observations from this quarter, such as trends over time and sudden org-wide changes in scores for specific indicators.
Since the results of the health check are supposed to surface organizational issues, support (and sometimes action) is usually needed at an executive level. The second meeting we hold includes product development managers and our engineering leadership team, who review the summary of the health check results and discuss specific areas that might need their attention.
Improving the Health Check Process
Just as our product development organization has changed over time, so too has our health check process:
- We amend the questions on a regular basis to focus on areas that we feel need more attention. After each quarterly health check is complete, the ALT will discuss whether it would be useful to make changes to the questions for the next round of health checks. For example, after one quarter, we added a question about the on-call health of our teams and a question about how the user-experience process is working. We also combined some questions and eliminated some questions altogether.
- We found that simple scores without context were sometimes difficult to interpret, so we put an emphasis on asking respondents to add comments to support their scores.
- While the individual surveys are still anonymous, we found that the more information we shared with management, the easier it was for them to respond. In the early days, we didn’t share comments or team names with managers, but now we share all the information we can.
- We’ve recently done some work with Google Forms to automatically collect and summarize the results from all teams into a single spreadsheet rather than have people manually cut and paste their team’s results.
What the Results Have Shown Us
Over time, the health checks have surfaced many important issues within our organization. During one health check, we noticed that many teams reported poor scores for one specific question regarding autonomy. Those particular teams had been directed to work on specific projects with specific deadlines, and this approach had a noticeable negative impact on answers to the question. Soon after, there was a change in how teams took on projects, and we noticed that the scores for the autonomy question jumped back up again.
Another area of concern that surfaced through a health check was dissatisfaction with the ease of releasing code to production. Multiple engineering teams have focused on that area since that issue was first surfaced, and the quarterly results have slowly improved over time. Without the health check results that highlighted the need to focus on that process, we may not have made such an investment in that area.
Our Biggest Learning: Organizational Change Is Hard
While one of the goals of our health check process is to surface organizational-level issues, one of the biggest challenges we’ve had with the process is trying to influence changes at an organizational level.
For some of the early health checks, we tried task forces or Tiger Teams that would focus on specific low-performing areas. However, getting this initiative to the top of anyone’s list was difficult, and it was tough for tiny groups of people to change things at an organizational level without buy-in from all levels. In the past, we’ve also asked the Executive Team to help us address specific issues because getting buy-in at the executive level can be really important to the success of any initiative regardless of whether they are actually tasked with specific actions.
Ultimately, the approach that has yielded the best results for us is to address organizational problems at a team level after getting management buy-in (which we ask for during the second meeting mentioned above). For example, a health check identified that teams felt there were too many streams of work coming at them, so we had multiple discussions about prioritization with many teams to try and mitigate the problem. Having multiple teams addressing the same issue simultaneously has resulted in an overall improvement in health check scores for the organization.
——-
PagerDuty has changed significantly since we started the health check process, but we’ve found that it continues to provide useful insights into the health and happiness of our teams. It’s great to have a tool in our Agile toolbox that helps us understand the overall state of our organization and helps us identify areas where we could be doing better.
If you’re working in an environment that uses health checks, we’d love to hear how it’s working for you; join the PagerDuty Community to share your best practices and tips. And if you’re working somewhere that doesn’t yet use health checks, maybe give it a try to see if they can provide some of the same organizational insights that they have for us here at PagerDuty—and then join our Community to share your experiences!