All, None, and One: The SQL Left Join Trick
Introduction
PagerDuty released Multi-User Alerting in early 2014, which allowed notifying and assigning multiple people when an incident is triggered. In addition to assigning multiple users to an incident, multi-user alerting also makes it possible for an incident to have multiple acknowledgers. This post will demonstrate the changes we made to our data model to implement multi-user alerting and the resulting sophistication added to our SQL queries to maintain their performance.
Data Model Migrations
The logic and data model behind our incident process exists as part of a Rails app. Before the migration, our data model looked like:
class Incident < ActiveRecord::Base
has_one :acknowledging_user
has_one :assigned_user
# Time objects
acknowledged_at
resolved_at
end
After multi-user alerting rolled out, the data model looked like:
class Incident < ActiveRecord::Base
has_many :acknowledgements, class_name: :IncidentsAcknowledgement, dependent: :delete_all, inverse_of: :incident, order: :created_at
has_many :acknowledging_users, through: :acknowledgements, source: :user
has_many :assignments, class_name: :IncidentsAssignment, dependent: :destroy, inverse_of: :incident
has_many :assigned_to_users, through: :assignments, source: :user
# Time object, acknowledged_at moved to IncidentsAcknowledgement.created_at
resolved_at
end
where IncidentsAcknowledgement
and IncidentsAssignment
are classic join tables:
class IncidentsAcknowledgement < ActiveRecord::Base
belongs_to :incident, inverse_of: :acknowledgements
belongs_to :user
end
class IncidentsAssignment < ActiveRecord::Base
belongs_to :incident, inverse_of: :assignments
belongs_to :user
end
At the surface, this type of migration seems straightforward: A 1:1 relationship becomes a 1:many relationship. A total cake walk straight out of Data Modeling 101, right?
Not really.
Some of the most common queries PagerDuty performs require finding all incidents for an account with a given state: triggered, acknowledged, or resolved. We’ll discuss how we migrated each of these queries, starting with the easiest.
Resolved Incidents
The query for resolved incidents did not change at all with Multi-User Alerting. An incident can only be resolved once, with or without Multi-User Alerting. That leads to the following scopes:
scope :resolved, where.not(resolved_at: nil)
scope :unresolved, where(resolved_at: nil)
Triggered Incidents
Before Multi-User Alerting, the triggered scope was:
scope :triggered, unresolved.where(acknowledged_at: nil)
A triggered incident was merely one where both acknowledged_at
and resolved_at
were nil
. Under Multi-User Alerting, an incident can have multiple acknowledgements, so this scope needed to be changed. We needed a query to find unresolved incidents that had exactly zero acknowledgements.
In SQL, how does one find if a join table has exactly zero associations? LEFT JOIN
of course!
As a refresher, a left join guarantees that the result set contains all of the rows from the original table.
For example, consider the following two incidents, where one has some acknowledgements:
# Incidents
id | description | resolved_at
----------------------------------
1 | CPU high | NULL
2 | Server on fire | NULL
# Acknowledgements
id | incident_id | user_id
-----------------------------
100 | 2 | 9875
101 | 2 | 9876
Consider the following SQL query that uses left join:
SELECT *
FROM incidents
LEFT OUTER JOIN acknowledgements on acknowledgements.incident_id = incidents.id
The result set of the above query is:
id | description | resolved_at | id | incident_id | user_id
-----------------------------------------------------------------
1 | CPU high | NULL | NULL | NULL | NULL
2 | Server on fire | NULL | 100 | 2 | 9875
2 | Server on fire | NULL | 101 | 2 | 9876
Incident 1 (with zero acknowledgements) is the one where acknowledgements.incident_id IS NULL
. That leads to the following straightforward scope definition:
scope :triggered,
arel_left_join(:acknowledgements).
where(incidents_acknowledgements: {incident_id: nil}).
unresolved
where arel_left_join
is a helper defined as:
def self.arel_join(association, join_type)
incident_t = arel_table
association_model = reflect_on_association(association).klass
association_t = association_model.arel_table
predicate = association_t[:incident_id].eq(incident_t[:id])
joins(incident_t.join(association_t, join_type).on(predicate).join_sources)
end
def self.arel_inner_join(association)
arel_join(association, Arel::Nodes::InnerJoin)
end
def self.arel_left_join(association)
arel_join(association, Arel::Nodes::OuterJoin)
end
This new triggered scope translates to the following SQL
SELECT `incidents.*`
FROM incidents
LEFT OUTER JOIN acknowledgements on acknowledgements.incident_id = incidents.id
WHERE (acknowledgements.incident_id IS NULL) AND (resolved_at IS NULL)
Acknowledged Incidents
Before Multi-User Alerting, the scope for acknowledged incidents was:
scope :acknowledged, unresolved.where("acknowledged_at IS NOT NULL")
Extending this to support multiple acknowledgements was similar to how triggered
was extended. In this case, we are looking for unresolved incidents with at least one entry in the acknowledgements table. Instead of using LEFT JOIN
with acknowledgements.incident_id IS NOT NULL
, we can simply use a plain old INNER JOIN
.
Unfortunately, naively using INNER JOIN
as we did with triggered incidents is incorrect. To illustrate, consider the following implementation of acknowledged
:
# Naive and very broken
scope :acknowledged,
arel_inner_join(:acknowledgements).
unresolved
To see why this is incorrect, consider the generated SQL:
SELECT incidents.*
FROM incidents
INNER JOIN acknowledgements on acknowledgements.incident_id = incidents.id
WHERE (resolved_at IS NULL)
as well as our sample incident data from before:
# Incidents
id | description | resolved_at
----------------------------------
2 | Server on fire | NULL
# Acknowledgements
id | incident_id | user_id
-----------------------------
100 | 2 | 9875
101 | 2 | 9876
The result set will be:
id | description | resolved_at | id | incident_id | user_id
-----------------------------------------------------------------
2 | Server on fire | NULL | 100 | 2 | 9875
2 | Server on fire | NULL | 101 | 2 | 9876
The result set has each incident occurring once for each acknowledgement. However, we want an incident to occur exactly once if it has at least one acknowledgement, so we need a way to remove these duplicate incidents. It turns out that efficiently removing duplicates in SQL is not as trivial as it seems.
SQL DISTINCT
/ Rails uniq
The most straightforward approach is to use .uniq
in Rails:
# Naive and very broken
scope :acknowledged,
arel_inner_join(:acknowledgements).
unresolved.
uniq
which translates to the following SQL:
SELECT DISTINCT incidents.*
FROM incidents
INNER JOIN acknowledgements on acknowledgements.incident_id = incidents.id
WHERE (resolved_at IS NULL)
Unfortunately, the above query is not very performant. For accounts with a large number of acknowledged incidents, we found this query was 2 to 3 times slower than the straightforward (though incorrect) query. The explain plan for the above query showed Using temporary
, which indicates that MySQL was using a temporary table to compute the distinct rows. We suspect that a large part of the slowdown is based on MySQL copying all of the columns of each row to the temporary table for deduplication.
Use GROUP BY
In suspecting that the slowdown was caused by copying all of the columns (DISTINCT incidents.*
), we tried to find a way to have DISTINCT
run on just one column. GROUP BY
can be used exactly for this purpose, so we attempted the following query as a starting point:
SELECT incidents.*
FROM incidents
INNER JOIN acknowledgements on acknowledgements.incident_id = incidents.id
WHERE (resolved_at IS NULL)
GROUP BY incident.id
For accounts with large numbers of acknowledged incidents, this query was significantly faster than using DISTINCT
. Furthermore, the explain plan for this query did not contain Using temporary
. Unfortunately, this query too had one problem due to the implementation of .count
in Rails.
The Rails scope that generated the previous query is:
# This works, but breaks other things
scope :acknowledged,
arel_inner_join(:acknowledgements).
unresolved.
group('incident.id')
GROUP BY
affects the results of aggregate functions, specifically COUNT
. In Rails, to count the total number of incidents from a scope, you will use count
(e.g., Incident.acknowledged.count
). However, with group
, .count
returns the count per group, not the total count. In other words,
instead of getting an integer returned:
Incident.acknowledged.count => 7
the return value is a map of incident id to count for that incident
Incident.acknowledged.count => {19 => 1, 20 => 1, 21 => 1}
Lots of places in our app assume that calling count
returns an Integer, so changing that behaviour was not desirable. Wrapping the GROUP BY
query in its own subquery restored the correct behaviour of count
, but that killed the performance gains since a temporary table was again required. So close, yet so far.
Use a second LEFT JOIN
The crux of the initial problem of joining acknowledgements
directly against incidents
was that a single incident could have multiple acknowledgements, leading to duplicate incidents in the joined result.
What if there was a way to select a single exemplar acknowledgement for each incident? The MySQL manual describes a technique using LEFT JOIN
for finding the group-wise maximum of a certain column. Using that trick with our query, we can find the acknowledgement
with the max id for a given incident. We finally arrive at the following query:
SELECT incidents.* FROM incidents
INNER JOIN acknowledgements ON acknowledgements.incident_id = incidents.id
LEFT OUTER JOIN acknowledgements AS acknowledgements_uniq ON
acknowledgements.incident_id = acknowledgements_uniq.incident_id AND
acknowledgements.id < acknowledgements_uniq.id
WHERE (incidents.resolved_at IS NULL) AND (acknowledgements_uniq.id IS NULL)
To illustrate why this works, consider the following incident with four acknowledgements:
# Incidents
id | description | resolved_at
----------------------------------
3 | Server inferno | NULL
# Acknowledgements
id | incident_id | user_id
-----------------------------
200 | 3 | 9875
201 | 3 | 9876
202 | 3 | 9874
203 | 3 | 9873
and the following query without the acknowledgements_uniq.id IS NULL
clause:
SELECT incident.id, acknowledgement.id, acknowledgements_uniq.id FROM incidents
INNER JOIN acknowledgements ON acknowledgements.incident_id = incidents.id
LEFT OUTER JOIN acknowledgements AS acknowledgements_uniq ON
acknowledgements.incident_id = acknowledgements_uniq.incident_id AND
acknowledgements.id < acknowledgements_uniq.id
WHERE (incidents.resolved_at IS NULL)
The result is:
incident.id | acknowledgement.id | acknowledgements_uniq.id |
-------------------------------------------------------------
3 | 200 | 201 |
3 | 200 | 202 |
3 | 200 | 203 |
3 | 201 | 202 |
3 | 201 | 203 |
3 | 202 | 203 |
3 | 203 | NULL |
From the MySQL manual, the LEFT OUTER JOIN
works on the basis that when acknowledgements.id
is at its maximum value, there is no acknowledgements_uniq.id
with a greater value and the acknowledgements_uniq
rows values will be NULL
. Nick Kallen, author of Arel, calls this technique “one of the great mind-blowing SQL queries”.
The fantastic thing about this query is that it is just as performant as the GROUP BY
version, and does not require the use of a temporary table. In addition, this query does not adversely affect count
in Rails. Perfect!
The final scope for Incident.acknowledged
is:
def left_join_as_distinct(association)
association_model = reflect_on_association(association).klass
table = association_model.arel_table
table_alias = table.alias("#{table.name}_uniq")
predicate = table[:incident_id].eq(table_alias[:incident_id])
predicate = predicate.and(table[:id].lt(table_alias[:id]))
join_sources = table.join(table_alias, Arel::Nodes::OuterJoin).on(predicate).join_sources
joins(join_sources).where(table_alias[:id].eq(nil))
end
scope :acknowledged,
arel_inner_join(:acknowledgements).
left_join_as_distinct(:acknowledgements).
unresolved
Conclusion
SQL’s left join operator turned out to be a valuable tool in writing correct and efficient queries for multi-user alerting. Its usage was typical for the triggered incidents scope, which involved finding incidents with zero acknowledgements. Left join additionally proved versatile for the acknowledged incidents scope, where it was used as an efficient proxy for DISTINCT
to find unique incidents with at least one acknowledgement.
Deriving efficient queries involved profiling the original SQL queries, understanding the behaviour of Percona’s query optimizer, and experimenting with alternative queries. If challenges problems like these excite you, PagerDuty is hiring.
Thanks to Steve Rice and Evan Gilman for reviewing drafts of this article.