Engineering Blog

Making Better Decisions With Athena and S3

by Andrew Xu April 9, 2020 | 7 min read

In a world of highly complex systems, it isn’t uncommon to use different data storage technologies and mechanisms for different purposes, as each technology has its own strengths and weaknesses. Here at PagerDuty, we are no different.

In 2017 we made a conscious decision to leverage S3 to power one of our microservices and accept the trade-offs—at the time. As that service scaled, one of the major trade-offs with storing data in S3 that surfaced was the inability to query the data to ask questions and quickly validate assumptions. In this blog, I’ll share how, in 2019, PagerDuty combated this problem in an inexpensive way so you could begin making queries on your data stored in S3 and your teams can make better product decisions.

Why Athena/Glue Is an Option

Amazon S3 is a simple storage mechanism that has built-in versioning, expiration policy, high availability, etc., which provides our team with many out-of-the-box benefits. It does a great job with storage, but if the data being stored contains valuable insights that can help you make better decisions by validating assumptions, S3 alone is not enough. To query data stored as JSON files on S3, Amazon offers 2 ways to achieve this; Amazon S3 Select and Amazon Athena.

The first option we looked into was Amazon S3 Select. If you have a single JSON file that contains all of the data, this simple solution is for you. Take this as an example: Sally owns a convenience store where she sells some products. All these products and information about her store are stored in a single JSON file as follows:

{
"name": "Sally's Corner Store",
"inventory": [
  {"product_name": "Chips", "quantity": 20, "cost": 1.50},
  {"product_name": "Gum", "quantity": 5, "cost": 0.75},
  {"product_name": "Chocolate Bar", "quantity": 53, "cost": 1.25},
  ...
  ]
}

In this case, by using S3 Select, we can ask questions such as, “What is the name of my store?” or “Which products do I have on hand that are running low?”:

  SELECT s.name
  FROM S3Object s;
  SELECT *
  FROM S3Object[*].inventory[*] as p WHERE p.quantity < 10

This method requires no additional setup and is built into the AWS SDK, much like boto3 for Python, and one can begin using it without any overhead.

However, if your data is represented by separate JSON files (assuming these files have the same JSON schema), then we’ll need to use Athena, which was the exact case for us here at PagerDuty. Let’s take the previous example and assume that Sally’s business got so big that she began opening stores across the country. The representation is that each JSON file represents a store in which it continues to have a name, inventory, etc.:

store_1.json
{
  "name": "Sally's Corner Store",
  "inventory": [
  {"product_name": "Chips", "quantity": 20, "cost": 1.50},
  {"product_name": "Gum", "quantity": 5, "cost": 0.75},
  ...
  ]
}
store_2.json
{
  "name": "Another Corner Store",
  "inventory": [
  {"product_name": "Chips", "quantity": 12, "cost": 1.50},
  {"product_name": "Gum", "quantity": 42, "cost": 0.75},
  ...
  ]
}

Amazon Athena is a query service that Amazon provides to help us analyze the data we have in S3 using standard SQL. Unlike S3 Select, there is a bit of an overhead to this method simply because it involves stringing all the separate JSON files together so that they can be queried as one. This is where AWS Glue Data Catalog comes into play.

In a traditional database sense, you start by telling the database how the data looks with CREATE TABLE (table schemas), then data gets written to it via INSERT (storage), and finally queried via SQL. This all works because the database knows how the data will look, where it stored it, thus enabling the ability to query. However, by storing data in S3 as JSON files, we never told the storage how the data looks (table schema); therefore, we lack the ability to query.

AWS Glue Catalog fills in this gap by discovering (using Crawlers) the schema of these JSON files such that we can query using Athena. Simply put, there are two ways for it to discover the schema: automatically or manually. In the automatic method, the Crawler will scan through all the JSON files in a given bucket/folder and try to infer a schema based on commonalities. This method is actually quite intuitive and convenient for most use cases. By leveraging recurring scheduled runs, the schema is updated periodically as you make changes to the JSON files and avoid the need to define and maintain it. However, for our use case, it was slightly more complicated. Some free-form attributes within the JSON made it difficult for the Crawler to infer, so we opted for the second option to define the schema (Glue Catalog Table) manually.

We used Terraform to provision all the necessary access policies, as well as a Glue Catalog Database to represent a database, and a custom Glue Catalog Table (with a manual schema) for the database:

# First, provision the database to house the table
resource "aws_glue_catalog_database" "my_db" {
  name = "Sally Corporation"
  description = "All that there is to know about Sally Corporation"
}

# Next, provision the table in the database we just provisioned,
# with a custom schema, and the location of the S3 path where the data is
resource "aws_glue_catalog_table" "stores" {
  name = "stores"
  database_name = "${aws_glue_catalog_database.my_db.name}"

  storage_descriptor {
    location = "${some_s3_path}"
    input_format = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

    ser_de_info {
    name = "my-store-serde"
    serialization_library = "org.openx.data.jsonserde.JsonSerDe"

    parameters = {
      "serialization.format" = 1
                                    "paths" = "name,inventory..."
      "ignore.malformed.json" = "true"
      "case.insensitive" = "true"
     }
 }

    columns {
      name = "name"
      type = "string"
     }

    columns {
      name = "inventory"
      type = "array<STRUCT<product_name:string, quantity:int, cost:decimal>>"
     }
 }

Once this was all said and done, by using Athena within AWS console, we were able to select this database and begin querying data just like a traditional database!

What We Were Able to Accomplish

By leveraging AWS Athena and Glue Catalog, we were able to unlock some of the capabilities that S3 alone lacked, specifically the ability to query the data within. Here are just some recent examples of how this technology has enabled us even more:

  • We were able to proactively answer questions in real time over a data set that we previously considered difficult to query. In Sally’s example, we are now able to quickly verify which stores are low on inventory for any given items (so that we can order them in before they go out of stock), and the average price for a certain item across her stores (so that we can ensure profitability on sales of the items).
  • This gave us the ability to triage problems in real time, too. Several times, we got alerted about some errors in our logs, and it turned out to be some bad data that was written to one of the JSON files. After fixing the code path to be more robust in dealing with these errors, we were also able to leverage Athena to quickly identify the other impacted files and fix them.

What Next Steps We Could Take or Where We Would Use It Again

This technology is not suited for all those that store data in S3; however, if you have data that has a consistent structure and you want to get more out of your data, Athena can be a great choice. Instead of spending a non-trivial amount of time to rewrite our micro service to use a traditional database (and the pains of migration), Athena allowed us to dive into our data to quickly validate assumptions and triage problems in real time. If you found Athena helpful, we would love to hear about your experience with the technology and how it helped you solve some of your problems! Share your Athena story in the Community Forums!