Convert JSON objects in S3 to rows in Athena or Presto SQL

Published in

Quilt

3 min readJun 2, 2021

It’s common to annotate datasets with JSON. Suppose we have the following metadata:

meta_sample = {
  "author": "akarve",
  "hero_table": [
    {"name": "Thor", "powers": ["thunder", "combat"], "id": 1},
    {"name": "Iron Man", "powers": ["tech", "combat"], "id": 2},
    {"name": "Spider-Man", "powers": ["sense", "agility"], "id": 3},
  ],
  "items_table": [
    {"id": 1, "items": ["Mjolnir", "Stormbreaker"]},
    {"id": 2, "items": ["suit", "car"]},
    {"id": 3, "items": ["suit", "web shooter"]}
  ]
}

How do we munge JSON it into a SQL table?

Now how do we slurp hero_table and items_table into Athena as tables and join on id? First we’ll need to convert the JSON into proper SQL types.

`ROW` maps objects to SQL column types

We use CAST to convert JSON into an ARRAY of type ROW . This is the part that I couldn’t find in any tutorial, but did find in the Presto Docs on casting to JSON. ROW enables you to cleanly map JSON keys to types as follows:

ROW(name VARCHAR, powers ARRAY(VARCHAR), id INTEGER))

Note how the names in each argument correspond precisely to our JSON keys ("name", "powers", "id").

Let’s roll this ROW statement into a larger query:

WITH raw AS (
  SELECT CAST(json_extract(user_meta, '$.hero_table')
  AS ARRAY(ROW(name VARCHAR, powers ARRAY(VARCHAR), id INTEGER))) AS record
  FROM "YOUR_DATABASE_HERE"
  WHERE json_extract_scalar(user_meta, '$.longitude') LIKE 'athena%'
)
SELECT *
FROM raw

But, the output i not quite what we want. It’s just a single column, record , with one row :(

[
  {name=Thor, powers=[thunder, combat], id=1},
  {name=Iron Man,  powers=[technology, combat], id=2},
  {name=Spider-Man, powers=[sense,  agility], id=3}
]

UNNEST unpacks arrays and maps into relations.

We use UNNEST to transpose our array into a column. You’ll typically see UNNEST with a CROSS JOIN . Append CROSS JOIN UNNEST(record) AS t(heroes) to our query and voila.

The full query solution

WITH raw AS (
  SELECT CAST(json_extract(user_meta, '$.hero_table')
  AS ARRAY(ROW(name VARCHAR, powers ARRAY(VARCHAR), id INTEGER))) AS record
  FROM "YOUR_DATABASE_HERE"
  WHERE json_extract_scalar(user_meta, '$.longitude') LIKE 'athena%'
)
SELECT heroes.name, heroes.powers, heroes.id
FROM raw
CROSS JOIN UNNEST(record) AS t(heroes)

Now our query produces a proper table from our JSON input:

name       powers               id
Thor       [thunder, combat]    1
Iron Man   [technology, combat] 2
Spider-Man [sense, agility]     3

But how does that CROSS JOIN work?

CROSS JOIN generates a cartesian product, but what are the two sets in play? On the left we have raw with a single column, record, and a single row. UNNEST transposes our array into a column with three values (one for each array element). We then select only the heroes.WHATEVER columns to hide record and the unnested heroes column (which contains row objects that look like maps). If you’re curious to understand what’s happening under the hood, put a SELECT * in place of SELECT heroes.name, heroes.powers, heroes.id.

How to associate metadata with a dataset

Datasets without metadata (as labels and documentation) quickly become meaningless.

Some businesses are so tired of meaningless data that they automatically delete unlabeled data after 30 days.

Quilt packages unify data and metadata as an immutable bundle so that datasets are more durable, meaningful, and reusable. It’s a one-liner to attach metadata to a package in Quilt, p.set_meta(meta_sample) . Here’s how it all comes together to land your data and its metadata to S3.

import quilt3 as q3p = q3.Package()
p.set_meta(meta_sample)
# ...
p.push(
  "akarve/heroes",
  message="annotate dataset",
  registry="s3://YOUR_BUCKET"
)

See Editing a package in the Quilt docs for more.