Convert JSON objects in S3 to rows in Athena or Presto SQL
It’s common to annotate datasets with JSON. Suppose we have the following metadata:
meta_sample = {
"author": "akarve",
"hero_table": [
{"name": "Thor", "powers": ["thunder", "combat"], "id": 1},
{"name": "Iron Man", "powers": ["tech", "combat"], "id": 2},
{"name": "Spider-Man", "powers": ["sense", "agility"], "id": 3},
],
"items_table": [
{"id": 1, "items": ["Mjolnir", "Stormbreaker"]},
{"id": 2, "items": ["suit", "car"]},
{"id": 3, "items": ["suit", "web shooter"]}
]
}
How do we munge JSON it into a SQL table?
Now how do we slurp hero_table
and items_table
into Athena as tables and join on id
? First we’ll need to convert the JSON into proper SQL types.
ROW
maps objects to SQL column types
We use CAST
to convert JSON into an ARRAY
of type ROW
. This is the part that I couldn’t find in any tutorial, but did find in the Presto Docs on casting to JSON. ROW
enables you to cleanly map JSON keys to types as follows:
ROW(name VARCHAR, powers ARRAY(VARCHAR), id INTEGER))
Note how the names in each argument correspond precisely to our JSON keys ("name", "powers", "id"
).
Let’s roll this ROW
statement into a larger query:
WITH raw AS (
SELECT CAST(json_extract(user_meta, '$.hero_table')
AS ARRAY(ROW(name VARCHAR, powers ARRAY(VARCHAR), id INTEGER))) AS record
FROM "YOUR_DATABASE_HERE"
WHERE json_extract_scalar(user_meta, '$.longitude') LIKE 'athena%'
)
SELECT *
FROM raw
But, the output i not quite what we want. It’s just a single column, record
, with one row :(
[
{name=Thor, powers=[thunder, combat], id=1},
{name=Iron Man, powers=[technology, combat], id=2},
{name=Spider-Man, powers=[sense, agility], id=3}
]
UNNEST unpacks arrays and maps into relations.
We use UNNEST
to transpose our array into a column. You’ll typically see UNNEST
with a CROSS JOIN
. Append CROSS JOIN UNNEST(record) AS t(heroes)
to our query and voila.
The full query solution
WITH raw AS (
SELECT CAST(json_extract(user_meta, '$.hero_table')
AS ARRAY(ROW(name VARCHAR, powers ARRAY(VARCHAR), id INTEGER))) AS record
FROM "YOUR_DATABASE_HERE"
WHERE json_extract_scalar(user_meta, '$.longitude') LIKE 'athena%'
)
SELECT heroes.name, heroes.powers, heroes.id
FROM raw
CROSS JOIN UNNEST(record) AS t(heroes)
Now our query produces a proper table from our JSON input:
name powers id
Thor [thunder, combat] 1
Iron Man [technology, combat] 2
Spider-Man [sense, agility] 3
But how does that CROSS JOIN work?
CROSS JOIN
generates a cartesian product, but what are the two sets in play? On the left we have raw
with a single column, record
, and a single row. UNNEST
transposes our array into a column with three values (one for each array element). We then select only the heroes.WHATEVER
columns to hide record
and the unnested heroes
column (which contains row objects that look like maps). If you’re curious to understand what’s happening under the hood, put a SELECT *
in place of SELECT heroes.name, heroes.powers, heroes.id
.
How to associate metadata with a dataset
Datasets without metadata (as labels and documentation) quickly become meaningless.
Some businesses are so tired of meaningless data that they automatically delete unlabeled data after 30 days.
Quilt packages unify data and metadata as an immutable bundle so that datasets are more durable, meaningful, and reusable. It’s a one-liner to attach metadata to a package in Quilt, p.set_meta(meta_sample)
. Here’s how it all comes together to land your data and its metadata to S3.
import quilt3 as q3p = q3.Package()
p.set_meta(meta_sample)
# ...
p.push(
"akarve/heroes",
message="annotate dataset",
registry="s3://YOUR_BUCKET"
)
See Editing a package in the Quilt docs for more.