Interactive data dictionaries for business users — flatten dbt YAML into web tables with Quilt, Perspective, and Voila

Aneesh Karve
Quilt
Published in
4 min readNov 12, 2021

--

In this article we’ll parse dbt YAML into a pandas dataframe, then publish that dataframe as an interactive data grid on the web. Source code is available on GitHub and the data are available on open.quiltdata.com. Let’s get started.

Data dictionaries are a form of data documentation that help readers to accurately interpret data. Column names and column values are common terms in data dictionaries. Dictionaries give data consumers hope that they can grasp the data before them without painstaking data archeology 😅.

Is your documentation integrated and easy to access?

A collaborator recently shared a dbt YAML file with column annotations in it, salesforce.yaml. I dig YAML. It’s clear for developers, but it’s obscure for business users. In this article we’ll flatten a YAML object into a clean table that we can display as a data dictionary in HTML. Armed with a data dictionary, business users know the meaning of each column.

Our file contains several models, eache of which contains nested columns. Here’s the basic structure:

models:
- name: MODEL_NAME
description: MODEL_DESCRIPTION
columns:
- name: COLUMN_NAME
description: COL_DESCRIPTION
tests: COL_TESTS

Parse YAML into a Python dict

Let’s load salesforce.yaml into a dict.

with open("salesforce.yaml", "r") as f:
salesforce = yaml.safe_load(f)

Our data dictionary is in there, but it’s acting all shy and nested 🥺.

salesforce.yaml as a Python dict

The nesting occurs like so:

models:
- name: MODEL_NAME
description: MODEL_DESCRIPTION
columns:
- name: COLUMN_NAME
description: COL_DESCRIPTION
tests: COL_TESTS

Enumerating the model names with [m["name"] for m in salesforce[“models”]], we encounter five models:

['salesforce__manager_performance',
'salesforce__owner_performance',
'salesforce__opportunity_enhanced',
'salesforce__sales_snapshot',
'salesforce__order_enhanced']

Normalize or flatten dict to dataframe

We want to flatten the nested column descriptions into a table. Enter pandas.json_normalize.

pd.json_normalize(
salesforce,
record_path=["models", "columns"]
)

(The record_path= kwarg tells pandas how deeply you want to normalize the data. In this case, we are after a table of columns, which are nested under models, so our path is ["models", "columns"].)

Normalized dataframe from a dictionary with pandas.json_normalize

But we’ve lost the name of the parent model. This confuses columns with the same name but different descriptions across models. That’s where the meta= kwarg comes in:

pd.json_normalize(
salesforce,
record_path=["models", "columns"],
record_prefix="models.columns.",
meta=[["models", "name"]]
).drop(columns="models.columns.tests")

Boom. We’ve cleaned things up a tad more by preserving the namespace of the keys and dropping the “tests” column because that’s TMI (too much information) for our dictionary users.

Data dictionary as a pandas dataframe

Publish our data dictionary to the web with Quilt catalog

We’ve massaged our data into tabular form and made it interactive. Now we need to publish it to the web. Quilt catalog makes this easy.

First, we need to create a quilt_summarize.json file that tells the Quilt catalog to run a Voila kernel for our dashboard notebook:

[
{
"title": "Data dictionary",
"path": "dataDictionary.ipynb",
"types": ["voila"]
}
]

Now, we can push the data to S3. (You’ll need to set up quilt3.)

quilt3 push \
examples/interactive-data-dictionary \
--dir . \
--registry s3://quilt-example \
--message "initial data dictionary"

If you’re running the Quilt catalog, you’ll see the following dashboard (or click the following link for a live data dictionary example).

Data dictionary as an interactive PerspectiveWidget in the Quilt catalog

Perspective is cool. Suppose each column has a data quality score (I injected some random scores). We can visualize data quality per model as an X/Y Scatter.

Salesforce model data quality scores in a PerspectiveWidget in the Quilt catalog

Code samples

Visit quiltdata/examples/interactive_data_dictionary for the full Jupyter notebooks behind this article.

What’s next?

I’ll be doing a whole series on visualizing data quality. Stay tuned. I welcome questions in the comments.

Thanks

Special thanks to @DrErnie for bringing the Salesforce model data to my attention.

--

--

Data, visualization, machine learning, and abstract algebra. CTO and co-founder @QuiltData. Recent talks https://goo.gl/U9VYr5.