Principles of lazy data documentation — and how to get your team onboard

Aneesh Karve
Quilt
Published in
7 min readMay 2, 2020

--

Data documentation is considerate. It forms a clue trail for anyone who wishes to analyze, reuse, or debug the data in question. There’s no shortage of recommendations on how to document data [1, 3, 4, 5, 6]. So why don’t more teams write data docs?

Because documenting data is a pain. Most of us would rather incur documentation debt in exchange for time to work on more urgent things.

Is there hope for the lazy among us? I think so. Below are a handful of tools and techniques that your team can use to easily and automatically document data.

Make docs easy and attractive, or nothing happens

Data documentation is a habit driven by incentives. If your users can quickly get an attractive web presence by jotting a few lines into a markdown file — or better yet by doing nothing — they will do it, they will like it, and others will imitate them. We all write READMEs on GitHub because the repo would feel bare and unsightly without them.

We’re going to look at lazy data documentation in seven dimensions:

  1. Automatically profile data, verify schemas, and generate docs
  2. Your cloud provider is already documenting your data for you, visualize it for everyone to see
  3. Publishing docs should be as easy as sending an email
  4. Docs should live in the same formats that people use to keep themselves organized, and be accessible with a browser
  5. Randomly selected files document the gist of the collection
  6. Prefer typed file formats
  7. Ensure docs are discoverable with natural language search

In the following sections we’ll provide examples and code samples for each of these seven dimensions.

Automatically profile data, verify schemas, and generate docs

At Quilt we’re enjoying Great Expectations to profile tabular files and data frames. You point Great Expectations at a data source and it automatically generates a data profile — expectations about the shape of the data — and HTML docs.

For semi-structured dictionaries, we use JSON-Schemas and apply validators to inbound data. It looks something like this:

import json
from jsonschema import Draft4Validator as dv
my_schema = {
"type": "object",
"properties": {
"type": {
"description": "Type of fruit",
"enum": ["citrus", "stone", "tropical", "berry", "other"]
}
}
}
# Fails (yay) because "berru" is not a legit fruit type typedv(my_schema).validate(json.loads("{\"type\": \"berru\"}"))

These simple guide rails keep schemas from getting corrupted and provide a place for people to capture tribal knowledge from their heads as community knowledge in the docs. Tribal knowledge, by the way, is evil. Because it’s expensive to acquire when a pipeline blows up, when someone leaves the team, or when you’re new to a dataset.

In February we introduced a new label, __UNKNOWN__, into the transaction_type column to match the Excel files that we get from the new supplier. I should probably write this down somewhere.

— Your coworker’s head

Your cloud provider is already documenting your data for you, visualize it for everyone to see

Many of us use monitoring services like CloudTrail, or search services like ElasticSearch. These systems are full of metadata on usage patterns and file statistics — but unless you’re a developer you never benefit from any of this metadata.

Fig. 1 — Automatically generated docs for the Allen Cell Imaging Collections.

With a bit of glue code, you can extract highlights from these data sources and visualize them in a browser. Key statistics include the number of objects in a collection, the total size of the collection, the most popular file extensions, and which files and extensions are frequented by other users (Fig. 1). An enormous clue as to how data fit together is “documented” in the path that other users forge through the data. If, as new user or as someone debugging a pipeline, I know where most people went next, I’m half way to a solution.

We use CloudTrail and Athena to dump, at hourly intervals, summaries of data downloads and time series of file accesses (Fig. 2, code here). A lot more is possible, such as indicating who the top users of a given file are.

Fig. 2— CloudTrail access data visualized as a time series.

Publishing docs should be as easy as sending an email

If there’s friction in your doc-writing process, you are dis-incentivizing information, and you’ll have less of it. The anti-pattern to avoid is a system that mugs people for information like an IRS form. Wikis, Web docs, and S3 are all potential homes for data documentation. I have a slight preference for S3 — for any location that’s adjacent to the data — since it increases the likelihood that users of the data will find the docs.

Docs should live in the same formats that people use to keep themselves organized, and be accessible with a browser

Jupyter notebooks, markdown files, and even text files are natural choices here. No data scientist should be asked to make a “quick PowerPoint presentation” just so that Fred in Accounting can look at the monthly sales numbers. Instead, let’s convert from Jupyter to HTML with a script.

Fig. 3— A Jupyter notebook rendered to HTML with nbconvert.

At Quilt we use nbconvert to render notebooks to HTML (Fig. 3) and are switching to remark to convert markdown to HTML on the fly. nbconvert includes facilities for turning Juptyer notebooks into slide presentations and filtering out select cell types. For example you can convert a Jupyter notebook to HTML , and eliminate code cells that dashboard users don’t need to see, as follows:

$ jupyter nbconvert \
--to html \
--TemplateExporter.exclude_code_cell=True \
notebook.ipynb

For details on how to programatically convert Jupyter notebooks to HTML in Python, see the Python Lambda function extract_ipynb().

Randomly selected files document the gist of the collection

In our experience, randomly selected files can help users to get the gist of large data collections. This is the kind of thing that you think won’t work until you experience it. As an example, I had used the 1000genomes bucket on AWS for months before I ever knew, thanks to auto-generated data docs, that it contained a few dozen cool visualizations (Fig. 3). Hardened biologists tell me that they, too, never knew these images existed in 1000genomes.

Fig. 3 — Unearthed visualizations in the 1000genomes project

We generate collection gists with random samples of images, tabular files, JSON, and README* files. For Python queries to ElasticSearch for random gists, see Quilt’s search lambda function. In most cases you’ll want to cache the results so that the overview documentation isn’t so dynamic — due to query LIMIT clauses and random sampling — as to be confusing.

Prefer typed file formats

CSVs are convenient but they are kind of evil. For starters, CSV files do not enforce type integrity. I recommend formats like Parquet and the excellent pyarrow libraries (or even pandas) for reading and writing Parquet. Not only does Parquet enforce types, reducing the likelihood of data drifting within columns, it is faster to read, write, and move over the network than text files. We use pandas’s ._repr_html_() to convert CSVs and Parquet files into an HTML document that’s browsable on the web.

Useful docs are plaintext searchable

Finding a notebook you touched last week, or a column in a Parquet file, is far easier with the help of tools like ElasticSearch (see Find your Jupyter notebooks with ElasticSearch). Commercial services like Dropbox and Google Docs offer some plain text search, but if you depend on specialized formats like .ipynb or .parquet, you’re better off writing your own extractors and sending documents to ElasticSearch. We’ve open sourced all of the code that we use to extract and send documents to ElasticSearch here.

Hey, that’s not all easy

Making things easy for documentation readers and writers does make things hard on the app and infrastructure providers. That is The Law of Conversation of Complexity in action. You can still be lazy by using off-the-shelf components. If you’re looking for a full-stack data documentation platform that integrates with pull requests, check out Airbnb’s Knowledge Repo. If you’re using S3, have a look at Quilt’s open source libraries for visualizing and summarizing S3 buckets. If you’re looking for a comprehensive solution to collect and aggregate metadata, see the open source project Marquez.

Doc unto others as you would have them doc unto you

There are fancy data lifecycles [5, 6], metadata formats [4], and frameworks [7] for documenting data. Because this is an article on being lazy, we are required to dismiss all of these frameworks :) That is to say, writing good data docs boils down to giving others a running start. Ask yourself one question when generating documentation: If I were new to this company and this dataset, what would I need to know to start using this data?

A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don’t necessarily have to read 100% of the code before knowing where to look for very specific things.

— Cookiecutter Data Science [2]

Conclusion: data without docs quickly become meaningless

"name", "price", "weight"
"carambola", 44.02, 181

How would you use this data to, say, purchase fruit for an event? I don’t know, you don’t know, and that’s the point. There are so many unknowns. What currency is the price in? What units are the weight in? Have we ever purchased carambola before?

This is a crude illustration of the point that data without documentation quickly become meaningless. And that’s where the value of data documentation comes in. Data documentation is a cultural habit that leads to greater utilization of data and, ultimately, smarter models. Make data documentation easy so that it actually happens.

References

  1. Data Management: Data Documentation & Standards, University of Portland, Clark Library
  2. Cookiecutter Data Science
  3. Principles of Documenting Data, The Social Science Research Council
  4. Schema.org
  5. Guide for Data Documentation, University of Helsinki
  6. Document, Discover, and Interoperate — Getting Started

--

--

Data, visualization, machine learning, and abstract algebra. CTO and co-founder @QuiltData. Recent talks https://goo.gl/U9VYr5.