Create and Browse Reusable Datasets in Your Private S3 Buckets with quilt3

Published in

Quilt

5 min readMar 9, 2022

I’m grateful to have the chance to talk with science and data teams from many different industries and research areas. Although the teams are working in different areas and using different data systems, there’s a story I hear in almost every conversation. There’s a new person on the team; they’re trying to figure out: which datasets are available, what they mean, and who if anyone at the company knows the answer. Quite often the person who does know has left the company.

Since the release of quilt3 4.0.0, I can now show these teams how they can attach READMEs and other documentation, and visualizations to their datasets in their own S3 buckets. These simple additions make it easy for collaborators to understand the data, especially when they’re new to a project. Using the same mechanism, they can also annotate datasets with metadata and track the provenance of their data, which increases trust in the data and decisions derived from it. Quilt bundles that data, metadata, documentation, and visualizations into versioned datasets called packages.

Quilt Data Packages

Data packages are immutable collections of related files, metadata, and documentation. Quilt provides a revision history of the entire package (collection). Each revision is uniquely identified by a cryptographic top-hash (or hash of hashes). Specific package revisions can be referenced in code (e.g., by top-hash) or browsed in the Quilt web catalog.

Quilt3 includes a Python library and web catalog. Install quilt3 including the optional web catalog by running: pip install quilt3[catalog].

To illustrate Quilt in action, I chose a dataset from Reef Check, one of our public data partners on open.quiltdata.com. It focuses on sea urchin populations, especially the kelp-eating purple urchin. We’ll use Quilt to sync our data to Amazon S3, document our dataset, and track each revision.

Reef Check

The Reef Check Foundation was founded in 1996 to help preserve the oceans and reefs which are critical to our survival, yet are being destroyed. Reef Check trains thousands of citizen scientist divers who volunteer to survey the health of coral reefs around the world and rocky reef ecosystems along the west coast of North America.

The dataset we’re using here was collected by the Reef Check team from 2006 to 2019. It’s close to my heart because I grew up surfing in northern California. I felt the rise in ocean temperature during that time and saw its effects firsthand — especially during 2015 when an enormous marine heatwave wreaked havoc on local ecosystems.

Though in this post we’ll focus on crowned, red, and especially purple urchins, ReefCheck tracks many other species of animals, which you can see when you check out the public datasets at ai-on-the-beach.

Thanks to Praful Mathur and Selena McMillan for curating and analyzing the urchin dataset and helping with this post. Thanks to everyone at Reef Check for helping protect our oceans.

Saving the raw data

The first step is to use S3 to create a bucket. In this example, we’ll use the AWS Python library, boto3. After creating the bucket, we turn on object versioning so that all versions of files stored in the bucket will be retained.

Once the bucket is ready, we can usequilt3 , to upload the tabular data and images to our S3 bucket and create a snapshot of the original state.

First, we create a package on our local machine and add the source data files to it. Next, we push the package to our bucket in S3. That will upload the raw data files, and create a new package manifest defining the package in its new location in S3.

Browse the Dataset in S3 with Quilt3 Catalog

Once the package is created in S3, we can use quilt3 catalogto browse the raw data in S3. The catalog uses AWS credentials stored in your local Python environment so you can browse data in private buckets.

Browse to your bucket by entering the bucket name in the bucket field in the top-left corner of the page. Once in the bucket, browse contents by selecting the “BUCKET” tab. From there, you’ll see the raw data has been uploaded into a folder structure matching the package structure. Preview files by navigating to aionthebeach/reef-check/raw and selecting a file, e.g., urchins2006-2019.parquet.

Explain the data so collaborators can understand it

Data without explanation or context is notoriously hard to understand let alone use productively. The first step to document a new dataset is to add a README. We can add a README to our data package by writing a local file, say README.md in any markdown editor, and using Package.set to add it to our package. Once our package has the README, we can push the new revision to S3 with Package.push.

Now collaborators can easily find an overview of the dataset and background on the organization behind it. To view the new package version with the README, browse to the “PACKAGES” tab and select aionthebeach/reef-check.

Invite collaborators to explore the data with interactive visualizations

Visualization is a powerful way for us to explore the dataset and communicate our findings with one another. In this example, we use the Python package, Altair, to explore the population of different species of sea urchins over the years 2006 to 2019.

We can add this visualization to our Quilt package so it’s included in the package summary in the Quilt catalog. First, we save the chart as a Vega (JSON) file. We can add the JSON file to our package and add a reference to it in the quilt_summarize.json file.

Now the entire team can explore the data visually in addition to accessing the underlying raw data.

In this case, we can see that the warming of the oceans is creating more suitable environments for purple urchins further north in California with dire consequences for kelp forests, as discussed in this article: In Hotter Climate, ‘Zombie’ Urchins Are Winning And Kelp Forests Are Losing.

Conclusion

The example above shows how a data team can manage a dataset in S3 with the open-sourcequilt3. Runningquilt3 catalog allows collaborators to explore data, documentation, and even interactive visualizations in the browser. Team members can contribute new revisions without fear of disrupting each other because the entire dataset is versioned. To try Quilt for yourself, check out the Quilt Documentation. If you like what you see, please give it a star on GitHub. Thanks!