Quilt

Quilt is a data mesh for cross-functional teams

Follow publication

Manage data like source code

--

Projects like pip, npm, and GitHub simplify source code management and dependency management. In 2017 it’s easy to share, version, and install source code. Why can’t we do the same with data? Data is dynamic, has complex dependencies, and needs to be shared across environments. Modeling in data science, artificial intelligence, and finance benefits from versioning. Versioned data makes it possible to reproduce results, repeatably train models, and track changes. Unfortunately, source code repositories like pip, npm, and GitHub aren’t suited for data management. Here’s why:

  • Large data files slow source repositories down. (Git LFS works to some extent, but comes at the cost of added complexity and misses the compilation step, discussed below.)
  • Source repositories don’t compile data. Below we’ll discuss how data can be “compiled” into high-speed, cross-platform, binary formats.
  • Source repositories don’t package data. Instead, source repositories treat data as flat files. The user is left to “extract, transform, and load” the data by hand. Given that 79% of a data science time goes to finding, cleaning, and organizing data sets, there’s a need for data packages that are easy to create, distribute, and use.

The secret sauce of source code management

Three things make source code management tick:

  • Packages — dependencies are expressed in reusable units
  • Compilation — text files are converted into fast binary formats (we call this serialization in the case of data)
  • Versioning — changes are tracked with hashes, tags, and a change log

We created Quilt to bring packages, serialization, and versioning to data.

Package, serialize, and version data with Quilt

Package

To use the Quilt package manager, first install HDF5.

$ pip install quilt
$ quilt install akarve/sales

Installing a data package downloads it to disk. Data packages import just like standard Python modules.

$ python
>>> from quilt.data.akarve import sales

Data packages are like folders containing data frames (tables optimized for manipulation). With Python’s dot operator you can traverse a data package.

>>> examples.sales.transactions.data()Order ID Order Date       Sales
0 3 2010-10-13 261.5400
1 293 2012-10-01 10123.0200
2 293 2012-10-01 244.5700
3 483 2011-07-10 4965.7595
...

If you type the name of a package you’ll see its contents:

>>> examples.sales
<class 'quilt.data.DataNode'>
File: /Users/akarve/demo/quilt_packages/akarve/examples.json
Path: /
README
transactions

quilt ls lists the packages that you have installed:

/Users/karve/code/dsci/demo/quilt_packages
└── akarve/sales

Every package has its own web page for documentation and discovery.

Serialize

I/O is slow. Parsing in addition to I/O is even slower. File formats like HDF5 and Apache Parquet serialize data into high-efficiency binary formats. Performance optimizations like run-length encoding, byte-reordering, and memory mapping mean that binary data loads five to twenty times faster than files, and has a much smaller footprint on disk. In 2017, you should be serializing your data.

You can serialize files into a package follows:

quilt build USER/PKG_NAME -d DIRECTORY

This will build the package USER/PKG_NAME by converting supported files in DIRECTORY to binary data frames.

Version

quilt log reveals the history of a package:

Hash       Created             Author
c0f2e30... 2017-03-22T22:35:18 akarve
2c7b6fd... 2017-03-22T22:44:51 akarve

You can install a specific version of a package as follows quilt install USER/PKG -x HASH.

There’s lots more

Check out the Quilt docs for details.

Contribute

The Quilt client is open source. We welcome contributions to the GitHub repository.

--

--

Published in Quilt

Quilt is a data mesh for cross-functional teams

Written by Aneesh Karve

Data, visualization, machine learning, and abstract algebra. CTO and co-founder @QuiltData. Recent talks https://goo.gl/U9VYr5.

Responses (1)

Write a response