Manage data like source code

Projects like pip, npm, and GitHub simplify source code management and dependency management. In 2017 it’s easy to share, version, and install source code. Why can’t we do the same with data? Data is dynamic, has complex dependencies, and needs to be shared across environments. Modeling in data science, artificial intelligence, and finance benefits from versioning. Versioned data makes it possible to reproduce results, repeatably train models, and track changes. Unfortunately, source code repositories like pip, npm, and GitHub aren’t suited for data management. Here’s why:
- Large data files slow source repositories down. (Git LFS works to some extent, but comes at the cost of added complexity and misses the compilation step, discussed below.)
- Source repositories don’t compile data. Below we’ll discuss how data can be “compiled” into high-speed, cross-platform, binary formats.
- Source repositories don’t package data. Instead, source repositories treat data as flat files. The user is left to “extract, transform, and load” the data by hand. Given that 79% of a data science time goes to finding, cleaning, and organizing data sets, there’s a need for data packages that are easy to create, distribute, and use.
The secret sauce of source code management
Three things make source code management tick:
- Packages — dependencies are expressed in reusable units
- Compilation — text files are converted into fast binary formats (we call this serialization in the case of data)
- Versioning — changes are tracked with hashes, tags, and a change log
We created Quilt to bring packages, serialization, and versioning to data.
Package, serialize, and version data with Quilt
Package
To use the Quilt package manager, first install HDF5.
$ pip install quilt
$ quilt install akarve/sales
Installing a data package downloads it to disk. Data packages import just like standard Python modules.
$ python
>>> from quilt.data.akarve import sales
Data packages are like folders containing data frames (tables optimized for manipulation). With Python’s dot operator you can traverse a data package.
>>> examples.sales.transactions.data()Order ID Order Date Sales
0 3 2010-10-13 261.5400
1 293 2012-10-01 10123.0200
2 293 2012-10-01 244.5700
3 483 2011-07-10 4965.7595
...
If you type the name of a package you’ll see its contents:
>>> examples.sales
<class 'quilt.data.DataNode'>
File: /Users/akarve/demo/quilt_packages/akarve/examples.json
Path: /
README
transactions
quilt ls
lists the packages that you have installed:
/Users/karve/code/dsci/demo/quilt_packages
└── akarve/sales
Every package has its own web page for documentation and discovery.
Serialize
I/O is slow. Parsing in addition to I/O is even slower. File formats like HDF5 and Apache Parquet serialize data into high-efficiency binary formats. Performance optimizations like run-length encoding, byte-reordering, and memory mapping mean that binary data loads five to twenty times faster than files, and has a much smaller footprint on disk. In 2017, you should be serializing your data.
You can serialize files into a package follows:
quilt build USER/PKG_NAME -d DIRECTORY
This will build the package USER/PKG_NAME
by converting supported files in DIRECTORY
to binary data frames.
Version
quilt log
reveals the history of a package:
Hash Created Author
c0f2e30... 2017-03-22T22:35:18 akarve
2c7b6fd... 2017-03-22T22:44:51 akarve
You can install a specific version of a package as follows quilt install USER/PKG -x HASH
.
There’s lots more
Check out the Quilt docs for details.
Contribute
The Quilt client is open source. We welcome contributions to the GitHub repository.