Declarative data processing with pandas and pyarrow

Aneesh Karve
Quilt
Published in
3 min readMar 5, 2018

--

Pandas offers a rich set of routines for parsing, transforming, and loading data into memory. But hand-written pandas becomes impractical when you want to transform and load hundreds of files.

Suppose that you’re working with the babynames folder from Wes McKinney’s Python for Data Analysis. The folder contains 131 .txt files and one .pdf. We use head to get a sense of how the .txt files are structured:

$ head yob1880.txt -n 5
Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746

head reveals that our .txt files are formatted as CSV and are missing a header row. If we were using pandas directly, we’d call read_csv(file, Header=None) on each file. To make things easier, we built declarative parsing syntax into Quilt’s open source data compiler. Now we can replace numerous calls to read_csv with a few lines of YAML:

contents:
babynames:
"babynames/*.txt":
transform: csv
kwargs:
header: # this sets header to None = no column names

The above build.yml file treats all of the .txt descendants of the babynamesdirectory as CSVs with a missing header row. Note "babynames/*.txt" supports glob syntax and matches against a file path relative to the build root.

We can execute our ETL program as follows:

$ pip install quilt # if you haven't already
$ quilt build <USERNAME>/<PACKAGE_NAME>

What happens next is that Quilt calls pandas.read_csv to parse the files into data frames, pyarrow then shreds the data frames into a columnar storage format, Apache Parquet. Why? Because Parquet compresses well, enables high-performance querying, and is accessible to a wide variety of big data query engines like PrestoDB and Drill. And luckily Parquet is just an implementation detail that’s hidden by Quilt. The magic of pyarrow (Arrow as a whole, really) is that it virtualizes columnar data so that it can be reshaped into a variety of formats.

Here’s how we can leverage the package that we just built in Jupyter:

from quilt.data.<USERNAME> import <PACKAGE_NAME>data.babynames  <GroupNode>
yob1880
yob1881
yob1882
yob1883
...

We see that babynames contains the 131 .txt files as dataframes. Since all of the files have the same schema, we can implicitly call pandas.concat as follows:

data.babynames()

That one call gives us the 1.69 million rows from 131 in a single data frame.

If you wish to share your data package with others you can quilt push <USERNAME>/<PACKAGE_NAME> --public.

Quilt’s declarative ETL syntax offers many more options for parsing columnar data (CSV, TSV, XLS, etc.). Visit the documentation to learn more. In future releases we plan to natively support non-columnar formats (like JSON and XML). You can include non-columnar files in Quilt packages today, just note that non-columnar formats are copied “as is” and are not transformed to Parquet.

If you’re looking for an automatic method of generating a declarative ETL program for a directory of files, try quilt generate DIR. quilt generate recursively descends DIR and adds an entry for every file it encounters.

The Quilt data compiler is open source and we welcome your contributions on GitHub.

--

--

Data, visualization, machine learning, and abstract algebra. CTO and co-founder @QuiltData. Recent talks https://goo.gl/U9VYr5.