Rethinking S3: Announcing T4, a team data hub

Aneesh Karve
Quilt
Published in
4 min readOct 18, 2018

--

In the past year, Quilt has shuttled terabytes of data to and from Amazon S3. Globally, S3 houses more than two trillion objects, and handles more than a million requests per second. Love it or hate it, S3 is the world’s data lake.

On the love side, S3 is fast, scales beautifully, and is reasonably priced. At its worst, S3 is opaque and hard to use — a write-only database.

Today we’re announcing an open source project, called T4. T4 gives S3 buckets superpowers, transforming S3 into a team data hub. T4 is for data scientists, data engineers, and data-driven teams.

Blob storage as a data integration layer

Teams wish to be data driven, but this wish is trumped by the fact that data are scattered across systems, formats, and organizational silos. As a result, no one has a complete and accurate picture of the latest data.

We see the need for a unified, low-cost data layer that houses all of a team’s canonical data, and covers all four quadrants of the experiment-to-production lifecycle. The experiment-to-production lifecycle is a pattern that we’ve observed as data graduates from individual experiments to production endpoints.

Figure 1: Experiment-to-production lifecycle

Data science often begins with single-developer, ad hoc experiments (upper left quadrant). These experiments are gradually shared outside of the organization. As and when experiments prove successful, data engineers scale experimental models and data into production equivalents. Version control — for code, containers, and data — forms an essential thread across all four phases: immutable hashes from systems like Git, Docker, and Quilt ensure that workflows are reproducible.

S3 is an excellent candidate for capturing the full data lifecycle. S3 accepts any data format, scales well, and offers granular permissions. But S3 is missing key features to make data findable, accessible, interoperable, and reusable. T4 adds these missing features to S3.

S3 + services = T4, a team data hub

Version all the things

T4 uses S3 versions to track the history of every path in S3. With S3 versions, you can travel time, detect changes, and recover from accidental deletions. One level above S3 versions, T4 introduces packages. A package is an immutable collection of one or more objects, usually a directory or an entire S3 bucket. T4 packages are immutable and, once sealed, can never change. As a result, data pipelines that use packages are reproducible. T4’s package diff() shows what’s changed across two packages.

Object preview

S3 requires users to download objects in order to know what’s in them. With T4’s web catalog, you can preview images, markdown files, and Jupyter notebooks — without downloading anything from S3.

Figure 2: Markdown preview in T4

Where there’s data, there’s visualization. T4 constructs visual summaries of data in S3, with user-defined combinations of Vega specs, images, Jupyter notebooks, and markdown files.

Figure 3: Visual summaries in T4
Figure 4: Preview notebooks in S3

Full-text and faceted search

Elasticsearch is Amazon’s de facto search solution. Elasticsearch is fast and precise. But, like so many instruments in the AWS toolbox, Elasticsearch is challenging to set up and tune. T4 automatically configures a private Elasticsearch endpoint for your S3 bucket, and attaches Lambda functions to index files as they land in S3. By default, T4 builds a full-text index for Jupyter notebooks and markdown files.

When notebooks are searchable, it’s easier to find and reuse past results — whether the results were generated by you or by a colleague, moments ago or last year.

Each object that you write to S3 can be annotated with custom metadata: put(obj, meta={"author":"aneesh"}). Faceted search gives your queries greater precision. You can sift through metadata facets with queries like user_meta.author:"aneesh".

Figure 5: Faceted search over Python objects in T4

Read, write, and de/serialize Python objects

T4 adds syntactic sugar to Amazon’s S3 client, so that it’s easy to read and write Python objects to and from S3. We’ve found it convenient to stash data frames, numpy arrays, and dictionaries in S3 with one T4 command, put().

Figure 6: T4’s data movement APIs

Your data, your rules

T4 runs on top of your own S3 buckets — giving you total control over permissions. In the near future, we’ll offer a hosted version of T4.

It’s still early

T4 is alpha software. We do not yet recommend T4 for production work. The T4 APIs and features will be in flux until we reach version 1.0.

Contribute

We’re offering T4 as a glimpse into the world’s data lake, S3. Come and help us to build T4 on GitHub.

--

--

Data, visualization, machine learning, and abstract algebra. CTO and co-founder @QuiltData. Recent talks https://goo.gl/U9VYr5.