Building a data lake for drug discovery: data reuse, compliance, and findability

Published in

Quilt

3 min readJun 10, 2021

In this webinar, Eric Goldbrener walks us through the highs and lows of data management for drug discovery. Eric is a consultant in compliant infrastructure for enterprises that develop drugs, targets and therapies. An applied mathematician by training, Eric led University of California’s Big Data Initiative culminating in a HIPAA-compliant data lake deployed at the San Diego Supercomputer Center (SDSC).

Slides (.pdf)

Download: Building a data lake for drug discovery.

Key topics

Common missteps when planning (or failing to plan) a data lake
Drug discovery milestones: funding, lab research, collaboration, live organism testing, clinical trials
The tension between data security and data usability
The role of data lineage or “provenance” in compliance, auditing, and peer review
The business costs of scientists not being able to find or access data that they know exists: lost publications, lost discoveries, repeated experiments
Planning for data reuse and data collaboration
The false lure of technical debt as a way to bring products to market faster

Eric’s five recommendations for architecting a data lake

Start your data lake in the cloud (not on-premise)
Make the data lake the central repository of reference data for all applications
Define a workflow process in which users read source data from the data lake and publish reports and analytics back to the data lake
The data lake should automatically generate its own human-readable content catalog with browse and search capabilities
Build controls and security for regulatory compliant applications

The structural reasons why you can’t find the data you’re looking for

Schemas unknown at write and constantly changing
Documentation and metadata absent or siloed from data
Data, docs, and metadata are not conveniently indexed for plain text or even SQL searches and queries
Technical debit is a seductive lure under schedule pressure but ultimately leads to further delays in the drug discovery process
Developers and non-developers work in different systems, leading to divergent sources of truth

Tips for moving towards a self-organizing data lake

Realize that schemas are progressively discovered, not fixed or defined up front
Establish programmatic data quality gates that filter data as it progresses from swampy, to refined, to curated
Embed documentation and visualizations within data for context and understandability
Leverage the principles of immutable data structures and pure functions
Define a manifest that maps logical keys to physical keys and metadata (see Quilt’s open source data package manifest)
Build your data lake from discrete, immutable units to facilitate reproducibility, discoverability, and trust
Make blob storage, like Amazon S3, the heart of your data lake (and
enable object versioning)
Control the chaos with branching and staging (S3 buckets are for data what branches are for source code)
Prefer schema-on-read databases like Athena for analytics; don’t pay for compute you’re not using
Embed documentation, visualizations, and metadata with the primary data
Index metadata for discovery with systems like ElasticSearch
Immutability is the key to moving faster. Teams with an immutable revision history can travel time, are audit ready, have an easier time debugging, and can build on one another’s work
Moving compute to data is cheaper and faster than moving data to compute
Plan access patterns with role-based-access (RBAC) and use compute regions to meet data residency requirements for clinical and regional data