Building a data lake for drug discovery: data reuse, compliance, and findability

Aneesh Karve
Quilt
Published in
3 min readJun 10, 2021

--

In this webinar, Eric Goldbrener walks us through the highs and lows of data management for drug discovery. Eric is a consultant in compliant infrastructure for enterprises that develop drugs, targets and therapies. An applied mathematician by training, Eric led University of California’s Big Data Initiative culminating in a HIPAA-compliant data lake deployed at the San Diego Supercomputer Center (SDSC).

Slides (.pdf)

Download: Building a data lake for drug discovery.

Key topics

  • Common missteps when planning (or failing to plan) a data lake
  • Drug discovery milestones: funding, lab research, collaboration, live organism testing, clinical trials
  • The tension between data security and data usability
  • The role of data lineage or “provenance” in compliance, auditing, and peer review
  • The business costs of scientists not being able to find or access data that they know exists: lost publications, lost discoveries, repeated experiments
  • Planning for data reuse and data collaboration
  • The false lure of technical debt as a way to bring products to market faster

Eric’s five recommendations for architecting a data lake

  1. Start your data lake in the cloud (not on-premise)
  2. Make the data lake the central repository of reference data for all applications
  3. Define a workflow process in which users read source data from the data lake and publish reports and analytics back to the data lake
  4. The data lake should automatically generate its own human-readable content catalog with browse and search capabilities
  5. Build controls and security for regulatory compliant applications

The structural reasons why you can’t find the data you’re looking for

  • Schemas unknown at write and constantly changing
  • Documentation and metadata absent or siloed from data
  • Data, docs, and metadata are not conveniently indexed for plain text or even SQL searches and queries
  • Technical debit is a seductive lure under schedule pressure but ultimately leads to further delays in the drug discovery process
  • Developers and non-developers work in different systems, leading to divergent sources of truth

Tips for moving towards a self-organizing data lake

  1. Realize that schemas are progressively discovered, not fixed or defined up front
  2. Establish programmatic data quality gates that filter data as it progresses from swampy, to refined, to curated
  3. Embed documentation and visualizations within data for context and understandability
  4. Leverage the principles of immutable data structures and pure functions
  5. Define a manifest that maps logical keys to physical keys and metadata (see Quilt’s open source data package manifest)
  6. Build your data lake from discrete, immutable units to facilitate reproducibility, discoverability, and trust
  7. Make blob storage, like Amazon S3, the heart of your data lake (and
  8. enable object versioning)
  9. Control the chaos with branching and staging (S3 buckets are for data what branches are for source code)
  10. Prefer schema-on-read databases like Athena for analytics; don’t pay for compute you’re not using
  11. Embed documentation, visualizations, and metadata with the primary data
  12. Index metadata for discovery with systems like ElasticSearch
  13. Immutability is the key to moving faster. Teams with an immutable revision history can travel time, are audit ready, have an easier time debugging, and can build on one another’s work
  14. Moving compute to data is cheaper and faster than moving data to compute
  15. Plan access patterns with role-based-access (RBAC) and use compute regions to meet data residency requirements for clinical and regional data

--

--

Data, visualization, machine learning, and abstract algebra. CTO and co-founder @QuiltData. Recent talks https://goo.gl/U9VYr5.