From zero to science without worrying about infrastructure

Aneesh Karve
Quilt
Published in
3 min readMar 2, 2022

--

In this webinar, we reveal how teams in healthcare and life sciences (HCLS) can build effective and usable data lakes, with Jim Davis from Amazon Web Services and Aaron Jeskey from PTP.

You can find PTP’s Zero to Science Quilt Jumpstart Deployment on AWS Marketplace.

Key topics

How can biotechs get started quickly storing, analyzing, and reusing data?

  • Prebuilt solutions for biotechs to get started (e.g. Biotech Blueprint)
  • Quick start partner network with PTP, application vendors like Quilt
  • Key imperatives for new HCLS customers in the AWS cloud
  • Transitioning from a university environment where infra is “done for you” to a corporate environment where you have to roll your own
  • Selecting the right AWS services
  • Core to science: getting data from the bench into Amazon S3, via AWS Storage Gateway
  • Finding the right software packages for the team (ELN, Data lake browsing)
  • AWS Control Tower for HIPAA, etc. compliance
  • Simple data management principles: blob storage, designing data for reuse and usability
  • It’s difficult to build the damn after the flood (data grows faster than headcount)
  • Enabling business users, non-developers, bench scientists to access S3 blob storage

Effective data management for life sciences teams

  • Breaking out of the tyranny of files and folders with cloud datasets (Quilt packages)
  • Improving data uniqueness, by eliminating data siloes and needless copies
  • Open source dataset abstraction for integrating data, metadata, charts, documentation, lineage in an immutable container
  • Meeting growing sample and pipeline volumes
  • How to relate raw data, to analyzed data, to finalized or sealed data
  • Small teams mask big problems: data management feels easy until you start hiring more people
  • Raw Amazon S3, NAS, Box are not enough for effective data management and labeling
  • As teams grow, finding and reusing data becomes a major hurdle to IND velocity
  • The time to put data management practices in place is NOW, before data volumes and headcounts grow and get messy
  • Ensuring two-sided access to cloud datasets: API and GUI
  • Data culture harmonizes priorities with incentives, data management is more than technology; it’s a human problem
  • If humans do the lazy thing, the data management layer should still work
  • Data uniqueness is required for trust: there should be one authoritative copy of the data that the team regards as true. Data versioning makes datasets authoritative in spite of copying

Syncing data to Amazon S3

  • Enable tools to work together by syncing all data into Amazon S3 with a user-friendly interface to S3
  • Cross-functional teams should refer to the same datasets in Amazon S3
  • Gathering CRO data in S3
  • Triggering automated pipelines in S3
  • Syncing file shares to Amazon S3 via Amazon Storage Gateway
  • Integrating Egnyte with Quilt, S3
  • All data ultimately lands in S3 (from Egnyte, Box, etc.)
  • Once data are in S3, analysis becomes much easier, there are events, there is a data lifecycle
  • Separating data collection from curation
  • It is unsustainable to store your data in someone else’s cloud
  • In your private cloud, you have full control over your data
  • Synchronizing data in databases to S3: Glue Crawler, linking to foreign data sources via unique IDs or query strings, generating database tables into Quilt for search and discovery
  • Streamlining the dataset creation UX

Data management is more of a human challenge than it is a technical challenge. See Quilt’s Data Platform Survival Guide.

--

--

Data, visualization, machine learning, and abstract algebra. CTO and co-founder @QuiltData. Recent talks https://goo.gl/U9VYr5.