From zero to science without worrying about infrastructure

Published in

Quilt

3 min readMar 2, 2022

In this webinar, we reveal how teams in healthcare and life sciences (HCLS) can build effective and usable data lakes, with Jim Davis from Amazon Web Services and Aaron Jeskey from PTP.

You can find PTP’s Zero to Science Quilt Jumpstart Deployment on AWS Marketplace.

Key topics

How can biotechs get started quickly storing, analyzing, and reusing data?

Prebuilt solutions for biotechs to get started (e.g. Biotech Blueprint)
Quick start partner network with PTP, application vendors like Quilt
Key imperatives for new HCLS customers in the AWS cloud
Transitioning from a university environment where infra is “done for you” to a corporate environment where you have to roll your own
Selecting the right AWS services
Core to science: getting data from the bench into Amazon S3, via AWS Storage Gateway
Finding the right software packages for the team (ELN, Data lake browsing)
AWS Control Tower for HIPAA, etc. compliance
Simple data management principles: blob storage, designing data for reuse and usability
It’s difficult to build the damn after the flood (data grows faster than headcount)
Enabling business users, non-developers, bench scientists to access S3 blob storage

Effective data management for life sciences teams

Breaking out of the tyranny of files and folders with cloud datasets (Quilt packages)
Improving data uniqueness, by eliminating data siloes and needless copies
Open source dataset abstraction for integrating data, metadata, charts, documentation, lineage in an immutable container
Meeting growing sample and pipeline volumes
How to relate raw data, to analyzed data, to finalized or sealed data
Small teams mask big problems: data management feels easy until you start hiring more people
Raw Amazon S3, NAS, Box are not enough for effective data management and labeling
As teams grow, finding and reusing data becomes a major hurdle to IND velocity
The time to put data management practices in place is NOW, before data volumes and headcounts grow and get messy
Ensuring two-sided access to cloud datasets: API and GUI
Data culture harmonizes priorities with incentives, data management is more than technology; it’s a human problem
If humans do the lazy thing, the data management layer should still work
Data uniqueness is required for trust: there should be one authoritative copy of the data that the team regards as true. Data versioning makes datasets authoritative in spite of copying

Syncing data to Amazon S3

Enable tools to work together by syncing all data into Amazon S3 with a user-friendly interface to S3
Cross-functional teams should refer to the same datasets in Amazon S3
Gathering CRO data in S3
Triggering automated pipelines in S3
Syncing file shares to Amazon S3 via Amazon Storage Gateway
Integrating Egnyte with Quilt, S3
All data ultimately lands in S3 (from Egnyte, Box, etc.)
Once data are in S3, analysis becomes much easier, there are events, there is a data lifecycle
Separating data collection from curation
It is unsustainable to store your data in someone else’s cloud
In your private cloud, you have full control over your data
Synchronizing data in databases to S3: Glue Crawler, linking to foreign data sources via unique IDs or query strings, generating database tables into Quilt for search and discovery
Streamlining the dataset creation UX

Data management is more of a human challenge than it is a technical challenge. See Quilt’s Data Platform Survival Guide.

From zero to science without worrying about infrastructure

Key topics

How can biotechs get started quickly storing, analyzing, and reusing data?

Effective data management for life sciences teams

Syncing data to Amazon S3

Written by Aneesh Karve