Capacity building in the cloud for Data Intensive Cancer Genomics - - PowerPoint PPT Presentation

capacity building in the cloud for data intensive cancer
SMART_READER_LITE
LIVE PREVIEW

Capacity building in the cloud for Data Intensive Cancer Genomics - - PowerPoint PPT Presentation

Capacity building in the cloud for Data Intensive Cancer Genomics Bruce Press, Ci4CC Fall meeting October 1, 2018 The rate of data generation is accelerating rapidly. 50 years 73 days Densen, P. Trans Am Clin Climatol Assoc 2011 Using this


slide-1
SLIDE 1

Capacity building in the cloud for Data Intensive Cancer Genomics

Bruce Press, Ci4CC Fall meeting October 1, 2018

slide-2
SLIDE 2

The rate of data generation is accelerating rapidly.

73 days 50 years

Densen, P. Trans Am Clin Climatol Assoc 2011

slide-3
SLIDE 3

Using this information to improve cancer patient outcomes isn’t only a technology challenge.

Scalable & Secure Environments Data Sharing & Collaboration Data Analysis Fluency Data Harmonization & Organization Technology Social

slide-4
SLIDE 4

Cloud is the most economically reasonable way to store and analyze our growing health data corpus.

slide-5
SLIDE 5

Cloud provides significant benefits for health data analysis at scale.

  • Immediate scaling -- no need to wait to

purchase and install hardware.

  • Levels the playing field -- even

researchers at institutions without large compute infrastructure investments can access powerful data and compute resources.

  • Extreme durability eliminates or

reduces need for backup copies.

  • Multi-tenancy of data means many

researchers can access data without needing to physically copy it. Old model: send data to compute New model: send compute to data

slide-6
SLIDE 6

Compute and Data storage platforms allow more researchers to quickly realize the power of cloud.

  • Infrastructure configuration and

security/compliance ‘out of the box’.

  • Optimized data storage and analysis

methods, across multiple underlying cloud infrastructures.

  • Cost monitoring and management.
  • Allows researchers focus on science,

not managing computational resources.

slide-7
SLIDE 7

The cloud allows multiple researchers to access the same copy of high- value public datasets.

  • NCI Cloud Pilots (now Cloud Resources

and the Cancer Research Data Commons) paved the way for secure access and analysis of high value datasets in the cloud.

  • Authentication and authorization

mechanisms enable approved researchers to access Controlled data initially from TCGA and TARGET and now an expanding set of data resources.

  • Potential to save millions of dollars by

reducing replication of data and speed research by avoiding download times.

https://cbiit.cancer.gov/ncip/cancer-research-data-commons

slide-8
SLIDE 8

Finding, organizing and cleaning data currently accounts for 80% of work performed by data scientists.

Gil Press, Forbes 2016

slide-9
SLIDE 9

Connecting data sets across multiple domains will increase the power of each to drive new discoveries.

  • Flexible, semantic data models and

advanced search allows finding data of interest from enormous repositories.

  • Can’t be a one-size-fits-all solution - the

properties most interesting for a particular research question tend to be unique

  • For example pregnancy exposure is

highly important for birth defect research but not a typical variable for adult cancer research.

slide-10
SLIDE 10

Portable and self-contained analysis methods promote reproducibility and speed harmonization of new data with large repositories.

  • By describing analysis methods in

Common Workflow Language and packaging tools in Docker containers, the exact routine used for large harmonization efforts can be applied to novel data.

  • Implementation of the GA4GH standard

WES allows the same analysis to be performed on multiple platforms.

  • Example: TOPMed harmonization

workflow run on GTEX files.

slide-11
SLIDE 11

Sharing data and working on it together will speed discoveries.

slide-12
SLIDE 12

Collaborative workspaces allow researchers with different expertise to work together in real time.

  • Capture the end-to-end ‘research journey’

to facilitate reproducibility and extension

  • f results.
  • Fine-grained permissions allow different

levels of data and analysis access.

  • Multiple communication channels allow

researchers to discuss analyses and results in situ.

slide-13
SLIDE 13

Sharing of data and results will reduce re-work, enhance serendipity, and ultimately result in better outcomes for patients.

  • Cloud platforms provide a efficient way to

facilitate data sharing since there’s no additional cost for more researchers to access and analyze data.

  • Data owners are beginning to make data

broadly available without embargo while ensuring compliance with patient consents - CHOP has led this charge via the CAVATICA platform.

  • New technologies facilitate sharing raw

data, methods, and results in a Findable, Accessible, Interoperable and Reusable(FAIR data principles) way.

slide-14
SLIDE 14

Data analysis will become a core competency for both researchers and medical professionals.

slide-15
SLIDE 15

Platforms and tools must be highly usable with as low a barrier to entry as possible while at the same time enabling power users.

  • Reproducibility of analysis journeys also

provides powerful teaching resource.

  • Interactive workshops, hackathons, and

training sessions are important to build expertise across individuals with diverse backgrounds.

  • Programmatic access methods (APIs)

allow automation and optimization by advanced users while visual interfaces support a broad user base.

slide-16
SLIDE 16

The Global Seven Bridges Team

Acknowledgements

Work presented was funded in whole or in part by: HHSN261201400008C, HHSN261200800001E, U2C HL138346-01, OT3 HL142478, OT3 OD02546 and U24CA224067