Scientific Data Asset Management The Missing Link in Data Driven - - PowerPoint PPT Presentation

scientific data asset management the missing link in data
SMART_READER_LITE
LIVE PREVIEW

Scientific Data Asset Management The Missing Link in Data Driven - - PowerPoint PPT Presentation

Scientific Data Asset Management The Missing Link in Data Driven Discovery Carl Kesselman University of Southern California Acknowledgements 4 Karl Czajkowski, Mike DArcy, Hongsuda Tangmunarunkit, Robert Schuler, Anoop Kumar, Alejendro


slide-1
SLIDE 1

Scientific Data Asset Management — The Missing Link in Data Driven Discovery

Carl Kesselman University of Southern California

slide-2
SLIDE 2

Acknowledgements 4Karl Czajkowski, Mike D’Arcy, Hongsuda Tangmunarunkit, Robert Schuler, Anoop Kumar, Alejendro Bugacov 4Kristi Clark, Lu Zhau 4Ian Foster, Kyle Chard, Ravi Madduri, 4Mike Hanson, Jeff Su, Ray Stevens 4Funded in part by NIH Big Data for Discovery Science Center of Excellence.

slide-3
SLIDE 3

Set the way back machine…. 4In 2000 we described how real-time data acquisition could be integrated into the Grid for diffraction studies and tomography

slide-4
SLIDE 4

System architecture….

slide-5
SLIDE 5

Fancy file systems….

slide-6
SLIDE 6

The whole story (kind of….)

Construct Design Virus creation Biomass production Protean Purification Crystallization Diffraction Flow Cytometry Chromotogahraphy Gel electrophorisis Imaging Stability measures

slide-7
SLIDE 7

PheWAS findings

7 Shaw, Molecular Psychiatry (2009) 14, 348–355 Raznahan, Neuroimage (2011) 57, 1517-23

slide-8
SLIDE 8

Image PheWAS

  • 1. Assemble Data Collections
  • 2. Identify subjects with images and extract images
  • 3. Compute image phenotypes
  • Use Freesurfer with different atlases and computed

measures

  • 4. Associate Freesurfer results with each subject.
  • 5. Quality control on derived data. Rerun on bad results
  • 6. Identify subset of subjects that have variant of interest

in SNP being considered

  • 7. Collect up all phenotype data associated with identified

subset

  • 8. Do correlation analysis of phenotypes for the SNP to

look for predictive correlations.

  • 9. Repeat until discovery
slide-9
SLIDE 9

Publish results

Collect data Design experiment Test hypothesis Hypothesize explanation Identify patterns Analyze data

How do we accelerate discovery?

9

Pose question

slide-10
SLIDE 10

A view from 1960…. “my choices of what to attempt and what not to attempt [are] determined to an embarrassingly great extent by considerations of clerical feasibility, not intellectual capability” Man-Computer Symbiosis

  • J. C. R. Licklider
slide-11
SLIDE 11

The View From 2016

Scientists report 50-80% of their time is spent “wrangling” messy data, not analyzing it

  • The problem is not the cost of computing!!

Repeatability of results from papers is shockingly low: 10% 4Lack of comprehensive tools for organizing, contextualizing, and sharing data 4Ad hoc processes and practices for managing and sharing information 4Messy Data à Reusable Data à Discovery

  • How to get from point A to point B?
slide-12
SLIDE 12
slide-13
SLIDE 13

What if….

4Every piece of data produced in was “citable

  • Microscope, flow cytometry, mass spec, sequence,

mouse, zebrafish, material sample

4Data flowed instantly and seemlessly

  • From points of production/acquisition
  • between dynamically evolving research teams

4Data was contextualized 4You had automated support to help discover data, extract interesting features, point you to related data, assemble data sets...

slide-14
SLIDE 14

It’s the data, not the analysis!! Data is a precious thing and will last longer than the systems themselves.

Tim Berners-Lee

slide-15
SLIDE 15

An Ecosystem for Data

Why don’t we have tools for managing data sets of cancer and kidneys that are as good as the tools we have for managing data sets of cats and kids?

Apple iPhoto

Editable

attributes and metadata

Full text search Data browsing Flexible data

  • rganization

Edit and share

Automatic analysis

slide-16
SLIDE 16

Applied to other types of work? 4Can we create a reusable platform that enables us to address data centric integration of

  • devices,
  • computation,
  • human interactions
slide-17
SLIDE 17

Digital Asset Management

4“management tasks and decisions surrounding the ingestion, annotation, cataloguing, storage, retrieval and distribution

  • f digital assets”

4streamline free-form “creative” processes rather than enforce predefined business processes. 4Many commercial DAM offerings, but not well suited to biomedical data

  • Complex and diverse data types
  • Specialized data ingest requirements
  • Data size (big data)
slide-18
SLIDE 18

Scientific Digital Asset Management 4Discovery Environment for Relational Information and Versioned Assets (DERIVA). 4Model discovery as process of creating and updating contextualized digital assets. 4Web services platform

  • “Data Oriented Architecture”

4Adaptive and extensible

slide-19
SLIDE 19

Platform Elements

4Object/Relational data store

  • Pub/Search/Retrieve structured data

4Object store

  • Pub/Retrieve immutable objects

4Batch publish/retrieval tool

  • Watch file system and publish data bundle

4Model-driven UI

  • Introspect and adapt to data model

ERMRest HATRAC IObox Chaise

slide-20
SLIDE 20

Software ecosystem

slide-21
SLIDE 21

IOBox

4Configurable tools for enabling arbitrary endpoint

  • Files, databases, microscopes, etc.
  • IoT like

4Contextualize data based on time and location

  • Ruleset per location
  • Metadata extraction, publication to catalog,

management of asset

  • Simple recovery mechanisms based on

retry/notification

4Triggers per asset ingest pipeline in “cloud”

slide-22
SLIDE 22

ERMRest 4Relational data storage service for web- based, data-oriented collaboration.

  • general entity-relationship modeling of data

resources manipulated by RESTful access methods.

4RESTful interface à data views as named resource 4Focus on introspection and evolution

  • Data model can change over time to reflect

evolving understanding of problem space

slide-23
SLIDE 23

Chaise – Adaptive User Interface

4How little can we assume?

  • discovery, analysis, visualization, editing, sharing and

collaboration over tabular data (ERMRest).

4Makes almost no assumptions about data model

  • Introspect the data model from ERMrest.
  • Use heuristics, for instance, how to flatten a

hierarchical structure into a simplified presentation for searching and viewing.

  • Schema annotations are used to modify or override its

rendering heuristics, for instance, to hide a column of a table or to use a specific display name.

  • Apply user preferences to override, for instance, to

present a nested table of data in a transposed layout

slide-24
SLIDE 24

One platform, many use cases 4High-resolution 2D and 3D microscopy 4GPCR protein conformation studies 4Kidney reconstruction using stem-cells 4Mapping dynamic synaptome in vivo 4Gene expression studies for craniofacial dysmorphia 4Digital cell line for cancer 4Developmental biology

slide-25
SLIDE 25

Neuroimaging PheWAS

4What is PheWAS?

  • One SNP -> a wide variety of neuroimaging

phenotypes (inverse of GWAS)

4Why PheWAS?

  • explores system-level genetic associations.

4Challenges

  • Complexity, heterogeneity, and volume of the data
  • Complex and sophisticated brain image processing
  • Multiple-comparison correction
  • Result visualization
slide-26
SLIDE 26

Philadelphia Neurodevelopmental Consortium 48719 subjects in study

  • Baseline clinical elements

46 different SNP array chipsets resulting in a combined set of 1,873,486 distinct SNPs (out of a possible 85 million in the human genome).

  • The total combinatorial space of the genomic

data is 5,435,533,460 (SNP, subject, allele) tuples across the 8719 subjects

4997 of the subjects have MRI imaging data

slide-27
SLIDE 27

Managing data collections

slide-28
SLIDE 28

Heterogeneous source data

slide-29
SLIDE 29

Bags bridge the gap between tools

10/6/16 29 BIG DATA for

  • 5. Query for specific imaging

information based on the derived genetic data

Raw Brain MRI data Processed MRI data

  • 6. Create

new bags of derived data

  • 7. Transfer

bags out for publication

Genetic Data Brain MRI

ERMrest

Process imaging data

dbGaP

  • 1. Query and

discover data (wherever it is)

  • 2. Create

bags

  • 3. Query for

genetic data from 6 chipsets

PLINK format genetic data

Alleles per subject

  • 4. Create

new bags of derived data

Alignm ent Files

After step 6 628 subjects

slide-30
SLIDE 30

Details on one data element

slide-31
SLIDE 31

QC on derived data

slide-32
SLIDE 32

Complex data relationships…

slide-33
SLIDE 33

NeuroimagingPheWAS Toolbox

slide-34
SLIDE 34

Summary 4Exponential increases in computing/storage imposes additional complexity on the end user…. What to do? 4Scientific Digital Asset Management is the missing link

  • Make science data as good as consumer data

4We have demonstrated that generally applicable software ecosystem for DAM is feasable