The NIEHS Data Commons Deep Patel, Mike Conway Office of Data - - PowerPoint PPT Presentation

the niehs data commons
SMART_READER_LITE
LIVE PREVIEW

The NIEHS Data Commons Deep Patel, Mike Conway Office of Data - - PowerPoint PPT Presentation

The NIEHS Data Commons Deep Patel, Mike Conway Office of Data Science National Institute of Environmental Health Sciences National Institutes of Health U.S. Department of Health and Human Services 1 The NIEHS Office of Data Science Who


slide-1
SLIDE 1

National Institutes of Health • U.S. Department of Health and Human Services

1

The NIEHS Data Commons

Deep Patel, Mike Conway Office of Data Science National Institute of Environmental Health Sciences

slide-2
SLIDE 2

National Institutes of Health U.S. Department of Health and Human Services 2

“The mission of the Office of Data Science is to accelerate scientific discovery, foster collaborative research, and ultimately improve public health through the application of scientific data and knowledge management in the environmental health sciences.”

The NIEHS Office of Data Science

Who are we?

slide-3
SLIDE 3

National Institutes of Health U.S. Department of Health and Human Services 3

Commons objectives

Develop a standards-based commons

  • Beginning with internal researchers, managing data originating from core

laboratories, including next-gen sequencing data.

  • Define organizational policies to handle data life-cycle
  • Track provenance and relationship of data sets to source data and

analysis

slide-4
SLIDE 4

National Institutes of Health U.S. Department of Health and Human Services 4

Commons objectives

Manage metadata for discoverability and long-term usability

  • FAIR Data
  • Develop standard metadata, including controlled vocabularies and
  • ntologies
  • Automatic metadata from instruments, pipelines, and computer-

actionable policies

  • Support multiple indexes and search technologies for data

discovery and re-use

  • Allow publication to reference collections, such as NCBI GEO
slide-5
SLIDE 5

National Institutes of Health U.S. Department of Health and Human Services 5

Commons objectives

Support integration and use of data in computation and analysis

  • Ease discovery and access through common tools and platforms
  • Securely share data with collaborators
  • Allow audit and enforcement of access and data usage agreements
  • Track provenance and authenticity
  • Ensure reproducibility
slide-6
SLIDE 6

National Institutes of Health U.S. Department of Health and Human Services 6

The NIEHS Data Commons

Data/Tools Enrollment Data/Tools Discovery Secure Collaboration, Analysis, and Workflow Execution

Data Commons APIs NIH Data Commons

slide-7
SLIDE 7

National Institutes of Health U.S. Department of Health and Human Services 7

Data Commons serving a full data life-cycle

Current ‘commons’ efforts (e.g. NIH Commons) focus on the mature part of the research data lifecycle and say less about where the data comes from!

NIEHS Concerns:

  • Metadata quality
  • Delivery to PI
  • Appropriate sharing

within project

  • Retention, compliance
  • Ingest pipelines

NIH Concerns:

  • FAIR
  • Publishing
  • Data sharing/licensing
  • Discoverability
  • Analytics and derived

data

Moore, Reagan W., et al. "White Paper: National Data Infrastructure for Earth System Science."

slide-8
SLIDE 8

National Institutes of Health U.S. Department of Health and Human Services 8

Commons ‘Patterns’

  • Let’s look at the NIEHS Commons and see where

patterns come into play.

– How do we as a community develop frameworks around iRODS capabilities and the philosophy of policy-based data management that ease development? – How do we develop a pattern language and architectural discipline and talk with each other about systems that support FAIR and Big Data? – The Consortium is already developing a pattern catalog, and this is a Good.Thing.

slide-9
SLIDE 9

National Institutes of Health U.S. Department of Health and Human Services 9

Extracting Patterns…

  • Shout out to the Consortium folks,

this may be the ‘next thing’.

  • How would a good catalog of patterns

translate into frameworks and capabilities in iRODS?

Patterns from https://irods.org/documentation/

slide-10
SLIDE 10

National Institutes of Health U.S. Department of Health and Human Services 10

Core Labs Ingest and Pipelines

Instrument s Instrument s

Instruments commonsProdZone

DDN Clarity or

  • ther LIMS

Synch, tiering, staging, replication Procedure/Provenance metadata Storage Resource NAS

NextGen Sequencing

CWL,NextFlow

Tiering File Scanner Landing Zone Data-to- compute Compute

  • to-data
slide-11
SLIDE 11

National Institutes of Health U.S. Department of Health and Human Services 11

Metadata Support

commonsProdZone NIEHS Central Ontology/CV Service

Instrument s Instrument s

Vocabs

Index/Search Platforms *Indexing Framework

*Metadata Templates *Virtual Collections

????

slide-12
SLIDE 12

National Institutes of Health U.S. Department of Health and Human Services 12

New Challenges

  • Managing immutable archives (e.g. BDBag) and persistent

identifiers

  • Managing federated authn/authz
  • Integrating the Data Commons into the workflows and daily routines
  • f researchers in non-disruptive ways that make their work easier,

not more difficult

  • Keeping the focus on science, not cyberinfrastructure
slide-13
SLIDE 13

National Institutes of Health U.S. Department of Health and Human Services 13

Big Data is Big Preservation

  • Let’s not forget our roots, and how this applies now more

than ever.

  • OAIS and related concepts, including trusted digital

preservation provide lots of useful language and a good conceptual framework to add to the ‘cloud’, ‘FAIR’, and NSF ‘Cyberinfrastructure for the 21st Century’ frame.

  • FAIR does not matter if the data turns out to be lost!
slide-14
SLIDE 14

National Institutes of Health U.S. Department of Health and Human Services 14

Acknowlegements

NIEHS: Beth Bowden, John Bucher, Allen Dearry, Leesa Deterding, Michael Devito, Christopher Duncan, Matthew Edin, Thomas Van'T Erve, John Grovenstein, Guang Hu, Mary Jacobson, Jeffrey Kuhn, Beth Lauderdale, Jian-Liang Li, Alex Merrick, Geoffrey Mueller, Suzanne Osborne, Scott Redman, Andy Shapiro, Troy Simpson, Chris Stone, Cheryl Thompson, Paul Wade, Deborah Wales, Jason Williams, Rick Woychik; Renaissance Computing Institute: (RENCI); iRODS Consortium