Reproducibility and Big (Omics) Data Nuno Bandeira, Ph.D. Associate - - PowerPoint PPT Presentation

reproducibility and big omics data
SMART_READER_LITE
LIVE PREVIEW

Reproducibility and Big (Omics) Data Nuno Bandeira, Ph.D. Associate - - PowerPoint PPT Presentation

Reproducibility and Big (Omics) Data Nuno Bandeira, Ph.D. Associate Professor Dept. Computer Science and Engineering Skaggs School of Pharmacy and Pharmaceutical Sciences Executive Director NIH/NIGMS Center for Computational Mass Spectrometry


slide-1
SLIDE 1

Reproducibility and Big (Omics) Data

Nuno Bandeira, Ph.D.

Associate Professor

  • Dept. Computer Science and Engineering

Skaggs School of Pharmacy and Pharmaceutical Sciences

Executive Director

NIH/NIGMS Center for Computational Mass Spectrometry

Center for Computational Mass Spectrometry

slide-2
SLIDE 2

Center for Computational Mass Spectrometry

http://proteomics.ucsd.edu

What is the Proteome?

Not just unmodified major protein isoforms Sequence polymorphisms Alternative splicing Post-translational mods (PTMs) Endogenous peptides

May be non-linear: insulin Protein interactions: cross-linking

Microbiome: 10x more cells, 100-360x more genes Disease proteomes

Infectious diseases: MHC peptides Cancer: fusions, polymorphisms Cataracts: hypermodified peptides Antibodies, drug discovery

slide-3
SLIDE 3

Center for Computational Mass Spectrometry

http://proteomics.ucsd.edu

Lens dataset: 5th round

MDVTIQHPWFKRT

Ac AcM

+O

Ac

WFK

AcM

+O

Ac

Nterm acetylation Nterm acetylation, M oxidation Nterm acetylation, W oxidation Nterm acetylation, M oxidation, W oxidation Nterm acetylation, WKyurenin

Ac

Ac

KR

Nterm acetylation, K acetylation

AcM

+O +O

WFKR

Nterm acetylation, M oxidation, W oxidation, K acetylation Ac Nterm acetylation, K carboxyethylation (?)

Ac

Nterm acetylation

Nterm acetylation, M dethiomethyl

AcMDVTIQHPWFKR

+O Nterm acetylation, Q deamidation, M oxidation, W oxidation, K acetylation Ac +O +1 +O

WFK

+O

AcM

KyFK K

Ac

Ac

KR

+HCS

Full peptide

VTIQHPWFK TIQHPWFKR MDVTIQHPWFK MDVTIQHPWFK DVTIQHPWFK DVTIQHPWFKR

( +38 ) ( +26 )

DVTIQHPWFK

+1

MDVTIQHPWFK MDVTIQHPW MDVTIQHPWFKR

Undetermined Modification (+38) Undetermined Modification (+25)

Cterm variants Nterm variants

Q deamidation

MDVTIQHPWFK

( +26 ) Undetermined Modification (+25), Q deamidation +1

Ac

  • H20

Nterm acetylation, Water Loss

VTIQHPWFK

Ac

Nterm acetylation

DVTIQHPWFK DVTIQHPWFK

slide-4
SLIDE 4

Center for Computational Mass Spectrometry

http://proteomics.ucsd.edu

More than just big data

Big Data Big Compute Big Algorithms Big Community

Proteomics Scalable, Accessible and Flexible environment Thousands of datasets, hundreds of terabytes 30+ data analysis workflows scalable to thousands of cores Designed to build on rather than just ‘tolerate’ big data Empower and enable community-wide sharing of knowledge

http://massive.ucsd.edu http://proteomics.ucsd.edu/software http://proteomics.ucsd.edu/ProteoSAFe http://gnps.ucsd.edu

slide-5
SLIDE 5

Center for Computational Mass Spectrometry

http://proteomics.ucsd.edu

Dataset reanalysis: PNNL microbiome

12 TB dataset covering 112 species from diverse taxa

  • Can easily import raw data for online reanalysis
  • Includes microbial spectral libraries reusable for searching new data
  • Search results can be compared with dataset results

– Online results or user-uploaded results – Reanalysis results will be `attachable’ to submitted dataset

slide-6
SLIDE 6

Center for Computational Mass Spectrometry

http://proteomics.ucsd.edu

ProteoSAFe reanalysis

ProteoSAFe: Compute-intensive discovery MS at the click of a button

(billions of spectra searched)

http://proteomics2.ucsd.edu/ProteoSAFe

30+ workflows, >70 tools

Cohort-aware spectral networks

slide-7
SLIDE 7

Center for Computational Mass Spectrometry

http://proteomics.ucsd.edu

Crowdsourced curated libraries Share data Explore unknown molecules Co-analyze private+public data

gnps.ucsd.edu

First MassIVE Knowledge Base,

  • pen March 2014
slide-8
SLIDE 8

Center for Computational Mass Spectrometry

http://proteomics.ucsd.edu

The GNPS vision

Data to knowledge 101

– Crowdsourced consensus IDs

  • Curators
  • Revisions
  • Quality levels

– Automated reanalysis

  • f all data

Investigator-centric

– “Living” datasets with new and revised knowledge – Dataset subscriptions – Molecular explorer: “Data like mine”

slide-9
SLIDE 9

Center for Computational Mass Spectrometry

http://proteomics.ucsd.edu

Challenges ahead

Worldwide proteomics big data

– Organizing thousands of datasets into a validated scientific resource – ‘Living’ data: consensus reanalysis, commenting, adding new results – Needs: FDR models for crowdsourced reanalysis – who’s right? Reference datasets for comparison of tools/workflows?

Most data has no conditions no biology, validation

– Need dataset revisions: more metadata, updated IDs – What constitutes a publishable unit? Label datasets as “gold” once the biological conclusions are confirmed by reanalysis?

Reusable knowledge bases

– Translating global data into a reusable resource (e.g., libraries) – Crowdsourcing curation of shared community knowledge bases – Needs: what knowledge to represent? Who reviews the curators?