The Bioconductor Project for Reproducible Analysis of High - PowerPoint PPT Presentation

The Bioconductor Project for Reproducible Analysis of High Throughput Genomic Data Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center 19-21 January, 2011

Analysis and Comprehension of High Throughput Genomic Data Hallmarks of effective computational software 1. Extensive: data, annotation 2. Statistical: volume, technology, experimental design 3. Reproducible: long-term, multi-participant science 4. Leading edge: novel, technology-driven 5. Accessible: affordable, transparent, usable

1. Extensive Data and Annotation Data ◮ Expression, tiling, methylation, custom arrays. ◮ Sequence analysis, e.g., ChIP-, RNA-seq ◮ Other high-throughput assays, e.g., flow cytometry, mass spec., imaging ◮ Public repositories, e.g., GEO, ArrayExpress Annotation, e.g., ◮ Well-curated: NCBI, Biomart, UCSC, MsigDB, GO, KEGG ◮ Loosely curated: emerging, specialized, & lab-based ◮ Consortium: HapMap, 1000 genomes, TCGA

Bioconductor Goal Help biologists understand their data ◮ Expression and other microarray; flow cytometry Focus ◮ High-throughput sequencing ◮ Contributions from ‘core’ members and Themes (primarily academic) user community ◮ Based on the R programming language – statistics, visualization, interoperability ◮ Reproducible – scripts, vignettes , packages ◮ Open source / open development Success > 400 packages; publications; 8,000 web visits / week; 75,000 unique IP downloads / year; very active mailing list; annual conferences; courses; . . .

Bioconductor : Sample Work Flow > ## > ## Pre-processing > library(affy) > eset <- just.rma() > ## > ## Quality assessment > library(arrayQualityMetrics) > arrayQualityMetrics(eset) > ## > ## Differential expression > library(limma) > status <- + c("Trt", "Trt", "Trt", "Ctrl", "Ctrl", "Ctrl") > design <- model.matrix( ~status ) > fit <- eBayes(lmFit(eset, design)) > topTable(fit, coef=2)

2. Statistical Technology ◮ Acknowledging artifacts and biases ◮ Accomodate using statistical models, e.g., RMA Volume of data ◮ Data reduction essential Experimental design ◮ Exploratory analysis ◮ Hypothesis-driven; designed Expression array. Pseudocolors experiments represent hybridisation intensities ◮ Cost-effective, but not too of RNA to features. Source: url clever

Statistical Technology ◮ Acknowledging artifacts and 14 biases 12 ◮ Accomodate using statistical log 2 intensity 10 models, e.g., RMA Volume of data 8 ◮ Data reduction essential 6 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 Experimental design Number of G and C ◮ Exploratory analysis Measured intensity increases with GC content; Chronic ◮ Hypothesis-driven; designed Lymphocytic Leukemia (CLL) experiments dataset. ◮ Cost-effective, but not too clever

Statistical Technology ◮ Acknowledging artifacts and biases CLL5 ◮ Accomodate using statistical CLL13 0.7 CLL20 CLL6 CLL9 0.6 models, e.g., RMA CLL3 CLL21 CLL8 0.5 CLL12 CLL14 Volume of data CLL7 0.4 CLL4 CLL11 0.3 CLL22 ◮ Data reduction essential CLL23 CLL16 0.2 CLL19 CLL17 CLL15 0.1 Experimental design CLL24 CLL2 CLL1 0.0 CLL18 ◮ Exploratory analysis CLL18 CLL1 CLL2 CLL24 CLL15 CLL17 CLL19 CLL16 CLL23 CLL22 CLL11 CLL4 CLL7 CLL14 CLL12 CLL8 CLL21 CLL3 CLL9 CLL6 CLL20 CLL13 CLL5 ◮ Hypothesis-driven; designed experiments Heatmap summarizing distance ◮ Cost-effective, but not too between CLL arrays clever

Statistical Technology ◮ Acknowledging artifacts and NUSE biases ◮ Accomodate using statistical 1.15 models, e.g., RMA 1.05 Volume of data ◮ Data reduction essential 0.95 CLL11 CLL12 CLL13 CLL14 CLL15 CLL16 CLL17 CLL18 CLL19 CLL1 CLL20 CLL21 CLL22 CLL23 CLL24 CLL2 CLL3 CLL4 CLL5 CLL6 CLL7 CLL8 CLL9 Experimental design Normalized unscaled standard ◮ Exploratory analysis error (NUSE) suggests array ◮ Hypothesis-driven; designed CLL1 is an outlier. experiments ◮ Cost-effective, but not too clever

Statistical Technology ◮ Acknowledging artifacts and 5 biases 4 ◮ Accomodate using statistical 3 log 10 p models, e.g., RMA 2 Volume of data 1 ◮ Data reduction essential 0 Experimental design −2 −1 0 1 2 log−ratio ◮ Exploratory analysis ‘Progressive’ vs. ‘stable’ status. ◮ Hypothesis-driven; designed log P vs. log-fold change, CLL experiments data set. Probe sets with extreme ◮ Cost-effective, but not too differentiation highlighted. clever

3. Reproducible Research Long-term ◮ Returning to analysis after days, weeks, months of other activity Multi-participant: communicating with. . . ◮ Other statisticians / bioinformaticians ◮ Biologists and others without specialized statistical knowledge Science: reproducibility. . . ◮ Facilitates third-party verification ◮ Allows critical assessment ◮ Challenging, even in high-profile journals requiring archived raw data (Ioannidis et al. , 2009, Nat Genet 41: 149-155).

Reproducible Research: Case Study Original research ◮ Potti et al. , 2006; Hsu et al. , 2007 References ◮ NCI60 cell line drug ◮ Potti et al. 2006 Nat Med sensitivity signature 12: 1294-1300; (retracted) ◮ Clinical trial allocation ◮ Hsu et al. 2007 J Clin Oncol Reproducibility 25: 4350-4357. (retracted) ◮ Baggerly & Coombes, 2009 ◮ Baggerly & Coombes 2009 ◮ Off-by-one cisplatin gene Ann Appl Stat 3: 1309-1334 signature ◮ Four ‘interesting’ genes not supported by analysis (two not on array)

Reproducible Research: Case Study Original research ◮ Potti et al. , 2006; Hsu et al. , 2007 ◮ NCI60 cell line drug sensitivity signature ◮ Clinical trial allocation Reproducibility ◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene signature ◮ Four ‘interesting’ genes not Hsu et al. , cisplatin, fig. 1a supported by analysis (two not on array)

Reproducible Research: Case Study Original research ◮ Potti et al. , 2006; Hsu et al. , 2007 ◮ NCI60 cell line drug sensitivity signature ◮ Clinical trial allocation Reproducibility ◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene signature ◮ Four ‘interesting’ genes not Baggerly & Coombes, fig. 2a supported by analysis (two not on array)

Reproducible Research: Case Study Original research ◮ Potti et al. , 2006; Hsu et al. , 2007 ◮ NCI60 cell line drug sensitivity signature ◮ Clinical trial allocation Reproducibility ◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene signature ◮ Four ‘interesting’ genes not Baggerly & Coombes, fig. 2b supported by analysis (two not on array)

Reproducible Research: Case Study Original research ◮ Potti et al. , 2006; Hsu et al. , 2007 ◮ NCI60 cell line drug sensitivity signature ◮ Clinical trial allocation Reproducibility ◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene signature Baggerly & Coombes, fig. 2d ◮ Four ‘interesting’ genes not supported by analysis (two not on array)

Reproducible Research: Case Study Original research . . . results incorporate ◮ Potti et al. , 2006; Hsu et al. , several simple errors 2007 that may be putting ◮ NCI60 cell line drug patients at risk. One theme that emerges is sensitivity signature that the most common ◮ Clinical trial allocation errors are simple (e.g., Reproducibility row or column offsets); ◮ Baggerly & Coombes, 2009 conversely, it is our ◮ Off-by-one cisplatin gene experience that the signature most simple errors are common – Baggerly & ◮ Four ‘interesting’ genes not Coombes, 2009 supported by analysis (two not on array)

Reproducible Research: Bioconductor Script-based Data transformations necessarily documented ‘Literate programming’ Text documents embed scripts, scripts evaluated when text document processed Versioned software and repositories Record which package versions used, and retrieve from Bioconductor archives Integrated data containers Sample descriptions and expression data in a single object. Subsetting expression data automatically subsets sample descriptions

The ALL dataset > library(ALL); data(ALL); ALL ExpressionSet (storageMode: lockedEnvironment) assayData: 12625 features, 128 samples element names: exprs protocolData: none phenoData sampleNames: 01005 01010 ... LAL4 (128 total) varLabels: cod diagnosis ... date last seen (21 total) varMetadata: labelDescription featureData: none experimentData: use ' experimentData(object) ' pubMedIds: 14684422 16243790 Annotation: hgu95av2

4. Leading Edge Technological innovations ◮ E.g., SNP, miRNA arrays ◮ E.g., lab sequencing platforms; novel protocols Fast-changing ◮ Commercial software products not yet developed, or already out-of-date ◮ Research questions require novel solutions

Leading Edge: Illustration Sequencing technologies ◮ Historically (e.g., 2 years ago): short reads, low ’tail’ quality, tail base call bias, data volume ◮ Current: count models, read bias, designed experiments, variant representations, annotation Bentley et al., 2008, Nature 456: 53-9

The Bioconductor Project for Reproducible Analysis of High - PowerPoint PPT Presentation

The Bioconductor Project for Reproducible Analysis of High Throughput Genomic Data Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center 19-21 January, 2011 Analysis and Comprehension of High Throughput Genomic Data

The Bioconductor Project Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

A very short, sketchy, introduction to A very short, sketchy, introduction to Bioconductor

The Bioconductor Project: Current Status Martin Morgan Roswell Park Cancer Institute Buffalo,

Reproducible Research with Stata using version control, GitHub, and MarkDoc E. F. Haghish Nov.

Topics for today Introduction to Bioconductor: Getting started with Bioconductor g Using R

Reproducible builds in Debian and everywhere Lunar lunar@debian.org Libre Software Meeting

A graphical user interface to DNA microarray data analysis using R and Bioconductor Jarno Tuimala

R / Bioconductor for Sequence Analysis Martin Morgan 1 June 20-23, 2011 1 mtmorgan@fhcrc.org

The Bioconductor Project Martin Morgan Fred Hutchinson Cancer Research Center 19-21 January,

R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data Martin T. Morgan

Approaches to Package Management Bioconductor Martin Morgan (Martin.Morgan@RoswellPark.org)

Introduction to Biostrings Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

Introducing ShortRead Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

Sequence Ranges Paula Andrea Martinez, PhD. Data scientist DataCamp Introduction to

Reproducible research in practice M ADAGASCAR software package Sergey Fomel Jackson School of

Reproducible Research Practices for Economists Mindy L. Mallory November 10, 2017 Mindy L.

The relationship between PML-rituximab and other immunobiologicals: an overview Renaud Du

GWG Review of Clinical Senior Science Officer, Portfolio Program Applications Development and

BIOMARKERS IN CHRONIC LYMPHOCYTIC LEUKEMIA: THE ART OF SYNTHESIS Session IV. Immunogenetics

Chronic lymphocytic leukemia is eradication feasible and worthwhile? Gianluca Gaidano, MD, PhD

Physician Drug Selection in Oncology Aaron Mitchell, MD Aaron Winn, MPP Stacie Dusetzina, PhD 1

Health - Reducing Excess Winter Deaths March 2016 Christina McArthur, Implementation Consultant

Abstract Presentations 3. Harshpreet Kaur, India Breathing and feeling well through universal

A global snapshot of the air pollution-related health impacts of transportation sector emissions