The Bioconductor Project for Reproducible Analysis of High - - PowerPoint PPT Presentation

the bioconductor project for reproducible analysis of
SMART_READER_LITE
LIVE PREVIEW

The Bioconductor Project for Reproducible Analysis of High - - PowerPoint PPT Presentation

The Bioconductor Project for Reproducible Analysis of High Throughput Genomic Data Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center 19-21 January, 2011 Analysis and Comprehension of High Throughput Genomic Data


slide-1
SLIDE 1

The Bioconductor Project for Reproducible Analysis of High Throughput Genomic Data

Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center 19-21 January, 2011

slide-2
SLIDE 2

Analysis and Comprehension of High Throughput Genomic Data

Hallmarks of effective computational software

  • 1. Extensive: data, annotation
  • 2. Statistical: volume, technology, experimental design
  • 3. Reproducible: long-term, multi-participant science
  • 4. Leading edge: novel, technology-driven
  • 5. Accessible: affordable, transparent, usable
slide-3
SLIDE 3
  • 1. Extensive Data and Annotation

Data

◮ Expression, tiling, methylation, custom arrays. ◮ Sequence analysis, e.g., ChIP-, RNA-seq ◮ Other high-throughput assays, e.g., flow cytometry, mass

spec., imaging

◮ Public repositories, e.g., GEO, ArrayExpress

Annotation, e.g.,

◮ Well-curated: NCBI, Biomart, UCSC, MsigDB, GO, KEGG ◮ Loosely curated: emerging, specialized, & lab-based ◮ Consortium: HapMap, 1000 genomes, TCGA

slide-4
SLIDE 4

Bioconductor

Goal Help biologists understand their data Focus

◮ Expression and other microarray; flow cytometry ◮ High-throughput sequencing

Themes

◮ Contributions from ‘core’ members and

(primarily academic) user community

◮ Based on the R programming language –

statistics, visualization, interoperability

◮ Reproducible – scripts, vignettes, packages ◮ Open source / open development

Success > 400 packages; publications; 8,000 web visits / week; 75,000 unique IP downloads / year; very active mailing list; annual conferences; courses; . . .

slide-5
SLIDE 5

Bioconductor: Sample Work Flow

> ## > ## Pre-processing > library(affy) > eset <- just.rma() > ## > ## Quality assessment > library(arrayQualityMetrics) > arrayQualityMetrics(eset) > ## > ## Differential expression > library(limma) > status <- + c("Trt", "Trt", "Trt", "Ctrl", "Ctrl", "Ctrl") > design <- model.matrix( ~status ) > fit <- eBayes(lmFit(eset, design)) > topTable(fit, coef=2)

slide-6
SLIDE 6
  • 2. Statistical

Technology

◮ Acknowledging artifacts and

biases

◮ Accomodate using statistical

models, e.g., RMA Volume of data

◮ Data reduction essential

Experimental design

◮ Exploratory analysis ◮ Hypothesis-driven; designed

experiments

◮ Cost-effective, but not too

clever Expression array. Pseudocolors represent hybridisation intensities

  • f RNA to features. Source: url
slide-7
SLIDE 7

Statistical

Technology

◮ Acknowledging artifacts and

biases

◮ Accomodate using statistical

models, e.g., RMA Volume of data

◮ Data reduction essential

Experimental design

◮ Exploratory analysis ◮ Hypothesis-driven; designed

experiments

◮ Cost-effective, but not too

clever

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 6 8 10 12 14 Number of G and C log2 intensity

Measured intensity increases with GC content; Chronic Lymphocytic Leukemia (CLL) dataset.

slide-8
SLIDE 8

Statistical

Technology

◮ Acknowledging artifacts and

biases

◮ Accomodate using statistical

models, e.g., RMA Volume of data

◮ Data reduction essential

Experimental design

◮ Exploratory analysis ◮ Hypothesis-driven; designed

experiments

◮ Cost-effective, but not too

clever

CLL18 CLL1 CLL2 CLL24 CLL15 CLL17 CLL19 CLL16 CLL23 CLL22 CLL11 CLL4 CLL7 CLL14 CLL12 CLL8 CLL21 CLL3 CLL9 CLL6 CLL20 CLL13 CLL5 CLL18 CLL1 CLL2 CLL24 CLL15 CLL17 CLL19 CLL16 CLL23 CLL22 CLL11 CLL4 CLL7 CLL14 CLL12 CLL8 CLL21 CLL3 CLL9 CLL6 CLL20 CLL13 CLL5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Heatmap summarizing distance between CLL arrays

slide-9
SLIDE 9

Statistical

Technology

◮ Acknowledging artifacts and

biases

◮ Accomodate using statistical

models, e.g., RMA Volume of data

◮ Data reduction essential

Experimental design

◮ Exploratory analysis ◮ Hypothesis-driven; designed

experiments

◮ Cost-effective, but not too

clever

CLL11 CLL12 CLL13 CLL14 CLL15 CLL16 CLL17 CLL18 CLL19 CLL1 CLL20 CLL21 CLL22 CLL23 CLL24 CLL2 CLL3 CLL4 CLL5 CLL6 CLL7 CLL8 CLL9 0.95 1.05 1.15

NUSE

Normalized unscaled standard error (NUSE) suggests array CLL1 is an outlier.

slide-10
SLIDE 10

Statistical

Technology

◮ Acknowledging artifacts and

biases

◮ Accomodate using statistical

models, e.g., RMA Volume of data

◮ Data reduction essential

Experimental design

◮ Exploratory analysis ◮ Hypothesis-driven; designed

experiments

◮ Cost-effective, but not too

clever

−2 −1 1 2 1 2 3 4 5 log−ratio log10 p

‘Progressive’ vs. ‘stable’ status. log P vs. log-fold change, CLL data set. Probe sets with extreme differentiation highlighted.

slide-11
SLIDE 11
  • 3. Reproducible Research

Long-term

◮ Returning to analysis after days, weeks, months of other

activity Multi-participant: communicating with. . .

◮ Other statisticians / bioinformaticians ◮ Biologists and others without specialized statistical knowledge

Science: reproducibility. . .

◮ Facilitates third-party verification ◮ Allows critical assessment ◮ Challenging, even in high-profile journals requiring archived

raw data (Ioannidis et al., 2009, Nat Genet 41: 149-155).

slide-12
SLIDE 12

Reproducible Research: Case Study

Original research

◮ Potti et al., 2006; Hsu et al.,

2007

◮ NCI60 cell line drug

sensitivity signature

◮ Clinical trial allocation

Reproducibility

◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene

signature

◮ Four ‘interesting’ genes not

supported by analysis (two not on array) References

◮ Potti et al. 2006 Nat Med

12: 1294-1300; (retracted)

◮ Hsu et al. 2007 J Clin Oncol

25: 4350-4357. (retracted)

◮ Baggerly & Coombes 2009

Ann Appl Stat 3: 1309-1334

slide-13
SLIDE 13

Reproducible Research: Case Study

Original research

◮ Potti et al., 2006; Hsu et al.,

2007

◮ NCI60 cell line drug

sensitivity signature

◮ Clinical trial allocation

Reproducibility

◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene

signature

◮ Four ‘interesting’ genes not

supported by analysis (two not on array) Hsu et al., cisplatin, fig. 1a

slide-14
SLIDE 14

Reproducible Research: Case Study

Original research

◮ Potti et al., 2006; Hsu et al.,

2007

◮ NCI60 cell line drug

sensitivity signature

◮ Clinical trial allocation

Reproducibility

◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene

signature

◮ Four ‘interesting’ genes not

supported by analysis (two not on array) Baggerly & Coombes, fig. 2a

slide-15
SLIDE 15

Reproducible Research: Case Study

Original research

◮ Potti et al., 2006; Hsu et al.,

2007

◮ NCI60 cell line drug

sensitivity signature

◮ Clinical trial allocation

Reproducibility

◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene

signature

◮ Four ‘interesting’ genes not

supported by analysis (two not on array) Baggerly & Coombes, fig. 2b

slide-16
SLIDE 16

Reproducible Research: Case Study

Original research

◮ Potti et al., 2006; Hsu et al.,

2007

◮ NCI60 cell line drug

sensitivity signature

◮ Clinical trial allocation

Reproducibility

◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene

signature

◮ Four ‘interesting’ genes not

supported by analysis (two not on array) Baggerly & Coombes, fig. 2d

slide-17
SLIDE 17

Reproducible Research: Case Study

Original research

◮ Potti et al., 2006; Hsu et al.,

2007

◮ NCI60 cell line drug

sensitivity signature

◮ Clinical trial allocation

Reproducibility

◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene

signature

◮ Four ‘interesting’ genes not

supported by analysis (two not on array) . . . results incorporate several simple errors that may be putting patients at risk. One theme that emerges is that the most common errors are simple (e.g., row or column offsets); conversely, it is our experience that the most simple errors are common – Baggerly & Coombes, 2009

slide-18
SLIDE 18

Reproducible Research: Bioconductor

Script-based Data transformations necessarily documented ‘Literate programming’ Text documents embed scripts, scripts evaluated when text document processed Versioned software and repositories Record which package versions used, and retrieve from Bioconductor archives Integrated data containers Sample descriptions and expression data in a single object. Subsetting expression data automatically subsets sample descriptions

slide-19
SLIDE 19

The ALL dataset

> library(ALL); data(ALL); ALL ExpressionSet (storageMode: lockedEnvironment) assayData: 12625 features, 128 samples element names: exprs protocolData: none phenoData sampleNames: 01005 01010 ... LAL4 (128 total) varLabels: cod diagnosis ... date last seen (21 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' pubMedIds: 14684422 16243790 Annotation: hgu95av2

slide-20
SLIDE 20
  • 4. Leading Edge

Technological innovations

◮ E.g., SNP, miRNA arrays ◮ E.g., lab sequencing platforms; novel protocols

Fast-changing

◮ Commercial software products not yet developed, or already

  • ut-of-date

◮ Research questions require novel solutions

slide-21
SLIDE 21

Leading Edge: Illustration

Sequencing technologies

◮ Historically (e.g., 2 years

ago): short reads, low ’tail’ quality, tail base call bias, data volume

◮ Current: count models, read

bias, designed experiments, variant representations, annotation Bentley et al., 2008, Nature 456: 53-9

slide-22
SLIDE 22

Leading Edge: Illustration

Sequencing technologies

◮ Historically (e.g., 2 years

ago): short reads, low ’tail’ quality, tail base call bias, data volume

◮ Current: count models, read

bias, designed experiments, variant representations, annotation

Average error proportion Change (late − early cycles)

−0.1 0.0 0.1 0.2 0.02 0.04 0.06 0.08 0.10 0.12

A|C A|G A|T C|A C|G C|T G|A G|C G|T T|A T|C T|G

slide-23
SLIDE 23

Leading Edge: Illustration

Sequencing technologies

◮ Historically (e.g., 2 years

ago): short reads, low ’tail’ quality, tail base call bias, data volume

◮ Current: count models,

read bias, designed experiments, variant representations, annotation

Copies per read (log10) Cummulative proportion

0.0 0.2 0.4 0.6 0.8 1.0 2.0 2.2 2.4 2.6

slide-24
SLIDE 24

Leading Edge: Illustration

Sequencing technologies

◮ Historically (e.g., 2 years

ago): short reads, low ’tail’ quality, tail base call bias, data volume

◮ Current: count models,

read bias, designed experiments, variant representations, annotation Poisson (purple) and negative binomial (orange) fit to RNA-seq

  • data. Anders & Huber, 2010,

Genome Biol, 11:R106

slide-25
SLIDE 25
  • 5. Accessible

Affordable

◮ Purchase / licensing; time

Transparent

◮ Algorithms, e.g., RMA ◮ Code reuse

Challenges and solutions

◮ Research questions requiring ‘one-off’ solutions ◮ Software bugs

Usable

◮ Documentation ◮ Training, such as today!

slide-26
SLIDE 26

Accessible

Affordable

◮ Purchase / licensing; time

Transparent

◮ Algorithms, e.g., RMA ◮ Code reuse

Challenges and solutions

◮ Research questions requiring

‘one-off’ solutions

◮ Software bugs

Usable

◮ Documentation ◮ Training, such as today!

Documentation

◮ Help pages ◮ Vignettes ◮ Archived course and

conference material

◮ Mailing list

BioC2011

◮ Annual conference – user

and scientific presentations, workshops, poster session

◮ Seattle July 27-29

slide-27
SLIDE 27

Analysis and Comprehension of High Throughput Genomic Data

Hallmarks of effective computational software

  • 1. Extensive: data, annotation
  • 2. Statistical: volume, technology, experimental design
  • 3. Reproducible: long-term, multi-participant science
  • 4. Leading edge: novel, technology-driven
  • 5. Accessible: affordable, transparent, usable
slide-28
SLIDE 28

Acknowledgments

◮ Vince Carey (Brigham &

Womens, Harvard), Wolfgang Huber (EBI), Rafael Irizzary (JHU), Robert Gentleman (Genentec)

◮ Herv´

e Pag` es, Marc Carlson, Nishant Gopalakrishnan, Chao-Jen Wong, Dan Tenenbaum, Valerie Obenchain

◮ Patrick Aboyoun, Seth

Falcon, Michael Lawrence, Deepayan Sarkar, Florian Hahne

◮ Sean Davis, James

MacDondald