the bioconductor project for reproducible analysis of
play

The Bioconductor Project for Reproducible Analysis of High - PowerPoint PPT Presentation

The Bioconductor Project for Reproducible Analysis of High Throughput Genomic Data Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center 19-21 January, 2011 Analysis and Comprehension of High Throughput Genomic Data


  1. The Bioconductor Project for Reproducible Analysis of High Throughput Genomic Data Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center 19-21 January, 2011

  2. Analysis and Comprehension of High Throughput Genomic Data Hallmarks of effective computational software 1. Extensive: data, annotation 2. Statistical: volume, technology, experimental design 3. Reproducible: long-term, multi-participant science 4. Leading edge: novel, technology-driven 5. Accessible: affordable, transparent, usable

  3. 1. Extensive Data and Annotation Data ◮ Expression, tiling, methylation, custom arrays. ◮ Sequence analysis, e.g., ChIP-, RNA-seq ◮ Other high-throughput assays, e.g., flow cytometry, mass spec., imaging ◮ Public repositories, e.g., GEO, ArrayExpress Annotation, e.g., ◮ Well-curated: NCBI, Biomart, UCSC, MsigDB, GO, KEGG ◮ Loosely curated: emerging, specialized, & lab-based ◮ Consortium: HapMap, 1000 genomes, TCGA

  4. Bioconductor Goal Help biologists understand their data ◮ Expression and other microarray; flow cytometry Focus ◮ High-throughput sequencing ◮ Contributions from ‘core’ members and Themes (primarily academic) user community ◮ Based on the R programming language – statistics, visualization, interoperability ◮ Reproducible – scripts, vignettes , packages ◮ Open source / open development Success > 400 packages; publications; 8,000 web visits / week; 75,000 unique IP downloads / year; very active mailing list; annual conferences; courses; . . .

  5. Bioconductor : Sample Work Flow > ## > ## Pre-processing > library(affy) > eset <- just.rma() > ## > ## Quality assessment > library(arrayQualityMetrics) > arrayQualityMetrics(eset) > ## > ## Differential expression > library(limma) > status <- + c("Trt", "Trt", "Trt", "Ctrl", "Ctrl", "Ctrl") > design <- model.matrix( ~status ) > fit <- eBayes(lmFit(eset, design)) > topTable(fit, coef=2)

  6. 2. Statistical Technology ◮ Acknowledging artifacts and biases ◮ Accomodate using statistical models, e.g., RMA Volume of data ◮ Data reduction essential Experimental design ◮ Exploratory analysis ◮ Hypothesis-driven; designed Expression array. Pseudocolors experiments represent hybridisation intensities ◮ Cost-effective, but not too of RNA to features. Source: url clever

  7. Statistical Technology ◮ Acknowledging artifacts and 14 biases 12 ◮ Accomodate using statistical log 2 intensity 10 models, e.g., RMA Volume of data 8 ◮ Data reduction essential 6 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 Experimental design Number of G and C ◮ Exploratory analysis Measured intensity increases with GC content; Chronic ◮ Hypothesis-driven; designed Lymphocytic Leukemia (CLL) experiments dataset. ◮ Cost-effective, but not too clever

  8. Statistical Technology ◮ Acknowledging artifacts and biases CLL5 ◮ Accomodate using statistical CLL13 0.7 CLL20 CLL6 CLL9 0.6 models, e.g., RMA CLL3 CLL21 CLL8 0.5 CLL12 CLL14 Volume of data CLL7 0.4 CLL4 CLL11 0.3 CLL22 ◮ Data reduction essential CLL23 CLL16 0.2 CLL19 CLL17 CLL15 0.1 Experimental design CLL24 CLL2 CLL1 0.0 CLL18 ◮ Exploratory analysis CLL18 CLL1 CLL2 CLL24 CLL15 CLL17 CLL19 CLL16 CLL23 CLL22 CLL11 CLL4 CLL7 CLL14 CLL12 CLL8 CLL21 CLL3 CLL9 CLL6 CLL20 CLL13 CLL5 ◮ Hypothesis-driven; designed experiments Heatmap summarizing distance ◮ Cost-effective, but not too between CLL arrays clever

  9. Statistical Technology ◮ Acknowledging artifacts and NUSE biases ◮ Accomodate using statistical 1.15 models, e.g., RMA 1.05 Volume of data ◮ Data reduction essential 0.95 CLL11 CLL12 CLL13 CLL14 CLL15 CLL16 CLL17 CLL18 CLL19 CLL1 CLL20 CLL21 CLL22 CLL23 CLL24 CLL2 CLL3 CLL4 CLL5 CLL6 CLL7 CLL8 CLL9 Experimental design Normalized unscaled standard ◮ Exploratory analysis error (NUSE) suggests array ◮ Hypothesis-driven; designed CLL1 is an outlier. experiments ◮ Cost-effective, but not too clever

  10. Statistical Technology ◮ Acknowledging artifacts and 5 biases 4 ◮ Accomodate using statistical 3 log 10 p models, e.g., RMA 2 Volume of data 1 ◮ Data reduction essential 0 Experimental design −2 −1 0 1 2 log−ratio ◮ Exploratory analysis ‘Progressive’ vs. ‘stable’ status. ◮ Hypothesis-driven; designed log P vs. log-fold change, CLL experiments data set. Probe sets with extreme ◮ Cost-effective, but not too differentiation highlighted. clever

  11. 3. Reproducible Research Long-term ◮ Returning to analysis after days, weeks, months of other activity Multi-participant: communicating with. . . ◮ Other statisticians / bioinformaticians ◮ Biologists and others without specialized statistical knowledge Science: reproducibility. . . ◮ Facilitates third-party verification ◮ Allows critical assessment ◮ Challenging, even in high-profile journals requiring archived raw data (Ioannidis et al. , 2009, Nat Genet 41: 149-155).

  12. Reproducible Research: Case Study Original research ◮ Potti et al. , 2006; Hsu et al. , 2007 References ◮ NCI60 cell line drug ◮ Potti et al. 2006 Nat Med sensitivity signature 12: 1294-1300; (retracted) ◮ Clinical trial allocation ◮ Hsu et al. 2007 J Clin Oncol Reproducibility 25: 4350-4357. (retracted) ◮ Baggerly & Coombes, 2009 ◮ Baggerly & Coombes 2009 ◮ Off-by-one cisplatin gene Ann Appl Stat 3: 1309-1334 signature ◮ Four ‘interesting’ genes not supported by analysis (two not on array)

  13. Reproducible Research: Case Study Original research ◮ Potti et al. , 2006; Hsu et al. , 2007 ◮ NCI60 cell line drug sensitivity signature ◮ Clinical trial allocation Reproducibility ◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene signature ◮ Four ‘interesting’ genes not Hsu et al. , cisplatin, fig. 1a supported by analysis (two not on array)

  14. Reproducible Research: Case Study Original research ◮ Potti et al. , 2006; Hsu et al. , 2007 ◮ NCI60 cell line drug sensitivity signature ◮ Clinical trial allocation Reproducibility ◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene signature ◮ Four ‘interesting’ genes not Baggerly & Coombes, fig. 2a supported by analysis (two not on array)

  15. Reproducible Research: Case Study Original research ◮ Potti et al. , 2006; Hsu et al. , 2007 ◮ NCI60 cell line drug sensitivity signature ◮ Clinical trial allocation Reproducibility ◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene signature ◮ Four ‘interesting’ genes not Baggerly & Coombes, fig. 2b supported by analysis (two not on array)

  16. Reproducible Research: Case Study Original research ◮ Potti et al. , 2006; Hsu et al. , 2007 ◮ NCI60 cell line drug sensitivity signature ◮ Clinical trial allocation Reproducibility ◮ Baggerly & Coombes, 2009 ◮ Off-by-one cisplatin gene signature Baggerly & Coombes, fig. 2d ◮ Four ‘interesting’ genes not supported by analysis (two not on array)

  17. Reproducible Research: Case Study Original research . . . results incorporate ◮ Potti et al. , 2006; Hsu et al. , several simple errors 2007 that may be putting ◮ NCI60 cell line drug patients at risk. One theme that emerges is sensitivity signature that the most common ◮ Clinical trial allocation errors are simple (e.g., Reproducibility row or column offsets); ◮ Baggerly & Coombes, 2009 conversely, it is our ◮ Off-by-one cisplatin gene experience that the signature most simple errors are common – Baggerly & ◮ Four ‘interesting’ genes not Coombes, 2009 supported by analysis (two not on array)

  18. Reproducible Research: Bioconductor Script-based Data transformations necessarily documented ‘Literate programming’ Text documents embed scripts, scripts evaluated when text document processed Versioned software and repositories Record which package versions used, and retrieve from Bioconductor archives Integrated data containers Sample descriptions and expression data in a single object. Subsetting expression data automatically subsets sample descriptions

  19. The ALL dataset > library(ALL); data(ALL); ALL ExpressionSet (storageMode: lockedEnvironment) assayData: 12625 features, 128 samples element names: exprs protocolData: none phenoData sampleNames: 01005 01010 ... LAL4 (128 total) varLabels: cod diagnosis ... date last seen (21 total) varMetadata: labelDescription featureData: none experimentData: use ' experimentData(object) ' pubMedIds: 14684422 16243790 Annotation: hgu95av2

  20. 4. Leading Edge Technological innovations ◮ E.g., SNP, miRNA arrays ◮ E.g., lab sequencing platforms; novel protocols Fast-changing ◮ Commercial software products not yet developed, or already out-of-date ◮ Research questions require novel solutions

  21. Leading Edge: Illustration Sequencing technologies ◮ Historically (e.g., 2 years ago): short reads, low ’tail’ quality, tail base call bias, data volume ◮ Current: count models, read bias, designed experiments, variant representations, annotation Bentley et al., 2008, Nature 456: 53-9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend