The Bioconductor Project: Current Status Martin Morgan Roswell Park - PowerPoint PPT Presentation

The Bioconductor Project: Current Status Martin Morgan Roswell Park Cancer Institute Buffalo, NY, USA martin.morgan@roswellpark.org 6 December 2016 The Bioconductor Project: Current Status 1 / 13

Bioconductor Analysis and comprehension of high-throughput genomic data. Started 2002 1296 R packages – developed by ‘us’ and user-contributed. Well-used and respected. 43k unique IP downloads / month. 17,000 PubMedCentral citations. The Bioconductor Project: Current Status Introduction 2 / 13

State of the project Packages Users Web & support sites Training & meetings Release & devel builders Funding Governance: (annual) Scientific Advisory Board; (monthly) Technical Advisory Board The Bioconductor Project: Current Status State of the project 3 / 13

State of the project Packages Users Web & support sites Training & meetings https://bioconductor.org https://support.bioconductor.org Release & devel builders Funding Governance: (annual) Scientific Advisory Board; (monthly) Technical Advisory Board The Bioconductor Project: Current Status State of the project 3 / 13

State of the project Packages Users Web & support sites Training & meetings Release & devel builders Funding Governance: (annual) Scientific Advisory Board; (monthly) Technical Advisory Board The Bioconductor Project: Current Status State of the project 3 / 13

Recent developments New package submission ◮ As github issues ◮ Public; review participation welcome ExperimentHub and AnnotationHub ◮ Similar to ‘Annotation’ and ‘Experiment’ data repositories ◮ ExperimentHub often used as the ’data store’ for experiment data packages, e.g., alpineData . Large data representation: HDF5Array (Sneak peak) Organism.dplyr The Bioconductor Project: Current Status Recent developments 4 / 13

HDF5Array > library(HDF5Array) # available in release & devel > library(h5vcData) > h5file <- system.file("extdata", "example.tally.hfs5", package="h5vcData") > cov0 <- HDF5Array(h5file, "/ExampleStudy/16/Coverages") > pcov <- t(drop(cov0[ , 1, ])) # coverage on plus strand > mcov <- t(drop(cov0[ , 2, ])) # coverage on minus strand > library(SummarizedExperiment) > SummarizedExperiment(list(pcov=pcov, mcov=mcov)) class: SummarizedExperiment dim: 90354753 6 metadata(0): assays(2): pcov mcov ... The Bioconductor Project: Current Status Recent developments 5 / 13

Sneak peak: Organism.dplyr > library(Organism.dplyr) # not yet publicly available > src = src_ucsc("Homo sapiens") # any org.* + TxDb.* using org.Hs.eg.db, TxDb.Hsapiens.UCSC.hg38.knownGene > src src: sqlite 3.8.6 [/home/mtmorgan/organism_dplyr.sqlite] tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein, id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx > tbl(src, 'id') %>% filter(symbol == 'BRCA1') %>% select(ensembl, symbol, genename) > exons(src, filter=list(symobl='BRCA1')) # GRanges > exons_tbl(src, filter=list(symbol='BRCA1')) # tibble The Bioconductor Project: Current Status Recent developments 6 / 13

Programming best practices Reuse & interoperability ◮ GenomicRanges and SummarizedExperiment ◮ rtracklayer ::import() for BED, WIG, GTF, GFF, etc. Documentation: classic or roxygen2 Testing: RUnit or testthat Correct, robust, efficient (vectorized) code; BiocParallel Classic, tidy, and semantically rich data The Bioconductor Project: Current Status Programming best practices 7 / 13

Correct, robust, efficient. . . f = function(n) { f2 = function(n) x = integer(0) vapply(1:n, c, integer(1)) for (i in 1:n) x = c(x, i) f3 = function(n) x seq_len(n) } microbenchmark(f(1000), ## correct f(10000), f(100000)) identical(f(100), f3(100)) f1 = function(n) { ## robust! x = integer(n) f(0); f3(0) for (i in 1:n) x[i] = i ## efficient x system.time(f3(1e9) } The Bioconductor Project: Current Status Programming best practices 8 / 13

Classic, tidy, rich: RNA-seq count data Classic Sample x (phenotype + expression) Feature data.frame Tidy ’Melt’ expression values to two long columns, replicated phenotype columns. End result: long data frame. Rich, e.g., SummarizedExperiment Phenotype and expression data manipulated in a coordinated fashion but stored separately. The Bioconductor Project: Current Status Programming best practices 9 / 13

Classic, tidy, rich: RNA-seq count data df0 <- as.data.frame(list(mean=colMeans(classic[, -(1:22)]))) df1 <- tidy %>% group_by(probeset) %>% summarize(mean=mean(exprs)) df2 <- as.data.frame(list(mean=rowMeans(assay(rich)))) ggplot(df1, aes(mean)) + geom_density() The Bioconductor Project: Current Status Programming best practices 10 / 13

Classic, tidy, rich: RNA-seq count data Vocabulary Programming contract Classic: extensive Classic, tidy: limited Tidy: restricted endomorphisms Rich: strict Rich: extensive, meaningful Lessons learned / best practices Constraints (e.g., probes & samples) Considerable value in semantically rich structures Tidy: implicit Current implementations Classic, Rich: explicit trade-off user and developer Flexibility convenience Classic, tidy: general-purpose Endomorphism, simple Rich: specialized vocabulary, consistent paradigm aid use The Bioconductor Project: Current Status Programming best practices 11 / 13

Future challenges Git Cloud. Possible visions: ◮ As now, but ‘in the cloud’ ◮ Integrated with ‘third party’ compute efforts, e.g., NCI, NIH in the United States The Bioconductor Project: Current Status Future challenges 12 / 13

Acknowledgments Core team: Yubo Cheng, Valerie Obenchain, Herv´ e Pag` es, Marcel Ramos, Lori Shepherd, Nitesh Turaga, Greg Wargula. Technical advisory board: Vincent Carey, Kasper Hansen, Wolfgang Huber, Robert Gentleman, Rafael Irizzary, Levi Waldron, Michael Lawrence, Sean Davis, Aedin Culhane Scientific advisory board: Simon Tavare (CRUK), Paul Flicek (EMBL/EBI), Simon Urbanek (AT&T), Vincent Carey (Brigham & Women’s), Wolfgang Huber (EBI), Rafael Irizzary (Dana Farber), Robert Gentleman (23andMe) Research reported in this presentation was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The Bioconductor Project: Current Status Acknowledgments 13 / 13

The Bioconductor Project: Current Status Martin Morgan Roswell Park - PowerPoint PPT Presentation

The Bioconductor Project: Current Status Martin Morgan Roswell Park Cancer Institute Buffalo, NY, USA martin.morgan@roswellpark.org 6 December 2016 The Bioconductor Project: Current Status 1 / 13 Bioconductor Analysis and comprehension of

The Bioconductor Project Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

A very short, sketchy, introduction to A very short, sketchy, introduction to Bioconductor

Topics for today Introduction to Bioconductor: Getting started with Bioconductor g Using R

The Bioconductor Project Martin Morgan Fred Hutchinson Cancer Research Center 19-21 January,

R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data Martin T. Morgan

Introduction to Biostrings Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

Introducing ShortRead Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

Sequence Ranges Paula Andrea Martinez, PhD. Data scientist DataCamp Introduction to

A graphical user interface to DNA microarray data analysis using R and Bioconductor Jarno Tuimala

Approaches to Package Management Bioconductor Martin Morgan (Martin.Morgan@RoswellPark.org)

R / Bioconductor for Sequence Analysis Martin Morgan 1 June 20-23, 2011 1 mtmorgan@fhcrc.org

The Bioconductor Project for Reproducible Analysis of High Throughput Genomic Data Martin Morgan

Overview of the Bioconductor project and marray packages Sandrine Dudoit PH296, Section 36 May

HCAL Status HCAL Status HCAL Status CMS HCAL Status Jim Freeman Simulation Workshop on CMS

Expression Analysis P R E S E N T E D B Y L U I S A M E R C A D O Presentation Roadmap

Working with Bioconductor Objects: Microarray Analysis Martin Morgan, Chao-Jen Wong Fred

STICs and STONES: OV.24 A randomized phase II double-blind placebo-controlled trial of

Key Recommendations Gene Ovary uterus Cervix Other gyn Breast BRCA1 40% 49-57% Take a

The Bioinformatics Approach to Proteins Magnus Andersson magnus.andersson@scilifelab.se

The Simulation of Genetic Data David Duffy Queensland Institute of Medical Research Brisbane,

COMP 364: Computer Tools for Life Sciences Python programming: Control flow: for loops, while

Describing and summarizing data Describing and summarizing data Abhijit Dasgupta Abhijit

Slide 1 _ _ Optimal

Annotation Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA

The Bioconductor Project: Current Status Martin Morgan Roswell Park - PowerPoint PPT Presentation

The Bioconductor Project: Current Status Martin Morgan Roswell Park Cancer Institute Buffalo, NY, USA martin.morgan@roswellpark.org 6 December 2016 The Bioconductor Project: Current Status 1 / 13 Bioconductor Analysis and comprehension of

The Bioconductor Project Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

A very short, sketchy, introduction to A very short, sketchy, introduction to Bioconductor

Topics for today Introduction to Bioconductor: Getting started with Bioconductor g Using R

The Bioconductor Project Martin Morgan Fred Hutchinson Cancer Research Center 19-21 January,

R / Bioconductor for Analysis and Comprehension of High-Throughput Sequence Data Martin T. Morgan

Introduction to Biostrings Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

Introducing ShortRead Paula Andrea Martinez, PhD. Data Scientist DataCamp Introduction to

Sequence Ranges Paula Andrea Martinez, PhD. Data scientist DataCamp Introduction to

A graphical user interface to DNA microarray data analysis using R and Bioconductor Jarno Tuimala

Approaches to Package Management Bioconductor Martin Morgan (Martin.Morgan@RoswellPark.org)

R / Bioconductor for Sequence Analysis Martin Morgan 1 June 20-23, 2011 1 mtmorgan@fhcrc.org

The Bioconductor Project for Reproducible Analysis of High Throughput Genomic Data Martin Morgan

Overview of the Bioconductor project and marray packages Sandrine Dudoit PH296, Section 36 May

HCAL Status HCAL Status HCAL Status CMS HCAL Status Jim Freeman Simulation Workshop on CMS

Expression Analysis P R E S E N T E D B Y L U I S A M E R C A D O Presentation Roadmap

Working with Bioconductor Objects: Microarray Analysis Martin Morgan, Chao-Jen Wong Fred

STICs and STONES: OV.24 A randomized phase II double-blind placebo-controlled trial of

Key Recommendations Gene Ovary uterus Cervix Other gyn Breast BRCA1 40% 49-57% Take a

The Bioinformatics Approach to Proteins Magnus Andersson magnus.andersson@scilifelab.se

The Simulation of Genetic Data David Duffy Queensland Institute of Medical Research Brisbane,

COMP 364: Computer Tools for Life Sciences Python programming: Control flow: for loops, while

Describing and summarizing data Describing and summarizing data Abhijit Dasgupta Abhijit

Slide 1 ___________________________________ ___________________________________ Optimal

Annotation Martin Morgan (mtmorgan@fhcrc.org) Fred Hutchinson Cancer Research Center Seattle, WA

Slide 1 _ _ Optimal