Exploring short read sequences Martin Morgan 1 Fred Hutchinson - PowerPoint PPT Presentation

Exploring short read sequences Martin Morgan 1 Fred Hutchinson Cancer Research Institute, Seattle, WA June 27-July 1, 2011 1 mtmorgan@fhcrc.org

Topics RNA-seq ◮ Experimental design ◮ Quality assessment ◮ Counting reads Microbiome ◮ Sequence manipulation

RNAseq example work flow – Malone and Oliver (2011) Sample ◮ Purify poly(A)+ RNA with oligo(dT) magnetic beads Microarray ◮ cDNA synthesis primed with random hexamers ◮ Dye-swap, hybridization, florescence, analysis RNA-seq ◮ Fragment ◮ cDNA synthesis primed with random hexamers ◮ Adapter ligation, size select

Good data: key issues ◮ Experimental design (Auer and Doerge, 2010) ◮ Replication ◮ Randomization and blocking, e.g., batch effects ◮ Depth of coverage ◮ Statistical power ◮ Library complexity ◮ Coverage heterogeneity ◮ Estimation biases ◮ Legitimate comparison ◮ Sequencing uncertainty (Bravo and Irizarry, 2010)

Good data: key issues ◮ Experimental design (Auer and Doerge, 2010) ◮ Replication ◮ Randomization and blocking, e.g., batch effects ◮ Depth of coverage ◮ Statistical power ◮ Library complexity ROC simulation ◮ Coverage heterogeneity ◮ Replication (red vs. blue) ◮ Estimation biases ◮ Randomization and blocking ◮ Legitimate comparison (solid vs. dot) ◮ Sequencing uncertainty (Bravo and Irizarry, 2010)

Good data: key issues ◮ Experimental design (Auer 0 1 2 3 4 0 1 2 3 4 5 6 7 8 and Doerge, 2010) 1.0 ◮ Replication 0.8 0.6 ◮ Randomization and 0.4 Cumulative proportion of reads blocking, e.g., batch 0.2 effects 0.0 1 2 3 4 ◮ Depth of coverage 1.0 0.8 ◮ Statistical power 0.6 ◮ Library complexity 0.4 0.2 ◮ Coverage heterogeneity 0.0 ◮ Estimation biases 0 1 2 3 4 0 1 2 3 4 Number of occurrences of each read (log 10 ) ◮ Legitimate comparison ◮ Sequencing uncertainty Cumulative proportion of reads occuring 0, 1, . . . times (Bravo and Irizarry, 2010)

Good data: key issues ◮ Experimental design (Auer and Doerge, 2010) 1.0 Cummulative proportion ◮ Replication 0.8 ◮ Randomization and blocking, e.g., batch 0.6 effects ◮ Depth of coverage 0.4 ◮ Statistical power 0.2 ◮ Library complexity ◮ Coverage heterogeneity 0.0 2.0 2.2 2.4 2.6 ◮ Estimation biases Copies per read (log 10 ) ◮ Legitimate comparison ◮ Sequencing uncertainty Actual (green) versus uniform φ X 174 coverage (Bravo and Irizarry, 2010)

Good data: key issues ◮ Experimental design (Auer and Doerge, 2010) ◮ Replication ◮ Randomization and blocking, e.g., batch effects ◮ Depth of coverage ◮ Statistical power ◮ Library complexity ◮ Coverage heterogeneity ◮ Estimation biases ◮ Legitimate comparison Read count increases with gene length ◮ Sequencing uncertainty (Bravo and Irizarry, 2010)

Good data: key issues ◮ Experimental design (Auer and Doerge, 2010) ◮ Replication ◮ Randomization and blocking, e.g., batch effects ◮ Depth of coverage ◮ Statistical power ◮ Library complexity ◮ Coverage heterogeneity ◮ Estimation biases Reads, stratified by cycle, ◮ Legitimate comparison supporting a spurious SNP call in ◮ Sequencing uncertainty φ X 174 (Bravo and Irizarry, 2010)

Quality assessment Subset of Brooks et al. (2011) ◮ RNAi and mRNA-seq to identify pasilla-regulated alternative splicing ◮ Purified polyA, random hexamer primed ◮ Single- and paired end sequences ◮ Align to reference genome, and to curated splice junctions > library(ShortRead) > ## collate statistics > fqFiles <- list.files(pattern="*.fastq") > names(fqFiles) <- sub(".fastq", "", fqFiles) > qas <- mapply(qa, fqFiles, names(fqFiles), + moreArgs=list(type="fastq")) > qa <- do.call(rbind, qas) > ## create report > rpt <- report(qa)

Counting hits: countGenomicOverlaps Case I & II : Single read, single gene, single feature G1 G2 F1 F2 Case III, IV & V : Single read, single gene, multiple features G3 G4 F4 F5 F3 F6 ◮ Types of overlaps G5 F7 ◮ Decision tree F8 Case VI : Single read, multiple genes, multiple features ◮ Performance: 10’s of G6 F9 second to count 10’s G7 F10 F11 of millions of reads Case VII : Split read, single gene, single feature against 20,000 G8 G8 F12 F12 regions Case VIII & IX : Split read, single or multiple genes, multiple features G9 G8 G10 F13 F14 F12 F15 G11 F16

Counting hits: countGenomicOverlaps type ◮ "any", "start", "end", "within" resolution ◮ Types of overlaps ◮ Reads hit 0 genes → discard ◮ Decision tree ◮ Reads hit 1 gene → count ◮ Performance: 10’s of ◮ Reads hit > 1 gene → second to count 10’s ◮ "none" → discard of millions of reads ◮ "divide" → equal divsion against 20,000 amongst genes regions ◮ "uniqueDisjoint" → ◮ Unique disjoint overlap → count ◮ Otherwise discard

Counting hits: countGenomicOverlaps ◮ Types of overlaps ◮ Decision tree ◮ Performance : 10’s of second to count 10’s of millions of reads against 20,000 regions

Sequence manipulation: microbiome Sampling 1. Sample bacterial Pre-processing tasks communities of 10’s of ◮ De-multiplex – simple indivdiuals pattern matching, subset, 2. 454 sequencing of 16S RNA narrow (remove bar code) 3. Pre-processing ◮ Primer removal – partial, ◮ Bar codes redundant primer requires ◮ Primers full Smith-Waterman 4. Phylogenetic placement matching 5. ‘Ecological’ analysis

Conclusions ◮ Well-designed experiments include biological replicates, with blocking of potentially confounding variates ◮ Biases are likely pervasive in sequence data; the question under investigation may influence whether biases are important ◮ Bioconductor includes flexible tools for exploring data

Bioconductor Who ◮ FHCRC: Herv´ e Pag` es, Marc Carlson, Nishant Gopalakrishnan, Valerie Obenchain, Dan Tenenbaum, Chao-Jen Wong ◮ Robert Gentleman (Genentech), Vince Carey (Harvard / Brigham & Women’s), Rafael Irizzary (Johns Hopkins), Wolfgang Huber (EBI, Hiedelberg) ◮ A large number of contributors, world-wide Resources ◮ http://bioconductor.org: installation, packages, work flows, courses, events ◮ Mailing list: friendly prompt help ◮ Conference: Morning talks, afternoon workshops, evening social. 28-29 July, Seattle, WA. Developer Day July 27

P. L. Auer and R. W. Doerge. Statistical design and analysis of RNA sequencing data. Genetics , 185:405–416, Jun 2010. H. C. Bravo and R. A. Irizarry. Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics , 66:665–674, Sep 2010. A. N. Brooks, L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E. Brenner, and B. R. Graveley. Conservation of an RNA regulatory map between Drosophila and mammals. Genome Res. , 21:193–202, Feb 2011. J. H. Malone and B. Oliver. Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol. , 9:34, 2011.

Exploring short read sequences Martin Morgan 1 Fred Hutchinson - PowerPoint PPT Presentation

Exploring short read sequences Martin Morgan 1 Fred Hutchinson Cancer Research Institute, Seattle, WA June 27-July 1, 2011 1 mtmorgan@fhcrc.org Topics RNA-seq Experimental design Quality assessment Counting reads Microbiome

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Sequences Sequences and Difference Equations "Sequences" is a central topic in

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

2016 ANNUAL GENERAL MEETING Short Sea Shipping is OUR BUSINESS 2 Short Sea Shipping is OUR

GSM Short Message Service GSM Short Message Service GSM Short Message Service GSM Short Message

Construction of covering arrays from Outline m-sequences Covering arrays Definition Research

Sequences Sequences are ordered lists of elements, e.g. 2, 3, 5, 7, 11, 13, 17, 19, . . . or a , b

Towards a Generative Model of Natural Motion C. Karen Liu University of Southern California

Read Write Inc. Phonics MISS CASBAN About Read Write Inc Phonics

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

Kindergarten Reading Getting Ready for Kindergarten Oregon Trail School District Read, Read

TraMineR: A toolbox for exploring and rendering sequences Gilbert Ritschard Institute for

EXPLORE ARIZONA THROUGH DATA FOCUS ON STUDENT DATA OVERVIEW WELCOME! EXPLORING DATA

Pitch location and Greinkes July Exploring Pitch Data in R Strike zone success Exploring

Scintillation Nowcasting with GNSS Radio Occultation Data Keith Groves, Charles Carrano, Charles

Neutrino dipole moments and Solar experiments Marco Picariello Torrente- -Lujan, Fernandez

Challenges posed by high-resolution spectropolarimetric observations of pulsating stars S.

An inverse problem of electromagnetic shaping of liquid metals Alfredo Canelas 1 , Jean R. Roche 2

Status Super-FRS M. Winkler 6 th MAC Meeting, GSI, October 10, 2011 Activities Circ = 1.1 km

Existence and Dynamics of Abrikosov Lattices I.M.Sigal based on the joint work with T. Tzaneteas

Time Delay Between Dst Index and Magnetic Storm Related Structure in the Solar Wind Vladimir A.

Radiation exposure and mission strategies for interplanetary manned missions and interplanetary

Exploring short read sequences Martin Morgan 1 Fred Hutchinson - PowerPoint PPT Presentation

Exploring short read sequences Martin Morgan 1 Fred Hutchinson Cancer Research Institute, Seattle, WA June 27-July 1, 2011 1 mtmorgan@fhcrc.org Topics RNA-seq Experimental design Quality assessment Counting reads Microbiome

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

Sequences Sequences and Difference Equations &quot;Sequences&quot; is a central topic in

Sequences Sequences and Difference Equations &quot;Sequences&quot; is a central topic in

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

2016 ANNUAL GENERAL MEETING Short Sea Shipping is OUR BUSINESS 2 Short Sea Shipping is OUR

GSM Short Message Service GSM Short Message Service GSM Short Message Service GSM Short Message

Construction of covering arrays from Outline m-sequences Covering arrays Definition Research

Sequences Sequences are ordered lists of elements, e.g. 2, 3, 5, 7, 11, 13, 17, 19, . . . or a , b

Towards a Generative Model of Natural Motion C. Karen Liu University of Southern California

Read Write Inc. Phonics MISS CASBAN About Read Write Inc Phonics

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

Kindergarten Reading Getting Ready for Kindergarten Oregon Trail School District Read, Read

TraMineR: A toolbox for exploring and rendering sequences Gilbert Ritschard Institute for

EXPLORE ARIZONA THROUGH DATA FOCUS ON STUDENT DATA OVERVIEW WELCOME! EXPLORING DATA

Pitch location and Greinkes July Exploring Pitch Data in R Strike zone success Exploring

Scintillation Nowcasting with GNSS Radio Occultation Data Keith Groves, Charles Carrano, Charles

Neutrino dipole moments and Solar experiments Marco Picariello Torrente- -Lujan, Fernandez

Challenges posed by high-resolution spectropolarimetric observations of pulsating stars S.

An inverse problem of electromagnetic shaping of liquid metals Alfredo Canelas 1 , Jean R. Roche 2

Status Super-FRS M. Winkler 6 th MAC Meeting, GSI, October 10, 2011 Activities Circ = 1.1 km

Existence and Dynamics of Abrikosov Lattices I.M.Sigal based on the joint work with T. Tzaneteas

Time Delay Between Dst Index and Magnetic Storm Related Structure in the Solar Wind Vladimir A.

Radiation exposure and mission strategies for interplanetary manned missions and interplanetary

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Sequences Sequences and Difference Equations "Sequences" is a central topic in