Reducing technical variability and bias in RNA-seq data Francesca - PowerPoint PPT Presentation

Reducing technical variability and bias in RNA-seq data Francesca Finotello NETTAB 2012 November 14-16 2012, Como, Italy

RNA-seq methodology RNA-Seq is a recent methodology (Nagalakshmi, Science 2008) for transcriptome profiling that is based on Next-Generation Sequencing Nat Rev Genet. 2009 widely adopted in quantitative transcriptomics and seen as a valuable alternative to microarrays Nat Methods. 2008

RNA-seq data cDNAs RNAs fragmentation retrotranscription amplification sequencing + size selection reads Condition 1 Condition 2 gene 1 27 80 gene 2 15 56 mapping … … … gene N 50 20 DE analysis gene 1 gene 2 Counts Condition 1 number of reads aligned on a gene digital measure of gene expression gene 1 gene 1 gene 2 gene 2 Condition 2

RNA-seq biases Read coverage is not uniform • along genes/transcripts RNA-seq […] can capture transcriptome dynamics across Different samples can be • different tissues or conditions sequenced at different without sophisticated sequencing depths normalization of data sets. Longer genes are more likely to • - Wang, Nat Methods. 2008 have higher counts gene 2 gene 1 Most of reads arise from a • restricted subset of highly expressed genes

Outline • Definition of an alternative approach for computing counts • Assessement of bias with standard and novel approach • Evaluation of effects on quantification and differential expression analysis • Conclusions and future developments

Outline • Definition of an alternative approach for computing counts • Assessment of bias with standard and novel approach • Evaluation of effects on quantification and differential expression analysis • Conclusions and future developments

New approach maxcounts Consider the reads aligned to an exon • For each exon i, in sample j • are the number of reads covering exon base p maxcounts are computed as the maximum of per-base counts: • Methods Reads mapped on reference genomes with T opHat, not allowing multiple alignments ( -g 1 option) Counts (totcounts) and per-base counts computed with bedtools (Quinlan, 2010) maxcounts computed with custom scripts (C++ and Perl) Differences in sequencing depths corrected via TMM (Robinson, 2010)

Biases exon length Smoothed scatter plot of counts vs. exon length (log-log) Data set: Griffith, 2010 Cubic-spline fit of mean log-counts, bins of 100 exons each r=0.43 r=-0.29 r=0.01 Exp. 1 Exp. 2 • Length bias also at RPKM e1 [100 bp] 100 80 exon level Reads Per Kilobase of e2 [95 bp] 120 115 exon model per Million • RPKMs overcorrect mapped reads … … … … • maxcounts strongly e100 [2000 bp] 2120 2000 reduce length bias ∑ counts 15 000 10 000

Counts distribution across exons Data set: Griffith, 2010 3-5% exons • contain 50% of counts 27-32% exons • contain 90% of counts Data set: Bullard, 2010 1-3% exons • contain 50% counts maxcounts have a less steep • curve than totcounts and RPKMs 15-34% exons • contain 90% i.e. counts are more evenly • counts distributed across exons Data set: Marioni, 2008

Variance technical replicates Variance vs. mean of log-counts/RPKMs across technical replicates Data set: Bullard, 2010 Data set: Griffith, 2010 maxcounts ’ variance is always lower than totcounts ’ variance • RPKMs’ variance depends on data set • Assessment on other data sets •

Quantification spike-in RNAs Data set: Jiang, 2011 Spike-in RNAs (ERCC Consortium) Single-isoforms • Known sequence and concentration • totcounts RPKMs maxcounts All measures have high concordance with concentrations • Transcripts length 270-2000 nt (performance on shorter transcripts?) •

DE analysis log-fold-changes Data set: Griffith, 2010 DE analysis with edgeR (Robinson, 2010)  log-fold-changes (logFC) Negative Binomial distribution of data required (no RPKMs) totcounts maxcounts RMSD Root-mean-square deviation  difference between logFC predicted from maxcounts or totcounts and from qRT- PCR (gold-standard) maxcounts have a lower RMSD  higher concordance with qRT-PCR

Conclusions & future developments length count tech. spike-in DE bias distrib. variance quant. analysis totcounts - - - + + (std approach) + + + ++ RPKM ++ ++ + ++ ++ maxcounts Work in progress and future developments Benchmark on more data sets (biological replicates, spike-in RNAs) • Use other DE methods downstream • Aggregate exon maxcounts to have a measure at gene/transcript level • Define a robust pre-processing pipeline to avoid artifacts • Develop an alternative strategy for computing maxcounts and implement all • versions in a bedtools module

Aknowledgements Enrico Lavezzo Luisa Barzon Stefano T oppo Paolo Fontana Paolo Mazzon Barbara Di Camillo

Reducing technical variability and bias in RNA-seq data Francesca - PowerPoint PPT Presentation

Reducing technical variability and bias in RNA-seq data Francesca Finotello NETTAB 2012 November 14-16 2012, Como, Italy RNA-seq methodology RNA-Seq is a recent methodology (Nagalakshmi, Science 2008) for transcriptome profiling that is based

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Dave Mark Intrinsic Algorithm Reducing the world to mathematical equations! Reducing

EDISON, INC. ESG: Committed to Reducing Emissions ESG: Committed to Reducing Methane Emissions

Product Features Technical Training 2007 Technical Training 2007 Technical Training 2007

Service Section Service Section Technical Training Technical Training December 2004 December

Technical Analysis Technical Analysis Technical Analysis Technical Analysis Introduction

Heatwave Management: Reducing the Risk to Public Health Tabled 14 October 2014 14 October 2014

Call to Action Reducing Radon Risk in America December 9, 2011 1 Reducing Radon Risk in

Reducing birdstrike risk Ian Witter, BAA Airside Operations Reducing birdstrike risk You

Reducing Air Pollution and CO2 Reducing Air Pollution and CO2 in Guiyang, China (Digest version)

RIF: Reducing Risk When You Are Reducing Your Workforce John F. Birmingham, Jr. David J.B.

Reducing the Risk of Uncontrolled Vehicle Reducing the Risk of Uncontrolled Vehicle Movements in

The Cost of Reducing Emissions The Cost of Reducing Emissions EES 3310/5310 EES 3310/5310

Dynamic Compilation for Reducing Dynamic Compilation for Reducing Energy Consumption of I/O-

A Corporate Strategy for Reducing Dry Holes and Improving Resource and Reserve Estimates R. C.

Chapter Overview 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4

Unix commands for beginners D. Puthier TAGC/Inserm, U1090, denis.puthier@univ-amu.fr Matthieu

Introduc)on to the Analysis of RNA-seq Data Lecture

CSEP 527 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation We

Advancing clinical proteomics via analysis based on biological complexes: A tale of five

Debug Information From Metadata to Modules Adrian Prantl Duncan Exon Smith Apple Apple What is

Workshop exercise Data integration and analysis In this exercise, we would like to work out

Kernel Methods for Predictive Sequence Analysis Cheng Soon Ong 1 , 2 and Gunnar Rtsch 1 1

Biocaml The OCaml Bioinformatics Library Ashish Agarwal, Sebastien Mondet, Philippe Veber,

Sambuz

Useful Links

Newsletter

Mail Us