Bias in RNA sequencing and what to do about it Walter L. (Larry) - PowerPoint PPT Presentation

Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA ruzzo@uw.edu

⬇ RNAseq ⬇ ⬇ Millions of reads, DNA Sequencer say, 100 bp each map to genome, map to genome, analyze compare & analyze 2

Goals of RNAseq 1. Which genes are being expressed? How? assemble reads (fragments of mRNAs) into (nearly) full-length mRNAs and/or map them to a reference genome 2. How highly expressed are they? How? count how many fragments come from each gene–expect more highly expressed genes to yield more reads per unit length 3. What’s same/diff between 2 samples E.g., tumor/normal 4. ... 3

RNA seq cDNA, fragment, QC filter, RNA → → S equence → → C ount end repair, A-tail, trim, map, ligate, PCR, … … It’s so easy, what could possibly go wrong?

What we expect: Uniform Sampling 100 Count reads starting at each position, not those covering each position 75 50 25 0 0 50 100 150 200 Uniform sampling of 4000 “reads” across a 200 bp “exon.” Average 20 ± 4.7 per position, min ≈ 9, max ≈ 33   I.e., as expected, we see ≈ μ ± 3 σ in 200 samples

What we get: highly non-uniform coverage The bad news : random fragments are not so uniform. E.g., assuming uniform, the 8 peaks above 100 are > +10 σ above mean ~ Count reads starting at Uniform each position, not those 50 25 covering each position 0 Actual ––––––––––– 3’ exon ––––––––– 200 nucleotides Mortazavi data

What we get: highly non-uniform coverage The bad news : random fragments are not so uniform. E.g., assuming uniform, the 8 peaks above 100 are > +10 σ above mean ~ Count reads starting at Uniform each position, not those 50 25 covering each position 0 Actual How to make it more uniform? A: Math tricks like averaging/smoothing (e.g. “coverage”) or transformations (“log”), …, or WE DO   B: Try to model (aspects of) causation   ––––––––––– 3’ exon ––––––––– THIS (& use increased uniformity of result as a measure of success) 200 nucleotides Mortazavi data

What we get: highly non-uniform coverage The Good News : we can (partially) correct the bias The bad news : random fragments are not so uniform. Uniform 50 25 0 Actual not perfect, but better: 38% reduction in LLR of uniform model; hugely more likely 200 nucleotides

(in part) Bias is ^ sequence-dependent Reads and platform/sample-dependent Fitting a model of the sequence surrounding read starts lets us predict which positions have more reads.

what causes bias? No one knows in any great detail Speculations: all steps in the complex protocol may contribute E.g., primers in PCR-like amplification steps may have unequal affinities (“random hexamers”, e.g.) ligase enzyme sequence preferences potential RNA structures fragmentation biases mapping biases 10

some prior work Hansen, et al. 2010 “7-mer” method - directly count foreground/ background 7-mers at read starts, correct by ratio   2 * (4 7 -1) = 32766 free parameters Li, et al. 2010 GLM - generalized linear model } training requires gene MART - multiple additive regression trees annotations 11

d sample foreground sequences o ( a ) h e t n e i M l t sample (local) background sequences ( b ) u O train Bayesian network ( c ) I.e., learn sequence patterns associated w/ high / low read counts. ( d ) ( e )

    defining bias Data is Un biased if read is independent of sequence: Pr( read at i ) = Pr( read at i | sequence at i ) From Bayes rule: Pr( seq at i | read at i ) Pr( read at i | seq at i ) = Pr( read at i ) Pr( seq at i) We define “bias” to be this factor 13

Modeling Sequence Bias Want a probability distribution over k-mers, k ≈ 40? Some obvious choices: Full joint distribution: 4 k -1 parameters PWM (0-th order Markov): (4-1)•k parameters Something intermediate: Directed Bayes network 14

Form of the models: Directed Bayes nets One “node” per nucleotide,   ±20 bp of read start • Filled node means that position is biased • Arrow i → j means letter at position i modifies bias at j • For both, numeric parameters say how much How–optimize: n n Pr [ s i | x i ] Pr [ x i ] � � ℓ = logPr [ x i | s i ]= log � x ∈ { 0 , 1 } Pr [ s i | x ] Pr [ x ] i = 1 i = 1

NB: •Not just initial hexamer •Span ≥ 19 •All include Illumina ABI negative positions •All different, even on same platform

Result – Increased Uniformity Kullback-Leibler Divergence Jones Li et al Hansen et al Trapnell Data

Result – Increased Uniformity R 2 Fractional improvement in log-likelihood under hypothesis test:   uniform model across * = p-value < 10 -23 “Is BN better than X?” 1000 exons (R 2 =1-L’/L) (1-sided Wilcoxon signed-rank test)

some questions What is the chance that we will learn an incorrect model? E.g., learn a biased model from unbiased input? How does the amount of   training data effect accuracy   of the resulting model? 19

Bias in RNA sequencing and what to do about it Walter L. (Larry) - PowerPoint PPT Presentation

Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA ruzzo@uw.edu RNAseq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

RNA sequencing with the MinION at Genoscope Jean-Marc Aury jmaury@genoscope.cns.fr @J_M_Aury

Single-cell RNA-sequencing Ximena Ibarra-Soria CRUK Cambridge Institute RNA-Sequence Analysis

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

organization prepared by Jenny Bryan for Reproducible Science Workshop A place for everything,

Targeting multiple mobile platforms with Qt Creator Aurindam Jana Aurindam Jana IRC: auri__ :

Quantifying gene expression Genome Sequence reads GTF (annotation)? FASTQ (+reference

RNA-seq: Analysis options Genome? Biological samples/Library preparation Transcriptome

Sector New Y Sector Ne w Yor ork Risk Insight Risk Insight Define maritime risks

Source-to-Source Compilation in Racket You Want it in Which Language? University of Bergen

Qt and Tizen together can do more Tomasz Olszak Qt, Tizen and Open Source enthusiast Why Qt and

The network-untangling problem: From interactions to activity timelines Polina Rozenshtein

Sambuz

Useful Links

Newsletter

Mail Us

Bias in RNA sequencing and what to do about it Walter L. (Larry) - PowerPoint PPT Presentation

Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA ruzzo@uw.edu RNAseq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Prediction of RNA-RNA Interaction slides by Mathias M ohl and Rolf Backofen ohl M.M c

BIAS BIAS LIGHT LIGHT &amp; &amp; MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

RNA sequencing with the MinION at Genoscope Jean-Marc Aury jmaury@genoscope.cns.fr @J_M_Aury

Single-cell RNA-sequencing Ximena Ibarra-Soria CRUK Cambridge Institute RNA-Sequence Analysis

DNA AND RNA ATI TEAS SCIENCE DNA &amp; RNA Questions related to DNA and RNA cover topics

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA) DNA

PROTEIN SYNTHESIS RNA (ribonucleic acid) 3 types RNA DIFFERENCES 1. messenger RNA (mRNA)

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

organization prepared by Jenny Bryan for Reproducible Science Workshop A place for everything,

Targeting multiple mobile platforms with Qt Creator Aurindam Jana Aurindam Jana IRC: auri__ :

Quantifying gene expression Genome Sequence reads GTF (annotation)? FASTQ (+reference

RNA-seq: Analysis options Genome? Biological samples/Library preparation Transcriptome

Sector New Y Sector Ne w Yor ork Risk Insight Risk Insight Define maritime risks

Source-to-Source Compilation in Racket You Want it in Which Language? University of Bergen

Qt and Tizen together can do more Tomasz Olszak Qt, Tizen and Open Source enthusiast Why Qt and

The network-untangling problem: From interactions to activity timelines Polina Rozenshtein

Sambuz

Useful Links

Newsletter

Mail Us

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

DNA AND RNA ATI TEAS SCIENCE DNA & RNA Questions related to DNA and RNA cover topics