[PPT] - Expression quantification II Helene Kretzmer 15.05.2012 Pipeline PowerPoint Presentation

SLIDE 1

Expression quantification II

Helene Kretzmer 15.05.2012

SLIDE 2

Pipeline

(i) RNA isolation from sample (ii) RNA transcription to cDNA and fragmentation (iii) sequencing (iv) mapping reads to reference genome (v) using read counts for expression level estimation

Mapping Problems

unknown isoforms sequencing non-uniformity read mapping uncertainty

IZBI Introduction 2

SLIDE 3

Read Mapping Uncertainty

paralogous genes low-complexity regions high sequence similarity reference sequence errors sequencing errors

⇒ multireads

gene mulitreads

isoform multireads

IZBI Introduction 3

SLIDE 4

Mapping Strategies

(a) discard mulitreads (b) rescue mulitreads

2 3 1 3

(c) em - a statistical model

IZBI Introduction 4

SLIDE 5

Measures of Expression - isoform i τi .. fraction of transcripts

percentage of isoform i of all transcripts in the sample

νi .. fraction of nucleotides

percentage of isoform i of all nucleotides in the sample

ℓi .. length of isoform i in nucleotides τi = RPKMi · 10−9

j

τjℓj

IZBI EM Model 5

SLIDE 6

Measures of Expression - isoform i τi .. fraction of transcripts τi = νi ℓi         

j

νj ℓj         

−1

νi .. fraction of nucleotides νi = τiℓi

j

τjℓj ℓi .. length of isoform i in nucleotides τi = RPKMi · 10−9

j

τjℓj

IZBI EM Model 5

SLIDE 7

EM-Model

Generative Model

N reads all of length L

Assumptions

M isoforms isoform sequence is known additional noise isoform uniformly distributed reads: # reads of isoform i

N

−→ νi

IZBI EM Model 6

SLIDE 8

Rn .. sequence of read n N

IZBI EM Model 7

SLIDE 9

Rn .. sequence of read n Gn .. isoform of read n N

IZBI EM Model 7

SLIDE 10

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n N

IZBI EM Model 7

SLIDE 11

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n On .. orientation (strang) of read n N

IZBI EM Model 7

SLIDE 12

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n On .. orientation (strang) of read n N

Gn Sn

IZBI EM Model 7

SLIDE 13

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n On .. orientation (strang) of read n N

Gn Sn On

IZBI EM Model 7

SLIDE 14

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n On .. orientation (strang) of read n N

Gn Sn On Rn

IZBI EM Model 7

SLIDE 15

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n On .. orientation (strang) of read n

θ = [θ0, . . . , θM] .. expression levels of the isoforms 0, . . . , M

N

θ

Gn Sn On Rn

IZBI EM Model 7

SLIDE 16

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n On .. orientation (strang) of read n

θ = [θ0, . . . , θM] .. expression levels of the isoforms 0, . . . , M

N

θ

Gn Sn On Rn

P(sn|gn)

IZBI EM Model 7

SLIDE 17

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n On .. orientation (strang) of read n

θ = [θ0, . . . , θM] .. expression levels of the isoforms 0, . . . , M

N

θ

Gn Sn On Rn

P(sn|gn)P(on|gn)

IZBI EM Model 7

SLIDE 18

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n On .. orientation (strang) of read n

θ = [θ0, . . . , θM] .. expression levels of the isoforms 0, . . . , M

N

θ

Gn Sn On Rn

P(sn|gn)P(on|gn)P(rn|gn, sn, on)

IZBI EM Model 7

SLIDE 19

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n On .. orientation (strang) of read n

θ = [θ0, . . . , θM] .. expression levels of the isoforms 0, . . . , M

N

θ

Gn Sn On Rn

P(gn|θ)P(sn|gn)P(on|gn)P(rn|gn, sn, on)

IZBI EM Model 7

SLIDE 20

Rn .. sequence of read n Gn .. isoform of read n Sn .. start position of read n On .. orientation (strang) of read n

θ = [θ0, . . . , θM] .. expression levels of the isoforms 0, . . . , M

N

θ

Gn Sn On Rn

P(g, s, o, r|θ) =

N

n=1

P(gn|θ)P(sn|gn)P(on|gn)P(rn|gn, sn, on)

IZBI EM Model 7

SLIDE 21

Summary

P(Gn = i|θ) .. probability that read n maps to isoform i given the expression levels θ0, . . . , θM P(On = 0|Gn 0) .. probability that read n has the same orientation as its template given that it is not from the noise isoform P(Sn = j|Gn = i) .. probability that read n starts at position j given that it is from isoform i P(Rn = ρ|Gn = i, Sn = j, On = 0) .. probability that read n has sequence ρ given it is from isoform i, starts at position j and has the same orientiation as its template

IZBI EM Model 8

SLIDE 22

Isoform Gn

P(g, s, o, r|θ) =

N

n=1

P(gn|θ)P(on|gn)P(sn|gn)P(r|gn, sn, on)

P(Gn = i|θ)

Gn ∈ [0, M] 0 noise isoform 1, . . . , M known isoforms P(Gn = i|θ) = θi and

i

θi = 1

IZBI EM Model 9

SLIDE 23

Orientiation On

P(g, s, o, r|θ) =

N

n=1

P(gn|θ)P(on|gn)P(sn|gn)P(r|gn, sn, on)

P(On = 0|Gn 0)

On =

1,

reverse complement 0, same orientation as its template P(On = 0|Gn 0) =

1,

strand specific sequencing 0.5, not strand specific sequencing

IZBI EM Model 10

SLIDE 24

Startposition Sn

P(g, s, o, r|θ) =

N

n=1

P(gn|θ)P(on|gn)P(sn|gn)P(r|gn, sn, on)

P(Sn = j|Gn = i)

Sn ∈ [1, . . . , max

i

ℓi] ℓi .. length of isoform i

P(Sn = j|Gn = i) =

      

1 ℓi ,

uniform read start distribution f( j

ℓi ) − f( j−1 ℓi ),

non-uniform read start distribution f .. empirical cumulative density function over [0, 1]

0.0 0.2 0.4 0.6 0.8 1.0 0.035 0.045 0.055 Fractional position along transcript Probability density function

IZBI EM Model 11

SLIDE 25

Sequence Rn

P(g, s, o, r|θ) =

N

n=1

P(gn|θ)P(on|gn)P(sn|gn)P(r|gn, sn, on)

P(Rn = ρ|Gn = i, Sn = j, On = k)

strand specific protocol, known isoforms: P(Rn = ρ|Gn = i, Sn = j, On = 0) =

L

t=1

ωt(ρt, γi

j+t−1)

ωt(a, b) = P(read[t] = a|isoform[j + t − 1] = b) γi .. sequence of isoform i

IZBI EM Model 12

SLIDE 26

Sequence Rn

P(g, s, o, r|θ) =

N

n=1

P(gn|θ)P(on|gn)P(sn|gn)P(r|gn, sn, on)

P(Rn = ρ|Gn = i, Sn = j, On = k)

strand specific protocol, known isoforms: P(Rn = ρ|Gn = i, Sn = j, On = 0) =

L

t=1

ωt(ρt, γi

j+t−1)

ωt(a, b) = P(read[t] = a|isoform[j + t − 1] = b) γi .. sequence of isoform i

IZBI EM Model 12

Alignment of read and isoform: C G A T A T C C G A A T C G P(Rn = ρ|Gn = i, Sn = j, On = 0) = ω1(C, C)ω2(G, G)ω3(A, A)ω4(T, A)

SLIDE 27

Sequence Rn

P(g, s, o, r|θ) =

N

n=1

P(gn|θ)P(on|gn)P(sn|gn)P(r|gn, sn, on)

P(Rn = ρ|Gn = i, Sn = j, On = k)

strand specific protocol, known isoforms: P(Rn = ρ|Gn = i, Sn = j, On = 0) =

L

t=1

ωt(ρt, γi

j+t−1)

ωt(a, b) = P(read[t] = a|isoform[j + t − 1] = b) γi .. sequence of isoform i

strand specific protocol, noise isoform 0: P(Rn = ρ|Gn = 0, Sn = j, On = 0) =

L

t=1

β(ρt) β .. background distribution

IZBI EM Model 12

SLIDE 28

Estimation of Expression Levels

Given: N reads of length L and M known isoforms Assumption: reads are uniformly sampled from the transcriptome EM Algorithm: find θ = [θ0, . . . , θM] that maximizes P(r|θ) P(r|θ) =

N

n=1

M

i=0

θi

1 ℓi

j

P(rn|gn = i, sn = j)

νi ≈ θi

1 − θ0

IZBI EM Model 13

SLIDE 29

Estimation of Expression Levels

Given: N reads of length L and M known isoforms Assumption: reads are uniformly sampled from the transcriptome EM Algorithm: find θ = [θ0, . . . , θM] that maximizes P(r|θ) P(r|θ) =

N

n=1

M

i=0

θi

1 ℓi

j

P(rn|gn = i, sn = j)

νi ≈ θi

1 − θ0

IZBI EM Model 13

EM-Algorithm: iteratively optimization of θ latent variables: Gn, Sn, On E-step: E[Gn = i, Sn = j, On = k] = P(Gn = i, Sn = j, On = k|r, θt) M-step:

θt+1 = arg max

θ

E[log(P(r, gn, on, sn|θ))|r, θt]

SLIDE 30

Estimation of Expression Levels

Given: N reads of length L and M known isoforms Assumption: reads are uniformly sampled from the transcriptome EM Algorithm: find θ = [θ0, . . . , θM] that maximizes P(r|θ) P(r|θ) =

N

n=1

M

i=0

θi

1 ℓi

j

P(rn|gn = i, sn = j)

νi ≈ θi

1 − θ0

IZBI EM Model 13

SLIDE 31

(a) (b)

Gene expression estimates (y-axis) vs. sample values (x-axis) for the simulated mouse (a) and maize (b) RNA-Seq data sets. Comparisons are given for ν.

IZBI Results 14

SLIDE 32

IZBI Refinements 15