Pair HMM based gap statistics for re-evaluation of indels in - - PowerPoint PPT Presentation

pair hmm based gap statistics
SMART_READER_LITE
LIVE PREVIEW

Pair HMM based gap statistics for re-evaluation of indels in - - PowerPoint PPT Presentation

Guideline Introduction Methods Results Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties Alexander Schnhuth, Raheleh Salari Cenk Sahinalp Centrum Wiskunde & Informatica Amsterdam


slide-1
SLIDE 1

Guideline Introduction Methods Results

Pair HMM based gap statistics

for re-evaluation of indels in alignments with affine gap penalties Alexander Schönhuth, Raheleh Salari Cenk Sahinalp

Centrum Wiskunde & Informatica Amsterdam Computational Biology Lab School of Computing Science Simon Fraser University

September 8, 2010

slide-2
SLIDE 2

Guideline Introduction Methods Results

Guideline

Introduction Motivation Affine Gap Cost Alignments Methods Indel Statistics: Problem Formulation Solution: Pair HMMs Results Data and Parameters Evaluation Strategies

slide-3
SLIDE 3

Guideline Introduction Methods Results

Motivation

  • Inaccuracies in sequence alignments abundant
  • Detrimental effects in downstream analyses
slide-4
SLIDE 4

Guideline Introduction Methods Results

Motivation

  • Inaccuracies in sequence alignments abundant
  • Detrimental effects in downstream analyses
  • [Lunter et al. (2007)]: Accuracy 96% far away from gaps, 56%

surrounding gaps

slide-5
SLIDE 5

Guideline Introduction Methods Results

Motivation

  • Inaccuracies in sequence alignments abundant
  • Detrimental effects in downstream analyses
  • [Lunter et al. (2007)]: Accuracy 96% far away from gaps, 56%

surrounding gaps

  • [Lunter et al. (2007)]: Gap attraction and gap annihilation
  • [Loeytynoja et al. (2005), Polyanovsky et al. (2008)]: Downward

bias in the number of computationally inferred indels

slide-6
SLIDE 6

Guideline Introduction Methods Results

Motivation

  • Inaccuracies in sequence alignments abundant
  • Detrimental effects in downstream analyses
  • [Lunter et al. (2007)]: Accuracy 96% far away from gaps, 56%

surrounding gaps

  • [Lunter et al. (2007)]: Gap attraction and gap annihilation
  • [Loeytynoja et al. (2005), Polyanovsky et al. (2008)]: Downward

bias in the number of computationally inferred indels

  • Numbers and sizes of indels may serve as indicator of indel

quality

slide-7
SLIDE 7

Guideline Introduction Methods Results

Indel Facts

Evolutionary processes behind insertions and deletions are not comprehensively understood. Classical alignment procedures with affine gap penalties:

  • Geometric distribution.
slide-8
SLIDE 8

Guideline Introduction Methods Results

Indel Facts

Evolutionary processes behind insertions and deletions are not comprehensively understood. Classical alignment procedures with affine gap penalties:

  • Geometric distribution.

Truly evolutionary distributions of indel length:

  • Mixtures of exponentials [Qian

and Goldstein, 2003], [Pang et. al, 2005]

  • Zipfian distribution [Chang and

Benner, 2004]

slide-9
SLIDE 9

Guideline Introduction Methods Results

Indel Facts

Evolutionary processes behind insertions and deletions are not comprehensively understood. Classical alignment procedures with affine gap penalties:

  • Geometric distribution.

Truly evolutionary distributions of indel length:

  • Mixtures of exponentials [Qian

and Goldstein, 2003], [Pang et. al, 2005]

  • Zipfian distribution [Chang and

Benner, 2004] Moreover, from small-scale studies:

  • Indels often occur in the proteins’

loop regions and cause significant structural changes [Fechteler et al., 1995]

  • Indels occur in disease-causing

mutational hot spots [Kondrashov et al., 2004]

  • Thanks to the structural changes:

novel approaches to antibacterial drug design [Cherkasov et al. 2005, 2006], [Nandan et al. 2007]

slide-10
SLIDE 10

Guideline Introduction Methods Results

Affine Gap Cost Alignments

Smith-Waterman and Needleman-Wunsch

  • Computationally efficient and still most popular.
  • SW and NW alignments are Viterbi paths in pair HMMs.
slide-11
SLIDE 11

Guideline Introduction Methods Results

Affine Gap Cost Alignments

Smith-Waterman and Needleman-Wunsch

  • Computationally efficient and still most popular.
  • SW and NW alignments are Viterbi paths in pair HMMs.
  • Blast employs affine gap penalty scoring schemes.
  • ☞ Many more advanced approaches to sequence alignment are

based on pair HMMs.

slide-12
SLIDE 12

Guideline Introduction Methods Results

Methods: Problem Formulation

slide-13
SLIDE 13

Guideline Introduction Methods Results

Indel Statistics

Problem Definition

  • Let T be a pool of protein pairs of interest and

LA(x, y), SimA(x, y) and Id,A(x, y) be the length, the similarity and the length of the d-th longest indel of the alignment of (x, y) ∈ T, as computed by the affine gap cost algorithm A.

slide-14
SLIDE 14

Guideline Introduction Methods Results

Indel Statistics

Problem Definition

  • Let T be a pool of protein pairs of interest and

LA(x, y), SimA(x, y) and Id,A(x, y) be the length, the similarity and the length of the d-th longest indel of the alignment of (x, y) ∈ T, as computed by the affine gap cost algorithm A.

  • We are interested in computation of

P( Id,A(x, y) ≥ k | LA(x, y) = n, SimA(x, y) ∈ [σ1, σ2] ) where (x, y) has been sampled from the pool T.

slide-15
SLIDE 15

Guideline Introduction Methods Results

Indel Statistics

Problem Definition

Multiple Indel Length Probability Problem Input: A sequence pair x, y, a pool of sequence pairs T s. t. (x, y) ∈ T, integers d, k. Output: An estimate of P( Id,A(x, y) ≥ k | LA(x, y) = n, SimA(x, y) ∈ [σ1, σ2] )

Remark

  • Replacing Id,A(x, y) by SA(x, y), the score of the A- alignment of x and

y is the classical problem of score statistics.

  • Approximate solution (exact for ungapped local alignments):

Altschul-Dembo-Karlin statistics [Karlin and Altschul, 1990], [Dembo and Karlin, 1991].

slide-16
SLIDE 16

Guideline Introduction Methods Results

Motivation

Systematically answer questions like: “Are 4 gaps of size at least 6 in an alignment of length 200 and similarity 50 likely to reflect true indels?”

slide-17
SLIDE 17

Guideline Introduction Methods Results

Solution: Pair HMMs

slide-18
SLIDE 18

Guideline Introduction Methods Results

Solution Strategy: Pair HMMs

slide-19
SLIDE 19

Guideline Introduction Methods Results

Solution Strategy: Pair HMMs

Idea: Approximate Viterbi path statistics by generative statistics of a Markov chain.

slide-20
SLIDE 20

Guideline Introduction Methods Results

Solution Strategy: Pair HMMs

Markov Chain Training

slide-21
SLIDE 21

Guideline Introduction Methods Results

Solution Strategy: Pair HMMs

Markov Chain Training Cd,k ⊂ {B, M1, M2, M3, IN, E}∗ : at least d IN-stretches of length at least k An ⊂ {B, M1, M2, M3, IN, E}∗ : alignment region of length n

slide-22
SLIDE 22

Guideline Introduction Methods Results

Solution Strategy: Pair HMMs

Markov Chain Training Cd,k ⊂ {B, M1, M2, M3, IN, E}∗ : at least d IN-stretches of length at least k An ⊂ {B, M1, M2, M3, IN, E}∗ : alignment region of length n Example: B M1 M1 IN IN IN M2 IN IN IN M3 E ∈ C2,3 ∩ A10 B M1 IN IN IN IN M2 IN M2 IN M3M3 E ∈ C3,1 ∩ C1,4 ∩ A11

slide-23
SLIDE 23

Guideline Introduction Methods Results

Solution Strategy: Pair HMMs

Markov Chain Training 1: Align all sequence pairs from T. 2: Infer q1, q2, q3, q4, q5, q6 by “Viterbi

training” the Markov chain with the alignments of Sim ∈ [σ1, σ2]. ☞ Markov chain Mσ1,σ2.

Cd,k ⊂ {B, M1, M2, M3, IN, E}∗ : at least d IN-stretches of length at least k An ⊂ {B, M1, M2, M3, IN, E}∗ : alignment region of length n Example: B M1 M1 IN IN IN M2 IN IN IN M3 E ∈ C2,3 ∩ A10 B M1 IN IN IN IN M2 IN M2 IN M3M3 E ∈ C3,1 ∩ C1,4 ∩ A11

slide-24
SLIDE 24

Guideline Introduction Methods Results

Solution Strategy: Pair HMMs

Computation of Probabilities Markov chain Mσ1,σ2

M = Mσ1,σ2

1: n, d, k as from the alignment of x and y 2: Compute PM(Cd,k ∩ An) as well as PM(An) 3: Output

PM(Cd,k | An) = PM(Cd,k ∩ An) PM(An) for P(Id ≥ k | L = n, Sim ∈ [σ1, σ2]).

Cd,k ⊂ {B, M1, M2, M3, IN, E}∗ : at least d IN-stretches of length at least k An ⊂ {B, M1, M2, M3, IN, E}∗ : alignment region of length n Example: B M1 M1 IN IN IN M2 IN IN IN M3 E ∈ C2,3 ∩ A10 B M1 IN IN IN IN M2 IN M2 IN M3M3 E ∈ C3,1 ∩ C1,4 ∩ A11

slide-25
SLIDE 25

Guideline Introduction Methods Results

Computation of P(Ck,d ∩ An): Naive Approach

Let Bn,t,k be sequences of the type (Z ∈ {M1, M2, M3, IN}): B

0 Z 1 ... Z t−1

IN

t ...

IN

t+k−1

Z

t+k ... Z n

E

n+1.

slide-26
SLIDE 26

Guideline Introduction Methods Results

Computation of P(Ck,d ∩ An): Naive Approach

Let Bn,t,k be sequences of the type (Z ∈ {M1, M2, M3, IN}): B

0 Z 1 ... Z t−1

IN

t ...

IN

t+k−1

Z

t+k ... Z n

E

n+1.

For example (d = 1), it holds that C1,k ∩ An =

n−k+1

[

t=1

Bn,t,k

slide-27
SLIDE 27

Guideline Introduction Methods Results

Computation of P(Ck,d ∩ An): Naive Approach

Let Bn,t,k be sequences of the type (Z ∈ {M1, M2, M3, IN}): B

0 Z 1 ... Z t−1

IN

t ...

IN

t+k−1

Z

t+k ... Z n

E

n+1.

For example (d = 1), it holds that C1,k ∩ An =

n−k+1

[

t=1

Bn,t,k However, proceeding by inclusion-exlusion P(C1,k ∩ An) =

n−k+1

X

m=1

(−1)m+1 X

1≤t1<...<tm≤n−k+1

P(Bn,t1,k ∩ ... ∩ Bn,tm,k) results in computation of

n−k+1

X

m=1

“n − k + 1 m ” = 2n−k+1 − 1 terms of the type P(Bn,t1,k ∩ ... ∩ Bn,tm,k), hence computation of P(C1,k ∩ An) would be exponential in n.

slide-28
SLIDE 28

Guideline Introduction Methods Results

Efficient Computation of P(Cd,k ∩ An)

Let Dn,t,k be sequences of the type (M ∈ {M2, M3}, Z ∈ {M1, M2, M3, IN}) B

0 Z 1 ... Z t−1

IN

t ...

IN

t+k−1

M

t+k

Z

t+k+1 ... Z n

E

n+1

slide-29
SLIDE 29

Guideline Introduction Methods Results

Efficient Computation of P(Cd,k ∩ An)

Let Dn,t,k be sequences of the type (M ∈ {M2, M3}, Z ∈ {M1, M2, M3, IN}) B

0 Z 1 ... Z t−1

IN

t ...

IN

t+k−1

M

t+k

Z

t+k+1 ... Z n

E

n+1

It holds that Ck,d ∩ An = [

1≤t1<t2<...<td ≤n−k+1

(Dn,t1,k ∩ ... ∩ Dn,td,k) ∩ An.

slide-30
SLIDE 30

Guideline Introduction Methods Results

Efficient Computation of P(Cd,k ∩ An)

Let Dn,t,k be sequences of the type (M ∈ {M2, M3}, Z ∈ {M1, M2, M3, IN}) B

0 Z 1 ... Z t−1

IN

t ...

IN

t+k−1

M

t+k

Z

t+k+1 ... Z n

E

n+1

It holds that Ck,d ∩ An = [

1≤t1<t2<...<td ≤n−k+1

(Dn,t1,k ∩ ... ∩ Dn,td,k) ∩ An. Generalized inclusion-exclusion yields P(Cd,k ∩ An) =

n k+1 ⌋

X

m=d

Km,d · P(Dn,t1,k ∩ ... ∩ Dn,tm,k) · P(An | Dn,tm,k) where Km,d = (−1)m+d `m−1

d−1

´ .

slide-31
SLIDE 31

Guideline Introduction Methods Results

Efficient Computation of P(Cd,k ∩ An)

Let Dn,t,k be sequences of the type (M ∈ {M2, M3}, Z ∈ {M1, M2, M3, IN}) B

0 Z 1 ... Z t−1

IN

t ...

IN

t+k−1

M

t+k

Z

t+k+1 ... Z n

E

n+1

It holds that Ck,d ∩ An = [

1≤t1<t2<...<td ≤n−k+1

(Dn,t1,k ∩ ... ∩ Dn,td,k) ∩ An. Generalized inclusion-exclusion yields P(Cd,k ∩ An) =

n k+1 ⌋

X

m=d

Km,d · P(Dn,t1,k ∩ ... ∩ Dn,tm,k) · P(An | Dn,tm,k) where Km,d = (−1)m+d `m−1

d−1

´ . Theorem: The full table of values P(Cd,k ∩ An), k ≤ n ≤ N for given d can be computed in O(N3) runtime.

slide-32
SLIDE 32

Guideline Introduction Methods Results

Results

slide-33
SLIDE 33

Guideline Introduction Methods Results

Data

  • 1. “Superfamilies” (Sup) and “Twilight Zone” (Twi) protein datasets from SABmark

1.65 (including false positives) with structural alignment information

  • 2. Sup more benign, homologous pairs
  • 3. Twi worst case scenario: alignments of 0-25% sequence identity
slide-34
SLIDE 34

Guideline Introduction Methods Results

Data

  • 1. “Superfamilies” (Sup) and “Twilight Zone” (Twi) protein datasets from SABmark

1.65 (including false positives) with structural alignment information

  • 2. Sup more benign, homologous pairs
  • 3. Twi worst case scenario: alignments of 0-25% sequence identity
  • 4. Computation of all Smith-Waterman (SW) and Needleman-Wunsch (NW)

alignments (both with affine gap penalties)

  • 5. Four groups of alignments: NWTwi, NWSup, SWTwi, SWSup
slide-35
SLIDE 35

Guideline Introduction Methods Results

Data

  • 1. “Superfamilies” (Sup) and “Twilight Zone” (Twi) protein datasets from SABmark

1.65 (including false positives) with structural alignment information

  • 2. Sup more benign, homologous pairs
  • 3. Twi worst case scenario: alignments of 0-25% sequence identity
  • 4. Computation of all Smith-Waterman (SW) and Needleman-Wunsch (NW)

alignments (both with affine gap penalties)

  • 5. Four groups of alignments: NWTwi, NWSup, SWTwi, SWSup

Summary:

  • Number of gaps: NWTwi 122701, NWSup 276082, SWTwi 17776, SWSup 68513

(without false positives)

slide-36
SLIDE 36

Guideline Introduction Methods Results

Data

  • 1. “Superfamilies” (Sup) and “Twilight Zone” (Twi) protein datasets from SABmark

1.65 (including false positives) with structural alignment information

  • 2. Sup more benign, homologous pairs
  • 3. Twi worst case scenario: alignments of 0-25% sequence identity
  • 4. Computation of all Smith-Waterman (SW) and Needleman-Wunsch (NW)

alignments (both with affine gap penalties)

  • 5. Four groups of alignments: NWTwi, NWSup, SWTwi, SWSup

Summary:

  • Number of gaps: NWTwi 122701, NWSup 276082, SWTwi 17776, SWSup 68513

(without false positives)

  • Gap positions therein: NWTwi 179018, NWSup 407629, NWSup 20853, SWSup

86233

slide-37
SLIDE 37

Guideline Introduction Methods Results

Markov Chain Parameters

  • 1. Pools of alignments of similarity

[σ, σ + 10], 20 ≤ σ ≤ 100.

  • 2. Training of 36 = 4 × 9 Markov

chains (with false positives)

  • 3. Efficient computation of probability

tables P(Ck,d | An), k ≤ n ≤ 1500 d = 1, 2, 3, 4, 5, 6, 7 Smith-Waterman Twilight Zone (SWTwi) Similarity (σ) 30 - 40 40 - 50 50 - 60 60 - 70 70 - 80 80 - 90 90 - 100

  • No. Alignments

27 1896 12512 9716 3956 1733 259 q1 0.9552 0.9606 0.9564 0.9485 0.9364 0.9188 0.9149 q2 0.0216 0.0300 0.0315 0.0265 0.0167 0.0078 0.0036 q3 0.7500 0.6667 0.5893 0.4692 0.3210 0.3640 0.0000 q4 0.0588 0.1948 0.2185 0.1739 0.0979 0.0459 0.0000 q5 0.8261 0.9439 0.9353 0.9226 0.8999 0.9253 1.0000 q6 0.9417 0.9514 0.9472 0.9335 0.9077 0.8991 0.7755

slide-38
SLIDE 38

Guideline Introduction Methods Results

Evaluation Strategies

All positions in the L-th longest gap, of length K, in an alignment are classified as true indel positions iff Sig1(θ) : P(I1(x, y) ≥ K | L(x, y), Sim(x, y)) ≤ θ Sig4(θ) : P(Imin(4,L)(x, y) ≥ K | L(x, y), Sim(x, y)) ≤ θ Sig7(θ) : P(Imin(7,L)(x, y) ≥ K | L(x, y), Sim(x, y)) ≤ θ Con : K ≥ Constant E.g. Sig7(0.01): All positions in the 6-th longest gap (of length K) are classified as true indel positions iff (min(7, L) = 6) P(I6(x, y) ≥ K | L(x, y), Sim(x, y)) ≤ 0.01 In the 8-th longest indel, length K (min(7, 8) = 7): P(I7(x, y) ≥ K | L(x, y), Sim(x, y)) ≤ 0.01

slide-39
SLIDE 39

Guideline Introduction Methods Results

Superfamilies

0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision Recall Needleman-Wunsch Sup

"Sig-7" "Sig-4" "Sig-1" "Const" 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision Recall Smith-Waterman Sup

"Sig-7" "Sig-4" "Sig-1" "Const"

Decreasing θ [in SigD(θ)], increasing K [in Con] lowers Recall Recall = #Predicted True Indel Positions #True Indel Positions Precision = #Predicted True Indel Positions #Predicted Indel Positions

slide-40
SLIDE 40

Guideline Introduction Methods Results

Twilight Zone

0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision Recall Needleman-Wunsch Twi

"Sig-7" "Sig-4" "Sig-1" "Const" 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision Recall Smith-Waterman Twi

"Sig-7" "Sig-4" "Sig-1" "Const"

Decreasing θ [in SigD(θ)], increasing K [in Con] lowers Recall Recall = #Predicted True Indel Positions #True Indel Positions Precision = #Predicted True Indel Positions #Predicted Indel Positions

slide-41
SLIDE 41

Guideline Introduction Methods Results

Acknowledgments, Outlook

Acknowledgments:

  • Funding: David DesJardins, Google Inc.
  • Lior Pachter
  • Lab for Mathematical and Computational Biology, UC Berkeley
  • Computational Biology Lab, Simon Fraser University

Outlook: Posterior Decoding Algorithms

slide-42
SLIDE 42

Guideline Introduction Methods Results

Thanks for the attention!