Pair HMM based gap statistics for re-evaluation of indels in - PowerPoint PPT Presentation

Guideline Introduction Methods Results Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties Alexander Schönhuth, Raheleh Salari Cenk Sahinalp Centrum Wiskunde & Informatica Amsterdam Computational Biology Lab School of Computing Science Simon Fraser University September 8, 2010

Guideline Introduction Methods Results Guideline Introduction Motivation Affine Gap Cost Alignments Methods Indel Statistics: Problem Formulation Solution: Pair HMMs Results Data and Parameters Evaluation Strategies

Guideline Introduction Methods Results Motivation • Inaccuracies in sequence alignments abundant • Detrimental effects in downstream analyses

Guideline Introduction Methods Results Motivation • Inaccuracies in sequence alignments abundant • Detrimental effects in downstream analyses • [Lunter et al. (2007)]: Accuracy 96% far away from gaps, 56% surrounding gaps

Guideline Introduction Methods Results Motivation • Inaccuracies in sequence alignments abundant • Detrimental effects in downstream analyses • [Lunter et al. (2007)]: Accuracy 96% far away from gaps, 56% surrounding gaps • [Lunter et al. (2007)]: Gap attraction and gap annihilation • [Loeytynoja et al. (2005), Polyanovsky et al. (2008)]: Downward bias in the number of computationally inferred indels

Guideline Introduction Methods Results Motivation • Inaccuracies in sequence alignments abundant • Detrimental effects in downstream analyses • [Lunter et al. (2007)]: Accuracy 96% far away from gaps, 56% surrounding gaps • [Lunter et al. (2007)]: Gap attraction and gap annihilation • [Loeytynoja et al. (2005), Polyanovsky et al. (2008)]: Downward bias in the number of computationally inferred indels • Numbers and sizes of indels may serve as indicator of indel quality

Guideline Introduction Methods Results Indel Facts Evolutionary processes behind insertions and deletions are not comprehensively understood. Classical alignment procedures with affine gap penalties: • Geometric distribution.

Guideline Introduction Methods Results Indel Facts Evolutionary processes behind insertions and deletions are not comprehensively understood. Classical alignment procedures with affine gap penalties: • Geometric distribution. Truly evolutionary distributions of indel length: • Mixtures of exponentials [Qian and Goldstein, 2003], [Pang et. al, 2005] • Zipfian distribution [Chang and Benner, 2004]

Guideline Introduction Methods Results Indel Facts Evolutionary processes behind Moreover, from small-scale insertions and deletions are not studies: comprehensively understood. • Indels often occur in the proteins’ Classical alignment procedures loop regions and cause significant with affine gap penalties: structural changes [Fechteler et • Geometric distribution. al., 1995] • Indels occur in disease-causing Truly evolutionary distributions of indel length: mutational hot spots [Kondrashov et al., 2004] • Mixtures of exponentials [Qian and Goldstein, 2003], [Pang et. al, • Thanks to the structural changes: 2005] novel approaches to antibacterial drug design [Cherkasov et al. • Zipfian distribution [Chang and 2005, 2006], [Nandan et al. 2007] Benner, 2004]

Guideline Introduction Methods Results Affine Gap Cost Alignments Smith-Waterman and Needleman-Wunsch • Computationally efficient and still most popular. • SW and NW alignments are Viterbi paths in pair HMMs.

Guideline Introduction Methods Results Affine Gap Cost Alignments Smith-Waterman and Needleman-Wunsch • Computationally efficient and still most popular. • SW and NW alignments are Viterbi paths in pair HMMs. • Blast employs affine gap penalty scoring schemes. • ☞ Many more advanced approaches to sequence alignment are based on pair HMMs.

Guideline Introduction Methods Results Methods: Problem Formulation

Guideline Introduction Methods Results Indel Statistics Problem Definition • Let T be a pool of protein pairs of interest and L A ( x , y ) , Sim A ( x , y ) I d , A ( x , y ) and be the length, the similarity and the length of the d-th longest indel of the alignment of ( x , y ) ∈ T , as computed by the affine gap cost algorithm A .

Guideline Introduction Methods Results Indel Statistics Problem Definition • Let T be a pool of protein pairs of interest and L A ( x , y ) , Sim A ( x , y ) I d , A ( x , y ) and be the length, the similarity and the length of the d-th longest indel of the alignment of ( x , y ) ∈ T , as computed by the affine gap cost algorithm A . • We are interested in computation of P ( I d , A ( x , y ) ≥ k | L A ( x , y ) = n , Sim A ( x , y ) ∈ [ σ 1 , σ 2 ] ) where ( x , y ) has been sampled from the pool T.

Guideline Introduction Methods Results Indel Statistics Problem Definition Multiple Indel Length Probability Problem Input : A sequence pair x , y , a pool of sequence pairs T s. t. ( x , y ) ∈ T , integers d , k . Output: An estimate of P ( I d , A ( x , y ) ≥ k | L A ( x , y ) = n , Sim A ( x , y ) ∈ [ σ 1 , σ 2 ] ) Remark • Replacing I d , A ( x , y ) by S A ( x , y ) , the score of the A - alignment of x and y is the classical problem of score statistics. • Approximate solution (exact for ungapped local alignments): Altschul-Dembo-Karlin statistics [Karlin and Altschul, 1990], [Dembo and Karlin, 1991].

Guideline Introduction Methods Results Motivation Systematically answer questions like: “Are 4 gaps of size at least 6 in an alignment of length 200 and similarity 50 likely to reflect true indels?”

Guideline Introduction Methods Results Solution: Pair HMMs

Guideline Introduction Methods Results Solution Strategy: Pair HMMs

Guideline Introduction Methods Results Solution Strategy: Pair HMMs Idea : Approximate Viterbi path statistics by generative statistics of a Markov chain.

Guideline Introduction Methods Results Solution Strategy: Pair HMMs Markov Chain Training

Guideline Introduction Methods Results Solution Strategy: Pair HMMs Markov Chain Training C d , k ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : at least d IN -stretches of length at least k A n ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : alignment region of length n

Guideline Introduction Methods Results Solution Strategy: Pair HMMs Markov Chain Training C d , k ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : at least d IN -stretches of length at least k A n ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : alignment region of length n Example: B M 1 M 1 IN IN IN M 2 IN IN IN M 3 E C 2 , 3 ∩ A 10 ∈ B M 1 IN IN IN IN M 2 IN M 2 IN M 3 M 3 E C 3 , 1 ∩ C 1 , 4 ∩ A 11 ∈

Guideline Introduction Methods Results Solution Strategy: Pair HMMs Markov Chain Training 1: Align all sequence pairs from T . 2: Infer q 1 , q 2 , q 3 , q 4 , q 5 , q 6 by “Viterbi training” the Markov chain with the alignments of Sim ∈ [ σ 1 , σ 2 ] . ☞ Markov chain M σ 1 ,σ 2 . C d , k ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : at least d IN -stretches of length at least k A n ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : alignment region of length n Example: B M 1 M 1 IN IN IN M 2 IN IN IN M 3 E C 2 , 3 ∩ A 10 ∈ B M 1 IN IN IN IN M 2 IN M 2 IN M 3 M 3 E C 3 , 1 ∩ C 1 , 4 ∩ A 11 ∈

Guideline Introduction Methods Results Solution Strategy: Pair HMMs Computation of Probabilities M = M σ 1 ,σ 2 1: n , d , k as from the alignment of x and y 2: Compute P M ( C d , k ∩ A n ) as well as P M ( A n ) 3: Output P M ( C d , k | A n ) = P M ( C d , k ∩ A n ) P M ( A n ) Markov chain M σ 1 ,σ 2 for P ( I d ≥ k | L = n , Sim ∈ [ σ 1 , σ 2 ]) . C d , k ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : at least d IN -stretches of length at least k A n ⊂ { B , M 1 , M 2 , M 3 , IN , E } ∗ : alignment region of length n Example: B M 1 M 1 IN IN IN M 2 IN IN IN M 3 E C 2 , 3 ∩ A 10 ∈ B M 1 IN IN IN IN M 2 IN M 2 IN M 3 M 3 E C 3 , 1 ∩ C 1 , 4 ∩ A 11 ∈

Guideline Introduction Methods Results Computation of P ( C k , d ∩ A n ) : Naive Approach Let B n , t , k be sequences of the type ( Z ∈ { M 1 , M 2 , M 3 , IN } ): B 0 Z 1 ... Z IN IN t + k ... Z Z E t ... n + 1 . t − 1 t + k − 1 n

Guideline Introduction Methods Results Computation of P ( C k , d ∩ A n ) : Naive Approach Let B n , t , k be sequences of the type ( Z ∈ { M 1 , M 2 , M 3 , IN } ): B 0 Z 1 ... Z IN IN t + k ... Z Z E t ... n + 1 . t − 1 t + k − 1 n For example ( d = 1), it holds that n − k + 1 C 1 , k ∩ A n = B n , t , k [ t = 1

Guideline Introduction Methods Results Computation of P ( C k , d ∩ A n ) : Naive Approach Let B n , t , k be sequences of the type ( Z ∈ { M 1 , M 2 , M 3 , IN } ): B 0 Z 1 ... Z IN IN t + k ... Z Z E t ... n + 1 . t − 1 t + k − 1 n For example ( d = 1), it holds that n − k + 1 C 1 , k ∩ A n = B n , t , k [ t = 1 However, proceeding by inclusion-exlusion n − k + 1 ( − 1 ) m + 1 P ( C 1 , k ∩ A n ) = X X P ( B n , t 1 , k ∩ ... ∩ B n , t m , k ) m = 1 1 ≤ t 1 <...< t m ≤ n − k + 1 results in computation of n − k + 1 “ n − k + 1 ” = 2 n − k + 1 − 1 X m m = 1 terms of the type P ( B n , t 1 , k ∩ ... ∩ B n , t m , k ) , hence computation of P ( C 1 , k ∩ A n ) would be exponential in n .

Pair HMM based gap statistics for re-evaluation of indels in - PowerPoint PPT Presentation

Guideline Introduction Methods Results Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties Alexander Schnhuth, Raheleh Salari Cenk Sahinalp Centrum Wiskunde & Informatica Amsterdam

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM.

Cell implementation HMM (HMM hidden Markov model) Authors: Jakub Hork Ji Hona

Efficient Implementation of a Generalized Pair HMM for Efficient Implementation of a Generalized

Using HMM to Blur the Lines between CPU and GPU Programming John Hubbard, May 10, 2017

ROUNDERS (1998) CASINO ROYALE (2006) HAND RANKINGS HIGH CARD HAND RANKINGS PAIR HIGH CARD

Closest Pair of Points Cormen et.al 33.4 Closest Pair of Points Closest pair. Given n points in

Global Robot Ego-Localization C Combining Image Retrieval and HMM- bi i I R i l d HMM

Pair HMMs and Pairwise Sequence Alignment COMP 571 Luay Nakhleh, Rice University Pair HMMs

A Talk on Protein Homology Detection by HMM-HMM comparisons[1] Sding, J Qing Ye Department of

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Fast TwoLevel Fast TwoLevel HMM Decodi HMM Decoding ng Algor gorithm for thm for Large

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

UCF FINANCIALS THE N EXT G EN Fit-Gap Kick Off April 17, 2018 AGENDA How are fit-gap sessions

MCP gap bottom bottom electrode gap Anode

On the Complexity of Closest Pair via Polar-Pair of Point-Sets Bundit Laekhanukit

Scala Generics Type Parameterization 1 / 16 Scala Generics Consider: 1 class Pair[T, S](val

Randomness in Computing L ECTURE 7 Last time Randomized quicksort Markovs inequality

Outline Discrete Probability Distributions The Binomial Distribution (3.1) The

Discrete distributions Probabilities and statistics for biology (CMB STAT1 - STAT2) Jacques van

The story of the film so far... Experiments with integer outcomes give rise to probability

A Denotational Semantics for Low-Level Probabilistic Programs with Nondeterminism Di Wang 1 Jan

The Geometric Distribution of Ranks and Selmer Groups of Elliptic Curves over Function Fields

Order statistics of Matrix-Geometric distributions Azucena Campillo Navarro 1 , Bo Friis Nielsen 1

Probability distributions Carola Wenk 11/13/13 CMPS/MATH 2170 Discrete Mathematics -- Carola