Multiple Sequence Alignment using Profile HMM based on Chapter 5 - PowerPoint PPT Presentation

0. Multiple Sequence Alignment using Profile HMM based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana R˘ at ¸oi, Diana Popovici

1. PLAN 1. From profiles to Profile HMMs 2. Setting the parameters of a profile HMM; the optimal (MAP) model construction 3. Basic algorithms for profile HMMs 4. Profile HMM training from unaligned sequences: Getting the model and the multiple alignment simultane- ously 5. Profile HMM variants for non-global alignments 6. Weighting the training sequences

2. 1 From profiles to Profile HMMs Problem Given a multiple alignment (obtained either manually or using one of the methods presented in Ch. 6 and Ch. 7), and the profile associated to a set of marked ( X = match) columns, design a HMM that would perform sequence alignments to that profile. Example X X . . . X 1 2 . . . 3 bat A G - - - C A 4/5 0 0 rat A - A G - C C 0 0 4/5 cat A G - A A - G 0 3/5 0 gnat - - A A A C T 0 0 0 goat A G - - - C - 1/5 2/5 1/5

3. Building up a solution At first sight, not taking into account gaps: M Begin End j What about insert residues? I j M Begin End j

4. What about gaps? M Begin End j A better treatment of gaps: D j M Begin End j

5. Could we put it all together? D j I j Begin M End j Transition structure of a profile HMM

6. X X . . . X bat A G - - - C rat A - A G - C Does it work? cat A G - A A - gnat - - A A A C goat A G - - - C 1 2 . . . 3 D D D I I I I M M M Begin End 0 1 2 3 4

7. Any resemblance to pair HMMs? δ X ε 1−2δ−τ δ q x i τ 1−ε−τ τ Begin p M End x y 1 i j 1−2δ−τ 1−ε−τ τ δ Y ε q y δ j τ Doesn’t seem so...

8. However, remember... An example of the state assignments for global alignment using the affine gap model: V L S P A D − K H L − − A E S K I I I x I I I I I x x x x x x x M M M M M M M M I y I y I y I y I y I y I I y y When making the extension to multiple sequence alignment, think of generating only one string (instead of a pair of strings); use I x for inserting residues, and I y to produce a gap; use one triplet of states ( M, I x , I y ) for each column in the alignment; finally define (appropriate) edges in the resulting FSA.

9. Consequence It shouldn’t be difficult to re-write the basic HMM algorithms for profile HMMs! one moment, though...

10. 2 Setting the parameters of a Profile HMM 2.1 Using Maximum Likelihood Estimates (MLE) for transition and emission probabilities For instance — assuming a given multiple alignment with match states marked ( X ) —, the emission probabilities are computed as c ja e M j ( a ) = � a ′ c ja ′ where c ja is the observed frequency of residue a in the column j of the multiple alignment.

11. Counts for our example 0 1 2 3 _ A 4 0 0 match X X . . . X _ C 0 0 4 emissions bat A G - - - C _ G 0 3 0 rat A - A G - C _ T 0 0 0 cat A G - A A - gnat - - A A A C A 0 0 6 0 goat A G - - - C insert C 0 0 0 0 1 2 . . . 3 emissions G 0 0 1 0 T 0 0 0 0 M−M 4 3 2 4 D D D M−D 1 1 0 0 M−I 0 0 1 0 I−M 0 0 2 0 I I I state I transitions I−D 0 0 1 0 I−I 0 0 4 0 _ D−M 0 0 1 Begin M M M End _ D−D 1 0 0 0 1 2 3 4 _ D−I 0 2 0

12. What about zero counts, i.e. unseen emissions/transitions? One solution: use pseudocounts (generalising Laplace’s law...): c ja + Aq a e M j ( a ) = � a ′ c ja ′ + A where A is a weight put on pseudocounts as compared to real counts ( c ja ), and q a is the frequency with which a appears in a random model. A = 20 works well for protein alignments. Note: At the intuitive level, using pseudocount makes a lot of sense: • e M j ( a ) is approximately equal to q a if very little data is available, i.e. all real counts are very small compared to A ; • when a large amount of data is available, e M j ( a ) is essentially equal to the maximum likelihood solution. For other solutions (e.g. Dirichlet mixtures, substitution matrix mixtures, estimation based on an ancestor), you may see Durbin et al., 1998, Section 6.5.

13. 2.2 Setting the L parameter • The process of model construction represents a way to decide which columns of the alignment should be assigned to match states, and which to insert states. • There are 2 L combinations of marking for alignment of L columns, and hence 2 L different profile HMMs to choose from. In a marked column, symbols are assigned to match states and gaps are assigned to delete states In an unmarked column, symbols are assigned to insert states and gaps are ignored. • There are at least tree ways to determine the marking: ◦ manual construction: the user marks alignment columns by hand; ◦ heuristic construction: e.g. a column might be marked when the proportion of gap symbols in it is below a certain threshold; ◦ Maximum A Posteriori (MAP) model construction: next slides.

14. The MAP (maximum a posteriori) model construction • Objective: we search for the model µ that maximises the likelihood of the given data, namely: argmax P ( C | µ ) µ where C is a set of aligned sequences. Note: The sequences in C are assumed to be statistically independent. • Idea: The MAP model construction algorithm recursively calculates S j , the log probability of the optimal model for the alignment up to and including column j , assuming that the column j is marked. • More specifically: S j is calculated from smaller subalignments ending at a marked column i , by incrementing S i with the summed log probability of the transitions and emissions for the columns between i and j .

15. MAP model construction algorithm: Notations • c xy — the observed state transition counts • a xy — the transition probabilities, estimated from the c xy in the usual fashion (MLE) c xy a xy = � y c xy • S j — the log probability of the optimal model for the alignment up to and including column j , assuming that column j is marked • T ij — the summed log probability of all the state transitions between marked columns i and j � T ij = c xy log a xy x,y ∈ M,D,I • M j — the log probability contribution for match state symbol emissions in the column j • L i,j — the log probability contribution for the insert state symbol emissions for the columns i + 1 , . . ., j − 1 (for j − i > 1 ).

16. The MAP model construction algorithm Initialization: S 0 = 0 , M L +1 = 0 Recurrence: for j = 1 , . . . , L + 1 S j = max 0 ≤ i<j S i + T ij + M j + L i +1 ,j − 1 + λ σ j = arg max 0 ≤ i<j S i + T ij + M j + L i +1 ,j − 1 + λ Traceback: from j = σ L +1 , while j > 0 : mark column j as a match column; j = σ j Complexity: O ( L ) in memory and O ( L 2 ) in time for an alignment of L columns... with some care in implementation! Note: λ is a penalty used to favour models with fewer match states. In Bayesian terms, λ is the log of the prior probability of marking each column. It implies a simple but adequate exponentially decreasing prior dis- tribution over model lengths.

17. 3 Basic algorithms for Profile HMMs Notations • v M j ( i ) — the probability of the best path matching the subsequence x 1 ...i to the (profile) submodel up to the column j , ending with x i being emitted by the state M j ; v I j ( i ) — the probability of the best path ending in x i being emitted by I j ; v D j ( i ) — the probability of the best path ending in D j ( x i being the last character emitted before D j ). • V M j ( i ) , V I j ( i ) , V D j ( i ) — the log-odds scores corresponding respectively to v M j ( i ) , v I j ( i ) , v D j ( i ) . • f M j ( i ) — the combined probability of all alignments up to x i that end in state M j , and similarly f I j ( i ) , f D j ( i ) . • b M j ( i ) , b I j ( i ) , b D j ( i ) — the corresponding backward probabilities.

18. The Viterbi algorithm for profile HMM Initialization: rename the Begin state as M 0 , and set v M 0 (0) = 1 ; rename the End state as M L +1 Recursion:  v M j − 1 ( i − 1) a M j − 1 M j  v I j − 1 ( i − 1) a I j − 1 M j v M j ( i ) = e M j ( x i ) max v D j − 1 ( i − 1) a D j − 1 M j   v M j ( i − 1) a M j I j ,  v I j ( i ) = e I j ( x i ) max v I j ( i − 1) a I j I j v D j ( i − 1) a D j I j   v M j − 1 ( i ) a M j − 1 D j  v I j − 1 ( i ) a I j − 1 D j v D j ( i ) = max v D j − 1 ( i ) a D j − 1 D j  Termination: the final score is v M L +1 ( n ) , calculated using the top recursion relation.

19. The Viterbi algorithm for profile HMMs: log-odds version Initialization: V M 0 (0) = 0 ; (the Begin state is M 0 , and the End state is M L +1 ) Recursion:  V M j − 1 ( i − 1) + log a M j − 1 M j e Mj ( x i )  V M j ( i ) = log + max V I j − 1 ( i − 1) + log a I j − 1 M j q xi V D j − 1 ( i − 1) + log a D j − 1 M j   V M j ( i − 1) + log a M j I j , e Ij ( x i )  V I j ( i − 1) + log a I j I j V I j ( i ) = log + max q xi V D j ( i − 1) + log a D j I j   V M j − 1 ( i ) + log a M j − 1 D j  V D j ( i ) = max V I j − 1 ( i ) + log a I j − 1 D j V D j − 1 ( i ) + log a D j − 1 D j  Termination: the final score is V M L +1 ( n ) , calculated using the top recursion relation.

Multiple Sequence Alignment using Profile HMM based on Chapter 5 - PowerPoint PPT Presentation

0. Multiple Sequence Alignment using Profile HMM based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana R at oi, Diana Popovici 1. PLAN

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM.

Cell implementation HMM (HMM hidden Markov model) Authors: Jakub Hork Ji Hona

CSCI 490 Bioinformatics Multiple Sequence Alignment Multiple Sequence Alignment Motivation:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Algorithms in Bioinformatics: A Practical Introduction Multiple Sequence Alignment Multiple

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 5/29/2013 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/12/2018 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

Automated Testing of Debian Packages Holger Levsen debian@layer-acht.org Lucas Nussbaum

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

n ( n ( I ( i , j ) = i j minimize ... | {z } r r . . A . k 2 Youngsoo

A Complete Solution to the Nose Gear Challenge Yannick Moy Senior Software Engineer Slide: The

Total (Co)Programming with Guarded Recursion Andrea Vezzosi Department of Computer Science and

How do people come to know the love of God? Psalm 139 How precious are you thoughts about me, O

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,