CSI5126 . Algorithms in bioinformatics Probabilistic Sequence Motifs - - PowerPoint PPT Presentation

csi5126 algorithms in bioinformatics
SMART_READER_LITE
LIVE PREVIEW

CSI5126 . Algorithms in bioinformatics Probabilistic Sequence Motifs - - PowerPoint PPT Presentation

. Preamble . . . . Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Motivation . Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation CSI5126


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

  • CSI5126. Algorithms in bioinformatics

Probabilistic Sequence Motifs Marcel Turcotte

School of Electrical Engineering and Computer Science (EECS) University of Ottawa

Version 25 novembre 2016

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Summary

Résumé

Probabilistic sequence motifs

Hidden Markov Models

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Plan

  • 1. Introduction
  • 2. Motivational example
  • 3. Formal defjnitions
  • 4. Worked example
  • 5. Applications

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Modeling biological sequences

Sequence alignment techniques, such as Needleman & Wunsch or Smith & Waterman, assume that positions along the sequence are independent and identically distributed (i.i.d.) :

Indeed, the same substitution matrix (PAM250, BLOSUM62, etc.) is used for weighting all the substitutions of an alignment ; Clearly, anyone looking at a multiple sequence alignment can see that the amino acid distribution varies greatly from one position to another. Some positions are clearly biased towards hydrophobic, charged or aromatic residues, for example.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Modeling biological sequences (cont.)

Regular expressions (RE) can be used to model these variations, [FAMILY][KREND][ILV][PG] … [ST]. However, REs are too rigid. Being deterministic, a sequence either match or not a regular expression.

⇒ Probabilistic motifs, in particular Hidden Markov Models (HMMs), elegantly combine the advantages of these two approaches.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example

Based on :

  • A. Krogh (1998) An introduction to hidden Markov Models for

biological sequences. In S.L. Salzberg, D.B. Searls, S. Kasif (Eds.), Computational Methods in Molecular Biology, Elsevier Science. §4, 45–63.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example

Consider the following aligned DNA sequences. ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC A regular expression representing the above motif could be : [AT][CG][AC][ACGT]*A[GT][CG] The expression matches all 5 sequences, the shortest possible sequence is of length 6, and a match must have an A three positions from its end.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

Consider the following aligned DNA sequences. ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Which of the following two sequences is the least likely to be a member of the above family and why ? TGCT--AGG ACAC--ATC

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

[AT][CG][AC][ACGT]*A[GT][CG] First of all, both sequences are recognized by the above RE ! TGCT--AGG ACAC--ATC Therefore, both sequences are good candidate for being a member

  • f this family.

Regular expressions are deterministic : a sequence is a member of the family or not ! In itself, this formalism does not provide information for ranking the sequences.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

Consider the following aligned DNA sequences. ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC TGCT--AGG (least likely) ACAC--ATC (most likely) However, notice that the top sequence has been constructed by selecting the “least likely” symbol at each position (i.e. the one that appears only once in that column), whilst the second one has been constructed by selecting the “most likely” nucleotide at each position, it is therefore a consensus sequence.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

A natural way to score a match would be to use the frequencies of

  • ccurrence at each position of the motif as estimates of the

probabilities of occurrence. For the fjrst sequence, this would this means 1 5 × 1 5 × 1 5 × . . . As for the second one, its probability would be 4 5 × 4 5 × 4 5 × . . . After the third position, our calculation has to take into account insertions and a diagram would be useful.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

T G C A T G C A T G C A .8 .2 .8 .2 .8 .2

A C A − − − A T G T C A A C T A T C A C A C − − A G C A G A − − − A T C A C C G − − A T C ...

Let’s create a diagram to represent the sequence alignment. Each conserved column of the alignment (i.e. each column that has no gaps) is associated with a box, called a (match) state. A state emits a symbol with a certain probability. Finite state machines that produce an output for each state are called Moore machines.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

T G C A T G C A T G C A .8 .2 .8 .2 .8 .2

A C A − − − A T G T C A A C T A T C A C A C − − A G C A G A − − − A T C A C C G − − A T C ...

After the third position, sequence 1 and 4 have no insertion at all, in terms of the regular expression [ACGT]* does not match any position, sequence 3 and 5 have one insertion, match [ACGT]*

  • nce and, fjnally, sequence 2 matches [ACGT]* three times.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

T G C A T G C A T G C A T G C A T G C A .8 .2 .8 .2 .8 .2 1 .2 .2 .2 .4

A C A − − − A T G T C A A C T A T C A C A C − − A G C A G A − − − A T C A C C G − − A T C

T G C A T G C A .8 .2 .2 .8

Let’s create a new node (state) to model [ACGT]*. Sequences 1 and 4 do not need to visit that state. Sequences 3 and 5 visit this state once, whilst the second sequence visits the state three times. Make sure to understand how the (emission) probabilities for that state are computed.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

T G C A T G C A T G C A T G C A T G C A .8 .2 1.0 1.0 0.6 0.4 .8 .2 .8 .2 1 .2 .2 .2 .4

A C A − − − A T G T C A A C T A T C A C A C − − A G C A G A − − − A T C A C C G − − A T C

T G C A T G C A .8 .2 .2 .8

Transitions 1 → 2, and 2 → 3 occur with probability 1.0. 2 out of 5 sequences do not visit state 4 and are going directly from state 3 to state 5. Let’s assign a (transition) probability 2

5 to

that edge, and 3

5 for the other outgoing edge so that the sum of all

the probabilities is 1.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-16
SLIDE 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

T G C A T G C A T G C A T G C A T G C A .8 .2 1.0 1.0 0.6 0.4 .8 .2 .8 .2 1 .2 .2 .2 .4

A C A − − − A T G T C A A C T A T C A C A C − − A G C A G A − − − A T C A C C G − − A T C

T G C A T G C A .8 .2 .2 .8

Once in state 4, 5 events occur before state 5 is reached. Sequences 3 and 5 are making a transition immediately to state 5 after the C an G has been emitted/matched. In the case of sequence 2, after the fjrst A has been matched, two transitions to state 4 are made, matching C and T, before the transition to state 5 is made. Therefore, 2 out of 5 transitions are made to state 4 and 3 to state 5.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-17
SLIDE 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

T G C A T G C A T G C A T G C A T G C A .8 .2 1.0 1.0 0.4 0.6 0.6 0.4 .8 .2 .8 .2 1 .2 .2 .2 .4

A C A − − − A T G T C A A C T A T C A C A C − − A G C A G A − − − A T C A C C G − − A T C

T G C A T G C A .8 .2 .2 .8 1.0 1.0

Finally, transitions from states 5 to 6, and 6 to 7 occur with probability 1. This is the basic idea behind Hidden Markov Models (HMMs), as applied to model sequence motifs.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-18
SLIDE 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

T G C A T G C A T G C A T G C A T G C A .8 .2 1.0 1.0 0.4 0.6 0.6 0.4 .8 .2 .8 .2 1 .2 .2 .2 .4

A C A − − − A T G T C A A C T A T C A C A C − − A G C A G A − − − A T C A C C G − − A T C

T G C A T G C A .8 .2 .2 .8 1.0 1.0

It’s now easy to score the probability of any sequence. In particular the consensus sequence,

P(ACACATC) = 0.8 × 1 × 0.8 × 1 × 0.8 × 0.6 × 0.4 × 0.6 × 1 × 1 × 0.8 × 1 × 0.8 × 1 = 0.0472

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-19
SLIDE 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Motivational example (contd)

T G C A T G C A T G C A T G C A T G C A .8 .2 1.0 1.0 0.4 0.6 0.6 0.4 .8 .2 .8 .2 1 .2 .2 .2 .4

A C A − − − A T G T C A A C T A T C A C A C − − A G C A G A − − − A T C A C C G − − A T C

T G C A T G C A .8 .2 .2 .8 1.0 1.0

And, the exceptional one,

P(TGCTAGG) = 0.2 × 1 × 0.2 × 1 × 0.2 × 0.6 × 0.2 × 0.6 × 1 × 1 × 0.2 × 1 × 0.2 1 = 0.000023

The consensus sequence is 2,052 times more likely than the exceptional one.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-20
SLIDE 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Remarks

T G C A T G C A T G C A T G C A T G C A .8 .2 1.0 1.0 0.4 0.6 0.6 0.4 .8 .2 .8 .2 1 .2 .2 .2 .4

A C A − − − A T G T C A A C T A T C A C A C − − A G C A G A − − − A T C A C C G − − A T C

T G C A T G C A .8 .2 .2 .8 1.0 1.0

We used to assume that all the positions are identically distributed, not anymore ! We used to model the length the gaps only, now distribution of the symbols is also modeled. There is one small problem to be fjxed, the computed probability highly depends on the sequence length (number of times the insertion state is used).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-21
SLIDE 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Length dependency

To remove the length dependency of the score, the probability of a sequence given the model just described is compared to the probability of that sequence given a random (NULL) model, as usual it’s convenient to express the ratio as a log-odds score. For

  • ur random model, let’s assume that nucleotides are equiprobable,

for a sequence S of length n, the log-odds score becomes, log2 P(S|M) P(S|R) = log2 P(S|M) 0.25n Since the two models have the same length and that the log of a product is the sum of logs, each probability value in the hidden Markov model can be transformed to a log-odd score, in our case dividing the probability values by 0.25 or using the probability estimates from actual data. This would also help avoiding underfmow problems.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-22
SLIDE 22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Length dependency (contd)

S P(S|M) × 100 Log-odds Consensus ACAC--ATC 4.7 6.7 Sequence motifs ACA---ATG 3.3 4.9 TCAACTATC 0.0075 3.0 ACAC--AGC 1.2 5.3 AGA---ATC 3.3 4.9 ACCG--ATC 0.59 4.6 Exception TGCT--AGG 0.0023

  • 0.97

Notice that two matches cannot be compared in the probability space, because of the length dependency, consider the scores for the second sequence and the exceptional one, their raw probability scores are almost identical but their log-odd scores are quite difgerent. In the case of the exceptional sequence, its log-odd score is negative, indicating that the null model is a better fjt.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-23
SLIDE 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Profjle-HMM

j

... ... ... ...

There is a particular type of HMM that is often used to model families of sequences they are called profjle-HMMs. They resemble normal sequence profjles and allow to model insertions and deletions in a position specifjc manner.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-24
SLIDE 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Profjle-HMM

Begin

I j M j Dj ... ...

End

... ...

The bottom nodes are called main or match states, each Mj corresponds to a particular column in a multiple sequence alignment, the probability distribution at that node corresponds to the probability distribution of the column it models in the alignment.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-25
SLIDE 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Profjle-HMM

Begin

I j M j Dj ... ...

End

... ...

The diamond shaped nodes are called insertion states, noted Ij, they allow to model variable regions, the amino acid probability distribution at those nodes could be set to the overall probability distribution of amino acids, for example.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-26
SLIDE 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Profjle-HMM

Begin

I j M j Dj ... ...

End

... ...

Finally, the round nodes are called delete or silent states, noted Dj, they allow to model deletions, i.e. to skip certain columns of the alignment. As you can see, insertions and deletions are modelled separately. Also, their respective probabilities are allowed to vary along the profjle.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-27
SLIDE 27

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Profjle-HMM

a c d e g f h i k l m n p q t s t v w y

Begin

I j M j Dj ... ...

End

... ...

⇒ Emission probabilities are now associated with each Mj, in the case of profjle HMMs they correspond to a column in a profjle (alignment).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-28
SLIDE 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Remarks

The Profjle-HMM topology (states and interconnections) is specifjc to bioinformatics. In bioinformatics, many other topologies are used, including specifjc topologies for modeling eukaryotic gene structures (exons and introns), sequence alignments and trans-membrane

  • proteins. Examples will be seen later.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-29
SLIDE 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

An HMM can be seen as a generative model

a c d e g f h i k l m n p q t s t v w y

Begin

I j M j Dj ... ...

End

... ...

⇒ Starting from the begin state, move to an adjacent state, i, according to some transition probability distribution, emit a symbol according to the emission probability distribution of that state, move to an adjacent state, again according to the transition probabilities, repeat until the end state has been reached.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-30
SLIDE 30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

What’s hidden ?

a c d e g f h i k l m n p q t s t v w y

Begin

I j M j Dj ... ...

End

... ...

Seen as a generative model, at each step this abstract machine moves to a new state and produces a symbol. The observer only sees the sequence of symbols ; not the sequence of state transitions, which are hidden. What is Markovian ?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-31
SLIDE 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Defjnitions

We need to distinguish between the sequence of states (π) and the sequence of symbols (S). The sequence of states, denoted by π and called the path, is modeled as a Markov chain, these transitions are not directly

  • bservable (they are hidden),

akl = P(πi = l|πi−1 = k) where akl is a transition probability from the state πk to πl. Each state has emission probabilities associated with it : ek(b) = P(S(i) = b|πi = k) the probability of observing/emitting the symbol b when in state k.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-32
SLIDE 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Defjnitions

The alphabet of emited symbols, Σ, the set of (hidden) states, Q, a matrix of transition probabilities, A, as well as a the emission probabilities, E, are the parameters of an HMM, M =< Σ, Q, A, E >.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-33
SLIDE 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Interesting questions

  • 1. P(S, π) : the joint probability of a sequence of symbols S

and a sequence of states π. The decoding problem consists of fjnding a path π such that P(S, π) is maximum ;

  • 2. P(S|θ) : the probability of a sequence of symbols S given

the model θ. It represents the likelihood that sequence S has been produced by this HMM, let’s call this the likelihood problem ;

  • 3. Finally, how are the parameters of the model (HMM), θ,

determined ? Let’s call this the parameter estimation problem.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-34
SLIDE 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Defjnitions

Begin

I j M j Dj ... ...

End

... ...

Joint probability of a sequence of symbols S and a sequence of states π : P(S, π) = a0π1

L

i=1

eπi(S(i))aπiπi+1 P(S = VGPGGAHA, π = BEG, M1, M2, I3, I3, I3, M3, M4, M5, END) ⇒ However in practice, the state sequence π is not known in advance.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-35
SLIDE 35

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Worked example : the occasionally dishonest player

A simplifjed example will help better understanding the characteristics of HMMs. I want to play a game. I will be tossing a coin n times. This information can be represented as follows : { H, T, T, H, T, T, …}

  • r { 0, 1, 1, 0, 1, 1, …}.

In fact, I will be using two coins ! One is fair, i.e. head and tail are equiprobable outcomes, but the other one is loaded (biased), it returns head with probability 1

4 and tail with probability 3 4.

I will not reveal when I am exchanging the coins. This information is hidden to you. Objective : Looking at a series of observations, S, can you predict when the exchanges of coins occurred ?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-36
SLIDE 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Worked example : the occasionally dishonest player

P(0) = 1/2 P(1) = 1/2 π P(0) = 1/4 P(1) = 3/4 π .2 .9 .8 .1

1 2

Such game can be modeled using an HMM where each state represents a coin, with its own emission probability distribution, and the transition probabilities represent exchanging the coins.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-37
SLIDE 37

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Worked example : the occasionally dishonest player

P(0) = 1/2 P(1) = 1/2 π P(0) = 1/4 P(1) = 3/4 π .2 .9 .8 .1

1 2

Given an input sequence of heads and tails, such as 0, 1, 1, 0, 1, 1, 1, which sequence of states has the highest probability ?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-38
SLIDE 38

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Worked example : the occasionally dishonest player

S 1 1 1 1 1 π π1 π1 π1 π1 π1 π1 π1 π π1 π1 π1 π1 π1 π1 π2 . . . π π2 π2 π1 π1 π2 π2 π2 . . . π π2 π2 π2 π2 π2 π2 π2 Since the game consists of printing the series of switches from one coin to the other, selecting the path with the highest joint probability, P(S, π), seems appropriate. Here, there are 27 = 128 possible paths, enumerating all of them is feasible, the number of states and consequently the number of possible paths are generally much larger.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-39
SLIDE 39

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

Given an observed sequence of symbols, S, the decoding problem consists of fjnding a sequence of states, π, such that the joint probability of S and π is maximum. argmaxπ P(S, π) For our game, the sequence of states is of interest because it serves to predict the exchanges of coins.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-40
SLIDE 40

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

If the observed sequence of symbols was of length one, the sequence of states would also be of length one (in our restricted example). Which state would you predict if the observed symbol was a 0 ? What if it was a 1 ? Now consider an observed sequence of length two, let’s assume that the last symbol is 1, what is the probability of that symbol being emitted from state π1 ? There are two ways of ending up in π1 while producing S(2) : 1) S(1) could have been produced from π1, and the state remained π1, or 2) S(1) could have been produced from π2, and there was a transition π2 to π1. The two joint probabilities would be

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-41
SLIDE 41

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem (cont.)

P(S(1)|π1)P(π1 → π1)P(S(2)|π1) and P(S(1)|π2)P(π2 → π1)P(S(2)|π1).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-42
SLIDE 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

Now consider an observed sequence of length three, let’s assume that the last symbol is 1, what is the probability of that symbol being emitted from state π1 ? There are two ways of ending up in π1 while producing S(3) : 1) the last state that led to the production of the sequence of symbols S[1, 2] was π1 and the state remained π1, or 2) the last state that led to the production of the sequence of symbols S[1, 2] was π2 and it is followed by a transition π2 to π1, with probability a21. Let’s defjne vk(i) as the probability of the most probable path ending in state k while producing the observation i. Using this notation for formulating the probabilities for the above two scenarios. v1(3) = max [ v1(2) × a11 × e1(0), v2(2) × a21 × e1(0) ]

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-43
SLIDE 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

For our 2 states HMM, we can write the following equation, v1(i) = max [ v1(i − 1) × a11 × e1(S(i)), v2(i − 1) × a21 × e1(S(i)) ] v2(i) = max [ v1(i − 1) × a12 × e2(S(i)), v2(i − 1) × a22 × e2(S(i)) ]

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-44
SLIDE 44

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

π1 π2

1 1 1 1 1

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-45
SLIDE 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

The most probable path can be found recursively. The score for the most probable path ending in state l with observation i, noted vl(i), is given by, vl(i) = el(S(i)) max

k [vk(i − 1)akl]

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-46
SLIDE 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem (cont.)

l ... k akl e (S(i)) vk

l (i−1)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-47
SLIDE 47

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem (cont.)

where k is running for states such that akl is defjned.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-48
SLIDE 48

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

The algorithm for solving the decoding problem is known as the Viterbi algorithm. It fjnds the best (most probable) path using the dynamic programming technique. Initialization : v0 = 1, vk = 0, k > 0 Recurrence : vl(i) = el(S(i)) max

k (vk(i − 1)akl)

where, vk(i) represents the probability of the most probable path ending in state k and position i in S. A pointer (backward) is kept from l to the value of k that maximizes vk(i − 1)akl. ⇒ Implementation issue : because of the products (small) probabilities leads to underfmow the algorithm is implemented using

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-49
SLIDE 49

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem (cont.)

the logarithm of the values and therefore the products becomes sums.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-50
SLIDE 50

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

π1 π2

S(1) ... S(2) S(3) S(n-1) S(n)

πm

...

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-51
SLIDE 51

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

# transition probabilities (t) $t[0][0] = 0.9; $t[0][1] = 0.1; $t[1][0] = 0.2; $t[1][1] = 0.8; # emission probabilities (e) $e[0][0] = 0.50; $e[0][1] = 0.50; $e[1][0] = 0.05; $e[1][1] = 0.95; # observed sequence (S) @S = (0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1); # initialization (d is the dynamic programming table) $d[ 0 ][ 0 ] = $e[ 0 ][ $S[ 0 ] ]; $d[ 1 ][ 0 ] = $e[ 1 ][ $S[ 0 ] ];

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-52
SLIDE 52

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

for ( $j=1; $j < @S; $j++ ) { for ( $i=0; $i <= 1; $i++ ) { $m = 0; for ( $k=0; $k <= 1; $k++ ) { $v = $d[$k][$j-1]*$t[$k][$i]*$e[$i][$S[$j]]; if ( $v > $m ) { $from = $k; $to = $i; $m = $v; } } $d[ $i ][ $j ] = $m; $tr[ $i ][ $j ] = "($from->$to)"; } }

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-53
SLIDE 53

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

for ( $i=0; $i <= 1; $i++ ) { for ( $j=0; $j < @S; $j++ ) { printf "\t%5.5f", $d[ $i ][ $j ]; } print "\n"; for ( $j=0; $j < @S; $j++ ) { printf "\t %s", $tr[ $i ][ $j ]; } print "\n"; } Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-54
SLIDE 54

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

t[0][0] = 0.9 ; t[0][1] = 0.1 ; t[1][0] = 0.2 ; t[1][1] = 0.8 ; e[0][0] = 0.50 ; e[0][1] = 0.50 ; e[1][0] = 0.05 ; e[1][1] = 0.95 ; 1 1 1 1 1 1 1 1 1 0.50000 0.22500 0.10125 0.04556 0.02050 0.00923 0.00415 0.00187 0.00084 0.00038 0.00017 0.00008 (0->0) (0->0) (0->0) (0->0) (0->0) (0->0) (0->0) (0->0) (0->0) (0->0) (0->0) 0.05000 0.04750 0.00190 0.00962 0.00038 0.00195 0.00148 0.00113 0.00086 0.00065 0.00049 0.00038 (0->1) (1->1) (0->1) (1->1) (0->1) (1->1) (1->1) (1->1) (1->1) (1->1) (1->1) Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-55
SLIDE 55

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The decoding problem

Given an HMM representing a protein family as well as an unknown protein sequence, the solution to the decoding problem reveals the internal structure of the unknown sequence, showing the location of the insertions and deletions, core elements, etc. ;

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-56
SLIDE 56

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The likelihood problem : calculating P(S|θ)

In the case of a Markov chain there is a single path for a given sequence S and therefore P(S|θ) is given by, P(S|θ) = P(S(1))

n

i=2

aS(i−1)S(i) In the case of an HMM, there are several paths producing the same S (some paths will be more likely than others) and P(S|θ) should be defjned as the sum of all the probabilities of all possible paths producing S, P(S|θ) =

π

P(S, π) The number of paths grows exponentially with respect to the length of the sequence, therefore all the paths cannot simply be enumerated and summed.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-57
SLIDE 57

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The likelihood problem : forward algorithm

Modifying the Viterbi algorithm changing the maximization by a sum calculates the probability of the observed sequence up to position i ending in state l, fl(i) = el(S(i))

k

fk(i − 1)akl

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-58
SLIDE 58

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The likelihood problem

The score represents the probability of the sequence up to (and including) S(i), noted fl(i), is given by, fl(i) = el(S(i))

k

[fk(i − 1)akl]

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-59
SLIDE 59

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The likelihood problem (cont.)

l ... k akl e (S(i)) fk

l (i−1)

where k is running for states such that akl is defjned.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-60
SLIDE 60

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Forward Algorithm

Can you think of an application for the forward algorithm ? Pfam is a large collection of HMMs covering many common protein domains and families, one HMM per domain or family, version 30.0 (June 2016) contains 16306 families. Given a new sequence, the forward algorithm can be used for fjnding the family that it belongs (if any). ⇒ pfam.xfam.org

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-61
SLIDE 61

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Model Specifjcation

We now turn to our third and fjnal question. How to determine the parameters of the model ? Let x1, . . . , xm be m independent examples forming the training set (typically, m sequences), the objective is to fjnd a set parameters, θ, such that max

θ

Πm

i=1P(xi|θ)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-62
SLIDE 62

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Model Specifjcation

Structure : states + interconnect ; (This is an occasion to include domain specifjc information !) Estimating the transition/emission probabilities.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-63
SLIDE 63

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Modeling the length

At least 5 symbols long

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-64
SLIDE 64

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Modeling the length (cont.)

2 to 8 symbols long

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-65
SLIDE 65

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Arbitrary Deletions

Too expensive, too many parameters to evaluate !

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-66
SLIDE 66

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Arbitrary Deletions (cont.)

Silent (null) states do not emit symbols. ⇒ Silent states prevent modeling specifjc distant transitions.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-67
SLIDE 67

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Profjle HMMs

Begin

I j M j Dj ... ...

End

... ...

⇒ Models insertion/deletions separately.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-68
SLIDE 68

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Trans-membrane (helical) proteins

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-69
SLIDE 69

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Trans-membrane (helical) proteins (cont.)

www.cbs.dtu.dk/services/TMHMM-2.0

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-70
SLIDE 70

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Gene prediction

5’ 3’ GT AG ... Flanking region Exon n CAAT GC TATA box box box box GT AG 5’UTR 3’UTR initiation Transcription Stop codon Poly (A) Initiation codon Exon 2 Flanking region GC Exon 1 Intron I

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-71
SLIDE 71

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Gene prediction (cont.)

I0 E0 E1 E2 I1 I2

Einit Eterm

5'UTR 3'UTR Promo

PolyA

Esngl Inter

genes.mit.edu/GENSCAN.html

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-72
SLIDE 72

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

The parameter estimation problem

Problem : estimate the ast and ek(b) probabilities. Given :

a fjxed topology ; n independent positive examples : S1, S2, . . . , Sn.

log P(S1, S2, . . . , Sn|θ) =

n

j=1

log P(Sj|θ) Two scenarios :

The paths are know (CG islands, secondary structure, gene prediction) ; The paths are unknown.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-73
SLIDE 73

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Parameters estimation/known paths

Maximum likelihood estimators are akl = Akl

l′ Akl′ and

Ek(b)

b′ Ek(b′)

Necessitates large number of positive examples ; If a state k is not visited than numerator and denominator are zero ; P(x, π) is a product of probabilities, what happen if an arc/emission is zero ?

Work around ?

Akl = Akl + rkl Ek(b) = Ek(b) + rk(b)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-74
SLIDE 74

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Parameters estimation/known paths (cont.)

where rkl and rk(b) are pseudocounts. The simplest pseudo count would be rkl = 1 and rk(b) = 1. Better pseudocounts would refmect our prior bias, using observed frequency of amino acids or derived from substitution scores.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-75
SLIDE 75

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Parameters estimation : remarks

Some (emission, transition) probabilities can be zero, if this is the case then all path involving those probabilities would have probability zero as well. In particular, this would happen if the number of sequences used to build the model is low, “strong conclusions would be drawn from very little evidence”. To circumvent that problem, pseudocounts are added prior to calculating the frequencies. The simplest pseudocounts consist in initializing all the counts to one ; rather than zeros before counting the number of occurrences of each event. In the case of the emission probabilities, this would be assuming that all amino acids are equiprobable. Since counts don’t need to be integers, a solution would be to initialize the counts with a value between zero and one, proportional to the overall distribution

  • f the amino acids.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-76
SLIDE 76

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Parameters estimation : remarks

More sophisticated pseudocounts would refmect the distribution of the amino acids at that position. For example, if leucine occurs with a high frequency at that position, you would expect that isoleucine would occur with a high frequency too, but not arginine — in the PAM250 scoring matrix, the score for substituting leucine and isoleucine is 2.80 whilst the score for leucine arginine is -2.2.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-77
SLIDE 77

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Parameters estimation/unknown paths

It is also possible to estimate the emission/transition probabilities when the paths are unknown. In the case of profjle-HMMs, it is possible to estimate the parameters of the HMM starting with a set of unaligned sequence. The details of these methods are complex, but the general idea is as follows : the model is initialized with more or less random values (we say more or less because one can use prior knowledge about the distribution of the amino acids or a rough sequence alignment as a starting point). The model is used to aligned the sequences from the training set, the alignment is then used to improve the parameters of the

  • model. The “improved” model is used to align the training

sequences again, in general, this will lead to a slightly improved alignment, which is used again to improve the probabilities of the

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-78
SLIDE 78

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Parameters estimation/unknown paths (cont.)

model, the process is repeated until no improvement of sequence alignment is observed. The scheme for parameter estimation is called “Expectation-Maximization”, one of the standard algorithms for model estimation is called Baum-Welch or forward-backward algorithm. One the main problem or limitation with this technique is that it converges toward a local optimum, i.e. it is not guaranteed to fjnd the most probable model given the observed data.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-79
SLIDE 79

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Expectation-Maximization (EM) algorithm

  • 1. Choose an initial model. If no prior information is

available, make all the transition probabilities equiprobable, similarly for the emission probabilities ;

  • 2. Use the decoding algorithm for fjnding the maximum

likelihood path for each input sequence ;

  • 3. Using these alignments, tally statistics for estimating all

akl and ek(b) values ;

  • 4. Repeat 3 and 4 until the parameter estimates converge.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-80
SLIDE 80

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Summary

Like Markov Chains, Hidden Markov Models (HMMs) consist of a fjnite number of states, π1, π2, . . ., and transition probabilities, P(πi → πj). Unlike Markov Chains, HMMs also “emit” a symbol (letter) at each (most) states. Sequence of states π = π1, π2 . . . Sequence of observed symbols S = S(1), S(2) . . . Given a new observation, the sequence of symbols is known (observed) but not the sequence of states, “it is hidden”.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-81
SLIDE 81

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Historical note

HMMs were fjrst developed for solving speech recognition problems in the early 1970s.

  • 1. A speech signal is divided into frames of 10 to 20

milliseconds ;

  • 2. A process called vector quantization assigns a predefjned

category to each frame (typically 256 predefjned categories) ;

The input is now represented as a long sequence of category labels (symbols). The next task is to recognize words in this long sequence of categories. However, variations are observed (that will be seen as category substitutions, insertions and deletions). HMMs were developed in such context.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-82
SLIDE 82

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Historical note (cont.)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-83
SLIDE 83

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Example : CG islands

[ From Durbin et al. Biological Sequence Analysis. ] Certain regions of the human genome are known as CG (or CpG) islands. These regions, located around promoters or start regions of many genes, show a higher frequency of CG dinucleotides than elsewhere *. Those regions are a few hundred to a few thousand bases long.

*. this is because whenever C is followed by G the chances that C will be methylated are higher (adding CH3 group to its base), also, methylated Cs mutate to T with high frequency, therefore CG dinucleotides are observed less frequently than expected by chance, P(C) × P(G), fjnally, methylation is suppressed in biologically important regions such as the start of a gene which explains the fact the CGs occurs there more frequently then elsewhere.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-84
SLIDE 84

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Example : CG islands

Problem 1 : Given an unlabeled (short) sequence of DNA, can we decide if it comes from a CG island or not ? ⇒ promoter : a site on DNA to which RNA polymerase will bind and initiate transcription.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-85
SLIDE 85

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Example : CG islands (cont.)

LOCUS AL162458 150791 bp DNA PRI 29-SEP-2000 DEFINITION Human DNA sequence from clone RP11-465L10 on chromosome 20. misc_feature 20670..21997 /note="CpG island" 20641 ...C GCGCGGGTGC CAGGACCCAG GTCCTTGCTA 20701 CGTCCGGAGC CTACGTCACC ACGATGCCTC CCCTGGGCCG GCGGCAGAAC CCGAGACCCC 20761 CGCAGGTTCT AAGACAGCCC CCACGCCCCC CAGTGCGCAC GCTCAGTCCA ACCCCGCCGC 20821 GCACCGCCCA CCGCGAACAT CCGGCTCCTG CGTGTGTGCT CGAGGGGGAA ACTGAGGCGG 20881 GGACGTGCCA GTGAATTCAT TCCTTCCTCA GTCCACCCGC AGGCCTACAA AGCTGTCTCC 20941 CCTTCCTCAG CGCCACAAGG AACAGCAGGG ACGGATGGGA AGAAGGGGAG GGGGCCGAAA 21001 GCAAGCTGGG TGCGAGGAGC CAGCCGACCC TGCCACACTC AAGATGGCGG CGCGGCCGCG 21061 GCGAGGTCCC TCAGAGGCGG TACCAGCGCA TGCGCAGCGC GGAGTCCCGG CCCGGGACAC 21121 AAGATGGCGG CAGCGGCGCT GGGGAGGGCG AGGCGGAGGC GGCAAAACGG GCGGTCGAGC 21181 AGAACGTGTA GCCGCGTCCC CTCCAGTCCG CTCCGGGCAG GTAAGAGTCC CAGGAAGCCA 21241 TGGTCCCGCA GCGAGCCGCG CCAGGGTCTG GGGATCCGAA GCTGGGGGGC GGCGGCCCCT 21301 CCGGCGCTTT CTGCTCGGGA CTGCCGCTTG CCCTGTCTCT GTTGCCGCCG CCATCTTAGA 21361 CCCGCGGGTG GGCGGCCGCG CCGGTGGCCG AAGTGAGGGA GGTGGGCCCG GAGAGCCCCA 21421 GCGGAGCGGG CTCTAGGGCC CCTCCGCTGC TGCCGCCGCC ACCGCCTTTG TGTCGGGCTC 21481 CGACTCTGAG TCGCCTCAGC CCGGGGGCGG GAGCGCGCGG CGGGGCGGGG GGCGGAGCCC 21541 GAGAGATGGG CCGGCGCGCG CGCGCGCGCC AAACAGCCCA CCCTCGCTGG GGTAGGGGGA 21601 GGGGAAGGTG CGCGCGCGCG CGCGCGCTGG AGCTCGCCTC TCGCCTTCGT GCGCCGTCGC 21661 GCCTGCGTAC TTTGTTCGCC CTTTGACTCC TCCCTACTGG GCCGGAGAAT TCTGATTGGT 21721 ACATTGCGGA GATGGTCCCG CCCCACGTGC CTCCAATCCC GGACTCGGAC TCTGGCTTCT 21781 GGTGGGTTTT TCTGGTTGCG CAGATAGAGT TGTTTATCCT TGAGCAGCGG TAATTCTCAA 21841 ACTGCGGTAT GCGTGGGGGT CGGGAAGCCA CAGGATAAAT AAAGACGTTA ACTTAAGAGC 21901 AGTTATGTCT TACTGGGAGC GTACAATGCT GGACTCTACA TATAACGGTC GAGTGATTCC 21961 GGTTTATAAG CCGGAAAGCA GAAGGGCCCG GAATCCG... Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-86
SLIDE 86

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Probabilistic model of a sequence

P(x) = P(xL, xL−1, . . . , x1) = P(xL|xL−1, . . . , x1)P(xL−1|xL−2 . . . , x1) . . . P(x1)

(by application of the general multiplication rule) For example the probability of CGAT :

P(CGAT) = P(T, A, G, C) = P(T|A, G, C)P(A|G, C)P(G|C)P(C)

Why can’t this framework be used for modeling CG islands ? First and foremost, the models requires estimating a large number

  • f parameters, which in turn implies an exceptionally large number
  • f examples.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-87
SLIDE 87

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Probabilistic model of a sequence

Under the assumption that positions are independent from one another, P(xi|xi−1 . . . , x1) = P(xi),

P(x) = P(xL|xL−1, . . . , x1)P(xL−1|xL−2 . . . , x1) . . . P(x1) = P(xL)P(xL−1) . . . P(x1) P(CGAT) = P(T|A, G, C)P(A|G, C)P(G|C)P(C) = P(T)P(A)P(G)P(C)

Why can’t this framework be used for modeling CG islands ? Dinucleotides are playing a critical role and the above model ignores the dependencies completely.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-88
SLIDE 88

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Markov Chains

However under the assumption of an underlying (fjrst order) Markovian process (memory less), P(xi|xi−1 . . . , x1) = P(xi|xi−1), and the previous equation can be rewritten as follows :

P(x) = P(xL|xL−1, . . . , x1)P(xL−1|xL−2 . . . , x1) . . . P(x1) = P(xL|xL−1)P(xL−1|xL−2) . . . P(x2|x1)P(x1)

In the previous example :

P(CGAT) = P(T, A, G, C) = P(T|A)P(A|G)P(G|C)P(C)

This seems to be the right model, the dinucleotide dependencies are represented.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-89
SLIDE 89

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Graphical Formalism for Markov Chains

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-90
SLIDE 90

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Graphical Formalism for Markov Chains (cont.)

A T C G

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-91
SLIDE 91

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Graphical Formalism for Markov Chains (cont.)

ast = P(S(i) = t|S(i − 1) = s). ⇒ transition probabilities, ast, are associated with the arcs of this graph.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-92
SLIDE 92

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Markov Chains

ast = P(S(i) = t|S(i − 1) = s) is the probability that symbol t is

  • bserved at position i knowing that s occurs at position i − 1.

Therefore P(S) can now be written as follows, P(S) = P(S(1))

n

i=2

aS(i−1)S(i) Here the concept of time (involved in the development of PAM matrices) has been replaced by that of space, with similar

  • bservations,

memory less : the probability that symbol a occurs at position i depends only on what symbol is found at position i − 1 ; and not any other i′ < i − 1. homogeneity of space : the probability that symbol a

  • ccurs at position i does not depend on the particular

value of i (e.g. i = 123 or i = 162, 144).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-93
SLIDE 93

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Markov Chains (contd)

Higher-order Markov models are interesting for modeling DNA sequences ; in particular for modeling coding regions, the codon structure. A Markov chain of order k is a model where the probability that symbol a occurs at position i depends only on what symbol is found at positions i − 1, i − 2 . . . i − k, and not any other i′ < i − k.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-94
SLIDE 94

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Markov Chains (contd)

Markov chains are particularly convenient for two reasons,

P(S(i)|S(i − 1) . . . S(1)) would be diffjcult to estimate (do you see why ?) ; They lead to computationally effjcient algorithms.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-95
SLIDE 95

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Markov Chains (contd) (cont.)

A T C G

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-96
SLIDE 96

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Markov Chains (contd) (cont.)

⇒ In the above model a sequence can start and end anywhere.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-97
SLIDE 97

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Markov Chains (contd) (cont.)

A T C G Start Stop

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-98
SLIDE 98

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Markov Chains (contd) (cont.)

⇒ 1) Allows modeling start/end efgects, P(Stop|T) could be difgerent than P(Stop|G), 2) models the distribution of lengths of the sequences, 3) defjnes a probability distribution of all possible sequence (of any length)(sum to 1), 4) lengths decays exponentially.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-99
SLIDE 99

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Methodology

Durbin et al. collected a large number of positive and negative examples of CG islands, almost 60,000 nucleotides in all ; Construct a Markov Model for the positive examples and

  • ne for the negative examples, this involves estimating the

transition probabilities ; To use the models for discrimination, calculate the log-odds ratio.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-100
SLIDE 100

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Maximum Likelihood Estimators

A T C G

1776 816 1296 902

a+

st =

c+

st

∑′

t c+ st′

⇒ a+

CA = 816/(816 + 902 + 1296 + 1776) = 0.17

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-101
SLIDE 101

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Maximum Likelihood Estimators

+ A C G T A 0.180 0.274 0.426 0.120 C 0.171 0.368 0.274 0.188 G 0.161 0.339 0.375 0.125 T 0.079 0.355 0.384 0.182 − A C G T A 0.300 0.205 0.285 0.210 C 0.322 0.298 0.078 0.302 G 0.248 0.246 0.298 0.208 T 0.177 0.239 0.292 0.292

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-102
SLIDE 102

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Maximum Likelihood Estimators (cont.)

A T C G

.27 .43 .18 .38 .18 .37 .34 .27 .16 .19 .12 .08 .17 .38 .12 .36

⇒ Markov Model for the positive examples of CG islands.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-103
SLIDE 103

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Maximum Likelihood Estimators (cont.)

A T C G

.30 .21 .28 .21 .32 .30 .08 .30 .25 .30 .25 .20 .18 .24 .29 .29

⇒ Markov Model for the negative examples of CG islands.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-104
SLIDE 104

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Discrimination

To test if a sequence S of length n is likely to be a CG island, the log-odds ratio of the two models is computed, log P(S|Model+) P(S|Model−) =

n

i=1

log a+

S(i−1)S(i)

a−

S(i−1)S(i)

which could also be written as,

n

i=1

log s(S(i), S(i − 1)) where, s(s, t) = a+

st

a−

st

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-105
SLIDE 105

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Summary

A T C G

Each state is associated with a single symbol ; Models dependencies between adjacent positions ; Transition probabilities aCT.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-106
SLIDE 106

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Pitfalls

MM for CG islands can only test entire sequences ; How can we use it to fjnd CG islands in an entire genome ? Sliding a window (say 100 nt) is unsatisfactory : variable length, sharp boundary ; Solution : combining CG+ and CG- into a single model.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-107
SLIDE 107

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Pitfalls (cont.)

T C G A T C G A

+ + + + − − − −

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-108
SLIDE 108

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Ideas

In this case, small probabilities of switching from one model to the other ; Models islands of CG within a sea of non-CG regions ; i.e probabilities should be chosen such that the probability of a transition from to is greater than to ; Now allows to model …, C G G C G G , … No one to one correspondence between symbols and states (more than one state per symbol) ; Looking at a sequence CGGCGG you don’t know the states (

  • r

) that were used to generate the sequence, the sequence of states is hidden.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-109
SLIDE 109

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

CG island (contd)

In the case of the hidden Markov model for the CG island the probabilities of emission are all 1 or 0. The probability that a sequence CGCG being emitted by the following path in our model C+, G−, C−, G+ is given by, astart,C+ × 1 × aC+,G− × 1 × aG−,C− × 1 × aC−,G+ × 1 × aG+,end In general the joint probability of an observed sequence S and a state sequence is, P S astart

1

n i 1

e

i S i a i i 1

In general, the sequence of state, , is unknown in advanced.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-110
SLIDE 110

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

CG island (contd) (cont.)

In fact, fjnding the path π such P(S, π) is maximum is often the goal. What does it mean ? Assuming the transition and emission probabilities are known for the CG island HMM. Given a new/unlabeled sequence S, i.e. S = CGCCG . . . CGCATG we don’t know which state was used to emit a given symbol, for example was the fjrst C emitted from C+ or C−, was the G second position emitted from G+ or G−, and so on, fjnd the path π such that P(S, π) is maximum will tell us what are the most likely locations for the CG islands : S = C+G+C+C+G+ . . . C+G+C−A−T−G+

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-111
SLIDE 111

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

CG island (contd) (cont.)

. The most probable path : π⋆ = argmaxπP(S, π)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-112
SLIDE 112

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Viterbi : Dynamic Programming Table

C G C G Start 1 A+ C+ 0.13 0.012 G+ 0.034 0.032 T+ A− C− 0.13 0.0026 G− 0.010 0.00021 T−

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-113
SLIDE 113

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Viterbi

The “best path”, sequence of states, determined by the Viterbi algorithm identies the CG islands within a genomic sequence.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-114
SLIDE 114

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Viterbi (cont.)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-115
SLIDE 115

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Probabilistic Models

Hypothesis : positions are independent from one another Pairwise alignment (PAM, BLOSUM, etc.) Given two aligned sequences, S′

1 and S′ 2, the score of an

alignment is given,

n

i=1

s(S′

1(i), S′ 2(i))

Position specifjc scoring scheme Given a sequence S and a probabilistic model M of a sequence family/motif, represented a matrix, f, of size 20 × n, where f(a, i) represents the probability that amino acid a occurs at position i in this family/motif, P(S|M) =

n

i=1

f(S(i), i)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-116
SLIDE 116

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Probabilistic Models (cont.)

  • r a log-odds score : ∑n

i=1 s(S(i), i), where

s(a, i) = log f(a,i)

qa .

Markovian models : Markov chains and hidden Markov models

Modelling the distribution of the amino acids within gaps and their length.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-117
SLIDE 117

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Pairwise (Multiple) Alignment vs Profjles vs Hidden Markov Models

Pairwise → uniform scoring system along the sequence Profjles → position specifjc scoring scheme HMMs → insertions/deletions modelled separately + variable topology (including simple grammatical structures).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-118
SLIDE 118

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Applications of MCs and HMMs

bacterial gene fjnders (MC) mRNA splicing (MC) trans-membrane helix prediction modeling signal peptides modeling families of aligned proteins multiple sequence alignment

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-119
SLIDE 119

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Pfam

Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. Version 5.5, September 2000, 2478 families. Large coverage of known proteins (∼ 63%). ⇒ www.sanger.ac.uk/Software/Pfam/index.shtml

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-120
SLIDE 120

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Web resources

HMMER by Sean Eddy, hmmer.org SAM from UCSC’s Computational Biology Group, www.cse.ucsc.edu/research/compbio/sam.html

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-121
SLIDE 121

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

References

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-122
SLIDE 122

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation Preamble Motivation Profjle HMM Defjnitions Example Decoding Likelihood Model Specifjcation Estimation

Pensez-y !

L’impression de ces notes n’est probablement pas nécessaire !

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics