Junction Tree Algorithm and a case study of the Hidden Markov Model - - PDF document

junction tree algorithm
SMART_READER_LITE
LIVE PREVIEW

Junction Tree Algorithm and a case study of the Hidden Markov Model - - PDF document

School of Computer Science Junction Tree Algorithm and a case study of the Hidden Markov Model Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 6, Oct 3, 2007 Receptor A Receptor A X 1 X 1 X 1


slide-1
SLIDE 1

1

1

School of Computer Science

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8

Junction Tree Algorithm

and a case study of the

Hidden Markov Model

Probabilistic Graphical Models (10 Probabilistic Graphical Models (10-

  • 708)

708)

Lecture 6, Oct 3, 2007

Eric Xing Eric Xing

Reading: J-Chap 12, 17, KF-Chap. 10

Eric Xing 2

Outline

So far we have studied exact inference in:

  • Trees
  • message passing on the original graph (which are trees)
  • Poly-trees, Tree-like graphs
  • message passing in factor trees

Now we will look into exact inference in arbitrary graphs

  • Junction-Tree algorithm

Inference in Hidden Markov Model

slide-2
SLIDE 2

2

Eric Xing 3

Recall that Induced dependency during marginalization is

captured in elimination cliques

  • Summation <-> elimination
  • Intermediate term <-> elimination clique
  • Can this lead to an generic

inference algorithm?

Elimination Clique

E F H A E F B A C E G A D C E A D C B A A Eric Xing 4

A Clique Tree

E F H A E F B A C E G A D C E A D C B A A

h

m

g

m

e

m

f

m

b

m

c

m

d

m

=

e f g e

e a m e m d c e p d c a m ) , ( ) ( ) , | ( ) , , (

slide-3
SLIDE 3

3

Eric Xing 5

  • Elimination ≡ message passing on a clique tree
  • Messages can be reused

E F H A E F B A C E G A D C E A D C B A A h

m

g

m

e

m

f

m

b

m

c

m

d

m

B A D C E F G H B A D C E F G H B A D C B A D C E F G B A D C E F B A D C E B A C B A A

≡ From Elimination to Message Passing

=

e f g e

e a m e m d c e p d c a m ) , ( ) ( ) , | ( ) , , (

Eric Xing 6 E F H A E F B A C E G A D C E A D C B A A c

m

b

m

g

m

e

m

d

m

f

m

h

m

From Elimination to Message Passing

  • Elimination ≡ message passing on a clique tree
  • Another query ...
  • Messages mf and mh are reused, others need to be recomputed
slide-4
SLIDE 4

4

Eric Xing 7

The Junction Tree Algorithm

  • Recall: Elimination ≡ message passing on a clique tree
  • Junction Tree Algorithm:
  • computing messages on a clique tree
  • message passing protocol on a clique tree
  • There are several inference algorithms; some of which operate

directly on (special) directed graph

  • Forward-backward algorithm for HMM (we will see it later)
  • Pealing algorithm for trees and phylogenies
  • The junction tree algorithm is the most popular and general

inference algorithm, it operates on an undirected graph

  • To understand the JT-algorithm, we need to understand how to compile a

directed graph into an undirected graph

Eric Xing 8

Moral Graph

  • Note that for both directed GMs and undirected GMs, the joint

probability is in a product form:

  • So let’s convert local conditional probabilities into potentials; then

the second expression will be generic, but how does this operation affect the directed graph?

  • We can think of a conditional probability, e.g,. P(C|A,B) as a function of the three

variables A, B, and C (we get a real number of each configuration):

  • Problem: But a node and its parent are not generally in the same clique in a BN
  • Solution: Marry the parents to obtain the "moral graph"

=

=

d i i

i

X P P

:

) | ( ) (

1 π

X X

=

C c c c

Z P ) ( ) ( X X ψ 1

BN: MRF:

A B C

P(C|A,B)

A B C

Ψ(A,B,C) = P(C|A,B)

slide-5
SLIDE 5

5

Eric Xing 9

Moral Graph (cont.)

  • Define the potential on a clique as the product over all conditional

probabilities contained within the clique

  • Now the product of potentials gives the right answer:

) , , ( ) , , ( ) , , ( ) , | ( ) | ( ) | ( ) , | ( ) ( ) ( ) , , , , , (

6 5 4 5 4 3 3 2 1 5 4 6 3 5 3 4 2 1 3 2 1 6 5 4 3 2 1

X X X X X X X X X X X X P X X P X X P X X X P X P X P X X X X X X P ψ ψ ψ = = ) , | ( ) ( ) ( ) , , (

2 1 3 2 1 3 2 1

X X X P X P X P X X X = ψ ) | ( ) | ( ) , , (

3 5 3 4 5 4 3

X X P X X P X X X = ψ ) , | ( ) , , (

5 4 6 6 5 4

X X X P X X X = ψ where X1 X2 X3 X4 X5 X6 X1 X2 X3 X5 X4 X6 Note that here the interpretation of potential is ambivalent: it can be either marginals

  • r conditionals

Eric Xing 10

Clique trees

  • A clique tree is an (undirected) tree of cliques
  • Consider cases in which two neighboring cliques V and W have an
  • verlap S (e.g., (X1, X2, X3) overlaps with (X3, X4, X5) ),
  • Now we have an alternative representation of the joint in terms of

the potentials:

X1 X2 X3 X5 X4 X6

X3, X4, X5 X4, X5, X6 X1, X2, X3 V W S

) (W ψ ) (V ψ ) (S φ

X3, X4, X5 X4, X5, X6 X1, X2, X3 X3 X4

slide-6
SLIDE 6

6

Eric Xing 11

Clique trees

  • A clique tree is an (undirected) tree of cliques
  • The alternative representation of the joint in terms of the potentials:
  • Generally:

X1 X2 X3 X5 X4 X6

X3, X4, X5 X4, X5, X6 X1, X2, X3 X3 X4 ) , ( ) , , ( ) ( ) , , ( ) , , ( ) , ( ) , , ( ) ( ) , , ( ) , , ( ) , | ( ) | ( ) | ( ) , | ( ) ( ) ( ) , , , , , (

5 4 6 5 4 3 5 4 3 3 2 1 5 4 6 5 4 3 5 4 3 3 2 1 5 4 6 3 5 3 4 2 1 3 2 1 6 5 4 3 2 1

X X X X X X X X X X X X X X P X X X P X P X X X P X X X P X X X P X X P X X P X X X P X P X P X X X X X X P φ ψ φ ψ ψ = = =

∏ ∏

=

S S S C C C

P ) ( ) ( ) ( X X X φ ψ Now each potential is isomorphic to the cluster marginal of the attendant set of variables

Eric Xing 12

Why this is useful?

Propagation of probabilities

  • Now suppose that some evidence has been "absorbed" (i.e., certain values of

some nodes have been observed). How do we propagate this effect to the rest of the graph?

  • What do we mean by propagate?

Can we adjust all the potentials {ψ}, {φ} so that they still represent the correct cluster marginals (or unnormalized equivalents) of their respective attendant variables?

  • Utility?

X1 X2 X3 X4 X5 X6 X1 X2 X3 X5 X4 X6

X3, X4, X5 X4, X5, X6 X1, X2, X3 X3 X4

) , , ( ) | (

, 3 2 1 6 6 1

3 2

X X X x X X P

X X∑

= = ψ ) ( ) | (

3 6 6 3

X x X X P φ = = Local operations! ) , , ( ) (

, 6 5 4 6

5 4

x X X x P

X X∑

= ψ

slide-7
SLIDE 7

7

Eric Xing 13

Local Consistency

  • We have two ways of obtaining p(S)

and they must be the same

  • The following update-rule ensures this:
  • Forward update:
  • Backward update
  • Two important identities can be proven

V W S

) (W ψ ) (V ψ ) (S φ

) ( ) (

\

V S P

S V

= ψ ) ( ) (

\

W S P

S W

= ψ

=

S V V S \ *

ψ φ

W S S W

ψ φ φ ψ

* * =

=

S W W S \ * * *

ψ φ

* * * * * * V S S V

ψ φ φ ψ =

* * \ * \ * * S S W W S V V

φ ψ ψ = = ∑

S W V S W V S W V

φ ψ ψ φ ψ ψ φ ψ ψ = =

* * * * * * * * *

Local Consistency Invariant Joint

Eric Xing 14

Message Passing Algorithm

This simple local message-passing algorithm on a clique tree

defines the general probability propagation algorithm for directed graphs!

  • Many interesting algorithms are special cases:
  • Forward-backward algorithm for hidden Markov models,
  • Kalman filter updates
  • Pealing algorithms for probabilistic trees
  • The algorithm seems reasonable. Is it correct?

V W S

) (W ψ ) (V ψ ) (S φ

=

S V V S \ *

ψ φ

W S S W

ψ φ φ ψ

* * =

=

S W W S \ * * *

ψ φ

* * * * * * V S S V

ψ φ φ ψ =

slide-8
SLIDE 8

8

Eric Xing 15

A problem

Consider the following graph and a corresponding clique tree

  • Note that C appears in two non-neighboring cliques

Question: with the previous message passage, can we ensure

that the probability associated with C in these two (non- neighboring) cliques consistent?

Answer: No. It is not true that in general local consistency

implies global consistency

What else do we need to get such a guarantee? A B C D A,B B,D A,C C,D

Eric Xing 16

Triangulation

  • A triangulated graph is one in which no cycles with

four or more nodes exist in which there is no chord

  • We triangulate a graph by adding chords:
  • Now we no longer have our global inconsistency

problem.

  • A clique tree for a triangulated graph has the running

intersection property: If a node appears in two cliques, it appears everywhere on the path between the cliques

  • Thus local consistency implies global consistency

A B C D A B C D A,B,C B,C,D

slide-9
SLIDE 9

9

Eric Xing 17

Junction trees

  • A clique tree for a triangulated graph is referred to as a junction tree
  • In junction trees, local consistency implies global consistency. Thus

the local message-passing algorithms is (provably) correct

  • It is also possible to show that only triangulated graphs have the

property that their clique trees are junctions. Thus if we want local algorithms, we must triangulate

  • Are we now all set?
  • How to triangulate?
  • The complexity of building a

JT depends on how we triangulate!!

  • Consider this network:

it turns out that we will need to pay an O(24)

  • r O(26) cost depending on how we triangulate!

B A D C E F G H

Eric Xing 18

moralization

B A D C E F G H B A D C E F G H B A D C B A D C E F G B A D C E F B A D C E B A C B A A

graph elimination

How to triangulate

  • A graph elimination algorithm
  • Intermediate terms correspond to the cliques resulted from

elimination

  • “good” elimination orderings lead to small cliques and hence reduce

complexity (what will happen if we eliminate "e" first in the above graph?)

  • finding the optimum ordering is NP-hard, but for many graph optimum or

near-optimum can often be heuristically found

slide-10
SLIDE 10

10

Eric Xing 19

  • Our algorithm so far answers only one query (e.g., on one node), do

we need to do a complete elimination for every such query?

  • Elimination ≡ message passing on a clique tree
  • Messages can be reused

E F H A E F B A C E G A D C E A D C B A A h

m

g

m

e

m

f

m

b

m

c

m

d

m

B A D C E F G H B A D C E F G H B A D C B A D C E F G B A D C E F B A D C E B A C B A A

≡ From Elimination to Message Passing

=

S V V S \ *

ψ φ

W S S W

ψ φ φ ψ

* * =

Recall this:

=

e f g e

e a m e m d c e p d c a m ) , ( ) ( ) , | ( ) , , (

Eric Xing 20 E F H A E F B A C E G A D C E A D C B A A c

m

b

m

g

m

e

m

d

m

f

m

h

m

From Elimination to Message Passing

  • Our algorithm so far answers only one query (e.g., on one node), do

we need to do a complete elimination for every such query?

  • Elimination ≡ message passing on a clique tree

Another query ...

  • Messages mf and mh are reused, others need to be recomputed
slide-11
SLIDE 11

11

Eric Xing 21

Message-passing algorithms

Message update

  • The Hugin update
  • The Shafer-Shenoy update

collect distribute

=

S V V S \ *

ψ φ

W S S W

ψ φ φ ψ

* * =

∑ ∏

≠ → →

=

ij i i

S C j k ki i k C ij j i

S m S m

\

) ( ) ( ψ

Eric Xing 22

A Sketch of the Junction Tree Algorithm

  • The algorithm

1. Moralize the graph (trivial) 2. Triangulate the graph (good heuristic exist, but actually NP hard) 3. Build a clique tree (e.g., using a maximum spanning tree algorithm 4. Propagation of probabilities --- a local message-passing protocol

  • Results in marginal probabilities of all cliques --- solves all queries

in a single run

  • A generic exact inference algorithm for any GM
  • Complexity: exponential in the size of the maximal clique --- a

good elimination order often leads to small maximal clique, and hence a good (i.e., thin) JT

slide-12
SLIDE 12

12

Eric Xing 23

Summary

Junction tree data-structure for exact inference on general

graphs

Two methods

  • Shafer-Shenoy
  • Belief-update or Lauritzen-Speigelhalter

Constructing Junction tree from chordal graphs

  • Maximum spanning tree approach

Eric Xing 24

Case study:

Hidden Markov Model

slide-13
SLIDE 13

13

Eric Xing 25

Recall definition of HMM

  • Transition probabilities between

any two states

  • r
  • Start probabilities
  • Emission probabilities associated with each state
  • r in general:

A A A A

x2 x3 x1 xT y2 y3 y1 yT

... ...

, ) | (

, j i i t j t

a y y p = = =

1 1

1

( )

. , , , , l Multinomia ~ ) | (

, , ,

I ∈ ∀ =

i a a a y y p

M i i i i t t

K

2 1 1

1

( ).

, , , l Multinomia ~ ) (

M

y p π π π K

2 1 1

( )

. , , , , l Multinomia ~ ) | (

, , ,

I ∈ ∀ = i b b b y x p

K i i i i t t

K

2 1

1

( )

. , | f ~ ) | ( I ∈ ∀ ⋅ = i y x p

i i t t

θ 1

Eric Xing 26

FAIR LOADED 0.05 0.05 0.95 0.95

P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2

The Dishonest Casino Model

, ) | (

, j i i t j t

a y y p = = =

1 1

1

Transition probabilities:

slide-14
SLIDE 14

14

Eric Xing 27

Typical structure of a gene

Eric Xing 28 E0 E1 E2 E

poly-A

3'UTR 5'UTR t Ei

Es

I0 I1 I2 intergenic region Forward (+) strand Reverse (-) strand Forward (+) strand Reverse (-) strand

promoter

GAGAACGTGTGAGAGAGAGGCAAGCCGAAAAATCAGCCGC CGAAGGATACACTATCGTCGTCCTTGTCCGACGAACCGGT GGTCATGCAAACAACGCACAGAACAAATTAATTTTCAAAT TGTTCAATAAATGTCCCACTTGCTTCTGTTGTTCCCCCCT TTCCGCTAGCGGAATTTTTTATATTCTTTTGGGGGCGCTC TTTCGTTGACTTTTCGAGCACTTTTTCGATTTTCGCGCGC TGTCGAACGGCAGCGTATTTATTTACAATTTTTTTTGTTA GCGGCCGCCGTTGTTTGTTGCAGATACACAGCGCACACAT ATAAGCTTGCACACTGATGCACACACACCGACACGTTGTC ACCGAAATGAACGGGACGGCCATATGACTGGCTGGCGCTC GGTATGTGGGTGCAAGCGAGATACCGCGATCAAGACTCGA ACGAGACGGGTCAGCGAGTGATACCGATTCTCTCTCTTTT GCGATTGGGAATAATGCCCGACTTTTTACACTACATGCGT TGGATCTGGTTATTTAATTATGCCATTTTTCTCAGTATAT CGGCAATTGGTTGCATTAATTTTGCCGCAAAGTAAGGAAC ACAAACCGATAGTTAAGATCCAACGTCCCTGCTGCGCCTC GCGTGCACAATTTGCGCCAATTTCCCCCCTTTTCCAGTTT TTTTCAACCCAGCACCGCTCGTCTCTTCCTCTTCTTAACG TTAGCATTCGTACGAGGAACAGTGCTGTCATTGTGGCCGC TGTGTAGCTAAAAAGCGTAATTATTCATTATCTAGCTATC TTTTCGGATATTATTGTCATTTGCCTTTAATCTTGTGTAT TTATATGGATGAAACGTGCTATAATAACAATGCAGAATGA AGAACTGAAGAGTTTCAAAACCTAAAAATAATTGGAATAT AAAGTTTGGTTTTACAATTTGATAAAACTCTATTGTAAGT GGAGCGTAACATAGGGTAGAAAACAGTGCAAATCAAAGTA CCTAAATGGAATACAAATTTTAGTTGTACAATTGAGTAAA ATGAGCAAAGCGCCTATTTTGGATAATATTTGCTGTTTAC AAGGGGAACATATTCATAATTTTCAGGTTTAGGTTACGCA TATGTAGGCGTAAAGAAATAGCTATATTTGTAGAAGTGCA TATGCACTTTATAAAAAATTATCCTACATTAACGTATTTT ATTTGCTTTAAAACCTATCTGAGATATTCCAATAAGGTAA GTGCAGTAATACAATGTAAATAATTGCAAATAATGTTGTA ACTAAATACGTAAACAATAATGTAGAAGTCCGGCTGAAAG CCCCAGCAGCTATAGCCGATATCTATATGATTTAAACTCT TGTCTGCAACGTTCTAATAAATAAATAAAATGCAAAATAT AACCTATTGAGACAATACATTTATTTTATTTTTTTATATC ATCAATCATCTACTGATTTCTTTCGGTGTATCGCCTAATC CATCTGTGAAATAGAAATGGCGCCACCTAGGTTAAGAAAA GATAAACAGTTGCCTTTAGTTGCATGACTTCCCGCTGGAT

4 3 2 1

= ) |

  • (

θ θ θ θ y p

GENSCAN (Burge & Karlin)

, ) | (

, j i i t j t

a y y p = = =

1 1

1

Transition probabilities:

slide-15
SLIDE 15

15

Eric Xing 29

Probability of a parse

  • Given a sequence x = x1……xT

and a parse y = y1, ……, yT,

  • To find how likely is the parse:

(given our HMM and the sequence) p(x, y) = p(x1……xT, y1, ……, yT) (Joint probability) = p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT) = p(y1, ……, yT) p(x1……xT | y1, ……, yT) =

  • Marginal probability:
  • Posterior probability:

[ ]

,

, def ,

j t i t t t

y y M j i ij y y

a a

1 1

1

+ +

=

=

[ ] ,

def

i

y M i i y

1 1

1

=

= π π

[ ]

, and

def ,

k t i t t t

x y M i K k ik x y

b b

∏∏

= =

=

1 1

Let

T T

y y y y y

a a

, ,

1 2 1 1 −

L π

T T x

y x y

b b

, , L

1 1

∑ ∑ ∑ ∑ ∏ ∏

= =

= =

y

y x x

1 2 1 1

2 1 y y y T t T t t t y y y

N t t

y x p a p p ) | ( ) , ( ) (

,

π L

) ( / ) , ( ) | ( x y x x y p p p =

A A A A

x2 x3 x1 xT y2 y3 y1 yT

... ...

Eric Xing 30

Applications of HMMs

Some early applications of HMMs

  • finance, but we never saw them
  • speech recognition
  • modelling ion channels

In the mid-late 1980s HMMs entered genetics and molecular

biology, and they are now firmly entrenched.

Some current applications of HMMs to biology

  • mapping chromosomes
  • aligning biological sequences
  • predicting sequence structure
  • inferring evolutionary relationships
  • finding genes in DNA sequence
slide-16
SLIDE 16

16

Eric Xing 31

Three main questions on HMMs

  • 1. Evaluation

GIVEN an HMM M, and a sequence x, FIND Prob (x | M) ALGO. Forward

  • 2. Decoding

GIVEN an HMM M, and a sequence x , FIND the sequence y of states that maximizes, e.g., P(y | x , M),

  • r the most probable subsequence of states

ALGO. Viterbi, Forward-backward

  • 3. Learning (next lecture)

GIVEN an HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters θ = (πi, aij, ηik) that maximize P(x | θ) ALGO. Baum-Welch (EM)

Eric Xing 32

The Forward Algorithm

  • We want to calculate P(x), the likelihood of x, given the HMM
  • Sum over all possible ways of generating x:
  • To avoid summing over an exponential number of paths y, define

(the forward probability)

  • The recursion:

) , ,..., ( ) (

def

1 1

1

= = = =

k t t k t k t

y x x P y α α

= =

i k i i t k t t k t

a y x p

,

) | (

1

1 α α

=

k k T

P α ) (x

∑ ∑ ∑ ∑ ∏ ∏

= =

= =

y

y x x

1 2 1 1

2 1 y y y T t T t t t y y y

N t t

y x p a p p ) | ( ) , ( ) (

,

π L

slide-17
SLIDE 17

17

Eric Xing 33 Eric Xing 34

The Forward Algorithm – derivation

Compute the forward probability:

) , , ,..., ( 1

1 1

= =

− k t t t k t

y x x x P α

= =

− −

1

1

1 1 1

t

y k t t t t

y y x x x P ) , , , ,..., ( ) , ,..., , | ( ) ,..., , | ( ) , ,..., (

1 1 1 1 1 1 1 1 1

1 1

1

− − − − − −

= = =∑

t t k t t t t k t y t t

y x x y x P x x y y P y x x P

t

) | ( ) | ( ) , ,..., ( 1 1

1 1 1 1

1

= = =

− − −

k t t t k t y t t

y x P y y P y x x P

t

) | ( ) , ,..., ( ) | ( 1 1 1 1

1 1 1 1

= = = = =

− − −

i t k t i i t t k t t

y y P y x x P y x P

k i i i t k t t

a y x P

,

) | (

= =

1

1 α

A A

xt x1 yt y1

... A

xt-1 yt-1

... ... ...

) , | ( ) | ( ) ( ) , , ( : rule Chain B A C P C B P A P C B A P =

slide-18
SLIDE 18

18

Eric Xing 35

  • Elimination ≡ message passing on a clique tree

E F H A E F B A C E G A D C E A D C B A A

h

m

g

m

e

m

f

m

b

m

c

m

d

m

Recall the Elimination and Message Passing Algorithm

A A A A

x2 x3 x1 xT y2 y3 y1 yT

... ...

= =

i k i i t k t t k t

a y x p

,

) | (

1

1 α α

=

k k T

P α ) (x

=

e f g e

e a m e m d c e p d c a m ) , ( ) ( ) , | ( ) , , (

Eric Xing 36

The Forward Algorithm

We can compute for all k, t, using dynamic programming!

Initialization: Iteration: Termination:

k t

α

k k k

y x P π α ) | ( 1

1 1 1

= =

k k k k k k

y x P y P y x P y x P π α ) | ( ) ( ) | ( ) , ( 1 1 1 1

1 1 1 1 1 1 1 1

= = = = = = =

k i i i t k t t k t

a y x P

,

) | (

= =

1

1 α α

=

k k T

P α ) (x

slide-19
SLIDE 19

19

Eric Xing 37

The Backward Algorithm

We want to compute ,

the posterior probability distribution on the t th position, given x

  • We start by computing
  • The recursion:

) | ( x 1 =

k t

y P

Forward, αt

k

Backward,

) ,..., , , ,..., ( ) , (

T t k t t k t

x x y x x P y P

1 1

1 1

+

= = = x ) | ... ( ) ... ( ) , ,..., | ,..., ( ) , ,..., (

,

1 1 1 1

1 1 1 1 1

= = = = = =

+ + k t T t k t t k t t T t k t t

y x x P y x x P y x x x x P y x x P

) | ,..., ( 1

1

= =

+ k t T t k t

y x x P β

+ + +

= =

i i t i t t i k k t

y x p a

1 1 1 ,

) 1 | ( β β

A A

xt+1 xT yt+1 yT

... A

xt yt

... ... ...

Eric Xing 38

The Backward Algorithm – derivation

Define the backward probability:

) | ,..., ( 1

1

= =

+ k t T t k t

y x x P β

+

= =

+ +

1

1

1 1

t

y k t t T t

y y x x P ) | , ,..., ( ) , , | ,..., ( ) , | ( ) | ( 1 1 1 1 1 1

1 1 2 1 1 1

= = = = = = =

+ + + + + +

k t i t t T t k t i t t i k t i t

y y x x x P y y x p y y P

i t i t t i i k

y x p a

1 1 1

1

+ + +

= = ∑ β ) | (

,

A A

xt+1 xT yt+1 yT

... A

xt yt

... ... ...

) | ,..., ( ) | ( ) | ( 1 1 1 1

1 2 1 1 1

= = = = =

+ + + + +

i t T t i t t i k t i t

y x x P y x p y y P ) , , | ( ) , | ( ) , ( ) | , , ( : rule Chain α α α α B A C P C B P A P C B A P =

slide-20
SLIDE 20

20

Eric Xing 39

The Backward Algorithm

We can compute for all k, t, using dynamic programming!

Initialization: Iteration: Termination:

k t

β

k

k T

∀ = , 1 β

i t i t t i i k k t

y x P a

1 1 1

1

+ + +

= = ∑ β β ) | (

,

=

k k k

P

1 1 β

α ) (x

Eric Xing 40

Shafer Shenoy for HMMs

Recap: Shafer-Shenoy algorithm

  • Message from clique i to clique j :
  • Clique marginal

∑ ∏

≠ → → =

ij i i

S C j k ki i k C j i

S

\

) ( µ ψ µ

k ki i k C i

S C p

i

) ( ) ( µ ψ

slide-21
SLIDE 21

21

Eric Xing 41

Message Passing for HMMs (cont.)

A junction tree for the HMM Rightward pass

  • This is exactly the forward algorithm!

Leftward pass …

  • This is exactly the backward algorithm!

A A A A

x2 x3 x1 xT y2 y3 y1 yT

... ...

) , (

1 1 x

y ψ ) , (

2 1 y

y ψ ) , (

3 2 y

y ψ ) , (

T T

y y

1 −

ψ ) , (

2 2 x

y ψ ) , (

3 3 x

y ψ ) , (

T T x

y ψ ) (

2

y ζ ) (

3

y ζ ) ( T y ζ ) (

1

y φ ) (

2

y φ

⇒ ⇒

) , (

1 + t t y

y ψ ) (

t t t

y

→ −1

µ ) , (

1 1 + + t t

x y ψ ) (

1 1 + + → t t t

y µ ) (

1 + ↑ t t

y µ

+

+ ↑ + + ← + ← −

=

1

1 1 1 1 1

t

y t t t t t t t t t t

y y y y y ) ( ) ( ) , ( ) ( µ µ ψ µ

∑ ∑

→ − + + + + → − +

+

= =

t t t t

y t t t y y t t y t t t t t t t

y a y x p y x p y y y p ) ( ) | ( ) | ( ) ( ) | (

, 1 1 1 1 1 1 1

µ

+ ↑ → − + + + →

=

t

y t t t t t t t t t t

y y y y y ) ( ) ( ) , ( ) (

1 1 1 1 1

µ µ ψ µ

+

+ + + + ← +

=

1

1 1 1 1 1

t

y t t t t t t t

y x p y y y p ) | ( ) ( ) | ( µ

) , (

1 + t t y

y ψ ) (

t t t

y

← −1

µ ) (

1 1 + + ← t t t

y µ ) , (

1 1 + + t t

x y ψ ) (

1 + ↑ t t

y µ

Eric Xing 42

Posterior decoding

We can now calculate Then, we can ask

  • What is the most likely state at position t of sequence x:
  • Note that this is an MPA of a single hidden state,

what if we want to a MPA of a whole hidden state sequence?

  • Posterior Decoding:
  • This is different from MPA of a whole sequence
  • f hidden states
  • This can be understood as bit error rate
  • vs. word error rate

) ( ) ( ) , 1 ( ) | 1 ( x x x x P P y P y P

k t k t k t k t

β α = = = = ) | 1 ( max arg

*

x = =

k t k t

y P k

{ }

1 : 1

*

T t y

t

k t

L = =

Example: MPA of X ? MPA of (X, Y) ?

x y P(x,y) 0.35 1 0.05 1 0.3 1 1 0.3

slide-22
SLIDE 22

22

Eric Xing 43

Viterbi decoding

  • GIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is

maximized:

y* = argmaxy P(y|x) = argmaxπ P(y,x)

  • Let

= Probability of most likely sequence of states ending at state yt = k

  • The recursion:
  • Underflows are a significant problem
  • These numbers become extremely small – underflow
  • Solution: Take the logs of all values:

) , ,..., , ,..., ( max

,

  • }

,... {

  • 1

1 1 1 1

1 1

= =

k t t t t y y k t

y x y y x x P V

t

i t k i i k t t k t

V a y x p V

1

1

= =

,

max ) | (

x1 x2 x3 ……………………...……..xN

State 1 2 K

x1 x2 x3 ……………………...……..xN

State 1 2 K

x1 x2 x3 ……………………...……..xN

State 1 2 K

x1 x2 x3 ……………………...……..xN

State 1 2 K Vi(t)

k t

V

t t t t

x y x y y y y y y t t

b b a a y y x x p

, , , ,

) , , , , , ( L L K K

1 1 1 2 1 1

1 1

= π

( ) ( )

i t k i i k t t k t

V a y x p V

1

1

+ + = =

,

log max ) | ( log

Eric Xing 44

The Viterbi Algorithm – derivation

Define the viterbi probability:

) , ,..., , ,..., ( max

, } ,... {

1

1 1 1 1 1

1

= =

+ + + k t t t t y y k t

y x y y x x P V

t

) ,..., , ,..., ( ) ,..., , ,..., | ( max

, } ,... { t t t t k t t y y

y y x x P y y x x y x P

t

1 1 1 1 1 1

1

1

= =

+ +

) , , ,..., , ,..., ( ) | ( max

, } ,... { t t t t t k t t y y

y x y y x x P y y x P

t

1 1 1 1 1 1

1

1

− − + +

= =

i t k i k t t i

V a y x P

, ,

) | ( max 1

1 1

= =

+ +

) , , ,..., , ,..., ( max ) | ( max

} ,... { ,

1 1 1

1 1 1 1 1 1

1 1

= = = =

− − + +

i t t t t y y i t k t t i

y x y y x x P y y x P

t

i t k i i k t t

V a y x P

, ,

max ) | ( 1

1 1

= =

+ +

slide-23
SLIDE 23

23

Eric Xing 45

The Viterbi Algorithm

  • Input: x = x1, …, xT,

Initialization: Iteration: Termination: TraceBack:

i t k i i k t t k t

V a y x P V

1

1

= =

, ,

max ) | (

k T kV

P max ) , (

* =

y x

k k k

y x P V π ) | ( 1

1 1 1

= =

i t k i i

V a k,t

1 −

=

,

max arg ) Ptr( ) , Ptr( max arg

* * *

t y y V y

t t k T k T

= =

−1

Eric Xing 46

Computational Complexity and implementation details

What is the running time, and space required, for Forward,

and Backward?

Time: O(K2N); Space: O(KN).

Useful implementation technique to avoid underflows

  • Viterbi:

sum of logs

  • Forward/Backward: rescaling at each position by multiplying by a

constant

= =

i k i i t k t t k t

a y x p

, 1

) 1 | ( α α

+ + +

= =

i i t i t t i k k t

y x p a

1 1 1 ,

) 1 | ( β β

i t k i i k t t k t

V a y x p V

1 ,

max ) 1 | (

= =