Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 - - PowerPoint PPT Presentation

hidden markov models
SMART_READER_LITE
LIVE PREVIEW

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 - - PowerPoint PPT Presentation

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM


slide-1
SLIDE 1

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Hidden Markov Models

Steven J Zeil

Old Dominion Univ.

Fall 2010

1

slide-2
SLIDE 2

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Hidden Markov Models

1

Discrete Markov Processes

2

Hidden Markov Models

3

Inferences from HMMs Evaluation Decoding

4

Training an HMM Baum-Welch Algorithm Model Selection

2

slide-3
SLIDE 3

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Introduction

Sequences of input, not i.i.d.

Sequences in time: phonemes in a word, words in a sentence, pen movements in handwriting Sequences in space: base pairs in DNA

3

slide-4
SLIDE 4

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Discrete Markov Processes

N states: S1, S2, . . . , SN State at “time” t: qt = Si First-order Markov: prob of entering a state depnds on only the most recent prior state P(qt+1 = Sj|qt = Si, qt−1 = Sk, . . .) = P(qt+1 = Sj|qt = Si) Transition probabilities are independent of time aij ≡ P(qt+1 = Sj|qt = Si), ; aij ≥ 0 ∧

N

  • j=1

aij = 1 Initial probabilities πi ≡ P(q1 = Si), ;

N

  • j=1

πi = 1

4

slide-5
SLIDE 5

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Stochastic Automaton

5

slide-6
SLIDE 6

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Example: Balls & Urns

Three urns each full of balls of one color. A “genii” moves randomly from urn to urn selecting balls. S1: red, S2: blue, S3: green

  • π = [0.5, 0.2, 0.3]T

A =   0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8   Suppose we observe O = [red, red, green, green] P(O|A, π) = P(S1)P(S1|S1)P(S3|S1)P(S3|S3) = π1a11a13a33 = 0.5 ∗ 0.4 ∗ 0.3 ∗ 0.8 = 0.048

6

slide-7
SLIDE 7

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Hiding the Model

Now suppose that The urns and the genii are hidden behind a screen. The urns start with different mixtures of all three colors and (if we’re really unlucky) we don’t even know how many urns there are Suppose we observe O = [red, red, green, green]. Can we say anything at all?

7

slide-8
SLIDE 8

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Hidden Markov Models

States are not observable Discrete observations [v1, v2, . . . , vM] are recorded

Each is a probabilisitic function of the state Emission probabilities bj(m) ≡ P(Ot = vm|qt = Sj) For any given sequence of observations, there may be multiple possible state sequences.

8

slide-9
SLIDE 9

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

HMM Unfolded in Time

9

slide-10
SLIDE 10

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Elements of an HMM

An HMM λ = (A, B, π) A = [aij]: N × N state transition probability matrix

N is number of hidden states

B = bj(m): N × M emission probability matrix

M is number of observation symbols

  • π = [πi]’: N × 1 initial state probability vector

10

slide-11
SLIDE 11

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Making Inferences from an HMM

Evaluation: Given λ and O, calculate P(O|λ) Example: Given several HMMs, each trained to recognize a different handwritten character, and given a sequence of pen strokes, which character is most likely denoted by that sequence? Decoding: Given λ and O, what is the most probable sequence of states leading to that observation? Example: Given an HMM trianed on sentences and a sequence of words, some of which can belong to multiple syntactical classes (e.g., “green” can be an adjective, a noun,

  • r a verb), determine the most likely syntactic class from

surrounding context. Related problems: most likely starting or ending state

11

slide-12
SLIDE 12

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Decoding Example

What’s the weather been? States can be “labeled” even though “hidden”.

12

slide-13
SLIDE 13

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Evaluation

Given λ and O, calculate P(O|λ) If we knew the state sequence q, we could do P(O|λ, q) =

T

  • t=1

P(Ot|qt, λ) =

T

  • t=1

bqt(Ot) The prob of a state sequence is P( q|λ) = πq1

T−1

  • t=1

aqtqt+1 P(O, q|λ) = πq1bq1(O1)

T−1

  • t=1

aqtqt+1bqt+1(Ot+1) P(O|λ) =

  • all possible

q P(O, q|λ)} which is totally impractical

13

slide-14
SLIDE 14

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Forward Variable

αt(i) ≡ P(O1 . . . Ot, qt = Si|λ) P(O|λ) =

N

  • i=1

αT(i) Computed recursively Initial: α1(i) = πibi(O1) Recursion: αt+1(j) = N

  • i=1

αt(i)aij

  • bj(Ot+1)

14

slide-15
SLIDE 15

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Decoding

Given λ and O, what is the most probable sequence of states leading to that observation? Start by introducing a backward variable: βt(i) ≡ P(Ot+1 . . . OT|qt = Si, λ) Initial: βT(i) = 1 Recursion: βt(i) =

N

  • j=1

aijbj(Ot+1)βt+1(j)

15

slide-16
SLIDE 16

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Viterbi’s Algorithm

A constrained optimizer for state graph traversal Dynamic programming algorithm: Assign a cost to each edge Update path metrics by addition from shorter paths. Discard suboptimal cases. Starting from the final state, trace back the optimal path.

16

slide-17
SLIDE 17

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

The HMM Trellis

17

slide-18
SLIDE 18

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Viterbi’s Algorithm for HMMs

δt(i) ≡ max

q1q2...qt−1 p(q1q2 . . . qt−1, qt = Si, O1 . . . Ot|λ)

Initial: δ1(i) = πibi(O1), ψ1(i) = 0 Iterate: δt(i) = maxi δt−1(i)aijbj(Ot) ψt(j) = arg maxi δt−1(i)aij Optimum: p∗ = maxi δT(i), q∗

T = arg maxi δT(i)

Backtrack: q∗

t = ψt+1(q∗ t+1), t = T − 1, . . . , 1

Examples: Numeric sequence, fixed problem Coin Flipping, Customizable Spelling Correction as a decoding problem

18

slide-19
SLIDE 19

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Training an HMM

Need to estimate aij, πi, bj(m) that maximize the likelihood of observing a set or training instances X = {Ok}K

k=1

19

slide-20
SLIDE 20

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Baum-Welch Algorithm - Overview

An E-M style algorithm Repeatedly apply the steps

E: Use the current λ = (A, B, π) to compute, for each training instance,

the probability of being in Si at time t the probability of making the transition from Si to Sj at time t + 1

M: Update the values of λ = (A, B, π) to maximize the likelihood of matching those probabilities.

20

slide-21
SLIDE 21

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

From the Training Data

zt

i =

1 if qt = Si

  • w

Note that

  • k zk

i

K

= ˆ P(qt = Si|λ) zt

ij =

1 if qt = Si ∧ qt+1 = Sj

  • w

Note that

  • k zk

ij

K

= ˆ P(qt = Si, qt+1 = Sj|λ)

21

slide-22
SLIDE 22

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

From the HMM

γt(i) ≡ P(qt = Si|O, λ) = αt(i)βt(i) N

j=1 αt(j)βt(j)

During Baum-Welch, we estimate as γk

t (i) ≈ E[zt i ]

22

slide-23
SLIDE 23

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

From the HMM

ξt(i, j) ≡ P(qt = Si, qt+1 = Sj|O, λ) = αt(i)aijbj(Ot+1)βt+1(j)

  • k
  • m αt(k)akmbm(Ot+1)βt+1(m)

During Baum-Welch, we estimate as ξk

t (i, j) ≈ E[zt ij]

23

slide-24
SLIDE 24

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Baum-Welch Algorithm - E

Repeatedly apply the steps

E: For each Ok, γk

t (i) ← E[zt i ]

ξk

t (i, j) ← E[zt ij]

Then average over all observations: γt(i) ← K

k=1 γk t (i)

K ξt(i, j) ← K

k=1 ξk t (i, j)

K M: Update the values of λ = (A, B, π) to maximize the likelihood of matching those probabilities.

24

slide-25
SLIDE 25

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Updating A

Expected number of transitions from Si to Sj is

t ξt(i, j)

Expected number of times to be in Si is

t γt(i)

Therefore the probability of the transition from Si to Sj is ˆ aij =

  • t ξt(i, j)
  • t γt(i)

25

slide-26
SLIDE 26

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Updating B

Expected number of times we see vm when the system is in Sj is T

t=1 γt(j)1(Ot = vm)

Expected number of times will be in Sj is

t γt(j)

Therefore the probability emitting vm from Sj is ˆ bj(m) =

  • k
  • t γk

t (j)1(Ok t = vm)

  • k
  • t γk

t (i)

26

slide-27
SLIDE 27

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Updating π

Probability of starting in Si ˆ πi =

  • k γk

1 (j)

K

27

slide-28
SLIDE 28

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Baum-Welch Algorithm - EM

Repeatedly apply the steps

E: For each Ok, γk

t (i) ← E[zt i ]

ξk

t (i, j) ← E[zt ij]

γt(i) ← K

k=1 γk t (i)

K ξt(i, j) ← K

k=1 ξk t (i, j)

K M: ˆ aij ←

  • t ξt(i, j)
  • t γt(i)

ˆ bj(m) ←

  • k
  • t γk

t (j)1(Ok t = vm)

  • k
  • t γk

t (i)

ˆ πi ←

  • k γk

1 (j)

K

(Practical implementations often require careful scaling.)

28

slide-29
SLIDE 29

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Model Selection

Classification: train a separate HMM for each class and apply Bayes rule P(λi|O) = P(O|λi)P(λi)

  • j P(O|λj)P(λj)

For many problems, we encode prior knowledge as known values for transitions, emissions, and/or starting points. For others, we may know something about the “shape” of the HMM

29

slide-30
SLIDE 30

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Left-to-Right HMMs

A =     a11 a12 a13 a22 a23 a24 a33 a34 a44     Useful for modeling signals whose properties change over time

Sometimes large jumps are prohibited

E.g., no jumps of more than k states band-diagonal matrix

No change required to training (initially zero transitions never become positive) Example: Face Recognition

30

slide-31
SLIDE 31

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Layered HMMs

Lower layers recognize sequences (e.g., phonemes to words) Best sequence fed to higher layer (e.g., words to sentences)

31

slide-32
SLIDE 32

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Other Variants

Edge emitters - in some variants outputs are associated with transition edges, not with state nodes

Some edges may emit empty or null signal

Tied states: some model states may be known to be isomorphic.

Parameters are forced to be equal.

Hierarchical HMM (HHMM) - each state is an HMM

32