Hidden Markov Models Bhiksha Raj 10 Nov 2016 11755/18797 1 - - PowerPoint PPT Presentation

hidden markov models
SMART_READER_LITE
LIVE PREVIEW

Hidden Markov Models Bhiksha Raj 10 Nov 2016 11755/18797 1 - - PowerPoint PPT Presentation

Machine Learning for Signal Processing Hidden Markov Models Bhiksha Raj 10 Nov 2016 11755/18797 1 Prediction : a holy grail Physical trajectories Automobiles, rockets, heavenly bodies Natural phenomena Weather Financial


slide-1
SLIDE 1

Machine Learning for Signal Processing

Hidden Markov Models

Bhiksha Raj 10 Nov 2016

11755/18797 1

slide-2
SLIDE 2

Prediction : a holy grail

  • Physical trajectories

– Automobiles, rockets, heavenly bodies

  • Natural phenomena

– Weather

  • Financial data

– Stock market

  • World affairs

– Who is going to have the next XXXX spring?

  • Signals

– Audio, video..

11755/18797 2

slide-3
SLIDE 3

The wind and the target

  • Aim: measure wind velocity accurately

– For some important task

  • Using a noisy wind speed sensor

– E.g. arrows shot at a target

  • Situation:

– Wind speed at time t depends on speed at t-1

  • 𝑻𝒖 = 𝑻𝒖−𝟐 + 𝝑𝒖

– Arrow position at time t depends on wind speed at time t

  • 𝒁𝒖 = 𝑩𝑻𝒖 + 𝜹𝒖
  • Challenge: Given sequence of observation 𝒁𝟐, 𝒁𝟑,…, 𝒁𝒖

– Estimate current wind speed 𝑻𝒖 – Predict wind speed and arrow position at 𝑢 + 1: 𝑻𝒖+𝟐and 𝒁𝒖+𝟐

11755/18797 3

slide-4
SLIDE 4

A Common Trait

  • Series data with trends
  • Stochastic functions of stochastic functions (of stochastic functions of …)
  • An underlying process that progresses (seemingly) randomly

– E.g. wind speed – E.g. Current position of a vehicle – E.g. current sentiment in stock market

  • Random expressions of underlying process

– E.g Wind speed sensor measurement – E.g. what you see from the vehicle – E.g. current stock prices of various stock

11755/18797 4

A B C

slide-5
SLIDE 5

What a sensible agent must do

  • Learn about the process

– From whatever they know

  • E.g. learn the wind-speed function

and the arrow-to-wind function

– Basic requirement for other procedures

  • Track underlying processes

– Track the wind speed

  • Predict future values

11755/18797 5

slide-6
SLIDE 6

A Specific Form of Process..

  • Doubly stochastic processes
  • One random process generates a “state”

variable X

– Random process X  P(X; Q)

  • Second-level process generates observations

as a function of state X

  • Random process Y  P(Y; f(X, L))

11755/18797 6

X Y

slide-7
SLIDE 7

Doubly Stochastic Process Model

  • Doubly stochastic processes

are models

– May not be a true representation

  • f process underlying actual data
  • First level variable may be a quantifiable variable

– Position/state of vehicle – Second level variable is a stochastic function of position

  • First level variable may not have meaning

– “Sentiment” of a stock market – “Configuration” of vocal tract

11755/18797 7

X Y

slide-8
SLIDE 8

Stochastic Function of a Markov Chain

  • First-level variable is usually abstract
  • The first level variable assumed to be the output of a

Markov Chain

  • The second level variable is a function of the output of the

Markov Chain

  • Also called an HMM
  • Another variant – stochastic function of Markov process

– Kalman Filtering..

11755/18797 8

X Y

slide-9
SLIDE 9

Markov Chain

  • Process can go through a number of states

– Random walk, Brownian motion..

  • From each state, it can go to any other state with a probability

– Which only depends on the current state

  • Walk goes on forever

– Or until it hits an “absorbing wall”

  • Output of the process – a sequence of states the process went

through

11755/18797 9

S1 S2 S3

slide-10
SLIDE 10

Stochastic Function of a Markov Chain

  • Output:

– Y == Y1 Y2 … – Yi  P(Yi ; f(si))

  • Probability distribution is a function of the state

11755/18797 10

S1 S2 S3

slide-11
SLIDE 11

A little parable

11755/18797 11

You’ve been kidnapped

slide-12
SLIDE 12

A little parable

11755/18797 12

You’ve been kidnapped And blindfolded

slide-13
SLIDE 13

A little parable

11755/18797 13

You’ve been kidnapped And blindfolded

You can only hear the car You must find your way back home from wherever they drop you off

slide-14
SLIDE 14

Kidnapped

  • Determine automatically, by only listening to a running

automobile, if it is:

– Idling; or – Travelling at constant velocity; or – Accelerating; or – Decelerating

  • You are super acoustically sensitive and can determine

sound pressure level (SPL)

– The SPL is measured once per second

11-755/18797 14

slide-15
SLIDE 15

What you know

  • An automobile that is at rest can accelerate, or

continue to stay at rest

  • An accelerating automobile can hit a steady-

state velocity, continue to accelerate, or decelerate

  • A decelerating automobile can continue to

decelerate, come to rest, cruise, or accelerate

  • A automobile at a steady-state velocity can

stay in steady state, accelerate or decelerate

11-755/18797 15

slide-16
SLIDE 16

What else you know

  • The probability distribution of the SPL of the

sound is different in the various conditions

– As shown in figure

  • In reality, depends on the car
  • The distributions for the different conditions
  • verlap

– Simply knowing the current sound level is not enough to know the state of the car

11-755/18797 16

45 70 65 60 P(x|idle) P(x|decel) P(x|cruise) P(x|accel)

slide-17
SLIDE 17

The Model!

  • The state-space model

– Assuming all transitions from a state are equally probable

– We will help you find your way back home in the next class

11-755/18797 17

45 P(x|idle) Idling state 70 P(x|accel) Accelerating state 65 Cruising state 60 Decelerating state 0.5 0.5 0.33 0.33 0.33 0.33 0.33 0.25 0.25 0.25 0.33 0.25 I A C D I 0.5 0.5 A 1/3 1/3 1/3 C 1/3 1/3 1/3 D 0.25 0.25 0.25 0.25

slide-18
SLIDE 18
  • “Probabilistic function of a markov chain”
  • Models a dynamical system
  • System goes through a number of states

– Following a Markov chain model

  • On arriving at any state it generates observations according to

a state-specific probability distribution

11755/18797 18

What is an HMM

slide-19
SLIDE 19

What is an HMM

  • The model assumes that the process can be in one of a number
  • f states at any time instant
  • The state of the process at any time instant depends only on the

state at the previous instant (causality, Markovian)

  • At each instant the process generates an observation from a

probability distribution that is specific to the current state

  • The generated observations are all that we get to see

– the actual state of the process is not directly observable

  • Hence the qualifier hidden

11755/18797 19

slide-20
SLIDE 20
  • A Hidden Markov Model consists of two components

– A state/transition backbone that specifies how many states there are, and how they can follow one another – A set of probability distributions, one for each state, which specifies the distribution of all vectors in that state

11755/18797

Hidden Markov Models

Markov chain Data distributions

slide-21
SLIDE 21

HMM assumed to be generating data

How an HMM models a process

state distributions state sequence

  • bservation

sequence

11755/18797 21

slide-22
SLIDE 22

HMM Parameters

  • The topology of the HMM

– Number of states and allowed transitions – E.g. here we have 3 states and cannot go from the blue state to the red

  • The transition probabilities

– Often represented as a matrix as here – Tij is the probability that when in state i, the process will move to j

  • The probability pi of beginning at

any state si

– The complete set is represented as p

  • The state output distributions

           5 . 5 . 3 . 7 . 4 . 6 . T

11755/18797 22

0.6 0.4 0.7 0.3 0.5 0.5

slide-23
SLIDE 23

HMM state output distributions

  • The state output distribution is the distribution of data produced from

any state

  • Typically modelled as Gaussian
  • The paremeters are mi and Qi
  • More typically, modelled as Gaussian mixtures
  • Other distributions may also be used
  • E.g. histograms for discrete observations

 

) ( ) ( 5 .

1

2 1 ) , ; ( ) | (

i i T i

x x i d i i i

e x Gaussian s x P

m m

p m

 Q  

Q  Q 

11755/18797 23

 

Q 

1 , , ,

) , ; ( ) | (

K j j i j i j i i

x Gaussian w s x P m

slide-24
SLIDE 24

The Diagonal Covariance Matrix

  • For GMMs it is frequently assumed that the feature

vector dimensions are all independent of each other

  • Result: The covariance matrix is reduced to a diagonal

form

– The determinant of the diagonal Q matrix is easy to compute

11755/18797 24

Full covariance: all elements are non-zero

  • 0.5(x-m)TQ-1(x-m)

Diagonal covariance:

  • ff-diagonal elements

are zero Si (xi-mi)2 / 2si

2

slide-25
SLIDE 25

Three Basic HMM Problems

  • What is the probability that it will generate a

specific observation sequence

  • Given a observation sequence, how do we

determine which observation was generated from which state

– The state segmentation problem

  • How do we learn the parameters of the HMM

from observation sequences

11755/18797 25

slide-26
SLIDE 26

Computing the Probability of an Observation Sequence

  • Two aspects to producing the observation:

– Progressing through a sequence of states – Producing observations from these states

11755/18797 26

slide-27
SLIDE 27

HMM assumed to be generating data

Progressing through states

state sequence

  • The process begins at some state (red) here
  • From that state, it makes an allowed transition

– To arrive at the same or any other state

  • From that state it makes another allowed transition

– And so on

11755/18797 27

slide-28
SLIDE 28

Probability that the HMM will follow a particular state sequence

  • P(s1) is the probability that the process will initially be in

state s1

  • P(si | si) is the transition probability of moving to state si at

the next time instant when the system is currently in si

– Also denoted by Tij earlier

11755/18797 28

P s s s P s P s s P s s ( , , ,...) ( ) ( | ) ( | )...

1 2 3 1 2 1 3 2

slide-29
SLIDE 29

HMM assumed to be generating data

Generating Observations from States

state distributions state sequence

  • bservation

sequence

  • At each time it generates an observation from the

state it is in at that time

11755/18797 29

slide-30
SLIDE 30

P o o o s s s P o s P o s P o s ( , , ,...| , , ,...) ( | ) ( | ) ( | )...

1 2 3 1 2 3 1 1 2 2 3 3

  • P(oi | si) is the probability of generating
  • bservation oi when the system is in state si

Probability that the HMM will generate a particular observation sequence given a state sequence (state sequence known)

Computed from the Gaussian or Gaussian mixture for state s1

11755/18797 30

slide-31
SLIDE 31

HMM assumed to be generating data

Proceeding through States and Producing Observations

state distributions state sequence

  • bservation

sequence

  • At each time it produces an observation and makes

a transition

11755/18797 31

slide-32
SLIDE 32

Probability that the HMM will generate a particular state sequence and from it, a particular observation sequence

P o s P o s P o s P s P s s P s s ( | ) ( | ) ( | )... ( ) ( | ) ( | )...

1 1 2 2 3 3 1 2 1 3 2

P o o o s s s ( , , ,..., , , ,...)

1 2 3 1 2 3

 P o o o s s s P s s s ( , , ,...| , , ,...) ( , , ,...)

1 2 3 1 2 3 1 2 3

11755/18797 32

slide-33
SLIDE 33

Probability of Generating an Observation Sequence

P o s P o s P o s P s P s s P s s

all possible state sequences

( | ) ( | ) ( | )... ( ) ( | ) ( | )...

. . 1 1 2 2 3 3 1 2 1 3 2

 P o o o s s s

all possible state sequences

( , , ,..., , , ,...)

. . 1 2 3 1 2 3

  P o o o ( , , ,...)

1 2 3

  • The precise state sequence is not known
  • All possible state sequences must be considered

11755/18797 33

slide-34
SLIDE 34

Computing it Efficiently

  • Explicit summing over all state sequences is not

tractable

– A very large number of possible state sequences

  • Instead we use the forward algorithm
  • A dynamic programming technique.

11755/18797 34

slide-35
SLIDE 35

Illustrative Example

  • Example: a generic HMM with 5 states and a “terminating

state”.

– Left to right topology

  • P(si) = 1 for state 1 and 0 for others

– The arrows represent transition for which the probability is not 0

  • Notation:

– P(si | si) = Tij – We represent P(ot | si) = bi(t) for brevity

11755/18797 35

slide-36
SLIDE 36

Diversion: The Trellis

Feature vectors (time)

State index t-1 t s

  • The trellis is a graphical representation of all possible paths through the HMM to

produce a given observation

  • The Y-axis represents HMM states, X axis represents observations
  • Every edge in the graph represents a valid transition in the HMM over a single

time step

  • Every node represents the event of a particular observation being generated

from a particular state

11755/18797 36

a(s,t)

slide-37
SLIDE 37

The Forward Algorithm

time

State index t-1 t s

  • a(s,t) is the total probability of ALL state

sequences that end at state s at time t, and all observations until xt

11755/18797 37

a(s,t)

) ) ( , ,..., , ( ) , (

2 1

s t state x x x P t s

t

  a

slide-38
SLIDE 38

The Forward Algorithm

time

t-1 t Can be recursively estimated starting from the first time instant (forward recursion) s State index

  • a(s,t) can be recursively computed in terms of

a(s’,t’), the forward probabilities at time t-1

11755/18797 38

) ) ( , ,..., , ( ) , (

2 1

s t state x x x P t s

t

  a

a(s,t) a(s,t-1) a(1,t-1)

 

'

) | ( ) ' | ( ) 1 , ' ( ) , (

s t s

x P s s P t s t s a a

slide-39
SLIDE 39

s

T s Totalprob ) , ( a

The Forward Algorithm

time

State index T

  • In the final observation the alpha at each state gives the

probability of all state sequences ending at that state

  • General model: The total probability of the observation is

the sum of the alpha values at all states

11755/18797 39

slide-40
SLIDE 40

The absorbing state

  • Observation sequences are assumed to end
  • nly when the process arrives at an absorbing

state

– No observations are produced from the absorbing state

11755/18797 40

slide-41
SLIDE 41

 

'

) ' | ( ) , ' ( ) 1 , (

s absorbing absorbing

s s P T s T s a a

) 1 , (   T s Totalprob

absorbing

a

The Forward Algorithm

time

State index T

  • Absorbing state model: The total probability is the alpha

computed at the absorbing state after the final observation

11755/18797 41

slide-42
SLIDE 42

Problem 2: State segmentation

  • Given only a sequence of observations, how

do we determine which sequence of states was followed in producing it?

11755/18797 42

slide-43
SLIDE 43

HMM assumed to be generating data

The HMM as a generator

state distributions state sequence

  • bservation

sequence

  • The process goes through a series of states and

produces observations from them

11755/18797 43

slide-44
SLIDE 44

HMM assumed to be generating data state distributions state sequence

  • bservation

sequence

  • The observations do not reveal the underlying state

11755/18797 44

States are hidden

slide-45
SLIDE 45

HMM assumed to be generating data state distributions state sequence

  • bservation

sequence

  • State segmentation: Estimate state sequence given
  • bservations

11755/18797 45

The state segmentation problem

slide-46
SLIDE 46

P o o o s s s ( , , ,..., , , ,...)

1 2 3 1 2 3

Estimating the State Sequence

  • Many different state sequences are capable of

producing the observation

  • Solution: Identify the most probable state sequence

– The state sequence for which the probability of progressing through that sequence and generating the

  • bservation sequence is maximum

– i.e is maximum

11755/18797 46

slide-47
SLIDE 47

Estimating the state sequence

  • Once again, exhaustive evaluation is impossibly

expensive

  • But once again a simple dynamic-programming

solution is available

  • Needed:

) | ( ) | ( ) | ( ) | ( ) ( ) | ( max arg

2 3 3 3 1 2 2 2 1 1 1 ,... , ,

3 2 1

s s P s

  • P

s s P s

  • P

s P s

  • P

s s s

11755/18797 47

P o s P o s P o s P s P s s P s s ( | ) ( | ) ( | )... ( ) ( | ) ( | )...

1 1 2 2 3 3 1 2 1 3 2

P o o o s s s ( , , ,..., , , ,...)

1 2 3 1 2 3

slide-48
SLIDE 48

Estimating the state sequence

  • Once again, exhaustive evaluation is impossibly

expensive

  • But once again a simple dynamic-programming

solution is available

  • Needed:

) | ( ) | ( ) | ( ) | ( ) ( ) | ( max arg

2 3 3 3 1 2 2 2 1 1 1 ,... , ,

3 2 1

s s P s

  • P

s s P s

  • P

s P s

  • P

s s s

11755/18797 48

P o s P o s P o s P s P s s P s s ( | ) ( | ) ( | )... ( ) ( | ) ( | )...

1 1 2 2 3 3 1 2 1 3 2

P o o o s s s ( , , ,..., , , ,...)

1 2 3 1 2 3

slide-49
SLIDE 49

HMM assumed to be generating data

The HMM as a generator

state distributions state sequence

  • bservation

sequence

  • Each enclosed term represents one forward

transition and a subsequent emission

11755/18797 49

slide-50
SLIDE 50

The state sequence

  • The probability of a state sequence ?,?,?,?,sx,sy ending at

time t , and producing all observations until ot

– P(o1..t-1, ?,?,?,?, sx , ot,sy) = P(o1..t-1,?,?,?,?, sx ) P(ot|sy)P(sy|sx)

  • The best state sequence that ends with sx,sy at t will have

a probability equal to the probability of the best state sequence ending at t-1 at sx times P(ot|sy)P(sy|sx)

11755/18797 50

slide-51
SLIDE 51

Extending the state sequence

state distributions state sequence

  • bservation

sequence

  • The probability of a state sequence ?,?,?,?,sx,sy

ending at time t and producing observations until ot

– P(o1..t-1,ot, ?,?,?,?, sx ,sy) = P(o1..t-1,?,?,?,?, sx )P(ot|sy)P(sy|sx)

11755/18797 51

t sx sy

slide-52
SLIDE 52

Trellis

  • The graph below shows the set of all possible state

sequences through this HMM in five time instants

11755/18797 52

time

t

slide-53
SLIDE 53

The cost of extending a state sequence

  • The cost of extending a state sequence ending at sx is
  • nly dependent on the transition from sx to sy, and

the observation probability at sy

11755/18797 53

time

t sy sx

P(ot|sy)P(sy|sx)

slide-54
SLIDE 54

The cost of extending a state sequence

  • The best path to sy through sx is simply an

extension of the best path to sx

11755/18797 54

time

t sy sx

BestP(o1..t-1,?,?,?,?, sx ) P(ot|sy)P(sy|sx)

slide-55
SLIDE 55

The Recursion

  • The overall best path to sy is an extension of

the best path to one of the states at the previous time

11755/18797 55

time

t sy

slide-56
SLIDE 56

The Recursion

 Prob. of best path to sy =

Maxsx BestP(o1..t-1,?,?,?,?, sx ) P(ot|sy)P(sy|sx)

11755/18797 56

time

t sy

slide-57
SLIDE 57

Finding the best state sequence

  • The simple algorithm just presented is called the VITERBI

algorithm in the literature

– After A.J.Viterbi, who invented this dynamic programming algorithm for a completely different purpose: decoding error correction codes!

11755/18797 57

slide-58
SLIDE 58

Viterbi Search (contd.)

11755/18797 58

time

Initial state initialized with path-score = P(s1)b1(1) In this example all other states have score 0 since P(si) = 0 for them

slide-59
SLIDE 59

Viterbi Search (contd.)

11755/18797 59

time

State with best path-score State with path-score < best State without a valid path-score

P (t) j = max [P (t-1) t b (t)] i ij j i

Total path-score ending up at state j at time t State transition probability, i to j Score for state j, given the input at time t

slide-60
SLIDE 60

Viterbi Search (contd.)

11755/18797 60

time P (t) j = max [P (t-1) t b (t)] i ij j i

Total path-score ending up at state j at time t State transition probability, i to j Score for state j, given the input at time t

slide-61
SLIDE 61

Viterbi Search (contd.)

11755/18797 61

time

slide-62
SLIDE 62

Viterbi Search (contd.)

11755/18797 62

time

slide-63
SLIDE 63

Viterbi Search (contd.)

11755/18797 63

time

slide-64
SLIDE 64

Viterbi Search (contd.)

11755/18797 64

time

slide-65
SLIDE 65

Viterbi Search (contd.)

11755/18797 65

time

slide-66
SLIDE 66

Viterbi Search (contd.)

11755/18797 66

time

slide-67
SLIDE 67

Viterbi Search (contd.)

11755/18797 67

time

slide-68
SLIDE 68

Viterbi Search (contd.)

11755/18797 68

time

THE BEST STATE SEQUENCE IS THE ESTIMATE OF THE STATE SEQUENCE FOLLOWED IN GENERATING THE OBSERVATION

slide-69
SLIDE 69

Problem3: Training HMM parameters

  • We can compute the probability of an observation,

and the best state sequence given an observation, using the HMM’s parameters

  • But where do the HMM parameters come from?
  • They must be learned from a collection of
  • bservation sequences

11755/18797 69

slide-70
SLIDE 70

Learning HMM parameters: Simple procedure – counting

  • Given a set of training instances
  • Iteratively:
  • 1. Initialize HMM parameters
  • 2. Segment all training instances
  • 3. Estimate transition probabilities and state
  • utput probability parameters by counting

11755/18797 70

slide-71
SLIDE 71

Learning by counting example

  • Explanation by example in next few slides
  • 2-state HMM, Gaussian PDF at states, 3 observation

sequences

  • Example shows ONE iteration

– How to count after state sequences are obtained

11755/18797 71

slide-72
SLIDE 72

Example: Learning HMM Parameters

  • We have an HMM with two states s1 and s2.
  • Observations are vectors xij

– i-th sequence, j-th vector

  • We are given the following three observation sequences

– And have already estimated state sequences

11755/18797 72

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-73
SLIDE 73

Example: Learning HMM Parameters

  • Initial state probabilities (usually denoted as p):

– We have 3 observations – 2 of these begin with S1, and one with S2 – p(S1) = 2/3, p(S2) = 1/3

11755/18797 73

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-74
SLIDE 74

Example: Learning HMM Parameters

  • Transition probabilities:

– State S1 occurs 11 times in non-terminal locations – Of these, it is followed by S1 X times – It is followed by S2 Y times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11

11755/18797 74

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-75
SLIDE 75

Example: Learning HMM Parameters

  • Transition probabilities:

– State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed by S2 Y times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11

11755/18797 75

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-76
SLIDE 76

Example: Learning HMM Parameters

  • Transition probabilities:

– State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11

11755/18797 76

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-77
SLIDE 77

Example: Learning HMM Parameters

  • Transition probabilities:

– State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1) = 6/ 11; P(S2 | S1) = 5 / 11

11755/18797 77

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-78
SLIDE 78

Example: Learning HMM Parameters

  • Transition probabilities:

– State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1) = 6/ 11; P(S2 | S1) = 5 / 11

11755/18797 78

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs. Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-79
SLIDE 79

Example: Learning HMM Parameters

  • Transition probabilities:

– State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 5 times – P(S1 | S1) = 6/ 11; P(S2 | S1) = 5 / 11

11755/18797 79

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-80
SLIDE 80

Example: Learning HMM Parameters

  • Transition probabilities:

– State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 8 times – P(S1 | S1) = 6/ 11; P(S2 | S1) = 5 / 11

11755/18797 80

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-81
SLIDE 81

Example: Learning HMM Parameters

  • Transition probabilities:

– State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 8 times – P(S1 | S2) = 5 / 13; P(S2 | S2) = 8 / 13

11755/18797 81

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-82
SLIDE 82

Parameters learnt so far

  • State initial probabilities, often denoted as p

– p(S1) = 2/3 = 0.66 – p(S2) = 1/3 = 0.33

  • State transition probabilities

– P(S1 | S1) = 6/11 = 0.545; P(S2 | S1) = 5/11 = 0.455 – P(S1 | S2) = 5/13 = 0.385; P(S2 | S2) = 8/13 = 0.615 – Represented as a transition matrix

11755/18797 82

                  615 . 385 . 455 . 545 . ) 2 | 2 ( ) 2 | 1 ( ) 1 | 2 ( ) 1 | 1 ( S S P S S P S S P S S P A

Each row of this matrix must sum to 1.0

slide-83
SLIDE 83

Example: Learning HMM Parameters

  • State output probability for S1

– There are 13 observations in S1

11755/18797 83

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-84
SLIDE 84

Example: Learning HMM Parameters

  • State output probability for S1

– There are 13 observations in S1 – Segregate them out and count

  • Compute parameters (mean and variance) of Gaussian
  • utput density for state S1

11755/18797 84

Time 1 2 6 7 9 10 state S1 S1 S1 S1 S1 S1 Obs Xa1 Xa2 Xa6 Xa7 Xa9 Xa10 Time 3 4 9 state S1 S1 S1 Obs Xb3 Xb4 Xb9 Time 1 3 4 5 state S1 S1 S1 S1 Obs Xc1 Xc2 Xc4 Xc5

 

) ( ) ( 5 . exp | | ) 2 ( 1 ) | (

1 1 1 1 1

1

m m p  Q   Q 

 X

X S X P

T d

                    

5 4 2 1 9 4 3 10 9 7 6 2 1 1

13 1

c c c c b b b a a a a a a

X X X X X X X X X X X X X m

                 

                               Q ... ... ... 13 1

1 2 1 2 1 1 1 1 1 4 1 4 1 3 1 3 1 2 1 2 1 1 1 1 1 T c c T c c T b b T b b T a a T a a

X X X X X X X X X X X X m m m m m m m m m m m m

slide-85
SLIDE 85

Example: Learning HMM Parameters

  • State output probability for S2

– There are 14 observations in S2

11755/18797 85

Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8

Observation 1 Observation 2 Observation 3

slide-86
SLIDE 86

Example: Learning HMM Parameters

  • State output probability for S2

– There are 14 observations in S2 – Segregate them out and count

  • Compute parameters (mean and variance) of Gaussian
  • utput density for state S2

11755/18797 86

Time 3 4 5 8 state S2 S2 S2 S2 Obs Xa3 Xa4 Xa5 Xa8 Time 1 2 5 6 7 8 state S2 S2 S2 S2 S2 S2 Obs Xb1 Xb2 Xb5 Xb6 Xb7 Xb8 Time 2 6 7 8 state S2 S2 S2 S2 Obs Xc2 Xc6 Xc7 Xc8

 

) ( ) ( 5 . exp | | ) 2 ( 1 ) | (

2 1 2 2 2 2

m m p  Q   Q 

 X

X S X P

T d

                     

8 7 6 2 8 7 6 5 2 1 8 5 4 3 2

14 1

c c c c b b b b b b a a a a

X X X X X X X X X X X X X X m

  

 

... 14 1

2 3 2 3 1

    Q

T a a

X X m m

slide-87
SLIDE 87

We have learnt all the HMM parmeters

  • State initial probabilities, often denoted as p

– p(S1) = 0.66 p(S2) = 1/3 = 0.33

  • State transition probabilities
  • State output probabilities

11755/18797 87

         615 . 385 . 455 . 545 . A

State output probability for S1 State output probability for S2

 

) ( ) ( 5 . exp | | ) 2 ( 1 ) | (

2 1 2 2 2 2

m m p  Q   Q 

 X

X S X P

T d

 

) ( ) ( 5 . exp | | ) 2 ( 1 ) | (

1 1 1 1 1 1

m m p  Q   Q 

 X

X S X P

T d

slide-88
SLIDE 88

Update rules at each iteration

  • Assumes state output PDF = Gaussian

– For GMMs, estimate GMM parameters from collection of observations at any state

11755/18797 88

sequences n

  • bservatio
  • f

no. Total state at start that sequences n

  • bservatio
  • f

No. ) (

i i

s s  p

   

   

  • bs

s t state t

  • bs

s t state s t state t i j

i j i

s s P

. ) ( : ) 1 ( &. . ) ( :

1 1 ) | (

   

 

  • bs

s t state t

  • bs

s t state t t

  • bs

i

i i

X

. ) ( : ) ( : ,

1 m

   

 

   Q

  • bs

s t state t

  • bs

s t state t T i t

  • bs

i t

  • bs

i

i i

X X

. ) ( : ) ( : , ,

1 ) )( ( m m

slide-89
SLIDE 89

 Initialize all HMM parameters  Segment all training observation sequences into states using the Viterbi

algorithm with the current models

 Using estimated state sequences and training observation sequences,

reestimate the HMM parameters

 This method is also called a “segmental k-means” learning procedure

Training by segmentation: Viterbi training

11755/18797

Initial models Segmentations Models Converged? yes no

slide-90
SLIDE 90

Alternative to counting: SOFT counting

  • Expectation maximization
  • Every observation contributes to every state

11755/18797 90

slide-91
SLIDE 91

Update rules at each iteration

  • Every observation contributes to every state

11755/18797 91

sequences n

  • bservatio
  • f

no. Total ) | ) 1 ( ( ) (

  

Obs i i

Obs s t state P s p

 

    

Obs t i Obs t j i i j

Obs s t state P Obs s t state s t state P s s P ) | ) ( ( ) | ) 1 ( , ) ( ( ) | (

 

  

Obs t i Obs t t Obs i i

Obs s t state P X Obs s t state P ) | ) ( ( ) | ) ( (

,

m

 

     Q

Obs t i Obs t T i t Obs i t Obs i i

Obs s t state P X X Obs s t state P ) | ) ( ( ) )( )( | ) ( (

, ,

m m

slide-92
SLIDE 92

Update rules at each iteration

  • Where did these terms come from?

11755/18797 92

sequences n

  • bservatio
  • f

no. Total ) | ) 1 ( ( ) (

  

Obs i i

Obs s t state P s p

 

    

Obs t i Obs t j i i j

Obs s t state P Obs s t state s t state P s s P ) | ) ( ( ) | ) 1 ( , ) ( ( ) | (

 

  

Obs t i Obs t t Obs i i

Obs s t state P X Obs s t state P ) | ) ( ( ) | ) ( (

,

m

 

     Q

Obs t i Obs t T i t Obs i t Obs i i

Obs s t state P X X Obs s t state P ) | ) ( ( ) )( )( | ) ( (

, ,

m m

slide-93
SLIDE 93

) ,..., , , ) ( (

2 1 T i

x x x s t state P 

  • The probability that the process was at s when

it generated Xt given the entire observation

  • Dropping the “Obs” subscript for brevity
  • We will compute

first

– This is the probability that the process visited s at time t while producing the entire observation

11755/18797 93

) ,..., , , ) ( ( ) ,..., , | ) ( (

2 1 2 1 T T

X X X s t state P X X X s t state P   

) | ) ( ( Obs s t state P 

slide-94
SLIDE 94
  • The probability that the HMM was in a particular state s when

generating the observation sequence is the probability that it followed a state sequence that passed through s at time t

11755/18797 94

s

time

t

) ,..., , , ) ( (

2 1 T

x x x s t state P 

slide-95
SLIDE 95
  • This can be decomposed into two multiplicative sections

– The section of the lattice leading into state s at time t and the section leading out of it

11755/18797 95

s

time

t

) ,..., , , ) ( (

2 1 T

x x x s t state P 

slide-96
SLIDE 96

The Forward Paths

  • The probability of the red section is the total probability of all

state sequences ending at state s at time t

– This is simply a(s,t) – Can be computed using the forward algorithm

11755/18797 96

time

t s

slide-97
SLIDE 97

The Backward Paths

  • The blue portion represents the probability of all state

sequences that began at state s at time t

– Like the red portion it can be computed using a backward recursion

11755/18797 97

time

t

slide-98
SLIDE 98

The Backward Recursion

t+1 s t Can be recursively estimated starting from the final time time instant (backward recursion)

time

  • b(s,t) is the total probability of ALL state sequences that

depart from s at time t, and all observations after xt

– b(s,T) = 1 at the final time instant for all valid final states

11755/18797 98

) ) ( | ,..., , ( ) , (

2 1

s t state x x x P t s

T t t

 

 

b

) ' | ( ) | ' ( ) 1 , ' ( ) , (

1 '

s x P s s P t s t s

t s 

  b b

b(s,t) b(s,t) b(N,t)

slide-99
SLIDE 99

The complete probability

t+1 t t-1 s

time a(s,t-1) b(s,t) b(N,t) a(s1,t-1)

) ) ( , ,..., , ( ) , ( ) , (

2 1

s t state x x x P t s t s

T t t

 

 

b a

11755/18797 99

slide-100
SLIDE 100

Posterior probability of a state

  • The probability that the process was in state s

at time t, given that we have observed the data is obtained by simple normalization

  • This term is often referred to as the gamma

term and denoted by gs,t

11755/18797 100

 

    

' ' 2 1 2 1

) , ' ( ) , ' ( ) , ( ) , ( ) ,..., , , ) ( ( ) ,..., , , ) ( ( ) | ) ( (

s s T T

t s t s t s t s x x x s t state P x x x s t state P Obs s t state P b a b a

slide-101
SLIDE 101

Update rules at each iteration

  • These have been found

11755/18797 101

sequences n

  • bservatio
  • f

no. Total ) | ) 1 ( ( ) (

  

Obs i i

Obs s t state P s p

 

    

Obs t i Obs t j i i j

Obs s t state P Obs s t state s t state P s s P ) | ) ( ( ) | ) 1 ( , ) ( ( ) | (

 

  

Obs t i Obs t t Obs i i

Obs s t state P X Obs s t state P ) | ) ( ( ) | ) ( (

,

m

 

     Q

Obs t i Obs t T i t Obs i t Obs i i

Obs s t state P X X Obs s t state P ) | ) ( ( ) )( )( | ) ( (

, ,

m m

slide-102
SLIDE 102

Update rules at each iteration

  • Where did these terms come from?

11755/18797 102

sequences n

  • bservatio
  • f

no. Total ) | ) 1 ( ( ) (

  

Obs i i

Obs s t state P s p

 

    

Obs t i Obs t j i i j

Obs s t state P Obs s t state s t state P s s P ) | ) ( ( ) | ) 1 ( , ) ( ( ) | (

 

  

Obs t i Obs t t Obs i i

Obs s t state P X Obs s t state P ) | ) ( ( ) | ) ( (

,

m

 

     Q

Obs t i Obs t T i t Obs i t Obs i i

Obs s t state P X X Obs s t state P ) | ) ( ( ) )( )( | ) ( (

, ,

m m

slide-103
SLIDE 103

s’

time

t

) ,..., , , ' ) 1 ( , ) ( (

2 1 T

x x x s t state s t state P   

s t+1

11755/18797 103

slide-104
SLIDE 104

s’

time

t

) ,..., , , ' ) 1 ( , ) ( (

2 1 T

x x x s t state s t state P   

s t+1

) , ( t s a

11755/18797 104

slide-105
SLIDE 105

s’

time

t

) ,..., , , ' ) 1 ( , ) ( (

2 1 T

x x x s t state s t state P   

s t+1

) , ( t s a

) ' | ( ) | ' (

1 s

x P s s P

t

11755/18797 105

slide-106
SLIDE 106

s’

time

t

) ,..., , , ' ) 1 ( , ) ( (

2 1 T

x x x s t state s t state P   

s t+1

) , ( t s a

) ' | ( ) | ' (

1 s

x P s s P

t

) 1 , ' (  t s b

11755/18797 106

slide-107
SLIDE 107

The a posteriori probability of transition

  • The a posteriori probability of a transition

given an observation

11755/18797 107



     

 

1 2

) 1 , ( ) | ( ) | ( ) , ( ) 1 , ' ( ) ' | ( ) | ' ( ) , ( ) | ' ) 1 ( , ) ( (

2 2 1 1 2 1 1 s s t t

t s s x P s s P t s t s s x P s s P t s Obs s t state s t state P b a b a

slide-108
SLIDE 108

Update rules at each iteration

  • These have been found

11755/18797 108

sequences n

  • bservatio
  • f

no. Total ) | ) 1 ( ( ) (

  

Obs i i

Obs s t state P s p

 

    

Obs t i Obs t j i i j

Obs s t state P Obs s t state s t state P s s P ) | ) ( ( ) | ) 1 ( , ) ( ( ) | (

 

  

Obs t i Obs t t Obs i i

Obs s t state P X Obs s t state P ) | ) ( ( ) | ) ( (

,

m

 

     Q

Obs t i Obs t T i t Obs i t Obs i i

Obs s t state P X X Obs s t state P ) | ) ( ( ) )( )( | ) ( (

, ,

m m

slide-109
SLIDE 109

State association probabilities Initial models

 Every feature vector associated with every state of every HMM with a

probability

 Probabilities computed using the forward-backward algorithm  Soft decisions taken at the level of HMM state  In practice, the segmentation based Viterbi training is much easier to

implement and is much faster

 The difference in performance between the two is small, especially if we have

lots of training data

Training without explicit segmentation: Baum-Welch training

11755/18797

Models Converged? yes no

slide-110
SLIDE 110

HMM Issues

  • How to find the best state sequence: Covered
  • How to learn HMM parameters: Covered
  • How to compute the probability of an
  • bservation sequence: Covered

11755/18797 110

slide-111
SLIDE 111

Magic numbers

  • How many states:

– No nice automatic technique to learn this – You choose

  • For speech, HMM topology is usually left to right (no

backward transitions)

  • For other cyclic processes, topology must reflect nature
  • f process
  • No. of states – 3 per phoneme in speech
  • For other processes, depends on estimated no. of

distinct states in process

11755/18797 111

slide-112
SLIDE 112

Applications of HMMs

  • Classification:

– Learn HMMs for the various classes of time series from training data – Compute probability of test time series using the HMMs for each class – Use in a Bayesian classifier – Speech recognition, vision, gene sequencing, character recognition, text mining…

  • Prediction
  • Tracking

11755/18797 112

slide-113
SLIDE 113

Applications of HMMs

  • Segmentation:

– Given HMMs for various events, find event boundaries

  • Simply find the best state sequence and the locations

where state identities change

  • Automatic speech segmentation, text

segmentation by topic, geneome segmentation, …

11755/18797 113