Machine Learning for Signal Processing
Hidden Markov Models
Bhiksha Raj 10 Nov 2016
11755/18797 1
Hidden Markov Models Bhiksha Raj 10 Nov 2016 11755/18797 1 - - PowerPoint PPT Presentation
Machine Learning for Signal Processing Hidden Markov Models Bhiksha Raj 10 Nov 2016 11755/18797 1 Prediction : a holy grail Physical trajectories Automobiles, rockets, heavenly bodies Natural phenomena Weather Financial
11755/18797 1
11755/18797 2
– For some important task
– E.g. arrows shot at a target
– Wind speed at time t depends on speed at t-1
– Arrow position at time t depends on wind speed at time t
– Estimate current wind speed 𝑻𝒖 – Predict wind speed and arrow position at 𝑢 + 1: 𝑻𝒖+𝟐and 𝒁𝒖+𝟐
11755/18797 3
– E.g. wind speed – E.g. Current position of a vehicle – E.g. current sentiment in stock market
– E.g Wind speed sensor measurement – E.g. what you see from the vehicle – E.g. current stock prices of various stock
11755/18797 4
and the arrow-to-wind function
11755/18797 5
11755/18797 6
11755/18797 7
– Kalman Filtering..
11755/18797 8
– Random walk, Brownian motion..
– Which only depends on the current state
– Or until it hits an “absorbing wall”
through
11755/18797 9
S1 S2 S3
11755/18797 10
S1 S2 S3
11755/18797 11
11755/18797 12
11755/18797 13
– Idling; or – Travelling at constant velocity; or – Accelerating; or – Decelerating
– The SPL is measured once per second
11-755/18797 14
11-755/18797 15
11-755/18797 16
45 70 65 60 P(x|idle) P(x|decel) P(x|cruise) P(x|accel)
– Assuming all transitions from a state are equally probable
11-755/18797 17
45 P(x|idle) Idling state 70 P(x|accel) Accelerating state 65 Cruising state 60 Decelerating state 0.5 0.5 0.33 0.33 0.33 0.33 0.33 0.25 0.25 0.25 0.33 0.25 I A C D I 0.5 0.5 A 1/3 1/3 1/3 C 1/3 1/3 1/3 D 0.25 0.25 0.25 0.25
– Following a Markov chain model
11755/18797 18
– the actual state of the process is not directly observable
11755/18797 19
– A state/transition backbone that specifies how many states there are, and how they can follow one another – A set of probability distributions, one for each state, which specifies the distribution of all vectors in that state
11755/18797
Markov chain Data distributions
HMM assumed to be generating data
state distributions state sequence
sequence
11755/18797 21
– Number of states and allowed transitions – E.g. here we have 3 states and cannot go from the blue state to the red
– Often represented as a matrix as here – Tij is the probability that when in state i, the process will move to j
any state si
– The complete set is represented as p
5 . 5 . 3 . 7 . 4 . 6 . T
11755/18797 22
0.6 0.4 0.7 0.3 0.5 0.5
any state
) ( ) ( 5 .
1
2 1 ) , ; ( ) | (
i i T i
x x i d i i i
e x Gaussian s x P
m m
p m
Q
Q Q
11755/18797 23
1 , , ,
K j j i j i j i i
11755/18797 24
Full covariance: all elements are non-zero
Diagonal covariance:
are zero Si (xi-mi)2 / 2si
2
11755/18797 25
11755/18797 26
HMM assumed to be generating data
state sequence
11755/18797 27
– Also denoted by Tij earlier
11755/18797 28
1 2 3 1 2 1 3 2
HMM assumed to be generating data
state distributions state sequence
sequence
11755/18797 29
1 2 3 1 2 3 1 1 2 2 3 3
Computed from the Gaussian or Gaussian mixture for state s1
11755/18797 30
HMM assumed to be generating data
state distributions state sequence
sequence
11755/18797 31
1 1 2 2 3 3 1 2 1 3 2
1 2 3 1 2 3
1 2 3 1 2 3 1 2 3
11755/18797 32
all possible state sequences
. . 1 1 2 2 3 3 1 2 1 3 2
all possible state sequences
. . 1 2 3 1 2 3
1 2 3
11755/18797 33
11755/18797 34
– Left to right topology
– The arrows represent transition for which the probability is not 0
– P(si | si) = Tij – We represent P(ot | si) = bi(t) for brevity
11755/18797 35
State index t-1 t s
produce a given observation
time step
from a particular state
11755/18797 36
State index t-1 t s
11755/18797 37
2 1
t
t-1 t Can be recursively estimated starting from the first time instant (forward recursion) s State index
11755/18797 38
2 1
t
'
s t s
s
State index T
11755/18797 39
11755/18797 40
'
s absorbing absorbing
absorbing
State index T
11755/18797 41
11755/18797 42
HMM assumed to be generating data
state distributions state sequence
sequence
11755/18797 43
HMM assumed to be generating data state distributions state sequence
sequence
11755/18797 44
HMM assumed to be generating data state distributions state sequence
sequence
11755/18797 45
1 2 3 1 2 3
11755/18797 46
2 3 3 3 1 2 2 2 1 1 1 ,... , ,
3 2 1
s s s
11755/18797 47
1 1 2 2 3 3 1 2 1 3 2
1 2 3 1 2 3
2 3 3 3 1 2 2 2 1 1 1 ,... , ,
3 2 1
s s s
11755/18797 48
1 1 2 2 3 3 1 2 1 3 2
1 2 3 1 2 3
HMM assumed to be generating data
state distributions state sequence
sequence
11755/18797 49
11755/18797 50
state distributions state sequence
sequence
– P(o1..t-1,ot, ?,?,?,?, sx ,sy) = P(o1..t-1,?,?,?,?, sx )P(ot|sy)P(sy|sx)
11755/18797 51
t sx sy
11755/18797 52
t
11755/18797 53
t sy sx
11755/18797 54
t sy sx
11755/18797 55
t sy
Prob. of best path to sy =
11755/18797 56
t sy
– After A.J.Viterbi, who invented this dynamic programming algorithm for a completely different purpose: decoding error correction codes!
11755/18797 57
11755/18797 58
Initial state initialized with path-score = P(s1)b1(1) In this example all other states have score 0 since P(si) = 0 for them
11755/18797 59
State with best path-score State with path-score < best State without a valid path-score
Total path-score ending up at state j at time t State transition probability, i to j Score for state j, given the input at time t
11755/18797 60
Total path-score ending up at state j at time t State transition probability, i to j Score for state j, given the input at time t
11755/18797 61
11755/18797 62
11755/18797 63
11755/18797 64
11755/18797 65
11755/18797 66
11755/18797 67
11755/18797 68
THE BEST STATE SEQUENCE IS THE ESTIMATE OF THE STATE SEQUENCE FOLLOWED IN GENERATING THE OBSERVATION
11755/18797 69
11755/18797 70
– How to count after state sequences are obtained
11755/18797 71
– i-th sequence, j-th vector
– And have already estimated state sequences
11755/18797 72
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
11755/18797 73
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
– State S1 occurs 11 times in non-terminal locations – Of these, it is followed by S1 X times – It is followed by S2 Y times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11
11755/18797 74
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
– State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed by S2 Y times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11
11755/18797 75
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
– State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11
11755/18797 76
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
– State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1) = 6/ 11; P(S2 | S1) = 5 / 11
11755/18797 77
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
– State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1) = 6/ 11; P(S2 | S1) = 5 / 11
11755/18797 78
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs. Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
– State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 5 times – P(S1 | S1) = 6/ 11; P(S2 | S1) = 5 / 11
11755/18797 79
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
– State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 8 times – P(S1 | S1) = 6/ 11; P(S2 | S1) = 5 / 11
11755/18797 80
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
– State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 8 times – P(S1 | S2) = 5 / 13; P(S2 | S2) = 8 / 13
11755/18797 81
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
11755/18797 82
Each row of this matrix must sum to 1.0
11755/18797 83
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
11755/18797 84
Time 1 2 6 7 9 10 state S1 S1 S1 S1 S1 S1 Obs Xa1 Xa2 Xa6 Xa7 Xa9 Xa10 Time 3 4 9 state S1 S1 S1 Obs Xb3 Xb4 Xb9 Time 1 3 4 5 state S1 S1 S1 S1 Obs Xc1 Xc2 Xc4 Xc5
) ( ) ( 5 . exp | | ) 2 ( 1 ) | (
1 1 1 1 1
1
m m p Q Q
X
X S X P
T d
5 4 2 1 9 4 3 10 9 7 6 2 1 1
13 1
c c c c b b b a a a a a a
X X X X X X X X X X X X X m
Q ... ... ... 13 1
1 2 1 2 1 1 1 1 1 4 1 4 1 3 1 3 1 2 1 2 1 1 1 1 1 T c c T c c T b b T b b T a a T a a
X X X X X X X X X X X X m m m m m m m m m m m m
11755/18797 85
Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Obs Xa1 Xa2 Xa3 Xa4 Xa5 Xa6 Xa7 Xa8 Xa9 Xa10 Time 1 2 3 4 5 6 7 8 9 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs Xb1 Xb2 Xb3 Xb4 Xb5 Xb6 Xb7 Xb8 Xb9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Obs Xc1 Xc2 Xc3 Xc4 Xc5 Xc6 Xc7 Xc8
Observation 1 Observation 2 Observation 3
11755/18797 86
Time 3 4 5 8 state S2 S2 S2 S2 Obs Xa3 Xa4 Xa5 Xa8 Time 1 2 5 6 7 8 state S2 S2 S2 S2 S2 S2 Obs Xb1 Xb2 Xb5 Xb6 Xb7 Xb8 Time 2 6 7 8 state S2 S2 S2 S2 Obs Xc2 Xc6 Xc7 Xc8
) ( ) ( 5 . exp | | ) 2 ( 1 ) | (
2 1 2 2 2 2
m m p Q Q
X
X S X P
T d
8 7 6 2 8 7 6 5 2 1 8 5 4 3 2
14 1
c c c c b b b b b b a a a a
X X X X X X X X X X X X X X m
... 14 1
2 3 2 3 1
Q
T a a
X X m m
11755/18797 87
State output probability for S1 State output probability for S2
) ( ) ( 5 . exp | | ) 2 ( 1 ) | (
2 1 2 2 2 2
m m p Q Q
X
X S X P
T d
) ( ) ( 5 . exp | | ) 2 ( 1 ) | (
1 1 1 1 1 1
m m p Q Q
X
X S X P
T d
– For GMMs, estimate GMM parameters from collection of observations at any state
11755/18797 88
i i
s t state t
s t state s t state t i j
i j i
. ) ( : ) 1 ( &. . ) ( :
s t state t
s t state t t
i
i i
. ) ( : ) ( : ,
Q
s t state t
s t state t T i t
i t
i
i i
X X
. ) ( : ) ( : , ,
1 ) )( ( m m
Initialize all HMM parameters Segment all training observation sequences into states using the Viterbi
algorithm with the current models
Using estimated state sequences and training observation sequences,
reestimate the HMM parameters
This method is also called a “segmental k-means” learning procedure
11755/18797
Initial models Segmentations Models Converged? yes no
11755/18797 90
11755/18797 91
sequences n
no. Total ) | ) 1 ( ( ) (
Obs i i
Obs s t state P s p
Obs t i Obs t j i i j
Obs s t state P Obs s t state s t state P s s P ) | ) ( ( ) | ) 1 ( , ) ( ( ) | (
Obs t i Obs t t Obs i i
Obs s t state P X Obs s t state P ) | ) ( ( ) | ) ( (
,
m
Obs t i Obs t T i t Obs i t Obs i i
, ,
11755/18797 92
sequences n
no. Total ) | ) 1 ( ( ) (
Obs i i
Obs s t state P s p
Obs t i Obs t j i i j
Obs s t state P Obs s t state s t state P s s P ) | ) ( ( ) | ) 1 ( , ) ( ( ) | (
Obs t i Obs t t Obs i i
Obs s t state P X Obs s t state P ) | ) ( ( ) | ) ( (
,
m
Obs t i Obs t T i t Obs i t Obs i i
, ,
2 1 T i
11755/18797 93
2 1 2 1 T T
11755/18797 94
s
t
2 1 T
– The section of the lattice leading into state s at time t and the section leading out of it
11755/18797 95
s
t
2 1 T
– This is simply a(s,t) – Can be computed using the forward algorithm
11755/18797 96
t s
– Like the red portion it can be computed using a backward recursion
11755/18797 97
t
t+1 s t Can be recursively estimated starting from the final time time instant (backward recursion)
– b(s,T) = 1 at the final time instant for all valid final states
11755/18797 98
2 1
T t t
1 '
t s
t+1 t t-1 s
2 1
T t t
11755/18797 99
11755/18797 100
' ' 2 1 2 1
s s T T
11755/18797 101
sequences n
no. Total ) | ) 1 ( ( ) (
Obs i i
Obs s t state P s p
Obs t i Obs t j i i j
Obs s t state P Obs s t state s t state P s s P ) | ) ( ( ) | ) 1 ( , ) ( ( ) | (
Obs t i Obs t t Obs i i
Obs s t state P X Obs s t state P ) | ) ( ( ) | ) ( (
,
m
Obs t i Obs t T i t Obs i t Obs i i
, ,
11755/18797 102
sequences n
no. Total ) | ) 1 ( ( ) (
Obs i i
Obs s t state P s p
Obs t i Obs t j i i j
Obs s t state P Obs s t state s t state P s s P ) | ) ( ( ) | ) 1 ( , ) ( ( ) | (
Obs t i Obs t t Obs i i
Obs s t state P X Obs s t state P ) | ) ( ( ) | ) ( (
,
m
Obs t i Obs t T i t Obs i t Obs i i
, ,
s’
t
2 1 T
s t+1
11755/18797 103
s’
t
2 1 T
s t+1
11755/18797 104
s’
t
2 1 T
s t+1
1 s
t
11755/18797 105
s’
t
2 1 T
s t+1
1 s
t
11755/18797 106
11755/18797 107
1 2
) 1 , ( ) | ( ) | ( ) , ( ) 1 , ' ( ) ' | ( ) | ' ( ) , ( ) | ' ) 1 ( , ) ( (
2 2 1 1 2 1 1 s s t t
t s s x P s s P t s t s s x P s s P t s Obs s t state s t state P b a b a
11755/18797 108
sequences n
no. Total ) | ) 1 ( ( ) (
Obs i i
Obs s t state P s p
Obs t i Obs t j i i j
Obs s t state P Obs s t state s t state P s s P ) | ) ( ( ) | ) 1 ( , ) ( ( ) | (
Obs t i Obs t t Obs i i
Obs s t state P X Obs s t state P ) | ) ( ( ) | ) ( (
,
m
Obs t i Obs t T i t Obs i t Obs i i
, ,
State association probabilities Initial models
Every feature vector associated with every state of every HMM with a
probability
Probabilities computed using the forward-backward algorithm Soft decisions taken at the level of HMM state In practice, the segmentation based Viterbi training is much easier to
implement and is much faster
The difference in performance between the two is small, especially if we have
lots of training data
11755/18797
Models Converged? yes no
11755/18797 110
11755/18797 111
11755/18797 112
11755/18797 113