[PPT] - Hidden Markov Models Slides adapted from Joyce Ho, David Sontag, PowerPoint Presentation

SLIDE 1

Hidden Markov Models

Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

SLIDE 2

Sequential Data

Time-series: Stock

market, weather, speech, video

Ordered: Text, genes

SLIDE 3

Sequential Data: Tracking

Observe noisy measurements of missile location

Where is the missile now? Where will it be in 1 minute?

SLIDE 4

Sequential Data: Weather

Predict the weather tomorrow

using previous information

If it rained yesterday, and the

previous day and historically it has rained 7 times in the past 10 years on this date — does this affect my prediction?

SLIDE 5

Sequential Data: Weather

Use product rule for joint distribution of a sequence
How do I solve this?
Model how weather changes over time
Model how observations are produced
Reason about the model

SLIDE 6

Markov Chain

Set S is called the state space
Process moves from one state to another generating a

sequence of states: x1, x2, …, xt

Markov chain property: probability of each subsequent

state depends only on the previous state:

SLIDE 7

Markov Chain: Parameters

State transition matrix A (|S| x |S|)

A is a stochastic matrix (all rows sum to one) Time homogenous Markov chain: transition probability between two states does not depend on time

Initial (prior) state probabilities

SLIDE 8

Rain Dry

0.7 0.3 0.2 0.8

Two states : ‘Rain’ and ‘Dry’.
Transition probabilities:

P(‘Rain’|‘Rain’)=0.3, P(‘Dry’|‘Rain’)=0.7 P(‘Rain’|‘Dry’) =0.2, P(‘Dry’|‘Dry’)=0.8

Initial probabilities:

P(‘Rain’)=0.4 , P(‘Dry’)=0.6

Example of Markov Model

SLIDE 9

Example: Weather Prediction

Compute probability of tomorrow’s

weather using Markov property

Evaluation: given today is dry, what’s

the probability that tomorrow is dry and the next day is rainy?

Learning: give some observations,

determine the transition probabilities

P({‘Dry’,’Dry’,’Rain’} ) = P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’) = 0.20.80.6

SLIDE 10

Hidden Markov Model (HMM)

Stochastic model where

the states of the model are hidden

Each state can emit an
utput which is observed

SLIDE 11

HMM: Parameters

State transition matrix A
Emission / observation

conditional output probabilities B

Initial (prior) state probabilities

SLIDE 12

Low High

0.7 0.3 0.2 0.8

Dry Rain

0.6 0.6 0.4 0.4

Example of Hidden Markov Model

SLIDE 13

Two states : ‘Low’ and ‘High’ atmospheric pressure.
Two observations : ‘Rain’ and ‘Dry’.
Transition probabilities:

P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8

Observation probabilities :

P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3

Initial probabilities:

P(‘Low’)=0.4 , P(‘High’)=0.6

Example of Hidden Markov Model

SLIDE 14

Suppose we want to calculate a probability of a sequence of
bservations in our example, {‘Dry’,’Rain’}.
Consider all possible hidden state sequences:

P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’})

where first term is :

P({‘Dry’,’Rain’} , {‘Low’,’Low’})= P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) = P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low) = 0.40.40.60.40.3

Calculation of observation sequence probability

SLIDE 15

Example: Dishonest Casino

A casino has two dices that it switches between with

5% probability

Fair dice
Loaded dice

SLIDE 16

Example: Dishonest Casino

Initial probabilities
State transition matrix
Emission probabilities

SLIDE 17

Example: Dishonest Casino

Given a sequence of rolls by the casino player
How likely is this sequence given our model of how the

casino works? – evaluation problem

What sequence portion was generated with the fair die,

and what portion with the loaded die? – decoding problem

How “loaded” is the loaded die? How “fair” is the fair

die? How often does the casino player change from fair to loaded and back? – learning problem

SLIDE 18

HMM: Problems

Evaluation: Given parameters and observation sequence,

find probability (likelihood) of observed sequence

forward algorithm
Decoding: Given HMM parameters and observation

sequence, find the most probable sequence of hidden states

Viterbi algorithm
Learning: Given HMM with unknown parameters and
bservation sequence, find the parameters that maximizes

likelihood of data

Forward-Backward algorithm

SLIDE 19

HMM: Evaluation Problem

Given
Probability of observed sequence

Summing over all possible hidden state values at all times — KT exponential # terms

SLIDE 20

s1 s2 si sK s1 s2 si sK s1 s2 sj sK s1 s2 si sK

a1j a2j aij aKj

Time= 1 t t+1 T

1 ot
t+1 oT = Observations

Trellis representation of an HMM

SLIDE 21

HMM: Forward Algorithm

Instead pose as recursive problem
Dynamic program to compute forward probability in state St = k

after observing the first t observations

Algorithm:
initialize: t=1
iterate with recursion: t=2, … t=k …
terminate: t=T

t k t t k

SLIDE 22

HMM: Problems

Evaluation: Given parameters and observation sequence,

find probability (likelihood) of observed sequence

forward algorithm
Decoding: Given HMM parameters and observation

sequence, find the most probable sequence of hidden states

Viterbi algorithm
Learning: Given HMM with unknown parameters and
bservation sequence, find the parameters that maximizes

likelihood of data

Forward-Backward algorithm

SLIDE 23

HMM: Decoding Problem 1

Given
Probability that hidden state at time t was k

We know how to compute the first part using forward algorithm

SLIDE 24

HMM: Backward Probability

Similar to forward probability, we can express as a

recursion problem

Dynamic program
Initialize
Iterate using recursion

t t t k

SLIDE 25

HMM: Decoding Problem 1

Probability that hidden state at time t was k
Most likely state assignment

Forward- backward algorithm

SLIDE 26

HMM: Decoding Problem 2

Given
What is most likely state sequence?

probability of most likely sequence of states ending at state ST=k

SLIDE 27

HMM: Viterbi Algorithm

Compute probability recursively over t
Use dynamic programming again!

SLIDE 28

HMM: Viterbi Algorithm

Initialize
Iterate
Terminate

Traceback

SLIDE 29

HMM: Computational Complexity

What is the running time for the forward algorithm,

backward algorithm, and Viterbi? O(K2T) vs O(KT)!

SLIDE 30

HMM: Problems

Evaluation: Given parameters and observation sequence,

find probability (likelihood) of observed sequence

forward algorithm
Decoding: Given HMM parameters and observation

sequence, find the most probable sequence of hidden states

Viterbi algorithm
Learning: Given HMM with unknown parameters and
bservation sequence, find the parameters that maximizes

likelihood of data

Forward-Backward, Baum-Welch algorithm

SLIDE 31

HMM: Learning Problem

Given only observations
Find parameters that maximize likelihood
Need to learn hidden state sequences as well

SLIDE 32

HMM: Baum-Welch (EM) Algorithm

Randomly initialize parameters
E-step: Fix parameters, find expected state assignment

Forward-backward algorithm

SLIDE 33

HMM: Baum-Welch (EM) Algorithm

Expected number of times we will be in state i
Expected number of transitions from state i
Expected number of transitions from state i to j

SLIDE 34

HMM: Baum-Welch (EM) Algorithm

M-step: Fix expected state assignments, update

parameters

SLIDE 35

HMM: Problems

Evaluation: Given parameters and observation sequence,

find probability (likelihood) of observed sequence

forward algorithm
Decoding: Given HMM parameters and observation

sequence, find the most probable sequence of hidden states

Viterbi algorithm
Learning: Given HMM with unknown parameters and
bservation sequence, find the parameters that maximizes

likelihood of data

Forward-Backward (Baum-Welch) algorithm

SLIDE 36

HMM vs Linear Dynamical Systems

HMM
States are discrete
Observations are discrete or continuous
Linear dynamical systems
Observations and states are multivariate Gaussians
Can use Kalman Filters to solve

SLIDE 37

Linear State Space Models

States & observations are Gaussian
Kalman filter: (recursive) prediction and

update

SLIDE 38

More examples

Location prediction
Privacy preserving data monitoring

SLIDE 39

Next Location Prediction: Definitions

SLIDE 40

Source: A. Monreale, F. Pinelli, R. Trasarti, F. Giannotti. WhereNext: a Location Predictor on Trajectory Pattern Mining. KDD 2009

Personalization
Individual-based methods only utilize the history of one object to predict its

future locations.

General-based methods use the movement history of other objects

additionally (e.g. similar objects or similar trajectories) to predict the object’s future location.

Next Location Prediction: Classification of Methods

SLIDE 41

Source: A. Monreale, F. Pinelli, R. Trasarti, F. Giannotti. WhereNext: a Location Predictor on Trajectory Pattern Mining. KDD 2009

Temporal Representation
Location-series representations define trajectories as a set of

sequenced locations ordered in time.

Fixed-interval time representations use a fixed time interval

between two consecutive locations

Variable-interval time representations allow variable

transition times between sequenced locations

Next Location Prediction: Classification of Methods

SLIDE 42

Spatial Representation
Grid-based methods divide space into fixed-size cells which

can be simple rectangular regions

Frequent/dense regions using clustering methods such as

density-based algorithms such as DBSCAN and hierarchical clustering.

Semantic-based methods use semantic features of locations

in addition to the geographic information, e.g. home, bank, school.

Next Location Prediction: Classification of Methods

SLIDE 43

Mobility Learning Method
Model-based (formulate the movement of moving objects using

mathematical models)

Markov Chains
Recursive Motion Function (Y. Tao et. al., ACM SIGMOD 2004)
Semi-Lazy Hidden Markov Model (J. Zhou et. al., ACM SIGKDD 2013)
Deep learning models
Pattern-based (exploit pattern mining algorithms for prediction)
Trajectory Pattern Mining (A. Monreale et. Al., ACM SIGKDD 2009)
Hybrid
Recursive Motion Function + Sequential Pattern Mining (H. Jeung et. al., ICDE 2008)

Next Location Prediction: Classification of Methods

SLIDE 44

Preliminary Results

Prediction error for different prediction length using (a) Brinkhoff , and (b) Periodical Synthetic dataset

Hidden Markov Models

Sequential Data

market, weather, speech, video

Sequential Data: Tracking

Where is the missile now? Where will it be in 1 minute?

Sequential Data: Weather

using previous information

previous day and historically it has rained 7 times in the past 10 years on this date — does this affect my prediction?

Sequential Data: Weather

Markov Chain

sequence of states: x1, x2, …, xt

state depends only on the previous state:

Markov Chain: Parameters

A is a stochastic matrix (all rows sum to one) Time homogenous Markov chain: transition probability between two states does not depend on time

Rain Dry

P(‘Rain’|‘Rain’)=0.3, P(‘Dry’|‘Rain’)=0.7 P(‘Rain’|‘Dry’) =0.2, P(‘Dry’|‘Dry’)=0.8

P(‘Rain’)=0.4 , P(‘Dry’)=0.6

Example of Markov Model

Example: Weather Prediction

weather using Markov property

the probability that tomorrow is dry and the next day is rainy?

determine the transition probabilities

P({‘Dry’,’Dry’,’Rain’} ) = P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’) = 0.2*0.8*0.6

Hidden Markov Model (HMM)

the states of the model are hidden

HMM: Parameters

conditional output probabilities B

Low High

Dry Rain

Example of Hidden Markov Model

P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8

P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3

P(‘Low’)=0.4 , P(‘High’)=0.6

Example of Hidden Markov Model

P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’})

where first term is :

P({‘Dry’,’Rain’} , {‘Low’,’Low’})= P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) = P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low) = 0.4*0.4*0.6*0.4*0.3

Calculation of observation sequence probability

Example: Dishonest Casino

5% probability

Example: Dishonest Casino

Example: Dishonest Casino

casino works? – evaluation problem

and what portion with the loaded die? – decoding problem

die? How often does the casino player change from fair to loaded and back? – learning problem

HMM: Problems

find probability (likelihood) of observed sequence

sequence, find the most probable sequence of hidden states

likelihood of data

HMM: Evaluation Problem

Summing over all possible hidden state values at all times — KT exponential # terms

s1 s2 si sK s1 s2 si sK s1 s2 sj sK s1 s2 si sK

Trellis representation of an HMM

HMM: Forward Algorithm

after observing the first t observations

HMM: Problems

find probability (likelihood) of observed sequence

sequence, find the most probable sequence of hidden states

likelihood of data

HMM: Decoding Problem 1

We know how to compute the first part using forward algorithm

HMM: Backward Probability

recursion problem

HMM: Decoding Problem 1

Forward- backward algorithm

HMM: Decoding Problem 2

probability of most likely sequence of states ending at state ST=k

HMM: Viterbi Algorithm

HMM: Viterbi Algorithm

Traceback

HMM: Computational Complexity

backward algorithm, and Viterbi? O(K2T) vs O(KT)!

HMM: Problems

find probability (likelihood) of observed sequence

sequence, find the most probable sequence of hidden states

likelihood of data

HMM: Learning Problem

HMM: Baum-Welch (EM) Algorithm

Forward-backward algorithm

HMM: Baum-Welch (EM) Algorithm

P({‘Dry’,’Dry’,’Rain’} ) = P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’) = 0.20.80.6

P({‘Dry’,’Rain’} , {‘Low’,’Low’})= P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) = P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low) = 0.40.40.60.40.3