HMM, MEMM and CRF Probabilistic Graphical Models Sharif University - - PowerPoint PPT Presentation
HMM, MEMM and CRF Probabilistic Graphical Models Sharif University - - PowerPoint PPT Presentation
HMM, MEMM and CRF Probabilistic Graphical Models Sharif University of Technology Spring 201 7 Soleymani Sequence labeling Taking collective a set of interrelated instances 1 , , and jointly labeling them We get as
Sequence labeling
2
ο½ Taking collective a set of interrelated instances π1, β¦ , ππ
and jointly labeling them
ο½ We get as input a sequence of observations π = π1:π and need
to label them with some joint label π = π§1:π
Generalization of mixture models for sequential data
3
[Jordan] π π π
1
π
2
π
π
π1 π2 ππ β¦ π
πβ1
ππβ1 Y: states (latent variables) X: observations
HMM examples
4
ο½ Some applications of HMM
ο½ Speech recognition, NLP
, activity recognition
ο½ Part-of-speech-tagging
πππ ππΆπ ππΆ ππ Students are ππΆπ expected to study
HMM: probabilistic model
5
ο½ Transitional probabilities: transition probabilities between
states
ο½ π΅ππ β‘ π(π
π’ = π|π π’β1 = π)
ο½ Initial state distribution: start probabilities in different
states
ο½ ππ β‘ π(π
1 = π)
ο½ Observation model: Emission probabilities associated with
each state
ο½ π(ππ’|π
π’, πΎ)
HMM: probabilistic model
6
ο½ Transitional probabilities: transition probabilities between
states
ο½ π π
π’ π π’β1 = π = ππ£ππ’π(π π’|π΅π1, β¦ , π΅ππ) βπ β π‘π’ππ’ππ‘
ο½ Initial state distribution: start probabilities in different
states
ο½ π π
1 = ππ£ππ’π(π 1|π1, β¦ , ππ)
ο½ Observation model: Emission probabilities associated with
each state
ο½ Discrete
- bservations:
π ππ’ π
π’ = π = ππ£ππ’π(ππ’|πΆπ,1, β¦ , πΆπ,πΏ)βπ
β π‘π’ππ’ππ‘
ο½ General: π ππ’ π
π’ = π = π(. |πΎπ)
π: states (latent variables) π: observations
Inference problems in sequential data
7
ο½ Decoding: argmax
π§1,β¦,π§π
π π§1, β¦ , π§π π¦1, β¦ , π¦π
ο½ Evaluation
ο½ Filtering: π π§π’ π¦1, β¦ , π¦π’ ο½ Smoothing: π’β² < π’, π π§π’β² π¦1, β¦ , π¦π’ ο½ Prediction: π’β² > π’, π π§π’β² π¦1, β¦ , π¦π’
Some questions
8
ο½ Inference
ο½ π π§π’|π¦1, β¦ , π¦π’ =? ο½ π π¦1, β¦ , π¦π =? ο½ π π§π’|π¦1, β¦ , π¦π =?
ο½ Learning: How do we adjust the HMM parameters:
ο½ Complete data: each training data includes a state sequence
and the corresponding observation sequence
ο½ Incomplete
data: each training data includes
- nly
an
- bservation sequence
Forward algorithm
9
π½π’ π = π π¦1, β¦ , π¦π’, π
π’ = π
π½π’ π =
π
π½π’β1 π π π
π’ = π|π π’β1 = π π π¦π’| π π’ = π
ο½ Initialization:
ο½ π½1 π = π π¦1, π
1 = π = π π¦1|π 1 = π π π 1 = π
ο½ Iterations: π’ = 2 to π
ο½ π½π’ π = π π½π’β1 π π π
π’ = π|π π’β1 = π π π¦π’|π π’ = π
π, π = 1, β¦ , π π
1
π
2
π
π
π1 π2 ππ β¦ π
πβ1
ππβ1 π½1(. ) π½2(. ) π½πβ1(. ) π½π(. )
π½π’ π = ππ’β1βπ’(π)
Backward algorithm
10
πΎπ’ π = ππ’βπ’β1(π) = π π¦π’+1, β¦ , π¦π|π
π’ = π
πΎπ’β1 π =
π
πΎπ’ π π π
π’ = π|π π’β1 = π π π¦π’| π π’ = π
ο½ Initialization:
ο½ πΎπ π = 1
ο½ Iterations: π’ = π down to 2
ο½ πΎπ’β1 π = π πΎπ’ π π π
π’ = π|π π’β1 = π π π¦π’| π π’ = π
π, π β π‘π’ππ’ππ‘ π
1
π
2
π
π
π1 π2 ππ β¦ π
πβ1
ππβ1 πΎπ . = 1 πΎπβ1 . πΎ2 . πΎ1 .
πΎπ’ π = ππ’βπ’β1(π)
Forward-backward algorithm
11
π½π’ π =
π
π½π’β1 π π π
π’ = π|π π’β1 = π π π¦π’| π π’ = π
π½1 π = π π¦1, π
1 = π = π π¦1|π 1 = π π π 1 = π
πΎπ’β1 π =
π
πΎπ’ π π π
π’ = π|π π’β1 = π π π¦π’| π π’ = π
πΎπ π = 1
π π¦1, π¦2, β¦ , π¦π =
π
π½π π πΎπ π =
π
π½π π π π
π’ = π|π¦1, π¦2, β¦ , π¦π = π½π’(π)πΎπ’(π)
π π½π π
π½π’(π) β‘ π π¦1, π¦2, β¦ , π¦π’, π
π’ = π
πΎπ’(π) β‘ π π¦π’+1, π¦π’+2, β¦ , π¦π|π
π’ = π
Forward-backward algorithm
12
ο½ This will also be used in the E-step of the EM algorithm
to train a HMM:
ο½ π π
π’ = π π¦1, β¦ , π¦π = π π¦1,β¦,π¦π,π
π’=π
π π¦1,β¦,π¦π
=
π½π’(π)πΎπ’(π) π=1
π
π½π(π)
π
1
π
2
π
π
π1 π2 ππ β¦ π
πβ1
ππβ1 πΎπ . = 1 π½1(. ) π½2(. ) π½πβ1(. ) π½π(. ) πΎπβ1 . πΎ2 . πΎ1 .
Decoding Problem
13
ο½ Choose state sequence to maximize the observations:
ο½ argmax
π§1,β¦,π§π’
π π§1, β¦ , π§π’ π¦1, β¦ , π¦π’
ο½ Viterbi algorithm:
ο½ Define auxiliary variable π:
ο½ ππ’ π =
max
π§1,β¦,π§π’β1 π(π§1, π§2, β¦ , π§π’β1, π π’ = π, π¦1, π¦2, β¦ , π¦π’) ο½ ππ’(π): probability of the most probable path ending in state π π’ = π
ο½ Recursive relation:
ο½ ππ’ π = max π
ππ’β1 π π(π
π’ = π|π π’β1 = π) π π¦π’ π π’ = π
ππ’ π = max
π=1,β¦,π ππ’β1 π π π π’ = π π π’β1 = π
π(π¦π’|π
π’ = π)
Decoding Problem: Viterbi algorithm
14
ο½ Initialization
ο½ π1 π = π π¦1|π
1 = π π π 1 = π
ο½ π1 π = 0
ο½ Iterations: π’ = 2, β¦ , π
ο½ ππ’ π = max
π
ππ’β1 π π π
π’ = π π π’β1 = π
π π¦π’ π
π’ = π
ο½ ππ’ π = argmax
π
ππ’ π π π
π’ = π π π’β1 = π
ο½ Final computation:
ο½ πβ = max
π=1,β¦,π ππ π
ο½ π§π
β = argmax π=1,β¦,π
ππ π
ο½ Traceback state sequence: π’ = π β 1 down to 1
ο½ π§π’
β = ππ’+1 π§π’+1 β
π = 1, β¦ , π π = 1, β¦ , π
Max-product algorithm
15
πππ
πππ¦ π¦π = max π¦π
π π¦π π π¦π, π¦π
πβπͺ(π)\π
πππ
πππ¦(π¦π)
ππ’ π = ππ’β1,π’
πππ¦ Γ π π¦π
HMM Learning
16
ο½ Supervised learning: When we have a set of data samples,
each of them containing a pair of sequences (one is the
- bservation sequence and the other is the state sequence)
ο½ Unsupervised learning: When we have a set of data
samples, each of them containing a sequence of observations
HMM supervised learning by MLE
17
ο½ Initial state probability:
ππ = π π
1 = π ,
1 β€ π β€ π
ο½ State transition probability:
π΅ππ = π π
π’+1 = π π π’ = π ,
1 β€ π, π β€ π
ο½ State transition probability:
πΆππ = π ππ’ = π π
π’ = π ,
1 β€ π β€ πΏ
π
1
π
2
π
π
π
πβ1
β¦ π π© π2 π1 ππβ1 ππ πͺ Discrete
- bservations
HMM: supervised parameter learning by MLE
18
π π |πΎ =
π=1 π
π π§
1 (π) π π’=2 π
π(π§π’
(π)|π§π’β1 (π), π©) π’=1 π
π(ππ’
(π)|π§π’ (π), πͺ)
π΅ππ = π=1
π
π’=2
π
π½ π§π’β1
(π) = π, π§π’ (π) = π
π=1
π
π’=2
π
π½ π§π’β1
(π) = π
ππ = π=1
π
π½ π§
1 (π) = π
π πΆππ = π=1
π
π’=1
π
π½ π§π’
(π) = π, π¦π’ (π) = π
π=1
π
π’=1
π
π½ π§π’
(π) = π
Discrete
- bservations
Learning
19
ο½ Problem: how to construct an HHM given only observations? ο½ Find πΎ = (π©, πͺ, π), maximizing π(π1, β¦ , ππ|πΎ)
ο½ Incomplete data
ο½ EM algorithm
HMM learning by EM (Baum-Welch)
20
ο½ πΎπππ = ππππ, π©πππ, πΎπππ ο½ E-Step:
ο½ πΏπ,π’
π
= π π
π’ (π) = π π(π); πΎπππ
ο½ ππ,π’
π,π = π π π’β1 (π) = π, π π’ (π) = π π(π); πΎπππ
ο½ M-Step:
ο½ ππ
πππ₯ = π=1
π
πΏπ,1
π
π
ο½ π΅π,π
πππ₯ = π=1
π
π’=2
π
ππ,π’
π,π
π=1
π
π’=1
πβ1 π ππ,π’ π,π =
π=1
π
π’=2
π
ππ,π’
π,π
π=1
π
π’=1
πβ1 πΏπ,π’ π
ο½ πΆπ,π
πππ₯ = π=1
π
π’=1
π
πΏπ,π’
π
π½ ππ’
(π)=π
π=1
π
π’=1
π
π½ ππ’
(π)=π π πΏπ,π’ π
=
π=1
π
π’=1
π
πΏπ,π’
π
π½ ππ’
(π)=π
π=1
π
π’=1
π
π½ ππ’
(π)=π
π, π = 1, β¦ , π π, π = 1, β¦ , π
Baum-Welch algorithm (Baum, 1972) Discrete observations
HMM learning by EM (Baum-Welch)
21
ο½ πΎπππ = ππππ, π©πππ, πΎπππ ο½ E-Step:
ο½ πΏπ,π’
π
= π π
π’ (π) = π π(π); πΎπππ
ο½ ππ,π’
π,π = π π π’β1 (π) = π, π π’ (π) = π π(π); πΎπππ
ο½ M-Step:
ο½ ππ
πππ₯ = π=1
π
πΏπ,1
π
π
ο½ π΅π,π
πππ₯ = π=1
π
π’=2
π
ππ,π’
π,π
π=1
π
π’=1
πβ1 πΏπ,π’ π
π, π = 1, β¦ , π π, π = 1, β¦ , π ππ
πππ₯ =
π=1
π
π’=1
π
πΏπ,π’
π ππ’ (π)
π=1
π
π’=1
π
πΏπ,π’
π
π―π
πππ₯ =
π=1
π
π’=1
π
πΏπ,π’
π
ππ’
(π) β ππ πππ₯
ππ’
(π) β ππ πππ₯ π
π=1
π
π’=1
π
πΏπ,π’
π
Assumption: Gaussian emission probabilities Baum-Welch algorithm (Baum, 1972)
Forward-backward algorithm for E-step
22
π π§π’β1, π§π’ π1, β¦ , ππ = π π1, β¦ , ππ π§π’β1, π§π’)π π§π’β1, π§π’ π(π1, β¦ , ππ) = π π1, β¦ , ππ’β1 π§π’β1)π ππ’ π§π’)π ππ’+1, β¦ , ππ π§π’)π π§π’ π§π’β1)π π§π’β1 π(π1, β¦ , ππ) = π½π’β1(π§π’β1)π ππ’ π§π’)π π§π’ π§π’β1)πΎπ’ π§π’ π=1
π π½π(π)
This equivalence is obtained according to the independencies in the HMM structure π π1, β¦ , ππ π§π’β1, π§π’) = π π1, β¦ , ππ’β1 ππ’, β¦ , ππ, π§π’β1, π§π’) Γ π ππ’ ππ’+1, β¦ , ππ, π§π’β1, π§π’) Γ π ππ’+1, β¦ , ππ π§π’β1, π§π’) = π π1, β¦ , ππ’β1 π§π’β1) Γ π ππ’ π§π’) Γ π ππ’+1, β¦ , ππ π§π’)
HMM shortcomings
23
ο½ In modeling the joint distribution π(π, π), HMM ignores many
dependencies between observations π1, β¦ , ππ (similar to most generative models that need to simplify the structure)
ο½ In
the sequence labeling task, we need to classify an
- bservation sequence using the conditional probability π(π|π)
ο½ However, HMM learns a joint distribution π(π, π) while uses only
π(π|π) for labeling
π
1
π
2
π
π
π1 π2 ππ β¦ π
πβ1
ππβ1
Maximum Entropy Markov Model (MEMM)
24
π π1:π π1:π = π π§1 π1:π
π’=2 π
π π§π’ π§π’β1, π1:π
π π§π’ π§π’β1, π1:π = exp πππ π§π’, π§π’β1, π1:π π§π’ exp πππ π§π’, π§π’β1, π1:π
ο½ Discriminative model
ο½ Only models π(π|π) and completely ignores modeling π(π) ο½ Maximizes the conditional likelihood π(π π|π π, πΎ)
π
1
π
2
π
π
π1 π2 ππ β¦ π
πβ1
ππβ1 π
1
π
2
π
π
π1:π β¦ π
πβ1
Feature function
25
ο½ Feature function π π§π’, π§π’β1, π1:π
can take account of relations between both data and label space
ο½ However, they are often indicator functions showing absence
- r presence of a feature
ο½ π₯π captures how closely π π§π’, π§π’β1, π1:π
is related with the label
MEMM disadvantages
26
ο½ The later observation in the sequence has absolutely no effect
- n the posterior probability of the current state
ο½ model does not allow for any smoothing. ο½ it is incapable of going back and changing its prediction about the
earlier observations.
ο½ The label bias problem
ο½ there are cases that a given observation is not useful in
predicting the next state of the model.
ο½ In the extreme case, if a state has a unique out-going transition, the
given observation is useless
π
1
π
2
π
π
π1 π2 ππ
β¦π
πβ1
ππβ1
Label bias problem in MEMM
27
ο½ Label bias problem: states with fewer arcs are preferred
ο½ Preference of states with lower entropy of transitions over others in
decoding
ο½ MEMMs should probably be avoided in cases where many transitions are
close to deterministic
ο½ Extreme case: When there is only one outgoing arc, it does not matter what
the observation is ο½ The source of this problem: Probabilities of outgoing arcs
normalized separately for each state
ο½ sum of transition probability for any state has to sum to 1
ο½ Solution: Do not normalize probabilities locally π π§π’β1, π1:π, π =
π§π’
exp πππ π§π’, π§π’β1, π1:π π π§π’ π§π’β1, π1:π = exp πππ π§π’, π§π’β1, π1:π π π§π’β1, π1:π, π
From MEMM to CRF
28
ο½ From local probabilities to local potentials
π π π = 1 π(π)
π’=1 π
π π§π’β1, π§π’, π = 1 π(π, π)
π’=1 π
exp πππ π§π’, π§π’β1, π
ο½ CRF is a discriminative model (like MEMM)
ο½ can dependence between each state and the entire observation sequence ο½ uses global normalizer π(π, π) that overcomes the label bias problem of MEMM
ο½ MEMM use an exponential model for each state while CRF have a single
exponential model for the joint probability of the entire label sequence
π
1
π
2
π
π
π1:π β¦ π
πβ1
π β‘ π1:π π β‘ π§1:π
CRF: conditional distribution
29
π π π = 1 π(π, π) exp
π=1 π
πππ π§π, π§πβ1, π = 1 π(π, π) exp
π=1 π π
πππ
π π§π, π§πβ1, π
π π, π =
π
exp
π=1 π
πππ π§π, π§πβ1, π
CRF: MAP inference
30
ο½ Given CRF parameters π, find the πβ that maximizes π π π :
πβ = argmax
π
exp
π=1 π
πππ π§π, π§πβ1, π
ο½ π(π) is not a function of π and so has been ignored ο½ Max-product algorithm can be used for this MAP inference
problem
ο½ Same asViterbi decoding used in HMMs
CRF: inference
31
Exact inference for 1-D chain CRFs π π§π, π§πβ1 = exp πππ π§π, π§πβ1, π
π
1
π
2
π
π
π1:π β¦ π
πβ1
π
1, π 2
π
2, π 3
β¦ π
πβ1, π π
π
2
π
πβ1
CRF: learning
32
πβ = argmax
π π=1 π
π π(π) π(π), π
π=1 π
π π(π) π(π), π =
π=1 π
1 π(π(π), π) exp
π=1 π
πππ π§π
(π), π§πβ1 (π), π(π)
= exp
π=1 π π=1 π
πππ π§π
(π), π§πβ1 (π), π(π) β ln π(π(π), π)
β π = ln
π=1 π
π π(π) π(π), π πΌπβ π =
π=1 π π=1 π
π π§π
(π), π§πβ1 (π), π(π) β πΌπ ln π(π(π), π)
πΌπ ln π(π(π), π) =
π
π(π|π π , π)
π=1 π
π π§π , π§πβ1, π π Maximum Conditional Likelihood πΎ = argmax
πΎ
π(π π|π π, πΎ) Gradient of the log-partition function for an exponential family is the expectation of the sufficient statistics.
CRF: learning
33
πΌπ ln π(π(π), π) =
π
π(π|π π , π)
π=1 π
π π§π , π§πβ1, π π =
π=1 π π
π(π|π π , π)π π§π , π§πβ1, π π =
π=1 π π§π ,π§πβ1
π(π§π , π§πβ1|π π , π)π π§π , π§πβ1, π
How do we find the above expectations? π(π§π , π§πβ1|π π , π) must be computed for all π = 2, β¦ , π
CRF: learning Inference to find π(π§π , π§πβ1|π π , π)
34
ο½ Junction tree algorithm
ο½ Initialization of clique potentials:
π π§π, π§πβ1 = exp πππ π§π, π§πβ1, π π π π§π = 1
ο½ After calibration (completion of message passing): πβ π§π, π§πβ1
π π§π, π§πβ1|π(π), π = πβ π§π, π§πβ1 π§π,π§πβ1 πβ π§π, π§πβ1
π
1, π 2
π
2, π 3
β¦ π
πβ1, π π
π
2
π
πβ1
CRF learning: gradient descent
35
ππ’ = ππ’ + ππΌπβ ππ’
ο½ In each gradient step, for each data sample, we use inference to find
π(π§π , π§πβ1|π π , π) required for computing feature expectation: πΌπ ln π(π(π), π) = πΉπ(π|π π ,ππ’)
π=1 π
π π§π , π§πβ1, π π =
π=1 π π§π ,π§πβ1
π π§π , π§πβ1 π π , π π π§π , π§πβ1, π π
πΌππ π =
π=1 π π=1 π
π π§π
(π), π§πβ1 (π), π(π) β πΌπ ln π(π(π), π)
Summary
36
ο½ Discriminative vs. generative
ο½ In cases where we have many correlated features, discriminative models (MEMM
and CRF) are often better
ο½ avoid the challenge of explicitly modeling the distribution over features.
ο½ but, if only limited training data are available, the stronger bias of the generative
model may dominate and these models may be preferred.
ο½ Learning
ο½ HMMs and MEMMs are much more easily learned. ο½ CRF requires an iterative gradient-based approach, which is considerably
more expensive
ο½ inference must be run separately for every training sequence in each step
ο½ MEMM vs. CRF (Label bias problem of MEMM)
ο½ In many cases, CRFs are likely to be a safer choice (particularly in cases where
many transitions are close to deterministic), but the computational cost may be prohibitive for large data sets.
References
37