[PPT] - HMM, MEMM and CRF Probabilistic Graphical Models Sharif University PowerPoint Presentation

SLIDE 1

HMM, MEMM and CRF

Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

SLIDE 2

Sequence labeling

2

 Taking collective a set of interrelated instances 𝒚1, … , 𝒚𝑈

and jointly labeling them

 We get as input a sequence of observations 𝒀 = 𝒚1:𝑈 and need

to label them with some joint label 𝒁 = 𝑧1:𝑈

SLIDE 3

Generalization of mixture models for sequential data

3

[Jordan] 𝑎 𝑌 𝑍

1

𝑍

2

𝑍

𝑈

𝑌1 𝑌2 𝑌𝑈 … 𝑍

𝑈−1

𝑌𝑈−1 Y: states (latent variables) X: observations

SLIDE 4

HMM examples

4

 Some applications of HMM

 Speech recognition, NLP

, activity recognition

 Part-of-speech-tagging

𝑂𝑂𝑄 𝑊𝐶𝑎 𝑊𝐶 𝑈𝑝 Students are 𝑊𝐶𝑂 expected to study

SLIDE 5

HMM: probabilistic model

5

 Transitional probabilities: transition probabilities between

states

 𝐵𝑗𝑘 ≡ 𝑄(𝑍

𝑢 = 𝑘|𝑍 𝑢−1 = 𝑗)

 Initial state distribution: start probabilities in different

states

 𝜌𝑗 ≡ 𝑄(𝑍

1 = 𝑗)

 Observation model: Emission probabilities associated with

each state

 𝑄(𝑌𝑢|𝑍

𝑢, 𝚾)

SLIDE 6

HMM: probabilistic model

6

 Transitional probabilities: transition probabilities between

states

 𝑄 𝑍

𝑢 𝑍 𝑢−1 = 𝑗 = 𝑁𝑣𝑚𝑢𝑗(𝑍 𝑢|𝐵𝑗1, … , 𝐵𝑗𝑁) ∀𝑗 ∈ 𝑡𝑢𝑏𝑢𝑓𝑡

 Initial state distribution: start probabilities in different

states

 𝑄 𝑍

1 = 𝑁𝑣𝑚𝑢𝑗(𝑍 1|𝜌1, … , 𝜌𝑁)

 Observation model: Emission probabilities associated with

each state

 Discrete

bservations:

𝑄 𝑌𝑢 𝑍

𝑢 = 𝑗 = 𝑁𝑣𝑚𝑢𝑗(𝑌𝑢|𝐶𝑗,1, … , 𝐶𝑗,𝐿)∀𝑗

∈ 𝑡𝑢𝑏𝑢𝑓𝑡

 General: 𝑄 𝑌𝑢 𝑍

𝑢 = 𝑗 = 𝑔(. |𝜾𝑗)

𝑍: states (latent variables) 𝑌: observations

SLIDE 7

Inference problems in sequential data

7

 Decoding: argmax

𝑧1,…,𝑧𝑈

𝑄 𝑧1, … , 𝑧𝑈 𝑦1, … , 𝑦𝑈

 Evaluation

 Filtering: 𝑄 𝑧𝑢 𝑦1, … , 𝑦𝑢  Smoothing: 𝑢′ < 𝑢, 𝑄 𝑧𝑢′ 𝑦1, … , 𝑦𝑢  Prediction: 𝑢′ > 𝑢, 𝑄 𝑧𝑢′ 𝑦1, … , 𝑦𝑢

SLIDE 8

Some questions

8

 Inference

 𝑄 𝑧𝑢|𝑦1, … , 𝑦𝑢 =?  𝑄 𝑦1, … , 𝑦𝑈 =?  𝑄 𝑧𝑢|𝑦1, … , 𝑦𝑈 =?

 Learning: How do we adjust the HMM parameters:

 Complete data: each training data includes a state sequence

and the corresponding observation sequence

 Incomplete

data: each training data includes

nly

an

bservation sequence

SLIDE 9

Forward algorithm

9

𝛽𝑢 𝑗 = 𝑄 𝑦1, … , 𝑦𝑢, 𝑍

𝑢 = 𝑗

𝛽𝑢 𝑗 =

𝑘

𝛽𝑢−1 𝑘 𝑄 𝑍

𝑢 = 𝑗|𝑍 𝑢−1 = 𝑘 𝑄 𝑦𝑢| 𝑍 𝑢 = 𝑗

 Initialization:

 𝛽1 𝑗 = 𝑄 𝑦1, 𝑍

1 = 𝑗 = 𝑄 𝑦1|𝑍 1 = 𝑗 𝑄 𝑍 1 = 𝑗

 Iterations: 𝑢 = 2 to 𝑈

 𝛽𝑢 𝑗 = 𝑘 𝛽𝑢−1 𝑘 𝑄 𝑍

𝑢 = 𝑗|𝑍 𝑢−1 = 𝑘 𝑄 𝑦𝑢|𝑍 𝑢 = 𝑗

𝑗, 𝑘 = 1, … , 𝑁 𝑍

1

𝑍

2

𝑍

𝑈

𝑌1 𝑌2 𝑌𝑈 … 𝑍

𝑈−1

𝑌𝑈−1 𝛽1(. ) 𝛽2(. ) 𝛽𝑈−1(. ) 𝛽𝑈(. )

𝛽𝑢 𝑗 = 𝑛𝑢−1→𝑢(𝑗)

SLIDE 10

Backward algorithm

10

𝛾𝑢 𝑗 = 𝑛𝑢→𝑢−1(𝑗) = 𝑄 𝑦𝑢+1, … , 𝑦𝑈|𝑍

𝑢 = 𝑗

𝛾𝑢−1 𝑗 =

𝑘

𝛾𝑢 𝑘 𝑄 𝑍

𝑢 = 𝑘|𝑍 𝑢−1 = 𝑗 𝑄 𝑦𝑢| 𝑍 𝑢 = 𝑗

 Initialization:

 𝛾𝑈 𝑗 = 1

 Iterations: 𝑢 = 𝑈 down to 2

 𝛾𝑢−1 𝑗 = 𝑘 𝛾𝑢 𝑘 𝑄 𝑍

𝑢 = 𝑘|𝑍 𝑢−1 = 𝑗 𝑄 𝑦𝑢| 𝑍 𝑢 = 𝑗

𝑗, 𝑘 ∈ 𝑡𝑢𝑏𝑢𝑓𝑡 𝑍

1

𝑍

2

𝑍

𝑈

𝑌1 𝑌2 𝑌𝑈 … 𝑍

𝑈−1

𝑌𝑈−1 𝛾𝑈 . = 1 𝛾𝑈−1 . 𝛾2 . 𝛾1 .

𝛾𝑢 𝑗 = 𝑛𝑢→𝑢−1(𝑗)

SLIDE 11

Forward-backward algorithm

11

𝛽𝑢 𝑗 =

𝑘

𝛽𝑢−1 𝑘 𝑄 𝑍

𝑢 = 𝑗|𝑍 𝑢−1 = 𝑘 𝑄 𝑦𝑢| 𝑍 𝑢 = 𝑗

𝛽1 𝑗 = 𝑄 𝑦1, 𝑍

1 = 𝑗 = 𝑄 𝑦1|𝑍 1 = 𝑗 𝑄 𝑍 1 = 𝑗

𝛾𝑢−1 𝑗 =

𝑘

𝛾𝑢 𝑘 𝑄 𝑍

𝑢 = 𝑘|𝑍 𝑢−1 = 𝑗 𝑄 𝑦𝑢| 𝑍 𝑢 = 𝑗

𝛾𝑈 𝑗 = 1

𝑄 𝑦1, 𝑦2, … , 𝑦𝑈 =

𝑗

𝛽𝑈 𝑗 𝛾𝑈 𝑗 =

𝑗

𝛽𝑈 𝑗 𝑄 𝑍

𝑢 = 𝑗|𝑦1, 𝑦2, … , 𝑦𝑈 = 𝛽𝑢(𝑗)𝛾𝑢(𝑗)

𝑗 𝛽𝑈 𝑗

𝛽𝑢(𝑗) ≡ 𝑄 𝑦1, 𝑦2, … , 𝑦𝑢, 𝑍

𝑢 = 𝑗

𝛾𝑢(𝑗) ≡ 𝑄 𝑦𝑢+1, 𝑦𝑢+2, … , 𝑦𝑈|𝑍

𝑢 = 𝑗

SLIDE 12

Forward-backward algorithm

12

 This will also be used in the E-step of the EM algorithm

to train a HMM:

 𝑄 𝑍

𝑢 = 𝑗 𝑦1, … , 𝑦𝑈 = 𝑄 𝑦1,…,𝑦𝑈,𝑍

𝑢=𝑗

𝑄 𝑦1,…,𝑦𝑈

=

𝛽𝑢(𝑗)𝛾𝑢(𝑗) 𝑘=1

𝑂

𝛽𝑈(𝑘)

𝑍

1

𝑍

2

𝑍

𝑈

𝑌1 𝑌2 𝑌𝑈 … 𝑍

𝑈−1

𝑌𝑈−1 𝛾𝑈 . = 1 𝛽1(. ) 𝛽2(. ) 𝛽𝑈−1(. ) 𝛽𝑈(. ) 𝛾𝑈−1 . 𝛾2 . 𝛾1 .

SLIDE 13

Decoding Problem

13

 Choose state sequence to maximize the observations:

 argmax

𝑧1,…,𝑧𝑢

𝑄 𝑧1, … , 𝑧𝑢 𝑦1, … , 𝑦𝑢

 Viterbi algorithm:

 Define auxiliary variable 𝜀:

 𝜀𝑢 𝑗 =

max

𝑧1,…,𝑧𝑢−1 𝑄(𝑧1, 𝑧2, … , 𝑧𝑢−1, 𝑍 𝑢 = 𝑗, 𝑦1, 𝑦2, … , 𝑦𝑢)  𝜀𝑢(𝑗): probability of the most probable path ending in state 𝑍 𝑢 = 𝑗

 Recursive relation:

 𝜀𝑢 𝑗 = max 𝑘

𝜀𝑢−1 𝑘 𝑄(𝑍

𝑢 = 𝑗|𝑍 𝑢−1 = 𝑘) 𝑄 𝑦𝑢 𝑍 𝑢 = 𝑗

𝜀𝑢 𝑗 = max

𝑘=1,…,𝑂 𝜀𝑢−1 𝑘 𝑄 𝑍 𝑢 = 𝑗 𝑍 𝑢−1 = 𝑘

𝑄(𝑦𝑢|𝑍

𝑢 = 𝑗)

SLIDE 14

Decoding Problem: Viterbi algorithm

14

 Initialization

 𝜀1 𝑗 = 𝑄 𝑦1|𝑍

1 = 𝑗 𝑄 𝑍 1 = 𝑗

 𝜔1 𝑗 = 0

 Iterations: 𝑢 = 2, … , 𝑈

 𝜀𝑢 𝑗 = max

𝑘

𝜀𝑢−1 𝑘 𝑄 𝑍

𝑢 = 𝑗 𝑍 𝑢−1 = 𝑘

𝑄 𝑦𝑢 𝑍

𝑢 = 𝑗

 𝜔𝑢 𝑗 = argmax

𝑘

𝜀𝑢 𝑘 𝑄 𝑍

𝑢 = 𝑗 𝑍 𝑢−1 = 𝑘

 Final computation:

 𝑄∗ = max

𝑘=1,…,𝑂 𝜀𝑈 𝑘

 𝑧𝑈

∗ = argmax 𝑘=1,…,𝑂

𝜀𝑈 𝑘

 Traceback state sequence: 𝑢 = 𝑈 − 1 down to 1

 𝑧𝑢

∗ = 𝜔𝑢+1 𝑧𝑢+1 ∗

𝑗 = 1, … , 𝑁 𝑗 = 1, … , 𝑁

SLIDE 15

Max-product algorithm

15

𝑛𝑘𝑗

𝑛𝑏𝑦 𝑦𝑗 = max 𝑦𝑘

𝜚 𝑦𝑘 𝜚 𝑦𝑗, 𝑦𝑘

𝑙∈𝒪(𝑘)\𝑗

𝑛𝑙𝑘

𝑛𝑏𝑦(𝑦𝑘)

𝜀𝑢 𝑗 = 𝑛𝑢−1,𝑢

𝑛𝑏𝑦 × 𝜚 𝑦𝑗

SLIDE 16

HMM Learning

16

 Supervised learning: When we have a set of data samples,

each of them containing a pair of sequences (one is the

bservation sequence and the other is the state sequence)

 Unsupervised learning: When we have a set of data

samples, each of them containing a sequence of observations

SLIDE 17

HMM supervised learning by MLE

17

 Initial state probability:

𝜌𝑗 = 𝑄 𝑍

1 = 𝑗 ,

1 ≤ 𝑗 ≤ 𝑁

 State transition probability:

𝐵𝑘𝑗 = 𝑄 𝑍

𝑢+1 = 𝑗 𝑍 𝑢 = 𝑘 ,

1 ≤ 𝑗, 𝑘 ≤ 𝑁

 State transition probability:

𝐶𝑗𝑙 = 𝑄 𝑌𝑢 = 𝑙 𝑍

𝑢 = 𝑗 ,

1 ≤ 𝑙 ≤ 𝐿

𝑍

1

𝑍

2

𝑍

𝑈

𝑍

𝑈−1

… 𝝆 𝑩 𝑌2 𝑌1 𝑌𝑈−1 𝑌𝑈 𝑪 Discrete

bservations

SLIDE 18

HMM: supervised parameter learning by MLE

18

𝑄 𝒠|𝜾 =

𝑜=1 𝑂

𝑄 𝑧

1 (𝑜) 𝝆 𝑢=2 𝑈

𝑄(𝑧𝑢

(𝑜)|𝑧𝑢−1 (𝑜), 𝑩) 𝑢=1 𝑈

𝑄(𝒚𝑢

(𝑜)|𝑧𝑢 (𝑜), 𝑪)

𝐵𝑗𝑘 = 𝑜=1

𝑂

𝑢=2

𝑈

𝐽 𝑧𝑢−1

(𝑜) = 𝑗, 𝑧𝑢 (𝑜) = 𝑘

𝑜=1

𝑂

𝑢=2

𝑈

𝐽 𝑧𝑢−1

(𝑜) = 𝑗

𝜌𝑗 = 𝑜=1

𝑂

𝐽 𝑧

1 (𝑜) = 𝑗

𝑂 𝐶𝑗𝑙 = 𝑜=1

𝑂

𝑢=1

𝑈

𝐽 𝑧𝑢

(𝑜) = 𝑗, 𝑦𝑢 (𝑜) = 𝑙

𝑜=1

𝑂

𝑢=1

𝑈

𝐽 𝑧𝑢

(𝑜) = 𝑗

Discrete

bservations

SLIDE 19

Learning

19

 Problem: how to construct an HHM given only observations?  Find 𝜾 = (𝑩, 𝑪, 𝝆), maximizing 𝑄(𝒚1, … , 𝒚𝑈|𝜾)

 Incomplete data

 EM algorithm

SLIDE 20

HMM learning by EM (Baum-Welch)

20

 𝜾𝑝𝑚𝑒 = 𝝆𝑝𝑚𝑒, 𝑩𝑝𝑚𝑒, 𝚾𝑝𝑚𝑒  E-Step:

 𝛿𝑜,𝑢

𝑗

= 𝑄 𝑍

𝑢 (𝑜) = 𝑗 𝒚(𝑜); 𝜾𝑝𝑚𝑒

 𝜊𝑜,𝑢

𝑗,𝑘 = 𝑄 𝑍 𝑢−1 (𝑜) = 𝑗, 𝑍 𝑢 (𝑜) = 𝑘 𝒚(𝑜); 𝜾𝑝𝑚𝑒

 M-Step:

 𝜌𝑗

𝑜𝑓𝑥 = 𝑜=1

𝑂

𝛿𝑜,1

𝑗

𝑂

 𝐵𝑗,𝑘

𝑜𝑓𝑥 = 𝑜=1

𝑂

𝑢=2

𝑈

𝜊𝑜,𝑢

𝑗,𝑘

𝑜=1

𝑂

𝑢=1

𝑈−1 𝑘 𝜊𝑜,𝑢 𝑗,𝑘 =

𝑜=1

𝑂

𝑢=2

𝑈

𝜊𝑜,𝑢

𝑗,𝑘

𝑜=1

𝑂

𝑢=1

𝑈−1 𝛿𝑜,𝑢 𝑗

 𝐶𝑗,𝑙

𝑜𝑓𝑥 = 𝑜=1

𝑂

𝑢=1

𝑈

𝛿𝑜,𝑢

𝑗

𝐽 𝑌𝑢

(𝑜)=𝑙

𝑜=1

𝑂

𝑢=1

𝑈

𝐽 𝑌𝑢

(𝑜)=𝑙 𝑗 𝛿𝑜,𝑢 𝑗

=

𝑜=1

𝑂

𝑢=1

𝑈

𝛿𝑜,𝑢

𝑗

𝐽 𝑌𝑢

(𝑜)=𝑙

𝑜=1

𝑂

𝑢=1

𝑈

𝐽 𝑌𝑢

(𝑜)=𝑙

𝑗, 𝑘 = 1, … , 𝑁 𝑗, 𝑘 = 1, … , 𝑁

Baum-Welch algorithm (Baum, 1972) Discrete observations

SLIDE 21

HMM learning by EM (Baum-Welch)

21

 𝜾𝑝𝑚𝑒 = 𝝆𝑝𝑚𝑒, 𝑩𝑝𝑚𝑒, 𝚾𝑝𝑚𝑒  E-Step:

 𝛿𝑜,𝑢

𝑗

= 𝑄 𝑍

𝑢 (𝑜) = 𝑙 𝒀(𝑜); 𝜾𝑝𝑚𝑒

 𝜊𝑜,𝑢

𝑗,𝑘 = 𝑄 𝑍 𝑢−1 (𝑜) = 𝑗, 𝑍 𝑢 (𝑜) = 𝑘 𝒚(𝑜); 𝜾𝑝𝑚𝑒

 M-Step:

 𝜌𝑗

𝑜𝑓𝑥 = 𝑜=1

𝑂

𝛿𝑜,1

𝑗

𝑂

 𝐵𝑗,𝑘

𝑜𝑓𝑥 = 𝑜=1

𝑂

𝑢=2

𝑈

𝜊𝑜,𝑢

𝑗,𝑘

𝑜=1

𝑂

𝑢=1

𝑈−1 𝛿𝑜,𝑢 𝑗

𝑗, 𝑘 = 1, … , 𝑁 𝑗, 𝑘 = 1, … , 𝑁 𝝂𝑗

𝑜𝑓𝑥 =

𝑜=1

𝑂

𝑢=1

𝑈

𝛿𝑜,𝑢

𝑗 𝒚𝑢 (𝑜)

𝑜=1

𝑂

𝑢=1

𝑈

𝛿𝑜,𝑢

𝑗

𝜯𝑗

𝑜𝑓𝑥 =

𝑜=1

𝑂

𝑢=1

𝑈

𝛿𝑜,𝑢

𝑗

𝒚𝑢

(𝑜) − 𝝂𝑗 𝑜𝑓𝑥

𝒚𝑢

(𝑜) − 𝝂𝑗 𝑜𝑓𝑥 𝑈

𝑜=1

𝑂

𝑢=1

𝑈

𝛿𝑜,𝑢

𝑗

Assumption: Gaussian emission probabilities Baum-Welch algorithm (Baum, 1972)

SLIDE 22

Forward-backward algorithm for E-step

22

𝑄 𝑧𝑢−1, 𝑧𝑢 𝒚1, … , 𝒚𝑈 = 𝑄 𝒚1, … , 𝒚𝑈 𝑧𝑢−1, 𝑧𝑢)𝑄 𝑧𝑢−1, 𝑧𝑢 𝑄(𝒚1, … , 𝒚𝑈) = 𝑄 𝒚1, … , 𝒚𝑢−1 𝑧𝑢−1)𝑄 𝒚𝑢 𝑧𝑢)𝑄 𝒚𝑢+1, … , 𝒚𝑈 𝑧𝑢)𝑄 𝑧𝑢 𝑧𝑢−1)𝑄 𝑧𝑢−1 𝑄(𝒚1, … , 𝒚𝑈) = 𝛽𝑢−1(𝑧𝑢−1)𝑄 𝒚𝑢 𝑧𝑢)𝑄 𝑧𝑢 𝑧𝑢−1)𝛾𝑢 𝑧𝑢 𝑗=1

𝑁 𝛽𝑈(𝑗)

This equivalence is obtained according to the independencies in the HMM structure 𝑄 𝒚1, … , 𝒚𝑈 𝑧𝑢−1, 𝑧𝑢) = 𝑄 𝒚1, … , 𝒚𝑢−1 𝒚𝑢, … , 𝒚𝑈, 𝑧𝑢−1, 𝑧𝑢) × 𝑄 𝒚𝑢 𝒚𝑢+1, … , 𝒚𝑈, 𝑧𝑢−1, 𝑧𝑢) × 𝑄 𝒚𝑢+1, … , 𝒚𝑈 𝑧𝑢−1, 𝑧𝑢) = 𝑄 𝒚1, … , 𝒚𝑢−1 𝑧𝑢−1) × 𝑄 𝒚𝑢 𝑧𝑢) × 𝑄 𝒚𝑢+1, … , 𝒚𝑈 𝑧𝑢)

SLIDE 23

HMM shortcomings

23

 In modeling the joint distribution 𝑄(𝒁, 𝒀), HMM ignores many

dependencies between observations 𝑌1, … , 𝑌𝑈 (similar to most generative models that need to simplify the structure)

 In

the sequence labeling task, we need to classify an

bservation sequence using the conditional probability 𝑄(𝒁|𝒀)

 However, HMM learns a joint distribution 𝑄(𝒁, 𝒀) while uses only

𝑄(𝒁|𝒀) for labeling

𝑍

1

𝑍

2

𝑍

𝑈

𝑌1 𝑌2 𝑌𝑈 … 𝑍

𝑈−1

𝑌𝑈−1

SLIDE 24

Maximum Entropy Markov Model (MEMM)

24

𝑄 𝒛1:𝑈 𝒚1:𝑈 = 𝑄 𝑧1 𝒚1:𝑈

𝑢=2 𝑈

𝑄 𝑧𝑢 𝑧𝑢−1, 𝒚1:𝑈

𝑄 𝑧𝑢 𝑧𝑢−1, 𝒚1:𝑈 = exp 𝒙𝑈𝒈 𝑧𝑢, 𝑧𝑢−1, 𝒚1:𝑈 𝑧𝑢 exp 𝒙𝑈𝒈 𝑧𝑢, 𝑧𝑢−1, 𝒚1:𝑈

 Discriminative model

 Only models 𝑄(𝒁|𝒀) and completely ignores modeling 𝑄(𝒀)  Maximizes the conditional likelihood 𝑄(𝒠𝒁|𝒠𝒀, 𝜾)

𝑍

1

𝑍

2

𝑍

𝑈

𝑌1 𝑌2 𝑌𝑈 … 𝑍

𝑈−1

𝑌𝑈−1 𝑍

1

𝑍

2

𝑍

𝑈

𝒀1:𝑈 … 𝑍

𝑈−1

SLIDE 25

Feature function

25

 Feature function 𝒈 𝑧𝑢, 𝑧𝑢−1, 𝒚1:𝑈

can take account of relations between both data and label space

 However, they are often indicator functions showing absence

r presence of a feature

 𝑥𝑗 captures how closely 𝑔 𝑧𝑢, 𝑧𝑢−1, 𝒚1:𝑈

is related with the label

SLIDE 26

MEMM disadvantages

26

 The later observation in the sequence has absolutely no effect

n the posterior probability of the current state

 model does not allow for any smoothing.  it is incapable of going back and changing its prediction about the

earlier observations.

 The label bias problem

 there are cases that a given observation is not useful in

predicting the next state of the model.

 In the extreme case, if a state has a unique out-going transition, the

given observation is useless

𝑍

1

𝑍

2

𝑍

𝑈

𝑌1 𝑌2 𝑌𝑈

…𝑍

𝑈−1

𝑌𝑈−1

SLIDE 27

Label bias problem in MEMM

27

 Label bias problem: states with fewer arcs are preferred

 Preference of states with lower entropy of transitions over others in

decoding

 MEMMs should probably be avoided in cases where many transitions are

close to deterministic

 Extreme case: When there is only one outgoing arc, it does not matter what

the observation is  The source of this problem: Probabilities of outgoing arcs

normalized separately for each state

 sum of transition probability for any state has to sum to 1

 Solution: Do not normalize probabilities locally 𝑎 𝑧𝑢−1, 𝒚1:𝑈, 𝒙 =

𝑧𝑢

exp 𝒙𝑈𝒈 𝑧𝑢, 𝑧𝑢−1, 𝒚1:𝑈 𝑄 𝑧𝑢 𝑧𝑢−1, 𝒚1:𝑈 = exp 𝒙𝑈𝒈 𝑧𝑢, 𝑧𝑢−1, 𝒚1:𝑈 𝑎 𝑧𝑢−1, 𝒚1:𝑈, 𝒙

SLIDE 28

From MEMM to CRF

28

 From local probabilities to local potentials

𝑄 𝒁 𝒀 = 1 𝑎(𝒀)

𝑢=1 𝑈

𝜚 𝑧𝑢−1, 𝑧𝑢, 𝒀 = 1 𝑎(𝒀, 𝒙)

𝑢=1 𝑈

exp 𝒙𝑈𝑔 𝑧𝑢, 𝑧𝑢−1, 𝒚

 CRF is a discriminative model (like MEMM)

 can dependence between each state and the entire observation sequence  uses global normalizer 𝑎(𝒀, 𝒙) that overcomes the label bias problem of MEMM

 MEMM use an exponential model for each state while CRF have a single

exponential model for the joint probability of the entire label sequence

𝑍

1

𝑍

2

𝑍

𝑈

𝒀1:𝑈 … 𝑍

𝑈−1

𝒀 ≡ 𝒚1:𝑈 𝒁 ≡ 𝑧1:𝑈

SLIDE 29

CRF: conditional distribution

29

𝑄 𝒛 𝒚 = 1 𝑎(𝒚, 𝝁) exp

𝑗=1 𝑈

𝝁𝑈𝒈 𝑧𝑗, 𝑧𝑗−1, 𝒚 = 1 𝑎(𝒚, 𝝁) exp

𝑗=1 𝑈 𝑙

𝜇𝑙𝑔

𝑙 𝑧𝑗, 𝑧𝑗−1, 𝒚

𝑎 𝒚, 𝝁 =

𝒛

exp

𝑗=1 𝑈

𝝁𝑈𝒈 𝑧𝑗, 𝑧𝑗−1, 𝒚

SLIDE 30

CRF: MAP inference

30

 Given CRF parameters 𝝁, find the 𝒛∗ that maximizes 𝑄 𝒛 𝒚 :

𝒛∗ = argmax

𝒛

exp

𝑗=1 𝑈

𝝁𝑈𝒈 𝑧𝑗, 𝑧𝑗−1, 𝒚

 𝑎(𝒚) is not a function of 𝒛 and so has been ignored  Max-product algorithm can be used for this MAP inference

problem

 Same asViterbi decoding used in HMMs

SLIDE 31

CRF: inference

31

Exact inference for 1-D chain CRFs 𝜚 𝑧𝑗, 𝑧𝑗−1 = exp 𝝁𝑈𝒈 𝑧𝑗, 𝑧𝑗−1, 𝒚

𝑍

1

𝑍

2

𝑍

𝑈

𝒀1:𝑈 … 𝑍

𝑈−1

𝑍

1, 𝑍 2

𝑍

2, 𝑍 3

… 𝑍

𝑈−1, 𝑍 𝑈

𝑍

2

𝑍

𝑈−1

SLIDE 32

CRF: learning

32

𝝁∗ = argmax

𝝁 𝑜=1 𝑂

𝑄 𝒛(𝑜) 𝒚(𝑜), 𝝁

𝑜=1 𝑂

𝑄 𝒛(𝑜) 𝒚(𝑜), 𝝁 =

𝑜=1 𝑂

1 𝑎(𝒚(𝑜), 𝝁) exp

𝑗=1 𝑈

𝝁𝑈𝒈 𝑧𝑗

(𝑜), 𝑧𝑗−1 (𝑜), 𝒚(𝑜)

= exp

𝑜=1 𝑂 𝑗=1 𝑈

𝝁𝑈𝒈 𝑧𝑗

(𝑜), 𝑧𝑗−1 (𝑜), 𝒚(𝑜) − ln 𝑎(𝒚(𝑜), 𝝁)

ℓ 𝝁 = ln

𝑜=1 𝑂

𝑄 𝒛(𝑜) 𝒚(𝑜), 𝝁 𝛼𝝁ℓ 𝝁 =

𝑜=1 𝑂 𝑗=1 𝑈

𝒈 𝑧𝑗

(𝑜), 𝑧𝑗−1 (𝑜), 𝒚(𝑜) − 𝛼𝝁 ln 𝑎(𝒚(𝑜), 𝝁)

𝛼𝝁 ln 𝑎(𝒚(𝑜), 𝝁) =

𝒛

𝑄(𝒛|𝒚 𝑜 , 𝝁)

𝑗=1 𝑈

𝒈 𝑧𝑗 , 𝑧𝑗−1, 𝒚 𝑜 Maximum Conditional Likelihood 𝜾 = argmax

𝜾

𝑄(𝒠𝑍|𝒠𝑌, 𝜾) Gradient of the log-partition function for an exponential family is the expectation of the sufficient statistics.

SLIDE 33

CRF: learning

33

𝛼𝝁 ln 𝑎(𝒚(𝑜), 𝝁) =

𝒛

𝑄(𝒛|𝒚 𝑜 , 𝝁)

𝑗=1 𝑈

𝒈 𝑧𝑗 , 𝑧𝑗−1, 𝒚 𝑜 =

𝑗=1 𝑈 𝒛

𝑄(𝒛|𝒚 𝑜 , 𝝁)𝒈 𝑧𝑗 , 𝑧𝑗−1, 𝒚 𝑜 =

𝑗=1 𝑈 𝑧𝑗 ,𝑧𝑗−1

𝑄(𝑧𝑗 , 𝑧𝑗−1|𝒚 𝑜 , 𝝁)𝒈 𝑧𝑗 , 𝑧𝑗−1, 𝒚

How do we find the above expectations? 𝑄(𝑧𝑗 , 𝑧𝑗−1|𝒚 𝑜 , 𝝁) must be computed for all 𝑗 = 2, … , 𝑈

SLIDE 34

CRF: learning Inference to find 𝑄(𝑧𝑗 , 𝑧𝑗−1|𝒚 𝑜 , 𝝁)

34

 Junction tree algorithm

 Initialization of clique potentials:

𝜔 𝑧𝑗, 𝑧𝑗−1 = exp 𝝁𝑈𝒈 𝑧𝑗, 𝑧𝑗−1, 𝒚 𝑜 𝜚 𝑧𝑗 = 1

 After calibration (completion of message passing): 𝜔∗ 𝑧𝑗, 𝑧𝑗−1

𝑄 𝑧𝑗, 𝑧𝑗−1|𝒚(𝑜), 𝝁 = 𝜔∗ 𝑧𝑗, 𝑧𝑗−1 𝑧𝑗,𝑧𝑗−1 𝜔∗ 𝑧𝑗, 𝑧𝑗−1

𝑍

1, 𝑍 2

𝑍

2, 𝑍 3

… 𝑍

𝑈−1, 𝑍 𝑈

𝑍

2

𝑍

𝑈−1

SLIDE 35

CRF learning: gradient descent

35

𝝁𝑢 = 𝝁𝑢 + 𝜃𝛼𝝁ℓ 𝝁𝑢

 In each gradient step, for each data sample, we use inference to find

𝑄(𝑧𝑗 , 𝑧𝑗−1|𝒚 𝑜 , 𝝁) required for computing feature expectation: 𝛼𝝁 ln 𝑎(𝒚(𝑜), 𝝁) = 𝐹𝑄(𝒛|𝒚 𝑜 ,𝝁𝑢)

𝑗=1 𝑈

𝒈 𝑧𝑗 , 𝑧𝑗−1, 𝒚 𝑜 =

𝑗=1 𝑈 𝑧𝑗 ,𝑧𝑗−1

𝑄 𝑧𝑗 , 𝑧𝑗−1 𝒚 𝑜 , 𝝁 𝒈 𝑧𝑗 , 𝑧𝑗−1, 𝒚 𝑜

𝛼𝝁𝑀 𝝁 =

𝑜=1 𝑂 𝑗=1 𝑈

𝒈 𝑧𝑗

(𝑜), 𝑧𝑗−1 (𝑜), 𝒚(𝑜) − 𝛼𝝁 ln 𝑎(𝒚(𝑜), 𝝁)

SLIDE 36

Summary

36

 Discriminative vs. generative

 In cases where we have many correlated features, discriminative models (MEMM

and CRF) are often better

 avoid the challenge of explicitly modeling the distribution over features.

 but, if only limited training data are available, the stronger bias of the generative

model may dominate and these models may be preferred.

 Learning

 HMMs and MEMMs are much more easily learned.  CRF requires an iterative gradient-based approach, which is considerably

more expensive

 inference must be run separately for every training sequence in each step

 MEMM vs. CRF (Label bias problem of MEMM)

 In many cases, CRFs are likely to be a safer choice (particularly in cases where

many transitions are close to deterministic), but the computational cost may be prohibitive for large data sets.