HMM, MEMM and CRF Probabilistic Graphical Models Sharif University - - PowerPoint PPT Presentation

β–Ά
hmm memm and crf
SMART_READER_LITE
LIVE PREVIEW

HMM, MEMM and CRF Probabilistic Graphical Models Sharif University - - PowerPoint PPT Presentation

HMM, MEMM and CRF Probabilistic Graphical Models Sharif University of Technology Spring 201 7 Soleymani Sequence labeling Taking collective a set of interrelated instances 1 , , and jointly labeling them We get as


slide-1
SLIDE 1

HMM, MEMM and CRF

Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

slide-2
SLIDE 2

Sequence labeling

2

 Taking collective a set of interrelated instances π’š1, … , π’šπ‘ˆ

and jointly labeling them

 We get as input a sequence of observations 𝒀 = π’š1:π‘ˆ and need

to label them with some joint label 𝒁 = 𝑧1:π‘ˆ

slide-3
SLIDE 3

Generalization of mixture models for sequential data

3

[Jordan] π‘Ž π‘Œ 𝑍

1

𝑍

2

𝑍

π‘ˆ

π‘Œ1 π‘Œ2 π‘Œπ‘ˆ … 𝑍

π‘ˆβˆ’1

π‘Œπ‘ˆβˆ’1 Y: states (latent variables) X: observations

slide-4
SLIDE 4

HMM examples

4

 Some applications of HMM

 Speech recognition, NLP

, activity recognition

 Part-of-speech-tagging

𝑂𝑂𝑄 π‘ŠπΆπ‘Ž π‘ŠπΆ π‘ˆπ‘ Students are π‘ŠπΆπ‘‚ expected to study

slide-5
SLIDE 5

HMM: probabilistic model

5

 Transitional probabilities: transition probabilities between

states

 π΅π‘—π‘˜ ≑ 𝑄(𝑍

𝑒 = π‘˜|𝑍 π‘’βˆ’1 = 𝑗)

 Initial state distribution: start probabilities in different

states

 πœŒπ‘— ≑ 𝑄(𝑍

1 = 𝑗)

 Observation model: Emission probabilities associated with

each state

 𝑄(π‘Œπ‘’|𝑍

𝑒, 𝚾)

slide-6
SLIDE 6

HMM: probabilistic model

6

 Transitional probabilities: transition probabilities between

states

 𝑄 𝑍

𝑒 𝑍 π‘’βˆ’1 = 𝑗 = π‘π‘£π‘šπ‘’π‘—(𝑍 𝑒|𝐡𝑗1, … , 𝐡𝑗𝑁) βˆ€π‘— ∈ 𝑑𝑒𝑏𝑒𝑓𝑑

 Initial state distribution: start probabilities in different

states

 𝑄 𝑍

1 = π‘π‘£π‘šπ‘’π‘—(𝑍 1|𝜌1, … , πœŒπ‘)

 Observation model: Emission probabilities associated with

each state

 Discrete

  • bservations:

𝑄 π‘Œπ‘’ 𝑍

𝑒 = 𝑗 = π‘π‘£π‘šπ‘’π‘—(π‘Œπ‘’|𝐢𝑗,1, … , 𝐢𝑗,𝐿)βˆ€π‘—

∈ 𝑑𝑒𝑏𝑒𝑓𝑑

 General: 𝑄 π‘Œπ‘’ 𝑍

𝑒 = 𝑗 = 𝑔(. |πœΎπ‘—)

𝑍: states (latent variables) π‘Œ: observations

slide-7
SLIDE 7

Inference problems in sequential data

7

 Decoding: argmax

𝑧1,…,π‘§π‘ˆ

𝑄 𝑧1, … , π‘§π‘ˆ 𝑦1, … , π‘¦π‘ˆ

 Evaluation

 Filtering: 𝑄 𝑧𝑒 𝑦1, … , 𝑦𝑒  Smoothing: 𝑒′ < 𝑒, 𝑄 𝑧𝑒′ 𝑦1, … , 𝑦𝑒  Prediction: 𝑒′ > 𝑒, 𝑄 𝑧𝑒′ 𝑦1, … , 𝑦𝑒

slide-8
SLIDE 8

Some questions

8

 Inference

 𝑄 𝑧𝑒|𝑦1, … , 𝑦𝑒 =?  𝑄 𝑦1, … , π‘¦π‘ˆ =?  𝑄 𝑧𝑒|𝑦1, … , π‘¦π‘ˆ =?

 Learning: How do we adjust the HMM parameters:

 Complete data: each training data includes a state sequence

and the corresponding observation sequence

 Incomplete

data: each training data includes

  • nly

an

  • bservation sequence
slide-9
SLIDE 9

Forward algorithm

9

𝛽𝑒 𝑗 = 𝑄 𝑦1, … , 𝑦𝑒, 𝑍

𝑒 = 𝑗

𝛽𝑒 𝑗 =

π‘˜

π›½π‘’βˆ’1 π‘˜ 𝑄 𝑍

𝑒 = 𝑗|𝑍 π‘’βˆ’1 = π‘˜ 𝑄 𝑦𝑒| 𝑍 𝑒 = 𝑗

 Initialization:

 𝛽1 𝑗 = 𝑄 𝑦1, 𝑍

1 = 𝑗 = 𝑄 𝑦1|𝑍 1 = 𝑗 𝑄 𝑍 1 = 𝑗

 Iterations: 𝑒 = 2 to π‘ˆ

 𝛽𝑒 𝑗 = π‘˜ π›½π‘’βˆ’1 π‘˜ 𝑄 𝑍

𝑒 = 𝑗|𝑍 π‘’βˆ’1 = π‘˜ 𝑄 𝑦𝑒|𝑍 𝑒 = 𝑗

𝑗, π‘˜ = 1, … , 𝑁 𝑍

1

𝑍

2

𝑍

π‘ˆ

π‘Œ1 π‘Œ2 π‘Œπ‘ˆ … 𝑍

π‘ˆβˆ’1

π‘Œπ‘ˆβˆ’1 𝛽1(. ) 𝛽2(. ) π›½π‘ˆβˆ’1(. ) π›½π‘ˆ(. )

𝛽𝑒 𝑗 = π‘›π‘’βˆ’1→𝑒(𝑗)

slide-10
SLIDE 10

Backward algorithm

10

𝛾𝑒 𝑗 = π‘›π‘’β†’π‘’βˆ’1(𝑗) = 𝑄 𝑦𝑒+1, … , π‘¦π‘ˆ|𝑍

𝑒 = 𝑗

π›Ύπ‘’βˆ’1 𝑗 =

π‘˜

𝛾𝑒 π‘˜ 𝑄 𝑍

𝑒 = π‘˜|𝑍 π‘’βˆ’1 = 𝑗 𝑄 𝑦𝑒| 𝑍 𝑒 = 𝑗

 Initialization:

 π›Ύπ‘ˆ 𝑗 = 1

 Iterations: 𝑒 = π‘ˆ down to 2

 π›Ύπ‘’βˆ’1 𝑗 = π‘˜ 𝛾𝑒 π‘˜ 𝑄 𝑍

𝑒 = π‘˜|𝑍 π‘’βˆ’1 = 𝑗 𝑄 𝑦𝑒| 𝑍 𝑒 = 𝑗

𝑗, π‘˜ ∈ 𝑑𝑒𝑏𝑒𝑓𝑑 𝑍

1

𝑍

2

𝑍

π‘ˆ

π‘Œ1 π‘Œ2 π‘Œπ‘ˆ … 𝑍

π‘ˆβˆ’1

π‘Œπ‘ˆβˆ’1 π›Ύπ‘ˆ . = 1 π›Ύπ‘ˆβˆ’1 . 𝛾2 . 𝛾1 .

𝛾𝑒 𝑗 = π‘›π‘’β†’π‘’βˆ’1(𝑗)

slide-11
SLIDE 11

Forward-backward algorithm

11

𝛽𝑒 𝑗 =

π‘˜

π›½π‘’βˆ’1 π‘˜ 𝑄 𝑍

𝑒 = 𝑗|𝑍 π‘’βˆ’1 = π‘˜ 𝑄 𝑦𝑒| 𝑍 𝑒 = 𝑗

𝛽1 𝑗 = 𝑄 𝑦1, 𝑍

1 = 𝑗 = 𝑄 𝑦1|𝑍 1 = 𝑗 𝑄 𝑍 1 = 𝑗

π›Ύπ‘’βˆ’1 𝑗 =

π‘˜

𝛾𝑒 π‘˜ 𝑄 𝑍

𝑒 = π‘˜|𝑍 π‘’βˆ’1 = 𝑗 𝑄 𝑦𝑒| 𝑍 𝑒 = 𝑗

π›Ύπ‘ˆ 𝑗 = 1

𝑄 𝑦1, 𝑦2, … , π‘¦π‘ˆ =

𝑗

π›½π‘ˆ 𝑗 π›Ύπ‘ˆ 𝑗 =

𝑗

π›½π‘ˆ 𝑗 𝑄 𝑍

𝑒 = 𝑗|𝑦1, 𝑦2, … , π‘¦π‘ˆ = 𝛽𝑒(𝑗)𝛾𝑒(𝑗)

𝑗 π›½π‘ˆ 𝑗

𝛽𝑒(𝑗) ≑ 𝑄 𝑦1, 𝑦2, … , 𝑦𝑒, 𝑍

𝑒 = 𝑗

𝛾𝑒(𝑗) ≑ 𝑄 𝑦𝑒+1, 𝑦𝑒+2, … , π‘¦π‘ˆ|𝑍

𝑒 = 𝑗

slide-12
SLIDE 12

Forward-backward algorithm

12

 This will also be used in the E-step of the EM algorithm

to train a HMM:

 𝑄 𝑍

𝑒 = 𝑗 𝑦1, … , π‘¦π‘ˆ = 𝑄 𝑦1,…,π‘¦π‘ˆ,𝑍

𝑒=𝑗

𝑄 𝑦1,…,π‘¦π‘ˆ

=

𝛽𝑒(𝑗)𝛾𝑒(𝑗) π‘˜=1

𝑂

π›½π‘ˆ(π‘˜)

𝑍

1

𝑍

2

𝑍

π‘ˆ

π‘Œ1 π‘Œ2 π‘Œπ‘ˆ … 𝑍

π‘ˆβˆ’1

π‘Œπ‘ˆβˆ’1 π›Ύπ‘ˆ . = 1 𝛽1(. ) 𝛽2(. ) π›½π‘ˆβˆ’1(. ) π›½π‘ˆ(. ) π›Ύπ‘ˆβˆ’1 . 𝛾2 . 𝛾1 .

slide-13
SLIDE 13

Decoding Problem

13

 Choose state sequence to maximize the observations:

 argmax

𝑧1,…,𝑧𝑒

𝑄 𝑧1, … , 𝑧𝑒 𝑦1, … , 𝑦𝑒

 Viterbi algorithm:

 Define auxiliary variable πœ€:

 πœ€π‘’ 𝑗 =

max

𝑧1,…,π‘§π‘’βˆ’1 𝑄(𝑧1, 𝑧2, … , π‘§π‘’βˆ’1, 𝑍 𝑒 = 𝑗, 𝑦1, 𝑦2, … , 𝑦𝑒)  πœ€π‘’(𝑗): probability of the most probable path ending in state 𝑍 𝑒 = 𝑗

 Recursive relation:

 πœ€π‘’ 𝑗 = max π‘˜

πœ€π‘’βˆ’1 π‘˜ 𝑄(𝑍

𝑒 = 𝑗|𝑍 π‘’βˆ’1 = π‘˜) 𝑄 𝑦𝑒 𝑍 𝑒 = 𝑗

πœ€π‘’ 𝑗 = max

π‘˜=1,…,𝑂 πœ€π‘’βˆ’1 π‘˜ 𝑄 𝑍 𝑒 = 𝑗 𝑍 π‘’βˆ’1 = π‘˜

𝑄(𝑦𝑒|𝑍

𝑒 = 𝑗)

slide-14
SLIDE 14

Decoding Problem: Viterbi algorithm

14

 Initialization

 πœ€1 𝑗 = 𝑄 𝑦1|𝑍

1 = 𝑗 𝑄 𝑍 1 = 𝑗

 πœ”1 𝑗 = 0

 Iterations: 𝑒 = 2, … , π‘ˆ

 πœ€π‘’ 𝑗 = max

π‘˜

πœ€π‘’βˆ’1 π‘˜ 𝑄 𝑍

𝑒 = 𝑗 𝑍 π‘’βˆ’1 = π‘˜

𝑄 𝑦𝑒 𝑍

𝑒 = 𝑗

 πœ”π‘’ 𝑗 = argmax

π‘˜

πœ€π‘’ π‘˜ 𝑄 𝑍

𝑒 = 𝑗 𝑍 π‘’βˆ’1 = π‘˜

 Final computation:

 π‘„βˆ— = max

π‘˜=1,…,𝑂 πœ€π‘ˆ π‘˜

 π‘§π‘ˆ

βˆ— = argmax π‘˜=1,…,𝑂

πœ€π‘ˆ π‘˜

 Traceback state sequence: 𝑒 = π‘ˆ βˆ’ 1 down to 1

 𝑧𝑒

βˆ— = πœ”π‘’+1 𝑧𝑒+1 βˆ—

𝑗 = 1, … , 𝑁 𝑗 = 1, … , 𝑁

slide-15
SLIDE 15

Max-product algorithm

15

π‘›π‘˜π‘—

𝑛𝑏𝑦 𝑦𝑗 = max π‘¦π‘˜

𝜚 π‘¦π‘˜ 𝜚 𝑦𝑗, π‘¦π‘˜

π‘™βˆˆπ’ͺ(π‘˜)\𝑗

π‘›π‘™π‘˜

𝑛𝑏𝑦(π‘¦π‘˜)

πœ€π‘’ 𝑗 = π‘›π‘’βˆ’1,𝑒

𝑛𝑏𝑦 Γ— 𝜚 𝑦𝑗

slide-16
SLIDE 16

HMM Learning

16

 Supervised learning: When we have a set of data samples,

each of them containing a pair of sequences (one is the

  • bservation sequence and the other is the state sequence)

 Unsupervised learning: When we have a set of data

samples, each of them containing a sequence of observations

slide-17
SLIDE 17

HMM supervised learning by MLE

17

 Initial state probability:

πœŒπ‘— = 𝑄 𝑍

1 = 𝑗 ,

1 ≀ 𝑗 ≀ 𝑁

 State transition probability:

π΅π‘˜π‘— = 𝑄 𝑍

𝑒+1 = 𝑗 𝑍 𝑒 = π‘˜ ,

1 ≀ 𝑗, π‘˜ ≀ 𝑁

 State transition probability:

𝐢𝑗𝑙 = 𝑄 π‘Œπ‘’ = 𝑙 𝑍

𝑒 = 𝑗 ,

1 ≀ 𝑙 ≀ 𝐿

𝑍

1

𝑍

2

𝑍

π‘ˆ

𝑍

π‘ˆβˆ’1

… 𝝆 𝑩 π‘Œ2 π‘Œ1 π‘Œπ‘ˆβˆ’1 π‘Œπ‘ˆ π‘ͺ Discrete

  • bservations
slide-18
SLIDE 18

HMM: supervised parameter learning by MLE

18

𝑄 𝒠|𝜾 =

π‘œ=1 𝑂

𝑄 𝑧

1 (π‘œ) 𝝆 𝑒=2 π‘ˆ

𝑄(𝑧𝑒

(π‘œ)|π‘§π‘’βˆ’1 (π‘œ), 𝑩) 𝑒=1 π‘ˆ

𝑄(π’šπ‘’

(π‘œ)|𝑧𝑒 (π‘œ), π‘ͺ)

π΅π‘—π‘˜ = π‘œ=1

𝑂

𝑒=2

π‘ˆ

𝐽 π‘§π‘’βˆ’1

(π‘œ) = 𝑗, 𝑧𝑒 (π‘œ) = π‘˜

π‘œ=1

𝑂

𝑒=2

π‘ˆ

𝐽 π‘§π‘’βˆ’1

(π‘œ) = 𝑗

πœŒπ‘— = π‘œ=1

𝑂

𝐽 𝑧

1 (π‘œ) = 𝑗

𝑂 𝐢𝑗𝑙 = π‘œ=1

𝑂

𝑒=1

π‘ˆ

𝐽 𝑧𝑒

(π‘œ) = 𝑗, 𝑦𝑒 (π‘œ) = 𝑙

π‘œ=1

𝑂

𝑒=1

π‘ˆ

𝐽 𝑧𝑒

(π‘œ) = 𝑗

Discrete

  • bservations
slide-19
SLIDE 19

Learning

19

 Problem: how to construct an HHM given only observations?  Find 𝜾 = (𝑩, π‘ͺ, 𝝆), maximizing 𝑄(π’š1, … , π’šπ‘ˆ|𝜾)

 Incomplete data

 EM algorithm

slide-20
SLIDE 20

HMM learning by EM (Baum-Welch)

20

 πœΎπ‘π‘šπ‘’ = π†π‘π‘šπ‘’, π‘©π‘π‘šπ‘’, πšΎπ‘π‘šπ‘’  E-Step:

 π›Ώπ‘œ,𝑒

𝑗

= 𝑄 𝑍

𝑒 (π‘œ) = 𝑗 π’š(π‘œ); πœΎπ‘π‘šπ‘’

 πœŠπ‘œ,𝑒

𝑗,π‘˜ = 𝑄 𝑍 π‘’βˆ’1 (π‘œ) = 𝑗, 𝑍 𝑒 (π‘œ) = π‘˜ π’š(π‘œ); πœΎπ‘π‘šπ‘’

 M-Step:

 πœŒπ‘—

π‘œπ‘“π‘₯ = π‘œ=1

𝑂

π›Ώπ‘œ,1

𝑗

𝑂

 𝐡𝑗,π‘˜

π‘œπ‘“π‘₯ = π‘œ=1

𝑂

𝑒=2

π‘ˆ

πœŠπ‘œ,𝑒

𝑗,π‘˜

π‘œ=1

𝑂

𝑒=1

π‘ˆβˆ’1 π‘˜ πœŠπ‘œ,𝑒 𝑗,π‘˜ =

π‘œ=1

𝑂

𝑒=2

π‘ˆ

πœŠπ‘œ,𝑒

𝑗,π‘˜

π‘œ=1

𝑂

𝑒=1

π‘ˆβˆ’1 π›Ώπ‘œ,𝑒 𝑗

 𝐢𝑗,𝑙

π‘œπ‘“π‘₯ = π‘œ=1

𝑂

𝑒=1

π‘ˆ

π›Ώπ‘œ,𝑒

𝑗

𝐽 π‘Œπ‘’

(π‘œ)=𝑙

π‘œ=1

𝑂

𝑒=1

π‘ˆ

𝐽 π‘Œπ‘’

(π‘œ)=𝑙 𝑗 π›Ώπ‘œ,𝑒 𝑗

=

π‘œ=1

𝑂

𝑒=1

π‘ˆ

π›Ώπ‘œ,𝑒

𝑗

𝐽 π‘Œπ‘’

(π‘œ)=𝑙

π‘œ=1

𝑂

𝑒=1

π‘ˆ

𝐽 π‘Œπ‘’

(π‘œ)=𝑙

𝑗, π‘˜ = 1, … , 𝑁 𝑗, π‘˜ = 1, … , 𝑁

Baum-Welch algorithm (Baum, 1972) Discrete observations

slide-21
SLIDE 21

HMM learning by EM (Baum-Welch)

21

 πœΎπ‘π‘šπ‘’ = π†π‘π‘šπ‘’, π‘©π‘π‘šπ‘’, πšΎπ‘π‘šπ‘’  E-Step:

 π›Ώπ‘œ,𝑒

𝑗

= 𝑄 𝑍

𝑒 (π‘œ) = 𝑙 𝒀(π‘œ); πœΎπ‘π‘šπ‘’

 πœŠπ‘œ,𝑒

𝑗,π‘˜ = 𝑄 𝑍 π‘’βˆ’1 (π‘œ) = 𝑗, 𝑍 𝑒 (π‘œ) = π‘˜ π’š(π‘œ); πœΎπ‘π‘šπ‘’

 M-Step:

 πœŒπ‘—

π‘œπ‘“π‘₯ = π‘œ=1

𝑂

π›Ώπ‘œ,1

𝑗

𝑂

 𝐡𝑗,π‘˜

π‘œπ‘“π‘₯ = π‘œ=1

𝑂

𝑒=2

π‘ˆ

πœŠπ‘œ,𝑒

𝑗,π‘˜

π‘œ=1

𝑂

𝑒=1

π‘ˆβˆ’1 π›Ώπ‘œ,𝑒 𝑗

𝑗, π‘˜ = 1, … , 𝑁 𝑗, π‘˜ = 1, … , 𝑁 𝝂𝑗

π‘œπ‘“π‘₯ =

π‘œ=1

𝑂

𝑒=1

π‘ˆ

π›Ώπ‘œ,𝑒

𝑗 π’šπ‘’ (π‘œ)

π‘œ=1

𝑂

𝑒=1

π‘ˆ

π›Ώπ‘œ,𝑒

𝑗

πœ―π‘—

π‘œπ‘“π‘₯ =

π‘œ=1

𝑂

𝑒=1

π‘ˆ

π›Ώπ‘œ,𝑒

𝑗

π’šπ‘’

(π‘œ) βˆ’ 𝝂𝑗 π‘œπ‘“π‘₯

π’šπ‘’

(π‘œ) βˆ’ 𝝂𝑗 π‘œπ‘“π‘₯ π‘ˆ

π‘œ=1

𝑂

𝑒=1

π‘ˆ

π›Ώπ‘œ,𝑒

𝑗

Assumption: Gaussian emission probabilities Baum-Welch algorithm (Baum, 1972)

slide-22
SLIDE 22

Forward-backward algorithm for E-step

22

𝑄 π‘§π‘’βˆ’1, 𝑧𝑒 π’š1, … , π’šπ‘ˆ = 𝑄 π’š1, … , π’šπ‘ˆ π‘§π‘’βˆ’1, 𝑧𝑒)𝑄 π‘§π‘’βˆ’1, 𝑧𝑒 𝑄(π’š1, … , π’šπ‘ˆ) = 𝑄 π’š1, … , π’šπ‘’βˆ’1 π‘§π‘’βˆ’1)𝑄 π’šπ‘’ 𝑧𝑒)𝑄 π’šπ‘’+1, … , π’šπ‘ˆ 𝑧𝑒)𝑄 𝑧𝑒 π‘§π‘’βˆ’1)𝑄 π‘§π‘’βˆ’1 𝑄(π’š1, … , π’šπ‘ˆ) = π›½π‘’βˆ’1(π‘§π‘’βˆ’1)𝑄 π’šπ‘’ 𝑧𝑒)𝑄 𝑧𝑒 π‘§π‘’βˆ’1)𝛾𝑒 𝑧𝑒 𝑗=1

𝑁 π›½π‘ˆ(𝑗)

This equivalence is obtained according to the independencies in the HMM structure 𝑄 π’š1, … , π’šπ‘ˆ π‘§π‘’βˆ’1, 𝑧𝑒) = 𝑄 π’š1, … , π’šπ‘’βˆ’1 π’šπ‘’, … , π’šπ‘ˆ, π‘§π‘’βˆ’1, 𝑧𝑒) Γ— 𝑄 π’šπ‘’ π’šπ‘’+1, … , π’šπ‘ˆ, π‘§π‘’βˆ’1, 𝑧𝑒) Γ— 𝑄 π’šπ‘’+1, … , π’šπ‘ˆ π‘§π‘’βˆ’1, 𝑧𝑒) = 𝑄 π’š1, … , π’šπ‘’βˆ’1 π‘§π‘’βˆ’1) Γ— 𝑄 π’šπ‘’ 𝑧𝑒) Γ— 𝑄 π’šπ‘’+1, … , π’šπ‘ˆ 𝑧𝑒)

slide-23
SLIDE 23

HMM shortcomings

23

 In modeling the joint distribution 𝑄(𝒁, 𝒀), HMM ignores many

dependencies between observations π‘Œ1, … , π‘Œπ‘ˆ (similar to most generative models that need to simplify the structure)

 In

the sequence labeling task, we need to classify an

  • bservation sequence using the conditional probability 𝑄(𝒁|𝒀)

 However, HMM learns a joint distribution 𝑄(𝒁, 𝒀) while uses only

𝑄(𝒁|𝒀) for labeling

𝑍

1

𝑍

2

𝑍

π‘ˆ

π‘Œ1 π‘Œ2 π‘Œπ‘ˆ … 𝑍

π‘ˆβˆ’1

π‘Œπ‘ˆβˆ’1

slide-24
SLIDE 24

Maximum Entropy Markov Model (MEMM)

24

𝑄 𝒛1:π‘ˆ π’š1:π‘ˆ = 𝑄 𝑧1 π’š1:π‘ˆ

𝑒=2 π‘ˆ

𝑄 𝑧𝑒 π‘§π‘’βˆ’1, π’š1:π‘ˆ

𝑄 𝑧𝑒 π‘§π‘’βˆ’1, π’š1:π‘ˆ = exp π’™π‘ˆπ’ˆ 𝑧𝑒, π‘§π‘’βˆ’1, π’š1:π‘ˆ 𝑧𝑒 exp π’™π‘ˆπ’ˆ 𝑧𝑒, π‘§π‘’βˆ’1, π’š1:π‘ˆ

 Discriminative model

 Only models 𝑄(𝒁|𝒀) and completely ignores modeling 𝑄(𝒀)  Maximizes the conditional likelihood 𝑄(𝒠𝒁|𝒠𝒀, 𝜾)

𝑍

1

𝑍

2

𝑍

π‘ˆ

π‘Œ1 π‘Œ2 π‘Œπ‘ˆ … 𝑍

π‘ˆβˆ’1

π‘Œπ‘ˆβˆ’1 𝑍

1

𝑍

2

𝑍

π‘ˆ

𝒀1:π‘ˆ … 𝑍

π‘ˆβˆ’1

slide-25
SLIDE 25

Feature function

25

 Feature function π’ˆ 𝑧𝑒, π‘§π‘’βˆ’1, π’š1:π‘ˆ

can take account of relations between both data and label space

 However, they are often indicator functions showing absence

  • r presence of a feature

 π‘₯𝑗 captures how closely 𝑔 𝑧𝑒, π‘§π‘’βˆ’1, π’š1:π‘ˆ

is related with the label

slide-26
SLIDE 26

MEMM disadvantages

26

 The later observation in the sequence has absolutely no effect

  • n the posterior probability of the current state

 model does not allow for any smoothing.  it is incapable of going back and changing its prediction about the

earlier observations.

 The label bias problem

 there are cases that a given observation is not useful in

predicting the next state of the model.

 In the extreme case, if a state has a unique out-going transition, the

given observation is useless

𝑍

1

𝑍

2

𝑍

π‘ˆ

π‘Œ1 π‘Œ2 π‘Œπ‘ˆ

…𝑍

π‘ˆβˆ’1

π‘Œπ‘ˆβˆ’1

slide-27
SLIDE 27

Label bias problem in MEMM

27

 Label bias problem: states with fewer arcs are preferred

 Preference of states with lower entropy of transitions over others in

decoding

 MEMMs should probably be avoided in cases where many transitions are

close to deterministic

 Extreme case: When there is only one outgoing arc, it does not matter what

the observation is  The source of this problem: Probabilities of outgoing arcs

normalized separately for each state

 sum of transition probability for any state has to sum to 1

 Solution: Do not normalize probabilities locally π‘Ž π‘§π‘’βˆ’1, π’š1:π‘ˆ, 𝒙 =

𝑧𝑒

exp π’™π‘ˆπ’ˆ 𝑧𝑒, π‘§π‘’βˆ’1, π’š1:π‘ˆ 𝑄 𝑧𝑒 π‘§π‘’βˆ’1, π’š1:π‘ˆ = exp π’™π‘ˆπ’ˆ 𝑧𝑒, π‘§π‘’βˆ’1, π’š1:π‘ˆ π‘Ž π‘§π‘’βˆ’1, π’š1:π‘ˆ, 𝒙

slide-28
SLIDE 28

From MEMM to CRF

28

 From local probabilities to local potentials

𝑄 𝒁 𝒀 = 1 π‘Ž(𝒀)

𝑒=1 π‘ˆ

𝜚 π‘§π‘’βˆ’1, 𝑧𝑒, 𝒀 = 1 π‘Ž(𝒀, 𝒙)

𝑒=1 π‘ˆ

exp π’™π‘ˆπ‘” 𝑧𝑒, π‘§π‘’βˆ’1, π’š

 CRF is a discriminative model (like MEMM)

 can dependence between each state and the entire observation sequence  uses global normalizer π‘Ž(𝒀, 𝒙) that overcomes the label bias problem of MEMM

 MEMM use an exponential model for each state while CRF have a single

exponential model for the joint probability of the entire label sequence

𝑍

1

𝑍

2

𝑍

π‘ˆ

𝒀1:π‘ˆ … 𝑍

π‘ˆβˆ’1

𝒀 ≑ π’š1:π‘ˆ 𝒁 ≑ 𝑧1:π‘ˆ

slide-29
SLIDE 29

CRF: conditional distribution

29

𝑄 𝒛 π’š = 1 π‘Ž(π’š, 𝝁) exp

𝑗=1 π‘ˆ

ππ‘ˆπ’ˆ 𝑧𝑗, π‘§π‘—βˆ’1, π’š = 1 π‘Ž(π’š, 𝝁) exp

𝑗=1 π‘ˆ 𝑙

πœ‡π‘™π‘”

𝑙 𝑧𝑗, π‘§π‘—βˆ’1, π’š

π‘Ž π’š, 𝝁 =

𝒛

exp

𝑗=1 π‘ˆ

ππ‘ˆπ’ˆ 𝑧𝑗, π‘§π‘—βˆ’1, π’š

slide-30
SLIDE 30

CRF: MAP inference

30

 Given CRF parameters 𝝁, find the π’›βˆ— that maximizes 𝑄 𝒛 π’š :

π’›βˆ— = argmax

𝒛

exp

𝑗=1 π‘ˆ

ππ‘ˆπ’ˆ 𝑧𝑗, π‘§π‘—βˆ’1, π’š

 π‘Ž(π’š) is not a function of 𝒛 and so has been ignored  Max-product algorithm can be used for this MAP inference

problem

 Same asViterbi decoding used in HMMs

slide-31
SLIDE 31

CRF: inference

31

Exact inference for 1-D chain CRFs 𝜚 𝑧𝑗, π‘§π‘—βˆ’1 = exp ππ‘ˆπ’ˆ 𝑧𝑗, π‘§π‘—βˆ’1, π’š

𝑍

1

𝑍

2

𝑍

π‘ˆ

𝒀1:π‘ˆ … 𝑍

π‘ˆβˆ’1

𝑍

1, 𝑍 2

𝑍

2, 𝑍 3

… 𝑍

π‘ˆβˆ’1, 𝑍 π‘ˆ

𝑍

2

𝑍

π‘ˆβˆ’1

slide-32
SLIDE 32

CRF: learning

32

πβˆ— = argmax

𝝁 π‘œ=1 𝑂

𝑄 𝒛(π‘œ) π’š(π‘œ), 𝝁

π‘œ=1 𝑂

𝑄 𝒛(π‘œ) π’š(π‘œ), 𝝁 =

π‘œ=1 𝑂

1 π‘Ž(π’š(π‘œ), 𝝁) exp

𝑗=1 π‘ˆ

ππ‘ˆπ’ˆ 𝑧𝑗

(π‘œ), π‘§π‘—βˆ’1 (π‘œ), π’š(π‘œ)

= exp

π‘œ=1 𝑂 𝑗=1 π‘ˆ

ππ‘ˆπ’ˆ 𝑧𝑗

(π‘œ), π‘§π‘—βˆ’1 (π‘œ), π’š(π‘œ) βˆ’ ln π‘Ž(π’š(π‘œ), 𝝁)

β„“ 𝝁 = ln

π‘œ=1 𝑂

𝑄 𝒛(π‘œ) π’š(π‘œ), 𝝁 𝛼𝝁ℓ 𝝁 =

π‘œ=1 𝑂 𝑗=1 π‘ˆ

π’ˆ 𝑧𝑗

(π‘œ), π‘§π‘—βˆ’1 (π‘œ), π’š(π‘œ) βˆ’ 𝛼𝝁 ln π‘Ž(π’š(π‘œ), 𝝁)

𝛼𝝁 ln π‘Ž(π’š(π‘œ), 𝝁) =

𝒛

𝑄(𝒛|π’š π‘œ , 𝝁)

𝑗=1 π‘ˆ

π’ˆ 𝑧𝑗 , π‘§π‘—βˆ’1, π’š π‘œ Maximum Conditional Likelihood 𝜾 = argmax

𝜾

𝑄(𝒠𝑍|π’ π‘Œ, 𝜾) Gradient of the log-partition function for an exponential family is the expectation of the sufficient statistics.

slide-33
SLIDE 33

CRF: learning

33

𝛼𝝁 ln π‘Ž(π’š(π‘œ), 𝝁) =

𝒛

𝑄(𝒛|π’š π‘œ , 𝝁)

𝑗=1 π‘ˆ

π’ˆ 𝑧𝑗 , π‘§π‘—βˆ’1, π’š π‘œ =

𝑗=1 π‘ˆ 𝒛

𝑄(𝒛|π’š π‘œ , 𝝁)π’ˆ 𝑧𝑗 , π‘§π‘—βˆ’1, π’š π‘œ =

𝑗=1 π‘ˆ 𝑧𝑗 ,π‘§π‘—βˆ’1

𝑄(𝑧𝑗 , π‘§π‘—βˆ’1|π’š π‘œ , 𝝁)π’ˆ 𝑧𝑗 , π‘§π‘—βˆ’1, π’š

How do we find the above expectations? 𝑄(𝑧𝑗 , π‘§π‘—βˆ’1|π’š π‘œ , 𝝁) must be computed for all 𝑗 = 2, … , π‘ˆ

slide-34
SLIDE 34

CRF: learning Inference to find 𝑄(𝑧𝑗 , π‘§π‘—βˆ’1|π’š π‘œ , 𝝁)

34

 Junction tree algorithm

 Initialization of clique potentials:

πœ” 𝑧𝑗, π‘§π‘—βˆ’1 = exp ππ‘ˆπ’ˆ 𝑧𝑗, π‘§π‘—βˆ’1, π’š π‘œ 𝜚 𝑧𝑗 = 1

 After calibration (completion of message passing): πœ”βˆ— 𝑧𝑗, π‘§π‘—βˆ’1

𝑄 𝑧𝑗, π‘§π‘—βˆ’1|π’š(π‘œ), 𝝁 = πœ”βˆ— 𝑧𝑗, π‘§π‘—βˆ’1 𝑧𝑗,π‘§π‘—βˆ’1 πœ”βˆ— 𝑧𝑗, π‘§π‘—βˆ’1

𝑍

1, 𝑍 2

𝑍

2, 𝑍 3

… 𝑍

π‘ˆβˆ’1, 𝑍 π‘ˆ

𝑍

2

𝑍

π‘ˆβˆ’1

slide-35
SLIDE 35

CRF learning: gradient descent

35

𝝁𝑒 = 𝝁𝑒 + πœƒπ›Όπβ„“ 𝝁𝑒

 In each gradient step, for each data sample, we use inference to find

𝑄(𝑧𝑗 , π‘§π‘—βˆ’1|π’š π‘œ , 𝝁) required for computing feature expectation: 𝛼𝝁 ln π‘Ž(π’š(π‘œ), 𝝁) = 𝐹𝑄(𝒛|π’š π‘œ ,𝝁𝑒)

𝑗=1 π‘ˆ

π’ˆ 𝑧𝑗 , π‘§π‘—βˆ’1, π’š π‘œ =

𝑗=1 π‘ˆ 𝑧𝑗 ,π‘§π‘—βˆ’1

𝑄 𝑧𝑗 , π‘§π‘—βˆ’1 π’š π‘œ , 𝝁 π’ˆ 𝑧𝑗 , π‘§π‘—βˆ’1, π’š π‘œ

𝛼𝝁𝑀 𝝁 =

π‘œ=1 𝑂 𝑗=1 π‘ˆ

π’ˆ 𝑧𝑗

(π‘œ), π‘§π‘—βˆ’1 (π‘œ), π’š(π‘œ) βˆ’ 𝛼𝝁 ln π‘Ž(π’š(π‘œ), 𝝁)

slide-36
SLIDE 36

Summary

36

 Discriminative vs. generative

 In cases where we have many correlated features, discriminative models (MEMM

and CRF) are often better

 avoid the challenge of explicitly modeling the distribution over features.

 but, if only limited training data are available, the stronger bias of the generative

model may dominate and these models may be preferred.

 Learning

 HMMs and MEMMs are much more easily learned.  CRF requires an iterative gradient-based approach, which is considerably

more expensive

 inference must be run separately for every training sequence in each step

 MEMM vs. CRF (Label bias problem of MEMM)

 In many cases, CRFs are likely to be a safer choice (particularly in cases where

many transitions are close to deterministic), but the computational cost may be prohibitive for large data sets.

slide-37
SLIDE 37

References

37

 M.I. Jordan, β€œAn

Introduction to Probabilistic Graphical Models”, Chapter 12.

 D. Koller and N. Friedman, β€œProbabilistic Graphical Models:

Principles and Techniques”, Chapter 20-3-2.