Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia - - PowerPoint PPT Presentation

lecture 12 em algorithm
SMART_READER_LITE
LIVE PREVIEW

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia - - PowerPoint PPT Presentation

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 Three basic problems for HMMs v Likelihood of the input: v Forward


slide-1
SLIDE 1

Lecture 12: EM Algorithm

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 CS6501 Natural Language Processing

slide-2
SLIDE 2

Three basic problems for HMMs

v Likelihood of the input:

vForward algorithm

v Decoding (tagging) the input:

vViterbi algorithm

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vForward-backward algorithm

CS6501 Natural Language Processing 2

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?

slide-3
SLIDE 3

EM algorithm

v POS induction – can we tag POS without annotated data? v An old idea

vGood mathematical intuition vTutorial paper: ftp://ftp.icsi.berkeley.edu/pub/techreports/1997/t r-97-021.pdf vhttp://people.csail.mit.edu/regina/6864/em_note s_mike.pdf

CS6501 Natural Language Processing 3

slide-4
SLIDE 4

Hard EM (Intuition)

v We don’t know the hidden states (i.e., POS tags) v If we know the model

CS6501 Natural Language Processing 4

slide-5
SLIDE 5

Recap: Learning from Labeled Data

v If we know the hidden states (labels) v we count how often we see 𝑢"#$𝑢" and w&𝑢" then normalize.

CS6501 Natural Language Processing 5

C C C H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2

slide-6
SLIDE 6

Recap: Tagging the input

v If we know the model, we can find the best tag sequence

CS6501 Natural Language Processing 6

slide-7
SLIDE 7

Hard EM (Intuition)

v We don’t know the hidden states (i.e., POS tags)

  • 1. Let’s guess!
  • 2. Then, we have labels; we can estimate

the model

  • 3. Check if the model is consistent with the

labels we guessed; if no → Step 1.

CS6501 Natural Language Processing 7

slide-8
SLIDE 8

P(…| C) P(… | H) P(…|Start) ( 1| … ) ?

  • ( 2| … )

? ?

  • ( 3 | …)

?

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 8

? ? ? ? ? ? ? ? ? ? ? ? 1 2 2 2 3 2 2 3 1 2 3 2

Let’s make a guess

slide-9
SLIDE 9

P(…| C) P(… | H) P(…|Start) ( 1| … ) ?

  • ( 2| … )

? ?

  • ( 3 | …)

?

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 9

C ? ? ? H ? ? H C ? H ? 1 2 2 2 3 2 2 3 1 2 3 2

These are obvious

slide-10
SLIDE 10

P(…| C) P(… | H) P(…|Start) ( 1| … ) ?

  • ( 2| … )

? ?

  • ( 3 | …)

?

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 10

C C ? H H H H H C ? H H 1 2 2 2 3 2 2 3 1 2 3 2

Guess more

slide-11
SLIDE 11

P(…| C) P(… | H) P(…|Start) ( 1| … ) ?

  • ( 2| … )

? ?

  • ( 3 | …)

?

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 11

C C C H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2

Now we can estimate ML Guess all of them

slide-12
SLIDE 12

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5

  • ( 2| … )

0.5 0.625

  • ( 3 | …)

0.375

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 12

C C C H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2

Does our guess consistent with the model?

slide-13
SLIDE 13

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5

  • ( 2| … )

0.5 0.625

  • ( 3 | …)

0.375

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 13

? ? ? ? ? ? ? ? ? ? ? ? 1 2 2 2 3 2 2 3 1 2 3 2

How to find latent states based on our model? Viterbi!

slide-14
SLIDE 14

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5

  • ( 2| … )

0.5 0.625

  • ( 3 | …)

0.375

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 14

C C C H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2 From Viterbi C H H H H H From Viterbi H H C H H H

Something wrong…

slide-15
SLIDE 15

P(…| C) P(… | H) P(…|Start) ( 1| … ) 1

  • ( 2| … )

0.7

  • ( 3 | …)

0.3

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 15

C H H H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2

It’s fine. Let’s do again

slide-16
SLIDE 16

P(…| C) P(… | H) P(…|Start) ( 1| … ) 1

  • ( 2| … )

0.7

  • ( 3 | …)

0.3

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 16

C H H H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2

This time it is consistent

From Viterbi C H H H H H From Viterbi H H C H H H

slide-17
SLIDE 17

Only one solution?

CS6501 Natural Language Processing 17

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.22

  • ( 2| … )

0.77

  • ( 3 | …)

1

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

C C C C H C C H C C H C 1 2 2 2 3 2 2 3 1 2 3 2

No! EM is sensitive to initialization

slide-18
SLIDE 18

How about this?

P(…| C) P(… | H) P(…|Start) ( 1| … ) ?

  • ( 2| … )

? ?

  • ( 3 | …)

?

  • ( C| …)

? ? 0.5 ( H | …) ? ? 0.5

CS6501 Natural Language Processing 18

? ? ? ? ? ? ? ? ? ? ? ? 1 2 2 2 3 2 2 3 1 2 3 2

slide-19
SLIDE 19

Hard EM

v We don’t know the hidden states (i.e., POS tags)

  • 1. Let’s guess based on our model!

v Find the best sequence using Viterbi algorithm

  • 2. Then, we have labels; we can estimate

the model

vMaximum Likelihood Estimation

  • 3. Check if the model is consistent with the

labels we guessed; if no → Step 1.

CS6501 Natural Language Processing 19

slide-20
SLIDE 20

Soft EM

v We don’t know the hidden states (i.e., POS tags)

  • 1. Let’s guess based on our model!

v Find the best sequence using Viterbi algorithm

  • 2. Then, we have labels; we can estimate

the model

vMaximum Likelihood Estimation

  • 3. Check if the model is consistent with the

labels we guessed; if no → Step 1.

CS6501 Natural Language Processing 20

Let’s use expected counts instead!

slide-21
SLIDE 21

Expected Counts

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 21

? ? ? 1 2 2

slide-22
SLIDE 22

Expected Counts

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 22

C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2

Some sequences are more likely to occur than the others

slide-23
SLIDE 23

Expected Counts

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 23

C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 0.01024 0.00256 0.00064 0.00256

slide-24
SLIDE 24

Expected Counts

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 24

C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256

Assume we draw 100,000 random samples…

slide-25
SLIDE 25

Expected Counts

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 25

C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256

Let’s update model

slide-26
SLIDE 26

Expected Counts

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 26

C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256

Let’s update model How many C-C?

1024*2+256=2302

slide-27
SLIDE 27

Expected Counts

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 27

C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256

How many C-C?

1024*2+256=2302

P(C|C)?

1024*3+256*2+64*2+256=3968

How many C?

2302/3968 = 0.580

slide-28
SLIDE 28

Expected Counts

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.58 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 28

C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256

P(C|C)?

2302/3968 = 0.580

Do this for all the other entries!

slide-29
SLIDE 29

Are we done yet?

v What if we have 45 tags…? v What if our sentences has 20 tokens...? v We need an efficent algorithm again!

CS6501 Natural Language Processing 29

slide-30
SLIDE 30

Expected Counts

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 30

C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256

P(C|C)?

2302/3968 = 0.580

P(1|C)?

(1024+256+256+64) /3968 = 0.403

slide-31
SLIDE 31

Expected Counts

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 31

P(C|C)?

2302/3968 = 0.580

𝑄(1|C)?

(1024+256+256+64) /3968 = 0.403 C H C H C H 1 2 2 0.01024 0.00256 0.00256 0.00064

slide-32
SLIDE 32

In general

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 32

C H C H C H 2 2 2

slide-33
SLIDE 33

In general

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 33

C H C H C H 2 2 2

slide-34
SLIDE 34

In general

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8

  • ( 2| … )

0.2 0.2

  • ( 3 | …)

0.8

  • ( C| …)

0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5

CS6501 Natural Language Processing 34

C H C H C H 2 2 2 …. …. i=k

Let’s say #words = n

𝑄 𝒙𝟐..𝒐, t/ = 𝑫

slide-35
SLIDE 35

In general

CS6501 Natural Language Processing 35

C H C H C H 2 2 2 …. ….

i=k

𝑄 𝒙𝟐..𝒐, t/ = 𝑫 = 𝑄 𝒙𝟐..𝒍, t/ = 𝑫 𝑄 𝒙𝒍3𝟐..𝒐|t/ = 𝑫

probability of 𝑥$ … 𝑥7 and tag k is ”C” probability of 𝑥73$ … 𝑥8 and tag k is ”C”

slide-36
SLIDE 36

In general

CS6501 Natural Language Processing 36

C H C H C H 2 2 2 …. ….

i=k

𝑄 𝒙𝟐..𝒐, t/ = 𝑫 = 𝑄 𝒙𝟐..𝒍, t/ = 𝑫 𝑄 𝒙𝒍3𝟐..𝒐|t/ = 𝑫

Can be computed by forward algorithm

𝑄 𝒙𝟐..𝒍, t/ = 𝑫 = 9 𝑄 𝒙𝟐..𝒍, 𝒖𝟐..𝒍#𝟐,t/ = 𝑫

𝒖𝟐..𝒍;𝟐

𝑄 𝒙𝒍3𝟐..𝒐|t/ = 𝑫 = 9 𝑄 𝒙𝒍..𝒐,𝑢𝒍3𝟐..𝒐|t/ = 𝑫

𝒖𝒍<𝟐..𝒐

Can be computed by backward algorithm

slide-37
SLIDE 37

Forward algorithm

CS6501 Natural Language Processing 37

Induction:

𝛽7 𝑟 =𝑄 𝑥7 𝑟) ∑ 𝛽7#$ 𝑟A

BC

𝑄(𝑟|𝑟′)

i

i

slide-38
SLIDE 38

Backward algorithm

v𝑄 𝒙𝒍3𝟐..𝒐|t/ = 𝑫

= ∑ 𝑄 𝒙𝒍3𝟑..𝒐|𝑢73$ = 𝑟 𝑄 𝑟 𝐷 𝑄(𝑥73$|𝑟)

𝒓

v 𝛾7 𝐷 = ∑ 𝛾73$ 𝑟 𝑄 𝑟 𝐷 𝑄(𝑥73$|𝑟)

B

CS6501 Natural Language Processing 38

slide-39
SLIDE 39

In general

CS6501 Natural Language Processing 39

C H C H C H 2 2 2 …. ….

i=k

𝑄 𝒙𝟐..𝒐, t/ = 𝑫 = 𝑄 𝒙𝟐..𝒍, t/ = 𝑫 𝑄 𝒙𝒍3𝟐..𝒐|t/ = 𝑫

Can be computed by forward algorithm

𝑄 𝒙𝟐..𝒍, t/ = 𝑫 = 9 𝑄 𝒙𝟐..𝒍, 𝒖𝟐..𝒍#𝟐,t/ = 𝑫

𝒖𝟐..𝒍;𝟐

𝑄 𝒙𝒍..𝒐,t/ = 𝑫 = 9 𝑄 𝒙𝒍..𝒐, 𝒖𝒍3𝟐..𝒐,t/ = 𝑫

𝒖𝒍<𝟐..𝒐

Can be computed by backward algorithm

slide-40
SLIDE 40

Emission Counts

CS6501 Natural Language Processing 40

C H C H C H 2 2 2 …. ….

i=k

𝑄 2 𝐷 = ∑ 𝑄(𝑥" = 2, 𝑢" = 𝐷, 𝒙𝟐..𝒐)

"

∑ 𝑄(𝑢" = 𝐷, 𝒙𝟐..𝒐)

"

Expected counts of (2,C) Expected counts of C

slide-41
SLIDE 41

How about the transition counts?

CS6501 Natural Language Processing 41

C H C H C H ….

i=k

C H

i=k+1

𝑄 𝒙𝟐..𝒐, t/ = 𝐷, 𝑢73$ = 𝐼

= 𝑄 𝒙𝟐..𝒍,t/ = 𝑫 𝑄 𝒙𝒍3𝟐..𝒐| t/3$ = 𝑰 𝑸 𝑰 𝑫 𝑸(𝒙𝒍3𝟐|𝑰) = 𝜷𝒍 𝑫 𝛾73$ 𝑰 𝑸 𝑰 𝑫 𝑸(𝒙𝒍3𝟐|𝑰)

slide-42
SLIDE 42

Three basic problems for HMMs

v Likelihood of the input:

vForward algorithm

v Decoding (tagging) the input:

vViterbi algorithm

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vForward-backward algorithm

CS6501 Natural Language Processing 42

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?

slide-43
SLIDE 43

Trick: computing everything in log space

v Homework:

vWrite forward, backward and Viterbi algorithm in log-space vHint: you need a function to compute log(a+b)

CS6501 Natural Language Processing 43

slide-44
SLIDE 44

Behind the scenes

v What is EM optimized?

vLog Likelihood of the input!

v log 𝑄(𝒙 ∣ 𝝁)

vlog 𝑄 𝒙 𝝁 = log ∑ 𝑸(𝒙, 𝒖 ∣ 𝝁)

𝒖

= log ∑ Π"V$

8 𝑄 𝑢" ∣ 𝑢"#$,𝑢"#W 𝑄(𝑥" ∣ 𝑢") X

CS6501 Natural Language Processing 44

In contrast, in the supervised situation, We are optimizing

log 𝑄(𝒙, 𝒖 ∣ 𝝁)

This is hard; In contrast

log 𝑄 𝒙, 𝒖 𝝁 = 𝒎𝒑𝒉Π"V$

8

𝑄 𝑢" ∣ 𝑢"#$,𝑢"#W 𝑄 𝑥" 𝑢" = ∑ (log 𝑄 𝑢" ∣ 𝑢"#$, 𝑢"#W + log 𝑄 𝑥" ∣ 𝑢" )

"

Log ∑ Π is hard; ∑ log Π = ∑ ∑ log is easy

slide-45
SLIDE 45

Intuition of EM (from the optimization perspective)

CS6501 Natural Language Processing 45

f 𝜇 = log𝑄 𝑥 𝜇 = 𝑚𝑝𝑕∑𝑄(𝑥,𝑢 ∣ 𝜇) 𝜇(b) 𝜇(b3$) 𝜇(b3W)

Key idea:

  • 1. Define gc 𝜇 such that

f 𝜇 ≥ 𝑕b 𝜇 ∀𝜇 and f 𝜇(b) = 𝑕b 𝜇 b

  • 2. Optimize gc 𝜇

𝑕b 𝑕b3$

slide-46
SLIDE 46

Intuition of EM (from optimization perspective)

CS6501 Natural Language Processing 46

f 𝜇 = log𝑄 𝑥 𝜇 = 𝑚𝑝𝑕∑𝑄(𝑥,𝑢 ∣ 𝜇) 𝜇(b) 𝜇(b3$) 𝜇(b3W)

Key idea:

  • 1. Define gc 𝜇 such that

f 𝜇 ≥ 𝑕b 𝜇 ∀𝜇 and f 𝜇(b) = 𝑕b 𝜇 b

  • 2. Optimize gc 𝜇

𝑕b 𝑕b3$

>

Hard EM, Soft EM define different gc 𝜇

slide-47
SLIDE 47

gc 𝜇

for soft EM

v 𝑚𝑝𝑕 ∑ 𝑄 𝑥, 𝑢 𝜇

X

= log ∑

f 𝑥, 𝑢 𝜇 f 𝑢 𝑥, 𝜇 b X

𝑄 𝑢 𝑥, 𝜇 b ≥ ∑ 𝑄 𝑢 𝑥, 𝜇 b

X

log

f 𝑥, 𝑢 𝜇 f 𝑢 𝑥, 𝜇 b

CS6501 Natural Language Processing 47

Jensen inequality: Let ∑𝑞(𝑦) = 1 log ∑ 𝑔 𝑦 𝑞(𝑦)

k

≥ ∑ p(x)log 𝑔 𝑦

k

slide-48
SLIDE 48

gc 𝜇(b) = 𝑔 𝜇 b ?

v 𝑚𝑝𝑕 ∑ 𝑄 𝑥, 𝑢 𝜇

X

≥ ∑ 𝑄 𝑢 𝑥, 𝜇 b

X

log

f 𝑥,𝑢 𝜇 f 𝑢 𝑥, 𝜇 b

𝑔 𝜇 b = 𝑚𝑝𝑕 ∑ 𝑄 𝑥,𝑢

𝜇(b) = log 𝑄(𝑥|𝜇 b )

X

gb 𝜇(b) = ∑ 𝑄 𝑢

𝑥, 𝜇 b

X

log

f 𝑥, 𝑢 𝜇(b) f 𝑢 𝑥, 𝜇 b

= ∑ 𝑄 𝑢 𝑥, 𝜇 b

X

log 𝑄 𝑥 𝜇(b) = (log𝑄 𝑥 𝜇(b) ) ∑ 𝑄 𝑢 𝑥, 𝜇 b

X

= log 𝑥 𝜇 b

CS6501 Natural Language Processing 48

slide-49
SLIDE 49

Intuition of EM (from optimization perspective)

CS6501 Natural Language Processing 49

f 𝜇 = log𝑄 𝑥 𝜇 = 𝑚𝑝𝑕∑𝑄(𝑥,𝑢 ∣ 𝜇) 𝜇(b) 𝜇(b3$) 𝜇(b3W)

Key idea:

  • 1. Define gc 𝜇 such that

f 𝜇 ≥ 𝑕b 𝜇 ∀𝜇 and f 𝜇(b) = 𝑕b 𝜇 b

  • 2. Optimize gc 𝜇

𝑕b 𝑕b3$

>

Soft EM define

gc 𝜇 =

∑ 𝑄 𝑢 𝑥,𝜇 b

X

log

f 𝑥, 𝑢 𝜇 f 𝑢 𝑥,𝜇 b

slide-50
SLIDE 50

Optimizing gc 𝜇

gc 𝜇 = ∑ 𝑄 𝑢 𝑥, 𝜇(b)

X

log

f 𝑥, 𝑢 𝜇 f 𝑢 𝑥, 𝜇(b)

= ∑ 𝑄 𝑢 𝑥, 𝜇 b

X

(log𝑄 𝑥, 𝑢 𝜇 − log𝑄 𝑢 𝑥, 𝜇(b) ) max

r

gs 𝜇 = ∑ 𝑄 𝑢 𝑥, 𝜇(b) (log𝑄 𝑥, 𝑢 𝜇 )

X

= ∑ 𝑄 𝑢 𝑥, 𝜇(b) ∑ (log 𝑄 𝑢" ∣ 𝑢"#$,𝑢"#W + log𝑄 𝑥" ∣ 𝑢" )

" X

CS6501 Natural Language Processing 50

This term doesn’t have 𝜇 We know how to solve this!!

In contrast, in supervised learning case:

log 𝑄 𝒙, 𝒖 𝝁 = 𝒎𝒑𝒉Π"V$

8

𝑄 𝑢" ∣ 𝑢"#$,𝑢"#W 𝑄 𝑥" 𝑢" = ∑ (log 𝑄 𝑢" ∣ 𝑢"#$, 𝑢"#W + log 𝑄 𝑥" ∣ 𝑢" )

"

Log ∑ Π is hard; ∑ log Π = ∑ ∑ log is easy