Lecture 12: EM Algorithm
Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16
1 CS6501 Natural Language Processing
Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia - - PowerPoint PPT Presentation
Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 Three basic problems for HMMs v Likelihood of the input: v Forward
Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16
1 CS6501 Natural Language Processing
CS6501 Natural Language Processing 2
How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?
CS6501 Natural Language Processing 3
CS6501 Natural Language Processing 4
CS6501 Natural Language Processing 5
C C C H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2
CS6501 Natural Language Processing 6
CS6501 Natural Language Processing 7
P(…| C) P(… | H) P(…|Start) ( 1| … ) ?
? ?
?
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 8
? ? ? ? ? ? ? ? ? ? ? ? 1 2 2 2 3 2 2 3 1 2 3 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) ?
? ?
?
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 9
C ? ? ? H ? ? H C ? H ? 1 2 2 2 3 2 2 3 1 2 3 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) ?
? ?
?
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 10
C C ? H H H H H C ? H H 1 2 2 2 3 2 2 3 1 2 3 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) ?
? ?
?
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 11
C C C H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5
0.5 0.625
0.375
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 12
C C C H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5
0.5 0.625
0.375
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 13
? ? ? ? ? ? ? ? ? ? ? ? 1 2 2 2 3 2 2 3 1 2 3 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5
0.5 0.625
0.375
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 14
C C C H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2 From Viterbi C H H H H H From Viterbi H H C H H H
P(…| C) P(… | H) P(…|Start) ( 1| … ) 1
0.7
0.3
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 15
C H H H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) 1
0.7
0.3
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 16
C H H H H H H H C H H H 1 2 2 2 3 2 2 3 1 2 3 2
From Viterbi C H H H H H From Viterbi H H C H H H
CS6501 Natural Language Processing 17
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.22
0.77
1
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
C C C C H C C H C C H C 1 2 2 2 3 2 2 3 1 2 3 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) ?
? ?
?
? ? 0.5 ( H | …) ? ? 0.5
CS6501 Natural Language Processing 18
? ? ? ? ? ? ? ? ? ? ? ? 1 2 2 2 3 2 2 3 1 2 3 2
CS6501 Natural Language Processing 19
CS6501 Natural Language Processing 20
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 21
? ? ? 1 2 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 22
C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 23
C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 0.01024 0.00256 0.00064 0.00256
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 24
C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 25
C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 26
C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256
1024*2+256=2302
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 27
C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256
1024*2+256=2302
1024*3+256*2+64*2+256=3968
2302/3968 = 0.580
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.58 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 28
C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256
2302/3968 = 0.580
CS6501 Natural Language Processing 29
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 30
C C C 1 2 2 C C H 1 2 2 C H C 1 2 2 C H H 1 2 2 1024 256 64 256
2302/3968 = 0.580
(1024+256+256+64) /3968 = 0.403
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 31
2302/3968 = 0.580
(1024+256+256+64) /3968 = 0.403 C H C H C H 1 2 2 0.01024 0.00256 0.00256 0.00064
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 32
C H C H C H 2 2 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 33
C H C H C H 2 2 2
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.8
0.2 0.2
0.8
0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5
CS6501 Natural Language Processing 34
C H C H C H 2 2 2 …. …. i=k
CS6501 Natural Language Processing 35
C H C H C H 2 2 2 …. ….
CS6501 Natural Language Processing 36
C H C H C H 2 2 2 …. ….
𝑄 𝒙𝟐..𝒍, t/ = 𝑫 = 9 𝑄 𝒙𝟐..𝒍, 𝒖𝟐..𝒍#𝟐,t/ = 𝑫
𝒖𝟐..𝒍;𝟐
𝑄 𝒙𝒍3𝟐..𝒐|t/ = 𝑫 = 9 𝑄 𝒙𝒍..𝒐,𝑢𝒍3𝟐..𝒐|t/ = 𝑫
𝒖𝒍<𝟐..𝒐
CS6501 Natural Language Processing 37
BC
i
i
𝒓
B
CS6501 Natural Language Processing 38
CS6501 Natural Language Processing 39
C H C H C H 2 2 2 …. ….
𝑄 𝒙𝟐..𝒍, t/ = 𝑫 = 9 𝑄 𝒙𝟐..𝒍, 𝒖𝟐..𝒍#𝟐,t/ = 𝑫
𝒖𝟐..𝒍;𝟐
𝑄 𝒙𝒍..𝒐,t/ = 𝑫 = 9 𝑄 𝒙𝒍..𝒐, 𝒖𝒍3𝟐..𝒐,t/ = 𝑫
𝒖𝒍<𝟐..𝒐
CS6501 Natural Language Processing 40
C H C H C H 2 2 2 …. ….
"
"
CS6501 Natural Language Processing 41
C H C H C H ….
C H
CS6501 Natural Language Processing 42
How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?
CS6501 Natural Language Processing 43
𝒖
8 𝑄 𝑢" ∣ 𝑢"#$,𝑢"#W 𝑄(𝑥" ∣ 𝑢") X
CS6501 Natural Language Processing 44
log 𝑄(𝒙, 𝒖 ∣ 𝝁)
log 𝑄 𝒙, 𝒖 𝝁 = 𝒎𝒑𝒉Π"V$
8
𝑄 𝑢" ∣ 𝑢"#$,𝑢"#W 𝑄 𝑥" 𝑢" = ∑ (log 𝑄 𝑢" ∣ 𝑢"#$, 𝑢"#W + log 𝑄 𝑥" ∣ 𝑢" )
"
CS6501 Natural Language Processing 45
f 𝜇 = log𝑄 𝑥 𝜇 = 𝑚𝑝∑𝑄(𝑥,𝑢 ∣ 𝜇) 𝜇(b) 𝜇(b3$) 𝜇(b3W)
Key idea:
f 𝜇 ≥ b 𝜇 ∀𝜇 and f 𝜇(b) = b 𝜇 b
b b3$
CS6501 Natural Language Processing 46
f 𝜇 = log𝑄 𝑥 𝜇 = 𝑚𝑝∑𝑄(𝑥,𝑢 ∣ 𝜇) 𝜇(b) 𝜇(b3$) 𝜇(b3W)
Key idea:
f 𝜇 ≥ b 𝜇 ∀𝜇 and f 𝜇(b) = b 𝜇 b
b b3$
Hard EM, Soft EM define different gc 𝜇
CS6501 Natural Language Processing 47
Jensen inequality: Let ∑𝑞(𝑦) = 1 log ∑ 𝑔 𝑦 𝑞(𝑦)
k
≥ ∑ p(x)log 𝑔 𝑦
k
X
X
f 𝑥,𝑢 𝜇 f 𝑢 𝑥, 𝜇 b
X
X
f 𝑥, 𝑢 𝜇(b) f 𝑢 𝑥, 𝜇 b
X
X
CS6501 Natural Language Processing 48
CS6501 Natural Language Processing 49
f 𝜇 = log𝑄 𝑥 𝜇 = 𝑚𝑝∑𝑄(𝑥,𝑢 ∣ 𝜇) 𝜇(b) 𝜇(b3$) 𝜇(b3W)
Key idea:
f 𝜇 ≥ b 𝜇 ∀𝜇 and f 𝜇(b) = b 𝜇 b
b b3$
Soft EM define
gc 𝜇 =
∑ 𝑄 𝑢 𝑥,𝜇 b
X
log
f 𝑥, 𝑢 𝜇 f 𝑢 𝑥,𝜇 b
X
f 𝑥, 𝑢 𝜇 f 𝑢 𝑥, 𝜇(b)
X
r
X
" X
CS6501 Natural Language Processing 50
This term doesn’t have 𝜇 We know how to solve this!!
log 𝑄 𝒙, 𝒖 𝝁 = 𝒎𝒑𝒉Π"V$
8
𝑄 𝑢" ∣ 𝑢"#$,𝑢"#W 𝑄 𝑥" 𝑢" = ∑ (log 𝑄 𝑢" ∣ 𝑢"#$, 𝑢"#W + log 𝑄 𝑥" ∣ 𝑢" )
"