Lecture 3: Language Model Smoothing
Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16
1 CS6501 Natural Language Processing
Lecture 3: Language Model Smoothing Kai-Wei Chang CS @ University - - PowerPoint PPT Presentation
Lecture 3: Language Model Smoothing Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 This lecture Zipfs law Dealing with unseen
Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16
1 CS6501 Natural Language Processing
2 CS6501 Natural Language Processing
3
CS6501 Natural Language Processing
4 CS6501 Natural Language Processing
5 CS6501 Natural Language Processing
6 CS6501 Natural Language Processing
7 CS6501 Natural Language Processing
8
There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique. But the traditional methods are easy to implement, run fast, and will give you intuitions about what you want from a smoothing method.
Credit: the following slides are adapted from Jason Eisner’s NLP course
CS6501 Natural Language Processing
9
CS6501 Natural Language Processing
10
CS6501 Natural Language Processing
11 CS6501 Natural Language Processing
12 CS6501 Natural Language Processing
13
P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total
allegations reports claims
attack
request
man
allegations
attack man
allegations reports
claims
request
Credit: Dan Klein
CS6501 Natural Language Processing
14
MLE(wi | wi-1) = c(wi-1,wi)
Add-1(wi | wi-1) = c(wi-1,wi)+1
CS6501 Natural Language Processing
15
CS6501 Natural Language Processing
20
CS6501 Natural Language Processing
21
see the abacus
see the abbot
see the abduct
see the above
see the Abram
… see the zygote
Total
20003/20003
CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 22
22
see the abacus
see the abbot
see the abduct
see the above
see the Abram
… see the zygote
Total
20003/20003
CS6501 Natural Language Processing
23
see the aaaaa
see the aaaab
see the aaaac
see the aaaad
see the aaaae
… see the zzzzz
Total
(∞+3)/(∞+3)
CS6501 Natural Language Processing
This gives much less probability to novel events.
That is, how much should we smooth?
24 CS6501 Natural Language Processing
25
CS6501 Natural Language Processing
26
CS6501 Natural Language Processing
This gives much less probability to novel events.
That is, how much should we smooth? E.g., how much probability to “set aside” for novel events?
Depends on how likely novel events really are! Which may depend on the type of text, size of training corpus, …
Can we figure it out from the data?
We’ll look at a few methods for deciding how much to smooth.
27 CS6501 Natural Language Processing
28
CS6501 Natural Language Processing
29
CS6501 Natural Language Processing
… and report results of that final model on test data.
600.465 - Intro to NLP - J. Eisner
30
Pick that gets best results on this 20% … … when we collect counts from this 80% and smooth them using add- smoothing. Now use that
smoothed counts from all 100% …
CS6501 Natural Language Processing
CS6501 Natural Language Processing 31
600.465 - Intro to NLP - J. Eisner 32
32
CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 33
33
CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 34
34
CS6501 Natural Language Processing
35 CS6501 Natural Language Processing
36
a 150 both 18 candy 1 donuts 2 every 50 versus ??? grapes 1 his 38 ice cream 7 …
CS6501 Natural Language Processing
37
a 150 both 18 candy 1 donuts 2 every 50 versus farina grapes 1 his 38 ice cream 7 …
CS6501 Natural Language Processing
38
a 150 both 18 candy 1 donuts 2 every 50 versus farina grapes 1 his 38 ice cream 7 …
CS6501 Natural Language Processing
CS6501 Natural Language Processing 39
http://wugology.com/zipfs-law/
CS6501 Natural Language Processing 40
CS6501 Natural Language Processing 41
10000 20000 30000 40000 50000 60000 70000 80000 1 2 3 4 5 6 52108 69836
42
N0 * N1 * N2 * N3 * N4 * N5 * N6 * 1* 1* abaringe, Babatinde, cabaret … aback, Babbitt, cabanas … Abbas, babel, Cabot … abdominal, Bach, cabana … aberrant, backlog, cabinets … abdomen, bachelor, Caesar … the EOS
CS6501 Natural Language Processing
43
5000 10000 15000 20000 25000 1 2 3 4 5 6 N0 * N1 * N2 * N3 * N4 * N5 * N6 *
novel words (in dictionary but never occur)
N2 = # doubleton types N2 * 2 = # doubleton tokens r Nr = total # types = T (purple bars) r (Nr * r) = total # tokens = N (all bars)
CS6501 Natural Language Processing
44
5000 10000 15000 20000 25000 1 2 3 4 5 6 N0 * N1 * N2 * N3 * N4 * N5 * N6 * novel words
If T/N is large, we’ve seen lots of novel types in the past, so we expect lots more. unsmoothed smoothed 2/N 2/(N+T) 1/N 1/(N+T) 0/N (T/(N+T)) / N0
CS6501 Natural Language Processing
0.005 0.01 0.015 0.02 0.025 0/N 1/N 2/N 3/N 4/N 5/N 6/N
45
N0* N1* N2* N3* N4* N5* N6*
Partition the type vocabulary into classes (novel, singletons, doubletons, …) by how often they occurred in training data Use observed total probability of class r+1 to estimate total probability of class r unsmoothed smoothed (N3*3/N)/N2 (N2*2/N)/N1 (N1*1/N)/N0
2%
1.5%
1.2%
2/N (N3*3/N)/N2 1/N (N2*2/N)/N1 0/N (N1*1/N)/N0
CS6501 Natural Language Processing
r/N = (Nr*r/N)/Nr (Nr+1*(r+1)/N)/Nr
Very nice idea (but a bit tricky in practice) See the paper “Good-Turing smoothing without tears”
46 CS6501 Natural Language Processing
N1 N2 N3 N4417 N3511
N0 N1 N2 N4416 N3510
N1 N2 N3 N1 N2
600.465 - Intro to NLP - J. Eisner 49
49 CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 50
50 CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 51
51 CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 52
52 CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 53
53 CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 54
54 CS6501 Natural Language Processing
55
1 1 1 1 scounting AbsoluteDi
i i i i i i
discounted bigram unigram
Interpolation weight CS6501 Natural Language Processing
56
Bigram count in training Bigram count in heldout set .0000270 1 0.448 2 1.25 3 2.24 4 3.23 5 4.21 6 5.23
CS6501 Natural Language Processing