Lecture 3: Language Model Smoothing Kai-Wei Chang CS @ University - - PowerPoint PPT Presentation

lecture 3 language
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Language Model Smoothing Kai-Wei Chang CS @ University - - PowerPoint PPT Presentation

Lecture 3: Language Model Smoothing Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 This lecture Zipfs law Dealing with unseen


slide-1
SLIDE 1

Lecture 3: Language Model Smoothing

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 CS6501 Natural Language Processing

slide-2
SLIDE 2

This lecture

 Zipf’s law  Dealing with unseen words/n-grams

 Add-one smoothing  Linear smoothing  Good-Turing smoothing  Absolute discounting  Kneser-Ney smoothing

2 CS6501 Natural Language Processing

slide-3
SLIDE 3

Recap: Bigram language model

Let P(<S>) = 1 P( I | <S>) = 2 / 3 P(am | I) = 1 P( Sam | am) = 1/3 P( </S> | Sam) = 1/2 P( <S> I am Sam</S>) = 1*2/3*1*1/3*1/2

3

<S> I am Sam </S> <S> I am legend </S> <S> Sam I am </S>

CS6501 Natural Language Processing

slide-4
SLIDE 4

More examples: Berkeley Restaurant Project sentences

 can you tell me about any good cantonese restaurants close by  mid priced thai food is what i’m looking for  tell me about chez panisse  can you give me a listing of the kinds of food that are available  i’m looking for a good place to eat breakfast  when is caffe venezia open during the day

4 CS6501 Natural Language Processing

slide-5
SLIDE 5

Raw bigram counts

 Out of 9222 sentences

5 CS6501 Natural Language Processing

slide-6
SLIDE 6

Raw bigram probabilities

 Normalize by unigrams:  Result:

6 CS6501 Natural Language Processing

slide-7
SLIDE 7

Zeros

Training set: … denied the allegations … denied the reports … denied the claims … denied the request P(“offer” | denied the) = 0 Test set … denied the offer … denied the loan

7 CS6501 Natural Language Processing

slide-8
SLIDE 8

Smoothing

8

There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique. But the traditional methods are easy to implement, run fast, and will give you intuitions about what you want from a smoothing method.

This dark art is why NLP is taught in the engineering school.

Credit: the following slides are adapted from Jason Eisner’s NLP course

CS6501 Natural Language Processing

slide-9
SLIDE 9

What is smoothing?

9

20 200 2000 2000000

CS6501 Natural Language Processing

slide-10
SLIDE 10

ML 101: bias variance tradeoff

 Different samples of size 20 vary considerably  though on average, they give the correct bell curve!

10

20 20 20 20

CS6501 Natural Language Processing

slide-11
SLIDE 11

Overfitting

11 CS6501 Natural Language Processing

slide-12
SLIDE 12

The perils of overfitting

 N-grams only work well for word prediction if the test corpus looks like the training corpus In real life, it often doesn’t

We need to train robust models that generalize! One kind of generalization: Zeros!

Things that don’t ever occur in the training set

But occur in the test set

12 CS6501 Natural Language Processing

slide-13
SLIDE 13

The intuition of smoothing

 When we have sparse statistics:  Steal probability mass to generalize better

13

P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total

allegations reports claims

attack

request

man

  • utcome

allegations

attack man

  • utcome

allegations reports

claims

request

Credit: Dan Klein

CS6501 Natural Language Processing

slide-14
SLIDE 14

Add-one estimation (Laplace smoothing)

 Pretend we saw each word one more time than we did  Just add one to all the counts!  MLE estimate:  Add-1 estimate:

14

P

MLE(wi | wi-1) = c(wi-1,wi)

c(wi-1) P

Add-1(wi | wi-1) = c(wi-1,wi)+1

c(wi-1)+V

CS6501 Natural Language Processing

slide-15
SLIDE 15

Add-One Smoothing

15

xya 100 100/300 101 101/326 xyb 0/300 1 1/326 xyc 0/300 1 1/326 xyd 200 200/300 201 201/326 xye 0/300 1 1/326 … xyz 0/300 1 1/326 Total xy 300 300/300 326 326/326

CS6501 Natural Language Processing

slide-16
SLIDE 16

Berkeley Restaurant Corpus: Laplace smoothed bigram counts

slide-17
SLIDE 17

Laplace-smoothed bigrams

V=1446 in the Berkeley Restaurant Project corpus

slide-18
SLIDE 18

Reconstituted counts

slide-19
SLIDE 19

Compare with raw bigram counts

slide-20
SLIDE 20

Problem with Add-One Smoothing

We’ve been considering just 26 letter types …

20

xya 1 1/3 2 2/29 xyb 0/3 1 1/29 xyc 0/3 1 1/29 xyd 2 2/3 3 3/29 xye 0/3 1 1/29 … xyz 0/3 1 1/29 Total xy 3 3/3 29 29/29

CS6501 Natural Language Processing

slide-21
SLIDE 21

Problem with Add-One Smoothing

Suppose we’re considering 20000 word types

21

see the abacus

1 1/3 2 2/20003

see the abbot

0/3 1 1/20003

see the abduct

0/3 1 1/20003

see the above

2 2/3 3 3/20003

see the Abram

0/3 1 1/20003

… see the zygote

0/3 1 1/20003

Total

3 3/3 20003

20003/20003

CS6501 Natural Language Processing

slide-22
SLIDE 22

600.465 - Intro to NLP - J. Eisner 22

Problem with Add-One Smoothing

Suppose we’re considering 20000 word types

22

see the abacus

1 1/3 2 2/20003

see the abbot

0/3 1 1/20003

see the abduct

0/3 1 1/20003

see the above

2 2/3 3 3/20003

see the Abram

0/3 1 1/20003

… see the zygote

0/3 1 1/20003

Total

3 3/3 20003

20003/20003

“Novel event” = event never happened in training data. Here: 19998 novel events, with total estimated probability 19998/20003. Add-one smoothing thinks we are extremely likely to see novel events, rather than words we’ve seen.

CS6501 Natural Language Processing

slide-23
SLIDE 23

Infinite Dictionary?

In fact, aren’t there infinitely many possible word types?

23

see the aaaaa

1 1/3 2 2/(∞+3)

see the aaaab

0/3 1 1/(∞+3)

see the aaaac

0/3 1 1/(∞+3)

see the aaaad

2 2/3 3 3/(∞+3)

see the aaaae

0/3 1 1/(∞+3)

… see the zzzzz

0/3 1 1/(∞+3)

Total

3 3/3

(∞+3)

(∞+3)/(∞+3)

CS6501 Natural Language Processing

slide-24
SLIDE 24

Add-Lambda Smoothing

 A large dictionary makes novel events too probable.  To fix: Instead of adding 1 to all counts, add  = 0.01?

 This gives much less probability to novel events.

 But how to pick best value for ?

 That is, how much should we smooth?

24 CS6501 Natural Language Processing

slide-25
SLIDE 25

Add-0.001 Smoothing

Doesn’t smooth much (estimated distribution has high variance)

25

xya 1 1/3 1.001 0.331 xyb 0/3 0.001 0.0003 xyc 0/3 0.001 0.0003 xyd 2 2/3 2.001 0.661 xye 0/3 0.001 0.0003 … xyz 0/3 0.001 0.0003 Total xy 3 3/3 3.026 1

CS6501 Natural Language Processing

slide-26
SLIDE 26

Add-1000 Smoothing

Smooths too much (estimated distribution has high bias)

26

xya 1 1/3 1001 1/26 xyb 0/3 1000 1/26 xyc 0/3 1000 1/26 xyd 2 2/3 1002 1/26 xye 0/3 1000 1/26 … xyz 0/3 1000 1/26 Total xy 3 3/3 26003 1

CS6501 Natural Language Processing

slide-27
SLIDE 27

Add-Lambda Smoothing

 A large dictionary makes novel events too probable.  To fix: Instead of adding 1 to all counts, add  = 0.01?

 This gives much less probability to novel events.

 But how to pick best value for ?

 That is, how much should we smooth?  E.g., how much probability to “set aside” for novel events?

 Depends on how likely novel events really are!  Which may depend on the type of text, size of training corpus, …

 Can we figure it out from the data?

 We’ll look at a few methods for deciding how much to smooth.

27 CS6501 Natural Language Processing

slide-28
SLIDE 28

Setting Smoothing Parameters

 How to pick best value for ? (in add- smoothing)  Try many  values & report the one that gets best results?  How to measure whether a particular  gets good results?  Is it fair to measure that on test data (for setting )?  Moral: Selective reporting on test data can make a method look artificially good. So it is unethical.  Rule: Test data cannot influence system

  • development. No peeking! Use it only to evaluate

the final system(s). Report all results on it.

28

Test Training

CS6501 Natural Language Processing

slide-29
SLIDE 29

Setting Smoothing Parameters

 How to pick best value for ? (in add- smoothing)  Try many  values & report the one that gets best results?  How to measure whether a particular  gets good results?  Is it fair to measure that on test data (for setting )?  Moral: Selective reporting on test data can make a method look artificially good. So it is unethical.  Rule: Test data cannot influence system

  • development. No peeking! Use it only to evaluate

the final system(s). Report all results on it.

29

Test Training

CS6501 Natural Language Processing

Feynman’s Advice: “The first principle is that you must not fool yourself, and you are the easiest person to fool.”

slide-30
SLIDE 30

 How to pick best value for ?  Try many  values & report the one that gets best results?

… and report results of that final model on test data.

600.465 - Intro to NLP - J. Eisner

Setting Smoothing Parameters

30

Test Training Dev. Training

Pick  that gets best results on this 20% … … when we collect counts from this 80% and smooth them using add- smoothing. Now use that

 to get

smoothed counts from all 100% …

CS6501 Natural Language Processing

slide-31
SLIDE 31

Large or small Dev set?

Here we held out 20% of our training set (yellow) for development. Would like to use > 20% yellow:

 20% not enough to reliably assess 

Would like to use > 80% blue:

Best  for smoothing 80%  best  for smoothing 100%

CS6501 Natural Language Processing 31

slide-32
SLIDE 32

600.465 - Intro to NLP - J. Eisner 32

Cross-Validation

 Try 5 training/dev splits as below

 Pick  that gets best average performance

  Tests on all 100% as yellow, so we can more reliably assess    Still picks a  that’s good at smoothing the 80% size, not 100%.  But now we can grow that 80% without trouble

32

Dev. Dev. Dev. Dev. Dev. Test

CS6501 Natural Language Processing

slide-33
SLIDE 33

600.465 - Intro to NLP - J. Eisner 33

N-fold Cross-Validation (“Leave One Out”)

 Test each sentence with smoothed model from

  • ther N-1 sentences

  Still tests on all 100% as yellow, so we can reliably assess    Trains on nearly 100% blue data ((N-1)/N) to measure whether  is good for smoothing that

33

Test

CS6501 Natural Language Processing

slide-34
SLIDE 34

600.465 - Intro to NLP - J. Eisner 34

N-fold Cross-Validation (“Leave One Out”)

  Surprisingly fast: why?  Usually easy to retrain on blue by adding/subtracting 1 sentence’s counts

34

Test

CS6501 Natural Language Processing

slide-35
SLIDE 35

More Ideas for Smoothing

 Remember, we’re trying to decide how much to smooth.

 E.g., how much probability to “set aside” for novel events?

 Depends on how likely novel events really are  Which may depend on the type of text, size

  • f training corpus, …

 Can we figure this out from the data?

35 CS6501 Natural Language Processing

slide-36
SLIDE 36

How likely are novel events?

36

a 150 both 18 candy 1 donuts 2 every 50 versus ??? grapes 1 his 38 ice cream 7 …

20000 types 300 tokens 300 tokens 0/300 0/300 which zero would you expect is really rare?

CS6501 Natural Language Processing

slide-37
SLIDE 37

How likely are novel events?

37

a 150 both 18 candy 1 donuts 2 every 50 versus farina grapes 1 his 38 ice cream 7 …

20000 types 300 tokens 300 tokens determiners: a closed class

CS6501 Natural Language Processing

slide-38
SLIDE 38

How likely are novel events?

38

a 150 both 18 candy 1 donuts 2 every 50 versus farina grapes 1 his 38 ice cream 7 …

20000 types 300 tokens 300 tokens (food) nouns: an open class

CS6501 Natural Language Processing

slide-39
SLIDE 39

Zipfs’ law

CS6501 Natural Language Processing 39

the r-th most common word 𝑥𝑠 has P(𝑥𝑠) ∝ 1/r

A few words are very frequent

http://wugology.com/zipfs-law/

slide-40
SLIDE 40

CS6501 Natural Language Processing 40

slide-41
SLIDE 41

CS6501 Natural Language Processing 41

slide-42
SLIDE 42

10000 20000 30000 40000 50000 60000 70000 80000 1 2 3 4 5 6 52108 69836

How common are novel events?

42

N0 * N1 * N2 * N3 * N4 * N5 * N6 * 1* 1* abaringe, Babatinde, cabaret … aback, Babbitt, cabanas … Abbas, babel, Cabot … abdominal, Bach, cabana … aberrant, backlog, cabinets … abdomen, bachelor, Caesar … the EOS

CS6501 Natural Language Processing

slide-43
SLIDE 43

How common are novel events?

43

5000 10000 15000 20000 25000 1 2 3 4 5 6 N0 * N1 * N2 * N3 * N4 * N5 * N6 *

Counts from Brown Corpus (N  1 million tokens)

novel words (in dictionary but never occur)

singletons (occur once) doubletons (occur twice)

N2 = # doubleton types N2 * 2 = # doubleton tokens r Nr = total # types = T (purple bars) r (Nr * r) = total # tokens = N (all bars)

CS6501 Natural Language Processing

slide-44
SLIDE 44

Witten-Bell Smoothing Idea

44

5000 10000 15000 20000 25000 1 2 3 4 5 6 N0 * N1 * N2 * N3 * N4 * N5 * N6 * novel words

singletons doubletons

If T/N is large, we’ve seen lots of novel types in the past, so we expect lots more. unsmoothed  smoothed 2/N  2/(N+T) 1/N  1/(N+T) 0/N  (T/(N+T)) / N0

CS6501 Natural Language Processing

slide-45
SLIDE 45

0.005 0.01 0.015 0.02 0.025 0/N 1/N 2/N 3/N 4/N 5/N 6/N

Good-Turing Smoothing Idea

45

N0* N1* N2* N3* N4* N5* N6*

Partition the type vocabulary into classes (novel, singletons, doubletons, …) by how often they occurred in training data Use observed total probability of class r+1 to estimate total probability of class r unsmoothed  smoothed (N3*3/N)/N2 (N2*2/N)/N1 (N1*1/N)/N0

  • bs. p(singleton)
  • est. p(novel)

2%

  • bs. p(doubleton)
  • est. p(singleton)

1.5%

  • bs. (tripleton)
  • est. p(doubleton)

1.2%

2/N  (N3*3/N)/N2 1/N  (N2*2/N)/N1 0/N  (N1*1/N)/N0

CS6501 Natural Language Processing

r/N = (Nr*r/N)/Nr  (Nr+1*(r+1)/N)/Nr

slide-46
SLIDE 46

Witten-Bell vs. Good-Turing

 Estimate p(z | xy) using just the tokens we’ve seen in context xy. Might be a small set …  Witten-Bell intuition: If those tokens were distributed over many different types, then novel types are likely in future.  Good-Turing intuition: If many of those tokens came from singleton types , then novel types are likely in future.

 Very nice idea (but a bit tricky in practice)  See the paper “Good-Turing smoothing without tears”

46 CS6501 Natural Language Processing

slide-47
SLIDE 47

Good-Turing Reweighting

 Problem 1: what about “the”? (k=4417) For small k, Nk > Nk+1 For large k, too jumpy.  Problem 2: we don’t observe events for every k

N1 N2 N3 N4417 N3511

. . . .

N0 N1 N2 N4416 N3510

. . . .

slide-48
SLIDE 48

Good-Turing Reweighting

Simple Good-Turing [Gale and Sampson]: replace empirical Nk with a best-fit regression (e.g., power law) once count counts get unreliable f(k) = a + b log (k) Find a,b such that f(k) ∼ Nk

N1 N2 N3 N1 N2

slide-49
SLIDE 49

600.465 - Intro to NLP - J. Eisner 49

Backoff and interpolation

 Why are we treating all novel events as the same?

49 CS6501 Natural Language Processing

words words

slide-50
SLIDE 50

600.465 - Intro to NLP - J. Eisner 50

Backoff and interpolation

p(zombie | see the) vs. p(baby | see the)

What if count(see the zygote) = count(see the baby) = 0? baby beats zygote as a unigram the baby beats the zygote as a bigram  see the baby beats see the zygote ?

(even if both have the same count, such as 0)

50 CS6501 Natural Language Processing

slide-51
SLIDE 51

600.465 - Intro to NLP - J. Eisner 51

Backoff and interpolation

condition on less context for contexts you haven’t learned much about backoff: use trigram if you have good evidence, otherwise bigram,

  • therwise unigram

Interpolation: – mixture of unigram, bigram, trigram (etc.) models

51 CS6501 Natural Language Processing

slide-52
SLIDE 52

600.465 - Intro to NLP - J. Eisner 52

Smoothing + backoff

Basic smoothing (e.g., add-, Good-Turing, Witten-Bell):

 Holds out some probability mass for novel events  Divided up evenly among the novel events

Backoff smoothing

 Holds out same amount of probability mass for novel events  But divide up unevenly in proportion to backoff prob.

52 CS6501 Natural Language Processing

slide-53
SLIDE 53

600.465 - Intro to NLP - J. Eisner 53

Smoothing + backoff

When defining p(z | xy), the backoff prob for novel z is p(z | y)

Even if z was never observed after xy, it may have been observed after y (why?). Then p(z | y) can be estimated without further backoff. If not, we back off further to p(z).

53 CS6501 Natural Language Processing

slide-54
SLIDE 54

600.465 - Intro to NLP - J. Eisner 54

Linear Interpolation

 Jelinek-Mercer smoothing  Use a weighted average of backed-off naïve models: paverage(z | xy) = 3 p(z | xy) + 2 p(z | y) + 1 p(z) where 3 + 2 + 1 = 1 and all are  0  The weights  can depend on the context xy

 E.g., we can make 3 large if count(xy) is high  Learn the weights on held-out data w/ jackknifing  Different 3 when xy is observed 1 time, 2 times, 5 times, …

54 CS6501 Natural Language Processing

slide-55
SLIDE 55

Absolute Discounting Interpolation

 Save ourselves some time and just subtract 0.75 (or some d)!

 But should we really just use the regular unigram P(w)?

55

) ( ) ( ) ( ) , ( ) | (

1 1 1 1 scounting AbsoluteDi

w P w w c d w w c w w P

i i i i i i    

   

discounted bigram unigram

Interpolation weight CS6501 Natural Language Processing

slide-56
SLIDE 56

Absolute discounting: just subtract a little from each count

 How much to subtract ?  Church and Gale (1991)’s clever idea  Divide data into training and held-out sets  for each bigram in the training set  see the actual count in the held-out set!  It sure looks like c* = (c - .75)

56

Bigram count in training Bigram count in heldout set .0000270 1 0.448 2 1.25 3 2.24 4 3.23 5 4.21 6 5.23

CS6501 Natural Language Processing