language modeling 2
play

Language Modeling (2) Diyi Yang Many slides from Dan Jurafsky and - PowerPoint PPT Presentation

CS 4650/7650: Natural Language Processing Language Modeling (2) Diyi Yang Many slides from Dan Jurafsky and Jason Esiner 1 Recap: Language Model Unigram model: ! " # ! " $ ! " % !(" ( ) Bigram model: ! " # !


  1. CS 4650/7650: Natural Language Processing Language Modeling (2) Diyi Yang Many slides from Dan Jurafsky and Jason Esiner 1

  2. Recap: Language Model ¡ Unigram model: ! " # ! " $ ! " % … !(" ( ) ¡ Bigram model: ! " # ! " $ |" # ! " % |" $ … !(" ( |" (+# ) ¡ Trigram model: ! " # ! " $ |" # ! " % |" $ , " # … !(" ( |" (+# " (+$ ) ¡ N-gram model: ! " # ! " $ |" # … !(" ( |" (+# " (+$ … " (+- ) 2

  3. Recap: How To Evaluate ¡ Extrinsic: build a new language model, use it for some task (MT, ASR, etc.) ¡ Intrinsic: measure how good we are at modeling language 3

  4. Difficulty of Extrinsic Evaluation ¡ Extrinsic: build a new language model, use it for some task (MT, etc.) ¡ Time-consuming; can take days or weeks ¡ So, sometimes use intrinsic evaluation: perplexity ¡ Bad approximation ¡ Unless the test data looks just like the training data ¡ So generally only useful in pilot experiments 4

  5. Recap: Intrinsic Evaluation ¡ Intuitively, language models should assign high probability to real language they have not seen before 5

  6. Evaluation: Perplexity ¡ Test data: ! = # $ , # & , … , # ()*+ ¡ Parameters are not estimated from S ¡ Perplexity is the normalized inverse probability of S ()*+ ()*+ , ! = - ,(# . ) log & ,(!) = 5 log & ,(# . ) ./$ ./$ ()*+ 6 = 1 perplexity = 2 :; 8 5 log & ,(# . ) ./$ 6

  7. Evaluation: Perplexity ,-./ log 3 4(6 * ) ' perplexity = 2 #$ , & = ( ∑ *+' ¡ Sent is the number of sentences in the test data ¡ M is the number of words in the test corpus ¡ A better language model has higher p(S) and lower perplexity 7

  8. Low Perplexity = Better Model ¡ Training 38 million words, test 1.5 million words, WSJ N-gram Order Unigram Bigram Trigram Perplexity 962 170 109 8

  9. Perplexity As A Branching Factor perplexity = 2 ' ( -./0 123 4 5(7 + ) ) ∑ +,( ¡ Assign probability of 1 to the test data à perplexity = 1 ! |#| to every word à perplexity = |V| ¡ Assign probability of ¡ Assign probability of 0 to anything à perplexity = ∞ ¡ Cannot compare perplexities of LMs trained on different corpora. 9

  10. This Lecture ¡ Dealing with unseen words/n-grams ¡ Add-one smoothing ¡ Linear interpolation ¡ Absolute discounting ¡ Kneser-Ney smoothing ¡ Neural language modeling 10

  11. Berkeley Restaurant Project Sentences ¡ can you tell me about any good cantonese restaurants close by ¡ mid priced that food is what i’m looking for ¡ tell me about chez pansies ¡ can you give me a listing of the kinds of food that are available ¡ i’m looking for a good place to eat breakfast ¡ when is cafe venezia open during the day 11

  12. Raw Bigram Counts ¡ Out of 9222 sentences 12

  13. Raw Bigram Probabilities ¡ Normalize by unigrams ¡ Result 13

  14. Approximating Shakespeare 14

  15. Shakespeare As Corpus ¡ N=884,647 tokens, V=29,066 ¡ Shakespeare produced 300,000 bigram types out of ! " =844 million possible bigrams ¡ 99.96% of the possible bigrams were never seen (have zero entries in the table) ¡ Quadrigrams worse: What’s coming out looks like Shakespeare because it is Shakespeare 15

  16. The Perils of Overfitting ¡ N-grams only work well for word prediction if the test corpus looks like the training corpus ¡ In real life, it often doesn’t ¡ We need to train robust models that generalize! ¡ One kind of generalization: Zeros! ¡ Things that don’t ever occur in the training set ¡ But occur in the test set 16

  17. Zeros ¡ Training set: ¡ Test set: … denied the offer … denied the allegations … denied the loan … denied the reports … denied the claims … denied the request P(“offer” | denied the) = 0 17

  18. Zero Probability Bigrams ¡ Bigrams with zero probability ¡ Mean that we will assign 0 probability to the test set ¡ And hence we cannot compute perplexity (can’t divide by 0) 18

  19. Smoothing 19

  20. The Intuition of Smoothing ¡ When we have sparse statistics: P(w | denied the) 3 allegations 2 reports allegations 1 claims outcome reports 1 request attack … request claims man 7 total 20

  21. The Intuition of Smoothing ¡ Steal probability mass to generalize better P(w | denied the) 2.5 allegations allegations allegations outcome 1.5 reports attack reports 0.5 claims … man claims request 0.5 request 2 other Credit: Dan Klein 7 total 21

  22. Add-one Estimation (Laplace Smoothing) ¡ Pretend we saw each word one more time than we did ¡ Just add one to all the counts! MLE ( w i | w i − 1 ) = c ( w i − 1 , w i ) P ¡ MLE estimate: c ( w i − 1 ) Add − 1 ( w i | w i − 1 ) = c ( w i − 1 , w i ) + 1 ¡ Add-1 estimate: P c ( w i − 1 ) + V 22

  23. Example: Add-one Smoothing xya 100 100/300 101 101/326 xyb 0 0/300 1 1/326 xyc 0 0/300 1 1/326 xyd 200 200/300 201 201/326 xye 0 0/300 1 1/326 … xyz 0 0/300 1 1/326 Total xy 300 300/300 326 326/326 23

  24. Berkeley Restaurant Corpus: Laplace Smoothed Bigram Counts 24

  25. Laplace-smoothed Bigrams V=1446 in the Berkeley Restaurant Project corpus 25

  26. Reconstruct the Count Matrix ! ∗ # $%& # $ = ( ∗ # $ # $%& ⋅ ! # $%& = ! # $%& # $ + 1 ⋅ !(# $%& ) ! # $%& + , 26

  27. Compare with Raw Bigram Counts 27

  28. Problem with Add-One Smoothing We’ve been considering just 26 letter types … xya 1 1/3 2 2/29 xyb 0 0/3 1 1/29 xyc 0 0/3 1 1/29 xyd 2 2/3 3 3/29 xye 0 0/3 1 1/29 … xyz 0 0/3 1 1/29 Total xy 3 3/3 29 29/29 28

  29. Problem with Add-One Smoothing Suppose we’re considering 20000 word types see the abacus 1 1/3 2 2/20003 see the abbot 0 0/3 1 1/20003 see the abduct 0 0/3 1 1/20003 see the above 2 2/3 3 3/20003 see the Abram 0 0/3 1 1/20003 … see the zygote 0 0/3 1 1/20003 Total 3 3/3 20003 20003/20003 29

  30. Problem with Add-One Smoothing Suppose we’re considering 20000 word types see the abacus 1 1/3 2 2/20003 see the abbot 0 0/3 1 1/20003 see the abduct 0 0/3 1 1/20003 “Novel event” = event never happened in training data. see the above 2 2/3 3 3/20003 Here: 19998 novel events, with total estimated probability 19998/20003. Add-one smoothing thinks we are extremely likely to see novel events, rather see the Abram 0 0/3 1 1/20003 than words we’ve seen. … see the zygote 0 0/3 1 1/20003 Total 3 3/3 20003 20003/20003 30

  31. Infinite Dictionary? In fact, aren’t there infinitely many possible word types? see the aaaaa 1 1/3 2 2/(∞+3) see the aaaab 0 0/3 1 1/(∞+3) see the aaaac 0 0/3 1 1/(∞+3) see the aaaad 2 2/3 3 3/(∞+3) see the aaaae 0 0/3 1 1/(∞+3) … see the zzzzz 0 0/3 1 1/(∞+3) 31 Total 3 3/3 (∞+3) (∞+3)/(∞+3)

  32. Add-Lambda Smoothing ¡ A large dictionary makes novel events too probable. ¡ To fix: Instead of adding 1 to all counts, add l = 0.01? ¡ This gives much less probability to novel events. ¡ But how to pick best value for l ? ¡ That is, how much should we smooth? 32

  33. Add-0.001 Smoothing Doesn’t smooth much (estimated distribution has high variance) xya 1 1/3 1.001 0.331 xyb 0 0/3 0.001 0.0003 xyc 0 0/3 0.001 0.0003 xyd 2 2/3 2.001 0.661 xye 0 0/3 0.001 0.0003 … xyz 0 0/3 0.001 0.0003 Total xy 3 3/3 3.026 1 33

  34. Add-1000 Smoothing Smooths too much (estimated distribution has high bias) xya 1 1/3 1001 1/26 xyb 0 0/3 1000 1/26 xyc 0 0/3 1000 1/26 xyd 2 2/3 1002 1/26 xye 0 0/3 1000 1/26 … xyz 0 0/3 1000 1/26 Total xy 3 3/3 26003 1 34

  35. Add-Lambda Smoothing ¡ A large dictionary makes novel events too probable. ¡ To fix: Instead of adding 1 to all counts, add l ¡ But how to pick best value for l ? ¡ That is, how much should we smooth? ¡ E.g., how much probability to “set aside” for novel events? ¡ Depends on how likely novel events really are! ¡ Which may depend on the type of text, size of training corpus, … ¡ Can we figure it out from the data? 35 ¡ We ’ll look at a few methods for deciding how much to smooth.

  36. Setting Smoothing Parameters ¡ How to pick best value for l ? (in add- l smoothing) ¡ Try many l values & report the one that gets best results? Training Test ¡ How to measure whether a particular l gets good results? ¡ Is it fair to measure that on test data (for setting l )? ¡ Moral: Selective reporting on test data can make a method look artificially good. So it is unethical. ¡ Rule: Test data cannot influence system development. No peeking! Use it only to evaluate the final system(s). Report all results on it. 36

  37. Setting Smoothing Parameters ¡ How to pick best value for l ? (in add- l smoothing) ¡ Try many l values & report the one that gets best results? Training Test Dev. … and report Pick l that … when we collect counts Now use that results of that l to get gets best from this 80% and smooth final model on them using add- l smoothing. results on smoothed test data. this 20% … counts from 37 all 100% …

  38. Large or Small Dev Set? ¡ Here we held out 20% of our training set (yellow) for development. ¡ Would like to use > 20% yellow: ¡ 20% not enough to reliably assess l ¡ Would like to use > 80% blue: ¡ Best l for smoothing 80% ¹ best l for smoothing 100% 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend