natural language processing
play

Natural Language Processing Lecture 5: Language Models and - PowerPoint PPT Presentation

Natural Language Processing Lecture 5: Language Models and Smoothing Language Modeling Is this sentences good? This is a pen Pen this is a Help choose between optons, help score optons


  1. Natural Language Processing Lecture 5: Language Models and Smoothing

  2. Language Modeling • Is this sentences good? – This is a pen – Pen this is a • Help choose between optons, help score optons – 他向记者介绍了发言的主要内容 – He briefed to reporters on the chief contents of the statement – He briefed reporters on the chief contents of the statement – He briefed to reporters on the main contents of the statement – He briefed reporters on the main contents of the statement

  3. One-Slide Review of Probability Terminology • Random variables take diferent values, depending on chance. • Notaton: p ( X = x ) is the probability that r.v. X takes value x p ( x ) is shorthand for the same p ( X ) is the distributon over values X can take (a functon) • Joint probability: p ( X = x , Y = y ) – Independence – Chain rule • Conditonal probability: p ( X = x | Y = y )

  4. Unigram Model • Every word in Σ is assigned some probability. • Random variables W 1 , W 2 , ... (one per word).

  5. Part of A Unigram Distributon [rank 1] … p(the) = 0.038 [rank 1001] p(of) = 0.023 p(joint) = 0.00014 p(and) = 0.021 p(relatvely) = 0.00014 p(to) = 0.017 p(plot) = 0.00014 p(is) = 0.013 p(DEL1SUBSEQ) = 0.00014 p(a) = 0.012 p(rule) = 0.00014 p(in) = 0.012 p(62.0) = 0.00014 p(for) = 0.009 p(9.1) = 0.00014 ... p(evaluated) = 0.00014 ...

  6. Unigram Model as a Generator fjrst, fsrsm! etss tet Tiess sftrtnt 2ee4*), suut wues e gds e 19.2 Ms te tetsr It ~(s?1), gdsvtn e.62 tetst (xe; m! t e 1 s et uuet. x 6e 1998. uun tr by Nsts t wut sfs st tt CFG 12e bt 1ee es tssn uur y Ifs m!s tes nstt 21.8 t e e WP te t tet te t Nsv? k. ts fsuun tssn; ts [e, ts sftrtnt v euuts, m!s te 65 sts. s s - 24*.94* stnttn ts nst te t 2 In ts euusttrsngd t e K&M 1ee Bse fs t X))] ppest ; In 1e4* S. gdr m!m! r wu s (St tssn sntr stsvt tetsss, tet m! esnts t bet -5.66 trs es: An tet ttxtuu e (fs m!sey ppes tssns. Wt e vt fssr m!s tes 4*e.1 ns 156 txpt tt rt ntsgdebsress

  7. Full History Model • Every word in Σ is assigned some probability, conditoned on every history .

  8. Bsee Cesntsn's uunuusuu eey srt t sm!m!tnt Wt nts y sn tet pssssbet rset sfs r t sn tet tet tssn wu s sn kttpsngd wuste tet Cesntsns' bs ts psrtr y Ob m! , wues ss sm!sngd ts bt sm!t tet fjrst be k U.S. prtss tnt, s tet et r fs vsrstt, tetrtby etsstnsngd tet psttnts e fs eesuut sfs Hsee ry Cesntsn sts nst wusn sn Ssuute C rsesn .

  9. N-Gram Model • Every word in Σ is assigned some probability, conditoned on a fjxed-jlfengfth history ( n – 1 ).

  10. Bigram Model as a Generator t. (A.33) (A.34*) A.5 Ms teS rt ess bttn sm!petttey suurp sst sn ptrfssrm! n t sn r fsts sfs snesnt egdsrstem!s n estvt fs r m!srt ss wueset suubst nts eey sm!prsvt uussngd CE. 4*.4*.1 MLE s C stsfsCE 71 26.34* 23.1 57.8 K&M 4*2.4* 62.7 4*e.9 4*4* 4*3 9e.7 1ee.e 1ee.e 1ee.e 15.1 3e.9 18.e 21.2 6e.1 uun srt tt tv euu tssns srt tt DEL1 TiRANS1 ntsgdebsress . Tiess sntsnuuts, wuste suuptrvsst snst., stm!ssuuptrvsst MLE wuste tet METiU- S b n sTirttb nk 195 ADJA ADJD ADV APPR APPRARTi APPO APZR ARTi CARD FM ITiJ KOUI KOUS KON KOKOM NN NN NN IN JJ NNTietsr prsbetm! ss y x. Tiet tv euu tssn sftrs tet eypstetsszt esnk gdr m!m! r wuste G uusss n

  11. Trigram Model as a Generator tsp(xI ,rsgdet,B). (A.39) vsnte(X, I) r snstste(I 1, I). (A.4*e) vsnt(n). (A.4*1) Tietst tquu tssns wutrt prtstntt sn bste sts; tetst s srts uu<AC>snts prsb bsesty sstrsbuutssn ss tvtn sm! eetr(r =e.e5). Tiess ss tx tey fsEM. Duursngd DA, ss gdr uu eey rte xt . Tiess pprs e suue bt tf stntey uust sn prtvssuus e pttrs) btfssrt tr snsngd (ttst) K&MZtrsLs er n sm! m!s tes Fsgduurt4*.12: Dsrt tt uur y sn ee ssx e ngduu gdts. Im!psrt ntey, tetst p ptrs estvt st tt- sfs-tet- rt rtsuuets sn tetsr t sks n uune btet t n tet vtrbs rt eeswut (fssr snst n t) ts stet t tet r sn esty sfs ss rttt struu tuurts, eskt m! t esngds sn wutsgdett gdr pes (M Dsn e tt e., 1993) (35 t gd typts, 3.39 bsts). Tiet Buuegd rs n,

  12. What’s in a word • Is punctuaton a word? – Does knowing the last “word” is a “,” help? • In speech – I do uh main- mainly business processing – Is “uh” a word?

  13. For Thought • Do N-Gram models “know” English? • Unknown words • N-gram models and fnite-state automata

  14. Startng and Stopping Unigram model: ... Bigram model: ... Trigram model: ...

  15. Evaluatio

  16. Which model is beter? • Can I get a number about how good my model is for a test set? • What is the P(test_set | Model ) • We measure this by Perplexity • Perplexity is the probability of test set normalized by the number of words

  17. Perplexity

  18. Perplexity of diferent models • Beter models have lower perplexity – WSJ: Unigram 962; Bigram 170; Trigram 109 • Diferent tasks have diferent perplexity – WSJ (109) vs Bus Informaton Queries (~25) • Higher the conditonal probability, lower the perplexity • Perplexity is the average branching rate

  19. What about open class • What is the probability of unseen words? – (Naïve answer is 0.0) • But that’s not what you want – Test set will usually include words not in training • What is the probability of – P(Nebuchadnezzur | son of )

  20. LM smoothing • Laplace or add-one smoothing – Add one to all counts – Or add “epsilon” to all counts – You stll need to know all your vocabulary • Have an OOV word in your vocabulary – The probability of seeing an unseen word

  21. Good-Turing Smoothing • Good (1953) From Turing. – Using the count of things you’ve seen once to estmate count of things you’ve never seen. • Calculate the frequency of frequencies of Ngrams – Count of Ngrams that appear 1 tmes – Count of Ngrams that appear 2 tmes – Count of Ngrams that appear 3 tmes – … – Estmate new c = (c+1) (N_c + 1)/N_c) • Change the counts a litle so we get a beter estmate for count 0

  22. Good-Turing’s Discounted Counts AP Newswire Berkeley Restaurants Bigrams Smith Thesis Bigrams Bigrams c N c c* N c c* N c c* e 74*,671,1ee,eee e.eeee27e 2,e81,4*96 e.ee2553 x 38,e4*8 / x 1 2,e18,e4*6 e.4*4*6 5,315 e.53396e 38,e4*8 e.2114*7 2 4*4*9,721 1.26 1,4*19 1.357294* 4*,e32 1.e5e71 3 188,933 2.24* 64*2 2.373832 1,4*e9 2.12633 4* 1e5,668 3.24* 381 4*.e81365 74*9 2.63685 5 68,379 4*.22 311 3.78135e 395 3.91899 6 4*8,19e 5.19 196 4*.5eeeee 258 4*.4*224*8

  23. Backof • If no trigram, use bigram • If no bigram, use unigram • If no unigram … smooth the unigrams

  24. Estmatng p ( w | esstsry) • Relatve frequencies (count & normalize) • Transform the counts: – Laplace/“add one”/“add λ” – Good-Turing discountng • Interpolate or “backof”: – With Good-Turing discountng: Katz backof – “Stupid” backof – Absolute discountng: Kneser-Ney

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend