Natural Language Processing Lecture 5: Language Models and - - PowerPoint PPT Presentation
Natural Language Processing Lecture 5: Language Models and - - PowerPoint PPT Presentation
Natural Language Processing Lecture 5: Language Models and Smoothing Language Modeling Is this sentences good? This is a pen Pen this is a Help choose between optons, help score optons
Language Modeling
- Is this sentences good?
– This is a pen – Pen this is a
- Help choose between optons, help score optons
– 他向记者介绍了发言的主要内容 – He briefed to reporters on the chief contents of the statement – He briefed reporters on the chief contents of the statement – He briefed to reporters on the main contents of the statement – He briefed reporters on the main contents of the statement
One-Slide Review
- f Probability Terminology
- Random variables take diferent values, depending on
chance.
- Notaton:
p(X = x) is the probability that r.v. X takes value x p(x) is shorthand for the same p(X) is the distributon over values X can take (a functon)
- Joint probability: p(X = x, Y = y)
– Independence – Chain rule
- Conditonal probability: p(X = x | Y = y)
Unigram Model
- Every word in Σ is assigned some probability.
- Random variables W1, W2, ... (one per word).
Part of A Unigram Distributon
… [rank 1001] p(joint) = 0.00014 p(relatvely) = 0.00014 p(plot) = 0.00014 p(DEL1SUBSEQ) = 0.00014 p(rule) = 0.00014 p(62.0) = 0.00014 p(9.1) = 0.00014 p(evaluated) = 0.00014 ... [rank 1] p(the) = 0.038 p(of) = 0.023 p(and) = 0.021 p(to) = 0.017 p(is) = 0.013 p(a) = 0.012 p(in) = 0.012 p(for) = 0.009 ...
Unigram Model as a Generator
fjrst, fsrsm! etss tet Tiess sftrtnt 2ee4*), suut wues e gds e 19.2 Ms te tetsr It ~(s?1), gdsvtn e.62 tetst (xe; m! t e 1 s et uuet. x 6e 1998. uun tr by Nsts t wut sfs st tt CFG 12e bt 1ee es tssn uur y Ifs m!s tes nstt 21.8 t e e WP te t tet te t Nsv? k. ts fsuun tssn; ts [e, ts sftrtnt v euuts, m!s te 65 sts. s s - 24*.94* stnttn ts nst te t 2 In ts euusttrsngd t e K&M 1ee Bse fs t X))] ppest ; In 1e4* S. gdr m!m! r wu s (St tssn sntr stsvt tetsss, tet m! esnts t bet -5.66 trs es: An tet ttxtuu e (fs m!sey ppes tssns. Wt e vt fssr m!s tes 4*e.1 ns 156 txpt tt rt ntsgdebsress
Full History Model
- Every word in Σ is assigned some probability,
conditoned on every history.
Bsee Cesntsn's uunuusuu eey srt t sm!m!tnt Wt nts y sn tet pssssbet rset sfs r t sn tet tet tssn wu s sn kttpsngd wuste tet Cesntsns' bs ts psrtr y Ob m! , wues ss sm!sngd ts bt sm!t tet fjrst be k U.S. prtss tnt, s tet et r fs vsrstt, tetrtby etsstnsngd tet psttnts e fs eesuut sfs Hsee ry Cesntsn sts nst wusn sn Ssuute C rsesn .
N-Gram Model
- Every word in Σ is assigned some probability,
conditoned on a fjxed-jlfengfth history (n – 1).
Bigram Model as a Generator
- t. (A.33) (A.34*) A.5 Ms teS rt ess bttn sm!petttey
suurp sst sn ptrfssrm! n t sn r fsts sfs snesnt egdsrstem!s n estvt fs r m!srt ss wueset suubst nts eey sm!prsvt uussngd CE. 4*.4*.1 MLE s C stsfsCE 71 26.34* 23.1 57.8 K&M 4*2.4* 62.7 4*e.9 4*4* 4*3 9e.7 1ee.e 1ee.e 1ee.e 15.1 3e.9 18.e 21.2 6e.1 uun srt tt tv euu tssns srt tt DEL1 TiRANS1 ntsgdebsress . Tiess sntsnuuts, wuste suuptrvsst snst., stm!ssuuptrvsst MLE wuste tet METiU- S b n sTirttb nk 195 ADJA ADJD ADV APPR APPRARTi APPO APZR ARTi CARD FM ITiJ KOUI KOUS KON KOKOM NN NN NN IN JJ NNTietsr prsbetm! ss y x. Tiet tv euu tssn sftrs tet eypstetsszt esnk gdr m!m! r wuste G uusss n
Trigram Model as a Generator
tsp(xI ,rsgdet,B). (A.39) vsnte(X, I) r snstste(I 1, I). (A.4*e) vsnt(n). (A.4*1) Tietst tquu tssns wutrt prtstntt sn bste sts; tetst s srts uu<AC>snts prsb bsesty sstrsbuutssn ss tvtn sm! eetr(r =e.e5). Tiess ss tx tey fsEM. Duursngd DA, ss gdr uu eey rte xt . Tiess pprs e suue bt tf stntey uust sn prtvssuus e pttrs) btfssrt tr snsngd (ttst) K&MZtrsLs er n sm! m!s tes Fsgduurt4*.12: Dsrt tt uur y sn ee ssx e ngduu gdts. Im!psrt ntey, tetst p ptrs estvt st tt- sfs-tet- rt rtsuuets sn tetsr t sks n uune btet t n tet vtrbs rt eeswut (fssr snst n t) ts stet t tet r sn esty sfs ss rttt struu tuurts, eskt m! t esngds sn wutsgdett gdr pes (M Dsn e tt e., 1993) (35 t gd typts, 3.39 bsts). Tiet Buuegd rs n,
What’s in a word
- Is punctuaton a word?
– Does knowing the last “word” is a “,” help?
- In speech
– I do uh main- mainly business processing – Is “uh” a word?
For Thought
- Do N-Gram models “know” English?
- Unknown words
- N-gram models and fnite-state automata
Startng and Stopping
Unigram model:
...
Bigram model:
...
Trigram model:
...
Evaluatio
Which model is beter?
- Can I get a number about how good my model
is for a test set?
- What is the P(test_set | Model )
- We measure this by Perplexity
- Perplexity is the probability of test set
normalized by the number of words
Perplexity
Perplexity of diferent models
- Beter models have lower perplexity
– WSJ: Unigram 962; Bigram 170; Trigram 109
- Diferent tasks have diferent perplexity
– WSJ (109) vs Bus Informaton Queries (~25)
- Higher the conditonal probability,
lower the perplexity
- Perplexity is the average branching rate
What about open class
- What is the probability of unseen words?
– (Naïve answer is 0.0)
- But that’s not what you want
– Test set will usually include words not in training
- What is the probability of
– P(Nebuchadnezzur | son of )
LM smoothing
- Laplace or add-one smoothing
– Add one to all counts – Or add “epsilon” to all counts – You stll need to know all your vocabulary
- Have an OOV word in your vocabulary
– The probability of seeing an unseen word
Good-Turing Smoothing
- Good (1953) From Turing.
– Using the count of things you’ve seen once to estmate count of things you’ve never seen.
- Calculate the frequency of frequencies of Ngrams
– Count of Ngrams that appear 1 tmes – Count of Ngrams that appear 2 tmes – Count of Ngrams that appear 3 tmes – … – Estmate new c = (c+1) (N_c + 1)/N_c)
- Change the counts a litle so we get a beter estmate for
count 0
Good-Turing’s Discounted Counts
AP Newswire Bigrams Berkeley Restaurants Bigrams Smith Thesis Bigrams c Nc c* Nc c* Nc c* e 74*,671,1ee,eee e.eeee27e 2,e81,4*96 e.ee2553 x 38,e4*8 / x 1 2,e18,e4*6 e.4*4*6 5,315 e.53396e 38,e4*8 e.2114*7 2 4*4*9,721 1.26 1,4*19 1.357294* 4*,e32 1.e5e71 3 188,933 2.24* 64*2 2.373832 1,4*e9 2.12633 4* 1e5,668 3.24* 381 4*.e81365 74*9 2.63685 5 68,379 4*.22 311 3.78135e 395 3.91899 6 4*8,19e 5.19 196 4*.5eeeee 258 4*.4*224*8
Backof
- If no trigram, use bigram
- If no bigram, use unigram
- If no unigram … smooth the unigrams
Estmatng p(w | esstsry)
- Relatve frequencies (count & normalize)
- Transform the counts:
– Laplace/“add one”/“add λ” – Good-Turing discountng
- Interpolate or “backof”: