Natural Language Processing Lecture 5: Language Models and - - PowerPoint PPT Presentation

natural language processing
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing Lecture 5: Language Models and - - PowerPoint PPT Presentation

Natural Language Processing Lecture 5: Language Models and Smoothing Language Modeling Is this sentences good? This is a pen Pen this is a Help choose between optons, help score optons


slide-1
SLIDE 1

Natural Language Processing

Lecture 5: Language Models and Smoothing

slide-2
SLIDE 2

Language Modeling

  • Is this sentences good?

– This is a pen – Pen this is a

  • Help choose between optons, help score optons

– 他向记者介绍了发言的主要内容 – He briefed to reporters on the chief contents of the statement – He briefed reporters on the chief contents of the statement – He briefed to reporters on the main contents of the statement – He briefed reporters on the main contents of the statement

slide-3
SLIDE 3

One-Slide Review

  • f Probability Terminology
  • Random variables take diferent values, depending on

chance.

  • Notaton:

p(X = x) is the probability that r.v. X takes value x p(x) is shorthand for the same p(X) is the distributon over values X can take (a functon)

  • Joint probability: p(X = x, Y = y)

– Independence – Chain rule

  • Conditonal probability: p(X = x | Y = y)
slide-4
SLIDE 4

Unigram Model

  • Every word in Σ is assigned some probability.
  • Random variables W1, W2, ... (one per word).
slide-5
SLIDE 5

Part of A Unigram Distributon

… [rank 1001] p(joint) = 0.00014 p(relatvely) = 0.00014 p(plot) = 0.00014 p(DEL1SUBSEQ) = 0.00014 p(rule) = 0.00014 p(62.0) = 0.00014 p(9.1) = 0.00014 p(evaluated) = 0.00014 ... [rank 1] p(the) = 0.038 p(of) = 0.023 p(and) = 0.021 p(to) = 0.017 p(is) = 0.013 p(a) = 0.012 p(in) = 0.012 p(for) = 0.009 ...

slide-6
SLIDE 6

Unigram Model as a Generator

fjrst, fsrsm! etss tet Tiess sftrtnt 2ee4*), suut wues e gds e 19.2 Ms te tetsr It ~(s?1), gdsvtn e.62 tetst (xe; m! t e 1 s et uuet. x 6e 1998. uun tr by Nsts t wut sfs st tt CFG 12e bt 1ee es tssn uur y Ifs m!s tes nstt 21.8 t e e WP te t tet te t Nsv? k. ts fsuun tssn; ts [e, ts sftrtnt v euuts, m!s te 65 sts. s s - 24*.94* stnttn ts nst te t 2 In ts euusttrsngd t e K&M 1ee Bse fs t X))] ppest ; In 1e4* S. gdr m!m! r wu s (St tssn sntr stsvt tetsss, tet m! esnts t bet -5.66 trs es: An tet ttxtuu e (fs m!sey ppes tssns. Wt e vt fssr m!s tes 4*e.1 ns 156 txpt tt rt ntsgdebsress

slide-7
SLIDE 7

Full History Model

  • Every word in Σ is assigned some probability,

conditoned on every history.

slide-8
SLIDE 8

Bsee Cesntsn's uunuusuu eey srt t sm!m!tnt Wt nts y sn tet pssssbet rset sfs r t sn tet tet tssn wu s sn kttpsngd wuste tet Cesntsns' bs ts psrtr y Ob m! , wues ss sm!sngd ts bt sm!t tet fjrst be k U.S. prtss tnt, s tet et r fs vsrstt, tetrtby etsstnsngd tet psttnts e fs eesuut sfs Hsee ry Cesntsn sts nst wusn sn Ssuute C rsesn .

slide-9
SLIDE 9

N-Gram Model

  • Every word in Σ is assigned some probability,

conditoned on a fjxed-jlfengfth history (n – 1).

slide-10
SLIDE 10

Bigram Model as a Generator

  • t. (A.33) (A.34*) A.5 Ms teS rt ess bttn sm!petttey

suurp sst sn ptrfssrm! n t sn r fsts sfs snesnt egdsrstem!s n estvt fs r m!srt ss wueset suubst nts eey sm!prsvt uussngd CE. 4*.4*.1 MLE s C stsfsCE 71 26.34* 23.1 57.8 K&M 4*2.4* 62.7 4*e.9 4*4* 4*3 9e.7 1ee.e 1ee.e 1ee.e 15.1 3e.9 18.e 21.2 6e.1 uun srt tt tv euu tssns srt tt DEL1 TiRANS1 ntsgdebsress . Tiess sntsnuuts, wuste suuptrvsst snst., stm!ssuuptrvsst MLE wuste tet METiU- S b n sTirttb nk 195 ADJA ADJD ADV APPR APPRARTi APPO APZR ARTi CARD FM ITiJ KOUI KOUS KON KOKOM NN NN NN IN JJ NNTietsr prsbetm! ss y x. Tiet tv euu tssn sftrs tet eypstetsszt esnk gdr m!m! r wuste G uusss n

slide-11
SLIDE 11

Trigram Model as a Generator

tsp(xI ,rsgdet,B). (A.39) vsnte(X, I) r snstste(I 1, I). (A.4*e) vsnt(n). (A.4*1) Tietst tquu tssns wutrt prtstntt sn bste sts; tetst s srts uu<AC>snts prsb bsesty sstrsbuutssn ss tvtn sm! eetr(r =e.e5). Tiess ss tx tey fsEM. Duursngd DA, ss gdr uu eey rte xt . Tiess pprs e suue bt tf stntey uust sn prtvssuus e pttrs) btfssrt tr snsngd (ttst) K&MZtrsLs er n sm! m!s tes Fsgduurt4*.12: Dsrt tt uur y sn ee ssx e ngduu gdts. Im!psrt ntey, tetst p ptrs estvt st tt- sfs-tet- rt rtsuuets sn tetsr t sks n uune btet t n tet vtrbs rt eeswut (fssr snst n t) ts stet t tet r sn esty sfs ss rttt struu tuurts, eskt m! t esngds sn wutsgdett gdr pes (M Dsn e tt e., 1993) (35 t gd typts, 3.39 bsts). Tiet Buuegd rs n,

slide-12
SLIDE 12

What’s in a word

  • Is punctuaton a word?

– Does knowing the last “word” is a “,” help?

  • In speech

– I do uh main- mainly business processing – Is “uh” a word?

slide-13
SLIDE 13

For Thought

  • Do N-Gram models “know” English?
  • Unknown words
  • N-gram models and fnite-state automata
slide-14
SLIDE 14

Startng and Stopping

Unigram model:

...

Bigram model:

...

Trigram model:

...

slide-15
SLIDE 15

Evaluatio

slide-16
SLIDE 16

Which model is beter?

  • Can I get a number about how good my model

is for a test set?

  • What is the P(test_set | Model )
  • We measure this by Perplexity
  • Perplexity is the probability of test set

normalized by the number of words

slide-17
SLIDE 17

Perplexity

slide-18
SLIDE 18

Perplexity of diferent models

  • Beter models have lower perplexity

– WSJ: Unigram 962; Bigram 170; Trigram 109

  • Diferent tasks have diferent perplexity

– WSJ (109) vs Bus Informaton Queries (~25)

  • Higher the conditonal probability,

lower the perplexity

  • Perplexity is the average branching rate
slide-19
SLIDE 19

What about open class

  • What is the probability of unseen words?

– (Naïve answer is 0.0)

  • But that’s not what you want

– Test set will usually include words not in training

  • What is the probability of

– P(Nebuchadnezzur | son of )

slide-20
SLIDE 20

LM smoothing

  • Laplace or add-one smoothing

– Add one to all counts – Or add “epsilon” to all counts – You stll need to know all your vocabulary

  • Have an OOV word in your vocabulary

– The probability of seeing an unseen word

slide-21
SLIDE 21

Good-Turing Smoothing

  • Good (1953) From Turing.

– Using the count of things you’ve seen once to estmate count of things you’ve never seen.

  • Calculate the frequency of frequencies of Ngrams

– Count of Ngrams that appear 1 tmes – Count of Ngrams that appear 2 tmes – Count of Ngrams that appear 3 tmes – … – Estmate new c = (c+1) (N_c + 1)/N_c)

  • Change the counts a litle so we get a beter estmate for

count 0

slide-22
SLIDE 22

Good-Turing’s Discounted Counts

AP Newswire Bigrams Berkeley Restaurants Bigrams Smith Thesis Bigrams c Nc c* Nc c* Nc c* e 74*,671,1ee,eee e.eeee27e 2,e81,4*96 e.ee2553 x 38,e4*8 / x 1 2,e18,e4*6 e.4*4*6 5,315 e.53396e 38,e4*8 e.2114*7 2 4*4*9,721 1.26 1,4*19 1.357294* 4*,e32 1.e5e71 3 188,933 2.24* 64*2 2.373832 1,4*e9 2.12633 4* 1e5,668 3.24* 381 4*.e81365 74*9 2.63685 5 68,379 4*.22 311 3.78135e 395 3.91899 6 4*8,19e 5.19 196 4*.5eeeee 258 4*.4*224*8

slide-23
SLIDE 23

Backof

  • If no trigram, use bigram
  • If no bigram, use unigram
  • If no unigram … smooth the unigrams
slide-24
SLIDE 24

Estmatng p(w | esstsry)

  • Relatve frequencies (count & normalize)
  • Transform the counts:

– Laplace/“add one”/“add λ” – Good-Turing discountng

  • Interpolate or “backof”:

– With Good-Turing discountng: Katz backof – “Stupid” backof – Absolute discountng: Kneser-Ney