p w 1 w 2 w n p w i w i k w i 1 i in other words we
play

P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, - PDF document

10/17/19 Language models Chapter 3 in Martin/Jurafsky N-gram models using the Markov assumption P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, we approximate each component in the product as P ( w i | w


  1. 10/17/19 Language models Chapter 3 in Martin/Jurafsky N-gram models using the Markov assumption ∏ P ( w 1 w 2 … w n ) ≈ P ( w i | w i − k … w i − 1 ) i • In other words, we approximate each component in the product as P ( w i | w 1 w 2 … w i − 1 ) ≈ P ( w i | w i − k … w i − 1 ) 1

  2. 10/17/19 Estimating bigram probabilities • The Maximum Likelihood estimate P ( w i | w i − 1 ) = c ( w i − 1 , w i ) c ( w i − 1 ) • c(xy) is the count of the bigram xy Example <s> I am Sam </s> P ( w i | w i − 1 ) = c ( w i − 1 , w i ) <s> Sam I am </s> c ( w i − 1 ) <s> I do not like green eggs and ham </s> 2

  3. 10/17/19 Example: Berkeley Restaurant Project sentences • can you tell me about any good cantonese restaurants close by • mid priced thai food is what i’m looking for • tell me about chez panisse • can you give me a lisEng of the kinds of food that are available • i’m looking for a good place to eat breakfast • when is caffe venezia open during the day Raw bigram counts Out of 9222 sentences • 3

  4. 10/17/19 Raw bigram probabilities Normalize by unigrams: • • Result: Bigram estimates of sentence probabilities P(<s> I want english food </s>) = P(I|<s>) × P(want|I) × P(english|want) × P(food|english) × P(</s>|food) = .000031 4

  5. 10/17/19 What is encoded in bigram statistics? • P(english|want) = .0011 P(chinese|want) = .0065 • P(to|want) = .66 • P(eat | to) = .28 • • P(food | to) = 0 P(want | spend) = 0 • P (i | <s>) = .25 • Practical issue • BeWer to do everything in log space – Avoid underflow – (also adding is faster than mulEplying) log( p 1 × p 2 × p 3 × p 4 ) = log p 1 + log p 2 + log p 3 + log p 4 5

  6. 10/17/19 Google N-Gram Release, August 2006 … http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html https://books.google.com/ngrams Google N-Gram Release • serve as the incoming 92 • serve as the incubator 99 • serve as the independent 794 • serve as the index 223 • serve as the indication 72 • serve as the indicator 120 • serve as the indicators 45 • serve as the indispensable 111 • serve as the indispensible 40 • serve as the individual 234 6

  7. 10/17/19 Evaluation: How good is our model? • Does our language model prefer good sentences to bad ones? – Assign higher probability to “real” or “frequently observed” sentences than “ungrammatical” or “rarely observed” sentences? • We train parameters of our model on a training set . • We test the model’s performance on data we haven’t seen. – A test set is an unseen dataset that is different from our training set, totally unused. – An evaluation metric tells us how well our model does on the test set. Training on the test set • Testing on data from the training set will assign it an artificially high probability • “Training on the test set” • Bad science! • And violates the honor code 14 7

  8. 10/17/19 Extrinsic evaluation of N-gram models • Best evaluation for comparing models A and B – Put each model in a task • spelling corrector, speech recognizer, MT system – Run the task, get an accuracy for A and for B • How many misspelled words corrected properly • How many words translated correctly – Compare accuracy for A and B Difficulty of extrinsic evaluation of N-gram models • Extrinsic evaluation can be time-consuming – Time-consuming • Is there any easier way? – Sometimes use intrinsic evaluation: perplexity 8

  9. 10/17/19 The intuition for Perplexity mushrooms 0.1 • The Shannon Game: pepperoni 0.1 – How well can we predict the next word? anchovies 0.01 I always order pizza with cheese and ____ … . The 33 rd President of the US was ____ fried rice 0.0001 I saw a ____ … . – Unigrams are terrible at this game. (Why?) and 1e-100 • A better model of a text – is one which assigns a higher probability to the word that actually occurs Perplexity The best language model is one that best predicts an unseen test set • Gives the highest P(sentence) − 1 N PP ( W ) = P ( w 1 w 2 ... w N ) Perplexity is the inverse probability of the test set, normalized by the number of 1 = N words: P ( w 1 w 2 ... w N ) Chain rule: Minimizing perplexity is the same as maximizing probability 9

  10. 10/17/19 Perplexity The best language model is one that best predicts an unseen test set • Gives the highest P(sentence) − 1 N PP ( W ) = P ( w 1 w 2 ... w N ) Perplexity is the inverse probability of the test set, normalized by the number of 1 = N words: P ( w 1 w 2 ... w N ) Chain rule: For bigrams: Intuition on perplexity • Let’s suppose a sentence consisting of random digits • What is the perplexity of this sentence according to a model that assign P=1/10 to each digit? PP ( W ) = P ( w 1 w 2 ... w N ) − 1 N N = ( 1 ) − 1 N 10 − 1 1 = 10 = 10 10

  11. 10/17/19 Lower perplexity = better model • Training 38 million words, test 1.5 million words, WSJ N-gram Unigram Bigram Trigram Order Perplexity 962 170 109 Digression: information theory • I am thinking of an integer between 0 and 1,023. You want to guess it using the fewest number of questions. • Most of us would ask “ is it between 0 and 512?” • This is a good strategy because it provides the most information about the unknown number. • It provides the first binary digit of the number. • Initially you need to obtain log 2 (1024) = 10 bits of information. After the first question you only need log 2 (512) = 9 bits. 11

  12. 10/17/19 Information and Entropy • By halving the search space we obtained one bit. • In general, the information associated with a probabilistic outcome: I ( p ) = − log p • Why the logarithm? • Assume we have two independent events x, and y. We would like the information they carry to be additive. Let’s check: I ( x, y ) = − log P ( x, y ) = − log P ( x ) P ( y ) = − log P ( x ) − log P ( y ) = I ( x ) + I ( y ) Information and Entropy • By halving the search space we obtained one bit. • In general, the information associated with a probabilistic outcome: I ( p ) = log p • Now we can define the entropy, or information associated with a random variable X: X H ( X ) = − p ( x ) log 2 p ( x ) x ∈ χ principle, be computed in any base. call χ ) • is the space the observations belong to (words in the NLP setting) random variable • When the logarithm is in base 2, entropy is measured in bits 12

  13. 10/17/19 Entropy • For a Bernoulli random variable: H ( p ) = − p log p − (1 − p ) log(1 − p ) Entropy • Entropy of all sequences of length n in a language L: X p ( W n 1 ) log p ( W n H ( w 1 , w 2 ,..., w n ) = − 1 ) W n 1 ∈ L • Entropy rate (entropy per word): define the entropy rate (we could also think of this ) = − 1 X p ( W n 1 ) log p ( W n 1 ) n W n 1 ∈ L • What we're interested in: 1 H ( L ) = lim nH ( w 1 , w 2 ,..., w n ) n → ∞ 1 X = − lim p ( w 1 ,..., w n ) log p ( w 1 ,..., w n ) n n → ∞ W ∈ L 13

  14. 10/17/19 Entropy • What we're interested in: 1 H ( L ) = lim nH ( w 1 , w 2 ,..., w n ) n → ∞ 1 X = − lim p ( w 1 ,..., w n ) log p ( w 1 ,..., w n ) n n → ∞ W ∈ L • Using the Shannon-McMillan-Breiman theorem: Under certain conditions we have that: n → ∞ − 1 H ( L ) = lim n log p ( w 1 w 2 ... w n ) take a single sequence that is long enough Entropy • Therefore we can estimate H(L) as: H ( W ) = − 1 N log P ( w 1 w 2 ... w N ) model P on a sequence of words W is no • Which gives us: Perpelexity(W) = P ( w 1 , . . . , w N ) − 1 N = 2 H ( W ) 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend