P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, - - PDF document

p w 1 w 2 w n p w i w i k w i 1 i in other words we
SMART_READER_LITE
LIVE PREVIEW

P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, - - PDF document

10/17/19 Language models Chapter 3 in Martin/Jurafsky N-gram models using the Markov assumption P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, we approximate each component in the product as P ( w i | w


slide-1
SLIDE 1

10/17/19 1

Language models

Chapter 3 in Martin/Jurafsky

N-gram models using the Markov assumption

  • In other words, we approximate each component in the

product as

P(w1w2…wn) ≈ P(wi | wi−k…wi−1)

i

P(wi | w1w2…wi−1) ≈ P(wi | wi−k…wi−1)

slide-2
SLIDE 2

10/17/19 2

Estimating bigram probabilities

  • The Maximum Likelihood estimate
  • c(xy) is the count of the bigram xy

P(wi | wi−1) = c(wi−1,wi) c(wi−1)

Example

<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>

P(wi | wi−1) = c(wi−1,wi) c(wi−1)

slide-3
SLIDE 3

10/17/19 3

Example: Berkeley Restaurant Project sentences

  • can you tell me about any good cantonese restaurants close by
  • mid priced thai food is what i’m looking for
  • tell me about chez panisse
  • can you give me a lisEng of the kinds of food that are available
  • i’m looking for a good place to eat breakfast
  • when is caffe venezia open during the day

Raw bigram counts

  • Out of 9222 sentences
slide-4
SLIDE 4

10/17/19 4

Raw bigram probabilities

  • Normalize by unigrams:
  • Result:

Bigram estimates of sentence probabilities

P(<s> I want english food </s>) = P(I|<s>) × P(want|I) × P(english|want) × P(food|english) × P(</s>|food) = .000031

slide-5
SLIDE 5

10/17/19 5

What is encoded in bigram statistics?

  • P(english|want) = .0011
  • P(chinese|want) = .0065
  • P(to|want) = .66
  • P(eat | to) = .28
  • P(food | to) = 0
  • P(want | spend) = 0
  • P (i | <s>) = .25

Practical issue

  • BeWer to do everything in log space

– Avoid underflow – (also adding is faster than mulEplying)

log(p1 × p2 × p3 × p4) = log p1 + log p2 + log p3 + log p4

slide-6
SLIDE 6

10/17/19 6

Google N-Gram Release, August 2006

… https://books.google.com/ngrams

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Google N-Gram Release

  • serve as the incoming 92
  • serve as the incubator 99
  • serve as the independent 794
  • serve as the index 223
  • serve as the indication 72
  • serve as the indicator 120
  • serve as the indicators 45
  • serve as the indispensable 111
  • serve as the indispensible 40
  • serve as the individual 234
slide-7
SLIDE 7

10/17/19 7

Evaluation: How good is our model?

  • Does our language model prefer good sentences to bad ones?

– Assign higher probability to “real” or “frequently observed” sentences than “ungrammatical” or “rarely observed” sentences?

  • We train parameters of our model on a training set.
  • We test the model’s performance on data we haven’t seen.

– A test set is an unseen dataset that is different from our training set, totally unused. – An evaluation metric tells us how well our model does on the test set.

Training on the test set

  • Testing on data from the training set will assign it an artificially high

probability

  • “Training on the test set”
  • Bad science!
  • And violates the honor code

14

slide-8
SLIDE 8

10/17/19 8

Extrinsic evaluation of N-gram models

  • Best evaluation for comparing models A and B

– Put each model in a task

  • spelling corrector, speech recognizer, MT system

– Run the task, get an accuracy for A and for B

  • How many misspelled words corrected properly
  • How many words translated correctly

– Compare accuracy for A and B

Difficulty of extrinsic evaluation of N-gram models

  • Extrinsic evaluation can be time-consuming

– Time-consuming

  • Is there any easier way?

– Sometimes use intrinsic evaluation: perplexity

slide-9
SLIDE 9

10/17/19 9

The intuition for Perplexity

  • The Shannon Game:

– How well can we predict the next word? – Unigrams are terrible at this game. (Why?)

  • A better model of a text

– is one which assigns a higher probability to the word that actually occurs I always order pizza with cheese and ____ The 33rd President of the US was ____ I saw a ____

mushrooms 0.1 pepperoni 0.1 anchovies 0.01 …. fried rice 0.0001 …. and 1e-100

Perplexity

Perplexity is the inverse probability of the test set, normalized by the number of words: Chain rule:

Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set

  • Gives the highest P(sentence)

PP(W) = P(w1w2...wN )

− 1 N

= 1 P(w1w2...wN )

N

slide-10
SLIDE 10

10/17/19 10

Perplexity

Perplexity is the inverse probability of the test set, normalized by the number of words: Chain rule: For bigrams: The best language model is one that best predicts an unseen test set

  • Gives the highest P(sentence)

PP(W) = P(w1w2...wN )

− 1 N

= 1 P(w1w2...wN )

N

Intuition on perplexity

  • Let’s suppose a sentence consisting of random digits
  • What is the perplexity of this sentence according to a model that assign

P=1/10 to each digit?

PP(W) = P(w1w2 ...wN)− 1

N

= ( 1 10

N

)− 1

N

= 1 10

−1

= 10

slide-11
SLIDE 11

10/17/19 11

Lower perplexity = better model

  • Training 38 million words, test 1.5 million words, WSJ

N-gram Order Unigram Bigram Trigram Perplexity 962 170 109

Digression: information theory

  • I am thinking of an integer between 0 and 1,023. You want to guess it

using the fewest number of questions.

  • Most of us would ask “is it between 0 and 512?”
  • This is a good strategy because it provides the most information

about the unknown number.

  • It provides the first binary digit of the number.
  • Initially you need to obtain log2(1024) = 10 bits of information.

After the first question you only need log2(512) = 9 bits.

slide-12
SLIDE 12

10/17/19 12

Information and Entropy

  • By halving the search space we obtained one bit.
  • In general, the information associated with a probabilistic outcome:
  • Why the logarithm?
  • Assume we have two independent events x, and y. We would like the

information they carry to be additive. Let’s check:

I(x, y) = − log P (x, y) = − log P (x)P (y) = − log P (x) − log P (y) = I(x) + I(y)

I(p) = − log p

Information and Entropy

  • By halving the search space we obtained one bit.
  • In general, the information associated with a probabilistic outcome:
  • Now we can define the entropy, or information associated with a random

variable X:

  • is the space the observations belong to (words in the NLP setting)
  • When the logarithm is in base 2, entropy is measured in bits

I(p) = log p

H(X) = − X

x∈χ

p(x)log2 p(x) principle, be computed in any base.

call χ) random variable

slide-13
SLIDE 13

10/17/19 13

Entropy

  • For a Bernoulli random variable:

H(p) = −p log p − (1 − p) log(1 − p)

Entropy

  • Entropy of all sequences of length n in a language L:
  • Entropy rate (entropy per word):
  • What we're interested in:

) = −1 n X

W n

1 ∈L

p(W n

1 )log p(W n 1 )

H(w1,w2,...,wn) = − X

W n

1 ∈L

p(W n

1 )log p(W n 1 )

define the entropy rate (we could also think of this H(L) = lim

n→∞

1 nH(w1,w2,...,wn) = − lim

n→∞

1 n X

W∈L

p(w1,...,wn)log p(w1,...,wn)

slide-14
SLIDE 14

10/17/19 14

Entropy

  • What we're interested in:
  • Using the Shannon-McMillan-Breiman theorem: Under certain

conditions we have that:

H(L) = lim

n→∞

1 nH(w1,w2,...,wn) = − lim

n→∞

1 n X

W∈L

p(w1,...,wn)log p(w1,...,wn)

H(L) = lim

n→∞−1

n log p(w1w2 ...wn) take a single sequence that is long enough

Entropy

  • Therefore we can estimate H(L) as:
  • Which gives us:

H(W) = − 1 N logP(w1w2 ...wN) model P on a sequence of words W is no

Perpelexity(W) = P(w1, . . . , wN)− 1

N = 2H(W )