P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, - PDF document

10/17/19 Language models Chapter 3 in Martin/Jurafsky N-gram models using the Markov assumption ∏ P ( w 1 w 2 … w n ) ≈ P ( w i | w i − k … w i − 1 ) i • In other words, we approximate each component in the product as P ( w i | w 1 w 2 … w i − 1 ) ≈ P ( w i | w i − k … w i − 1 ) 1

10/17/19 Estimating bigram probabilities • The Maximum Likelihood estimate P ( w i | w i − 1 ) = c ( w i − 1 , w i ) c ( w i − 1 ) • c(xy) is the count of the bigram xy Example <s> I am Sam </s> P ( w i | w i − 1 ) = c ( w i − 1 , w i ) <s> Sam I am </s> c ( w i − 1 ) <s> I do not like green eggs and ham </s> 2

10/17/19 Example: Berkeley Restaurant Project sentences • can you tell me about any good cantonese restaurants close by • mid priced thai food is what i’m looking for • tell me about chez panisse • can you give me a lisEng of the kinds of food that are available • i’m looking for a good place to eat breakfast • when is caffe venezia open during the day Raw bigram counts Out of 9222 sentences • 3

10/17/19 What is encoded in bigram statistics? • P(english|want) = .0011 P(chinese|want) = .0065 • P(to|want) = .66 • P(eat | to) = .28 • • P(food | to) = 0 P(want | spend) = 0 • P (i | <s>) = .25 • Practical issue • BeWer to do everything in log space – Avoid underflow – (also adding is faster than mulEplying) log( p 1 × p 2 × p 3 × p 4 ) = log p 1 + log p 2 + log p 3 + log p 4 5

10/17/19 Google N-Gram Release, August 2006 … http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html https://books.google.com/ngrams Google N-Gram Release • serve as the incoming 92 • serve as the incubator 99 • serve as the independent 794 • serve as the index 223 • serve as the indication 72 • serve as the indicator 120 • serve as the indicators 45 • serve as the indispensable 111 • serve as the indispensible 40 • serve as the individual 234 6

10/17/19 Evaluation: How good is our model? • Does our language model prefer good sentences to bad ones? – Assign higher probability to “real” or “frequently observed” sentences than “ungrammatical” or “rarely observed” sentences? • We train parameters of our model on a training set . • We test the model’s performance on data we haven’t seen. – A test set is an unseen dataset that is different from our training set, totally unused. – An evaluation metric tells us how well our model does on the test set. Training on the test set • Testing on data from the training set will assign it an artificially high probability • “Training on the test set” • Bad science! • And violates the honor code 14 7

10/17/19 Extrinsic evaluation of N-gram models • Best evaluation for comparing models A and B – Put each model in a task • spelling corrector, speech recognizer, MT system – Run the task, get an accuracy for A and for B • How many misspelled words corrected properly • How many words translated correctly – Compare accuracy for A and B Difficulty of extrinsic evaluation of N-gram models • Extrinsic evaluation can be time-consuming – Time-consuming • Is there any easier way? – Sometimes use intrinsic evaluation: perplexity 8

10/17/19 The intuition for Perplexity mushrooms 0.1 • The Shannon Game: pepperoni 0.1 – How well can we predict the next word? anchovies 0.01 I always order pizza with cheese and ____ … . The 33 rd President of the US was ____ fried rice 0.0001 I saw a ____ … . – Unigrams are terrible at this game. (Why?) and 1e-100 • A better model of a text – is one which assigns a higher probability to the word that actually occurs Perplexity The best language model is one that best predicts an unseen test set • Gives the highest P(sentence) − 1 N PP ( W ) = P ( w 1 w 2 ... w N ) Perplexity is the inverse probability of the test set, normalized by the number of 1 = N words: P ( w 1 w 2 ... w N ) Chain rule: Minimizing perplexity is the same as maximizing probability 9

10/17/19 Perplexity The best language model is one that best predicts an unseen test set • Gives the highest P(sentence) − 1 N PP ( W ) = P ( w 1 w 2 ... w N ) Perplexity is the inverse probability of the test set, normalized by the number of 1 = N words: P ( w 1 w 2 ... w N ) Chain rule: For bigrams: Intuition on perplexity • Let’s suppose a sentence consisting of random digits • What is the perplexity of this sentence according to a model that assign P=1/10 to each digit? PP ( W ) = P ( w 1 w 2 ... w N ) − 1 N N = ( 1 ) − 1 N 10 − 1 1 = 10 = 10 10

10/17/19 Lower perplexity = better model • Training 38 million words, test 1.5 million words, WSJ N-gram Unigram Bigram Trigram Order Perplexity 962 170 109 Digression: information theory • I am thinking of an integer between 0 and 1,023. You want to guess it using the fewest number of questions. • Most of us would ask “ is it between 0 and 512?” • This is a good strategy because it provides the most information about the unknown number. • It provides the first binary digit of the number. • Initially you need to obtain log 2 (1024) = 10 bits of information. After the first question you only need log 2 (512) = 9 bits. 11

10/17/19 Information and Entropy • By halving the search space we obtained one bit. • In general, the information associated with a probabilistic outcome: I ( p ) = − log p • Why the logarithm? • Assume we have two independent events x, and y. We would like the information they carry to be additive. Let’s check: I ( x, y ) = − log P ( x, y ) = − log P ( x ) P ( y ) = − log P ( x ) − log P ( y ) = I ( x ) + I ( y ) Information and Entropy • By halving the search space we obtained one bit. • In general, the information associated with a probabilistic outcome: I ( p ) = log p • Now we can define the entropy, or information associated with a random variable X: X H ( X ) = − p ( x ) log 2 p ( x ) x ∈ χ principle, be computed in any base. call χ ) • is the space the observations belong to (words in the NLP setting) random variable • When the logarithm is in base 2, entropy is measured in bits 12

10/17/19 Entropy • For a Bernoulli random variable: H ( p ) = − p log p − (1 − p ) log(1 − p ) Entropy • Entropy of all sequences of length n in a language L: X p ( W n 1 ) log p ( W n H ( w 1 , w 2 ,..., w n ) = − 1 ) W n 1 ∈ L • Entropy rate (entropy per word): define the entropy rate (we could also think of this ) = − 1 X p ( W n 1 ) log p ( W n 1 ) n W n 1 ∈ L • What we're interested in: 1 H ( L ) = lim nH ( w 1 , w 2 ,..., w n ) n → ∞ 1 X = − lim p ( w 1 ,..., w n ) log p ( w 1 ,..., w n ) n n → ∞ W ∈ L 13

10/17/19 Entropy • What we're interested in: 1 H ( L ) = lim nH ( w 1 , w 2 ,..., w n ) n → ∞ 1 X = − lim p ( w 1 ,..., w n ) log p ( w 1 ,..., w n ) n n → ∞ W ∈ L • Using the Shannon-McMillan-Breiman theorem: Under certain conditions we have that: n → ∞ − 1 H ( L ) = lim n log p ( w 1 w 2 ... w n ) take a single sequence that is long enough Entropy • Therefore we can estimate H(L) as: H ( W ) = − 1 N log P ( w 1 w 2 ... w N ) model P on a sequence of words W is no • Which gives us: Perpelexity(W) = P ( w 1 , . . . , w N ) − 1 N = 2 H ( W ) 14

P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, - PDF document

10/17/19 Language models Chapter 3 in Martin/Jurafsky N-gram models using the Markov assumption P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, we approximate each component in the product as P ( w i | w

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

The nature and quantity of the unique words of narratives (i.e.., the words beyond the

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

Simplicity in Practice https://xkcd.com/1349/ Words, words, words. Hamlet, Act 2 Scene

MORPHOLOGY A Study of the internal structure of words and the relationships among words

Token to Words Expanding identified token to words numbers+type = word list

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

Schemes for Pattern-Avoiding Words Lara Pudwell Rutgers University Permutation Patterns 2007

sphere Can you think of any other words that follow this rule? sincere adhere hemisphere

Many words share the same root word This week we are focusing on words with the root gram.

Orientation - an over view : words, self-referencing, and projection Words render our

Synonyms Antonyms Are words Are words that mean the that mean the same opposite

Words with the Prefix mis - What do you notice about these words? misbehave misplace misheard

press vent which is often which is often connected to words connected to words

Language Models (2) CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin Roadmap

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu With

Dependency Parsing Guest lecture in Computational Linguistics course Barbara Plank

Seth Cable Introduction to Linguistic Theory Spring 2018 Linguistics 201 Some Notes and Practice

Natural Language Processing (CSE 490U): Compositional Semantics Noah Smith 2017 c

El Nio / Southern Oscillation (ENSO) and inter-annual climate variability seasonal cycle what

Building a pizza ontology What is a pizza ? The answer depends on how we are going to use our

Lookup Tables 1 Example of Arbitrary Waveform Generator Clock Generator Output Increments a

P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, - PDF document

10/17/19 Language models Chapter 3 in Martin/Jurafsky N-gram models using the Markov assumption P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, we approximate each component in the product as P ( w i | w

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

The nature and quantity of the unique words of narratives (i.e.., the words beyond the

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

Simplicity in Practice https://xkcd.com/1349/ Words, words, words. Hamlet, Act 2 Scene

MORPHOLOGY A Study of the internal structure of words and the relationships among words

Token to Words Expanding identified token to words numbers+type = word list

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

Schemes for Pattern-Avoiding Words Lara Pudwell Rutgers University Permutation Patterns 2007

sphere Can you think of any other words that follow this rule? sincere adhere hemisphere

Many words share the same root word This week we are focusing on words with the root gram.

Orientation - an over view : words, self-referencing, and projection Words render our

Synonyms Antonyms Are words Are words that mean the that mean the same opposite

Words with the Prefix mis - What do you notice about these words? misbehave misplace misheard

press vent which is often which is often connected to words connected to words

Language Models (2) CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin Roadmap

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu With

Dependency Parsing Guest lecture in Computational Linguistics course Barbara Plank

Seth Cable Introduction to Linguistic Theory Spring 2018 Linguistics 201 Some Notes and Practice

Natural Language Processing (CSE 490U): Compositional Semantics Noah Smith 2017 c

El Nio / Southern Oscillation (ENSO) and inter-annual climate variability seasonal cycle what

Building a pizza ontology What is a pizza ? The answer depends on how we are going to use our

Lookup Tables 1 Example of Arbitrary Waveform Generator Clock Generator Output Increments a

Language Models (2) CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin Roadmap