ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some - PowerPoint PPT Presentation

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 26 September 2019 Sharon Goldwater ANLP Lecture 6 26 September 2019

Recap: N -gram models • We can model sentence probs by conditioning each word on N − 1 previous words. • For example, a bigram model: n � P ( � w ) = P ( w i | w i − 1 ) i =1 • Or trigram model: n � P ( � w ) = P ( w i | w i − 2 , w i − 1 ) i =1 Sharon Goldwater ANLP Lecture 6 1

MLE estimates for N -grams • To estimate each word prob, we could use MLE... C ( w 1 , w 2 ) P ML ( w 2 | w 1 ) = C ( w 1 ) • But what happens when I compute P (consuming | commence) ? – Assume we have seen commence in our corpus – But we have never seen commence consuming Sharon Goldwater ANLP Lecture 6 2

MLE estimates for N -grams • To estimate each word prob, we could use MLE... C ( w 1 , w 2 ) P ML ( w 2 | w 1 ) = C ( w 1 ) • But what happens when I compute P (consuming | commence) ? – Assume we have seen commence in our corpus – But we have never seen commence consuming • Any sentence with commence consuming gets probability 0 The guests shall commence consuming supper Green inked commence consuming garden the Sharon Goldwater ANLP Lecture 6 3

The problem with MLE • MLE estimates probabilities that make the observed data maximally probable • by assuming anything unseen cannot happen (and also assigning too much probability to low-frequency observed events). • It over-fits the training data. • We tried to avoid zero-probability sentences by modelling with smaller chunks ( n -grams), but even these will sometimes have zero prob under MLE. Today: smoothing methods, which reassign probability mass from observed to unobserved events, to avoid overfitting/zero probs. Sharon Goldwater ANLP Lecture 6 4

Today’s lecture: • How does add-alpha smoothing work, and what are its effects? • What are some more sophisticated smoothing methods, and what information do they use that simpler methods don’t? • What are training, development, and test sets used for? • What are the trade-offs between higher order and lower order n-grams? • What is a word embedding and how can it help in language modelling? Sharon Goldwater ANLP Lecture 6 5

Add-One Smoothing • For all possible bigrams, add one more count. P ML ( w i | w i − 1 ) = C ( w i − 1 , w i ) C ( w i − 1 ) P +1 ( w i | w i − 1 ) = C ( w i − 1 , w i ) + 1 ⇒ ? C ( w i − 1 ) Sharon Goldwater ANLP Lecture 6 6

Add-One Smoothing • For all possible bigrams, add one more count. P ML ( w i | w i − 1 ) = C ( w i − 1 , w i ) C ( w i − 1 ) P +1 ( w i | w i − 1 ) = C ( w i − 1 , w i ) + 1 ⇒ ? C ( w i − 1 ) • NO! Sum over possible w i (in vocabulary V ) must equal 1: � P ( w i | w i − 1 ) = 1 w i ∈ V • True for P ML but we increased the numerator; must change denominator too. Sharon Goldwater ANLP Lecture 6 7

Add-One Smoothing: normalization C ( w i − 1 , w i ) + 1 • We want: � = 1 C ( w i − 1 ) + x w i ∈ V • Solve for x : � ( C ( w i − 1 , w i ) + 1) = C ( w i − 1 ) + x w i ∈ V � � C ( w i − 1 , w i ) + 1 = C ( w i − 1 ) + x w i ∈ V w i ∈ V C ( w i − 1 ) + v = C ( w i − 1 ) + x • So, P +1 ( w i | w i − 1 ) = C ( w i − 1 ,w i )+1 where v = vocabulary size. C ( w i − 1 )+ v Sharon Goldwater ANLP Lecture 6 8

Add-One Smoothing: effects • Add-one smoothing: P +1 ( w i | w i − 1 ) = C ( w i − 1 , w i ) + 1 C ( w i − 1 ) + v • Large vobulary size means v is often much larger than C ( w i − 1 ) , overpowers actual counts. • Example: in Europarl, v = 86 , 700 word types (30m tokens, max C ( w i − 1 ) = 2m). Sharon Goldwater ANLP Lecture 6 9

Add-One Smoothing: effects P +1 ( w i | w i − 1 ) = C ( w i − 1 , w i ) + 1 C ( w i − 1 ) + v Using v = 86 , 700 compute some example probabilities: C ( w i − 1 ) = 10 , 000 C ( w i − 1 ) = 100 C ( w i − 1 , w i ) P ML = P +1 ≈ C ( w i − 1 , w i ) P ML = P +1 ≈ 100 1/100 1/970 100 1 1/870 10 1/1k 1/10k 10 1/10 1/9k 1 1/10k 1/48k 1 1/100 1/43k 0 0 1/97k 0 0 1/87k Sharon Goldwater ANLP Lecture 6 10

The problem with Add-One smoothing • All smoothing methods “steal from the rich to give to the poor” • Add-one smoothing steals way too much • ML estimates for frequent events are quite accurate, don’t want smoothing to change these much. Sharon Goldwater ANLP Lecture 6 11

Add- α Smoothing • Add α < 1 to each count P + α ( w i | w i − 1 ) = C ( w i − 1 , w i ) + α C ( w i − 1 ) + αv • Simplifying notation: c is n-gram count, n is history count P + α = c + α n + αv • What is a good value for α ? Sharon Goldwater ANLP Lecture 6 12

Optimizing α • Divide corpus into training set (80-90%), held-out (or development or validation ) set (5-10%), and test set (5-10%) • Train model (estimate probabilities) on training set with different values of α • Choose the value of α that minimizes perplexity on dev set • Report final results on test set Sharon Goldwater ANLP Lecture 6 13

A general methodology • Training/dev/test split is used across machine learning • Development set used for evaluating different models, debugging, optimizing parameters (like α ) • Test set simulates deployment; only used once final model and parameters are chosen. (Ideally: once per paper) • Avoids overfitting to the training set and even to the test set Sharon Goldwater ANLP Lecture 6 14

Is add- α sufficient? • Even if we optimize α , add- α smoothing makes pretty bad predictions for word sequences. • Some cleverer methods such as Good-Turing improve on this by discounting less from very frequent items. But there’s still a problem... Sharon Goldwater ANLP Lecture 6 15

Remaining problem • In given corpus, suppose we never observe – Scottish beer drinkers – Scottish beer eaters • If we build a trigram model smoothed with Add- α or G-T, which example has higher probability? Sharon Goldwater ANLP Lecture 6 16

Remaining problem • Previous smoothing methods assign equal probability to all unseen events. • Better: use information from lower order N -grams (shorter histories). – beer drinkers – beer eaters • Two ways: interpolation and backoff . Sharon Goldwater ANLP Lecture 6 17

Interpolation • Higher and lower order N -gram models have different strengths and weaknesses – high-order N -grams are sensitive to more context, but have sparse counts – low-order N -grams consider only very limited context, but have robust counts • So, combine them: P I ( w 3 | w 1 , w 2 ) = λ 1 P 1 ( w 3 ) P 1 (drinkers) + λ 2 P 2 ( w 3 | w 2 ) P 2 (drinkers | beer) + λ 3 P 3 ( w 3 | w 1 , w 2 ) P 3 (drinkers | Scottish , beer) Sharon Goldwater ANLP Lecture 6 18

Interpolation • Note that λ i s must sum to 1: � 1 = P I ( w 3 | w 1 , w 2 ) w 3 � = [ λ 1 P 1 ( w 3 ) + λ 2 P 2 ( w 3 | w 2 ) + λ 3 P 3 ( w 3 | w 1 , w 2 )] w 3 � � � = P 1 ( w 3 ) + λ 2 P 2 ( w 3 | w 2 ) + λ 3 P 3 ( w 3 | w 1 , w 2 ) λ 1 w 3 w 3 w 3 = λ 1 + λ 2 + λ 3 Sharon Goldwater ANLP Lecture 6 19

Fitting the interpolation parameters • In general, any weighted combination of distributions is called a mixture model . • So λ i s are interpolation parameters or mixture weights . • The values of the λ i s are chosen to optimize perplexity on a held-out data set. Sharon Goldwater ANLP Lecture 6 20

Back-Off • Trust the highest order language model that contains N -gram, otherwise “back off” to a lower order model. • Basic idea: – discount the probabilities slightly in higher order model – spread the extra mass between lower order N -grams • But maths gets complicated to make probabilities sum to 1. Sharon Goldwater ANLP Lecture 6 21

Back-Off Equation P BO ( w i | w i − N +1 , ..., w i − 1 ) =  P ∗ ( w i | w i − N +1 , ..., w i − 1 )     if count ( w i − N +1 , ..., w i ) > 0   = α ( w i − N +1 , ..., w i − 1 ) P BO ( w i | w i − N +2 , ..., w i − 1 )     else   • Requires – adjusted prediction model P ∗ ( w i | w i − N +1 , ..., w i − 1 ) – backoff weights α ( w 1 , ..., w N − 1 ) • See textbook for details/explanation. Sharon Goldwater ANLP Lecture 6 22

Do our smoothing methods work here? Example from MacKay and Bauman Peto (1994): Imagine, you see, that the language, you see, has, you see, a frequently occurring couplet, ‘you see’, you see, in which the second word of the couplet, ‘see’, follows the first word, ‘you’, with very high probability, you see. Then the marginal statistics, you see, are going to become hugely dominated, you see, by the words ‘you’ and ‘see’, with equal frequency, you see. • P ( see ) and P ( you ) both high, but see nearly always follows you . • So P ( see | novel ) should be much lower than P ( you | novel ) . Sharon Goldwater ANLP Lecture 6 23

Diversity of histories matters! • A real example: the word York – fairly frequent word in Europarl corpus, occurs 477 times – as frequent as foods , indicates and providers → in unigram language model: a respectable probability • However, it almost always directly follows New (473 times) • So, in unseen bigram contexts, York should have low probability – lower than predicted by unigram model used in interpolation or backoff. Sharon Goldwater ANLP Lecture 6 24

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some - PowerPoint PPT Presentation

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 26 September 2019 Sharon Goldwater ANLP Lecture 6 26 September 2019 Recap: N -gram models We can model sentence probs by conditioning each word

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by conditioning each word on

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

SI485i : NLP Set 4 Smoothing Language Models Fall 2013 : Chambers Review: evaluating n-gram

SI425 : NLP Set 4 Smoothing Language Models Fall 2017 : Chambers Review: evaluating n-gram

SI425 : NLP Set 4 Smoothing Language Models Fall 2020 : Chambers Review: evaluating n-gram

Natural Language Processing CSCI 4152/6509 Lecture 17 N-gram Model Smoothing Instructor:

N-grams & Language ID If N-gram models represent language models, can we use N-gram

ANLP Lecture 8 Part-of-speech tagging Sharon Goldwater (based on slides by Philipp Koehn) 1

Until now... Phrase Structures and Syntax ANLP: Lecture 11 Focused mostly on regular

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

N-gram Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T

Phrase Structures and Syntax ANLP: Lecture 11 Shay Cohen School of Informatics University of

Accelerated Natural Language Processing Lecture 5 N-gram models, entropy Sharon Goldwater (some

Energy and Climate Brian Chase - Fermilab Saturday Morning Physics ENERGY BASICS What is

NLP Programming Tutorial 2 - Bigram Language Models Graham Neubig Nara Institute of Science and

Adaptive Garbled RAM from Adaptive Garbled RAM from Laconic Oblivious Transfer Sanjam Garg

Lecture 7 Introduction to Neural Networks Julia Hockenmaier juliahmr@illinois.edu 3324

Ego State Model Transactional Analysis Ego States P A C VISIONS Inc. Transactional Analysis

Polynomial Chains in Gentry- Szydlo Algorithm Se7ng R ring

U SE OF D ECISION U NIT AND I NCREMENTAL S AMPLING M ETHODS TO I MPROVE S ITE I NVESTIGATIONS 2015

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lectures 5-6: Language Models

Sambuz

Useful Links

Newsletter

Mail Us

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some - PowerPoint PPT Presentation

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 26 September 2019 Sharon Goldwater ANLP Lecture 6 26 September 2019 Recap: N -gram models We can model sentence probs by conditioning each word

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by conditioning each word on

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

SI485i : NLP Set 4 Smoothing Language Models Fall 2013 : Chambers Review: evaluating n-gram

SI425 : NLP Set 4 Smoothing Language Models Fall 2017 : Chambers Review: evaluating n-gram

SI425 : NLP Set 4 Smoothing Language Models Fall 2020 : Chambers Review: evaluating n-gram

Natural Language Processing CSCI 4152/6509 Lecture 17 N-gram Model Smoothing Instructor:

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

ANLP Lecture 8 Part-of-speech tagging Sharon Goldwater (based on slides by Philipp Koehn) 1

Until now... Phrase Structures and Syntax ANLP: Lecture 11 Focused mostly on regular

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS &amp; COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

N-gram Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T

Phrase Structures and Syntax ANLP: Lecture 11 Shay Cohen School of Informatics University of

Accelerated Natural Language Processing Lecture 5 N-gram models, entropy Sharon Goldwater (some

Energy and Climate Brian Chase - Fermilab Saturday Morning Physics ENERGY BASICS What is

NLP Programming Tutorial 2 - Bigram Language Models Graham Neubig Nara Institute of Science and

Adaptive Garbled RAM from Adaptive Garbled RAM from Laconic Oblivious Transfer Sanjam Garg

Lecture 7 Introduction to Neural Networks Julia Hockenmaier juliahmr@illinois.edu 3324

Ego State Model Transactional Analysis Ego States P A C VISIONS Inc. Transactional Analysis

Polynomial Chains in Gentry- Szydlo Algorithm Se7ng R ring

U SE OF D ECISION U NIT AND I NCREMENTAL S AMPLING M ETHODS TO I MPROVE S ITE I NVESTIGATIONS 2015

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lectures 5-6: Language Models

Sambuz

Useful Links

Newsletter

Mail Us

N-grams & Language ID If N-gram models represent language models, can we use N-gram

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details