Recap: N -gram models ANLP Lecture 6 We can model sentence probs by - PowerPoint PPT Presentation

Recap: N -gram models ANLP Lecture 6 • We can model sentence probs by conditioning each word on N − 1 N-gram models and smoothing previous words. • For example, a bigram model: Sharon Goldwater n (some slides from Philipp Koehn) � P ( � w ) = P ( w i | w i − 1 ) 26 September 2019 i =1 • Or trigram model: n � P ( � w ) = P ( w i | w i − 2 , w i − 1 ) i =1 Sharon Goldwater ANLP Lecture 6 26 September 2019 Sharon Goldwater ANLP Lecture 6 1 MLE estimates for N -grams MLE estimates for N -grams • To estimate each word prob, we could use MLE... • To estimate each word prob, we could use MLE... C ( w 1 , w 2 ) C ( w 1 , w 2 ) P ML ( w 2 | w 1 ) = P ML ( w 2 | w 1 ) = C ( w 1 ) C ( w 1 ) • But what happens when I compute P (consuming | commence) ? • But what happens when I compute P (consuming | commence) ? – Assume we have seen commence in our corpus – Assume we have seen commence in our corpus – But we have never seen commence consuming – But we have never seen commence consuming • Any sentence with commence consuming gets probability 0 The guests shall commence consuming supper Green inked commence consuming garden the Sharon Goldwater ANLP Lecture 6 2 Sharon Goldwater ANLP Lecture 6 3

The problem with MLE Today’s lecture: • MLE estimates probabilities that make the observed data • How does add-alpha smoothing work, and what are its effects? maximally probable • What are some more sophisticated smoothing methods, and what • by assuming anything unseen cannot happen (and also assigning information do they use that simpler methods don’t? too much probability to low-frequency observed events). • What are training, development, and test sets used for? • It over-fits the training data. • What are the trade-offs between higher order and lower order • We tried to avoid zero-probability sentences by modelling with n-grams? smaller chunks ( n -grams), but even these will sometimes have zero prob under MLE. • What is a word embedding and how can it help in language modelling? Today: smoothing methods, which reassign probability mass from observed to unobserved events, to avoid overfitting/zero probs. Sharon Goldwater ANLP Lecture 6 4 Sharon Goldwater ANLP Lecture 6 5 Add-One Smoothing Add-One Smoothing • For all possible bigrams, add one more count. • For all possible bigrams, add one more count. P ML ( w i | w i − 1 ) = C ( w i − 1 , w i ) P ML ( w i | w i − 1 ) = C ( w i − 1 , w i ) C ( w i − 1 ) C ( w i − 1 ) P +1 ( w i | w i − 1 ) = C ( w i − 1 , w i ) + 1 P +1 ( w i | w i − 1 ) = C ( w i − 1 , w i ) + 1 ⇒ ? ⇒ ? C ( w i − 1 ) C ( w i − 1 ) • NO! Sum over possible w i (in vocabulary V ) must equal 1: � P ( w i | w i − 1 ) = 1 w i ∈ V • True for P ML but we increased the numerator; must change denominator too. Sharon Goldwater ANLP Lecture 6 6 Sharon Goldwater ANLP Lecture 6 7

Add-One Smoothing: normalization Add-One Smoothing: effects C ( w i − 1 , w i ) + 1 • We want: • Add-one smoothing: � = 1 C ( w i − 1 ) + x w i ∈ V P +1 ( w i | w i − 1 ) = C ( w i − 1 , w i ) + 1 C ( w i − 1 ) + v • Solve for x : � ( C ( w i − 1 , w i ) + 1) = C ( w i − 1 ) + x w i ∈ V • Large vobulary size means v is often much larger than C ( w i − 1 ) , � � C ( w i − 1 , w i ) + 1 = C ( w i − 1 ) + x overpowers actual counts. w i ∈ V w i ∈ V C ( w i − 1 ) + v = C ( w i − 1 ) + x • Example: in Europarl, v = 86 , 700 word types (30m tokens, max C ( w i − 1 ) = 2m). • So, P +1 ( w i | w i − 1 ) = C ( w i − 1 ,w i )+1 where v = vocabulary size. C ( w i − 1 )+ v Sharon Goldwater ANLP Lecture 6 8 Sharon Goldwater ANLP Lecture 6 9 Add-One Smoothing: effects The problem with Add-One smoothing P +1 ( w i | w i − 1 ) = C ( w i − 1 , w i ) + 1 • All smoothing methods “steal from the rich to give to the poor” C ( w i − 1 ) + v Using v = 86 , 700 compute some example probabilities: • Add-one smoothing steals way too much • ML estimates for frequent events are quite accurate, don’t want C ( w i − 1 ) = 10 , 000 C ( w i − 1 ) = 100 smoothing to change these much. C ( w i − 1 , w i ) P ML = P +1 ≈ C ( w i − 1 , w i ) P ML = P +1 ≈ 100 1/100 1/970 100 1 1/870 10 1/1k 1/10k 10 1/10 1/9k 1 1/10k 1/48k 1 1/100 1/43k 0 0 1/97k 0 0 1/87k Sharon Goldwater ANLP Lecture 6 10 Sharon Goldwater ANLP Lecture 6 11

Add- α Smoothing Optimizing α • Divide corpus into training set (80-90%), held-out (or • Add α < 1 to each count development or validation ) set (5-10%), and test set (5-10%) P + α ( w i | w i − 1 ) = C ( w i − 1 , w i ) + α • Train model (estimate probabilities) on training set with different C ( w i − 1 ) + αv values of α • Choose the value of α that minimizes perplexity on dev set • Simplifying notation: c is n-gram count, n is history count P + α = c + α • Report final results on test set n + αv • What is a good value for α ? Sharon Goldwater ANLP Lecture 6 12 Sharon Goldwater ANLP Lecture 6 13 A general methodology Is add- α sufficient? • Training/dev/test split is used across machine learning • Even if we optimize α , add- α smoothing makes pretty bad predictions for word sequences. • Development set used for evaluating different models, debugging, optimizing parameters (like α ) • Some cleverer methods such as Good-Turing improve on this by discounting less from very frequent items. But there’s still a • Test set simulates deployment; only used once final model and problem... parameters are chosen. (Ideally: once per paper) • Avoids overfitting to the training set and even to the test set Sharon Goldwater ANLP Lecture 6 14 Sharon Goldwater ANLP Lecture 6 15

Remaining problem Remaining problem • In given corpus, suppose we never observe • Previous smoothing methods assign equal probability to all unseen events. – Scottish beer drinkers – Scottish beer eaters • Better: use information from lower order N -grams (shorter histories). • If we build a trigram model smoothed with Add- α or G-T, which example has higher probability? – beer drinkers – beer eaters • Two ways: interpolation and backoff . Sharon Goldwater ANLP Lecture 6 16 Sharon Goldwater ANLP Lecture 6 17 Interpolation Interpolation • Higher and lower order N -gram models have different strengths • Note that λ i s must sum to 1: and weaknesses – high-order N -grams are sensitive to more context, but have � 1 = P I ( w 3 | w 1 , w 2 ) sparse counts w 3 – low-order N -grams consider only very limited context, but have � = [ λ 1 P 1 ( w 3 ) + λ 2 P 2 ( w 3 | w 2 ) + λ 3 P 3 ( w 3 | w 1 , w 2 )] robust counts w 3 � � � • So, combine them: = λ 1 P 1 ( w 3 ) + λ 2 P 2 ( w 3 | w 2 ) + λ 3 P 3 ( w 3 | w 1 , w 2 ) P I ( w 3 | w 1 , w 2 ) = λ 1 P 1 ( w 3 ) P 1 (drinkers) w 3 w 3 w 3 = λ 1 + λ 2 + λ 3 + λ 2 P 2 ( w 3 | w 2 ) P 2 (drinkers | beer) + λ 3 P 3 ( w 3 | w 1 , w 2 ) P 3 (drinkers | Scottish , beer) Sharon Goldwater ANLP Lecture 6 18 Sharon Goldwater ANLP Lecture 6 19

Fitting the interpolation parameters Back-Off • In general, any weighted combination of distributions is called a • Trust the highest order language model that contains N -gram, mixture model . otherwise “back off” to a lower order model. • So λ i s are interpolation parameters or mixture weights . • Basic idea: – discount the probabilities slightly in higher order model • The values of the λ i s are chosen to optimize perplexity on a – spread the extra mass between lower order N -grams held-out data set. • But maths gets complicated to make probabilities sum to 1. Sharon Goldwater ANLP Lecture 6 20 Sharon Goldwater ANLP Lecture 6 21 Back-Off Equation Do our smoothing methods work here? Example from MacKay and Bauman Peto (1994): P BO ( w i | w i − N +1 , ..., w i − 1 ) = Imagine, you see, that the language, you see, has, you see,  P ∗ ( w i | w i − N +1 , ..., w i − 1 )  a frequently occurring couplet, ‘you see’, you see, in which    if count ( w i − N +1 , ..., w i ) > 0  the second word of the couplet, ‘see’, follows the first word,  = ‘you’, with very high probability, you see. Then the marginal α ( w i − N +1 , ..., w i − 1 ) P BO ( w i | w i − N +2 , ..., w i − 1 )    statistics, you see, are going to become hugely dominated,  else   you see, by the words ‘you’ and ‘see’, with equal frequency, • Requires you see. – adjusted prediction model P ∗ ( w i | w i − N +1 , ..., w i − 1 ) – backoff weights α ( w 1 , ..., w N − 1 ) • P ( see ) and P ( you ) both high, but see nearly always follows you . • See textbook for details/explanation. • So P ( see | novel ) should be much lower than P ( you | novel ) . Sharon Goldwater ANLP Lecture 6 22 Sharon Goldwater ANLP Lecture 6 23

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by - PowerPoint PPT Presentation

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by conditioning each word on N 1 N-gram models and smoothing previous words. For example, a bigram model: Sharon Goldwater n (some slides from Philipp Koehn) P (

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 26

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

N-grams & Language ID If N-gram models represent language models, can we use N-gram

ANLP Lecture 8 Part-of-speech tagging Sharon Goldwater (based on slides by Philipp Koehn) 1

Accelerated Natural Language Processing Lecture 5 N-gram models, entropy Sharon Goldwater (some

Until now... Phrase Structures and Syntax ANLP: Lecture 11 Focused mostly on regular

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

Phrase Structures and Syntax ANLP: Lecture 11 Shay Cohen School of Informatics University of

ANLP Lecture 29: Gender Bias in NLP Sharon Goldwater 19 Nov 2019 Recap Some co- reference

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage:

N-gram Language Models CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin Roadmap

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI)

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M.

Latent class analysis and finite mixture models with Stata Isabel Canette Principal

Econ 2148, fall 2019 Text as data Maximilian Kasy Department of Economics, Harvard University 1

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Machine Learning Estimation Hamid R. Rabiee Spring 2015

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Modernmachinelearningmethods fortrustworthyscience TomCharnock

Mobility Inequality in the United States 1 download slides at: www.inequality.com/slides

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by - PowerPoint PPT Presentation

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by conditioning each word on N 1 N-gram models and smoothing previous words. For example, a bigram model: Sharon Goldwater n (some slides from Philipp Koehn) P (

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 26

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

ANLP Lecture 8 Part-of-speech tagging Sharon Goldwater (based on slides by Philipp Koehn) 1

Accelerated Natural Language Processing Lecture 5 N-gram models, entropy Sharon Goldwater (some

Until now... Phrase Structures and Syntax ANLP: Lecture 11 Focused mostly on regular

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS &amp; COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

Phrase Structures and Syntax ANLP: Lecture 11 Shay Cohen School of Informatics University of

ANLP Lecture 29: Gender Bias in NLP Sharon Goldwater 19 Nov 2019 Recap Some co- reference

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage:

N-gram Language Models CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin Roadmap

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI)

Gaussian Mixture Models &amp; EM CE-717: Machine Learning Sharif University of Technology M.

Latent class analysis and finite mixture models with Stata Isabel Canette Principal

Econ 2148, fall 2019 Text as data Maximilian Kasy Department of Economics, Harvard University 1

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Machine Learning Estimation Hamid R. Rabiee Spring 2015

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Modernmachinelearningmethods fortrustworthyscience TomCharnock

Mobility Inequality in the United States 1 download slides at: www.inequality.com/slides

N-grams & Language ID If N-gram models represent language models, can we use N-gram

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

N-gram Language Models CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin Roadmap

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M.