Smoothing BM1: Advanced Natural Language Processing University of - PowerPoint PPT Presentation

Smoothing BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de November 1, 2016

Last Week ¤ Language model: P(X t = w t | X 1 = w 1 ,... ,X t-1 = w t-1 ) ¤ Probability of string w 1 … w n with bigram model: P(w 1 … w n ) = P(w 1 )P(w 2 |w 1 ) … P(w n |w n-1 ) ¤ Maximum likelihood estimation using relative frequencies: low n high n modeling errors estimation errors 2

Today ¤ More about dealing with sparse data ¤ Smoothing ¤ Good-Turing estimation ¤ Linear interpolation ¤ Backoff models 3

An example (Chen/Goodman, 1998) 4

An example (Chen/Goodman, 1998) 5

Unseen data ¤ ML estimate is “optimal” only for the corpus from which we computed it. ¤ Usually does not generalize directly to new data. ¤ Ok for unigrams, but there are so many bigrams. ¤ Extreme case: P(unseen|w k-1 ) = 0 for all w k-1 ¤ This is a disaster because product with 0 is always 0. 6

Honest evaluation ¤ To get an honest picture of a model’s performance, need to try it on a separate test corpus. ¤ Maximum likelihood for training corpus is not necessarily good for the test corpus. ¤ In Cher corpus, likelihood L(test) = 0. 7

Measures of quality ¤ (Cross) Entropy: Average number of bits per word in corpus T in an optimal compression scheme: ¤ Good language model should minimize entropy of observations. ¤ Equivalently, represent in terms of perplexity: 8

Smoothing techniques ¤ Replace ML estimate ¤ by an adjusted bigram count ¤ Redistribute counts from seen to unseen bigrams. ¤ Generalizes easily to n-gram models with n > 2. 9

Smoothing P(... | eat) in Brown corpus 10

Laplace Smoothing 11

Laplace Smoothing 12

Laplace Smoothing ¤ Count every bigram (seen or unseen) one more time than in corpus and normalize: ¤ Easy to implement, but dramatically overestimates probability of unseen events. ¤ Quick fix: Additive smoothing with some 0 < δ ≤ 1. 13

Good-Turing Estimation ¤ For each bigram count r in corpus, look how many bigrams had the same count: ¤ “count count” n r ¤ Now re-estimate bigram counts as ¤ One intuition: ¤ 0* is now greater than zero. ¤ Total sum of counts stays the same: 15

Good-Turing Estimation ¤ Problem: n r becomes zero for large r. ¤ Solution: need to smooth out n r in some way, e.g. Simple G-T (Gale/Sampson 1995): 16

Good-Turing > Laplace (Manning/Schütze after Church/Gale 1991) 17

Linear Interpolation ¤ One problem with Good-Turing: All unseen events are assigned the same probability. ¤ Idea: P*(w i | w i-1 ) for unseen bigram w i-1 w i should be higher if w i is a frequent word. ¤ Linear interpolation: combine multiple models with a weighting factor λ . 18

Linear interpolation ¤ Simplest variant: λ wi-1wi the same λ for all bigrams. ¤ Estimate from held-out data: ¤ Can also bucket bigrams in various ways and have one λ for each bucket, for better performance. ¤ Linear interpolation generalizes to higher n-grams. (graph from Dan Klein) 19

Backoff models ¤ Katz: try fine-grained model first; if not enough data available, back off to lower-order model. ¤ By contrast, interpolation always mixes different models. ¤ General formula (e.g., k=5): ¤ Choose α and d appropriately to redistribute probability mass in a principled way. 20

Kneser-Ney smoothing ¤ Interpolation and backoff models that rely on unigram models can make mistakes if there was a reason why a bigram was rare: ¤ “I can’t see without my reading ______” ¤ C 1 (Francisco) > C 1 (glasses), but appears only in very specific contexts (example from Jurafsky & Martin). ¤ Kneser-Ney smoothing: P(w) models how likely w is to occur after words that we haven’t seen w with. ¤ captures “specificity” of “Francisco” vs. “glasses” ¤ originally formulated as backoff model, nowadays interpolation 21

Smoothing performance (Chen/Goodman 1998) 22

Summary ¤ In practice (speech recognition, SMT, etc.): ¤ unigram, bigram models not accurate enough ¤ trigram models work much better ¤ higher models only if we have lots of training data ¤ Smoothing is important and surprisingly effective. ¤ permits use of “deeper” model with same amount of data ¤ “If data sparsity is not a problem for you, your model is too simple.” 23

Friday ¤ Part of Speech Tagging 24

Smoothing BM1: Advanced Natural Language Processing University of - PowerPoint PPT Presentation

Smoothing BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de November 1, 2016 Last Week Language model: P(X t = w t | X 1 = w 1 ,... ,X t-1 = w t-1 ) Probability of

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

THE COMPARISON OF INCOME THE COMPARISON OF INCOME SMOOTHING AND MARKET SMOOTHING AND MARKET

Testing for Poverty Traps: Asset Smoothing versus Consumption Smoothing in Burkina Faso (with

8.2 Surface Smoothing Hao Li http://cs621.hao-li.com 1 Mesh Optimization Smoothing Low

Image Smoothing ! Chicken-and-egg dilemma! " ! Edge preserving image smoothing !

Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based

Kernel Smoothing Methods (Part 1) Henry Tan Georgetown University April 13, 2015 Georgetown

When does label smoothing help? Rafael Mller, Simon Kornblith, Geofgrey Hinton Label smoothing

8.2 Surface Smoothing Weikai Chen http://cs621.hao-li.com 1 Mesh Optimization Smoothing

Background Smoothing LM, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Smoothing In image processing literature, the weighting averaging operation is referred to as

Event-consistent smoothing in generalized Introduction Conventional CRS stack high-density

Rotating Half Smoothing Filters, Image Rotating Half Smoothing Filters, Image Segmentation and

Implementation of a Fluctuation Smoothing Production Control Policy in IBMs 200mm Wafer Fab

Smoothing investment cycles in the water sector 13 July 2012 Mark Worsfold, Chief Engineer, Ofwat

Diffeomorphisms of discs Oscar Randal-Williams Smoothing theory M a topological d -manifold,

Centers Parmjeet Singh, Myungjin Lee Sagar Kumar, Ramana Rao Kompella Latency-critical

Cooperating to Build a Radio Map to Support Spectrum Agility Song Liu, Wade Trappe, Larry J.

Image Pyramids Sanja Fidler CSC420: Intro to Image Understanding 1 / 37 Finding Waldo Lets

6.2 Quaternions ...or, adventures on the 4D unit sphere Jaakko Lehtinen with lots of slides from

Spatial Data CS444 Chapter 8, VA&D

Representation Learning and Super-Resolution Generation for Scientific Visualization Chaoli Wang

Table Based Models Victor Bourenkov Computational Modelling Group Tyndall National Institute,

Comparing different interpolation methods Kuhnt on two-dimensional test functions Introduction

Smoothing BM1: Advanced Natural Language Processing University of - PowerPoint PPT Presentation

Smoothing BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de November 1, 2016 Last Week Language model: P(X t = w t | X 1 = w 1 ,... ,X t-1 = w t-1 ) Probability of

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

THE COMPARISON OF INCOME THE COMPARISON OF INCOME SMOOTHING AND MARKET SMOOTHING AND MARKET

Testing for Poverty Traps: Asset Smoothing versus Consumption Smoothing in Burkina Faso (with

8.2 Surface Smoothing Hao Li http://cs621.hao-li.com 1 Mesh Optimization Smoothing Low

Image Smoothing ! Chicken-and-egg dilemma! &quot; ! Edge preserving image smoothing !

Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based

Kernel Smoothing Methods (Part 1) Henry Tan Georgetown University April 13, 2015 Georgetown

When does label smoothing help? Rafael Mller, Simon Kornblith, Geofgrey Hinton Label smoothing

8.2 Surface Smoothing Weikai Chen http://cs621.hao-li.com 1 Mesh Optimization Smoothing

Background Smoothing LM, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Smoothing In image processing literature, the weighting averaging operation is referred to as

Event-consistent smoothing in generalized Introduction Conventional CRS stack high-density

Rotating Half Smoothing Filters, Image Rotating Half Smoothing Filters, Image Segmentation and

Implementation of a Fluctuation Smoothing Production Control Policy in IBMs 200mm Wafer Fab

Smoothing investment cycles in the water sector 13 July 2012 Mark Worsfold, Chief Engineer, Ofwat

Diffeomorphisms of discs Oscar Randal-Williams Smoothing theory M a topological d -manifold,

Centers Parmjeet Singh, Myungjin Lee Sagar Kumar, Ramana Rao Kompella Latency-critical

Cooperating to Build a Radio Map to Support Spectrum Agility Song Liu, Wade Trappe, Larry J.

Image Pyramids Sanja Fidler CSC420: Intro to Image Understanding 1 / 37 Finding Waldo Lets

6.2 Quaternions ...or, adventures on the 4D unit sphere Jaakko Lehtinen with lots of slides from

Spatial Data CS444 Chapter 8, VA&amp;D

Representation Learning and Super-Resolution Generation for Scientific Visualization Chaoli Wang

Table Based Models Victor Bourenkov Computational Modelling Group Tyndall National Institute,

Comparing different interpolation methods Kuhnt on two-dimensional test functions Introduction

Image Smoothing ! Chicken-and-egg dilemma! " ! Edge preserving image smoothing !

Spatial Data CS444 Chapter 8, VA&D