Lecture 4: Smoothing Julia Hockenmaier juliahmr@illinois.edu 3324 - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 4: Smoothing Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Last lecture’s key concepts Basic probability review: joint probability, conditional probability Probability models Independence assumptions Parameter estimation: relative frequency estimation   (aka maximum likelihood estimation) Language models N-gram language models: unigram, bigram, trigram… � 2 CS447: Natural Language Processing (J. Hockenmaier)

N-gram language models A language model is a distribution P (W)   over the (infinite) set of strings in a language L   To define a distribution over this infinite set,   we have to make independence assumptions.   N-gram language models assume that each word w i depends only on the n − 1 preceding words: P n-gram ( w 1 … w T ) := ∏ i=1..T P (w i | w i − 1 , …, w i − (n − 1) )   P unigram ( w 1 … w T ) := ∏ i=1..T P (w i ) P bigram ( w 1 … w T ) := ∏ i=1..T P (w i | w i − 1 ) P trigram ( w 1 … w T ) := ∏ i=1..T P (w i | w i − 1 , w i − 2 ) � 3 CS447: Natural Language Processing (J. Hockenmaier)

Quick note re. notation Consider the sentence W = “John loves Mary”   For a trigram model we could write: P (w 3 = Mary | w 1 w 2 = “ John loves” ) This notation implies that we treat the preceding bigram w 1 w 2 as one single conditioning variable P ( X | Y )   Instead, we typically write: P (w 3 = Mary | w 2 = loves , w 1 = John ) Although this is less readable ( John loves → loves, John ), this notation gives us more flexibility, since it implies that we treat the preceding bigram w 1 w 2 as two conditioning variables P ( X | Y, Z ) � 4 CS447: Natural Language Processing (J. Hockenmaier)

Parameter estimation (training) Parameters: the actual probabilities (numbers) P ( w i = ‘the’ | w i -1 = ‘on’ ) = 0.0123   We need (a large amount of) text as training data   to estimate the parameters of a language model.   The most basic estimation technique:   relative frequency estimation (= counts) P ( w i = ‘the’ | w i-1 = ‘on’ ) = C ( ‘on the’ ) / C ( ‘on’ ) This assigns all probability mass to events   in the training corpus. Also called Maximum Likelihood Estimation (MLE) � 5 CS447: Natural Language Processing (J. Hockenmaier)

Testing: unseen events will occur Recall the Shakespeare example: Only 30,000 word types occurred.   Any word that does not occur in the training data   has zero probability! Only 0.04% of all possible bigrams occurred.   Any bigram that does not occur in the training data   has zero probability! � 6 CS447: Natural Language Processing (J. Hockenmaier)

Zipf’s law: the long tail How many words occur once, twice, 100 times, 1000 times? How many words occur N times? 100000 A few words   the r- th most are very frequent 10000 Word frequency (log-scale) common word w r Frequency (log) has P ( w r ) ∝ 1/ r 1000 Most words   100 are very rare 10 1 1 10 100 1000 10000 100000 Number of words (log) English words, sorted by frequency (log-scale) w 1 = the, w 2 = to, …., w 5346 = computer , ... In natural language: - A small number of events (e.g. words) occur with high frequency - A large number of events occur with very low frequency � 7 CS447: Natural Language Processing (J. Hockenmaier)

So…. … we can’t actually evaluate our MLE models on unseen test data (or system output)… … because both are likely to contain words/n-grams that these models assign zero probability to. We need language models that assign some probability mass to unseen words and n-grams. � 8 CS447: Natural Language Processing (J. Hockenmaier)

    Today’s lecture How can we design language models*   that can deal with previously unseen events?   *actually, probabilistic models in general P (unseen) > 0.0 ??? P (seen) P (seen) = 1.0 < 1.0 MLE model Smoothed model � 9 CS447: Natural Language Processing (J. Hockenmaier)

Dealing with unseen events Relative frequency estimation assigns all probability mass to events in the training corpus   But we need to reserve some probability mass to events that don’t occur in the training data Unseen events = new words, new bigrams   Important questions: What possible events are there? How much probability mass should they get? � 10 CS447: Natural Language Processing (J. Hockenmaier)

  What unseen events may occur? Simple distributions: P ( X = x ) (e.g. unigram models)   Possibility:   The outcome x has not occurred during training   (i.e. is unknown): - We need to reserve mass in P ( X ) for x Questions: - What outcomes x are possible? - How much mass should they get? � 11 CS447: Natural Language Processing (J. Hockenmaier)

What unseen events may occur? Simple conditional distributions: P ( X = x | Y = y ) (e.g. bigram models)   Case 1: The outcome x has been seen,   but not in the context of Y = y : - We need to reserve mass in P ( X | Y=y ) for X = x   Case 2: The conditioning variable y has not been seen: - We have no P ( X | Y = y ) distribution. - We need to drop the conditioning variable Y = y   and use P ( X ) instead. � 12 CS447: Natural Language Processing (J. Hockenmaier)

  What unseen events may occur? Complex conditional distributions   (with multiple conditioning variables) P ( X = x | Y = y , Z = z ) (e.g. trigram models) Case 1: The outcome X = x was seen, but not in the context of ( Y=y, Z=z ) : - We need to reserve mass in P ( X | Y = y , Z = z ) Case 2: The joint conditioning event ( Y=y , Z=z ) hasn’t been seen: - We have no P ( X | Y=y , Z=z ) distribution. - But we can drop z and use P ( X | Y=y ) instead. � 13 CS447: Natural Language Processing (J. Hockenmaier)

            Examples Training data: The wolf is an endangered species Test data: The wallaby is endangered   Unigram Bigram Trigram P(the) P(the | <s>) P(the | <s>) × P(wallaby) × P( wallaby | the) × P( wallaby | the, <s>) × P(is) × P(is | wallaby) × P(is | wallaby, the) × P(endangered) × P(endangered | is) × P(endangered | is, wallaby) - Case 1: P(wallaby), P(wallaby | the), P( wallaby | the, <s>):   What is the probability of an unknown word (in any context)? - Case 2: P(endangered | is)   What is the probability of a known word in a known context,   if that word hasn’t been seen in that context? - Case 3: P(is | wallaby) P(is | wallaby, the) P(endangered | is, wallaby):   What is the probability of a known word in an unseen context? � 14 CS447: Natural Language Processing (J. Hockenmaier)

Smoothing: Reserving mass in   P ( X ) for unseen events CS447: Natural Language Processing (J. Hockenmaier) � 15

Dealing with unknown words: The simple solution Training: - Assume a fixed vocabulary   (e.g. all words that occur at least twice (or n times) in the corpus) - Replace all other words by a token <UNK> - Estimate the model on this corpus. Testing: - Replace all unknown words by <UNK> - Run the model.   This requires a large training corpus to work well. � 16 CS447: Natural Language Processing (J. Hockenmaier)

  Dealing with unknown events Use a different estimation technique: - Add-1(Laplace) Smoothing - Good-Turing Discounting P ( w ) = C ( w ) Idea: Replace MLE estimate N Combine a complex model with a simpler model: - Linear Interpolation - Modified Kneser-Ney smoothing Idea: use bigram probabilities of w i   P ( w i | w i − 1 ) to calculate trigram probabilities of w i P ( w i | w i − n ...w i − 1 ) � 17 CS447: Natural Language Processing (J. Hockenmaier)

Add-1 (Laplace) smoothing Assume every (seen or unseen) event   occurred once more than it did in the training data.   Example: unigram probabilities Estimated from a corpus with N tokens and a vocabulary (number of word types) of size V. ∑ j C ( w j ) = C ( w i ) C ( w i ) P ( w i ) = MLE ∑ j C ( w j ) N N ∑ j ( C ( w j )+ 1 ) = C ( w i )+ 1 C ( w i )+ 1 P ( w i ) = Add One N + V � 18 CS447: Natural Language Processing (J. Hockenmaier)

Bigram counts Original: Smoothed: � 19 CS447: Natural Language Processing (J. Hockenmaier)

Bigram probabilities Original: Smoothed: Problem:   Add-one moves too much probability mass   from seen to unseen events! � 20 CS447: Natural Language Processing (J. Hockenmaier)

        Reconstituting the counts We can “reconstitute” pseudo-counts c * for our training set of size N from our estimate:   P ( w i ): probability that the next word is w i .   c ∗ N : number of word tokens we generate = P ( w i ) · N Unigrams:   i C ( w i )+ 1 Plug in the model definition of P ( w i ) = · N V : size of vocabulary N + V N Rearrange   = ( C ( w i )+ 1 ) · (to see dependence on N and V ) N + V c ∗ ( w i | w i − 1 ) = P ( w i | w i − 1 ) · C ( w i − 1 ) Bigrams: = C ( w i − 1 w i )+ 1 P ( w i –1 w i ): probability of bigram “ w i –1 w i ” .   C ( w i − 1 )+ V · C ( w i − 1 ) C ( w i –1 ): frequency of w i –1 (in training data) Plug in the model definition of P ( w i | w i –1 ) � 21 CS447: Natural Language Processing (J. Hockenmaier)

Reconstituted Bigram counts Original: Reconstituted: � 22 CS447: Natural Language Processing (J. Hockenmaier)

Lecture 4: Smoothing Julia Hockenmaier juliahmr@illinois.edu 3324 - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 4: Smoothing Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Last lectures key concepts Basic probability review: joint probability, conditional

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

THE COMPARISON OF INCOME THE COMPARISON OF INCOME SMOOTHING AND MARKET SMOOTHING AND MARKET

Testing for Poverty Traps: Asset Smoothing versus Consumption Smoothing in Burkina Faso (with

8.2 Surface Smoothing Hao Li http://cs621.hao-li.com 1 Mesh Optimization Smoothing Low

Image Smoothing ! Chicken-and-egg dilemma! " ! Edge preserving image smoothing !

Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based

Kernel Smoothing Methods (Part 1) Henry Tan Georgetown University April 13, 2015 Georgetown

When does label smoothing help? Rafael Mller, Simon Kornblith, Geofgrey Hinton Label smoothing

8.2 Surface Smoothing Weikai Chen http://cs621.hao-li.com 1 Mesh Optimization Smoothing

Background Smoothing LM, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Smoothing In image processing literature, the weighting averaging operation is referred to as

Event-consistent smoothing in generalized Introduction Conventional CRS stack high-density

Rotating Half Smoothing Filters, Image Rotating Half Smoothing Filters, Image Segmentation and

Implementation of a Fluctuation Smoothing Production Control Policy in IBMs 200mm Wafer Fab

Smoothing investment cycles in the water sector 13 July 2012 Mark Worsfold, Chief Engineer, Ofwat

Diffeomorphisms of discs Oscar Randal-Williams Smoothing theory M a topological d -manifold,

Accuracy of Admissible Heuristic Functions in Selected Planning Domains Malte Helmert Robert

PubPol 201 Module 3: International Trade Policy Class 2 The Gains and Losses from Trade Class

2012 Half Year Results 30 th August 2012 Agenda Henry Engelhardt, Chief Executive Officer

Power Supply (SMPS) for Reliability By Ron Wunderlich, PhD DfR Solutions Open House March 16,

The UK - a natural home for global engineering and technology champions? Warren East January

ADJOINT APPROACH TO ACCELERATOR LATTICE DESIGN* T.M.

Parameterized Hardware Accelerators for Lattice-Based Cryptography and Their Application to the

High-Speed Elliptic Curve Cryptography Accelerator for Koblitz Curves Kimmo J arvinen Jorma

Lecture 4: Smoothing Julia Hockenmaier juliahmr@illinois.edu 3324 - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 4: Smoothing Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Last lectures key concepts Basic probability review: joint probability, conditional

Exponential smoothing and non-negative data Muhammad Akram Rob J Hyndman J Keith Ord Business

THE COMPARISON OF INCOME THE COMPARISON OF INCOME SMOOTHING AND MARKET SMOOTHING AND MARKET

Testing for Poverty Traps: Asset Smoothing versus Consumption Smoothing in Burkina Faso (with

8.2 Surface Smoothing Hao Li http://cs621.hao-li.com 1 Mesh Optimization Smoothing Low

Image Smoothing ! Chicken-and-egg dilemma! &quot; ! Edge preserving image smoothing !

Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based

Kernel Smoothing Methods (Part 1) Henry Tan Georgetown University April 13, 2015 Georgetown

When does label smoothing help? Rafael Mller, Simon Kornblith, Geofgrey Hinton Label smoothing

8.2 Surface Smoothing Weikai Chen http://cs621.hao-li.com 1 Mesh Optimization Smoothing

Background Smoothing LM, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Smoothing In image processing literature, the weighting averaging operation is referred to as

Event-consistent smoothing in generalized Introduction Conventional CRS stack high-density

Rotating Half Smoothing Filters, Image Rotating Half Smoothing Filters, Image Segmentation and

Implementation of a Fluctuation Smoothing Production Control Policy in IBMs 200mm Wafer Fab

Smoothing investment cycles in the water sector 13 July 2012 Mark Worsfold, Chief Engineer, Ofwat

Diffeomorphisms of discs Oscar Randal-Williams Smoothing theory M a topological d -manifold,

Accuracy of Admissible Heuristic Functions in Selected Planning Domains Malte Helmert Robert

PubPol 201 Module 3: International Trade Policy Class 2 The Gains and Losses from Trade Class

2012 Half Year Results 30 th August 2012 Agenda Henry Engelhardt, Chief Executive Officer

Power Supply (SMPS) for Reliability By Ron Wunderlich, PhD DfR Solutions Open House March 16,

The UK - a natural home for global engineering and technology champions? Warren East January

ADJOINT APPROACH TO ACCELERATOR LATTICE DESIGN* T.M.

Parameterized Hardware Accelerators for Lattice-Based Cryptography and Their Application to the

High-Speed Elliptic Curve Cryptography Accelerator for Koblitz Curves Kimmo J arvinen Jorma

Image Smoothing ! Chicken-and-egg dilemma! " ! Edge preserving image smoothing !