Probability & Language Modeling (II) Classification & the Noisy Channel Model
CMSC 473/673 UMBC September 11th, 2017
Some slides adapted from 3SLP , Jason Eisner
Probability & Language Modeling (II) Classification & the - - PowerPoint PPT Presentation
Probability & Language Modeling (II) Classification & the Noisy Channel Model CMSC 473/673 UMBC September 11 th , 2017 Some slides adapted from 3SLP , Jason Eisner Recap from last time Three people have been fatally shot, p (
Probability & Language Modeling (II) Classification & the Noisy Channel Model
CMSC 473/673 UMBC September 11th, 2017
Some slides adapted from 3SLP , Jason Eisner
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.
Probability Takeaways
Basic probability axioms and definitions Probabilistic Independence Definition of joint probability Definition of conditional probability Bayes rule Probability chain rule
Kinds of Statistics
Descriptive Confirmatory Predictive
The average grade on this assignment is 83.
Conditional Probabilities Are Probabilities
all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads
𝑞 heads egg salad) 𝑞 heads NOT egg salad)
vs.
𝑞 heads egg salad) 𝑞 tails egg salad)
vs.
Bayes Rule
posterior probability likelihood prior probability marginal likelihood (probability)
Simple Count-Based Language Models
Simple Count-Based Language Models
“proportional to”
Simple Count-Based Language Models
“proportional to”
Proportional To/Normalization
everything
Proportional To/Normalization
Proportional To/Normalization
c( ) = x1 c( c( c( ) = x2 ) = x3 ) = x4
Proportional To/Normalization
p( ) ∝ x1 p( p( p( ) ∝ x2 ) ∝ x3 ) ∝ x4
Simple Count-Based Language Models
“proportional to”
Simple Count-Based Language Models
“proportional to”
constant
Simple Count-Based Language Models
sequence of characters pseudo-words sequence of words pseudo-phrases
Shakespearian Sequences of Words
Probability Chain Rule
𝑞 𝑦1, 𝑦2, … , 𝑦𝑇 = 𝑞 𝑦1 𝑞 𝑦2 𝑦1)𝑞 𝑦3 𝑦1, 𝑦2) ⋯ 𝑞 𝑦𝑇 𝑦1, … , 𝑦𝑗
Chain Rule + Backoff (Markov assumption) = n-grams
N-Gram Terminology
n Commonly called History Size (Markov order) Example 1 unigram p(furiously) 2 bigram 1 p(furiously | sleep) 3 trigram (3-gram) 2 p(furiously | ideas sleep) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(wi | wi-n+1 … wi-1) how to (efficiently) compute p(Colorless green ideas sleep furiously)?
Trigrams
p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)
Trigrams
p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS>) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) * p(<EOS> | sleep furiously)
Consistent notation: Pad the left with <BOS> (beginning of sentence) symbols Fully proper distribution: Pad the right with a single <EOS> symbol
N-Gram Probability
Count-Based N-Grams (Unigrams)
Count-Based N-Grams (Unigrams)
Count-Based N-Grams (Unigrams)
Count-Based N-Grams (Unigrams)
word type word type word type
Count-Based N-Grams (Unigrams)
word type word type number of tokens observed
Count-Based N-Grams (Unigrams)
The film got a great opening and the film went on to become a hit .
Word (Type) Raw Count Normalization Probability The 1 film 2 got 1 a 2 great 1
1 and 1 the 1 went 1
1 to 1 become 1 hit 1 . 1
Count-Based N-Grams (Unigrams)
The film got a great opening and the film went on to become a hit .
Word (Type) Raw Count Normalization Probability The 1 16 film 2 got 1 a 2 great 1
1 and 1 the 1 went 1
1 to 1 become 1 hit 1 . 1
Count-Based N-Grams (Unigrams)
The film got a great opening and the film went on to become a hit .
Word (Type) Raw Count Normalization Probability The 1 16 1/16 film 2 1/8 got 1 1/16 a 2 1/8 great 1 1/16
1 1/16 and 1 1/16 the 1 1/16 went 1 1/16
1 1/16 to 1 1/16 become 1 1/16 hit 1 1/16 . 1 1/16
Count-Based N-Grams (Trigrams)
conditioning
count
Count-Based N-Grams (Trigrams)
conditioning
count
count(x, y, z) ≠ count(x, z, y) ≠ count(y, x, z) ≠ …
Count-Based N-Grams (Trigrams)
Context Word (Type) Raw Count Normalization Probability The film The 1 0/1 The film film 0/1 The film got 1 1/1 The film went 0/1 … a great great 1 0/1 a great
1 1/1 a great and 0/1 a great the 0/1 …
Count-Based N-Grams (Trigrams)
The film got a great opening and the film went on to become a hit .
Context Word (Type) Raw Count Normalization Probability the film the 2 0/1 the film film 0/1 the film got 1 1/2 the film went 1 1/2 … a great great 1 0/1 a great
1 1/1 a great and 0/1 a great the 0/1 …
Count-Based N-Grams (Lowercased Trigrams)
the film got a great opening and the film went on to become a hit .
Evaluating Language Models
What is “correct?” What is working “well?”
Evaluating Language Models
What is “correct?” What is working “well?”
Training Data
Dev Data Test Data
Evaluating Language Models
What is “correct?” What is working “well?”
Training Data
Dev Data Test Data
acquire primary statistics for learning model parameters fine-tune any secondary (hyper)parameters perform final evaluation
Evaluating Language Models
What is “correct?” What is working “well?”
Training Data
Dev Data Test Data
acquire primary statistics for learning model parameters fine-tune any secondary (hyper)parameters perform final evaluation
DO NOT ITERATE ON THE TEST DATA
Evaluating Language Models
What is “correct?” What is working “well?” Extrinsic: Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors
Evaluating Language Models
What is “correct?” What is working “well?” Extrinsic: Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors Intrinsic: Treat LM as its own downstream task Use perplexity (from information theory)
Perplexity
Lower is better : lower perplexity less surprised
More outcomes More surprised Fewer outcomes Less surprised
Perplexity
Lower is better : lower perplexity less surprised perplexity
n-gram history (n-1 items)
Perplexity
Lower is better : lower perplexity less surprised perplexity
≥ 0, ≤ 1: higher
Perplexity
Lower is better : lower perplexity less surprised perplexity
≥ 0, ≤ 1: higher ≤ 0: higher
Perplexity
Lower is better : lower perplexity less surprised perplexity
≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher
Perplexity
Lower is better : lower perplexity less surprised perplexity
≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher ≥ 0, lower is better
Perplexity
Lower is better : lower perplexity less surprised perplexity
≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher ≥ 0, lower is better ≥ 0, lower
Perplexity
Lower is better : lower perplexity less surprised perplexity
≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher ≥ 0, lower is better ≥ 0, lower base must be the same
Perplexity
Lower is better : lower perplexity less surprised perplexity
weighted geometric average
Perplexity
Lower is better : lower perplexity less surprised perplexity 471/671: Branching factor
Outline
Probability review Words Defining Language Models Breaking & Fixing Language Models
Maximum Likelihood Estimates
Maximizes the likelihood of the training set Do different corpora look the same? For large data: can actually do reasonably well
𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)
n = 1
, , land of in , a teachers The , wilds the and gave a Etienne any two beginning without probably heavily that other useless the the a different . the able mines , unload into in foreign the the be either other Britain finally avoiding , for of have the cure , the Gutenberg-tm ; of being can as country in authority deviates as d seldom and They employed about from business marshal materials than in , they
n = 2
These varied with it to the civil wars , therefore , it did not for the company had the East India , the mechanical , the sum which were by barter , vol. i , and , conveniencies of all made to purchase a council of landlords , constitute a sum as an argument , having thus forced abroad , however , and influence in the one , or banker , will there was encouraged and more common trade to corrupt , profit , it ; but a master does not , twelfth year the consent that of volunteers and […] , the other hand , it certainly it very earnestly entreat both nations . In opulent nations in a revenue of four parts of production .
n = 3
His employer , if silver was regulated according to the temporary and
What goods could bear the expense of defending themselves , than in the value of different sorts of goods , and placed at a much greater , there have been the effects of self-deception , this attention , but a very important ones , and which , having become of less than they ever were in this agreement for keeping up the business of weighing . After food , clothes , and a few months longer credit than is wanted , there must be sufficient to keep by him , are of such colonies to surmount . They facilitated the acquisition of the empire , both from the rents of land and labour of those pedantic pieces of silver which he can afford to take from the duty upon every quarter which they have a more equable distribution of employment .
n = 4
To buy in one market , in order to have it ; but the 8th of George III . The tendency of some of the great lords , gradually encouraged their villains to make upon the prices of corn , cattle , poultry , etc . Though it may , perhaps , in the mean time , that part of the governments of New England , the market , trade cannot always be transported to so great a number of seamen , not inferior to those of
The farmer makes his profit by parting with it . But the government of that country below what it is in itself necessarily slow , uncertain , liable to be interrupted by the weather .
0s Are Not Your (Language Model’s) Friend
𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item = 0 → 𝑞 item = 0
0s Are Not Your (Language Model’s) Friend
0 probability item is impossible 0s annihilate: x*y*z*0 = 0 Language is creative: new words keep appearing existing words could appear in known contexts How much do you trust your data?
𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item = 0 → 𝑞 item = 0
Add-λ estimation
Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts
Add-λ estimation
Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts
Add-λ estimation
Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts
Add-λ estimation
Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts
Add-λ N-Grams (Unigrams)
The film got a great opening and the film went on to become a hit .
Word (Type) Raw Count Norm Prob. Add-λ Count Add-λ Norm. Add-λ Prob. The 1 16 1/16 film 2 1/8 got 1 1/16 a 2 1/8 great 1 1/16
1 1/16 and 1 1/16 the 1 1/16 went 1 1/16
1 1/16 to 1 1/16 become 1 1/16 hit 1 1/16 . 1 1/16
Add-1 N-Grams (Unigrams)
The film got a great opening and the film went on to become a hit .
Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 16 1/16 2 film 2 1/8 3 got 1 1/16 2 a 2 1/8 3 great 1 1/16 2
1 1/16 2 and 1 1/16 2 the 1 1/16 2 went 1 1/16 2
1 1/16 2 to 1 1/16 2 become 1 1/16 2 hit 1 1/16 2 . 1 1/16 2
Add-1 N-Grams (Unigrams)
The film got a great opening and the film went on to become a hit .
Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 16 1/16 2 16 + 14*1 = 30 film 2 1/8 3 got 1 1/16 2 a 2 1/8 3 great 1 1/16 2
1 1/16 2 and 1 1/16 2 the 1 1/16 2 went 1 1/16 2
1 1/16 2 to 1 1/16 2 become 1 1/16 2 hit 1 1/16 2 . 1 1/16 2
Add-1 N-Grams (Unigrams)
The film got a great opening and the film went on to become a hit .
Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 16 1/16 2 16 + 14*1 = 30 =1/15 film 2 1/8 3 =1/10 got 1 1/16 2 =1/15 a 2 1/8 3 =1/10 great 1 1/16 2 =1/15
1 1/16 2 =1/15 and 1 1/16 2 =1/15 the 1 1/16 2 =1/15 went 1 1/16 2 =1/15
1 1/16 2 =1/15 to 1 1/16 2 =1/15 become 1 1/16 2 =1/15 hit 1 1/16 2 =1/15 . 1 1/16 2 =1/15
Backoff and Interpolation
Sometimes it helps to use less context
condition on less context for contexts you haven’t learned much
Backoff and Interpolation
Sometimes it helps to use less context
condition on less context for contexts you haven’t learned much about
Backoff:
use trigram if you have good evidence
Backoff and Interpolation
Sometimes it helps to use less context
condition on less context for contexts you haven’t learned much about
Backoff:
use trigram if you have good evidence
Interpolation:
mix (average) unigram, bigram, trigram
Linear Interpolation
Simple interpolation
𝑞 𝑧 𝑦) = 𝜇𝑞2 𝑧 𝑦) + 1 − 𝜇 𝑞1 𝑧 0 ≤ 𝜇 ≤ 1
Linear Interpolation
Simple interpolation Condition on context
𝑞 𝑧 𝑦) = 𝜇𝑞2 𝑧 𝑦) + 1 − 𝜇 𝑞1 𝑧 0 ≤ 𝜇 ≤ 1
𝑞 𝑨 𝑦, 𝑧) = 𝜇3 𝑦, 𝑧 𝑞3 𝑨 𝑦, 𝑧) + 𝜇2(𝑧)𝑞2 𝑨 | 𝑧 + 𝜇1𝑞1(𝑨)
Backoff
Trust your statistics, up to a point
Discounted Backoff
Trust your statistics, up to a point
Discounted Backoff
Trust your statistics, up to a point
discount constant context-dependent normalization constant
Setting Hyperparameters
Use a development corpus Choose λs to maximize the probability of dev data:
Fix the N-gram probabilities/counts (on the training data) Search for λs that give largest probability to held-out set
Training Data
Dev Data Test Data
Implementation: Unknown words
Create an unknown word token <UNK> Training:
Evaluation:
Use UNK probabilities for any word not in training
Other Kinds of Smoothing
Interpolated (modified) Kneser-Ney
Idea: How “productive” is a context? How many different word types v appear in a context x, y
Good-Turing
Partition words into classes of occurrence Smooth class statistics Properties of classes are likely to predict properties of other classes
Witten-Bell
Idea: Every observed type was at some point novel Give MLE prediction for novel type occurring
Bayes Rule NLP Applications
posterior probability likelihood prior probability marginal likelihood (probability)
Text Classification
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …
Text Classification
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …
Input:
a document a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class c C
Text Classification: Hand-coded Rules?
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …
Rules based on combinations of words or other features
spam: black-list-address OR (“dollars” AND“have been selected”)
Accuracy can be high
If rules carefully refined by expert
Building and maintaining these rules is expensive
Text Classification: Supervised Machine Learning
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …
Input:
a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
Output:
a learned classifier γ:d c
Text Classification: Supervised Machine Learning
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …
Input:
a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
Output:
a learned classifier γ:d c
Naïve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …
Text Classification: Supervised Machine Learning
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …
Input:
a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
Output:
a learned classifier γ:d c
Naïve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …
Probabilistic Text Classification
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …
class
data
Probabilistic Text Classification
Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …
class
data class-based likelihood prior probability of class
Noisy Channel Model
Noisy Channel Model
what I want to tell you “sports”
Noisy Channel Model
what I want to tell you “sports” what you actually see “The Os lost again…”
Noisy Channel Model
what I want to tell you “sports” what you actually see “The Os lost again…” Decode hypothesized intent “sad stories” “sports”
Noisy Channel Model
what I want to tell you “sports” what you actually see “The Os lost again…” Decode Rerank hypothesized intent “sad stories” “sports” reweight according to what’s likely “sports”
Noisy Channel
Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis …
possible (clean)
(noisy) text translation/ decode model (clean) language model
Noisy Channel
Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis …
possible (clean)
(noisy) text (clean) language model
translation/ decode model
Language Model
Use any of the language modeling algorithms we’ve learned Unigram, bigram, trigram Add-λ, interpolation, backoff (Later: Maxent, RNNs, hierarchical Bayesian LMs, …)
Noisy Channel
Noisy Channel
Noisy Channel
constant with respect to X
Noisy Channel
Noisy Channel
Evaluation: the 2-by-2 contingency table
Actually Correct Actually Incorrect Selected/Gu essed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)
Accuracy, Precision, and Recall
Accuracy: % of items correct
Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)
Accuracy, Precision, and Recall
Accuracy: % of items correct Precision: % of selected items that are correct
Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)
Accuracy, Precision, and Recall
Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected
Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)
A combined measure: F
Weighted (harmonic) average of Precision & Recall Balanced F1 measure: = 1 ( = ½)
F = 2PR/(P+R) R P PR R P F + + =
=
2 2
) 1 ( 1 ) 1 ( 1 1 b b a a
Micro- vs. Macro-Averaging
If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.
Micro- vs. Macro-Averaging: Example
Truth : yes Truth : no Classifier: yes 10 10 Classifier: no 10 970 Truth : yes Truth : no Classifier: yes 90 10 Classifier: no 10 890 Truth : yes Truth : no Classifier: yes 100 20 Classifier: no 20 1860
Class 1 Class 2 Micro Ave. Table
Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Microaveraged score is dominated by score on common classes
The Bag of Words Representation
The Bag of Words Representation
The Bag of Words Representation
114
Bag of Words Representation
seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...
Naïve Bayes Classifier
Start with Bayes Rule
Naïve Bayes Classifier
Adopt naïve bag of words representation Yi
Naïve Bayes Classifier
Adopt naïve bag of words representation Yi Assume position doesn’t matter
Naïve Bayes Classifier
Adopt naïve bag of words representation Yi Assume position doesn’t matter Assume the feature probabilities are independent given the class X
Multinomial Naïve Bayes: Learning
Calculate P(cj) terms For each cjin C do
docsj all docs with class =cj
Calculate P(wk | cj) terms Textj single doc containing all docsj For each word wk in Vocabulary nk # of occurrences of wk in Textj
From training corpus, extract Vocabulary
𝑞 𝑥𝑙| 𝑑𝑘 = class LM
Naïve Bayes and Language Modeling
Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides
We use only word features we use all of the words in the text (not a subset)
Then
Naïve Bayes has an important similarity to language modeling
Naïve Bayes as a Language Model
Sec.13.2.1
0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film
Positive Model Negative Model
0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
film love this fun I
Sec.13.2.1
0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film
Positive Model Negative Model
0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film
Positive Model Negative Model
film love this fun I
0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2
0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film
Sec.13.2.1
Naïve Bayes as a Language Model
Which class assigns the higher probability to s?
0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film
Positive Model Negative Model
film love this fun I
0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2
5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9
0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film
Sec.13.2.1
Brill and Banko (2001) With enough data, the classifier may not matter
Summary: Naïve Bayes is Not So Naïve
Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)