The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from - - PowerPoint PPT Presentation

the noisy channel model
SMART_READER_LITE
LIVE PREVIEW

The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from - - PowerPoint PPT Presentation

Classification & The Noisy Channel Model CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation Chain Rule +


slide-1
SLIDE 1

Classification & The Noisy Channel Model

CMSC 473/673 UMBC

Some slides adapted from 3SLP

slide-2
SLIDE 2

Outline

Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

slide-3
SLIDE 3

Chain Rule + Backoff (Markov assumption) = n-grams

slide-4
SLIDE 4

N-Gram Terminology

n Commonly called History Size (Markov order) Example 1 unigram p(furiously) 2 bigram 1 p(furiously | sleep) 3 trigram (3-gram) 2 p(furiously | ideas sleep) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(wi | wi-n+1 … wi-1) how to (efficiently) compute p(Colorless green ideas sleep furiously)?

slide-5
SLIDE 5

Language Models & Smoothing

Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

slide-6
SLIDE 6

Evaluation Framework

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

acquire primary statistics for learning model parameters fine-tune any secondary (hyper)parameters perform final evaluation

DO NOT ITERATE ON THE TEST DATA

slide-7
SLIDE 7

Setting Hyperparameters

Use a development corpus Choose λs to maximize the probability of dev data:

Fix the N-gram probabilities/counts (on the training data) Search for λs that give largest probability to held-out set

Training Data

Dev Data Test Data

slide-8
SLIDE 8

Evaluating Language Models

What is “correct?” What is working “well?” Extrinsic: Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors Intrinsic: Treat LM as its own downstream task Use perplexity (from information theory)

slide-9
SLIDE 9

Perplexity

Lower is better: lower perplexity --> less surprised perplexity = exp(−1

𝑁 σ𝑗=1 𝑁 log 𝑞 𝑥𝑗

ℎ𝑗))

n-gram history (n-1 items)

slide-10
SLIDE 10

Implementation: Unknown words

Create an unknown word token <UNK> Training:

  • 1. Create a fixed lexicon L of size V
  • 2. Change any word not in L to <UNK>
  • 3. Train LM as normal

Evaluation:

Use UNK probabilities for any word not in training

slide-11
SLIDE 11

Implementation: EOS Padding

Create an end of sentence (“chunk”) token <EOS> Don’t estimate p(<BOS> | <EOS>) Training & Evaluation:

  • 1. Identify “chunks” that are relevant

(sentences, paragraphs, documents)

  • 2. Append the <EOS> token to the end of

the chunk

  • 3. Train or evaluate LM as normal

Post 25

slide-12
SLIDE 12

Outline

Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

slide-13
SLIDE 13

Probabilistic Classification

𝑞 𝑌 𝑍) ∝ 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌)

𝑞 𝑌 𝑍) = ℎ(𝑌; 𝑍)

Discriminatively trained classifier Generatively trained classifier

Directly model the posterior Model the posterior with Bayes rule

slide-14
SLIDE 14

Generative Training: Two Different Philosophical Frameworks

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

posterior probability likelihood prior probability marginal likelihood (probability)

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

there are others too (CMSC 478/678)

slide-15
SLIDE 15

Outline

Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

slide-16
SLIDE 16

Classification

POLITICS TERRORISM SPORTS TECH HEALTH FINANCE …

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-17
SLIDE 17

Classification

POLITICS TERRORISM SPORTS TECH HEALTH FINANCE …

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-18
SLIDE 18

Classification

POLITICS TERRORISM SPORTS TECH HEALTH FINANCE …

Electronic alerts have been used to assist the authorities in moments of chaos and potential danger: after the Boston bombing in 2013, when the Boston suspects were still at large, and last month in Los Angeles, during an active shooter scare at the airport.

slide-19
SLIDE 19

Classification

POLITICS TERRORISM SPORTS TECH HEALTH FINANCE …

Electronic alerts have been used to assist the authorities in moments of chaos and potential danger: after the Boston bombing in 2013, when the Boston suspects were still at large, and last month in Los Angeles, during an active shooter scare at the airport.

slide-20
SLIDE 20

Classify with Uncertainty

Use probabilities

slide-21
SLIDE 21

Classify with Uncertainty

Use probabilities*

*There are non-probabilistic ways to handle uncertainty… but probabilities sure are handy!

slide-22
SLIDE 22

Classification

POLITICS .05 TERRORISM .48 SPORTS .0001 TECH .39 HEALTH .0001 FINANCE .0002 …

Electronic alerts have been used to assist the authorities in moments of chaos and potential danger: after the Boston bombing in 2013, when the Boston suspects were still at large, and last month in Los Angeles, during an active shooter scare at the airport.

slide-23
SLIDE 23

Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

slide-24
SLIDE 24

Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class c from C

slide-25
SLIDE 25

Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document linguistic blob a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class c from C

slide-26
SLIDE 26

Text Classification: Hand-coded Rules?

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Rules based on combinations of words or other features

spam: black-list-address OR (“dollars” AND “have been selected”)

Accuracy can be high

If rules carefully refined by expert

Building and maintaining these rules is expensive Can humans faithfully assign uncertainty?

slide-27
SLIDE 27

Text Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

Output:

a learned classifier γ that maps documents to classes

slide-28
SLIDE 28

Text Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

Output:

a learned classifier γ that maps documents to classes

Naïve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …

slide-29
SLIDE 29

Text Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Naïve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …

Input:

a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

Output:

a learned classifier γ that maps documents to classes

slide-30
SLIDE 30

Outline

Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

slide-31
SLIDE 31

Generative Training: Two Different Philosophical Frameworks

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

posterior probability likelihood prior probability marginal likelihood (probability)

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

slide-32
SLIDE 32

Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

class

  • bserved

data

slide-33
SLIDE 33

Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

class

  • bserved

data prior probability of class

  • bservation likelihood (averaged over all classes)

class-based likelihood (language model)

slide-34
SLIDE 34

Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

class

  • bserved

data class-based likelihood (language model) prior probability of class

  • bservation likelihood (averaged over all classes)
slide-35
SLIDE 35

Classification with Bayes Rule argmax𝑌𝑞 𝑌 𝑍)

slide-36
SLIDE 36

Classification with Bayes Rule argmax𝑌 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

slide-37
SLIDE 37

Classification with Bayes Rule argmax𝑌 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

constant with respect to X

slide-38
SLIDE 38

Classification with Bayes Rule argmax𝑌𝑞 𝑍 𝑌) ∗ 𝑞(𝑌)

slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41

Classification with Bayes Rule argmax𝑌 log 𝑞 𝑍 𝑌) + log 𝑞(𝑌)

slide-42
SLIDE 42

Classification (labels) with Bayes Rule

argmax𝑌 log 𝑞 𝑍 𝑌) + log 𝑞(𝑌)

how well does text Y represent label X? how likely is label X overall?

For “simple” or “flat” labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score

slide-43
SLIDE 43

Classification/Decoding with Bayes Rule

argmax𝑌 log 𝑞 𝑍 𝑌) + log 𝑞(𝑌)

how well does text (complex input) Y represent text (complex output) X? how likely is text (complex

  • utput) X overall?

* iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score

If X is a string (or some complex structure), this can be complicated

slide-44
SLIDE 44

Outline

Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

slide-45
SLIDE 45

Generative Training: Two Different Philosophical Frameworks

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

posterior probability likelihood prior probability marginal likelihood (probability)

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

slide-46
SLIDE 46

Noisy Channel Model

slide-47
SLIDE 47

Noisy Channel Model

what I want to tell you “sports”

slide-48
SLIDE 48

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…”

slide-49
SLIDE 49

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…” Decode hypothesized intent “sad stories” “sports”

slide-50
SLIDE 50

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…” Decode Rerank hypothesized intent “sad stories” “sports” reweight according to what’s likely “sports”

slide-51
SLIDE 51

Noisy Channel

Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning …

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

possible (clean)

  • utput
  • bserved

(noisy) text translation/ decode model (clean) language model

  • bservation (noisy) likelihood
slide-52
SLIDE 52

Noisy Channel

Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning …

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

possible (clean)

  • utput
  • bserved

(noisy) text (clean) language model

  • bservation (noisy) likelihood

translation/ decode model

slide-53
SLIDE 53

Language Model

Use any of the language modeling algorithms we’ve learned Unigram, bigram, trigram Add-λ, interpolation, backoff (Later: Maxent, RNNs, hierarchical Bayesian LMs, …)

slide-54
SLIDE 54

Probabilistic Classification

𝑞 𝑌 𝑍) ∝ 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌)

𝑞 𝑌 𝑍) = ℎ(𝑌; 𝑍)

Discriminatively trained classifier Generatively trained classifier

Directly model the posterior Model the posterior with Bayes rule

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

slide-55
SLIDE 55

Outline

Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation

slide-56
SLIDE 56

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed Not selected/ not guessed

slide-57
SLIDE 57

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed Not selected/ not guessed

Classes/Choices

slide-58
SLIDE 58

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) Not selected/ not guessed

Classes/Choices

Correct Guessed

slide-59
SLIDE 59

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed

Classes/Choices

Correct Guessed Correct Guessed

slide-60
SLIDE 60

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed False Negative (FN)

Classes/Choices

Correct Guessed Correct Guessed Correct Guessed

slide-61
SLIDE 61

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed False Negative (FN) True Negative (TN)

Classes/Choices

Correct Guessed Correct Guessed Correct Guessed Correct Guessed

slide-62
SLIDE 62

Accuracy, Precision, and Recall

Accuracy: % of items correct

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

TP + TN TP + FP + FN + TN

slide-63
SLIDE 63

Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

TP TP + FP TP + TN TP + FP + FN + TN

slide-64
SLIDE 64

Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

TP TP + FP TP TP + FN TP + TN TP + FP + FN + TN

slide-65
SLIDE 65

A combined measure: F

Weighted (harmonic) average of Precision & Recall 𝐺 = 1 𝛽 1 𝑄 + (1 − 𝛽) 1 𝑆

slide-66
SLIDE 66

A combined measure: F

Weighted (harmonic) average of Precision & Recall 𝐺 = 1 𝛽 1 𝑄 + (1 − 𝛽) 1 𝑆 = 1 + 𝛾2 ∗ 𝑄 ∗ 𝑆 (𝛾2 ∗ 𝑄) + 𝑆

algebra (not important)

slide-67
SLIDE 67

A combined measure: F

Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1 𝐺 = 1 + 𝛾2 ∗ 𝑄 ∗ 𝑆 (𝛾2 ∗ 𝑄) + 𝑆 𝐺

1 = 2 ∗ 𝑄 ∗ 𝑆

𝑄 + 𝑆

slide-68
SLIDE 68

Micro- vs. Macro-Averaging

If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

  • Sec. 15.2.4
slide-69
SLIDE 69

Micro- vs. Macro-Averaging: Example

Truth : yes Truth : no Classifier: yes 10 10 Classifier: no 10 970 Truth : yes Truth : no Classifier: yes 90 10 Classifier: no 10 890 Truth : yes Truth : no Classifier: yes 100 20 Classifier: no 20 1860

Class 1 Class 2 Micro Ave. Table

  • Sec. 15.2.4

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Microaveraged score is dominated by score on common classes

slide-70
SLIDE 70

Outline

Recap: language modeling Classification Why incorporate uncertainty Posterior decoding Noisy channel decoding Evaluation