Classification & The Noisy Channel Model CMSC 473/673 UMBC - - PowerPoint PPT Presentation

classification the noisy channel model
SMART_READER_LITE
LIVE PREVIEW

Classification & The Noisy Channel Model CMSC 473/673 UMBC - - PowerPoint PPT Presentation

Classification & The Noisy Channel Model CMSC 473/673 UMBC September 13 th , 2017 Some slides adapted from 3SLP Recap from last time Three people have been fatally shot, p ( ) and five people, including a mayor, were seriously


slide-1
SLIDE 1

Classification & The Noisy Channel Model

CMSC 473/673 UMBC September 13th, 2017

Some slides adapted from 3SLP

slide-2
SLIDE 2

Recap from last time…

slide-3
SLIDE 3

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

pθ( )

slide-4
SLIDE 4

Chain Rule + Backoff (Markov assumption) = n-grams

slide-5
SLIDE 5

N-Gram Terminology

n Commonly called History Size (Markov order) Example 1 unigram p(furiously) 2 bigram 1 p(furiously | sleep) 3 trigram (3-gram) 2 p(furiously | ideas sleep) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(wi | wi-n+1 … wi-1) how to (efficiently) compute p(Colorless green ideas sleep furiously)?

slide-6
SLIDE 6

Count-Based N-Grams (Unigrams)

slide-7
SLIDE 7

Count-Based N-Grams (Trigrams)

slide-8
SLIDE 8

Add-λ estimation

Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

slide-9
SLIDE 9

Linear Interpolation

Simple interpolation --> Averaging

𝑞 𝑧 𝑦) = 𝜇𝑞2 𝑧 𝑦) + 1 − 𝜇 𝑞1 𝑧 0 ≤ 𝜇 ≤ 1

simpler models

slide-10
SLIDE 10

Discounted Backoff

Trust your statistics, up to a point

discount constant context-dependent normalization constant simpler models

slide-11
SLIDE 11

Evaluation Framework

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

acquire primary statistics for learning model parameters fine-tune any secondary (hyper)parameters perform final evaluation

DO NOT ITERATE ON THE TEST DATA

slide-12
SLIDE 12

Setting Hyperparameters

Use a development corpus Choose λs to maximize the probability of dev data:

Fix the N-gram probabilities/counts (on the training data) Search for λs that give largest probability to held-out set

Training Data

Dev Data Test Data

slide-13
SLIDE 13

Evaluating Language Models

What is “correct?” What is working “well?” Extrinsic: Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors Intrinsic: Treat LM as its own downstream task Use perplexity (from information theory)

slide-14
SLIDE 14

Perplexity

Lower is better: lower perplexity --> less surprised perplexity

n-gram history (n-1 items)

slide-15
SLIDE 15

Maximum Likelihood Estimates

Maximizes the likelihood of the training set Do different corpora look the same? For large data: can actually do reasonably well

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

slide-16
SLIDE 16

Implementation: Unknown words

Create an unknown word token <UNK> Training:

  • 1. Create a fixed lexicon L of size V
  • 2. Change any word not in L to <UNK>
  • 3. Train LM as normal

Evaluation:

Use UNK probabilities for any word not in training

slide-17
SLIDE 17

<BOS>/<EOS> Padding

p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS>) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) * p(<EOS> | sleep furiously)

Consistent notation: Pad the left with <BOS> (beginning of sentence) symbols Fully proper distribution: Pad the right with a single <EOS> symbol

Post 23

slide-18
SLIDE 18

Implementation: EOS Padding

Create an end of sentence (“chunk”) token <EOS> Don’t estimate p(<BOS> | <EOS>) Training & Evaluation:

  • 1. Identify “chunks” that are relevant (sentences,

paragraphs, documents)

  • 2. Append the <EOS> token to the end of the chunk
  • 3. Train or evaluate LM as normal
slide-19
SLIDE 19

Other Kinds of Smoothing

Interpolated (modified) Kneser-Ney

Idea: How “productive” is a context? How many different word types v appear in a context x, y

Good-Turing

Partition words into classes of occurrence Smooth class statistics Properties of classes are likely to predict properties of other classes

Witten-Bell

Idea: Every observed type was at some point novel Give MLE prediction for novel type occurring

slide-20
SLIDE 20

Bayes Rule → NLP Applications

posterior probability likelihood prior probability marginal likelihood (probability)

slide-21
SLIDE 21

Two Different Philosophical Frameworks

posterior probability likelihood prior probability marginal likelihood (probability)

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

there are others too (CMSC 478/678)

slide-22
SLIDE 22

Two Different Philosophical Frameworks

posterior probability likelihood prior probability marginal likelihood (probability)

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

there are others too (CMSC 478/678)

slide-23
SLIDE 23

Classification

POLITICS TERRORISM SPORTS TECH HEALTH FINANCE …

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-24
SLIDE 24

Classification

POLITICS TERRORISM SPORTS TECH HEALTH FINANCE …

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-25
SLIDE 25

Classification

POLITICS TERRORISM SPORTS TECH HEALTH FINANCE …

Electronic alerts have been used to assist the authorities in moments of chaos and potential danger: after the Boston bombing in 2013, when the Boston suspects were still at large, and last month in Los Angeles, during an active shooter scare at the airport.

slide-26
SLIDE 26

Classification

POLITICS TERRORISM SPORTS TECH HEALTH FINANCE …

Electronic alerts have been used to assist the authorities in moments of chaos and potential danger: after the Boston bombing in 2013, when the Boston suspects were still at large, and last month in Los Angeles, during an active shooter scare at the airport.

slide-27
SLIDE 27

Classify with Uncertainty

Use probabilities

slide-28
SLIDE 28

Classify with Uncertainty

Use probabilities*

*There are non-probabilistic ways to handle uncertainty… but probabilities sure are handy!

slide-29
SLIDE 29

Classification

POLITICS .05 TERRORISM .48 SPORTS .0001 TECH .39 HEALTH .0001 FINANCE .0002 …

Electronic alerts have been used to assist the authorities in moments of chaos and potential danger: after the Boston bombing in 2013, when the Boston suspects were still at large, and last month in Los Angeles, during an active shooter scare at the airport.

slide-30
SLIDE 30

Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

slide-31
SLIDE 31

Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class c from C

slide-32
SLIDE 32

Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a documentlinguistic blob a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class c from C

slide-33
SLIDE 33

Text Classification: Hand-coded Rules?

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Rules based on combinations of words or other features

spam: black-list-address OR (“dollars” AND “have been selected”)

Accuracy can be high

If rules carefully refined by expert

Building and maintaining these rules is expensive Can humans faithfully assign uncertainty?

slide-34
SLIDE 34

Text Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

Output:

a learned classifier γ that maps documents to classes

slide-35
SLIDE 35

Text Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

Output:

a learned classifier γ that maps documents to classes

Naïve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …

slide-36
SLIDE 36

Text Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Naïve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …

Input:

a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

Output:

a learned classifier γ that maps documents to classes

slide-37
SLIDE 37

Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

class

  • bserved

data

slide-38
SLIDE 38

Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

class

  • bserved

data prior probability of class

  • bservation likelihood (averaged over all classes)

class-based likelihood (language model)

slide-39
SLIDE 39

Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

class

  • bserved

data class-based likelihood (language model) prior probability of class

  • bservation likelihood (averaged over all classes)
slide-40
SLIDE 40

Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

class

  • bserved

data prior probability of class

  • bservation likelihood (averaged over all classes)

class-based likelihood (language model)

slide-41
SLIDE 41

Two Different Philosophical Frameworks

posterior probability likelihood prior probability marginal likelihood (probability)

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

there are others too (CMSC 478/678)

slide-42
SLIDE 42

Noisy Channel Model

slide-43
SLIDE 43

Noisy Channel Model

what I want to tell you “sports”

slide-44
SLIDE 44

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…”

slide-45
SLIDE 45

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…” Decode hypothesized intent “sad stories” “sports”

slide-46
SLIDE 46

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…” Decode Rerank hypothesized intent “sad stories” “sports” reweight according to what’s likely “sports”

slide-47
SLIDE 47

Noisy Channel

Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning …

possible (clean)

  • utput
  • bserved

(noisy) text translation/ decode model (clean) language model

  • bservation (noisy) likelihood
slide-48
SLIDE 48

Noisy Channel

Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning …

possible (clean)

  • utput
  • bserved

(noisy) text (clean) language model

  • bservation (noisy) likelihood

translation/ decode model

slide-49
SLIDE 49

Language Model

Use any of the language modeling algorithms we’ve learned Unigram, bigram, trigram Add-λ, interpolation, backoff (Later: Maxent, RNNs, hierarchical Bayesian LMs, …)

slide-50
SLIDE 50

Two Different Philosophical Frameworks

posterior probability likelihood prior probability marginal likelihood (probability)

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

there are others too (CMSC 478/678)

slide-51
SLIDE 51

Classification with Bayes Rule

slide-52
SLIDE 52

Classification with Bayes Rule

slide-53
SLIDE 53

Classification with Bayes Rule

constant with respect to X

slide-54
SLIDE 54

Classification with Bayes Rule

slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57

Classification with Bayes Rule

slide-58
SLIDE 58

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed Not selected/ not guessed

slide-59
SLIDE 59

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed Not selected/ not guessed

Classes/Choices

slide-60
SLIDE 60

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) Not selected/ not guessed

Classes/Choices

Correct Guessed

slide-61
SLIDE 61

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed

Classes/Choices

Correct Guessed Correct Guessed

slide-62
SLIDE 62

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed False Negative (FN)

Classes/Choices

Correct Guessed Correct Guessed Correct Guessed

slide-63
SLIDE 63

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed False Negative (FN) True Negative (TN)

Classes/Choices

Correct Guessed Correct Guessed Correct Guessed Correct Guessed

slide-64
SLIDE 64

Accuracy, Precision, and Recall

Accuracy: % of items correct

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-65
SLIDE 65

Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-66
SLIDE 66

Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-67
SLIDE 67

A combined measure: F

Weighted (harmonic) average of Precision & Recall

slide-68
SLIDE 68

A combined measure: F

Weighted (harmonic) average of Precision & Recall

algebra (not important)

slide-69
SLIDE 69

A combined measure: F

Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1

slide-70
SLIDE 70

Micro- vs. Macro-Averaging

If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

  • Sec. 15.2.4
slide-71
SLIDE 71

Micro- vs. Macro-Averaging: Example

Truth : yes Truth : no Classifier: yes 10 10 Classifier: no 10 970 Truth : yes Truth : no Classifier: yes 90 10 Classifier: no 10 890 Truth : yes Truth : no Classifier: yes 100 20 Classifier: no 20 1860

Class 1 Class 2 Micro Ave. Table

  • Sec. 15.2.4

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Microaveraged score is dominated by score on common classes

slide-72
SLIDE 72

Language Modeling as Naïve Bayes Classifier

posterior probability

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding class

  • bserved

data prior probability of class

  • bservation likelihood (averaged over all classes)

class-based likelihood (language model)

slide-73
SLIDE 73

The Bag of Words Representation

slide-74
SLIDE 74

The Bag of Words Representation

slide-75
SLIDE 75

The Bag of Words Representation

75

slide-76
SLIDE 76

Bag of Words Representation

γ( )=c

seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...

slide-77
SLIDE 77

Language Modeling as Naïve Bayes Classifier

Start with Bayes Rule

slide-78
SLIDE 78

Language Modeling as Naïve Bayes Classifier

Adopt naïve bag of words representation Yi

slide-79
SLIDE 79

Language Modeling as Naïve Bayes Classifier

Adopt naïve bag of words representation Yi Assume position doesn’t matter

slide-80
SLIDE 80

Language Modeling as Naïve Bayes Classifier

Adopt naïve bag of words representation Yi Assume position doesn’t matter Assume the feature probabilities are independent given the class X

slide-81
SLIDE 81

Multinomial Naïve Bayes: Learning

From training corpus, extract Vocabulary

slide-82
SLIDE 82

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms For each cjin C do

docsj = all docs with class =cj

From training corpus, extract Vocabulary

slide-83
SLIDE 83

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms For each cjin C do

docsj = all docs with class =cj

Calculate P(wk | cj) terms Textj = single doc containing all docsj For each word wk in Vocabulary nk = # of occurrences of wk in Textj

From training corpus, extract Vocabulary

𝑞 𝑥𝑙| 𝑑𝑘 = class LM

slide-84
SLIDE 84

Naïve Bayes and Language Modeling

Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides

We use only word features we use all of the words in the text (not a subset)

Then

Naïve Bayes has an important similarity to language modeling

slide-85
SLIDE 85

Naïve Bayes as a Language Model

Sec.13.2.1

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

slide-86
SLIDE 86

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

film love this fun I

Sec.13.2.1

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

slide-87
SLIDE 87

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1

slide-88
SLIDE 88

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1

slide-89
SLIDE 89

Brill and Banko (2001) With enough data, the classifier may not matter

slide-90
SLIDE 90

Summary: Naïve Bayes is Not So Naïve

Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

slide-91
SLIDE 91

But: Naïve Bayes Isn’t Without Issue

Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data? (automated, more principled)