Probability & Language Modeling (II) Classification & the - - PowerPoint PPT Presentation

probability language modeling ii classification the noisy
SMART_READER_LITE
LIVE PREVIEW

Probability & Language Modeling (II) Classification & the - - PowerPoint PPT Presentation

Probability & Language Modeling (II) Classification & the Noisy Channel Model CMSC 473/673 UMBC September 11 th , 2017 Some slides adapted from 3SLP , Jason Eisner Recap from last time Three people have been fatally shot, p (


slide-1
SLIDE 1

Probability & Language Modeling (II) Classification & the Noisy Channel Model

CMSC 473/673 UMBC September 11th, 2017

Some slides adapted from 3SLP , Jason Eisner

slide-2
SLIDE 2

Recap from last time…

slide-3
SLIDE 3

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

pθ( )

slide-4
SLIDE 4

Probability Takeaways

Basic probability axioms and definitions Probabilistic Independence Definition of joint probability Definition of conditional probability Bayes rule Probability chain rule

slide-5
SLIDE 5

Kinds of Statistics

Descriptive Confirmatory Predictive

The average grade on this assignment is 83.

slide-6
SLIDE 6

Conditional Probabilities Are Probabilities

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads egg salad) 𝑞 heads NOT egg salad)

vs.

𝑞 heads egg salad) 𝑞 tails egg salad)

vs.

slide-7
SLIDE 7

Bayes Rule

posterior probability likelihood prior probability marginal likelihood (probability)

slide-8
SLIDE 8

Simple Count-Based Language Models

slide-9
SLIDE 9

Simple Count-Based Language Models

“proportional to”

slide-10
SLIDE 10

Simple Count-Based Language Models

“proportional to”

slide-11
SLIDE 11

Proportional To/Normalization

p( ) = 1

everything

slide-12
SLIDE 12

Proportional To/Normalization

c( ) = x1

slide-13
SLIDE 13

Proportional To/Normalization

c( ) = x1 c( c( c( ) = x2 ) = x3 ) = x4

slide-14
SLIDE 14

Proportional To/Normalization

p( ) ∝ x1 p( p( p( ) ∝ x2 ) ∝ x3 ) ∝ x4

slide-15
SLIDE 15

Simple Count-Based Language Models

“proportional to”

slide-16
SLIDE 16

Simple Count-Based Language Models

“proportional to”

constant

slide-17
SLIDE 17

Simple Count-Based Language Models

sequence of characters  pseudo-words sequence of words  pseudo-phrases

slide-18
SLIDE 18

Shakespearian Sequences of Words

slide-19
SLIDE 19

Probability Chain Rule

𝑞 𝑦1, 𝑦2, … , 𝑦𝑇 = 𝑞 𝑦1 𝑞 𝑦2 𝑦1)𝑞 𝑦3 𝑦1, 𝑦2) ⋯ 𝑞 𝑦𝑇 𝑦1, … , 𝑦𝑗

slide-20
SLIDE 20

Chain Rule + Backoff (Markov assumption) = n-grams

slide-21
SLIDE 21

N-Gram Terminology

n Commonly called History Size (Markov order) Example 1 unigram p(furiously) 2 bigram 1 p(furiously | sleep) 3 trigram (3-gram) 2 p(furiously | ideas sleep) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(wi | wi-n+1 … wi-1) how to (efficiently) compute p(Colorless green ideas sleep furiously)?

slide-22
SLIDE 22

Trigrams

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

slide-23
SLIDE 23

Trigrams

p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS>) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) * p(<EOS> | sleep furiously)

Consistent notation: Pad the left with <BOS> (beginning of sentence) symbols Fully proper distribution: Pad the right with a single <EOS> symbol

slide-24
SLIDE 24

N-Gram Probability

slide-25
SLIDE 25

Count-Based N-Grams (Unigrams)

slide-26
SLIDE 26

Count-Based N-Grams (Unigrams)

slide-27
SLIDE 27

Count-Based N-Grams (Unigrams)

slide-28
SLIDE 28

Count-Based N-Grams (Unigrams)

word type word type word type

slide-29
SLIDE 29

Count-Based N-Grams (Unigrams)

word type word type number of tokens observed

slide-30
SLIDE 30

Count-Based N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Normalization Probability The 1 film 2 got 1 a 2 great 1

  • pening

1 and 1 the 1 went 1

  • n

1 to 1 become 1 hit 1 . 1

slide-31
SLIDE 31

Count-Based N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Normalization Probability The 1 16 film 2 got 1 a 2 great 1

  • pening

1 and 1 the 1 went 1

  • n

1 to 1 become 1 hit 1 . 1

slide-32
SLIDE 32

Count-Based N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Normalization Probability The 1 16 1/16 film 2 1/8 got 1 1/16 a 2 1/8 great 1 1/16

  • pening

1 1/16 and 1 1/16 the 1 1/16 went 1 1/16

  • n

1 1/16 to 1 1/16 become 1 1/16 hit 1 1/16 . 1 1/16

slide-33
SLIDE 33

Count-Based N-Grams (Trigrams)

  • rder matters in

conditioning

  • rder matters in

count

slide-34
SLIDE 34

Count-Based N-Grams (Trigrams)

  • rder matters in

conditioning

  • rder matters in

count

count(x, y, z) ≠ count(x, z, y) ≠ count(y, x, z) ≠ …

slide-35
SLIDE 35

Count-Based N-Grams (Trigrams)

slide-36
SLIDE 36

Context Word (Type) Raw Count Normalization Probability The film The 1 0/1 The film film 0/1 The film got 1 1/1 The film went 0/1 … a great great 1 0/1 a great

  • pening

1 1/1 a great and 0/1 a great the 0/1 …

Count-Based N-Grams (Trigrams)

The film got a great opening and the film went on to become a hit .

slide-37
SLIDE 37

Context Word (Type) Raw Count Normalization Probability the film the 2 0/1 the film film 0/1 the film got 1 1/2 the film went 1 1/2 … a great great 1 0/1 a great

  • pening

1 1/1 a great and 0/1 a great the 0/1 …

Count-Based N-Grams (Lowercased Trigrams)

the film got a great opening and the film went on to become a hit .

slide-38
SLIDE 38

Evaluating Language Models

What is “correct?” What is working “well?”

slide-39
SLIDE 39

Evaluating Language Models

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

slide-40
SLIDE 40

Evaluating Language Models

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

acquire primary statistics for learning model parameters fine-tune any secondary (hyper)parameters perform final evaluation

slide-41
SLIDE 41

Evaluating Language Models

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

acquire primary statistics for learning model parameters fine-tune any secondary (hyper)parameters perform final evaluation

DO NOT ITERATE ON THE TEST DATA

slide-42
SLIDE 42

Evaluating Language Models

What is “correct?” What is working “well?” Extrinsic: Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors

slide-43
SLIDE 43

Evaluating Language Models

What is “correct?” What is working “well?” Extrinsic: Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors Intrinsic: Treat LM as its own downstream task Use perplexity (from information theory)

slide-44
SLIDE 44

Perplexity

Lower is better : lower perplexity  less surprised

More outcomes  More surprised Fewer outcomes  Less surprised

slide-45
SLIDE 45

Perplexity

Lower is better : lower perplexity  less surprised perplexity

n-gram history (n-1 items)

slide-46
SLIDE 46

Perplexity

Lower is better : lower perplexity  less surprised perplexity

≥ 0, ≤ 1: higher

slide-47
SLIDE 47

Perplexity

Lower is better : lower perplexity  less surprised perplexity

≥ 0, ≤ 1: higher ≤ 0: higher

slide-48
SLIDE 48

Perplexity

Lower is better : lower perplexity  less surprised perplexity

≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher

slide-49
SLIDE 49

Perplexity

Lower is better : lower perplexity  less surprised perplexity

≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher ≥ 0, lower is better

slide-50
SLIDE 50

Perplexity

Lower is better : lower perplexity  less surprised perplexity

≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher ≥ 0, lower is better ≥ 0, lower

slide-51
SLIDE 51

Perplexity

Lower is better : lower perplexity  less surprised perplexity

≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher ≥ 0, lower is better ≥ 0, lower base must be the same

slide-52
SLIDE 52

Perplexity

Lower is better : lower perplexity  less surprised perplexity

weighted geometric average

slide-53
SLIDE 53

Perplexity

Lower is better : lower perplexity  less surprised perplexity 471/671: Branching factor

slide-54
SLIDE 54

Outline

Probability review Words Defining Language Models Breaking & Fixing Language Models

slide-55
SLIDE 55

Maximum Likelihood Estimates

Maximizes the likelihood of the training set Do different corpora look the same? For large data: can actually do reasonably well

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

slide-56
SLIDE 56

n = 1

, , land of in , a teachers The , wilds the and gave a Etienne any two beginning without probably heavily that other useless the the a different . the able mines , unload into in foreign the the be either other Britain finally avoiding , for of have the cure , the Gutenberg-tm ; of being can as country in authority deviates as d seldom and They employed about from business marshal materials than in , they

slide-57
SLIDE 57

n = 2

These varied with it to the civil wars , therefore , it did not for the company had the East India , the mechanical , the sum which were by barter , vol. i , and , conveniencies of all made to purchase a council of landlords , constitute a sum as an argument , having thus forced abroad , however , and influence in the one , or banker , will there was encouraged and more common trade to corrupt , profit , it ; but a master does not , twelfth year the consent that of volunteers and […] , the other hand , it certainly it very earnestly entreat both nations . In opulent nations in a revenue of four parts of production .

slide-58
SLIDE 58

n = 3

His employer , if silver was regulated according to the temporary and

  • ccasional event .

What goods could bear the expense of defending themselves , than in the value of different sorts of goods , and placed at a much greater , there have been the effects of self-deception , this attention , but a very important ones , and which , having become of less than they ever were in this agreement for keeping up the business of weighing . After food , clothes , and a few months longer credit than is wanted , there must be sufficient to keep by him , are of such colonies to surmount . They facilitated the acquisition of the empire , both from the rents of land and labour of those pedantic pieces of silver which he can afford to take from the duty upon every quarter which they have a more equable distribution of employment .

slide-59
SLIDE 59

n = 4

To buy in one market , in order to have it ; but the 8th of George III . The tendency of some of the great lords , gradually encouraged their villains to make upon the prices of corn , cattle , poultry , etc . Though it may , perhaps , in the mean time , that part of the governments of New England , the market , trade cannot always be transported to so great a number of seamen , not inferior to those of

  • ther European nations from any direct trade to America .

The farmer makes his profit by parting with it . But the government of that country below what it is in itself necessarily slow , uncertain , liable to be interrupted by the weather .

slide-60
SLIDE 60

0s Are Not Your (Language Model’s) Friend

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item = 0 → 𝑞 item = 0

slide-61
SLIDE 61

0s Are Not Your (Language Model’s) Friend

0 probability  item is impossible 0s annihilate: x*y*z*0 = 0 Language is creative: new words keep appearing existing words could appear in known contexts How much do you trust your data?

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item = 0 → 𝑞 item = 0

slide-62
SLIDE 62

Add-λ estimation

Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

slide-63
SLIDE 63

Add-λ estimation

Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

slide-64
SLIDE 64

Add-λ estimation

Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

slide-65
SLIDE 65

Add-λ estimation

Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

slide-66
SLIDE 66

Add-λ N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-λ Count Add-λ Norm. Add-λ Prob. The 1 16 1/16 film 2 1/8 got 1 1/16 a 2 1/8 great 1 1/16

  • pening

1 1/16 and 1 1/16 the 1 1/16 went 1 1/16

  • n

1 1/16 to 1 1/16 become 1 1/16 hit 1 1/16 . 1 1/16

slide-67
SLIDE 67

Add-1 N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 16 1/16 2 film 2 1/8 3 got 1 1/16 2 a 2 1/8 3 great 1 1/16 2

  • pening

1 1/16 2 and 1 1/16 2 the 1 1/16 2 went 1 1/16 2

  • n

1 1/16 2 to 1 1/16 2 become 1 1/16 2 hit 1 1/16 2 . 1 1/16 2

slide-68
SLIDE 68

Add-1 N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 16 1/16 2 16 + 14*1 = 30 film 2 1/8 3 got 1 1/16 2 a 2 1/8 3 great 1 1/16 2

  • pening

1 1/16 2 and 1 1/16 2 the 1 1/16 2 went 1 1/16 2

  • n

1 1/16 2 to 1 1/16 2 become 1 1/16 2 hit 1 1/16 2 . 1 1/16 2

slide-69
SLIDE 69

Add-1 N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 16 1/16 2 16 + 14*1 = 30 =1/15 film 2 1/8 3 =1/10 got 1 1/16 2 =1/15 a 2 1/8 3 =1/10 great 1 1/16 2 =1/15

  • pening

1 1/16 2 =1/15 and 1 1/16 2 =1/15 the 1 1/16 2 =1/15 went 1 1/16 2 =1/15

  • n

1 1/16 2 =1/15 to 1 1/16 2 =1/15 become 1 1/16 2 =1/15 hit 1 1/16 2 =1/15 . 1 1/16 2 =1/15

slide-70
SLIDE 70

Backoff and Interpolation

Sometimes it helps to use less context

condition on less context for contexts you haven’t learned much

slide-71
SLIDE 71

Backoff and Interpolation

Sometimes it helps to use less context

condition on less context for contexts you haven’t learned much about

Backoff:

use trigram if you have good evidence

  • therwise bigram, otherwise unigram
slide-72
SLIDE 72

Backoff and Interpolation

Sometimes it helps to use less context

condition on less context for contexts you haven’t learned much about

Backoff:

use trigram if you have good evidence

  • therwise bigram, otherwise unigram

Interpolation:

mix (average) unigram, bigram, trigram

slide-73
SLIDE 73

Linear Interpolation

Simple interpolation

𝑞 𝑧 𝑦) = 𝜇𝑞2 𝑧 𝑦) + 1 − 𝜇 𝑞1 𝑧 0 ≤ 𝜇 ≤ 1

slide-74
SLIDE 74

Linear Interpolation

Simple interpolation Condition on context

𝑞 𝑧 𝑦) = 𝜇𝑞2 𝑧 𝑦) + 1 − 𝜇 𝑞1 𝑧 0 ≤ 𝜇 ≤ 1

𝑞 𝑨 𝑦, 𝑧) = 𝜇3 𝑦, 𝑧 𝑞3 𝑨 𝑦, 𝑧) + 𝜇2(𝑧)𝑞2 𝑨 | 𝑧 + 𝜇1𝑞1(𝑨)

slide-75
SLIDE 75

Backoff

Trust your statistics, up to a point

slide-76
SLIDE 76

Discounted Backoff

Trust your statistics, up to a point

slide-77
SLIDE 77

Discounted Backoff

Trust your statistics, up to a point

discount constant context-dependent normalization constant

slide-78
SLIDE 78

Setting Hyperparameters

Use a development corpus Choose λs to maximize the probability of dev data:

Fix the N-gram probabilities/counts (on the training data) Search for λs that give largest probability to held-out set

Training Data

Dev Data Test Data

slide-79
SLIDE 79

Implementation: Unknown words

Create an unknown word token <UNK> Training:

  • 1. Create a fixed lexicon L of size V
  • 2. Change any word not in L to <UNK>
  • 3. Train LM as normal

Evaluation:

Use UNK probabilities for any word not in training

slide-80
SLIDE 80

Other Kinds of Smoothing

Interpolated (modified) Kneser-Ney

Idea: How “productive” is a context? How many different word types v appear in a context x, y

Good-Turing

Partition words into classes of occurrence Smooth class statistics Properties of classes are likely to predict properties of other classes

Witten-Bell

Idea: Every observed type was at some point novel Give MLE prediction for novel type occurring

slide-81
SLIDE 81

Bayes Rule  NLP Applications

posterior probability likelihood prior probability marginal likelihood (probability)

slide-82
SLIDE 82

Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

slide-83
SLIDE 83

Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class c  C

slide-84
SLIDE 84

Text Classification: Hand-coded Rules?

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Rules based on combinations of words or other features

spam: black-list-address OR (“dollars” AND“have been selected”)

Accuracy can be high

If rules carefully refined by expert

Building and maintaining these rules is expensive

slide-85
SLIDE 85

Text Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

Output:

a learned classifier γ:d  c

slide-86
SLIDE 86

Text Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

Output:

a learned classifier γ:d  c

Naïve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …

slide-87
SLIDE 87

Text Classification: Supervised Machine Learning

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

Input:

a document d a fixed set of classes C = {c1, c2,…, cJ} A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

Output:

a learned classifier γ:d  c

Naïve Bayes Logistic regression Support-vector machines k-Nearest Neighbors …

slide-88
SLIDE 88

Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

class

  • bserved

data

slide-89
SLIDE 89

Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

class

  • bserved

data class-based likelihood prior probability of class

  • bservation likelihood (averaged over all classes)
slide-90
SLIDE 90

Noisy Channel Model

slide-91
SLIDE 91

Noisy Channel Model

what I want to tell you “sports”

slide-92
SLIDE 92

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…”

slide-93
SLIDE 93

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…” Decode hypothesized intent “sad stories” “sports”

slide-94
SLIDE 94

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…” Decode Rerank hypothesized intent “sad stories” “sports” reweight according to what’s likely “sports”

slide-95
SLIDE 95

Noisy Channel

Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis …

possible (clean)

  • utput
  • bserved

(noisy) text translation/ decode model (clean) language model

  • bservation (noisy) likelihood
slide-96
SLIDE 96

Noisy Channel

Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis …

possible (clean)

  • utput
  • bserved

(noisy) text (clean) language model

  • bservation (noisy) likelihood

translation/ decode model

slide-97
SLIDE 97

Language Model

Use any of the language modeling algorithms we’ve learned Unigram, bigram, trigram Add-λ, interpolation, backoff (Later: Maxent, RNNs, hierarchical Bayesian LMs, …)

slide-98
SLIDE 98

Noisy Channel

slide-99
SLIDE 99

Noisy Channel

slide-100
SLIDE 100

Noisy Channel

constant with respect to X

slide-101
SLIDE 101

Noisy Channel

slide-102
SLIDE 102
slide-103
SLIDE 103
slide-104
SLIDE 104

Noisy Channel

slide-105
SLIDE 105

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/Gu essed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-106
SLIDE 106

Accuracy, Precision, and Recall

Accuracy: % of items correct

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-107
SLIDE 107

Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-108
SLIDE 108

Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-109
SLIDE 109

A combined measure: F

Weighted (harmonic) average of Precision & Recall Balanced F1 measure:  = 1 ( = ½)

F = 2PR/(P+R) R P PR R P F + + =

  • +

=

2 2

) 1 ( 1 ) 1 ( 1 1 b b a a

slide-110
SLIDE 110

Micro- vs. Macro-Averaging

If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

  • Sec. 15.2.4
slide-111
SLIDE 111

Micro- vs. Macro-Averaging: Example

Truth : yes Truth : no Classifier: yes 10 10 Classifier: no 10 970 Truth : yes Truth : no Classifier: yes 90 10 Classifier: no 10 890 Truth : yes Truth : no Classifier: yes 100 20 Classifier: no 20 1860

Class 1 Class 2 Micro Ave. Table

  • Sec. 15.2.4

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Microaveraged score is dominated by score on common classes

slide-112
SLIDE 112

The Bag of Words Representation

slide-113
SLIDE 113

The Bag of Words Representation

slide-114
SLIDE 114

The Bag of Words Representation

114

slide-115
SLIDE 115

Bag of Words Representation

γ( )=c

seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...

slide-116
SLIDE 116

Naïve Bayes Classifier

Start with Bayes Rule

slide-117
SLIDE 117

Naïve Bayes Classifier

Adopt naïve bag of words representation Yi

slide-118
SLIDE 118

Naïve Bayes Classifier

Adopt naïve bag of words representation Yi Assume position doesn’t matter

slide-119
SLIDE 119

Naïve Bayes Classifier

Adopt naïve bag of words representation Yi Assume position doesn’t matter Assume the feature probabilities are independent given the class X

slide-120
SLIDE 120

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms For each cjin C do

docsj  all docs with class =cj

Calculate P(wk | cj) terms Textj  single doc containing all docsj For each word wk in Vocabulary nk  # of occurrences of wk in Textj

From training corpus, extract Vocabulary

𝑞 𝑥𝑙| 𝑑𝑘 = class LM

slide-121
SLIDE 121

Naïve Bayes and Language Modeling

Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides

We use only word features we use all of the words in the text (not a subset)

Then

Naïve Bayes has an important similarity to language modeling

slide-122
SLIDE 122

Naïve Bayes as a Language Model

Sec.13.2.1

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

slide-123
SLIDE 123

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

film love this fun I

Sec.13.2.1

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

slide-124
SLIDE 124

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1

slide-125
SLIDE 125

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1

slide-126
SLIDE 126

Brill and Banko (2001) With enough data, the classifier may not matter

slide-127
SLIDE 127

Summary: Naïve Bayes is Not So Naïve

Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)