Nave Bayes & Maxent Models CMSC 473/673 UMBC September 18 th , - - PowerPoint PPT Presentation

na ve bayes maxent models
SMART_READER_LITE
LIVE PREVIEW

Nave Bayes & Maxent Models CMSC 473/673 UMBC September 18 th , - - PowerPoint PPT Presentation

Nave Bayes & Maxent Models CMSC 473/673 UMBC September 18 th , 2017 Some slides adapted from 3SLP Announcements: Assignment 1 Due 11:59 AM, Wednesday 9/20 < 2 days Use submit utility with: class id cs473_ferraro assignment id a1 We


slide-1
SLIDE 1

Naïve Bayes & Maxent Models

CMSC 473/673 UMBC September 18th, 2017

Some slides adapted from 3SLP

slide-2
SLIDE 2

Announcements: Assignment 1

Due 11:59 AM, Wednesday 9/20 < 2 days Use submit utility with: class id cs473_ferraro assignment id a1 We must be able to run it on GL! Common pitfall #1: forgetting files Common pitfall #2: incorrect paths to files Common pitfall #3: 3rd party libraries

slide-3
SLIDE 3

Announcements: Course Project

Official handout will be out Wednesday 9/20

Until then, focus on assignment 1

Teams of 1-3 Mixed undergrad/grad is encouraged but not required Some novel aspect is needed

Ex 1: reimplement existing technique and apply to new domain Ex 2: reimplement existing technique and apply to new (human) language Ex 3: explore novel technique on existing problem

slide-4
SLIDE 4

Recap from last time…

slide-5
SLIDE 5

Two Different Philosophical Frameworks

posterior probability likelihood prior probability marginal likelihood (probability)

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

there are others too (CMSC 478/678)

slide-6
SLIDE 6

Posterior Decoding: Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

class

  • bserved

data class-based likelihood (language model) prior probability of class

  • bservation likelihood (averaged over all classes)
slide-7
SLIDE 7

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…” Decode Rerank hypothesized intent “sad stories” “sports” reweight according to what’s likely “sports”

slide-8
SLIDE 8

Noisy Channel

Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning …

possible (clean)

  • utput
  • bserved

(noisy) text (clean) language model

  • bservation (noisy) likelihood

translation/ decode model

slide-9
SLIDE 9

Classify or Decode with Bayes Rule

slide-10
SLIDE 10

Classify or Decode with Bayes Rule

slide-11
SLIDE 11

Classify or Decode with Bayes Rule

constant with respect to X

slide-12
SLIDE 12

Classify or Decode with Bayes Rule

slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15

Classify or Decode with Bayes Rule

slide-16
SLIDE 16

Classify or Decode with Bayes Rule

slide-17
SLIDE 17

Classify or Decode with Bayes Rule

how well does text Y represent label X? how likely is label X overall?

slide-18
SLIDE 18

Classify or Decode with Bayes Rule

how well does text Y represent label X? how likely is label X overall?

For “simple” or “flat” labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score

slide-19
SLIDE 19

Classify or Decode with Bayes Rule

how well does text (complex input) Y represent text (complex output) X? how likely is text (complex

  • utput) X overall?
slide-20
SLIDE 20

Classify or Decode with Bayes Rule

how well does text (complex input) Y represent text (complex output) X? how likely is text (complex

  • utput) X overall?

* iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score can be complicated

slide-21
SLIDE 21

Classify or Decode with Bayes Rule

how well does text (complex input) Y represent text (complex output) X? how likely is text (complex

  • utput) X overall?

* iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score can be complicated we’ll come back to this in October

slide-22
SLIDE 22

Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed Not selected/ not guessed

slide-23
SLIDE 23

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed Not selected/ not guessed

Classes/Choices

slide-24
SLIDE 24

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) Not selected/ not guessed

Classes/Choices

Correct Guessed

slide-25
SLIDE 25

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed

Classes/Choices

Correct Guessed Correct Guessed

slide-26
SLIDE 26

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed False Negative (FN)

Classes/Choices

Correct Guessed Correct Guessed Correct Guessed

slide-27
SLIDE 27

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed False Negative (FN) True Negative (TN)

Classes/Choices

Correct Guessed Correct Guessed Correct Guessed Correct Guessed

slide-28
SLIDE 28

Classification Evaluation: Accuracy, Precision, and Recall

Accuracy: % of items correct

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-29
SLIDE 29

Classification Evaluation: Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-30
SLIDE 30

Classification Evaluation: Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-31
SLIDE 31

A combined measure: F

Weighted (harmonic) average of Precision & Recall

slide-32
SLIDE 32

A combined measure: F

Weighted (harmonic) average of Precision & Recall

algebra (not important)

slide-33
SLIDE 33

A combined measure: F

Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1

slide-34
SLIDE 34

Micro- vs. Macro-Averaging

If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

  • Sec. 15.2.4
slide-35
SLIDE 35

Micro- vs. Macro-Averaging: Example

Truth : yes Truth : no Classifier: yes 10 10 Classifier: no 10 970 Truth : yes Truth : no Classifier: yes 90 10 Classifier: no 10 890 Truth : yes Truth : no Classifier: yes 100 20 Classifier: no 20 1860

Class 1 Class 2 Micro Ave. Table

  • Sec. 15.2.4

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Microaveraged score is dominated by score on common classes

slide-36
SLIDE 36

Language Modeling as Naïve Bayes Classifier

posterior probability

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding class

  • bserved

data prior probability of class

  • bservation likelihood (averaged over all classes)

class-based likelihood (language model)

slide-37
SLIDE 37

The Bag of Words Representation

slide-38
SLIDE 38

The Bag of Words Representation

slide-39
SLIDE 39

The Bag of Words Representation

39
slide-40
SLIDE 40

Bag of Words Representation

γ( )=c

seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...

classifier classifier

slide-41
SLIDE 41

Language Modeling as Naïve Bayes Classifier

Start with Bayes Rule

slide-42
SLIDE 42

Language Modeling as Naïve Bayes Classifier

Adopt naïve bag of words representation Yi

slide-43
SLIDE 43

Language Modeling as Naïve Bayes Classifier

Adopt naïve bag of words representation Yi Assume position doesn’t matter

slide-44
SLIDE 44

Language Modeling as Naïve Bayes Classifier

Adopt naïve bag of words representation Yi Assume position doesn’t matter Assume the feature probabilities are independent given the class X

slide-45
SLIDE 45

Multinomial Naïve Bayes: Learning

From training corpus, extract Vocabulary

slide-46
SLIDE 46

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms For each cjin C do

docsj = all docs with class =cj

From training corpus, extract Vocabulary

slide-47
SLIDE 47

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms For each cjin C do

docsj = all docs with class =cj

Calculate P(wk | cj) terms Textj = single doc containing all docsj For each word wk in Vocabulary nk = # of occurrences of wk in Textj

From training corpus, extract Vocabulary

𝑞 𝑥𝑙| 𝑑𝑘 = class LM

slide-48
SLIDE 48

Naïve Bayes and Language Modeling

Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides

We use only word features we use all of the words in the text (not a subset)

Then

Naïve Bayes has an important similarity to language modeling

slide-49
SLIDE 49

Naïve Bayes as a Language Model

Sec.13.2.1

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

slide-50
SLIDE 50

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

film love this fun I

Sec.13.2.1

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

slide-51
SLIDE 51

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1

slide-52
SLIDE 52

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1

slide-53
SLIDE 53

Brill and Banko (2001) With enough data, the classifier may not matter

slide-54
SLIDE 54

Summary: Naïve Bayes is Not So Naïve

Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

slide-55
SLIDE 55

But: Naïve Bayes Isn’t Without Issue

Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data? (automated, more principled)

slide-56
SLIDE 56

Maximum Entropy (Log-linear) Models

a more general language model

slide-57
SLIDE 57

Maximum Entropy (Log-linear) Models

classify in one go

slide-58
SLIDE 58

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. Observed document Label

slide-59
SLIDE 59

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Document Classification

ATTACK

  • # killed:
  • Type:
  • Perp:

shot

ATTACK

slide-60
SLIDE 60

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-61
SLIDE 61

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-62
SLIDE 62

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-63
SLIDE 63

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-64
SLIDE 64

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-65
SLIDE 65

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

We need to score the different combinations.

slide-66
SLIDE 66

Score and Combine Our Possibilities

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

COMBINE posterior probability of ATTACK are all of these uncorrelated?

scorek(department, ATTACK)

slide-67
SLIDE 67

Score and Combine Our Possibilities

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

COMBINE posterior probability of ATTACK

Q: What are the score and combine functions for Naïve Bayes?

slide-68
SLIDE 68

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =

ATTACK

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

slide-69
SLIDE 69

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =

ATTACK

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

Learn these scores… but how? What do we

  • ptimize?
slide-70
SLIDE 70

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))

ATTACK

Maxent Modeling

slide-71
SLIDE 71

What function…

  • perates on any real number?

is never less than 0?

slide-72
SLIDE 72

What function…

  • perates on any real number?

is never less than 0? f(x) = exp(x)

slide-73
SLIDE 73 Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))

ATTACK

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-74
SLIDE 74

exp( ))

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-75
SLIDE 75

exp( ))

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-76
SLIDE 76

exp( ))

weight1 * applies1(fatally shot, ATTACK) weight2 * applies2(seriously wounded, ATTACK) weight3 * applies3(Shining Path, ATTACK)

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-77
SLIDE 77

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | ) =

ATTACK

exp( ))

weight1 * applies1(fatally shot, ATTACK) weight2 * applies2(seriously wounded, ATTACK) weight3 * applies3(Shining Path, ATTACK)

Maxent Modeling

1 Z

Q: How do we define Z?

slide-78
SLIDE 78

exp( )

weight1 * applies1(fatally shot, X) weight2 * applies2(seriously wounded, X) weight3 * applies3(Shining Path, X)

Σ

label x

Z =

Normalization for Classification

𝑞 𝑦 𝑧) ∝ exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 )

classify doc y with label x in one go

slide-79
SLIDE 79

Normalization for Language Model

general class-based (X) language model of doc y

slide-80
SLIDE 80

Normalization for Language Model

Can be significantly harder in the general case

general class-based (X) language model of doc y

slide-81
SLIDE 81

Normalization for Language Model

Can be significantly harder in the general case Simplifying assumption: maxent n-grams!

general class-based (X) language model of doc y