Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP - - PowerPoint PPT Presentation

na ve bayes
SMART_READER_LITE
LIVE PREVIEW

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP - - PowerPoint PPT Presentation

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Terminology: bag-of-words Nave assumption Training & performance NB as a language model Outline Terminology: bag-of-words Nave assumption Training


slide-1
SLIDE 1

Naïve Bayes

CMSC 473/673 UMBC

Some slides adapted from 3SLP

slide-2
SLIDE 2

Outline

Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

slide-3
SLIDE 3

Outline

Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

slide-4
SLIDE 4

Bag-of-words

Based on some tokenization, turn an input document into an array (or dictionary or set) of its unique vocab items Position/ordering in the document is not important Bag-of-words go hand-in-hand with one-hot representations but they can be extended to handle dense representations Short-hand: BOW

slide-5
SLIDE 5

The Bag of Words Representation

slide-6
SLIDE 6

The Bag of Words Representation

slide-7
SLIDE 7

The Bag of Words Representation

7

slide-8
SLIDE 8

The Bag of Words Representation

8

Count-based BOW

slide-9
SLIDE 9

Bag-of-words as a Function

Based on some tokenization, turn an input document into an array (or dictionary or set) of its unique vocab items Think of getting a BOW rep. as a function f input: Document

  • utput: Container of size E, indexable by

each vocab type v

slide-10
SLIDE 10

Some Bag-of-words Functions

Kind Type of 𝒈𝒘 Interpretation Binary 0, 1 Did v appear in the document? Count-based Natural number (int >= 0) How often did v occur in the document? Averaged Real number (>=0, <= 1) How often did v occur in the document, normalized by doc length? TF-IDF (term frequency, inverse document frequency) Real number (>= 0) How frequent is a word, tempered by how prevalent it is across the corpus (to be covered later!) …

slide-11
SLIDE 11

Some Bag-of-words Functions

Kind Type of 𝒈𝒘 Interpretation Binary 0, 1 Did v appear in the document? Count-based Natural number (int >= 0) How often did v occur in the document? Averaged Real number (>=0, <= 1) How often did v occur in the document, normalized by doc length? TF-IDF (term frequency, inverse document frequency) Real number (>= 0) How frequent is a word, tempered by how prevalent it is across the corpus (to be covered later!) …

Q: Is this a reasonable representation? Q: What are some tradeoffs (benefits vs. costs)?

slide-12
SLIDE 12

Outline

Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

slide-13
SLIDE 13

Bag of Words Classifier

γ( )=c

seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...

classifier classifier

slide-14
SLIDE 14

Naïve Bayes (NB) Classifier argmax𝑍𝑞 𝑌 𝑍) ∗ 𝑞(𝑍)

Start with Bayes Rule

label text Q: Are we doing discriminative training

  • r generative training?
slide-15
SLIDE 15

Naïve Bayes (NB) Classifier

Start with Bayes Rule

label text Q: Are we doing discriminative training

  • r generative training?

A: generative training

argmax𝑍𝑞 𝑌 𝑍) ∗ 𝑞(𝑍)

slide-16
SLIDE 16

Naïve Bayes (NB) Classifier argmax𝑍 ෑ

𝑢

𝑞(𝑌𝑢|𝑍) ∗ 𝑞(𝑍)

Adopt naïve bag of words representation Xt

label each word Iterate through types

slide-17
SLIDE 17

Naïve Bayes (NB) Classifier argmax𝑍 ෑ

𝑢

𝑞(𝑌𝑢|𝑍) ∗ 𝑞(𝑍)

Adopt naïve bag of words representation Xt Assume position doesn’t matter

label each word Iterate through types

slide-18
SLIDE 18

Outline

Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

slide-19
SLIDE 19

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora)

slide-20
SLIDE 20

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora)

Q: What parameters (values/weights) must be learned?

slide-21
SLIDE 21

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora)

Q: What parameters (values/weights) must be learned? A: 𝑞 𝑥𝑤 𝑣𝑚 , 𝑞(𝑣𝑚)

slide-22
SLIDE 22

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora)

Q: What parameters (values/weights) must be learned? A: 𝑞 𝑥𝑤 𝑣𝑚 , 𝑞(𝑣𝑚) Q: How many parameters must be learned?

slide-23
SLIDE 23

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora)

Q: What parameters (values/weights) must be learned? A: 𝑞 𝑥𝑤 𝑣𝑚 , 𝑞(𝑣𝑚) Q: How many parameters must be learned? A: 𝑀𝐿 + 𝑀

slide-24
SLIDE 24

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora)

Q: What parameters (values/weights) must be learned? A: 𝑞 𝑥𝑤 𝑣𝑚 , 𝑞(𝑣𝑚) Q: How many parameters must be learned? A: 𝑀𝐿 + 𝑀 Q: What distributions need to sum to 1?

slide-25
SLIDE 25

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora)

Q: What parameters (values/weights) must be learned? A: 𝑞 𝑥𝑤 𝑣𝑚 , 𝑞(𝑣𝑚) Q: How many parameters must be learned? A: 𝑀𝐿 + 𝑀 Q: What distributions need to sum to 1? A: Each 𝑞 ⋅ 𝑣𝑚 , and the prior

slide-26
SLIDE 26

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣𝑚 , all class specific language models 𝑞 ⋅ 𝑣𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!)

slide-27
SLIDE 27

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣𝑚 , all class specific language models 𝑞 ⋅ 𝑣𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!)

Q: Should OOV and UNK be included?

slide-28
SLIDE 28

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣𝑚 , all class specific language models 𝑞 ⋅ 𝑣𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!)

Q: Should OOV and UNK be included? Q: Should EOS be included?

slide-29
SLIDE 29

Learning for a Naïve Bayes Classifier

Assuming V vocab types 𝑥1, … , 𝑥𝑊 and L classes 𝑣1, … , 𝑣𝑀(and appropriate corpora) If you’re going to compute perplexity from 𝑞 ⋅ 𝑣𝑚 , all class specific language models 𝑞 ⋅ 𝑣𝑚 MUST share a common vocab (otherwise it’s not a fair comparison!!!) A binary/count-based NB classifier is also called a Multinomial Naïve Bayes classifier

slide-30
SLIDE 30

Multinomial Naïve Bayes: Learning

From training corpus, extract Vocabulary

slide-31
SLIDE 31

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms For each cjin C do

docsj = all docs with class =cj

From training corpus, extract Vocabulary

𝑞 𝑑

𝑘 = |𝑒𝑝𝑑𝑡 𝑘|

# 𝑒𝑝𝑑𝑡

slide-32
SLIDE 32

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms For each cjin C do

docsj = all docs with class =cj

Calculate P(wk | cj) terms Textj = single doc containing all docsj Foreach word wk in Vocabulary nk = # of occurrences of wk in Textj

From training corpus, extract Vocabulary

𝑞 𝑑

𝑘 = |𝑒𝑝𝑑𝑡 𝑘|

# 𝑒𝑝𝑑𝑡 𝑞 𝑥𝑙| 𝑑𝑘 = class unigram LM

slide-33
SLIDE 33

Brill and Banko (2001) With enough data, the classifier may not matter

slide-34
SLIDE 34

Outline

Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model

slide-35
SLIDE 35

Naïve Bayes as a Language Model

Sec.13.2.1

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

slide-36
SLIDE 36

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

film love this fun I

Sec.13.2.1

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

slide-37
SLIDE 37

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1

slide-38
SLIDE 38

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1

slide-39
SLIDE 39

Naïve Bayes To Try

http://csee.umbc.edu/courses/undergraduate/473/f19/nb

  • Toy problem: classify whether a tweet will be retweeted
  • Toy problem: OOV and EOS are not included
  • Laplace smoothing is used for p(word | label)
slide-40
SLIDE 40

Summary: Naïve Bayes is Not So Naïve

Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

slide-41
SLIDE 41

But: Naïve Bayes Isn’t Without Issue

Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data? (automated, more principled)

slide-42
SLIDE 42

Outline

Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language model