Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, - - PowerPoint PPT Presentation

spam filtering with naive bayes classifier
SMART_READER_LITE
LIVE PREVIEW

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, - - PowerPoint PPT Presentation

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What is spam? Different spam types Anti-Spam Techniques Probability theory basics Conditional probability Bayes Theorem Naive Bayes Theorem Spam


slide-1
SLIDE 1

Spam Filtering with Naive Bayes Classifier

Yuriy Arabskyy June 6, 2017

slide-2
SLIDE 2

Table of contents

What is spam? Different spam types Anti-Spam Techniques Probability theory basics Conditional probability Bayes Theorem Naive Bayes Theorem Spam filtering with Naive Bayes Classifier (NBC) Definition of terms Feature representation Evaluation Comparison to Logistic Classifier (LC)

slide-3
SLIDE 3

What is spam?

Spam – mass-mailing of a message over the internet, for the purposes of advertising.

slide-4
SLIDE 4

What is spam?

Spam – mass-mailing of a message over the internet, for the purposes of advertising. HOT sing1e wOmen seeking your attention near this AREA!!!%%###!!! Just follow this link . . .

slide-5
SLIDE 5

What is spam?

Spam – mass-mailing of a message over the internet, for the purposes of advertising. HOT sing1e wOmen seeking your attention near this AREA!!!%%###!!! Just follow this link . . . I am Wumi Abdul; the only Daughter of late Mr and Mrs George

  • Abdul. My father was a very wealthy cocoa merchant in Abidjan,

he was poisoned to death by his business associates. . . I seek for a foreign partner. Please provide a Bank account where this money would be transferred to.

slide-6
SLIDE 6

Different spam types

Figure: Spam chart

slide-7
SLIDE 7

Anti-Spam Techniques

◮ End-user techniques

◮ Discretion ◮ Address munging ◮ Ham passwords

slide-8
SLIDE 8

Anti-Spam Techniques

◮ End-user techniques

◮ Discretion ◮ Address munging ◮ Ham passwords

◮ Mail server level filtering

◮ Realtime Blackhole Lists ◮ Spamtrapping ◮ SMTP callback verification ◮ Statistical spam filtering

slide-9
SLIDE 9

Probability theory basics

Conditional Probability: Pr[X ∩ Y ] = Pr[Y |X] · Pr[X] (1)

slide-10
SLIDE 10

Probability theory basics

Conditional Probability: Pr[X ∩ Y ] = Pr[Y |X] · Pr[X] (1)

Figure: Weather - conditional probability

slide-11
SLIDE 11

Probability theory basics

Pr[X ∩ Y ] = Pr[Y |X] · Pr[X] = Pr[X|Y ] · Pr[Y ] (2)

slide-12
SLIDE 12

Probability theory basics

Pr[X ∩ Y ] = Pr[Y |X] · Pr[X] = Pr[X|Y ] · Pr[Y ] (2) Bayes Theorem: Pr[Y |X] = Pr[X|Y ] · Pr[Y ] Pr[X] (3)

slide-13
SLIDE 13

Probability theory basics

Pr[X ∩ Y ] = Pr[Y |X] · Pr[X] = Pr[X|Y ] · Pr[Y ] (2) Bayes Theorem: Pr[Y |X] = Pr[X|Y ] · Pr[Y ] Pr[X] (3) Bayes Theorem is a way of updating of what we think about the world, based on what we know about it.

slide-14
SLIDE 14

Probability theory basics

Multiple variables Pr[x1, x2, . . . , xn] = Pr[x1|x2, x3, . . . , xn] · Pr[x2, x3, . . . , xn] (4) Pr[x2, x3, . . . , xn] = Pr[x2|x3, x4, . . . , xn] · Pr[x3, x4, . . . , xn] (5)

slide-15
SLIDE 15

Probability theory basics

Multiple variables Pr[x1, x2, . . . , xn] = Pr[x1|x2, x3, . . . , xn] · Pr[x2, x3, . . . , xn] (4) Pr[x2, x3, . . . , xn] = Pr[x2|x3, x4, . . . , xn] · Pr[x3, x4, . . . , xn] (5) Assuming xi and xj are independent: Pr[xi|xj] = Pr[xi] (6)

slide-16
SLIDE 16

Probability theory basics

Multiple variables Pr[x1, x2, . . . , xn] = Pr[x1|x2, x3, . . . , xn] · Pr[x2, x3, . . . , xn] (4) Pr[x2, x3, . . . , xn] = Pr[x2|x3, x4, . . . , xn] · Pr[x3, x4, . . . , xn] (5) Assuming xi and xj are independent: Pr[xi|xj] = Pr[xi] (6) Previous formula may be simplified to the following one: Pr[x1, x2, . . . , xn] = Pr[x1] · Pr[x2] · . . . · Pr[xn] (7)

slide-17
SLIDE 17

Spam filtering with NBC

Bayes theorem rewritten using the naive assumption: Pr(c | x1, x2, . . . , xn) = Pr(c)Pr(x1|c)Pr(x2|c) · · · Pr(xn|c) Pr(x1, x2, . . . , xn) (8)

slide-18
SLIDE 18

Spam filtering with NBC

Bayes theorem rewritten using the naive assumption: Pr(c | x1, x2, . . . , xn) = Pr(c)Pr(x1|c)Pr(x2|c) · · · Pr(xn|c) Pr(x1, x2, . . . , xn) (8) Class of di = argmaxc Pr(c | di) (9)

slide-19
SLIDE 19

Spam filtering with NBC

Defintion of terms:

◮ Vocabulary (V) is an ordered collection of words i.e.,

V = (v1, v2, v3, . . . , vn) used to classify an email.

◮ Document (D) is an ordered collection of words used in a

message D = (w1, w2, w3, . . . , wn).

◮ The classifier is a machine that, when given a document D

and a collection of parameters θ, deterministically returns the class of the document.

slide-20
SLIDE 20

Spam filtering with NBC

Document representation

◮ Binary vector of length |V | is used to represent a document. ◮ xi means the absence of the word vi in the specified

document.

slide-21
SLIDE 21

Spam filtering with NBC

Document representation

◮ Binary vector of length |V | is used to represent a document. ◮ xi means the absence of the word vi in the specified

document.

Bernoulli event model

Pr[xi|ck] = pxi

ki · (1 − pki)1−xi

(10) pki is the probability of class ck generating the word vi and can be calculated as follows: pki =

  • d∈ck isPresent(vi, d)

# of documents in ck (11)

slide-22
SLIDE 22

Evaluation

Legitimate Spam Classifier accepted a b Classifier rejected c d

◮ b – accepted even though it was spam ◮ c – legitimate mail is classified as spam (very bad!)

Recall = a a + c Precision = a a + b (12)

slide-23
SLIDE 23

Comparison to Logistic Classifier

Advantage

NBC requires less training data to be able to function properly.

Disadvantage

Logistic Classifier can reach a lower error rate when given enough data.

slide-24
SLIDE 24

Comparison to Logistic Classifier

Figure: Dashed – LC; Solid – NBC; Y-axis – error; X-axis - m (1000 random train splits

slide-25
SLIDE 25

Thank you for listening!

slide-26
SLIDE 26

Thank you for listening!

And remember, what do we say to nigerian princes who want to make business with you? :)

slide-27
SLIDE 27

Thank you for listening!

And remember, what do we say to nigerian princes who want to make business with you? :) If you’re interested, listen to James Veitch’s talk about answering spam: https://www.ted.com/talks/james_veitch_this_is_what_ happens_when_you_reply_to_spam_email