SLIDE 1
Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, - - PowerPoint PPT Presentation
Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, - - PowerPoint PPT Presentation
Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What is spam? Different spam types Anti-Spam Techniques Probability theory basics Conditional probability Bayes Theorem Naive Bayes Theorem Spam
SLIDE 2
SLIDE 3
What is spam?
Spam – mass-mailing of a message over the internet, for the purposes of advertising.
SLIDE 4
What is spam?
Spam – mass-mailing of a message over the internet, for the purposes of advertising. HOT sing1e wOmen seeking your attention near this AREA!!!%%###!!! Just follow this link . . .
SLIDE 5
What is spam?
Spam – mass-mailing of a message over the internet, for the purposes of advertising. HOT sing1e wOmen seeking your attention near this AREA!!!%%###!!! Just follow this link . . . I am Wumi Abdul; the only Daughter of late Mr and Mrs George
- Abdul. My father was a very wealthy cocoa merchant in Abidjan,
he was poisoned to death by his business associates. . . I seek for a foreign partner. Please provide a Bank account where this money would be transferred to.
SLIDE 6
Different spam types
Figure: Spam chart
SLIDE 7
Anti-Spam Techniques
◮ End-user techniques
◮ Discretion ◮ Address munging ◮ Ham passwords
SLIDE 8
Anti-Spam Techniques
◮ End-user techniques
◮ Discretion ◮ Address munging ◮ Ham passwords
◮ Mail server level filtering
◮ Realtime Blackhole Lists ◮ Spamtrapping ◮ SMTP callback verification ◮ Statistical spam filtering
SLIDE 9
Probability theory basics
Conditional Probability: Pr[X ∩ Y ] = Pr[Y |X] · Pr[X] (1)
SLIDE 10
Probability theory basics
Conditional Probability: Pr[X ∩ Y ] = Pr[Y |X] · Pr[X] (1)
Figure: Weather - conditional probability
SLIDE 11
Probability theory basics
Pr[X ∩ Y ] = Pr[Y |X] · Pr[X] = Pr[X|Y ] · Pr[Y ] (2)
SLIDE 12
Probability theory basics
Pr[X ∩ Y ] = Pr[Y |X] · Pr[X] = Pr[X|Y ] · Pr[Y ] (2) Bayes Theorem: Pr[Y |X] = Pr[X|Y ] · Pr[Y ] Pr[X] (3)
SLIDE 13
Probability theory basics
Pr[X ∩ Y ] = Pr[Y |X] · Pr[X] = Pr[X|Y ] · Pr[Y ] (2) Bayes Theorem: Pr[Y |X] = Pr[X|Y ] · Pr[Y ] Pr[X] (3) Bayes Theorem is a way of updating of what we think about the world, based on what we know about it.
SLIDE 14
Probability theory basics
Multiple variables Pr[x1, x2, . . . , xn] = Pr[x1|x2, x3, . . . , xn] · Pr[x2, x3, . . . , xn] (4) Pr[x2, x3, . . . , xn] = Pr[x2|x3, x4, . . . , xn] · Pr[x3, x4, . . . , xn] (5)
SLIDE 15
Probability theory basics
Multiple variables Pr[x1, x2, . . . , xn] = Pr[x1|x2, x3, . . . , xn] · Pr[x2, x3, . . . , xn] (4) Pr[x2, x3, . . . , xn] = Pr[x2|x3, x4, . . . , xn] · Pr[x3, x4, . . . , xn] (5) Assuming xi and xj are independent: Pr[xi|xj] = Pr[xi] (6)
SLIDE 16
Probability theory basics
Multiple variables Pr[x1, x2, . . . , xn] = Pr[x1|x2, x3, . . . , xn] · Pr[x2, x3, . . . , xn] (4) Pr[x2, x3, . . . , xn] = Pr[x2|x3, x4, . . . , xn] · Pr[x3, x4, . . . , xn] (5) Assuming xi and xj are independent: Pr[xi|xj] = Pr[xi] (6) Previous formula may be simplified to the following one: Pr[x1, x2, . . . , xn] = Pr[x1] · Pr[x2] · . . . · Pr[xn] (7)
SLIDE 17
Spam filtering with NBC
Bayes theorem rewritten using the naive assumption: Pr(c | x1, x2, . . . , xn) = Pr(c)Pr(x1|c)Pr(x2|c) · · · Pr(xn|c) Pr(x1, x2, . . . , xn) (8)
SLIDE 18
Spam filtering with NBC
Bayes theorem rewritten using the naive assumption: Pr(c | x1, x2, . . . , xn) = Pr(c)Pr(x1|c)Pr(x2|c) · · · Pr(xn|c) Pr(x1, x2, . . . , xn) (8) Class of di = argmaxc Pr(c | di) (9)
SLIDE 19
Spam filtering with NBC
Defintion of terms:
◮ Vocabulary (V) is an ordered collection of words i.e.,
V = (v1, v2, v3, . . . , vn) used to classify an email.
◮ Document (D) is an ordered collection of words used in a
message D = (w1, w2, w3, . . . , wn).
◮ The classifier is a machine that, when given a document D
and a collection of parameters θ, deterministically returns the class of the document.
SLIDE 20
Spam filtering with NBC
Document representation
◮ Binary vector of length |V | is used to represent a document. ◮ xi means the absence of the word vi in the specified
document.
SLIDE 21
Spam filtering with NBC
Document representation
◮ Binary vector of length |V | is used to represent a document. ◮ xi means the absence of the word vi in the specified
document.
Bernoulli event model
Pr[xi|ck] = pxi
ki · (1 − pki)1−xi
(10) pki is the probability of class ck generating the word vi and can be calculated as follows: pki =
- d∈ck isPresent(vi, d)
# of documents in ck (11)
SLIDE 22
Evaluation
Legitimate Spam Classifier accepted a b Classifier rejected c d
◮ b – accepted even though it was spam ◮ c – legitimate mail is classified as spam (very bad!)
Recall = a a + c Precision = a a + b (12)
SLIDE 23
Comparison to Logistic Classifier
Advantage
NBC requires less training data to be able to function properly.
Disadvantage
Logistic Classifier can reach a lower error rate when given enough data.
SLIDE 24
Comparison to Logistic Classifier
Figure: Dashed – LC; Solid – NBC; Y-axis – error; X-axis - m (1000 random train splits
SLIDE 25
Thank you for listening!
SLIDE 26
Thank you for listening!
And remember, what do we say to nigerian princes who want to make business with you? :)
SLIDE 27