 
              Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017
Table of contents What is spam? Different spam types Anti-Spam Techniques Probability theory basics Conditional probability Bayes Theorem Naive Bayes Theorem Spam filtering with Naive Bayes Classifier (NBC) Definition of terms Feature representation Evaluation Comparison to Logistic Classifier (LC)
What is spam? Spam – mass-mailing of a message over the internet, for the purposes of advertising.
What is spam? Spam – mass-mailing of a message over the internet, for the purposes of advertising. HOT sing1e wOmen seeking your attention near this AREA!!!%%###!!! Just follow this link . . .
What is spam? Spam – mass-mailing of a message over the internet, for the purposes of advertising. HOT sing1e wOmen seeking your attention near this AREA!!!%%###!!! Just follow this link . . . I am Wumi Abdul; the only Daughter of late Mr and Mrs George Abdul. My father was a very wealthy cocoa merchant in Abidjan, he was poisoned to death by his business associates. . . I seek for a foreign partner. Please provide a Bank account where this money would be transferred to.
Different spam types Figure: Spam chart
Anti-Spam Techniques ◮ End-user techniques ◮ Discretion ◮ Address munging ◮ Ham passwords
Anti-Spam Techniques ◮ End-user techniques ◮ Discretion ◮ Address munging ◮ Ham passwords ◮ Mail server level filtering ◮ Realtime Blackhole Lists ◮ Spamtrapping ◮ SMTP callback verification ◮ Statistical spam filtering
Probability theory basics Conditional Probability: Pr[ X ∩ Y ] = Pr[ Y | X ] · Pr[ X ] (1)
Probability theory basics Conditional Probability: Pr[ X ∩ Y ] = Pr[ Y | X ] · Pr[ X ] (1) Figure: Weather - conditional probability
Probability theory basics Pr[ X ∩ Y ] = Pr[ Y | X ] · Pr[ X ] (2) = Pr[ X | Y ] · Pr[ Y ]
Probability theory basics Pr[ X ∩ Y ] = Pr[ Y | X ] · Pr[ X ] (2) = Pr[ X | Y ] · Pr[ Y ] Bayes Theorem: Pr[ Y | X ] = Pr[ X | Y ] · Pr[ Y ] (3) Pr[ X ]
Probability theory basics Pr[ X ∩ Y ] = Pr[ Y | X ] · Pr[ X ] (2) = Pr[ X | Y ] · Pr[ Y ] Bayes Theorem: Pr[ Y | X ] = Pr[ X | Y ] · Pr[ Y ] (3) Pr[ X ] Bayes Theorem is a way of updating of what we think about the world, based on what we know about it.
Probability theory basics Multiple variables Pr[ x 1 , x 2 , . . . , x n ] = Pr[ x 1 | x 2 , x 3 , . . . , x n ] · Pr[ x 2 , x 3 , . . . , x n ] (4) Pr[ x 2 , x 3 , . . . , x n ] = Pr[ x 2 | x 3 , x 4 , . . . , x n ] · Pr[ x 3 , x 4 , . . . , x n ] (5)
Probability theory basics Multiple variables Pr[ x 1 , x 2 , . . . , x n ] = Pr[ x 1 | x 2 , x 3 , . . . , x n ] · Pr[ x 2 , x 3 , . . . , x n ] (4) Pr[ x 2 , x 3 , . . . , x n ] = Pr[ x 2 | x 3 , x 4 , . . . , x n ] · Pr[ x 3 , x 4 , . . . , x n ] (5) Assuming x i and x j are independent: Pr[ x i | x j ] = Pr[ x i ] (6)
Probability theory basics Multiple variables Pr[ x 1 , x 2 , . . . , x n ] = Pr[ x 1 | x 2 , x 3 , . . . , x n ] · Pr[ x 2 , x 3 , . . . , x n ] (4) Pr[ x 2 , x 3 , . . . , x n ] = Pr[ x 2 | x 3 , x 4 , . . . , x n ] · Pr[ x 3 , x 4 , . . . , x n ] (5) Assuming x i and x j are independent: Pr[ x i | x j ] = Pr[ x i ] (6) Previous formula may be simplified to the following one: Pr[ x 1 , x 2 , . . . , x n ] = Pr[ x 1 ] · Pr[ x 2 ] · . . . · Pr[ x n ] (7)
Spam filtering with NBC Bayes theorem rewritten using the naive assumption: Pr( c | x 1 , x 2 , . . . , x n ) = Pr( c )Pr( x 1 | c )Pr( x 2 | c ) · · · Pr( x n | c ) (8) Pr( x 1 , x 2 , . . . , x n )
Spam filtering with NBC Bayes theorem rewritten using the naive assumption: Pr( c | x 1 , x 2 , . . . , x n ) = Pr( c )Pr( x 1 | c )Pr( x 2 | c ) · · · Pr( x n | c ) (8) Pr( x 1 , x 2 , . . . , x n ) Class of d i = argmax c Pr( c | d i ) (9)
Spam filtering with NBC Defintion of terms: ◮ Vocabulary (V) is an ordered collection of words i.e., V = ( v 1 , v 2 , v 3 , . . . , v n ) used to classify an email. ◮ Document (D) is an ordered collection of words used in a message D = ( w 1 , w 2 , w 3 , . . . , w n ). ◮ The classifier is a machine that, when given a document D and a collection of parameters θ , deterministically returns the class of the document.
Spam filtering with NBC Document representation ◮ Binary vector of length | V | is used to represent a document. ◮ x i means the absence of the word v i in the specified document.
Spam filtering with NBC Document representation ◮ Binary vector of length | V | is used to represent a document. ◮ x i means the absence of the word v i in the specified document. Bernoulli event model Pr[ x i | c k ] = p x i ki · (1 − p ki ) 1 − x i (10) p ki is the probability of class c k generating the word v i and can be calculated as follows: � d ∈ c k isPresent( v i , d ) p ki = (11) # of documents in c k
Evaluation Legitimate Spam Classifier accepted a b Classifier rejected c d ◮ b – accepted even though it was spam ◮ c – legitimate mail is classified as spam (very bad!) a a Recall = Precision = (12) a + c a + b
Comparison to Logistic Classifier Advantage NBC requires less training data to be able to function properly. Disadvantage Logistic Classifier can reach a lower error rate when given enough data.
Comparison to Logistic Classifier Figure: Dashed – LC; Solid – NBC; Y-axis – error; X-axis - m (1000 random train splits
Thank you for listening!
Thank you for listening! And remember, what do we say to nigerian princes who want to make business with you? :)
Thank you for listening! And remember, what do we say to nigerian princes who want to make business with you? :) If you’re interested, listen to James Veitch’s talk about answering spam: https://www.ted.com/talks/james_veitch_this_is_what_ happens_when_you_reply_to_spam_email
Recommend
More recommend