Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, - PowerPoint PPT Presentation

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017

Table of contents What is spam? Different spam types Anti-Spam Techniques Probability theory basics Conditional probability Bayes Theorem Naive Bayes Theorem Spam filtering with Naive Bayes Classifier (NBC) Definition of terms Feature representation Evaluation Comparison to Logistic Classifier (LC)

What is spam? Spam – mass-mailing of a message over the internet, for the purposes of advertising.

What is spam? Spam – mass-mailing of a message over the internet, for the purposes of advertising. HOT sing1e wOmen seeking your attention near this AREA!!!%%###!!! Just follow this link . . .

What is spam? Spam – mass-mailing of a message over the internet, for the purposes of advertising. HOT sing1e wOmen seeking your attention near this AREA!!!%%###!!! Just follow this link . . . I am Wumi Abdul; the only Daughter of late Mr and Mrs George Abdul. My father was a very wealthy cocoa merchant in Abidjan, he was poisoned to death by his business associates. . . I seek for a foreign partner. Please provide a Bank account where this money would be transferred to.

Different spam types Figure: Spam chart

Anti-Spam Techniques ◮ End-user techniques ◮ Discretion ◮ Address munging ◮ Ham passwords

Anti-Spam Techniques ◮ End-user techniques ◮ Discretion ◮ Address munging ◮ Ham passwords ◮ Mail server level filtering ◮ Realtime Blackhole Lists ◮ Spamtrapping ◮ SMTP callback verification ◮ Statistical spam filtering

Probability theory basics Conditional Probability: Pr[ X ∩ Y ] = Pr[ Y | X ] · Pr[ X ] (1)

Probability theory basics Conditional Probability: Pr[ X ∩ Y ] = Pr[ Y | X ] · Pr[ X ] (1) Figure: Weather - conditional probability

Probability theory basics Pr[ X ∩ Y ] = Pr[ Y | X ] · Pr[ X ] (2) = Pr[ X | Y ] · Pr[ Y ]

Probability theory basics Pr[ X ∩ Y ] = Pr[ Y | X ] · Pr[ X ] (2) = Pr[ X | Y ] · Pr[ Y ] Bayes Theorem: Pr[ Y | X ] = Pr[ X | Y ] · Pr[ Y ] (3) Pr[ X ]

Probability theory basics Pr[ X ∩ Y ] = Pr[ Y | X ] · Pr[ X ] (2) = Pr[ X | Y ] · Pr[ Y ] Bayes Theorem: Pr[ Y | X ] = Pr[ X | Y ] · Pr[ Y ] (3) Pr[ X ] Bayes Theorem is a way of updating of what we think about the world, based on what we know about it.

Probability theory basics Multiple variables Pr[ x 1 , x 2 , . . . , x n ] = Pr[ x 1 | x 2 , x 3 , . . . , x n ] · Pr[ x 2 , x 3 , . . . , x n ] (4) Pr[ x 2 , x 3 , . . . , x n ] = Pr[ x 2 | x 3 , x 4 , . . . , x n ] · Pr[ x 3 , x 4 , . . . , x n ] (5)

Probability theory basics Multiple variables Pr[ x 1 , x 2 , . . . , x n ] = Pr[ x 1 | x 2 , x 3 , . . . , x n ] · Pr[ x 2 , x 3 , . . . , x n ] (4) Pr[ x 2 , x 3 , . . . , x n ] = Pr[ x 2 | x 3 , x 4 , . . . , x n ] · Pr[ x 3 , x 4 , . . . , x n ] (5) Assuming x i and x j are independent: Pr[ x i | x j ] = Pr[ x i ] (6)

Probability theory basics Multiple variables Pr[ x 1 , x 2 , . . . , x n ] = Pr[ x 1 | x 2 , x 3 , . . . , x n ] · Pr[ x 2 , x 3 , . . . , x n ] (4) Pr[ x 2 , x 3 , . . . , x n ] = Pr[ x 2 | x 3 , x 4 , . . . , x n ] · Pr[ x 3 , x 4 , . . . , x n ] (5) Assuming x i and x j are independent: Pr[ x i | x j ] = Pr[ x i ] (6) Previous formula may be simplified to the following one: Pr[ x 1 , x 2 , . . . , x n ] = Pr[ x 1 ] · Pr[ x 2 ] · . . . · Pr[ x n ] (7)

Spam filtering with NBC Bayes theorem rewritten using the naive assumption: Pr( c | x 1 , x 2 , . . . , x n ) = Pr( c )Pr( x 1 | c )Pr( x 2 | c ) · · · Pr( x n | c ) (8) Pr( x 1 , x 2 , . . . , x n )

Spam filtering with NBC Defintion of terms: ◮ Vocabulary (V) is an ordered collection of words i.e., V = ( v 1 , v 2 , v 3 , . . . , v n ) used to classify an email. ◮ Document (D) is an ordered collection of words used in a message D = ( w 1 , w 2 , w 3 , . . . , w n ). ◮ The classifier is a machine that, when given a document D and a collection of parameters θ , deterministically returns the class of the document.

Spam filtering with NBC Document representation ◮ Binary vector of length | V | is used to represent a document. ◮ x i means the absence of the word v i in the specified document.

Spam filtering with NBC Document representation ◮ Binary vector of length | V | is used to represent a document. ◮ x i means the absence of the word v i in the specified document. Bernoulli event model Pr[ x i | c k ] = p x i ki · (1 − p ki ) 1 − x i (10) p ki is the probability of class c k generating the word v i and can be calculated as follows: � d ∈ c k isPresent( v i , d ) p ki = (11) # of documents in c k

Evaluation Legitimate Spam Classifier accepted a b Classifier rejected c d ◮ b – accepted even though it was spam ◮ c – legitimate mail is classified as spam (very bad!) a a Recall = Precision = (12) a + c a + b

Comparison to Logistic Classifier Advantage NBC requires less training data to be able to function properly. Disadvantage Logistic Classifier can reach a lower error rate when given enough data.

Comparison to Logistic Classifier Figure: Dashed – LC; Solid – NBC; Y-axis – error; X-axis - m (1000 random train splits

Thank you for listening!

Thank you for listening! And remember, what do we say to nigerian princes who want to make business with you? :)

Thank you for listening! And remember, what do we say to nigerian princes who want to make business with you? :) If you’re interested, listen to James Veitch’s talk about answering spam: https://www.ted.com/talks/james_veitch_this_is_what_ happens_when_you_reply_to_spam_email

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, - PowerPoint PPT Presentation

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What is spam? Different spam types Anti-Spam Techniques Probability theory basics Conditional probability Bayes Theorem Naive Bayes Theorem Spam

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Naive Bayes case study Training set: 10,000 emails that are either SPAM or HAM Testing set:

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Outline Naive Credal Classifier 2: an extension of Naive Bayes Introducing NCC2 1 for

Naive Bayes Classication Naive Bayes Classication In [1]: % matplotlib inline from

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

KALMAN AND PARTICLE FILTERS Tutorial 10 H. R. B. Orlande, M. J. Colao, G. S. Dulikravich, F.

A few basics of credibility theory Greg Taylor Director, Taylor Fry Consulting Actuaries

A RISK ANALYSIS OF THE MOLYBDENUM-99 SUPPLY CHAIN USING BAYESIAN NETWORKS 2017 Mo-99 Topical

Analyzing #POTUS Sentiment on Twitter to Predict Public Opinion on Presidential Issues By: Jacob

Performing Bayesian analysis in Stata using WinBUGS Tom Palmer, John Thompson & Santiago

A Bayesian inversion approach to recovering material parameters in hyperelastic solids using

Exploration of classification methods: SVM and KDE Xi Cheng, Heng Xu, Jing Peng, Zimeng Wang, Andy

Critical Tests of Theory of the Early Universe using the Cosmic Microwave Background Eiichiro

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, - PowerPoint PPT Presentation

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What is spam? Different spam types Anti-Spam Techniques Probability theory basics Conditional probability Bayes Theorem Naive Bayes Theorem Spam

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Naive Bayes case study Training set: 10,000 emails that are either SPAM or HAM Testing set:

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Outline Naive Credal Classifier 2: an extension of Naive Bayes Introducing NCC2 1 for

Naive Bayes Classication Naive Bayes Classication In [1]: % matplotlib inline from

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee

KALMAN AND PARTICLE FILTERS Tutorial 10 H. R. B. Orlande, M. J. Colao, G. S. Dulikravich, F.

A few basics of credibility theory Greg Taylor Director, Taylor Fry Consulting Actuaries

A RISK ANALYSIS OF THE MOLYBDENUM-99 SUPPLY CHAIN USING BAYESIAN NETWORKS 2017 Mo-99 Topical

Analyzing #POTUS Sentiment on Twitter to Predict Public Opinion on Presidential Issues By: Jacob

Performing Bayesian analysis in Stata using WinBUGS Tom Palmer, John Thompson &amp; Santiago

A Bayesian inversion approach to recovering material parameters in hyperelastic solids using

Exploration of classification methods: SVM and KDE Xi Cheng, Heng Xu, Jing Peng, Zimeng Wang, Andy

Critical Tests of Theory of the Early Universe using the Cosmic Microwave Background Eiichiro

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Performing Bayesian analysis in Stata using WinBUGS Tom Palmer, John Thompson & Santiago