Machine Learning in Spam Filtering A Crash Course in ML Konstantin - PowerPoint PPT Presentation

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu

Overview Spam is Evil ML for Spam Filtering: General Idea, Problems. Some Algorithms Naïve Bayesian Classifier k -Nearest Neighbors Classifier The Perceptron Support Vector Machine Algorithms in Practice Measures and Numbers Improvement ideas: Striving for the ideal filter Machine Learning Techniques in Spam Filtering – p.1/26

Spam is Evil It is cheap to send, but expensive to receive: Large amount of bulk traffic between servers Dial-up users spend bandwidth to download it People spend time sorting thru unwanted mail Important e-mails may get deleted by mistake Pornographic spam is not meant for everyone Machine Learning Techniques in Spam Filtering – p.2/26

Eliminating Spam Social and political solutions Never send spam Never respond to spam Put all spammers to jail Technical solutions Block spammer’s IP address Require authorization for sending e-mails (?) Mail filtering Knowledge engineering (KE) Machine learning (ML) Machine Learning Techniques in Spam Filtering – p.3/26

Knowledge Engineering Create a set of classification rules by hand: “if the Subject of a message contains the text BUY NOW , then the message is spam” procmail “Message Rules” in Outlook, etc. The set of rules is difficult to maintain Possible solution: maintain it in a centralized manner Then spammer has access to the rules Machine Learning Techniques in Spam Filtering – p.4/26

Machine Learning Classification rules are derived from a set of training samples For example: Training samples Subject: "BUY NOW" -> SPAM Subject: "BUY IT" -> SPAM Subject: "A GOOD BUY" -> SPAM Subject: "A GOOD BOY" -> LEGITIMATE Subject: "A GOOD DAY" -> LEGITIMATE Derived rule Subject contains "BUY" -> SPAM Machine Learning Techniques in Spam Filtering – p.5/26

Machine Learning A training set is required. It is to be updated regularly. Hard to guarantee that no misclassifications occur. No need to manage and understand the rules. Machine Learning Techniques in Spam Filtering – p.6/26

Machine Learning Training set: { ( m 1 , c 1 ) , ( m 2 , c 2 ) , . . . , ( m n , c n ) } m i ∈ M are training messages, a class c i ∈ { S, L } is assigned to each message. Using the training set we construct a classification function f : M → { S, L } We use this function afterwards to classify (unseen) messages. Machine Learning Techniques in Spam Filtering – p.7/26

ML for Spam: Problem 1 Problem: We classify text but most classification algorithms either require numerical data ( R n ) require a distance metric between objects require a scalar product ▽ Machine Learning Techniques in Spam Filtering – p.8/26

ML for Spam: Problem 1 Problem: We classify text but most classification algorithms either require numerical data ( R n ) require a distance metric between objects require a scalar product Solution: use a feature extractor to convert messages to vectors: φ : M → R n Machine Learning Techniques in Spam Filtering – p.8/26

ML for Spam: Problem 2 Problem: A spam filter may not make mistakes False positive: a legitimate mail classified as spam False negative: spam classified as legitimate mail False negatives are ok, false positives are very bad Solution: ? Machine Learning Techniques in Spam Filtering – p.9/26

Algorithms: Naive Bayes Bayesian classifier is optimal , i.e. its average error rate is minimal over all possible classifiers. The problem is, we can never know the exact probabilities in practice. Machine Learning Techniques in Spam Filtering – p.11/26

Algorithms: Naive Bayes How to calculate P ( x | c ) ? ▽ Machine Learning Techniques in Spam Filtering – p.12/26

Algorithms: Naive Bayes How to calculate P ( x | c ) ? It is simple if the feature vector is simple: Let the feature vector consist of a single binary attribute x w . Let x w = 1 if a certain word w is present in the message and x w = 0 otherwise. ▽ Machine Learning Techniques in Spam Filtering – p.12/26

Algorithms: Naive Bayes How to calculate P ( x | c ) ? It is simple if the feature vector is simple: Let the feature vector consist of a single binary attribute x w . Let x w = 1 if a certain word w is present in the message and x w = 0 otherwise. We may use more complex feature vectors if we assume that presence of one word does not influence the probability of presence of other words, i.e. P ( x w , x v | c ) = P ( x w | c ) P ( x v | c ) Machine Learning Techniques in Spam Filtering – p.12/26

Algorithms: k -NN Suppose we have a distance metric d defined for messages. To determine the class of a certain message m we find its k nearest neighbors in the training set. If there are more spam messages among the neighbors, classify m as spam, otherwise as legitimate mail. Machine Learning Techniques in Spam Filtering – p.13/26

Algorithms: k -NN k -NN is one of the few universally consistent classification rules. Theorem (Stone): as the size of the training set n goes to infinity, if k → ∞ , k n → 0 , then the average error of the k -NN classifier approaches its minimal possible value. Machine Learning Techniques in Spam Filtering – p.14/26

Algorithms: The Perceptron The idea is to find a linear function of the feature vector f ( x ) = w T x + b such that f ( x ) > 0 for vectors of one class, and f ( x ) < 0 for vectors of other class. w = ( w 1 , w 2 , . . . , w m ) is the vector of coefficients (weights) of the function, and b is the so-called bias . If we denote the classes by numbers +1 and − 1 , we can state that we search for a decision function d ( x ) = sign( w T x + b ) Machine Learning Techniques in Spam Filtering – p.15/26

Algorithms: The Perceptron Start with arbitrarily chosen parameters ( w 0 , b 0 ) and update them iteratively. On the n -th iteration of the algorithm choose a training sample ( x , c ) such that the current decision function does not classify it correctly (i.e. sign( w T n x + b n ) � = c ). Update the parameters ( w n , b n ) using the rule: w n +1 = w n + c x b n +1 = b n + c ▽ Machine Learning Techniques in Spam Filtering – p.16/26

Algorithms: The Perceptron Start with arbitrarily chosen parameters ( w 0 , b 0 ) and update them iteratively. On the n -th iteration of the algorithm choose a training sample ( x , c ) such that the current decision function does not classify it correctly (i.e. sign( w T n x + b n ) � = c ). Update the parameters ( w n , b n ) using the rule: w n +1 = w n + c x b n +1 = b n + c The procedure stops someday if the training samples were linearly separable Machine Learning Techniques in Spam Filtering – p.16/26

Algorithms: The Perceptron Fast and simple. Easy to implement. Requires linearly separable data. Machine Learning Techniques in Spam Filtering – p.17/26

Algorithms: SVM The same idea as in the case of the Perceptron: find a separating hyperplane w T x + b = 0 This time we are not interested in any separating hyperplane, but the maximal margin separating hyperplane. Machine Learning Techniques in Spam Filtering – p.18/26

Algorithms: SVM Maximal margin separating hyperplane Machine Learning Techniques in Spam Filtering – p.19/26

Algorithms: SVM Finding the optimal hyperplane requires minimizing a quadratic function on a convex domain — a task known as a quadratic programme . ▽ Machine Learning Techniques in Spam Filtering – p.20/26

Algorithms: SVM Finding the optimal hyperplane requires minimizing a quadratic function on a convex domain — a task known as a quadratic programme . Statistical Learning Theory by V. Vapnik guarantees good generalization for SVM-s. ▽ Machine Learning Techniques in Spam Filtering – p.20/26

Algorithms: SVM Finding the optimal hyperplane requires minimizing a quadratic function on a convex domain — a task known as a quadratic programme . Statistical Learning Theory by V. Vapnik guarantees good generalization for SVM-s. There are lots of further options for SVM-s (soft margin classification, nonlinear kernels, regression). ▽ Machine Learning Techniques in Spam Filtering – p.20/26

Algorithms: SVM Finding the optimal hyperplane requires minimizing a quadratic function on a convex domain — a task known as a quadratic programme . Statistical Learning Theory by V. Vapnik guarantees good generalization for SVM-s. There are lots of further options for SVM-s (soft margin classification, nonlinear kernels, regression). SVM-s are one of the most widely used ML classification techniques currently. Machine Learning Techniques in Spam Filtering – p.20/26

Machine Learning in Spam Filtering A Crash Course in ML Konstantin - PowerPoint PPT Presentation

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems. Some Algorithms Nave Bayesian

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Filtering Cubemaps Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

Traffic Control Mechanisms Filtering Source address filtering Other forms of filtering

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002 23 October 2002 Emmanuel Ormancey 1

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK 2006 Advanced Tutorial

A Human Factors Approach to Spam Factors Approach Filtering SpamGUI Parting Thoughts Summary

Introduction to Adaptive Designs FUNDAMENTAL DESIGN PRINCIPLES, CASE STUDIES, AND HANDS-ON

Reducing Uncertainty and Increasing Confidence in Reservoir Seismic Characterisation Erick

Planning and Delivering your Presentation This document should be read together with the

Performance Performance Highlights: Bayesian methodology and actuarial science Case

TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 Spring 2020 Andrew

Variational inference, spin glasses, and TAP free energy Song Mei Stanford University September

Sentiment Analysis and Movie Reviews By: Donovan Ambler Overview Problem Description

High-Dimensional Classification Methods for Sparse Signals and Their Applications in Gene

Sambuz

Useful Links

Newsletter

Mail Us

Machine Learning in Spam Filtering A Crash Course in ML Konstantin - PowerPoint PPT Presentation

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems. Some Algorithms Nave Bayesian

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Filtering Cubemaps Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

Traffic Control Mechanisms Filtering Source address filtering Other forms of filtering

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002 23 October 2002 Emmanuel Ormancey 1

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK 2006 Advanced Tutorial

A Human Factors Approach to Spam Factors Approach Filtering SpamGUI Parting Thoughts Summary

Introduction to Adaptive Designs FUNDAMENTAL DESIGN PRINCIPLES, CASE STUDIES, AND HANDS-ON

Reducing Uncertainty and Increasing Confidence in Reservoir Seismic Characterisation Erick

Planning and Delivering your Presentation This document should be read together with the

Performance Performance Highlights: Bayesian methodology and actuarial science Case

TERM PROJECT Classifying Tweets Using Nave Bayes Classifier CSC 177 Spring 2020 Andrew

Variational inference, spin glasses, and TAP free energy Song Mei Stanford University September

Sentiment Analysis and Movie Reviews By: Donovan Ambler Overview Problem Description

High-Dimensional Classification Methods for Sparse Signals and Their Applications in Gene

Sambuz

Useful Links

Newsletter

Mail Us

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All