Machine Learning Machine Learning: algorithms that use experience - PowerPoint PPT Presentation

Machine Learning Machine Learning: algorithms that use “ experience ” to improve their performance We use machine learning in situations where it is very challenging (or impossible) to define the rules by hand: e.g. • face detection • speech recognition • stock prediction • driving a car • medical diagnosis • figure out if a credit card purchase is fraudulent 1

Example 2: Face detection 3

Example 4: Machine translation 5

Spam Detection Using Naïve Bayes Classification Jonathan Lee and Varun Mahadevan

Programming Project: Spam Filter On homework 3, you’ll be asked to implement a Naive Bayes classifier for classifying emails as either spam or ham (= nonspam).

Spam vs. Ham In the past, the bane of any email user’s existence Less of a problem for consumers now, because spam filters have gotten really good Easy for humans to identify spam, but not necessarily easy for computers

The spam classification problem Input: collection of emails, already labeled spam or ham Someone has to label these by hand Called the training data Use this data to train a model that can predict whether an email is spam or ham Many approaches: we’ll use a Naïve Bayes classifier. Test your model on emails whose label isn’t provided, and see how well it does Called the test data

Naïve Bayes in the real world One of the oldest, simplest methods for classification Powerful and still used in the real world/industry • Identifying credit card fraud • Identifying fake Amazon reviews • Identifying vandalism on Wikipedia Still used (with modifications) by Gmail to prevent spam • • Facial recognition • Categorizing Google News articles • Even used for medical diagnosis!

Naïve Bayes in theory You will use what we’ve learned recently. Specifically: Chain Rule Conditional Probability ! "|$ = & '∩) ! " . , … , " - &()) = ! " . ! " 1 " . … ! " - " -2. … " . Bayes’ Theorem Conditional Independence ! " $ = & )|' & ' of A and B, given C &()) ! " ∩ $ 3 = ! " 3 !($|3) Law of Total Probability ! " $ ∩ 3 = ! " 3 !(") = ∑ - ! " $ - !($ - )

training examples 0 14

How do we represent an email? • There are characteristics of emails that might give a computer a hint about whether it’s spam • Possible features : words in body, subject line, sender, message header, time sent • For this assignment, we choose to represent an email as the set {" # , " % , … , " ' } of distinct words in the subject and body

How do we represent an email? • There are characteristics of emails that might give a computer a hint about whether it’s spam • Possible features : words in body, subject line, sender, message header, time sent • For this assignment, we choose to represent an email as the set {" # , " % , … , " ' } of distinct words in the subject and body SUBJECT: Top Secret {top, secret, business, Business Venture venture, dear, sir, first, I, must, solicit, your, confidence, in, Dear Sir. this, transaction, is, by, virture, of, its, nature, as, being, First, I must solicit your utterly, confidencial, and} confidence in this transaction, this is by Notice that there are virture of its nature as being utterly confidencial no duplicate words and top secret… IF

Programming Project Take the set {" # , " % , … , " ' } of distinct words to represent the email. We are trying to compute ) *+,- " # , " % , … , " ' = ? ? ?

Programming Project Take the set {" # , " % , … , " ' } of distinct words to represent the email. We are trying to compute ) *+,- " # , " % , … , " ' = ? ? ? Apply Bayes’ Theorem. It’s easier to find the probability of a word appearing in a spam email than the reverse. ) *+,- " # , " % , … , " ' = ) " # , " % , … " ' *+,- )(*+,-) ) " # , " % , … " ' *+,- ) *+,- + ) " # , " % , … " ' 3,- )(3,-) o

Apply the chain rule to the numerator: O ! " # , " % , … " ' ()*+ ! ()*+ = !(" # , " % , … , " ' , ()*+) Apply the Chain Rule again to decompose this: ! " # , " % , … , " ' , ()*+ = ! " # " % , … , " ' , ()*+ !(" % |" 0 , … , " ' , ()*+) … ! " ' ()*+ !(()*+) But this is still hard to compute. D How could you compute ! " # " % , … , " ' , ()*+ ?

We’ll simplify the problem with an assumption (a big one!) We will assume that the words in the email are conditionally independent of each other, given that we know whether or not the email is spam. Prf N spam Viagra 1 span Definition: Two events A and B are conditionally independent given C if and only if ! " ∩ $ % = ! " % ! $ % . Equivalently, if P(B) > 0 and P(C) > 0, then ! " $% = ! " % .

Let’s simplify the problem with an assumption. We will assume that the words in the email are conditionally independent of each other, given that we know whether or not the email is spam. This is why we call this Naïve Bayes: conditional independence isn’t true. So how does this help? ! " # , " % , … , " ' , ()*+ = ! " # " % , … , " ' , ()*+ !(" % |" / , … , " ' , ()*+) … ! " ' ()*+ !(()*+) ≈ ! " # ()*+ !(" % |()*+) … ! " ' ()*+ !(()*+) ' !(" # , " % , … , " ' , ()*+) ≈ !(()*+) 2 !(" 3 |()*+) 34#

Using conditional independence ( !(# $ , # & , … , # ( , )*+,) ≈ !()*+,) ∏ 01$ !(# 0 |)*+,) ( Similarly, !(# $ , # & , … , # ( , 3+,) ≈ !(3+,) ∏ 01$ !(# 0 |3+,) Putting it all together ( !()*+,) ∏ 01$ !(# 0 |)*+,) 00 ! )*+, # $ , # & , … , # ( ≈ ( ( ! )*+, ∏ 01$ ! # 0 )*+, + !(3+,) ∏ 01$ !(# 0 |3+,) Given labelled training data, how do we compute these spam emails intraing data quantities? TSE ! )*+, and ! 3+, ? What about ! # 0 )*+, , e.g., ! 9:+;<+ )*+, ? t spam emails af word Xi i s

( !(# $ , # & , … , # ( , )*+,) ≈ !()*+,) ∏ 01$ !(# 0 |)*+,) ( Similarly, !(# $ , # & , … , # ( , 3+,) ≈ !(3+,) ∏ 01$ !(# 0 |3+,) Putting it all together o ( !()*+,) ∏ 01$ !(# 0 |)*+,) ! )*+, # $ , # & , … , # ( ≈ ( ( ! )*+, ∏ 01$ ! # 0 )*+, + !(3+,) ∏ 01$ !(# 0 |3+,) ! )*+, and !(3+,) are just the fraction of training emails that are spam and ham What about ! # 0 )*+, ?

How spammy is a word? What is !(#$%&'%|)*%+) asking? Would be easy to count how many spam emails contain this word: ! - )*%+ = /01234 56 7891 319:;7 <5/=9:/:/> ? =5=9; /01234 56 7891 319:;7 This seems reasonable, but there’s a problem…

Suppose the word Pokemon only appears in ham in the training data, never in spam. Then we would estimate ! !"#$%"& '()% = + Since the overall spam probability is the product of such individual probabilities, if any of those is 0, the whole product is 0 Any email with the word Pokemon would be assigned a spam probability of 0 What can we do? SUBJECT: Get out of debt! definitely not spam, Cheap prescription pills! Earn fast cash using this one weird right? trick! Meet singles near you and get preapproved for a low interest credit card! Pokemon

Laplace smoothing O • Crazy idea: what if we pretend we’ve seen every outcome once already? • Pretend we’ve seen one more spam email with ! , one more without ! " ! #$%& = |)$%& *&%+,) -./0%+/+/1 !| + 1 D |)$%& *&%+,)| + 2 • Then, " ".5*&./ #$%& > 0 • No one word will bias the overall probability too much • General technique to avoid assuming that unseen events will never happen

Naïve Bayes Overview For each word w in the spam training set, count how many spam emails contain w: ! " #$%& = |)$%& *&%+,) -./0%+/+/1 "| + 1 |)$%& *&%+,)| + 2 Compute ! " 5%& analogously |89:; <;:=>8| !(#$%&) = |89:; <;:=>8|?|@:; <;:=>8| , !(5%&) = 1 − !(#$%&) For each test email with words {C D , C F , … , C H } , H !(#$%&) ∏ =LD !(C = |#$%&) ! #$%& C D , C F , … , C H ≈ H H ! #$%& ∏ =LD ! C = #$%& + !(5%&) ∏ =LD !(C = |5%&) Output “spam” iff ! #$%& C D , C F , … , C H > 1/2

Read the Notes! Read Jonathan Lee’s Naïve Bayes Notes on the course web for precise technical details, start early, and ask for help if you get stuck! Describes how to avoid floating point underflow in % formulas such as ∏ "#$ & ' " ()*+

Machine Learning Machine Learning: algorithms that use experience - PowerPoint PPT Presentation

Machine Learning Machine Learning: algorithms that use experience to improve their performance We use machine learning in situations where it is very challenging (or impossible) to define the rules by hand: e.g. face detection

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

CMPSC443 - Introduction to Computer and Network Security Module: EMail Secuirty Professor

20 years of Web search where to next? Mark Sanderson Who am I? Professor at RMIT

and Information System Devy Schonfeld Turn off your cell phones an Housekeeping put them

INFO 4300 / CS4300 Information Retrieval IR 13: Web history Paul Ginsparg Cornell University,

Administrivia CS 188: Artificial Intelligence Reminder: Spring 2006 Drop-in Python/Unix

IP Reputation Analysis of Public Databases and Machine Learning Techniques Jared Lee Lewis

Lecture 4: Introduction to Classification for NLP Julia Hockenmaier juliahmr@illinois.edu 3324

What is Green? What does it mean to be green? Why is being green important?