Bayes Classifier (slides borrowed from Tom Mitchell, Barnabs Pczos - - PowerPoint PPT Presentation

bayes classifier
SMART_READER_LITE
LIVE PREVIEW

Bayes Classifier (slides borrowed from Tom Mitchell, Barnabs Pczos - - PowerPoint PPT Presentation

CSCI 4520 - Introduction to Machine Learning Mehdi Allahyari Georgia Southern University Bayes Classifier (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1 Joint Distribution: sounds like the solution to learning F: X


slide-1
SLIDE 1

CSCI 4520 - Introduction to Machine Learning

Mehdi Allahyari Georgia Southern University

(slides borrowed from Tom Mitchell, Barnabás Póczos & Aarti Singh

1

Bayes Classifier

slide-2
SLIDE 2

Joint Distribution:

2

sounds like the solution to learning F: X !Y,

  • r P(Y | X).

Main problem: learning P(Y|X) can require more data than we have consider learning Joint Dist. with 100 attributes # of rows in this table? # of people on earth? fraction of rows with 0 training examples?

slide-3
SLIDE 3

What to do?

3

  • 1. Be smart about how we estimate

probabilities from sparse data

– maximum likelihood estimates – maximum a posteriori estimates

  • 2. Be smart about how to represent joint

distributions

– Bayes networks, graphical models

slide-4
SLIDE 4

4

  • 1. Be smart about how we

estimate probabilities

slide-5
SLIDE 5

Principles for Estimating Probabilities

5

Principle 1 (maximum likelihood):

  • choose parameters θ that maximize

P(data | θ) Principle 2 (maximum a posteriori prob.):

  • choose parameters θ that maximize

P(θ | data) = P(data | θ) P(θ) P(data)

slide-6
SLIDE 6

Two Principles for Estimating Parameters

6

  • Maximum Likelihood Estimate (MLE): choose θ that

maximizes probability of observed data

  • Maximum a Posteriori (MAP) estimate: choose θ that

is most probable given prior probability and the data

slide-7
SLIDE 7

Some terminology

7

  • Likelihood function: P(data | θ)
  • Prior: P(θ)
  • Posterior: P(θ | data)
  • Conjugate prior: P(θ) is the conjugate

prior for likelihood function P(data | θ) if the forms of P(θ) and P(θ | data) are the same.

slide-8
SLIDE 8

You should know

§

Probability basics

§

random variables, events, sample space, conditional probs, …

§

independence of random variables

§

Bayes rule

§

Joint probability distributions

§

calculating probabilities from the joint distribution

§

Point estimation

§

maximum likelihood estimates

§

maximum a posteriori estimates

§

distributions – binomial, Beta, Dirichlet, …

8

slide-9
SLIDE 9

Let’s learn classifiers by learning P(Y|X)

9

Consider Y=Wealth, X=<Gender, HoursWorked>

Gender HrsWorked P(rich | G,HW) P(poor | G,HW) F <40.5 .09 .91 F >40.5 .21 .79 M <40.5 .23 .77 M >40.5 .38 .62

slide-10
SLIDE 10

How many parameters must we estimate?

10

How many parameters must we estimate?

Suppose X =<X1,… Xn> where Xi and Y are boolean RVs To estimate P(Y| X1, X2, … Xn) If we have 30 boolean Xis: P(Y | X1, X2, … X30)

slide-11
SLIDE 11

Chain Rule & Bayes Rule

11

Which is shorthand for: Equivalently: Chain rule: Bayes rule:

slide-12
SLIDE 12

§ Use Bayes rule:

posterior likelihood prior

§ Or equivalently:

12

Bayesian Learning

slide-13
SLIDE 13

The Naïve Bayes Classifier

13

13

slide-14
SLIDE 14

To estimate P(Y| X1, X2, … Xn)

Y (2n-1) 2

14

Can we reduce parameters using Bayes Rule?

Suppose X =<X1,… Xn> where Xi and Y are boolean RV’s

d rows

If we have 30 Xi’s instead of 2: P(Y| X1, X2, … X30)

230 ≅ 1 Billion

s

slide-15
SLIDE 15

Naïve Bayes assumption: Features X1 and X2 are conditionally independent given the class label Y: More generally:

15

Naïve Bayes Assumption

slide-16
SLIDE 16

Conditional Independence

16

Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent

  • f the value of Y, given the

value of Z Which we often write E.g.,

slide-17
SLIDE 17

Naïve Bayes Assumption

17

Naïve Bayes uses assumption that the Xi are conditionally independent, given Y. Given this assumption, then: in general: How many parameters to describe P(X1…Xn|Y)? P(Y)? Without conditional indep assumption? With conditional indep assumption? 2 (2n – 1) + 1 2n + 1

slide-18
SLIDE 18

Application of Bayes Rule

18

slide-19
SLIDE 19

Data

§ Approximately 0.1% are infected § Test detects all infections § Test reports positive for 1% healthy people

Only 9%!...

Probability of having AIDS if test is positive:

19

AIDS test (Bayes rule)

slide-20
SLIDE 20

Use a weaker follow-up test!

§ Approximately 0.1% are infected § Test 2 reports positive for 90% infections § Test 2 reports positive for 5% healthy people

=

64%!...

20

Improving the diagnosis

slide-21
SLIDE 21

Why can’t we use Test 1 twice?

§ Outcomes are not independent, § but tests 1 and 2 are conditionally independent (by assumption):

21

Improving the diagnosis

slide-22
SLIDE 22

Naïve Bayes in a Nutshell

22

Bayes rule: Assuming conditional independence among Xi’s: So, classification rule for Xnew = < X1, …, Xn > is:

slide-23
SLIDE 23

23

Naïve Bayes Algorithm – discrete Xi

  • Train Naïve Bayes (examples) for

each* value yk estimate for each* value xij of each attribute Xi estimate

  • Classify (Xnew)

* probabilities must sum to 1, so need estimate only n-1 of these...

slide-24
SLIDE 24

Estimating Parameters: Y, Xidiscrete-valued

24

Maximum likelihood estimates (MLE’s): (Relative Frequencies)

Number of items in dataset D for which Y=yk

slide-25
SLIDE 25

Naïve Bayes: Subtlety #1

25

Often the Xi are not really conditionally independent

  • We use Naïve Bayes in many cases anyway, and

it often works pretty well

– often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996])

  • What is effect on estimated P(Y|X)?

– Extreme case: what if we add two copies: Xi = Xk

slide-26
SLIDE 26

What now??? What can be done to avoid this? For example,

26

Subtlety #2: Insufficient training data

slide-27
SLIDE 27
  • Maximum Likelihood Estimate (MLE): choose

θ that maximizes probability of observed data

  • Maximum a Posteriori (MAP) estimate: choose θ that

is most probable given prior probability and the data

Estimating Parameters

slide-28
SLIDE 28

[A. Singh]

Conjugate priors

slide-29
SLIDE 29

[A. Singh]

Conjugate priors

slide-30
SLIDE 30

Training data:

Use your expert knowledge & apply prior distributions:

§ Add m “virtual” examples § Same as assuming conjugate priors Assume priors: MAP Estimate:

# virtual examples with Y = b

30

Estimating Parameters: Y, Xidiscrete-valued

slide-31
SLIDE 31

Estimating Parameters: Y, Xidiscrete-valued

31

Maximum likelihood estimates: MAP estimates (Beta, Dirichlet priors):

Only difference: “imaginary” examples

slide-32
SLIDE 32

§ Classify e-mails – Y = {Spam, NotSpam} § Classify news articles – Y = {what is the topic of the article?} What are the features X? The text! Let Xi represent ith word in the document

32

Case Study: Text Classification

slide-33
SLIDE 33
  • date
  • time
  • recipient path
  • IP number
  • sender
  • encoding
  • many more features

Delivered-To: alex.smola@gmail.com Received: by 10.216.47.73 with SMTP id s51cs361171web; Tue, 3 Jan 2012 14:17:53 -0800 (PST) Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org> Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:51 -0800 (PST) Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) client-ip=209.85.215.175; Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org; dkim=pass (test mode) header.i=@googlemail.com Received: by eaal1 with SMTP id l1so15092746eaa.6 for <alex.smola@gmail.com>; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362; Tue, 03 Jan 2012 14:17:51 -0800 (PST) X-Forwarded-To: alex.smola@gmail.com X-Forwarded-For: alex@smola.org alex.smola@gmail.com Delivered-To: alex@smola.org Received: by 10.204.65.198 with SMTP id k6cs206093bki; Tue, 3 Jan 2012 14:17:50 -0800 (PST) Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Tue, 03 Jan 2012 14:17:48 -0800 (PST) Return-Path: <althoff.tim@googlemail.com> Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [209.85.220.179]) by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:48 -0800 (PST) Received-SPF: pass (google.com: domain of althoff.tim@googlemail.com designates 209.85.220.179 as permitted sender) client-ip=209.85.220.179; Received: by vcbf13 with SMTP id f13so11295098vcb.10 for <alex@smola.org>; Tue, 03 Jan 2012 14:17:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=; b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o= MIME-Version: 1.0 Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Tue, 03 Jan 2012 14:17:47 -0800 (PST) Sender: althoff.tim@googlemail.com Received: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47 -0800(PST) Date: Tue, 3 Jan 2012 14:17:47 -0800 X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com> Subject: CS 281B. Advanced Topics in Learning and Decision Making From: Tim Althoff <althoff@eecs.berkeley.edu> To: alex@smola.org Content-Type: multipart/alternative; boundary=f46d043c7af4b07e8d04b5a7113a

  • -f46d043c7af4b07e8d04b5a7113a

Content-Type: text/plain; charset=ISO-8859-1

Data for spam filtering

33

slide-34
SLIDE 34

34

Xi represents ith word in document

slide-35
SLIDE 35

35

NB for Text Classification

A problem: The support of P(X|Y) is huge! – Article at least 1000 words, X={X1,…,X1000} – Xi represents ith word in document, i.e., the domain of Xi is the entire vocabulary, e.g., Webster Dictionary (or more). Xi 2 {1,…,50000} ) K(100050000 -1) parameters to estimate without the NB assumption….

slide-36
SLIDE 36

36

NB for Text Classification

26

Xi 2 {1,…,50000} ) K(100050000 -1) parameters to estimate…. NB assumption helps a lot!!! If P(Xi=xi|Y=y) is the probability of observing word xi at the ith position in a document on topic y NB assumption helps, but still lots of parameters to estimate. ) 1000K(50000-1) parameters to estimate with NB assumption

slide-37
SLIDE 37

37

Bag of words model

Typical additional assumption: Position in document doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)

– “Bag of words” model – order of words on the page ignored The document is just a bag of words: i.i.d. words – Sounds really silly, but often works very well!

27

The probability of a document with words x1,x2,…

) K(50000-1) parameters to estimate

slide-38
SLIDE 38

in is lecture lecture next over person remember room sitting the the the to to up wake when you When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

38

Bag of words model

slide-39
SLIDE 39

Bag of words approach

aardvark 0 about 2 all 2

  • Africa. 1

apple anxious ... gas ...

  • il

… Zaire 1 1

slide-40
SLIDE 40
  • Y discrete valued. e.g., Spam or not
  • X = <X1, X2, … Xn> = document
  • Xi is a random variable describing the word at position i in

the document

  • possible values for Xi : any word wk in English
  • Document = bag of words: the vector of counts for all wk’s
  • This vector of counts follows a ?? Distribution

Learning to classify document: P(Y|X) the “Bag of Words” model

slide-41
SLIDE 41
  • Train Naïve Bayes

(examples) for each value yk estimate for each value xij of each attribute Xi estimate

  • Classify (Xnew)

prob that word xij appears in position i, given Y=yk

* Additional assumption: word probabilities are position independent

Naïve Bayes Algorithm – discrete Xi

slide-42
SLIDE 42

Map estimate for multinomial What β’s should we choose?

MAP estimates for bag of words

slide-43
SLIDE 43

30

Naïve Bayes: 89% accuracy

43

Twenty news groups results

slide-44
SLIDE 44

For code and data, see

www.cs.cmu.edu/~tom/mlbook.html

click on “Software and Data”

Twenty news groups results

slide-45
SLIDE 45

Eg., image classification: Xi is ith pixel

What if we have continuous Xi?

slide-46
SLIDE 46

image classification: Xi is ith pixel, Y = mental state Still have: Just need to decide how to represent P(Xi | Y)

What if we have continuous Xi?

slide-47
SLIDE 47

Eg., image classification: Xi is ith pixel Gaussian Naïve Bayes (GNB): assume Sometimes assume σik

  • is independent of Y (i.e., σi),
  • or independent of Xi (i.e., σk)
  • or both (i.e., σ)

What if features are continuous?

slide-48
SLIDE 48

(but still discrete Y)

  • Train Naïve Bayes

(examples) for each value yk estimate* for each attribute Xi estimate class conditional mean , variance

  • Classify (Xnew)

* probabilities must sum to 1, so need estimate only n-1 parameters...

Gaussian Naïve Bayes Algorithm – continuous Xi

slide-49
SLIDE 49

49

Estimating parameters: Y discrete, Xi continuous

slide-50
SLIDE 50

Maximum likelihood estimates:

jth trainingimage ith pixel in jth trainingimage kth class

50

Estimating parameters: Y discrete, Xi continuous

δ(z)=1 if z true, else 0

slide-51
SLIDE 51

~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level Dependent (BOLD) response

[Mitchell et al.]

51

Example: GNB for classifying mental states

Classify a person’s cognitive activity, based

  • n brain image
slide-52
SLIDE 52

words

Tool words Pairwise classification accuracy: 78-99%, 12 participants Building

[Mitchell et al.]

52

Learned Naïve Bayes Models – Means for P(BrainActivity | WordCategory)

slide-53
SLIDE 53

53

What you should know…

  • Training and using classifiers based on

Bayes rule

  • Conditional independence

§ What it is § Why it’s important

  • Naïve Bayes

§ What it is § Why we use it so much § Training using MLE, MAP estimates § Discrete variables and continuous

(Gaussian)