Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / - - PowerPoint PPT Presentation

โ–ถ
na ve bayes
SMART_READER_LITE
LIVE PREVIEW

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / - - PowerPoint PPT Presentation

Nave Bayes Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266 Linear Regression


slide-1
SLIDE 1

Naรฏve Bayes

Jia-Bin Huang Virginia Tech

Spring 2019

ECE-5424G / CS-5824

slide-2
SLIDE 2

Administrative

  • HW 1 out today. Please start early!
  • Office hours
  • Chen: Wed 4pm-5pm
  • Shih-Yang: Fri 3pm-4pm
  • Location: Whittemore 266
slide-3
SLIDE 3

Linear Regression

  • Model representation

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2 + โ‹ฏ + ๐œ„๐‘œ๐‘ฆ๐‘œ = ๐œ„โŠค๐‘ฆ

  • Cost function ๐พ ๐œ„ =

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

  • Gradient descent for linear regression

Repeat until convergence {๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1 ๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘— }

  • Features and polynomial regression

Can combine features; can use different functions to generate features (e.g., polynomial)

  • Normal equation ๐œ„ = (๐‘ŒโŠค๐‘Œ)โˆ’1๐‘ŒโŠค๐‘ง
slide-4
SLIDE 4

(๐‘ฆ0) Size in feet^2 (๐‘ฆ1) Number of bedrooms (๐‘ฆ2) Number of floors (๐‘ฆ3) Age of home (years) (๐‘ฆ4) Price ($) in 1000โ€™s (y) 1 2104 5 1 45 460 1 1416 3 2 40 232 1 1534 3 2 30 315 1 852 2 1 36 178 โ€ฆ โ€ฆ

๐‘ง = 460 232 315 178

๐œ„ = (๐‘ŒโŠค๐‘Œ)โˆ’1๐‘ŒโŠค๐‘ง

Slide credit: Andrew Ng

slide-5
SLIDE 5

Least square solution

  • ๐พ ๐œ„ =

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

=

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

๐œ„โŠค๐‘ฆ(๐‘—) โˆ’ ๐‘ง ๐‘—

2

=

1 2๐‘› ๐‘Œ๐œ„ โˆ’ ๐‘ง 2 2

  • ๐œ–

๐œ–๐œ„ ๐พ ๐œ„ = 0

  • ๐œ„ = (๐‘ŒโŠค๐‘Œ)โˆ’1๐‘ŒโŠค๐‘ง
slide-6
SLIDE 6

Justification/interpretation 1

  • Geometric interpretation

๐’€ = 1 โ† ๐’š(1) โ†’ 1 โ‹ฎ 1 โ† ๐’š(2) โ†’ โ‹ฎ โ† ๐’š(๐‘›) โ†’ = โ†‘ ๐’œ๐Ÿ โ†‘ ๐’œ๐Ÿ โ†‘ ๐’œ๐Ÿ‘ โ‹ฏ โ†‘ ๐’œ๐’ โ†“ โ†“ โ†“ โ†“

  • ๐’€๐œพ: column space of ๐’€ or span({๐’œ๐Ÿ, ๐’œ๐Ÿ, โ‹ฏ , ๐’œ๐’})
  • Residual ๐’€๐œพ โˆ’ ๐ณ is orthogonal to the column space of ๐’€
  • ๐’€โŠค ๐’€๐œพ โˆ’ ๐ณ = 0 โ†’ (๐’€โŠค๐’€)๐œพ = ๐’€โŠค๐’›

๐’›

column space of ๐’€

๐’€๐œพ ๐’€๐œพ โˆ’ ๐’›

slide-7
SLIDE 7

Justification/interpretation 2

  • Probabilistic model
  • Assume linear model with Gaussian errors

๐‘ž๐œ„ ๐‘ง ๐‘— ๐‘ฆ ๐‘— = 1 2๐œŒ๐œ2 exp(โˆ’ 1 2๐œ2 (๐‘ง ๐‘— โˆ’ ๐œ„โŠค๐‘ฆ ๐‘— )

  • Solving maximum likelihood

argmin

๐œ„

เท‘

๐‘—=1 ๐‘›

๐‘ž๐œ„ ๐‘ง ๐‘— ๐‘ฆ ๐‘— argmin

๐œ„

log(เท‘

๐‘—=1 ๐‘›

๐‘ž ๐‘ง ๐‘— ๐‘ฆ ๐‘— ) = argmin

๐œ„

1 2๐œ2 เท

๐‘—=1 ๐‘› 1

2 ๐œ„โŠค๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

Image credit: CS 446@UIUC

slide-8
SLIDE 8

Justification/interpretation 3

  • Loss minimization

๐พ ๐œ„ = 1 2๐‘› เท

๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2 = 1

๐‘› เท

๐‘—=1 ๐‘›

๐‘€๐‘š๐‘ก โ„Ž๐œ„ ๐‘ฆ ๐‘— , ๐‘ง ๐‘—

  • ๐‘€๐‘š๐‘ก ๐‘ง, เทœ

๐‘ง =

1 2 ๐‘ง โˆ’ เทœ

๐‘ง 2

2: Least squares loss

  • Empirical Risk Minimization (ERM)

1 ๐‘› เท

๐‘—=1 ๐‘›

๐‘€๐‘š๐‘ก ๐‘ง ๐‘— , เทœ ๐‘ง

slide-9
SLIDE 9

๐‘› training examples, ๐‘œ features

Gradient Descent

  • Need to choose ๐›ฝ
  • Need many iterations
  • Works well even when

๐‘œ is large Normal Equation

  • No need to choose ๐›ฝ
  • Donโ€™t need to iterate
  • Need to compute

(๐‘ŒโŠค๐‘Œ)โˆ’1

  • Slow if ๐‘œ is very large

Slide credit: Andrew Ng

slide-10
SLIDE 10

Things to remember

  • Model representation

โ„Ž๐œ„ ๐‘ฆ = ๐œ„0 + ๐œ„1๐‘ฆ1 + ๐œ„2๐‘ฆ2 + โ‹ฏ + ๐œ„๐‘œ๐‘ฆ๐‘œ = ๐œ„โŠค๐‘ฆ

  • Cost function ๐พ ๐œ„ =

1 2๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—

2

  • Gradient descent for linear regression

Repeat until convergence {๐œ„

๐‘˜ โ‰” ๐œ„ ๐‘˜ โˆ’ ๐›ฝ 1 ๐‘› ฯƒ๐‘—=1 ๐‘›

โ„Ž๐œ„ ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘— ๐‘ฆ๐‘˜

๐‘— }

  • Features and polynomial regression

Can combine features; can use different functions to generate features (e.g., polynomial)

  • Normal equation ๐œ„ = (๐‘ŒโŠค๐‘Œ)โˆ’1๐‘ŒโŠค๐‘ง
slide-11
SLIDE 11

Todayโ€™s plan

  • Probability basics
  • Estimating parameters from data
  • Maximum likelihood (ML)
  • Maximum a posteriori estimation (MAP)
  • Naรฏve Bayes
slide-12
SLIDE 12

Todayโ€™s plan

  • Probability basics
  • Estimating parameters from data
  • Maximum likelihood (ML)
  • Maximum a posteriori estimation (MAP)
  • Naive Bayes
slide-13
SLIDE 13

Random variables

  • Outcome space S
  • Space of possible outcomes
  • Random variables
  • Functions that map outcomes to real numbers
  • Event E
  • Subset of S
slide-14
SLIDE 14

Visualizing probability ๐‘„(๐ต)

A is true

Sample space Area = 1

A is false

๐‘„ ๐ต = Area of the blue circle

slide-15
SLIDE 15

Visualizing probability ๐‘„ ๐ต + P ~A

A is true A is false

๐‘„ ๐ต + P ~A = 1

slide-16
SLIDE 16

Visualizing probability ๐‘„ ๐ต

A^~B

๐‘„ ๐ต = P(A, B) + P A, ~๐ถ

B A^B

slide-17
SLIDE 17

Visualizing conditional probability

A

๐‘„ ๐ต|๐ถ = ๐‘„ ๐ต, ๐ถ /๐‘„(๐ถ)

B A^B

Corollary: The chain rule

๐‘„ ๐ต, ๐ถ = ๐‘„ ๐ต|๐ถ ๐‘„(๐ถ)

slide-18
SLIDE 18

Bayes rule

A

๐‘„ ๐ต|๐ถ = ๐‘„ ๐ต, ๐ถ ๐‘„ ๐ถ = ๐‘„ ๐ถ ๐ต ๐‘„ ๐ต ๐‘„(๐ถ)

B A^B

Corollary: The chain rule

๐‘„ ๐ต, ๐ถ = ๐‘„ ๐ต|๐ถ ๐‘„ ๐ถ = P B P A B

Thomas Bayes

slide-19
SLIDE 19

Other forms of Bayes rule

๐‘„ ๐ต|๐ถ = ๐‘„ ๐ถ ๐ต ๐‘„ ๐ต ๐‘„(๐ถ) ๐‘„ ๐ต|๐ถ, ๐‘Œ = ๐‘„ ๐ถ ๐ต, ๐‘Œ ๐‘„ ๐ต, ๐‘Œ ๐‘„(๐ถ, ๐‘Œ) ๐‘„ ๐ต|๐ถ = ๐‘„ ๐ถ ๐ต ๐‘„ ๐ต ๐‘„ ๐ถ ๐ต ๐‘„ ๐ต + ๐‘„ ๐ถ ~๐ต ๐‘„(~๐ต)

slide-20
SLIDE 20

Applying Bayes rule

๐‘„ ๐ต|๐ถ = ๐‘„ ๐ถ ๐ต ๐‘„ ๐ต ๐‘„ ๐ถ ๐ต ๐‘„ ๐ต + ๐‘„ ๐ถ ~๐ต ๐‘„(~๐ต)

  • A = you have the flu

B = you just coughed

  • Assume:
  • ๐‘„ ๐ต = 0.05
  • ๐‘„ ๐ถ ๐ต = 0.8
  • ๐‘„ ๐ถ ~๐ต = 0.2
  • What is P(flu | cough) = P(A|B)?

๐‘„ ๐ต|๐ถ = 0.8 ร— 0.05 0.8 ร— 0.05 + 0.2 ร— 0.95 ~0.17

Slide credit: Tom Mitchell

slide-21
SLIDE 21

Why we are learning this? Learn ๐‘„ ๐‘|๐‘Œ

โ„Ž ๐‘ฆ ๐‘ง

Hypothesis

slide-22
SLIDE 22

Joint distribution

  • Making a joint distribution of M variables
  • 1. Make a truth table listing all combinations
  • 2. For each combination of values, say how probable it is
  • 3. Probability must sum to 1

A B C Prob 0.30 1 0.05 1 0.10 1 1 0.05 1 0.05 1 1 0.10 1 1 0.25 1 1 1 0.10

Slide credit: Tom Mitchell

slide-23
SLIDE 23

Using joint distribution

  • Can ask for any logical expression

involving these variables

  • ๐‘„ ๐น = ฯƒrows matching E ๐‘„(row)
  • ๐‘„ ๐น1|๐น2 =

ฯƒrows matching E1 and ๐น2 ๐‘„ row ฯƒrows matching ๐น2 ๐‘„ row

Slide credit: Tom Mitchell

slide-24
SLIDE 24

The solution to learn ๐‘„ ๐‘|๐‘Œ ?

  • Main problem: learning ๐‘„ ๐‘|๐‘Œ may require more data than we have
  • Say, learning a joint distribution with 100 attributes
  • # of rows in this table?
  • # of people on earth?

2100 โ‰ฅ 1030 109

Slide credit: Tom Mitchell

slide-25
SLIDE 25

What should we do?

  • 1. Be smart about

how we estimate probabilities from sparse data

  • Maximum likelihood estimates (ML)
  • Maximum a posteriori estimates (MAP)
  • 2. Be smart about

how to represent joint distributions

  • Bayes network, graphical models (more on this later)

Slide credit: Tom Mitchell

slide-26
SLIDE 26

Todayโ€™s plan

  • Probability basics
  • Estimating parameters from data
  • Maximum likelihood (ML)
  • Maximum a posteriori (MAP)
  • Naive Bayes
slide-27
SLIDE 27

Estimating the probability

  • Flip the coin repeatedly, observing
  • It turns heads ๐›ฝ1 times
  • It turns tails ๐›ฝ0 times
  • Your estimate for ๐‘„ ๐‘Œ = 1 is?
  • Case A: 100 flips: 51 Heads (๐‘Œ = 1), 49 Tails (๐‘Œ = 0)

๐‘„ ๐‘Œ = 1 = ?

  • Case B: 3 flips: 2 Heads (๐‘Œ = 1), 1 Tails (๐‘Œ = 0)

๐‘„ ๐‘Œ = 1 = ?

๐‘Œ = 1 ๐‘Œ =0

Slide credit: Tom Mitchell

slide-28
SLIDE 28

Two principles for estimating parameters

  • Maximum Likelihood Estimate (MLE)

Choose ๐œ„ that maximizes probability of observed data

เทก ๐œพMLE = argmax

๐œ„

๐‘„(๐ธ๐‘๐‘ข๐‘|๐œ„)

  • Maximum a posteriori estimation (MAP)

Choose ๐œ„ that is most probable given prior probability and data

เทก ๐œพMAP = argmax

๐œ„

๐‘„ ๐œ„ ๐ธ = argmax

๐œ„

๐‘„ ๐ธ๐‘๐‘ข๐‘ ๐œ„ ๐‘„ ๐œ„ ๐‘„(๐ธ๐‘๐‘ข๐‘)

Slide credit: Tom Mitchell

slide-29
SLIDE 29

Two principles for estimating parameters

  • Maximum Likelihood Estimate (MLE)

Choose ๐œ„ that maximizes ๐‘„ ๐ธ๐‘๐‘ข๐‘ ๐œ„ เทก ๐œพMLE = ๐›ฝ1 ๐›ฝ1 + ๐›ฝ0

  • Maximum a posteriori estimation (MAP)

Choose ๐œ„ that maximize ๐‘„ ๐œ„ ๐ธ๐‘๐‘ข๐‘ เทก ๐œพMAP = (๐›ฝ1 + #halluciated 1s) (๐›ฝ1+#halluciated 1๐‘ก) + (๐›ฝ0 + #halluciated 0s)

Slide credit: Tom Mitchell

slide-30
SLIDE 30

Maximum likelihood estimate

  • Each flip yields Boolean value for ๐‘Œ

๐‘Œ โˆผ Bernoulli: ๐‘„ ๐‘Œ = ๐œ„๐‘Œ 1 โˆ’ ๐œ„ 1โˆ’๐‘Œ

  • Data set ๐ธ of independent, identically distributed (iid)

flips, produces ๐›ฝ1 ones, ๐›ฝ0 zeros ๐‘„ ๐ธ ๐œ„ = ๐‘„ ๐›ฝ1, ๐›ฝ0 ๐œ„ = ๐œ„๐›ฝ1 1 โˆ’ ๐œ„ ๐›ฝ0 เทก ๐œพ = argmax

๐œ„

๐‘„(๐ธ|๐œ„) = ๐›ฝ1 ๐›ฝ1 + ๐›ฝ0

๐‘Œ = 1 ๐‘Œ =0 ๐‘„ ๐‘Œ = 1 = ๐œ„ ๐‘„ ๐‘Œ = 0 = 1 โˆ’ ๐œ„

Slide credit: Tom Mitchell

slide-31
SLIDE 31

Beta prior distribution ๐‘„ ๐œ„

  • ๐‘„ ๐œ„ = ๐ถ๐‘“๐‘ข๐‘ ๐›พ1, ๐›พ0 =

1 ๐ถ(๐›พ1,๐›พ0) ๐œ„๐›พ1โˆ’1 1 โˆ’ ๐œ„ ๐›พ0โˆ’1

Slide credit: Tom Mitchell

slide-32
SLIDE 32

Maximum likelihood estimate

  • Data set ๐ธ of iid flips,

produces ๐›ฝ1 ones, ๐›ฝ0 zeros ๐‘„ ๐ธ ๐œ„ = ๐‘„ ๐›ฝ1, ๐›ฝ0 ๐œ„ = ๐œ„๐›ฝ1 1 โˆ’ ๐œ„ ๐›ฝ0

  • Assume prior (Conjugate prior: Closed form representation of posterior)

๐‘„ ๐œ„ = ๐ถ๐‘“๐‘ข๐‘ ๐›พ1, ๐›พ0 = 1 ๐ถ(๐›พ1, ๐›พ0) ๐œ„๐›พ1โˆ’1 1 โˆ’ ๐œ„ ๐›พ0โˆ’1 เทก ๐œพ = argmax

๐œ„

๐‘„ ๐ธ ๐œ„ P(๐œ„) = ๐›ฝ1 + ๐›พ1 โˆ’ 1 (๐›ฝ1 + ๐›พ1 โˆ’ 1) + (๐›ฝ0 + ๐›พ0 โˆ’ 1)

๐‘Œ = 1 ๐‘Œ =0

Slide credit: Tom Mitchell

slide-33
SLIDE 33

Some terminology

  • Likelihood function ๐‘„ ๐ธ๐‘๐‘ข๐‘ ๐œ„
  • Prior ๐‘„ ๐œ„
  • Posterior ๐‘„ ๐œ„ ๐ธ๐‘๐‘ข๐‘
  • Conjugate prior:

Prior ๐‘„ ๐œ„ is the conjugate prior for a likelihood function๐‘„ ๐ธ๐‘๐‘ข๐‘ ๐œ„ if the prior ๐‘„ ๐œ„ and the posterior ๐‘„ ๐œ„ ๐ธ๐‘๐‘ข๐‘ have the same form.

  • Example (coin flip problem)
  • Prior ๐‘„ ๐œ„ : ๐ถ๐‘“๐‘ข๐‘ ๐›พ1, ๐›พ0

Likelihood ๐‘„ ๐ธ๐‘๐‘ข๐‘ ๐œ„ : Binomial ๐œ„๐›ฝ1 1 โˆ’ ๐œ„ ๐›ฝ0

  • Posterior ๐‘„ ๐œ„ ๐ธ๐‘๐‘ข๐‘ : ๐ถ๐‘“๐‘ข๐‘ ๐›ฝ1 + ๐›พ1, ๐›ฝ0 + ๐›พ0

Slide credit: Tom Mitchell

slide-34
SLIDE 34

How many parameters?

  • Suppose ๐‘Œ = [๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ], where

๐‘Œ๐‘— and ๐‘ are Boolean random variables To estimate ๐‘„ ๐‘ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) When ๐‘œ = 2 (Gender, Hours-worked)? When ๐‘œ = 30?

Slide credit: Tom Mitchell

slide-35
SLIDE 35

Can we reduce paras using Bayes rule?

๐‘„ ๐‘|๐‘Œ = ๐‘„ ๐‘Œ ๐‘ ๐‘„ ๐‘ ๐‘„(๐‘Œ)

  • How many parameters for ๐‘„(๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ|๐‘)?

2๐‘œ โˆ’ 1 ร— 2

  • How many parameters for ๐‘„ ๐‘ ?

1

Slide credit: Tom Mitchell

slide-36
SLIDE 36

Todayโ€™s plan

  • Probability basics
  • Estimating parameters from data
  • Maximum likelihood (ML)
  • Maximum a posteriori estimation (MAP)
  • Naive Bayes
slide-37
SLIDE 37

Naรฏve Bayes

  • Assumption:

๐‘„ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ ๐‘ = เท‘

๐‘˜=1 ๐‘œ

๐‘„(๐‘Œ

๐‘˜|๐‘)

  • i.e., ๐‘Œ๐‘— and ๐‘Œ

๐‘˜ are conditionally independent

given ๐‘ for ๐‘— โ‰  ๐‘˜

Slide credit: Tom Mitchell

slide-38
SLIDE 38

Conditional independence

  • Definition: ๐‘Œ is conditionally independent of ๐‘ given ๐‘Ž, if the

probability distribution governing ๐‘Œ is independent of the value of ๐‘, given the value of ๐‘Ž โˆ€๐‘—, ๐‘˜, ๐‘™ ๐‘„ ๐‘Œ = ๐‘ฆ๐‘— ๐‘ = ๐‘ง๐‘˜, ๐‘Ž = ๐‘จ๐‘™) = ๐‘„(๐‘Œ = ๐‘ฆ๐‘—|๐‘Ž๐‘™) ๐‘„ ๐‘Œ ๐‘, ๐‘Ž = ๐‘„(๐‘Œ|๐‘Ž) Example: ๐‘„ Thunder Rain, Lightning = ๐‘„(Thunder|Lightning)

Slide credit: Tom Mitchell

slide-39
SLIDE 39

Applying conditional independence

  • Naรฏve Bayes assumes ๐‘Œ๐‘— are conditionally independent given ๐‘

e.g., ๐‘„ ๐‘Œ1 ๐‘Œ2, ๐‘ = ๐‘„(๐‘Œ1|๐‘) ๐‘„ ๐‘Œ1, ๐‘Œ2 ๐‘ = ๐‘„ ๐‘Œ1 ๐‘Œ2, ๐‘ ๐‘„ ๐‘Œ2 ๐‘ = ๐‘„ ๐‘Œ1 ๐‘ ๐‘„(๐‘Œ2|๐‘) General form: ๐‘„ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ ๐‘ = ฯ‚๐‘˜=1

๐‘œ

๐‘„(๐‘Œ

๐‘˜|๐‘)

How many parameters to describe ๐‘„ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ ๐‘ ? ๐‘„(Y)?

  • Without conditional indep assumption?
  • With conditional indep assumption?

Slide credit: Tom Mitchell

slide-40
SLIDE 40

Naรฏve Bayes classifier

  • Bayes rule:

๐‘„ ๐‘ = ๐‘ง๐‘™ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) = ๐‘„(๐‘ = ๐‘ง๐‘™)๐‘„(๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ ๐‘ = ๐‘ง๐‘™ ฯƒ๐‘˜ ๐‘„ ๐‘ = ๐‘ง๐‘˜ ๐‘„ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ ๐‘ = ๐‘ง๐‘˜

  • Assume conditional independence among ๐‘Œ๐‘—โ€™s:

๐‘„ ๐‘ = ๐‘ง๐‘™ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) = ๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™) ฯƒ๐‘˜ ๐‘„ ๐‘ = ๐‘ง๐‘˜ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘˜)

  • Pick the most probable Y

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™)

Slide credit: Tom Mitchell

slide-41
SLIDE 41

Naรฏve Bayes algorithm โ€“ discrete Xi

  • For each value yk

Estimate ๐œŒ๐‘™ = ๐‘„(๐‘ = ๐‘ง๐‘™) For each value xij of each attribute Xi

Estimate ๐œ„๐‘—๐‘˜๐‘™ = ๐‘„(๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜๐‘™|๐‘ = ๐‘ง๐‘™)

  • Classify Xtest

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘—

test ๐‘ = ๐‘ง๐‘™)

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐œŒ๐‘™ ฮ ๐‘—๐œ„๐‘—๐‘˜๐‘™

slide-42
SLIDE 42

Estimating parameters: discrete ๐‘, ๐‘Œ๐‘—

  • Maximum likelihood estimates (MLE)

เทœ ๐œŒ๐‘™ = เท  ๐‘„ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘ = ๐‘ง๐‘™ ๐ธ แˆ˜ ๐œ„๐‘—๐‘˜๐‘™ = เท  ๐‘„ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜ ^ ๐‘ = ๐‘ง๐‘™ #๐ธ{๐‘ = ๐‘ง๐‘™}

Slide credit: Tom Mitchell

slide-43
SLIDE 43
  • F = 1 iff you live in Fox Ridge
  • S = 1 iff you watched the superbowl last night
  • D = 1 iff you drive to VT
  • G = 1 iff you went to gym in the last month

๐‘„ ๐บ = 1 = ๐‘„ ๐‘‡ = 1|๐บ = 1 = ๐‘„ ๐‘‡ = 1|๐บ = 0 = ๐‘„ ๐ธ = 1|๐บ = 1 = ๐‘„ ๐ธ = 1|๐บ = 0 = ๐‘„ ๐ป = 1|๐บ = 1 = ๐‘„ ๐ป = 1|๐บ = 0 = ๐‘„ ๐บ = 0 = ๐‘„ ๐‘‡ = 0|๐บ = 1 = ๐‘„ ๐‘‡ = 0|๐บ = 0 = ๐‘„ ๐ธ = 0|๐บ = 1 = ๐‘„ ๐ธ = 0|๐บ = 0 = ๐‘„ ๐ป = 0|๐บ = 1 = ๐‘„ ๐ป = 0|๐บ = 0 = ๐‘„ ๐บ|๐‘‡, ๐ธ, ๐ป = ๐‘„ ๐บ P S F P D F P(G|F)

slide-44
SLIDE 44

Naรฏve Bayes: Subtlety #1

  • Often the ๐‘Œ๐‘— are not really conditionally independent
  • Naรฏve Bayes often works pretty well anyway
  • Often the right classification, even when not the right probability [Domingos

& Pazzani, 1996])

  • What is the effect on estimated P(Y|X)?
  • What if we have two copies: ๐‘Œ๐‘— = ๐‘Œ๐‘™

๐‘„ ๐‘ = ๐‘ง๐‘™ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) โˆ ๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™)

Slide credit: Tom Mitchell

slide-45
SLIDE 45

Naรฏve Bayes: Subtlety #2

MLE estimate for ๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™) might be zero. (for example, ๐‘Œ๐‘— = birthdate. ๐‘Œ๐‘— = Feb_4_1995)

  • Why worry about just one parameter out of many?

๐‘„ ๐‘ = ๐‘ง๐‘™ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) โˆ ๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™)

  • What can we do to address this?
  • MAP estimates (adding โ€œimaginaryโ€ examples)

Slide credit: Tom Mitchell

slide-46
SLIDE 46

Estimating parameters: discrete ๐‘, ๐‘Œ๐‘—

  • Maximum likelihood estimates (MLE)

เทœ ๐œŒ๐‘™ = เท  ๐‘„ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘ = ๐‘ง๐‘™ ๐ธ แˆ˜ ๐œ„๐‘—๐‘˜๐‘™ = เท  ๐‘„ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜, ๐‘ = ๐‘ง๐‘™ #๐ธ{๐‘ = ๐‘ง๐‘™}

  • MAP estimates (Dirichlet priors):

เทœ ๐œŒ๐‘™ = เท  ๐‘„ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘ = ๐‘ง๐‘™ + (๐›พ๐‘™โˆ’1) ๐ธ + ฯƒ๐‘›(๐›พ๐‘›โˆ’1) แˆ˜ ๐œ„๐‘—๐‘˜๐‘™ = เท  ๐‘„ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜ ๐‘ = ๐‘ง๐‘™ = #๐ธ ๐‘Œ๐‘— = ๐‘ฆ๐‘—๐‘˜, ๐‘ = ๐‘ง๐‘™ + (๐›พ๐‘™ โˆ’1) #๐ธ{๐‘ = ๐‘ง๐‘™} + ฯƒ๐‘›(๐›พ๐‘›โˆ’1)

Slide credit: Tom Mitchell

slide-47
SLIDE 47

What if we have continuous Xi

  • Gaussian Naรฏve Bayes (GNB): assume

๐‘„ ๐‘Œ๐‘— = ๐‘ฆ ๐‘ = ๐‘ง๐‘™ = 1 2๐œŒ๐œ๐‘—๐‘™ exp(โˆ’ ๐‘ฆ โˆ’ ๐œˆ๐‘—๐‘™ 2๐œ๐‘—๐‘™

2 2

)

  • Additional assumption on ๐œ๐‘—๐‘™:
  • Is independent of ๐‘ (๐œ๐‘—)
  • Is independent of ๐‘Œ๐‘— (๐œ๐‘™)
  • Is independent of ๐‘Œi and ๐‘ (๐œ๐‘™)

Slide credit: Tom Mitchell

slide-48
SLIDE 48

Naรฏve Bayes algorithm โ€“ continuous Xi

  • For each value yk

Estimate ๐œŒ๐‘™ = ๐‘„(๐‘ = ๐‘ง๐‘™) For each attribute Xi estimate Class conditional mean ๐œˆ๐‘—๐‘™, variance ๐œ๐‘—๐‘™

  • Classify Xtest

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘—

test ๐‘ = ๐‘ง๐‘™)

เท  ๐‘ โ† argmax

๐‘ง๐‘™

๐œŒ๐‘™ ฮ ๐‘— ๐‘‚๐‘๐‘ ๐‘›๐‘๐‘š(๐‘Œ๐‘—

test, ๐œˆ๐‘—๐‘™, ๐œ๐‘—๐‘™)

Slide credit: Tom Mitchell

slide-49
SLIDE 49

Things to remember

  • Probability basics
  • Estimating parameters from data
  • Maximum likelihood (ML) maximize ๐‘„(Data|๐œ„)
  • Maximum a posteriori estimation (MAP) maximize ๐‘„(๐œ„|Data)
  • Naive Bayes

๐‘„ ๐‘ = ๐‘ง๐‘™ ๐‘Œ1, โ‹ฏ , ๐‘Œ๐‘œ) โˆ ๐‘„ ๐‘ = ๐‘ง๐‘™ ฮ ๐‘—๐‘„ ๐‘Œ๐‘— ๐‘ = ๐‘ง๐‘™)

slide-50
SLIDE 50

Next class

  • Logistic regression