Applied Machine Learning Applied Machine Learning Naive Bayes - - PowerPoint PPT Presentation

applied machine learning applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Applied Machine Learning Naive Bayes - - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Naive Bayes Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives generative vs. discriminative classifier Naive


slide-1
SLIDE 1

Applied Machine Learning Applied Machine Learning

Naive Bayes

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020) 1

slide-2
SLIDE 2

generative vs. discriminative classifier Naive Bayes classifier assumption different design choices

Learning objectives Learning objectives

2

slide-3
SLIDE 3

so far we modeled the conditional distribution: p(y ∣ x)

discriminative

3 . 1

image: https://rpsychologist.com

Discreminative vs generative classification Discreminative vs generative classification

slide-4
SLIDE 4

so far we modeled the conditional distribution: p(y ∣ x)

discriminative

learn the joint distribution p(y, x) = p(y)p(x ∣ y)

generative

3 . 1

image: https://rpsychologist.com

Discreminative vs generative classification Discreminative vs generative classification

slide-5
SLIDE 5

so far we modeled the conditional distribution: p(y ∣ x)

discriminative

learn the joint distribution p(y, x) = p(y)p(x ∣ y)

generative

p(y = c ∣ x) =

p(x) p(c)p(x∣c)

Bayes rule

3 . 1

image: https://rpsychologist.com

Discreminative vs generative classification Discreminative vs generative classification

slide-6
SLIDE 6

so far we modeled the conditional distribution: p(y ∣ x)

discriminative

learn the joint distribution p(y, x) = p(y)p(x ∣ y)

generative

p(y = c ∣ x) =

p(x) p(c)p(x∣c)

Bayes rule prior class probability: frequency of observing this label

3 . 1

image: https://rpsychologist.com

p(x∣y = 1) p(x∣y = 0)

x

Discreminative vs generative classification Discreminative vs generative classification

slide-7
SLIDE 7

so far we modeled the conditional distribution: p(y ∣ x)

discriminative

learn the joint distribution p(y, x) = p(y)p(x ∣ y)

generative

p(y = c ∣ x) =

p(x) p(c)p(x∣c)

Bayes rule prior class probability: frequency of observing this label

likelihood of input features given the class label

(input features for each label come from a different distribution) 3 . 1

image: https://rpsychologist.com

p(x∣y = 1) p(x∣y = 0)

x

Discreminative vs generative classification Discreminative vs generative classification

slide-8
SLIDE 8

so far we modeled the conditional distribution: p(y ∣ x)

discriminative

learn the joint distribution p(y, x) = p(y)p(x ∣ y)

generative

p(y = c ∣ x) =

p(x) p(c)p(x∣c)

Bayes rule prior class probability: frequency of observing this label

likelihood of input features given the class label

(input features for each label come from a different distribution)

p(x, c )

∑c =1

C ′

marginal probability of the input (evidence)

3 . 1

image: https://rpsychologist.com

p(x∣y = 1) p(x∣y = 0)

x

Discreminative vs generative classification Discreminative vs generative classification

slide-9
SLIDE 9

so far we modeled the conditional distribution: p(y ∣ x)

discriminative

learn the joint distribution p(y, x) = p(y)p(x ∣ y)

generative

p(y = c ∣ x) =

p(x) p(c)p(x∣c)

Bayes rule prior class probability: frequency of observing this label

likelihood of input features given the class label

(input features for each label come from a different distribution)

posterior probability

  • f a given class
p(x, c )

∑c =1

C ′

marginal probability of the input (evidence)

3 . 1

image: https://rpsychologist.com

p(x∣y = 1) p(x∣y = 0)

x

Discreminative vs generative classification Discreminative vs generative classification

slide-10
SLIDE 10

so far we modeled the conditional distribution: p(y ∣ x)

discriminative

learn the joint distribution p(y, x) = p(y)p(x ∣ y)

generative

how to classify new input x?

p(y = c ∣ x) =

p(x) p(c)p(x∣c)

Bayes rule prior class probability: frequency of observing this label

likelihood of input features given the class label

(input features for each label come from a different distribution)

posterior probability

  • f a given class
p(x, c )

∑c =1

C ′

marginal probability of the input (evidence)

3 . 1

image: https://rpsychologist.com

p(x∣y = 1) p(x∣y = 0)

x

Discreminative vs generative classification Discreminative vs generative classification

slide-11
SLIDE 11

Example: Example: Bayes rule for classification Bayes rule for classification

patient having cancer?

y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature

p(c ∣ x) =

p(x) p(c)p(x∣c)

3 . 2

slide-12
SLIDE 12

Example: Example: Bayes rule for classification Bayes rule for classification

patient having cancer?

y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature

p(c ∣ x) =

p(x) p(c)p(x∣c)

prior: 1% of population has cancer p(yes) = .01

3 . 2

slide-13
SLIDE 13

Example: Example: Bayes rule for classification Bayes rule for classification

patient having cancer?

y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature

p(c ∣ x) =

p(x) p(c)p(x∣c)

prior: 1% of population has cancer p(yes) = .01

likelihood:

p(+∣yes) = .9

TP rate of the test (90%)

3 . 2

slide-14
SLIDE 14

Example: Example: Bayes rule for classification Bayes rule for classification

patient having cancer?

y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature

p(c ∣ x) =

p(x) p(c)p(x∣c)

prior: 1% of population has cancer p(yes) = .01

likelihood:

p(+∣yes) = .9

TP rate of the test (90%)

p(+) = p(yes)p(+∣yes) + p(no)p(+∣no) = .01 × .9 + .99 × .05 = .189

evidence:

FP rate of the test (5%)

3 . 2

slide-15
SLIDE 15

Example: Example: Bayes rule for classification Bayes rule for classification

patient having cancer?

y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature

p(c ∣ x) =

p(x) p(c)p(x∣c)

prior: 1% of population has cancer p(yes) = .01 posterior: p(yes∣+) = .08

likelihood:

p(+∣yes) = .9

TP rate of the test (90%)

p(+) = p(yes)p(+∣yes) + p(no)p(+∣no) = .01 × .9 + .99 × .05 = .189

evidence:

FP rate of the test (5%)

3 . 2

slide-16
SLIDE 16

Winter 2020 | Applied Machine Learning (COMP551)

Example: Example: Bayes rule for classification Bayes rule for classification

patient having cancer?

y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature

p(c ∣ x) =

p(x) p(c)p(x∣c)

prior: 1% of population has cancer p(yes) = .01 posterior: p(yes∣+) = .08

likelihood:

p(+∣yes) = .9

TP rate of the test (90%)

p(+) = p(yes)p(+∣yes) + p(no)p(+∣no) = .01 × .9 + .99 × .05 = .189

evidence:

FP rate of the test (5%)

3 . 2

in a generative classifier likelihood & prior class probabilities are learned from data

slide-17
SLIDE 17

p(y = c ∣ x) =

p(x) p(c)p(x∣c)

p(x, c )

∑c =1

C ′

prior class probability: frequency of observing this label

likelihood of input features given the class label

(input features for each label come from a different distribution)

posterior probability

  • f a given class

marginal probability of the input (evidence)

Generative classification Generative classification

image: https://rpsychologist.com

Some generative classifiers: Gaussian Discriminant Analysis: the likelihood is multivariate Gaussian Naive Bayes: decomposed likelihood

4 . 1

slide-18
SLIDE 18

Naive Bayes: Naive Bayes: model model

assumption about the likelihood p(x∣y) =

p(x ∣y)

∏d=1

D d

number of input features

4 . 2

slide-19
SLIDE 19

Naive Bayes: Naive Bayes: model model

assumption about the likelihood p(x∣y) =

p(x ∣y)

∏d=1

D d

number of input features

when is this assumption correct? when features are conditionally independent given the label

knowing the label, the value of one input feature gives us no information about the other input features

x

i

x

j

y

4 . 2

slide-20
SLIDE 20

Naive Bayes: Naive Bayes: model model

assumption about the likelihood p(x∣y) =

p(x ∣y)

∏d=1

D d

number of input features

when is this assumption correct? when features are conditionally independent given the label

knowing the label, the value of one input feature gives us no information about the other input features

x

i

x

j

y

p(x∣y) = p(x

∣y)p(x ∣y, x )p(x ∣y, x , x ) … p(x ∣y, x , … , x )

1 2 1 3 1 2 D 1 D−1

chain rule of probability (true for any distribution)

4 . 2

slide-21
SLIDE 21

Naive Bayes: Naive Bayes: model model

assumption about the likelihood p(x∣y) =

p(x ∣y)

∏d=1

D d

number of input features

when is this assumption correct? when features are conditionally independent given the label

knowing the label, the value of one input feature gives us no information about the other input features

x

i

x

j

y

p(x∣y) = p(x

∣y)p(x ∣y, x )p(x ∣y, x , x ) … p(x ∣y, x , … , x )

1 2 1 3 1 2 D 1 D−1

chain rule of probability (true for any distribution)

p(x

∣y, x , x ) =

3 1 2

p(x

∣y)

3

conditional independence assumption x1, x2 give no extra information, so

4 . 2

slide-22
SLIDE 22

given the training dataset

D = {(x , y ), … , (x , y )}

(1) (1) (N) (N)

Naive Bayes: Naive Bayes: objective

  • bjective

maximize the joint likelihood (contrast with logistic regression)

ℓ(w, u) =

log p (x

, y ) ∑n

u,w (n) (n)

4 . 3

slide-23
SLIDE 23

=

log p (y

) + ∑n

u (n)

log p

(x

∣y )

w (n) (n)

given the training dataset

D = {(x , y ), … , (x , y )}

(1) (1) (N) (N)

Naive Bayes: Naive Bayes: objective

  • bjective

maximize the joint likelihood (contrast with logistic regression)

ℓ(w, u) =

log p (x

, y ) ∑n

u,w (n) (n)

=

log p (y

) + ∑n

u (n)

log p (x

∣y ) ∑n

w (n) (n)

4 . 3

slide-24
SLIDE 24

=

log p (y

) + ∑n

u (n)

log p

(x

∣y )

w (n) (n)

given the training dataset

D = {(x , y ), … , (x , y )}

(1) (1) (N) (N)

Naive Bayes: Naive Bayes: objective

  • bjective

maximize the joint likelihood (contrast with logistic regression)

ℓ(w, u) =

log p (x

, y ) ∑n

u,w (n) (n)

=

log p (y

) + ∑n

u (n)

log p (x

∣y ) ∑n

w (n) (n)

=

log p (y

) + ∑n

u (n)

log p (x ∣y

) ∑d ∑n

w

[d]

d (n) (n)

using Naive Bayes assumption

4 . 3

slide-25
SLIDE 25

=

log p (y

) + ∑n

u (n)

log p

(x

∣y )

w (n) (n)

given the training dataset

D = {(x , y ), … , (x , y )}

(1) (1) (N) (N)

Naive Bayes: Naive Bayes: objective

  • bjective

maximize the joint likelihood (contrast with logistic regression)

ℓ(w, u) =

log p (x

, y ) ∑n

u,w (n) (n)

=

log p (y

) + ∑n

u (n)

log p (x

∣y ) ∑n

w (n) (n)

=

log p (y

) + ∑n

u (n)

log p (x ∣y

) ∑d ∑n

w

[d]

d (n) (n)

using Naive Bayes assumption

separate MLE estimates for each part

4 . 3

slide-26
SLIDE 26

given the training dataset

D = {(x , y ), … , (x , y )}

(1) (1) (N) (N)

Naive Bayes: Naive Bayes: train-test train-test

learn the prior class probabilities learn the likelihood components

p

(y)

u

p

(x ∣y)

∀d

w

[d]

d

training time

4 . 4

slide-27
SLIDE 27

Winter 2020 | Applied Machine Learning (COMP551)

given the training dataset

D = {(x , y ), … , (x , y )}

(1) (1) (N) (N)

Naive Bayes: Naive Bayes: train-test train-test

learn the prior class probabilities learn the likelihood components

p

(y)

u

p

(x ∣y)

∀d

w

[d]

d

training time test time

arg max

p(c∣x) =

c

arg max

c

p (c ) p (x ∣c )

∑c =1

′ C u ′ ∏d=1 D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

find posterior class probabilities

4 . 4

slide-28
SLIDE 28

Class prior Class prior

5 . 1

p(c∣x) =

p

(c) p (x ∣c )

∑c =1

′ C u

∏d=1

D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

slide-29
SLIDE 29

Class prior Class prior

Bernoulli distribution

p

(y) =

u

u (1 −

y

u)1−y

binary classification

5 . 1

p(c∣x) =

p

(c) p (x ∣c )

∑c =1

′ C u

∏d=1

D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

slide-30
SLIDE 30

Class prior Class prior

Bernoulli distribution

p

(y) =

u

u (1 −

y

u)1−y

binary classification

ℓ(u) =

y

log(u) + ∑n=1

N (n)

(1 − y ) log(1 −

(n)

u)

maximizing the log-likelihood

5 . 1

p(c∣x) =

p

(c) p (x ∣c )

∑c =1

′ C u

∏d=1

D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

slide-31
SLIDE 31

Class prior Class prior

frequency of class 1 in the dataset

= N

log(u) +

1

(N − N

) log(1 −

1

u)

frequency of class 0 in the dataset

Bernoulli distribution

p

(y) =

u

u (1 −

y

u)1−y

binary classification

ℓ(u) =

y

log(u) + ∑n=1

N (n)

(1 − y ) log(1 −

(n)

u)

maximizing the log-likelihood

5 . 1

p(c∣x) =

p

(c) p (x ∣c )

∑c =1

′ C u

∏d=1

D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

slide-32
SLIDE 32

Class prior Class prior

frequency of class 1 in the dataset

= N

log(u) +

1

(N − N

) log(1 −

1

u)

frequency of class 0 in the dataset

Bernoulli distribution

p

(y) =

u

u (1 −

y

u)1−y

binary classification

ℓ(u) =

y

log(u) + ∑n=1

N (n)

(1 − y ) log(1 −

(n)

u)

maximizing the log-likelihood

ℓ(u) =

du d

u N

1

=

1−u N−N

1

0 ⇒ u =

∗ N N

1 max-likelihood estimate (MLE) is the

frequency of class labels

setting its derivative to zero

5 . 1

p(c∣x) =

p

(c) p (x ∣c )

∑c =1

′ C u

∏d=1

D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

slide-33
SLIDE 33

categorical distribution

p

(y) =

u

u

∏c=1

C c y

c

multiclass classification assuming one-hot coding for labels is now a parameter vector

u = [u

, … , u ]

1 C

p(c∣x) =

p

(c) p (x ∣c )

∑c =1

′ C u

∏d=1

D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

Class prior Class prior

5 . 2

slide-34
SLIDE 34

Winter 2020 | Applied Machine Learning (COMP551)

categorical distribution

p

(y) =

u

u

∏c=1

C c y

c

multiclass classification assuming one-hot coding for labels is now a parameter vector

u = [u

, … , u ]

1 C

p(c∣x) =

p

(c) p (x ∣c )

∑c =1

′ C u

∏d=1

D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

maximizing the log likelihood

ℓ(u) =

y log(u )

∑n ∑c

c (n) c

subject to:

u =

∑c

c

1

closed form for the optimal parameter

u =

[

, … , ]

N N

1

N N

C

number of instances in class 1 all instances in the dataset

Class prior Class prior

5 . 2

slide-35
SLIDE 35

choice of likelihood distribution depends on the type of features

(likelihood encodes our assumption about "generative process")

Bernoulli: binary features Categorical: categorical features Gaussian: continuous distribution ...

Likelihood terms Likelihood terms

p(c∣x) =

p

(c) p (x ∣c )

∑c =1

′ C u

∏d=1

D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

(class-conditionals)

6 . 1

slide-36
SLIDE 36

choice of likelihood distribution depends on the type of features

(likelihood encodes our assumption about "generative process")

Bernoulli: binary features Categorical: categorical features Gaussian: continuous distribution ...

Likelihood terms Likelihood terms

p(c∣x) =

p

(c) p (x ∣c )

∑c =1

′ C u

∏d=1

D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

note that these are different from the choice of distribution for class prior

(class-conditionals)

6 . 1

slide-37
SLIDE 37

choice of likelihood distribution depends on the type of features

(likelihood encodes our assumption about "generative process")

Bernoulli: binary features Categorical: categorical features Gaussian: continuous distribution ...

Likelihood terms Likelihood terms

p(c∣x) =

p

(c) p (x ∣c )

∑c =1

′ C u

∏d=1

D w [d] d ′

p

(c) p (x ∣c)

u

∏d=1

D w [d] d

x

d

w

=

[d] ∗

arg max

log p (x ∣

w

[d] ∑n=1

N w

[d]

d (n)

y )

(n)

each feature may use a different likelihood separate max-likelihood estimates for each feature

note that these are different from the choice of distribution for class prior

(class-conditionals)

6 . 1

slide-38
SLIDE 38

binary features: likelihood is Bernoulli

Bernoulli Bernoulli Naive Bayes Naive Bayes

p

(x ∣

w

[d]

d

y = 0) = Bernoulli(x

; w

)

d [d],0

p

(x ∣

w

[d]

d

y = 1) = Bernoulli(x

; w

)

d [d],1

{

  • ne parameter per label

6 . 2

slide-39
SLIDE 39

binary features: likelihood is Bernoulli

Bernoulli Bernoulli Naive Bayes Naive Bayes

p

(x ∣

w

[d]

d

y = 0) = Bernoulli(x

; w

)

d [d],0

p

(x ∣

w

[d]

d

y = 1) = Bernoulli(x

; w

)

d [d],1

{

short form:

p

(x ∣

w

[d]

d

y) = Bernoulli(x

; w )

d [d],y

  • ne parameter per label

6 . 2

slide-40
SLIDE 40

binary features: likelihood is Bernoulli

Bernoulli Bernoulli Naive Bayes Naive Bayes

max-likelihood estimation is similar to what we saw for the prior

closed form solution of MLE

w

=

[d],c ∗ N(y=c) N(y=c,x

=1)

d number of training instances satisfying this condition

p

(x ∣

w

[d]

d

y = 0) = Bernoulli(x

; w

)

d [d],0

p

(x ∣

w

[d]

d

y = 1) = Bernoulli(x

; w

)

d [d],1

{

short form:

p

(x ∣

w

[d]

d

y) = Bernoulli(x

; w )

d [d],y

  • ne parameter per label

6 . 2

slide-41
SLIDE 41

Example Example: Bernoulli Naive Bayes Bernoulli Naive Bayes

using naive Bayes for document classification: 2 classes (documents types) 600 binary features

word d is present in the document n (vocabulary of 600)

x

=

d (n)

1

6 . 3

slide-42
SLIDE 42

Example Example: Bernoulli Naive Bayes Bernoulli Naive Bayes

using naive Bayes for document classification: 2 classes (documents types) 600 binary features

word d is present in the document n (vocabulary of 600)

x

=

d (n)

1

d w

[d],0 ∗

likelihood of words in two document types

w

[d],1 ∗

6 . 3

slide-43
SLIDE 43

Example Example: Bernoulli Naive Bayes Bernoulli Naive Bayes

using naive Bayes for document classification: 2 classes (documents types) 600 binary features

word d is present in the document n (vocabulary of 600)

x

=

d (n)

1

d w

[d],0 ∗

likelihood of words in two document types

w

[d],1 ∗

def BernoulliNaiveBayes(prior,# vector of size 2 for class prior likelihood, #600 x 2: likelihood of each word under each class x, #vector of size 600: binary features for a new document ): logp = np.log(prior) + np.sum(np.log(likelihood * x[:,None]), 0) + \ np.sum(np.log((1-likelihood) * (1 - x[:,None])), 0) log_p -= np.max(log_p) #numerical stability posterior = np.exp(log_p) # vector of size 2 posterior /= np.sum(posterior) # normalize return posterior # posterior class probability 1 2 3 4 5 6 7 8 9 10 6 . 3

slide-44
SLIDE 44

Multinomial Multinomial Naive Bayes Naive Bayes

what if we wanted to use word frequencies in document classification

6 . 4

slide-45
SLIDE 45

Multinomial Multinomial Naive Bayes Naive Bayes

what if we wanted to use word frequencies in document classification

x

d (n) is the number of times word d appears in document n

6 . 4

slide-46
SLIDE 46

Multinomial Multinomial Naive Bayes Naive Bayes

what if we wanted to use word frequencies in document classification

Multinomial likelihood: p

(x∣c) =

w

w x !

∏d=1

D d

(

x )!

∑d

d

∏d=1

D d,c x

d

x

d (n) is the number of times word d appears in document n

6 . 4

slide-47
SLIDE 47

Multinomial Multinomial Naive Bayes Naive Bayes

what if we wanted to use word frequencies in document classification

Multinomial likelihood: p

(x∣c) =

w

w x !

∏d=1

D d

(

x )!

∑d

d

∏d=1

D d,c x

d

x

d (n) is the number of times word d appears in document n we have a vector of size D for each class

C × D (parameters)

6 . 4

slide-48
SLIDE 48

Winter 2020 | Applied Machine Learning (COMP551)

Multinomial Multinomial Naive Bayes Naive Bayes

what if we wanted to use word frequencies in document classification

Multinomial likelihood: p

(x∣c) =

w

w x !

∏d=1

D d

(

x )!

∑d

d

∏d=1

D d,c x

d

x

d (n) is the number of times word d appears in document n we have a vector of size D for each class

C × D (parameters)

MLE estimates: w

=

d,c ∗

x y

∑n ∑d′

d′ (n) c (n)

x

y

d (n) c (n)

count of word d in all documents labelled y total word count in all documents labelled y

6 . 4

slide-49
SLIDE 49

Gaussian Gaussian Naive Bayes Naive Bayes

p

(x ∣

w

[d]

d

y) = N(x

; μ , σ ) =

d d,y d,y 2

e

2πσ

d,y 2

1 −

2σ d,y2 (x

−μ )

d d,y 2

Gaussian likelihood terms

7 . 1

slide-50
SLIDE 50

Gaussian Gaussian Naive Bayes Naive Bayes

p

(x ∣

w

[d]

d

y) = N(x

; μ , σ ) =

d d,y d,y 2

e

2πσ

d,y 2

1 −

2σ d,y2 (x

−μ )

d d,y 2

Gaussian likelihood terms w

=

[d]

, σ , … , μ , σ )

d,1 d,1 d,C d,C

  • ne mean and std. parameter for each class-feature pair

7 . 1

slide-51
SLIDE 51

Gaussian Gaussian Naive Bayes Naive Bayes

p

(x ∣

w

[d]

d

y) = N(x

; μ , σ ) =

d d,y d,y 2

e

2πσ

d,y 2

1 −

2σ d,y2 (x

−μ )

d d,y 2

Gaussian likelihood terms w

=

[d]

, σ , … , μ , σ )

d,1 d,1 d,C d,C

  • ne mean and std. parameter for each class-feature pair

writing log-likelihood and setting derivative to zero we get maximum likelihood estimate:

μ

=

d,y

x y

N

c

1 ∑n=1 N d (n) c (n)

σ

=

d,y 2

y (x −

N

c

1 ∑n=1 N c (n) d (n)

μ

)

d,y 2

empirical mean & std of feature across instances with label y

x

d

7 . 1

slide-52
SLIDE 52

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes

classification on Iris flowers dataset:

samples with D=4 features, for each of C=3 species of Iris flower

N

=

c

50

(a classic dataset originally used by Fisher)

7 . 2

slide-53
SLIDE 53

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes

classification on Iris flowers dataset:

samples with D=4 features, for each of C=3 species of Iris flower

N

=

c

50

(a classic dataset originally used by Fisher)

  • ur setting

3 classes 2 features

(septal width, petal length)

7 . 2

slide-54
SLIDE 54

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes

categorical class prior & Gaussian likelihood

def GaussianNaiveBayes( X, # N x D y, # N x C Xtest,# N_test x D ): 1 2 3 4 5 N,C = y.shape 6 D = X.shape[1] 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15

7 . 3

slide-55
SLIDE 55

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes

categorical class prior & Gaussian likelihood

def GaussianNaiveBayes( X, # N x D y, # N x C Xtest,# N_test x D ): 1 2 3 4 5 N,C = y.shape 6 D = X.shape[1] 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15 N,C = y.shape D = X.shape[1] def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 6 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15

decision boundaries are not linear!

7 . 3

slide-56
SLIDE 56

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes

categorical class prior & Gaussian likelihood

def GaussianNaiveBayes( X, # N x D y, # N x C Xtest,# N_test x D ): 1 2 3 4 5 N,C = y.shape 6 D = X.shape[1] 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15 N,C = y.shape D = X.shape[1] def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 6 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15 mu, s = np.zeros((C,D)), np.zeros((C,D)) for c in range(C): #calculate mean and std inds = np.nonzero(y[:,c])[0] mu[c,:] = np.mean(X[inds,:], 0) s[c,:] = np.std(X[inds,:], 0) def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 N,C = y.shape 6 D = X.shape[1] 7 8 9 10 11 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15

decision boundaries are not linear!

7 . 3

slide-57
SLIDE 57

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes

categorical class prior & Gaussian likelihood

def GaussianNaiveBayes( X, # N x D y, # N x C Xtest,# N_test x D ): 1 2 3 4 5 N,C = y.shape 6 D = X.shape[1] 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15 N,C = y.shape D = X.shape[1] def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 6 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15 mu, s = np.zeros((C,D)), np.zeros((C,D)) for c in range(C): #calculate mean and std inds = np.nonzero(y[:,c])[0] mu[c,:] = np.mean(X[inds,:], 0) s[c,:] = np.std(X[inds,:], 0) def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 N,C = y.shape 6 D = X.shape[1] 7 8 9 10 11 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15 log_prior = np.log(np.mean(y, 0))[:,None] log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 N,C = y.shape 6 D = X.shape[1] 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 13 14 return log_prior + log_likelihood #N_text x C 15

decision boundaries are not linear!

7 . 3

slide-58
SLIDE 58

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes

categorical class prior & Gaussian likelihood

def GaussianNaiveBayes( X, # N x D y, # N x C Xtest,# N_test x D ): 1 2 3 4 5 N,C = y.shape 6 D = X.shape[1] 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15 N,C = y.shape D = X.shape[1] def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 6 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15 mu, s = np.zeros((C,D)), np.zeros((C,D)) for c in range(C): #calculate mean and std inds = np.nonzero(y[:,c])[0] mu[c,:] = np.mean(X[inds,:], 0) s[c,:] = np.std(X[inds,:], 0) def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 N,C = y.shape 6 D = X.shape[1] 7 8 9 10 11 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 return log_prior + log_likelihood #N_text x C 15 log_prior = np.log(np.mean(y, 0))[:,None] log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 N,C = y.shape 6 D = X.shape[1] 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 13 14 return log_prior + log_likelihood #N_text x C 15 return log_prior + log_likelihood #N_text x C def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 N,C = y.shape 6 D = X.shape[1] 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 s[c,:] = np.std(X[inds,:], 0) 12 log_prior = np.log(np.mean(y, 0))[:,None] 13 log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

14 15

decision boundaries are not linear!

7 . 3

slide-59
SLIDE 59

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes

categorical class prior & Gaussian likelihood

def GaussianNaiveBayes( X, # N x D y, # N x C Xtest,# N_test x D ): N,C = y.shape D = X.shape[1] mu, s = np.zeros((C,D)), np.zeros((C,D)) for c in range(C): #calculate mean and std inds = np.nonzero(y[:,c])[0] mu[c,:] = np.mean(X[inds,:], 0) s[c,:] = np.std(X[inds,:], 0) log_prior = np.log(np.mean(y, 0))[:,None] log_likelihood = - np.sum( np.log(s[:,None,:]) +.5*(((Xt[None,:,:]

  • mu[:,None,:])/s[:,None,:])**2), 2)

return log_prior + log_likelihood #N_text x C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

posterior class probability for c=1

7 . 4

slide-60
SLIDE 60

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes

using the same variance for all classes

log_likelihood = - np.sum(.5*(((Xt[None,:,:] - mu[:,None,:]))**2), 2) def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 N,C = y.shape 6 D = X.shape[1] 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 log_prior = np.log(np.mean(y, 0))[:,None] 12 13 return log_prior + log_likelihood #N_text x C 14 its value does not make a difference

7 . 5

slide-61
SLIDE 61

Winter 2020 | Applied Machine Learning (COMP551)

Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes

using the same variance for all classes

log_likelihood = - np.sum(.5*(((Xt[None,:,:] - mu[:,None,:]))**2), 2) def GaussianNaiveBayes( 1 X, # N x D 2 y, # N x C 3 Xtest,# N_test x D 4 ): 5 N,C = y.shape 6 D = X.shape[1] 7 mu, s = np.zeros((C,D)), np.zeros((C,D)) 8 for c in range(C): #calculate mean and std 9 inds = np.nonzero(y[:,c])[0] 10 mu[c,:] = np.mean(X[inds,:], 0) 11 log_prior = np.log(np.mean(y, 0))[:,None] 12 13 return log_prior + log_likelihood #N_text x C 14

decision boundaries are linear

its value does not make a difference

7 . 5

slide-62
SLIDE 62

Decision boundary Decision boundary in generative classifiers in generative classifiers

p(y∣x) = p(y ∣x)

decision boundaries: two classes have the same probability

8

slide-63
SLIDE 63

Decision boundary Decision boundary in generative classifiers in generative classifiers

p(y∣x) = p(y ∣x)

decision boundaries: two classes have the same probability

log

=

p(y=c ∣x)

p(y=c∣x)

log

=

p(c )p(x∣c )

′ ′

p(c)p(x∣c)

log

+

p(c )

p(c)

log

=

p(x∣c )

p(x∣c)

which means

not a function of x (ignore)

8

slide-64
SLIDE 64

Decision boundary Decision boundary in generative classifiers in generative classifiers

p(y∣x) = p(y ∣x)

decision boundaries: two classes have the same probability

log

=

p(y=c ∣x)

p(y=c∣x)

log

=

p(c )p(x∣c )

′ ′

p(c)p(x∣c)

log

+

p(c )

p(c)

log

=

p(x∣c )

p(x∣c)

which means

not a function of x (ignore)

this ratio is linear (in some bases) for a large family of probabilities

(called linear exponential family)

8

slide-65
SLIDE 65

Decision boundary Decision boundary in generative classifiers in generative classifiers

p(y∣x) = p(y ∣x)

decision boundaries: two classes have the same probability

log

=

p(y=c ∣x)

p(y=c∣x)

log

=

p(c )p(x∣c )

′ ′

p(c)p(x∣c)

log

+

p(c )

p(c)

log

=

p(x∣c )

p(x∣c)

which means

not a function of x (ignore)

this ratio is linear (in some bases) for a large family of probabilities

(called linear exponential family)

p(x∣c) =

Z(w

)

y,c

ew

ϕ(x)

y,c T

log =

p(x∣c )

p(x∣c)

(w

y,c

w

) ϕ(x) +

y,c′ T

g(w

, w )

y,c y,c′ linear using some bases

not a function of x

8

slide-66
SLIDE 66

Decision boundary Decision boundary in generative classifiers in generative classifiers

p(y∣x) = p(y ∣x)

decision boundaries: two classes have the same probability

log

=

p(y=c ∣x)

p(y=c∣x)

log

=

p(c )p(x∣c )

′ ′

p(c)p(x∣c)

log

+

p(c )

p(c)

log

=

p(x∣c )

p(x∣c)

which means

not a function of x (ignore)

this ratio is linear (in some bases) for a large family of probabilities

(called linear exponential family)

e.g., Bernoulli is a member of this family with ϕ(x) = x Bernoulli Naive Bayes has a linear decision boundary linear.

p(x∣c) =

Z(w

)

y,c

ew

ϕ(x)

y,c T

log =

p(x∣c )

p(x∣c)

(w

y,c

w

) ϕ(x) +

y,c′ T

g(w

, w )

y,c y,c′ linear using some bases

not a function of x

8

slide-67
SLIDE 67

maximize conditional likelihood

p(y ∣ x)

discriminative

maximize joint likelihood

p(y, x) = p(y)p(x ∣ y)

generative

Discreminative vs generative Discreminative vs generative classification

classification

9 . 1

slide-68
SLIDE 68

maximize conditional likelihood

p(y ∣ x)

discriminative

maximize joint likelihood

p(y, x) = p(y)p(x ∣ y)

generative

Discreminative vs generative Discreminative vs generative classification

classification

it makes assumptions about p(x) makes no assumption about p(x)

9 . 1

slide-69
SLIDE 69

maximize conditional likelihood

p(y ∣ x)

discriminative

maximize joint likelihood

p(y, x) = p(y)p(x ∣ y)

generative

Discreminative vs generative Discreminative vs generative classification

classification

it makes assumptions about p(x) makes no assumption about p(x) can deal with missing values can learn from unlabelled data

9 . 1

slide-70
SLIDE 70

maximize conditional likelihood

p(y ∣ x)

discriminative

maximize joint likelihood

p(y, x) = p(y)p(x ∣ y)

generative

Discreminative vs generative Discreminative vs generative classification

classification

it makes assumptions about p(x) makes no assumption about p(x) can deal with missing values

  • ften works better on larger datasets

can learn from unlabelled data

  • ften works better on smaller datasets

9 . 1

slide-71
SLIDE 71

Winter 2020 | Applied Machine Learning (COMP551)

Example

naive Bayes vs logistic regression on UCI datasets

naive Bayes logistic regression from: Ng & Jordan 2001

Discreminative vs generative Discreminative vs generative classification

classification

m is #instances

9 . 2

slide-72
SLIDE 72

Summary Summary

generative classification learn the class prior and likelihood Bayes rule for conditional class probability Naive Bayes assumes conditional independence e.g., word appearances indep. of each other given document type class prior: Bernoulli or Categorical likelihood: Bernoulli, Gaussian, Categorical... MLE has closed form (in contrast to logistic regression) estimated separately for each feature and each label evaluation measures for classification accuracy

10

slide-73
SLIDE 73

Measuring performance Measuring performance

binary classification

11 . 1

We use the confusion matrix

A side note on measuring performance of classifiers

count the combinations of y and

y ^

slide-74
SLIDE 74

Measuring performance Measuring performance

binary classification

Positive Negative

11 . 1

We use the confusion matrix Example

A side note on measuring performance of classifiers

count the combinations of y and

y ^

slide-75
SLIDE 75

Measuring performance Measuring performance

binary classification

Positive Negative

11 . 1

We use the confusion matrix Example

A side note on measuring performance of classifiers

count the combinations of y and

y ^

slide-76
SLIDE 76

Measuring performance Measuring performance

binary classification

Positive Negative

11 . 1

We use the confusion matrix Example

A side note on measuring performance of classifiers

count the combinations of y and

y ^

slide-77
SLIDE 77

Measuring performance Measuring performance

binary classification

Positive Negative

11 . 1

We use the confusion matrix Example

A side note on measuring performance of classifiers

count the combinations of y and

y ^

slide-78
SLIDE 78

Measuring performance Measuring performance

binary classification

Positive Negative

11 . 1

We use the confusion matrix Example

A side note on measuring performance of classifiers

count the combinations of y and

y ^

slide-79
SLIDE 79

Measuring performance Measuring performance

binary classification

Positive Negative

11 . 1

We use the confusion matrix Example

A side note on measuring performance of classifiers

count the combinations of y and

y ^

slide-80
SLIDE 80

Measuring performance Measuring performance

binary classification

{Harmonic mean}

RP = TP + FP RN = FN + TN P = TP + FN N = FP + TN

11 . 2

use the confusion matrix to quantify difference metrics marginals:

slide-81
SLIDE 81

Measuring performance Measuring performance

binary classification

Accuracy =

P+N TP+TN

{Harmonic mean}

RP = TP + FP RN = FN + TN P = TP + FN N = FP + TN

11 . 2

use the confusion matrix to quantify difference metrics marginals:

slide-82
SLIDE 82

Measuring performance Measuring performance

binary classification

Accuracy =

P+N TP+TN

{Harmonic mean}

RP = TP + FP RN = FN + TN P = TP + FN N = FP + TN

Error rate =

P+N FP+FN

11 . 2

use the confusion matrix to quantify difference metrics marginals:

slide-83
SLIDE 83

Measuring performance Measuring performance

binary classification

Accuracy =

P+N TP+TN

Precision =

RP TP

{Harmonic mean}

RP = TP + FP RN = FN + TN P = TP + FN N = FP + TN

Error rate =

P+N FP+FN

11 . 2

use the confusion matrix to quantify difference metrics marginals:

slide-84
SLIDE 84

Measuring performance Measuring performance

binary classification

Accuracy =

P+N TP+TN

Recall =

P TP

Precision =

RP TP

{Harmonic mean}

RP = TP + FP RN = FN + TN P = TP + FN N = FP + TN

Error rate =

P+N FP+FN

11 . 2

use the confusion matrix to quantify difference metrics marginals:

slide-85
SLIDE 85

Measuring performance Measuring performance

binary classification

Accuracy =

P+N TP+TN

F

score =

1

2

Precision+Recall Precision×Recall

Recall =

P TP

Precision =

RP TP

{Harmonic mean}

RP = TP + FP RN = FN + TN P = TP + FN N = FP + TN

Error rate =

P+N FP+FN

11 . 2

use the confusion matrix to quantify difference metrics marginals:

slide-86
SLIDE 86

Accuracy =

P+N TP+TN

F

score =

1

2

Precision+Recall Precision×Recall

Recall =

P TP

Precision = RP

TP

{Harmonic mean}

Miss rate =

P FN

Fallout =

N FP

False discovery rate =

RP FP

Selectivity =

N TN

False omission rate =

RN FN

Negative predictive value =

RN TN

Less common

Measuring performance Measuring performance

binary classification

11 . 3

slide-87
SLIDE 87

Threshold invariant: ROC & AUC Threshold invariant: ROC & AUC

ROC as a function of threshold

TPR = TP/P (recall, sensitivity) FPR = FP/N (fallout, false alarm)

11 . 4