Probabilistic classification CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

probabilistic classification
SMART_READER_LITE
LIVE PREVIEW

Probabilistic classification CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier Nave Bayes


slide-1
SLIDE 1

Probabilistic classification

CE-717: Machine Learning

Sharif University of Technology

  • M. Soleymani

Fall 2016

slide-2
SLIDE 2

Topics

 Probabilistic approach

 Bayes decision theory  Generative models

 Gaussian Bayes classifier  Naïve Bayes

 Discriminative models

 Logistic regression

2

slide-3
SLIDE 3

Classification problem: probabilistic view

3

 Each feature as a random variable  Class label also as a random variable  We observe the feature values for a random sample and

we intend to find its class label

 Evidence: feature vector 𝒚  Query: class label

slide-4
SLIDE 4

Definitions

4

 Posterior probability: 𝑞 𝒟𝑙 𝒚  Likelihood or class conditional probability: 𝑞 𝒚|𝒟𝑙  Prior probability: 𝑞(𝒟𝑙)

𝑞(𝒚): pdf of feature vector 𝒚 (𝑞 𝒚 = 𝑙=1

𝐿

𝑞 𝒚 𝒟𝑙 𝑞(𝒟𝑙)) 𝑞(𝒚|𝒟𝑙): pdf of feature vector 𝒚 for samples of class 𝒟𝑙 𝑞(𝒟𝑙): probability of the label be 𝒟𝑙

slide-5
SLIDE 5

Bayes decision rule

5

𝑞 𝑓𝑠𝑠𝑝𝑠 𝒚 = 𝑞(𝐷2|𝒚) if we decide 𝒟1 𝑄(𝐷1|𝒚) if we decide 𝒟2

 If we use Bayes decision rule:

𝑄 𝑓𝑠𝑠𝑝𝑠 𝒚 = min{𝑄 𝒟1 𝒚 , 𝑄(𝒟2|𝒚)}

𝐿 = 2

If 𝑄 𝒟1|𝒚 > 𝑄(𝒟2|𝒚) decide 𝒟1

  • therwise decide 𝒟2

Using Bayes rule, for each 𝒚, 𝑄 𝑓𝑠𝑠𝑝𝑠 𝒚 is as small as possible and thus this rule minimizes the probability of error

slide-6
SLIDE 6

Optimal classifier

6

 The optimal decision is the one that minimizes the

expected number of mistakes

 We show that Bayes classifier is an optimal classifier

slide-7
SLIDE 7

Bayes decision rule Minimizing misclassification rate

7

 Decision regions: ℛ𝑙 = {𝒚|𝛽 𝒚 = 𝑙}

 All points in ℛ𝑙 are assigned to class 𝒟𝑙

Choose class with highest 𝑞 𝒟𝑙 𝒚 as 𝛽 𝒚 𝑞 𝑓𝑠𝑠𝑝𝑠 = 𝐹𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) = 𝑞 𝒚 ∈ ℛ1, 𝒟2 + 𝑞 𝒚 ∈ ℛ2, 𝒟1 =

ℛ1

𝑞 𝒚, 𝒟2 𝑒𝒚 +

ℛ2

𝑞 𝒚, 𝒟1 𝑒𝒚 =

ℛ1

𝑞 𝒟2|𝒚 𝑞 𝒚 𝑒𝒚 +

ℛ2

𝑞 𝒟1|𝒚 𝑞 𝒚 𝑒𝒚

𝐿 = 2

slide-8
SLIDE 8

Bayes minimum error

8

 Bayes minimum error classifier:

min

𝛽(.) 𝐹𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧)

 If we know the probabilities in advance then the above

  • ptimization problem will be solved easily.

 𝛽 𝒚 = argmax

𝑧

𝑞(𝑧|𝒚)  In practice, we can estimate 𝑞(𝑧|𝒚) based on a set of

training samples 𝒠

Zero-one loss

slide-9
SLIDE 9

Bayes theorem

9

 Bayes’ theorem

𝑞 𝒟𝑙 𝒚 =

𝑞 𝒚|𝒟𝑙 𝑞(𝒟𝑙) 𝑞(𝒚)

 Posterior probability: 𝑞 𝒟𝑙 𝒚  Likelihood or class conditional probability: 𝑞 𝒚|𝒟𝑙  Prior probability: 𝑞(𝒟𝑙)

Likelihood Prior Posterior 𝑞(𝒚): pdf of feature vector 𝒚 (𝑞 𝒚 = 𝑙=1

𝐿

𝑞 𝒚 𝒟𝑙 𝑞(𝒟𝑙)) 𝑞(𝒚|𝒟𝑙): pdf of feature vector 𝒚 for samples of class 𝒟𝑙 𝑞(𝒟𝑙): probability of the label be 𝒟𝑙

slide-10
SLIDE 10

Bayes decision rule: example

10

 Bayes decision: Choose the class with highest 𝑞 𝒟𝑙 𝒚

𝑞(𝑦|𝒟1) 𝑞(𝑦|𝒟2) 𝑞 𝒟1 = 2 3 𝑞 𝒟2 = 1 3 𝑞(𝒟1|𝑦) 𝑞(𝒟2|𝑦) ℛ2 ℛ2 𝑞 𝒟𝑙 𝒚 = 𝑞 𝒚|𝒟𝑙 𝑞(𝒟𝑙) 𝑞(𝒚) 𝑞 𝒚 = 𝑞 𝒟1 𝑞 𝒚 𝒟1 + 𝑞 𝒟2 𝑞 𝒚 𝒟2

slide-11
SLIDE 11

Bayesian decision rule

11

 If 𝑄 𝒟1|𝒚 > 𝑄(𝒟2|𝒚) decide 𝒟1  otherwise decide 𝒟2  If

𝑞 𝒚|𝒟1 𝑄(𝒟1) 𝑞(𝒚)

>

𝑞 𝒚|𝒟2 𝑄(𝒟2) 𝑞(𝑦)

decide 𝒟1

 otherwise decide 𝒟2  If 𝑞 𝒚|𝒟1 𝑄(𝒟1) > 𝑞 𝒚|𝒟2 𝑄(𝒟2) decide 𝒟1  otherwise decide 𝒟2

Equivalent Equivalent

slide-12
SLIDE 12

Bayes decision rule: example

12

 Bayes decision: Choose the class with highest 𝑞 𝒟𝑙 𝒚

𝑞(𝑦|𝒟1) 𝑞(𝑦|𝒟2) 𝑞 𝒟1 = 2 3 𝑞 𝒟2 = 1 3 𝑞(𝒟1|𝑦) 𝑞(𝒟2|𝑦) ℛ2 ℛ2 ℛ2 ℛ2 2 × 𝑞(𝑦|𝒟1)

slide-13
SLIDE 13

Bayes Classier

13

 Simple Bayes classifier: estimate posterior probability of

each class

 What should the decision criterion be?

 Choose class with highest 𝑞 𝒟𝑙 𝒚

 The optimal decision is the one that minimizes the

expected number of mistakes

slide-14
SLIDE 14

Diabetes example

14

 white blood cell count

This example has been adopted from Sanja Fidler’s slides, University of Toronto, CSC411

slide-15
SLIDE 15

Diabetes example

15

 Doctor has a prior 𝑞 𝑧 = 1 = 0.2

 Prior: In the absence of any observation, what do I know about

the probability of the classes?

 A patient comes in with white blood cell count 𝑦  Does the patient have diabetes 𝑞 𝑧 = 1|𝑦 ?

 given a new observation, we still need to compute the

posterior

slide-16
SLIDE 16

Diabetes example

16

This example has been adopted from Sanja Fidler’s slides, University of Toronto, CSC411

𝑞 𝑦 𝑧 = 0 𝑞 𝑦 𝑧 = 1

slide-17
SLIDE 17

Estimate probability densities from data

17

 If we assume Gaussian distributions for 𝑞(𝑦|𝒟1) and

𝑞(𝑦|𝒟2)

 Recall that for samples {𝑦 1 , … , 𝑦 𝑂 }, if we assume a

Gaussian distribution, the MLE estimates will be

slide-18
SLIDE 18

Diabetes example

18

This example has been adopted from Sanja Fidler’s slides, University of Toronto, CSC411

𝑞 𝑦 𝑧 = 1 = 𝑂 𝜈1, 𝜏1

2

𝜈1 = 𝑜: 𝑧(𝑜)=1 𝑦(𝑜) 𝑜: 𝑧(𝑜)=1 1 = 𝑜: 𝑧(𝑜)=1 𝑦(𝑜) 𝑂1 𝜏1

2 = 𝑜: 𝑧(𝑜)=1 𝑦 𝑜 −𝜈1

2

𝑂1

𝑞 𝑦 𝑧 = 0 𝑞 𝑦 𝑧 = 1

slide-19
SLIDE 19

Diabetes example

19

 Add a second observation: Plasma glucose value

This example has been adopted from Sanja Fidler’s slides, University of Toronto, CSC411

slide-20
SLIDE 20

Generative approach for this example

20

 Multivariate Gaussian distributions for 𝑞(𝑦|𝒟𝑙):

𝑞 𝒚 𝑧 = 𝑙 = 1 2𝜌 𝑒/2 Σ 1/2 exp{− 1 2 𝒚 − 𝝂𝑙 𝑈𝜯𝑙

−1 𝒚 − 𝝂𝑙 }

𝑙 = 1,2

 Prior distribution 𝑞(𝑦|𝒟𝑙):

 𝑞 𝑧 = 1 = 𝜌,

𝑞 𝑧 = 0 = 1 − 𝜌

slide-21
SLIDE 21

MLE for multivariate Gaussian

21

 For samples {𝑦 1 , … , 𝑦 𝑂 }, if we assume a multivariate

Gaussian distribution, the MLE estimates will be: 𝝂 = 𝑜=1

𝑂

𝒚(𝑜) 𝑂 𝜯 = 1 𝑂

𝑜=1 𝑂

𝒚(𝑜) − 𝝂 𝒚(𝑜) − 𝝂

𝑈

slide-22
SLIDE 22

Generative approach: example

22

Maximum likelihood estimation (𝐸 = 𝒚 𝑜 , 𝑧 𝑜

𝑜=1 𝑂

):

 𝜌 =

𝑂1 𝑂

 𝝂1 =

𝑜=1

𝑂

𝑧(𝑜)𝒚(𝑜) 𝑂1

, 𝝂2 =

𝑜=1

𝑂

(1−𝑧(𝑜))𝒚(𝑜) 𝑂2

 𝜯1 =

1 𝑂1 𝑜=1 𝑂

𝑧(𝑜) 𝒚(𝑜) − 𝝂 𝒚(𝑜) − 𝝂

𝑈

 𝜯2 =

1 𝑂2 𝑜=1 𝑂

(1 − 𝑧 𝑜 ) 𝒚(𝑜) − 𝝂 𝒚(𝑜) − 𝝂

𝑈

𝑂1 =

𝑜=1 𝑂

𝑧(𝑜) 𝑂2 = 𝑂 − 𝑂1

slide-23
SLIDE 23

Decision boundary for Gaussian Bayes classifier

23

𝑞 𝒟1 𝒚 = 𝑞(𝒟2|𝒚) ln 𝑞(𝒟1|𝒚) = ln 𝑞(𝒟2|𝒚) ln 𝑞(𝒚|𝒟1) + ln 𝑞(𝒟1) − ln 𝑞(𝒚) = ln 𝑞(𝒚|𝒟2) + ln 𝑞(𝒟2) − ln 𝑞(𝒚) ln 𝑞(𝒚|𝒟1) + ln 𝑞(𝒟1) = ln 𝑞(𝒚|𝒟2) + ln 𝑞(𝒟2) ln 𝑞(𝒚|𝒟𝑙) = − 𝑒 2 ln 2𝜌 − 1 2 ln 𝜯𝑙

−1 − 1

2 𝒚 − 𝝂𝑙 𝑈𝜯𝑙

−1 𝒚 − 𝝂𝑙 𝑞 𝒟𝑙 𝒚 = 𝑞 𝒚|𝒟𝑙 𝑞(𝒟𝑙)

𝑞(𝒚)

slide-24
SLIDE 24

Decision boundary

24

𝑞(𝒚|𝐷1) 𝑞(𝒚|𝐷2) 𝑞(𝐷1|𝒚) 𝑞(𝐷1|𝒚)=𝑞(𝐷2|𝒚)

slide-25
SLIDE 25

Shared covariance matrix

26

 When classes share a single covariance matrix 𝜯 = 𝜯1

= 𝜯2 𝑞 𝒚 𝐷𝑙 = 1 2𝜌 𝑒/2 Σ 1/2 exp{− 1 2 𝒚 − 𝝂𝑙 𝑈𝜯−1 𝒚 − 𝝂𝑙 }

𝑙 = 1,2

 𝑞 𝐷1 = 𝜌,

𝑞 𝐷2 = 1 − 𝜌

slide-26
SLIDE 26

Likelihood

27

𝑜=1 𝑂

𝑞(𝒚 𝑜 , 𝑧(𝑜)|𝜌, 𝝂1, 𝝂2, 𝜯) =

𝑜=1 𝑂

𝑞(𝒚 𝑜 |𝑧 𝑜 , 𝝂1, 𝝂2, 𝜯)𝑞(𝑧 𝑜 |𝜌)

slide-27
SLIDE 27

Shared covariance matrix

28

 Maximum likelihood estimation (𝐸 =

𝒚 𝑗 , 𝑧 𝑗

𝑗=1 𝑜

):

𝜌 = 𝑂1 𝑂 𝝂1 = 𝑜=1

𝑂

𝑧(𝑜)𝒚(𝑜) 𝑂1 𝝂2 = 𝑜=1

𝑂

(1 − 𝑧(𝑜))𝒚(𝑜) 𝑂2 𝜯 = 1 𝑂

𝑜∈𝐷1

𝒚(𝑜) − 𝝂1 𝒚(𝑜) − 𝝂1

𝑈 + 𝑜∈𝐷2

𝒚(𝑜) − 𝝂2 𝒚(𝑜) − 𝝂2

𝑈

slide-28
SLIDE 28

Decision boundary when shared covariance matrix

29

ln 𝑞(𝒚|𝒟1) + ln 𝑞(𝒟1) = ln 𝑞(𝒚|𝒟2) + ln 𝑞(𝒟2) ln 𝑞(𝒚|𝒟𝑙) = − 𝑒 2 ln 2𝜌 − 1 2 ln 𝜯𝑙

−1 − 1

2 𝒚 − 𝝂𝑙 𝑈𝜯−1 𝒚 − 𝝂𝑙

slide-29
SLIDE 29

Bayes decision rule Multi-class misclassification rate

31

 Multi-class problem: Probability of error of Bayesian

decision rule

 Simpler to compute the probability of correct decision

𝑄 𝑓𝑠𝑠𝑝𝑠 = 1 − 𝑄(𝑑𝑝𝑠𝑠𝑓𝑑𝑢)

ℛ𝑗: the subset of feature space assigned to the class 𝒟𝑗 using the classifier 𝑄 𝐷𝑝𝑠𝑠𝑓𝑑𝑢 =

𝑗=1 𝐿 ℛ𝑗

𝑞(𝒚, 𝒟𝑗) 𝑒𝒚 =

𝑗=1 𝐿 ℛ𝑗

𝑞 𝒟𝑗 𝒚 𝑞(𝒚) 𝑒𝒚

slide-30
SLIDE 30

Bayes minimum error

32

 Bayes minimum error classifier:

min

𝛽(.) 𝐹𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧)

𝛽 𝒚 = argmax

𝑧

𝑞(𝑧|𝒚)

Zero-one loss

slide-31
SLIDE 31

Minimizing Bayes risk (expected loss)

33

for each 𝒚 minimize it that is called conditional risk

𝐹𝒚,𝑧 𝑀 𝛽 𝒚 , 𝑧 =

𝑘=1 𝐿

𝑀 𝛽 𝒚 , 𝒟

𝑘 𝑞 𝒚, 𝒟 𝑘 𝑒𝒚

= 𝑞 𝒚

𝑘=1 𝐿

𝑀 𝛽 𝒚 , 𝒟

𝑘 𝑞 𝒟 𝑘|𝒚 𝑒𝒚

 Bayes minimum loss (risk) decision rule:

𝛽(𝒚) 𝛽(𝒚) = argmin

𝑗=1,…,𝐿 𝑘=1 𝐿

𝑀𝑗𝑘𝑞 𝒟

𝑘|𝒚

The loss of assigning a sample to 𝒟𝑗 where the correct class is 𝒟

𝑘

slide-32
SLIDE 32

Minimizing expected loss: special case (loss = misclassification rate)

 Problem definition for this special case:

 If action 𝛽 𝒚 = 𝑗 is taken and the true category is 𝒟𝑘, then the

decision is correct if 𝑗 = 𝑘 and otherwise it is incorrect.

 Zero-one loss function:

𝑀𝑗𝑘 = 1 − 𝜀𝑗𝑘 = 0 𝑗 = 𝑘 1 𝑝. 𝑥.

𝛽 𝒚 = argmin

𝑗=1,…,𝐿 𝑘=1 𝐿

𝑀𝑗𝑘𝑞 𝒟

𝑘|𝒚

= argmin

𝑗=1,…,𝐿

0 × 𝑞 𝒟𝑗|𝒚 +

𝑘≠𝑗

𝑞 𝒟𝑘|𝒚

= argmin

𝑗=1,…,𝐿 1 − 𝑞 𝒟𝑗|𝒚 = argmax 𝑗=1,…,𝐿 𝑞 𝒟𝑗|𝒚

35

slide-33
SLIDE 33

Probabilistic discriminant functions

36

 Discriminant functions: A popular way of representing

a classifier

 A discriminant function 𝑔

𝑗 𝒚 for each class 𝒟𝑗 (𝑗 = 1, … , 𝐿):

 𝒚 is assigned to class 𝒟𝑗 if:

𝑔𝑗(𝒚) > 𝑔𝑘(𝒚) 𝑘  𝑗  Representing

Bayesian classifier using discriminant functions:

 Classifier minimizing error rate: 𝑔𝑗 𝒚 = 𝑄(𝒟𝑗|𝒚)  Classifier minimizing risk: 𝑔𝑗 𝒚 = − 𝑘=1

𝐿

𝑀𝑗𝑘𝑞 𝒟

𝑘|𝒚

slide-34
SLIDE 34

Naïve Bayes classifier

37

 Generative methods

 High number of parameters

 Assumption: Conditional independence

𝑞 𝒚 𝐷𝑙 = 𝑞 𝑦1 𝐷𝑙 × 𝑞 𝑦2 𝐷𝑙 × ⋯ × 𝑞 𝑦𝑒 𝐷𝑙

slide-35
SLIDE 35

Naïve Bayes classifier

38

 In the decision phase, it finds the label of 𝒚 according to:

argmax

𝑙=1,…,𝐿

𝑞 𝐷𝑙 𝒚 argmax

𝑙=1,…,𝐿

𝑞(𝐷𝑙)

𝑗=1 𝑜

𝑞(𝑦𝑗|𝐷𝑙)

𝑞 𝒚 𝐷𝑙 = 𝑞 𝑦1 𝐷𝑙 × 𝑞 𝑦2 𝐷𝑙 × ⋯ × 𝑞 𝑦𝑒 𝐷𝑙 𝑞 𝐷𝑙 𝒚 ∝ 𝑞(𝐷𝑙)

𝑗=1 𝑜

𝑞(𝑦𝑗|𝐷𝑙)

slide-36
SLIDE 36

Naïve Bayes classifier

39

 Finds 𝑒 univariate distributions 𝑞 𝑦1 𝐷𝑙 , ⋯ , 𝑞 𝑦𝑒 𝐷𝑙

instead

  • f finding one multi-variate distribution 𝑞 𝒚 𝐷𝑙

 Example 1: For Gaussian class-conditional density 𝑞 𝒚 𝐷𝑙 , it finds 𝑒 + 𝑒 (mean

and sigma parameters on different dimensions) instead of 𝑒 + 𝑒(𝑒+1)

2

parameters

 Example 2: For Bernoulli class-conditional density 𝑞 𝒚 𝐷𝑙 , it finds 𝑒 (mean

parameters on different dimensions) instead of 2d − 1 parameters  It

first estimates the class conditional densities 𝑞 𝑦1 𝐷𝑙 , ⋯ , 𝑞 𝑦𝑒 𝐷𝑙 and the prior probability 𝑞(𝐷𝑙) for each class (𝑙 = 1, … , 𝐿) based on the training set.

slide-37
SLIDE 37

Naïve Bayes: discrete example

40

 𝑞 𝐼 = 𝑍𝑓𝑡 = 0.3  𝑞 𝐸 = 𝑍𝑓𝑡 𝐼 = 𝑍𝑓𝑡 =

1 3

 𝑞 𝑇 = 𝑍𝑓𝑡 𝐼 = 𝑍𝑓𝑡 =

2 3

 𝑞 𝐸 = 𝑍𝑓𝑡 𝐼 = 𝑂𝑝 =

2 7

 𝑞 𝑇 = 𝑍𝑓𝑡 𝐼 = 𝑂𝑝 =

2 7

 Decision on 𝒚 = [𝑍𝑓𝑡, 𝑍𝑓𝑡] (a person that has diabetes and also smokes):

𝑞 𝐼 = 𝑍𝑓𝑡 𝒚 ∝ 𝑞 𝐼 = 𝑍𝑓𝑡 𝑞 𝐸 = 𝑧𝑓𝑡 𝐼 = 𝑍𝑓𝑡 𝑞 𝑇 = 𝑧𝑓𝑡 𝐼 = 𝑍𝑓𝑡 = 0.066

𝑞 𝐼 = 𝑂𝑝 𝒚 ∝ 𝑞 𝐼 = 𝑂𝑝 𝑞 𝐸 = 𝑧𝑓𝑡 𝐼 = 𝑂𝑝 𝑞 𝑇 = 𝑧𝑓𝑡 𝐼 = 𝑂𝑝 = 0.057

Thus decide 𝐼 = 𝑧𝑓𝑡

Diabetes (D) Smoke (S) Heart Disease (H) Y N Y Y N N N Y N N Y N N N N N Y Y N N N N Y Y N N N Y N N

slide-38
SLIDE 38

Probabilistic classifiers

41

 How can we find the probabilities required in the Bayes

decision rule?

 Probabilistic classification approaches can be divided in

two main categories:

 Generative

 Estimate pdf 𝑞(𝒚, 𝒟𝑙) for each class 𝒟𝑙 and then use it to find

𝑞(𝒟𝑙|𝒚)

 or alternatively estimate both pdf 𝑞(𝒚|𝒟𝑙) and 𝑞 𝒟𝑙 to find 𝑞(𝒟𝑙|𝒚)

 Discriminative

 Directly estimate 𝑞(𝒟𝑙|𝒚) for each class 𝒟𝑙

slide-39
SLIDE 39

Generative approach

42

 Inference stage

 Determine class conditional densities 𝑞(𝒚|𝒟𝑙)

and priors 𝑞(𝒟𝑙)

 Use the Bayes theorem to find 𝑞(𝒟𝑙|𝒚)

 Decision stage: After learning the model (inference stage),

make optimal class assignment for new input

 if 𝑞 𝒟𝑗 𝒚 > 𝑞 𝒟

𝑘 𝒚

∀𝑘 ≠ 𝑗 then decide 𝒟𝑗

slide-40
SLIDE 40

Discriminative vs. generative approach

43

[Bishop]

slide-41
SLIDE 41

Class conditional densities vs. posterior

44

𝑞 𝒚 𝐷1 𝑞 𝒚 𝐷2 𝑞 𝐷1 𝒚 𝑞 𝒟1 𝒚 = 𝜏(𝒙𝑈𝒚 + 𝑥0) 𝒙 = 𝜯−1 𝝂1 − 𝝂2 𝑥0 = − 1 2 𝝂1

𝑈𝜯−1𝝂1 + 1

2 𝝂2

𝑈𝜯−1𝝂2 + ln 𝑞(𝒟1)

𝑞(𝒟2) [Bishop]

slide-42
SLIDE 42

Discriminative approach

45

 Inference stage

 Determine the posterior class probabilities 𝑄(𝒟𝑙|𝒚) directly

 Decision stage: After learning the model (inference stage),

make optimal class assignment for new input

 if 𝑄 𝒟𝑗 𝒚 > 𝑄 𝒟

𝑘 𝒚

∀𝑘 ≠ 𝑗 then decide 𝒟𝑗

slide-43
SLIDE 43

Posterior probabilities

46

 Two-class: 𝑞 𝒟𝑙 𝒚 can be written as a logistic sigmoid for

a wide choice of 𝑞 𝒚 𝒟𝑙 distributions 𝑞 𝒟1 𝒚 = 𝜏 𝑏(𝒚) = 1 1 + exp(−𝑏(𝒚))

 Multi-class: 𝑞 𝒟𝑙 𝒚

can be written as a soft-max for a wide choice of 𝑞 𝒚 𝒟𝑙 𝑞 𝒟𝑙 𝒚 = exp(𝑏𝑙(𝒚)) 𝑘=1

𝐿

exp(𝑏𝑘(𝒚))

slide-44
SLIDE 44

Discriminative approach: logistic regression

47

 More general than discriminant functions:

 𝑔 𝒚; 𝒙 predicts posterior probabilities 𝑄 𝑧 = 1 𝒚

𝑔 𝒚; 𝒙 = 𝜏(𝒙𝑈𝒚)

𝜏 . is an activation function

 Sigmoid (logistic) function

 Activation function

𝜏 𝑨 = 1 1 + 𝑓−𝑨

𝐿 = 2 𝒚 = 1, 𝑦1, … , 𝑦𝑒 𝒙 = 𝑥0, 𝑥1, … , 𝑥𝑒

slide-45
SLIDE 45

Logistic regression

48

 𝑔 𝒚; 𝒙 : probability that 𝑧 = 1 given 𝒚 (parameterized by 𝒙)

𝑄 𝑧 = 1 𝒚, 𝒙 = 𝑔 𝒚; 𝒙 𝑄 𝑧 = 0 𝒚, 𝒙 = 1 − 𝑔 𝒚; 𝒙

 Example: Cancer (Malignant, Benign)

 𝑔 𝒚; 𝒙 = 0.7  70% chance of tumor being malignant

𝐿 = 2 𝑧 ∈ {0,1}

𝑔 𝒚; 𝒙 = 𝜏(𝒙𝑈𝒚) 0 ≤ 𝑔 𝒚; 𝒙 ≤ 1 estimated probability of 𝑧 = 1 on input 𝑦

slide-46
SLIDE 46

Logistic regression: Decision surface

49

 Decision surface 𝑔 𝒚; 𝒙 = constant  𝑔 𝒚; 𝒙 = 𝜏 𝒙𝑈𝒚 =

1 1+𝑓−(𝒙𝑈𝒚) = 0.5

 Decision surfaces are linear functions of 𝒚

 if 𝑔 𝒚; 𝒙 ≥ 0.5 then 𝑧 = 1  else 𝑧 = 0

Equivalent to

 if 𝒙𝑈𝒚 + 𝑥0 ≥ 0 then 𝑧 = 1  else 𝑧 = 0

slide-47
SLIDE 47

Logistic regression: ML estimation

50

 Maximum (conditional) log likelihood:

𝒙 = argmax

𝒙

log

𝑗=1 𝑜

𝑞 𝑧(𝑗) 𝒙, 𝒚(𝑗) 𝑞 𝑧(𝑗) 𝒙, 𝒚(𝑗) = 𝑔 𝒚(𝑗); 𝒙

𝑧(𝑗)

1 − 𝑔 𝒚(𝑗); 𝒙

(1−𝑧(𝑗))

log 𝑞 𝒛 𝒀, 𝒙 =

𝑗=1 𝑜

𝑧(𝑗)log 𝑔 𝒚(𝑗); 𝒙 + (1 − 𝑧(𝑗))log 1 − 𝑔 𝒚(𝑗); 𝒙

slide-48
SLIDE 48

Logistic regression: cost function

51

𝒙 = argmin

𝒙

𝐾(𝒙) 𝐾 𝒙 = −

𝑗=1 𝑜

log 𝑞 𝑧(𝑗) 𝒙, 𝒚(𝒋) =

𝑗=1 𝑜

−𝑧(𝑗)log 𝑔 𝒚(𝑗); 𝒙 − (1 − 𝑧(𝑗))log 1 − 𝑔 𝒚(𝑗); 𝒙

 No closed form solution for

𝛂𝒙 𝐾(𝒙) = 0

 However 𝐾(𝒙) is convex.

slide-49
SLIDE 49

Logistic regression: Gradient descent

52

𝒙𝑢+1 = 𝒙𝑢 − 𝜃𝛼

𝒙𝐾(𝒙𝑢)

𝛼

𝒙𝐾 𝒙 = 𝑗=1 𝑜

𝑔 𝒚 𝑗 ; 𝒙 − 𝑧 𝑗 𝒚 𝑗

 Is it similar to gradient of SSE for linear regression?

𝛼

𝒙𝐾 𝒙 = 𝑗=1 𝑜

𝒙𝑈𝒚 𝑗 − 𝑧 𝑗 𝒚 𝑗

slide-50
SLIDE 50

Logistic regression: loss function

53

Loss 𝑧, 𝑔 𝒚; 𝒙 = −𝑧 × log 𝑔 𝒚; 𝒙 − (1 − 𝑧) × log(1 − 𝑔 𝒚; 𝒙 ) Loss 𝑧, 𝑔 𝒚; 𝒙 = −log(𝑔(𝒚; 𝒙)) if 𝑧 = 1 −log(1 − 𝑔 𝒚; 𝒙 ) if 𝑧 = 0 How is it related to zero-one loss? Loss 𝑧, 𝑧 = 1 𝑧 ≠ 𝑧 𝑧 = 𝑧

𝑔 𝒚; 𝒙 = 1 1 + 𝑓𝑦𝑞(−𝒙𝑈𝒚) Since 𝑧 = 1 or 𝑧 = 0 ⇒

slide-51
SLIDE 51

Logistic regression: cost function (summary)

54

 Logistic Regression (LR) has a more proper cost function for

classification than SSE and Perceptron

 Why is the cost function of LR also more suitable than?

𝐾 𝒙 = 1 𝑜

𝑗=1 𝑜

𝑧 𝑗 − 𝑔 𝒚 𝑗 ; 𝒙

2

 where 𝑔 𝒚; 𝒙 = 𝜏(𝒙𝑈𝒚)

 The conditional distribution 𝑞 𝑧|𝒚, 𝒙

in the classification problem is not Gaussian (it is Bernoulli)

 The cost function of LR is also convex

slide-52
SLIDE 52

Multi-class logistic regression

55

 For each class 𝑙, 𝑔

𝑙 𝒚; 𝑿 predicts the probability of 𝑧 = 𝑙

 i.e., 𝑄(𝑧 = 𝑙|𝒚, 𝑿)

 On a new input 𝒚, to make a prediction, pick the class that

maximizes 𝑔

𝑙 𝒚; 𝑿 :

𝛽 𝒚 = argmax

𝑙=1,…,𝐿

𝑔

𝑙 𝒚 if 𝑔

𝑙 𝒚 > 𝑔 𝑘 𝒚

∀𝑘 ≠ 𝑙 then decide 𝐷𝑙

slide-53
SLIDE 53

Multi-class logistic regression

56

𝐿 > 2 𝑧 ∈ {1,2, … , 𝐿}

𝑔

𝑙 𝒚; 𝑿 = 𝑞 𝑧 = 𝑙 𝒚 =

exp(𝒙𝑙

𝑈𝒚 )

𝑘=1

𝐿

exp(𝒙𝑘

𝑈𝒚 )

 Normalized exponential (aka softmax)

 If 𝒙𝑙

𝑈𝒚 ≫ 𝒙𝑘 𝑈𝒚 for all 𝑘 ≠ 𝑙 then 𝑞(𝐷𝑙|𝒚) ≃ 1, 𝑞(𝐷 𝑘|𝒚) ≃ 0

𝑞 𝐷𝑙 𝒚 = 𝑞 𝒚 𝐷𝑙 𝑞(𝐷𝑙) 𝑘=1

𝐿

𝑞 𝒚 𝐷

𝑘 𝑞(𝐷 𝑘)

slide-54
SLIDE 54

Logistic regression: multi-class

57

𝑿 = argmin

𝑿

𝐾(𝑿) 𝐾 𝑿 = − log

𝑗=1 𝑜

𝑞 𝒛 𝑗 𝒚 𝑗 , 𝑿 = − log

𝑗=1 𝑜 𝑙=1 𝐿

𝑔

𝑙 𝒚 𝑗 ; 𝑿 𝑧𝑙

𝑗

= −

𝑗=1 𝑜 𝑙=1 𝐿

𝑧𝑙

𝑗 log 𝑔 𝑙 𝒚(𝑗); 𝑿 𝑿 = 𝒙1 ⋯ 𝒙𝐿 𝒁 =

𝒛(1) ⋮ 𝒛(𝑜) = 𝑧1

(1)

⋯ 𝑧𝐿

(1)

⋮ ⋱ ⋮ 𝑧1

(𝑜)

⋯ 𝑧𝐿

(𝑜)

𝒛 is a vector of length 𝐿 (1-of-K coding) e.g., 𝒛 = 0,0,1,0 𝑈 when the target class is 𝐷3

slide-55
SLIDE 55

Logistic regression: multi-class

58

𝒙𝑘

𝑢+1 = 𝒙𝑘 𝑢 − 𝜃𝛼𝑿𝐾(𝑿𝑢)

𝛼

𝒙𝑘𝐾 𝑿 = 𝑗=1 𝑜

𝑔

𝑘 𝒚 𝑗 ; 𝑿 − 𝑧𝑘 𝑗

𝒚 𝑗

slide-56
SLIDE 56

Logistic Regression (LR): summary

59

 LR is a linear classifier  LR

  • ptimization

problem is

  • btained

by maximum likelihood

 when

assuming Bernoulli distribution for conditional probabilities whose mean is

1 1+𝑓−(𝒙𝑈𝒚)

 No closed-form solution for its optimization problem

 But convex cost function and global optimum can be found by

gradient ascent

slide-57
SLIDE 57

Discriminative vs. generative: number of parameters

60

 𝑒-dimensional feature space  Logistic regression: 𝑒 + 1 parameters

 𝒙 = (𝑥0, 𝑥1 , . . , 𝑥𝑒)

 Generative approach:

 Gaussian class-conditionals with shared covariance matrix

 2𝑒 parameters for means  𝑒(𝑒 + 1)/2 parameters for shared covariance matrix  one parameter for class prior 𝑞(𝐷1).

 But LR is more robust, less sensitive to incorrect modeling

assumptions

slide-58
SLIDE 58

Summary of alternatives

61

 Generative

 Most demanding, because it finds the joint distribution 𝑞(𝒚, 𝒟𝑙)  Usually needs a large training set to find 𝑞(𝒚|𝒟𝑙)  Can find 𝑞(𝒚) ⇒ Outlier or novelty detection

 Discriminative

 Specifies what is really needed (i.e., 𝑞(𝒟𝑙|𝒚))  More computationally efficient

slide-59
SLIDE 59

Resources

62

 C. Bishop, “Pattern Recognition and Machine Learning”,

Chapter 4.2-4.3.