probabilistic classification
play

Probabilistic classification CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier Nave Bayes


  1. Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

  2. Topics  Probabilistic approach  Bayes decision theory  Generative models  Gaussian Bayes classifier  Naïve Bayes  Discriminative models  Logistic regression 2

  3. Classification problem: probabilistic view  Each feature as a random variable  Class label also as a random variable  We observe the feature values for a random sample and we intend to find its class label  Evidence: feature vector 𝒚  Query: class label 3

  4. Definitions  Posterior probability: 𝑞 𝒟 𝑙 𝒚  Likelihood or class conditional probability: 𝑞 𝒚|𝒟 𝑙  Prior probability: 𝑞(𝒟 𝑙 ) 𝑞(𝒚) : pdf of feature vector 𝒚 ( 𝑞 𝒚 = 𝑙=1 𝐿 𝑞 𝒚 𝒟 𝑙 𝑞(𝒟 𝑙 ) ) 𝑞(𝒚|𝒟 𝑙 ) : pdf of feature vector 𝒚 for samples of class 𝒟 𝑙 𝑞(𝒟 𝑙 ) : probability of the label be 𝒟 𝑙 4

  5. Bayes decision rule 𝐿 = 2 If 𝑄 𝒟 1 |𝒚 > 𝑄(𝒟 2 |𝒚) decide 𝒟 1 otherwise decide 𝒟 2 𝑞 𝑓𝑠𝑠𝑝𝑠 𝒚 = 𝑞(𝐷 2 |𝒚) if we decide 𝒟 1 𝑄(𝐷 1 |𝒚) if we decide 𝒟 2  If we use Bayes decision rule: 𝑄 𝑓𝑠𝑠𝑝𝑠 𝒚 = min{𝑄 𝒟 1 𝒚 , 𝑄(𝒟 2 |𝒚)} Using Bayes rule, for each 𝒚 , 𝑄 𝑓𝑠𝑠𝑝𝑠 𝒚 is as small as possible and thus t his rule minimizes the probability of error 5

  6. Optimal classifier  The optimal decision is the one that minimizes the expected number of mistakes  We show that Bayes classifier is an optimal classifier 6

  7. Bayes decision rule Minimizing misclassification rate  Decision regions: ℛ 𝑙 = {𝒚|𝛽 𝒚 = 𝑙} 𝐿 = 2  All points in ℛ 𝑙 are assigned to class 𝒟 𝑙 𝑞 𝑓𝑠𝑠𝑝𝑠 = 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) = 𝑞 𝒚 ∈ ℛ 1 , 𝒟 2 + 𝑞 𝒚 ∈ ℛ 2 , 𝒟 1 = 𝑞 𝒚, 𝒟 2 𝑒𝒚 + 𝑞 𝒚, 𝒟 1 𝑒𝒚 ℛ 1 ℛ 2 = 𝑞 𝒟 2 |𝒚 𝑞 𝒚 𝑒𝒚 + 𝑞 𝒟 1 |𝒚 𝑞 𝒚 𝑒𝒚 ℛ 1 ℛ 2 Choose class with highest 𝑞 𝒟 𝑙 𝒚 as 𝛽 𝒚 7

  8. Bayes minimum error  Bayes minimum error classifier: 𝛽(.) 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) min Zero-one loss  If we know the probabilities in advance then the above optimization problem will be solved easily.  𝛽 𝒚 = argmax 𝑞(𝑧|𝒚) 𝑧  In practice, we can estimate 𝑞(𝑧|𝒚) based on a set of training samples 𝒠 8

  9. Bayes theorem Likelihood Prior Posterior  Bayes ’ theorem 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 𝑞 𝒟 𝑙 𝒚 = 𝑞(𝒚)  Posterior probability: 𝑞 𝒟 𝑙 𝒚  Likelihood or class conditional probability: 𝑞 𝒚|𝒟 𝑙  Prior probability: 𝑞(𝒟 𝑙 ) 𝑞(𝒚) : pdf of feature vector 𝒚 ( 𝑞 𝒚 = 𝑙=1 𝐿 𝑞 𝒚 𝒟 𝑙 𝑞(𝒟 𝑙 ) ) 𝑞(𝒚|𝒟 𝑙 ) : pdf of feature vector 𝒚 for samples of class 𝒟 𝑙 𝑞(𝒟 𝑙 ) : probability of the label be 𝒟 𝑙 9

  10. Bayes decision rule: example  Bayes decision: Choose the class with highest 𝑞 𝒟 𝑙 𝒚 𝑞(𝒟 1 |𝑦) 𝑞(𝑦|𝒟 2 ) 𝑞(𝑦|𝒟 1 ) 𝑞(𝒟 2 |𝑦) ℛ 2 ℛ 2 𝑞 𝒟 1 = 2 𝑞 𝒟 𝑙 𝒚 = 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 3 𝑞(𝒚) 𝑞 𝒟 2 = 1 𝑞 𝒚 = 𝑞 𝒟 1 𝑞 𝒚 𝒟 1 + 𝑞 𝒟 2 𝑞 𝒚 𝒟 2 3 10

  11. Bayesian decision rule  If 𝑄 𝒟 1 |𝒚 > 𝑄(𝒟 2 |𝒚) decide 𝒟 1  otherwise decide 𝒟 2 Equivalent 𝑞 𝒚|𝒟 1 𝑄(𝒟 1 ) 𝑞 𝒚|𝒟 2 𝑄(𝒟 2 )  If decide 𝒟 1 > 𝑞(𝒚) 𝑞(𝑦)  otherwise decide 𝒟 2 Equivalent  If 𝑞 𝒚|𝒟 1 𝑄(𝒟 1 ) > 𝑞 𝒚|𝒟 2 𝑄(𝒟 2 ) decide 𝒟 1  otherwise decide 𝒟 2 11

  12. Bayes decision rule: example  Bayes decision: Choose the class with highest 𝑞 𝒟 𝑙 𝒚 2 × 𝑞(𝑦|𝒟 1 ) 𝑞(𝑦|𝒟 2 ) 𝑞(𝑦|𝒟 1 ) ℛ 2 ℛ 2 𝑞 𝒟 1 = 2 𝑞(𝒟 1 |𝑦) 3 𝑞 𝒟 2 = 1 3 𝑞(𝒟 2 |𝑦) 12 ℛ 2 ℛ 2

  13. Bayes Classier  Simple Bayes classifier: estimate posterior probability of each class  What should the decision criterion be?  Choose class with highest 𝑞 𝒟 𝑙 𝒚  The optimal decision is the one that minimizes the expected number of mistakes 13

  14. Diabetes example  white blood cell count 14 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

  15. Diabetes example  Doctor has a prior 𝑞 𝑧 = 1 = 0.2  Prior: In the absence of any observation, what do I know about the probability of the classes?  A patient comes in with white blood cell count 𝑦  Does the patient have diabetes 𝑞 𝑧 = 1|𝑦 ?  given a new observation, we still need to compute the posterior 15

  16. Diabetes example 𝑞 𝑦 𝑧 = 0 𝑞 𝑦 𝑧 = 1 16 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

  17. Estimate probability densities from data  If we assume Gaussian distributions for 𝑞(𝑦|𝒟 1 ) and 𝑞(𝑦|𝒟 2 )  Recall that for samples {𝑦 1 , … , 𝑦 𝑂 } , if we assume a Gaussian distribution, the MLE estimates will be 17

  18. Diabetes example 𝑞 𝑦 𝑧 = 0 𝑞 𝑦 𝑧 = 1 𝑜: 𝑧 (𝑜) =1 𝑦 (𝑜) 𝑜: 𝑧 (𝑜) =1 𝑦 (𝑜) 2 𝑞 𝑦 𝑧 = 1 = 𝑂 𝜈 1 , 𝜏 1 𝜈 1 = = 𝑂 1 𝑜: 𝑧 (𝑜) =1 1 2 𝑜: 𝑧(𝑜)=1 𝑦 𝑜 −𝜈 1 2 = 𝜏 1 𝑂 1 18 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

  19. Diabetes example  Add a second observation: Plasma glucose value 19 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

  20. Generative approach for this example  Multivariate Gaussian distributions for 𝑞(𝑦|𝒟 𝑙 ) : 𝑞 𝒚 𝑧 = 𝑙 2𝜌 𝑒/2 Σ 1/2 exp{− 1 1 −1 𝒚 − 𝝂 𝑙 } 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 𝑙 = 𝑙 = 1,2  Prior distribution 𝑞(𝑦|𝒟 𝑙 ) :  𝑞 𝑧 = 1 = 𝜌 , 𝑞 𝑧 = 0 = 1 − 𝜌 20

  21. MLE for multivariate Gaussian  For samples {𝑦 1 , … , 𝑦 𝑂 } , if we assume a multivariate Gaussian distribution, the MLE estimates will be: 𝑂 𝒚 (𝑜) 𝝂 = 𝑜=1 𝑂 𝑂 𝜯 = 1 𝑈 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑜=1 21

  22. Generative approach: example 𝑂 𝒚 𝑜 , 𝑧 𝑜 Maximum likelihood estimation ( 𝐸 = ): 𝑜=1 𝑂 1  𝜌 = 𝑂 𝑂 𝑂 𝑂 𝑧 (𝑜) 𝒚 (𝑜) (1−𝑧 (𝑜) )𝒚 (𝑜) 𝑜=1 𝑜=1 , 𝝂 2 =  𝝂 1 = 𝑧 (𝑜) 𝑂 1 = 𝑂 1 𝑂 2 𝑜=1 𝑈 1 𝑧 (𝑜) 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑂 1 𝑜=1  𝜯 1 = 𝑂 2 = 𝑂 − 𝑂 1 𝑈 1 (1 − 𝑧 𝑜 ) 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑂 2 𝑜=1  𝜯 2 = 22

  23. Decision boundary for Gaussian Bayes classifier 𝑞 𝒟 1 𝒚 = 𝑞(𝒟 2 |𝒚) 𝑞 𝒟 𝑙 𝒚 = 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 𝑞(𝒚) ln 𝑞(𝒟 1 |𝒚) = ln 𝑞(𝒟 2 |𝒚) ln 𝑞(𝒚|𝒟 1 ) + ln 𝑞(𝒟 1 ) − ln 𝑞(𝒚) = ln 𝑞(𝒚|𝒟 2 ) + ln 𝑞(𝒟 2 ) − ln 𝑞(𝒚) ln 𝑞(𝒚|𝒟 1 ) + ln 𝑞(𝒟 1 ) = ln 𝑞(𝒚|𝒟 2 ) + ln 𝑞(𝒟 2 ) ln 𝑞(𝒚|𝒟 𝑙 ) = − 𝑒 2 ln 2𝜌 − 1 −1 − 1 −1 𝒚 − 𝝂 𝑙 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 𝑙 2 ln 𝜯 𝑙 23

  24. Decision boundary 𝑞(𝒚|𝐷 1 ) 𝑞(𝒚|𝐷 2 ) 𝑞(𝐷 1 |𝒚) = 𝑞(𝐷 2 |𝒚) 𝑞(𝐷 1 |𝒚) 24

  25. Shared covariance matrix  When classes share a single covariance matrix 𝜯 = 𝜯 1 = 𝜯 2 2𝜌 𝑒/2 Σ 1/2 exp{− 1 1 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 −1 𝒚 − 𝝂 𝑙 } 𝑞 𝒚 𝐷 𝑙 = 𝑙 = 1,2  𝑞 𝐷 1 = 𝜌 , 𝑞 𝐷 2 = 1 − 𝜌 26

  26. Likelihood 𝑂 𝑞(𝒚 𝑜 , 𝑧 (𝑜) |𝜌, 𝝂 1 , 𝝂 2 , 𝜯) 𝑜=1 𝑂 𝑞(𝒚 𝑜 |𝑧 𝑜 , 𝝂 1 , 𝝂 2 , 𝜯)𝑞(𝑧 𝑜 |𝜌) = 𝑜=1 27

  27. Shared covariance matrix 𝑜 𝒚 𝑗 , 𝑧 𝑗  Maximum likelihood estimation ( 𝐸 = ): 𝑗=1 𝜌 = 𝑂 1 𝑂 𝑂 𝑧 (𝑜) 𝒚 (𝑜) 𝝂 1 = 𝑜=1 𝑂 1 𝑂 (1 − 𝑧 (𝑜) )𝒚 (𝑜) 𝝂 2 = 𝑜=1 𝑂 2 𝜯 = 1 𝑈 + 𝑈 𝒚 (𝑜) − 𝝂 1 𝒚 (𝑜) − 𝝂 1 𝒚 (𝑜) − 𝝂 2 𝒚 (𝑜) − 𝝂 2 𝑂 𝑜∈𝐷 1 𝑜∈𝐷 2 28

  28. Decision boundary when shared covariance matrix ln 𝑞(𝒚|𝒟 1 ) + ln 𝑞(𝒟 1 ) = ln 𝑞(𝒚|𝒟 2 ) + ln 𝑞(𝒟 2 ) ln 𝑞(𝒚|𝒟 𝑙 ) = − 𝑒 2 ln 2𝜌 − 1 −1 − 1 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 −1 𝒚 − 𝝂 𝑙 2 ln 𝜯 𝑙 29

  29. Bayes decision rule Multi-class misclassification rate  Multi-class problem: Probability of error of Bayesian decision rule  Simpler to compute the probability of correct decision 𝑄 𝑓𝑠𝑠𝑝𝑠 = 1 − 𝑄(𝑑𝑝𝑠𝑠𝑓𝑑𝑢) 𝐿 𝑄 𝐷𝑝𝑠𝑠𝑓𝑑𝑢 = 𝑞(𝒚, 𝒟 𝑗 ) 𝑒𝒚 ℛ 𝑗 𝑗=1 𝐿 = 𝑞 𝒟 𝑗 𝒚 𝑞(𝒚) 𝑒𝒚 ℛ 𝑗 𝑗=1 ℛ 𝑗 : the subset of feature space assigned to the class 𝒟 𝑗 using the classifier 31

  30. Bayes minimum error  Bayes minimum error classifier: 𝛽(.) 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) min Zero-one loss 𝛽 𝒚 = argmax 𝑞(𝑧|𝒚) 𝑧 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend