Probabilistic classification CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Topics  Probabilistic approach  Bayes decision theory  Generative models  Gaussian Bayes classifier  Naïve Bayes  Discriminative models  Logistic regression 2

Classification problem: probabilistic view  Each feature as a random variable  Class label also as a random variable  We observe the feature values for a random sample and we intend to find its class label  Evidence: feature vector 𝒚  Query: class label 3

Definitions  Posterior probability: 𝑞 𝒟 𝑙 𝒚  Likelihood or class conditional probability: 𝑞 𝒚|𝒟 𝑙  Prior probability: 𝑞(𝒟 𝑙 ) 𝑞(𝒚) : pdf of feature vector 𝒚 ( 𝑞 𝒚 = 𝑙=1 𝐿 𝑞 𝒚 𝒟 𝑙 𝑞(𝒟 𝑙 ) ) 𝑞(𝒚|𝒟 𝑙 ) : pdf of feature vector 𝒚 for samples of class 𝒟 𝑙 𝑞(𝒟 𝑙 ) : probability of the label be 𝒟 𝑙 4

Bayes decision rule 𝐿 = 2 If 𝑄 𝒟 1 |𝒚 > 𝑄(𝒟 2 |𝒚) decide 𝒟 1 otherwise decide 𝒟 2 𝑞 𝑓𝑠𝑠𝑝𝑠 𝒚 = 𝑞(𝐷 2 |𝒚) if we decide 𝒟 1 𝑄(𝐷 1 |𝒚) if we decide 𝒟 2  If we use Bayes decision rule: 𝑄 𝑓𝑠𝑠𝑝𝑠 𝒚 = min{𝑄 𝒟 1 𝒚 , 𝑄(𝒟 2 |𝒚)} Using Bayes rule, for each 𝒚 , 𝑄 𝑓𝑠𝑠𝑝𝑠 𝒚 is as small as possible and thus t his rule minimizes the probability of error 5

Optimal classifier  The optimal decision is the one that minimizes the expected number of mistakes  We show that Bayes classifier is an optimal classifier 6

Bayes decision rule Minimizing misclassification rate  Decision regions: ℛ 𝑙 = {𝒚|𝛽 𝒚 = 𝑙} 𝐿 = 2  All points in ℛ 𝑙 are assigned to class 𝒟 𝑙 𝑞 𝑓𝑠𝑠𝑝𝑠 = 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) = 𝑞 𝒚 ∈ ℛ 1 , 𝒟 2 + 𝑞 𝒚 ∈ ℛ 2 , 𝒟 1 = 𝑞 𝒚, 𝒟 2 𝑒𝒚 + 𝑞 𝒚, 𝒟 1 𝑒𝒚 ℛ 1 ℛ 2 = 𝑞 𝒟 2 |𝒚 𝑞 𝒚 𝑒𝒚 + 𝑞 𝒟 1 |𝒚 𝑞 𝒚 𝑒𝒚 ℛ 1 ℛ 2 Choose class with highest 𝑞 𝒟 𝑙 𝒚 as 𝛽 𝒚 7

Bayes minimum error  Bayes minimum error classifier: 𝛽(.) 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) min Zero-one loss  If we know the probabilities in advance then the above optimization problem will be solved easily.  𝛽 𝒚 = argmax 𝑞(𝑧|𝒚) 𝑧  In practice, we can estimate 𝑞(𝑧|𝒚) based on a set of training samples 𝒠 8

Bayes theorem Likelihood Prior Posterior  Bayes ’ theorem 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 𝑞 𝒟 𝑙 𝒚 = 𝑞(𝒚)  Posterior probability: 𝑞 𝒟 𝑙 𝒚  Likelihood or class conditional probability: 𝑞 𝒚|𝒟 𝑙  Prior probability: 𝑞(𝒟 𝑙 ) 𝑞(𝒚) : pdf of feature vector 𝒚 ( 𝑞 𝒚 = 𝑙=1 𝐿 𝑞 𝒚 𝒟 𝑙 𝑞(𝒟 𝑙 ) ) 𝑞(𝒚|𝒟 𝑙 ) : pdf of feature vector 𝒚 for samples of class 𝒟 𝑙 𝑞(𝒟 𝑙 ) : probability of the label be 𝒟 𝑙 9

Bayes decision rule: example  Bayes decision: Choose the class with highest 𝑞 𝒟 𝑙 𝒚 𝑞(𝒟 1 |𝑦) 𝑞(𝑦|𝒟 2 ) 𝑞(𝑦|𝒟 1 ) 𝑞(𝒟 2 |𝑦) ℛ 2 ℛ 2 𝑞 𝒟 1 = 2 𝑞 𝒟 𝑙 𝒚 = 𝑞 𝒚|𝒟 𝑙 𝑞(𝒟 𝑙 ) 3 𝑞(𝒚) 𝑞 𝒟 2 = 1 𝑞 𝒚 = 𝑞 𝒟 1 𝑞 𝒚 𝒟 1 + 𝑞 𝒟 2 𝑞 𝒚 𝒟 2 3 10

Bayes decision rule: example  Bayes decision: Choose the class with highest 𝑞 𝒟 𝑙 𝒚 2 × 𝑞(𝑦|𝒟 1 ) 𝑞(𝑦|𝒟 2 ) 𝑞(𝑦|𝒟 1 ) ℛ 2 ℛ 2 𝑞 𝒟 1 = 2 𝑞(𝒟 1 |𝑦) 3 𝑞 𝒟 2 = 1 3 𝑞(𝒟 2 |𝑦) 12 ℛ 2 ℛ 2

Bayes Classier  Simple Bayes classifier: estimate posterior probability of each class  What should the decision criterion be?  Choose class with highest 𝑞 𝒟 𝑙 𝒚  The optimal decision is the one that minimizes the expected number of mistakes 13

Diabetes example  white blood cell count 14 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

Diabetes example  Doctor has a prior 𝑞 𝑧 = 1 = 0.2  Prior: In the absence of any observation, what do I know about the probability of the classes?  A patient comes in with white blood cell count 𝑦  Does the patient have diabetes 𝑞 𝑧 = 1|𝑦 ?  given a new observation, we still need to compute the posterior 15

Diabetes example 𝑞 𝑦 𝑧 = 0 𝑞 𝑦 𝑧 = 1 16 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

Estimate probability densities from data  If we assume Gaussian distributions for 𝑞(𝑦|𝒟 1 ) and 𝑞(𝑦|𝒟 2 )  Recall that for samples {𝑦 1 , … , 𝑦 𝑂 } , if we assume a Gaussian distribution, the MLE estimates will be 17

Diabetes example 𝑞 𝑦 𝑧 = 0 𝑞 𝑦 𝑧 = 1 𝑜: 𝑧 (𝑜) =1 𝑦 (𝑜) 𝑜: 𝑧 (𝑜) =1 𝑦 (𝑜) 2 𝑞 𝑦 𝑧 = 1 = 𝑂 𝜈 1 , 𝜏 1 𝜈 1 = = 𝑂 1 𝑜: 𝑧 (𝑜) =1 1 2 𝑜: 𝑧(𝑜)=1 𝑦 𝑜 −𝜈 1 2 = 𝜏 1 𝑂 1 18 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

Diabetes example  Add a second observation: Plasma glucose value 19 This example has been adopted from Sanja Fidler ’ s slides, University of Toronto, CSC411

Generative approach for this example  Multivariate Gaussian distributions for 𝑞(𝑦|𝒟 𝑙 ) : 𝑞 𝒚 𝑧 = 𝑙 2𝜌 𝑒/2 Σ 1/2 exp{− 1 1 −1 𝒚 − 𝝂 𝑙 } 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 𝑙 = 𝑙 = 1,2  Prior distribution 𝑞(𝑦|𝒟 𝑙 ) :  𝑞 𝑧 = 1 = 𝜌 , 𝑞 𝑧 = 0 = 1 − 𝜌 20

MLE for multivariate Gaussian  For samples {𝑦 1 , … , 𝑦 𝑂 } , if we assume a multivariate Gaussian distribution, the MLE estimates will be: 𝑂 𝒚 (𝑜) 𝝂 = 𝑜=1 𝑂 𝑂 𝜯 = 1 𝑈 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑜=1 21

Generative approach: example 𝑂 𝒚 𝑜 , 𝑧 𝑜 Maximum likelihood estimation ( 𝐸 = ): 𝑜=1 𝑂 1  𝜌 = 𝑂 𝑂 𝑂 𝑂 𝑧 (𝑜) 𝒚 (𝑜) (1−𝑧 (𝑜) )𝒚 (𝑜) 𝑜=1 𝑜=1 , 𝝂 2 =  𝝂 1 = 𝑧 (𝑜) 𝑂 1 = 𝑂 1 𝑂 2 𝑜=1 𝑈 1 𝑧 (𝑜) 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑂 1 𝑜=1  𝜯 1 = 𝑂 2 = 𝑂 − 𝑂 1 𝑈 1 (1 − 𝑧 𝑜 ) 𝒚 (𝑜) − 𝝂 𝒚 (𝑜) − 𝝂 𝑂 𝑂 2 𝑜=1  𝜯 2 = 22

Decision boundary 𝑞(𝒚|𝐷 1 ) 𝑞(𝒚|𝐷 2 ) 𝑞(𝐷 1 |𝒚) = 𝑞(𝐷 2 |𝒚) 𝑞(𝐷 1 |𝒚) 24

Shared covariance matrix  When classes share a single covariance matrix 𝜯 = 𝜯 1 = 𝜯 2 2𝜌 𝑒/2 Σ 1/2 exp{− 1 1 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 −1 𝒚 − 𝝂 𝑙 } 𝑞 𝒚 𝐷 𝑙 = 𝑙 = 1,2  𝑞 𝐷 1 = 𝜌 , 𝑞 𝐷 2 = 1 − 𝜌 26

Likelihood 𝑂 𝑞(𝒚 𝑜 , 𝑧 (𝑜) |𝜌, 𝝂 1 , 𝝂 2 , 𝜯) 𝑜=1 𝑂 𝑞(𝒚 𝑜 |𝑧 𝑜 , 𝝂 1 , 𝝂 2 , 𝜯)𝑞(𝑧 𝑜 |𝜌) = 𝑜=1 27

Shared covariance matrix 𝑜 𝒚 𝑗 , 𝑧 𝑗  Maximum likelihood estimation ( 𝐸 = ): 𝑗=1 𝜌 = 𝑂 1 𝑂 𝑂 𝑧 (𝑜) 𝒚 (𝑜) 𝝂 1 = 𝑜=1 𝑂 1 𝑂 (1 − 𝑧 (𝑜) )𝒚 (𝑜) 𝝂 2 = 𝑜=1 𝑂 2 𝜯 = 1 𝑈 + 𝑈 𝒚 (𝑜) − 𝝂 1 𝒚 (𝑜) − 𝝂 1 𝒚 (𝑜) − 𝝂 2 𝒚 (𝑜) − 𝝂 2 𝑂 𝑜∈𝐷 1 𝑜∈𝐷 2 28

Decision boundary when shared covariance matrix ln 𝑞(𝒚|𝒟 1 ) + ln 𝑞(𝒟 1 ) = ln 𝑞(𝒚|𝒟 2 ) + ln 𝑞(𝒟 2 ) ln 𝑞(𝒚|𝒟 𝑙 ) = − 𝑒 2 ln 2𝜌 − 1 −1 − 1 2 𝒚 − 𝝂 𝑙 𝑈 𝜯 −1 𝒚 − 𝝂 𝑙 2 ln 𝜯 𝑙 29

Bayes decision rule Multi-class misclassification rate  Multi-class problem: Probability of error of Bayesian decision rule  Simpler to compute the probability of correct decision 𝑄 𝑓𝑠𝑠𝑝𝑠 = 1 − 𝑄(𝑑𝑝𝑠𝑠𝑓𝑑𝑢) 𝐿 𝑄 𝐷𝑝𝑠𝑠𝑓𝑑𝑢 = 𝑞(𝒚, 𝒟 𝑗 ) 𝑒𝒚 ℛ 𝑗 𝑗=1 𝐿 = 𝑞 𝒟 𝑗 𝒚 𝑞(𝒚) 𝑒𝒚 ℛ 𝑗 𝑗=1 ℛ 𝑗 : the subset of feature space assigned to the class 𝒟 𝑗 using the classifier 31

Bayes minimum error  Bayes minimum error classifier: 𝛽(.) 𝐹 𝒚,𝑧 𝐽(𝛽(𝒚) ≠ 𝑧) min Zero-one loss 𝛽 𝒚 = argmax 𝑞(𝑧|𝒚) 𝑧 32

Probabilistic classification CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier Nave Bayes

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Reconstruction

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

From Probabilistic Circuits to Probabilistic Programs and Back Guy Van den Broeck PROBPROG - Oct

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

CS3505/5020 Software Practice II Updated topics schedule Transformations CS 3505 L05 - 1

Preconditioning techniques based on the Birkhoff-von Neumann decomposition Bora U car CNRS

Support Constrained Generator Matrices of Gabidulin Codes in Characteristic Zero Hikmet Yildiz

Kernels + Support Vector Machines (SVMs) SVM Readings: Matt Gormley Murphy

Bayes Classifier (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1 Joint

1 Problem with Brute force Nave Bayes ( ) ( ) ( ) It cannot generalize to unseen

Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

Human-Oriented Robotics Supervised Learning Part 1/3 Kai Arras Social Robotics Lab, University

Probabilistic classification CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier Nave Bayes

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Reconstruction

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

From Probabilistic Circuits to Probabilistic Programs and Back Guy Van den Broeck PROBPROG - Oct

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

CS3505/5020 Software Practice II Updated topics schedule Transformations CS 3505 L05 - 1

Preconditioning techniques based on the Birkhoff-von Neumann decomposition Bora U car CNRS

Support Constrained Generator Matrices of Gabidulin Codes in Characteristic Zero Hikmet Yildiz

Kernels + Support Vector Machines (SVMs) SVM Readings: Matt Gormley Murphy

Bayes Classifier (slides borrowed from Tom Mitchell, Barnabs Pczos &amp; Aarti Singh 1 Joint

1 Problem with Brute force Nave Bayes ( ) ( ) ( ) It cannot generalize to unseen

Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

Human-Oriented Robotics Supervised Learning Part 1/3 Kai Arras Social Robotics Lab, University

Bayes Classifier (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1 Joint