CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & - PowerPoint PPT Presentation

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel’s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1 / 28

Today Classification – Multi-dimensional (Gaussian) Bayes classifier Estimate probability densities from data Naive Bayes classifier Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 2 / 28

Generative vs Discriminative Two approaches to classification: Discriminative classifiers estimate parameters of decision boundary/class separator directly from labeled examples ◮ learn p ( y | x ) directly (logistic regression models) ◮ learn mappings from inputs to classes (least-squares, neural nets) Generative approach: model the distribution of inputs characteristic of the class (Bayes classifier) ◮ Build a model of p ( x | y ) ◮ Apply Bayes Rule Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 3 / 28

Bayes Classifier Aim to diagnose whether patient has diabetes: classify into one of two classes (yes C=1; no C=0) Run battery of tests Given patient’s results: x = [ x 1 , x 2 , · · · , x d ] T we want to update class probabilities using Bayes Rule: p ( C | x ) = p ( x | C ) p ( C ) p ( x ) More formally posterior = Class likelihood × prior Evidence How can we compute p ( x ) for the two class case? p ( x ) = p ( x | C = 0) p ( C = 0) + p ( x | C = 1) p ( C = 1) Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 4 / 28

Classification: Diabetes Example Last class we had a single observation per patient: white blood cell count p ( C = 1 | x = 48) = p ( x = 48 | C = 1) p ( C = 1) p ( x = 48) Add second observation: Plasma glucose value Now our input x is 2-dimensional Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 5 / 28

Gaussian Discriminant Analysis (Gaussian Bayes Classifier) Gaussian Discriminant Analysis in its general form assumes that p ( x | t ) is distributed according to a multivariate normal (Gaussian) distribution Multivariate Gaussian distribution: 1 − ( x − µ k ) T Σ − 1 � � p ( x | t = k ) = (2 π ) d / 2 | Σ k | 1 / 2 exp k ( x − µ k ) where | Σ k | denotes the determinant of the matrix, and d is dimension of x Each class k has associated mean vector µ k and covariance matrix Σ k Typically the classes share a single covariance matrix Σ (“share” means that they have the same parameters; the covariance matrix in this case): Σ = Σ 1 = · · · = Σ k Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 6 / 28

Multivariate Data Multiple measurements (sensors) d inputs/features/attributes N instances/observations/examples x (1) x (1) x (1)   · · · 1 2 d x (2) x (2) x (2) · · ·   1 2 d   X = . . . ...   . . .  . . .    x ( N ) x ( N ) x ( N ) · · · 1 2 d Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 7 / 28

Multivariate Parameters Mean E [ x ] = [ µ 1 , · · · , µ d ] T Covariance  σ 2  σ 12 · · · σ 1 d 1 σ 2 σ 12 · · · σ 2 d  2  Σ = Cov ( x ) = E [( x − µ ) T ( x − µ )] = . . .  ...  . . .   . . .   σ 2 · · · σ d 1 σ d 2 d Correlation = Corr ( x ) is the covariance divided by the product of standard deviation ρ ij = σ ij σ i σ j Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 8 / 28

Multivariate Gaussian Distribution x ∼ N ( µ, Σ), a Gaussian (or normal) distribution defined as 1 − ( x − µ ) T Σ − 1 ( x − µ ) � � p ( x ) = (2 π ) d / 2 | Σ | 1 / 2 exp Mahalanobis distance ( x − µ k ) T Σ − 1 ( x − µ k ) measures the distance from x to µ in terms of Σ It normalizes for difference in variances and correlations Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 9 / 28

Bivariate Normal � 1 � � 1 � � 1 � 0 0 0 Σ = Σ = 0 . 5 Σ = 2 0 1 0 1 0 1 Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 10 / 28

Bivariate Normal var ( x 1 ) = var ( x 2 ) var ( x 1 ) > var ( x 2 ) var ( x 1 ) < var ( x 2 ) Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 11 / 28

Bivariate Normal � 1 � 1 � 1 � � � 0 0 . 5 0 . 8 Σ = Σ = Σ = 0 1 0 . 5 1 0 . 8 1 Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 12 / 28

Bivariate Normal Cov ( x 1 , x 2 ) = 0 Cov ( x 1 , x 2 ) > 0 Cov ( x 1 , x 2 ) < 0 Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 13 / 28

Gaussian Discriminant Analysis (Gaussian Bayes Classifier) GDA (GBC) decision boundary is based on class posterior: log p ( t k | x ) = log p ( x | t k ) + log p ( t k ) − log p ( x ) 2 log(2 π ) − 1 k | − 1 − d 2 log | Σ − 1 2( x − µ k ) T Σ − 1 = k ( x − µ k ) + + log p ( t k ) − log p ( x ) Decision: take the class with the highest posterior probability Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 14 / 28

Decision Boundary discriminant: !! likelihoods) P !( t 1 | x" )!=!0.5! posterior)for)t 1) Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 15 / 28

Decision Boundary when Shared Covariance Matrix Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 16 / 28

Learning Learn the parameters using maximum likelihood N � p ( x ( n ) , t ( n ) | φ, µ 0 , µ 1 , Σ) ℓ ( φ, µ 0 , µ 1 , Σ) = − log n =1 N � p ( x ( n ) | t ( n ) , µ 0 , µ 1 , Σ) p ( t ( n ) | φ ) = − log n =1 What have we assumed? Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 17 / 28

More on MLE Assume the prior is Bernoulli (we have two classes) p ( t | φ ) = φ t (1 − φ ) 1 − t You can compute the ML estimate in closed form N 1 ✶ [ t ( n ) = 1] � φ = N n =1 n =1 ✶ [ t ( n ) = 0] · x ( n ) � N µ 0 = � N n =1 ✶ [ t ( n ) = 0] n =1 ✶ [ t ( n ) = 1] · x ( n ) � N = µ 1 n =1 ✶ [ t ( n ) = 1] � N N 1 ( x ( n ) − µ t ( n ) )( x ( n ) − µ t ( n ) ) T � Σ = N n =1 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 18 / 28

Gaussian Discriminative Analysis vs Logistic Regression If you examine p ( t = 1 | x ) under GDA, you will find that it looks like this: 1 p ( t | x , φ, µ 0 , µ 1 , Σ) = 1 + exp( − w T x ) where w is an appropriate function of ( φ, µ 0 , µ 1 , Σ) So the decision boundary has the same form as logistic regression! When should we prefer GDA to LR, and vice versa? Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 19 / 28

Gaussian Discriminative Analysis vs Logistic Regression GDA makes stronger modeling assumption: assumes class-conditional data is multivariate Gaussian If this is true, GDA is asymptotically efficient (best model in limit of large N) But LR is more robust, less sensitive to incorrect modeling assumptions Many class-conditional distributions lead to logistic classifier When these distributions are non-Gaussian, in limit of large N, LR beats GDA Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 20 / 28

Simplifying the Model What if x is high-dimensional? For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance matrix has many parameters Save some parameters by using a shared covariance for the classes Any other idea you can think of? Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 21 / 28

Naive Bayes Naive Bayes is an alternative generative model: Assumes features independent given the class d � p ( x | t = k ) = p ( x i | t = k ) i =1 Assuming likelihoods are Gaussian, how many parameters required for Naive Bayes classifier? Important note: Naive Bayes does not assume a particular distribution Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 22 / 28

Naive Bayes Classifier Given prior p ( t = k ) assuming features are conditionally independent given the class likelihood p ( x i | t = k ) for each x i The decision rule d � y = arg max p ( t = k ) p ( x i | t = k ) k i =1 If the assumption of conditional independence holds, NB is the optimal classifier If not, a heavily regularized version of generative classifier What’s the regularization? Note: NB’s assumptions (cond. independence) typically do not hold in practice. However, the resulting algorithm still works well on many problems, and it typically serves as a decent baseline for more sophisticated models Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 23 / 28

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & - PowerPoint PPT Presentation

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1 / 28 Today Classification

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Naive Bayes Classication Naive Bayes Classication In [1]: % matplotlib inline from

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

CSC 411 Lecture 20: Gaussian Processes Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun & Rich Zemels lectures

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 20: Closing Thoughts Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 19: Bayesian Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun & Rich Zemels

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &

Potential PCA Interpretation Problems for Volatility Smile Dynamics Robert Tompkins, Dimitri

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh COMP 551 (Fall 2020) Admin

Week 1: Introduc/on Precision and covariance matrix 2 1.2C

. . . . . . . . . . . . . . . . . . . . . Let denote an average .

Structural Variations 02-715 Advanced Topics in Computa8onal Genomics

Exploiting Latency Variation for Access Conflict Reduction of NAND Flash Memory Jinhua Cui,

Alex Suciu Northeastern University ETnA 2017: Encounter in Topology and Algebra Scuola

Variety of orthomodular posets Ivan Chajda, Miroslav Kola r k Palack y University

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & - PowerPoint PPT Presentation

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1 / 28 Today Classification

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Naive Bayes Classication Naive Bayes Classication In [1]: % matplotlib inline from

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

CSC 411 Lecture 20: Gaussian Processes Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun &amp; Rich Zemels lectures

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 20: Closing Thoughts Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla

CSC 411 Lecture 19: Bayesian Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun &amp; Rich Zemels

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &amp;

Potential PCA Interpretation Problems for Volatility Smile Dynamics Robert Tompkins, Dimitri

Applied Machine Learning Multivariate Gaussian Siamak Ravanbakhsh COMP 551 (Fall 2020) Admin

Week 1: Introduc/on Precision and covariance matrix 2 1.2C

. . . . . . . . . . . . . . . . . . . . . Let denote an average .

Structural Variations 02-715 Advanced Topics in Computa8onal Genomics

Exploiting Latency Variation for Access Conflict Reduction of NAND Flash Memory Jinhua Cui,

Alex Suciu Northeastern University ETnA 2017: Encounter in Topology and Algebra Scuola

Variety of orthomodular posets Ivan Chajda, Miroslav Kola r k Palack y University

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun & Rich Zemels lectures

CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun & Rich Zemels

CSC 411: Lecture 08: Generative Models for Classification Class based on Raquel Urtasun &