csc 411 lecture 09 naive bayes
play

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & - PowerPoint PPT Presentation

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1 / 28 Today Classification


  1. CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel’s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1 / 28

  2. Today Classification – Multi-dimensional (Gaussian) Bayes classifier Estimate probability densities from data Naive Bayes classifier Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 2 / 28

  3. Generative vs Discriminative Two approaches to classification: Discriminative classifiers estimate parameters of decision boundary/class separator directly from labeled examples ◮ learn p ( y | x ) directly (logistic regression models) ◮ learn mappings from inputs to classes (least-squares, neural nets) Generative approach: model the distribution of inputs characteristic of the class (Bayes classifier) ◮ Build a model of p ( x | y ) ◮ Apply Bayes Rule Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 3 / 28

  4. Bayes Classifier Aim to diagnose whether patient has diabetes: classify into one of two classes (yes C=1; no C=0) Run battery of tests Given patient’s results: x = [ x 1 , x 2 , · · · , x d ] T we want to update class probabilities using Bayes Rule: p ( C | x ) = p ( x | C ) p ( C ) p ( x ) More formally posterior = Class likelihood × prior Evidence How can we compute p ( x ) for the two class case? p ( x ) = p ( x | C = 0) p ( C = 0) + p ( x | C = 1) p ( C = 1) Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 4 / 28

  5. Classification: Diabetes Example Last class we had a single observation per patient: white blood cell count p ( C = 1 | x = 48) = p ( x = 48 | C = 1) p ( C = 1) p ( x = 48) Add second observation: Plasma glucose value Now our input x is 2-dimensional Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 5 / 28

  6. Gaussian Discriminant Analysis (Gaussian Bayes Classifier) Gaussian Discriminant Analysis in its general form assumes that p ( x | t ) is distributed according to a multivariate normal (Gaussian) distribution Multivariate Gaussian distribution: 1 − ( x − µ k ) T Σ − 1 � � p ( x | t = k ) = (2 π ) d / 2 | Σ k | 1 / 2 exp k ( x − µ k ) where | Σ k | denotes the determinant of the matrix, and d is dimension of x Each class k has associated mean vector µ k and covariance matrix Σ k Typically the classes share a single covariance matrix Σ (“share” means that they have the same parameters; the covariance matrix in this case): Σ = Σ 1 = · · · = Σ k Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 6 / 28

  7. Multivariate Data Multiple measurements (sensors) d inputs/features/attributes N instances/observations/examples x (1) x (1) x (1)   · · · 1 2 d x (2) x (2) x (2) · · ·   1 2 d   X = . . . ...   . . .  . . .    x ( N ) x ( N ) x ( N ) · · · 1 2 d Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 7 / 28

  8. Multivariate Parameters Mean E [ x ] = [ µ 1 , · · · , µ d ] T Covariance  σ 2  σ 12 · · · σ 1 d 1 σ 2 σ 12 · · · σ 2 d  2  Σ = Cov ( x ) = E [( x − µ ) T ( x − µ )] = . . .  ...  . . .   . . .   σ 2 · · · σ d 1 σ d 2 d Correlation = Corr ( x ) is the covariance divided by the product of standard deviation ρ ij = σ ij σ i σ j Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 8 / 28

  9. Multivariate Gaussian Distribution x ∼ N ( µ, Σ), a Gaussian (or normal) distribution defined as 1 − ( x − µ ) T Σ − 1 ( x − µ ) � � p ( x ) = (2 π ) d / 2 | Σ | 1 / 2 exp Mahalanobis distance ( x − µ k ) T Σ − 1 ( x − µ k ) measures the distance from x to µ in terms of Σ It normalizes for difference in variances and correlations Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 9 / 28

  10. Bivariate Normal � 1 � � 1 � � 1 � 0 0 0 Σ = Σ = 0 . 5 Σ = 2 0 1 0 1 0 1 Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 10 / 28

  11. Bivariate Normal var ( x 1 ) = var ( x 2 ) var ( x 1 ) > var ( x 2 ) var ( x 1 ) < var ( x 2 ) Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 11 / 28

  12. Bivariate Normal � 1 � 1 � 1 � � � 0 0 . 5 0 . 8 Σ = Σ = Σ = 0 1 0 . 5 1 0 . 8 1 Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 12 / 28

  13. Bivariate Normal Cov ( x 1 , x 2 ) = 0 Cov ( x 1 , x 2 ) > 0 Cov ( x 1 , x 2 ) < 0 Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 13 / 28

  14. Gaussian Discriminant Analysis (Gaussian Bayes Classifier) GDA (GBC) decision boundary is based on class posterior: log p ( t k | x ) = log p ( x | t k ) + log p ( t k ) − log p ( x ) 2 log(2 π ) − 1 k | − 1 − d 2 log | Σ − 1 2( x − µ k ) T Σ − 1 = k ( x − µ k ) + + log p ( t k ) − log p ( x ) Decision: take the class with the highest posterior probability Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 14 / 28

  15. Decision Boundary discriminant: !! likelihoods) P !( t 1 | x" )!=!0.5! posterior)for)t 1) Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 15 / 28

  16. Decision Boundary when Shared Covariance Matrix Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 16 / 28

  17. Learning Learn the parameters using maximum likelihood N � p ( x ( n ) , t ( n ) | φ, µ 0 , µ 1 , Σ) ℓ ( φ, µ 0 , µ 1 , Σ) = − log n =1 N � p ( x ( n ) | t ( n ) , µ 0 , µ 1 , Σ) p ( t ( n ) | φ ) = − log n =1 What have we assumed? Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 17 / 28

  18. More on MLE Assume the prior is Bernoulli (we have two classes) p ( t | φ ) = φ t (1 − φ ) 1 − t You can compute the ML estimate in closed form N 1 ✶ [ t ( n ) = 1] � φ = N n =1 n =1 ✶ [ t ( n ) = 0] · x ( n ) � N µ 0 = � N n =1 ✶ [ t ( n ) = 0] n =1 ✶ [ t ( n ) = 1] · x ( n ) � N = µ 1 n =1 ✶ [ t ( n ) = 1] � N N 1 ( x ( n ) − µ t ( n ) )( x ( n ) − µ t ( n ) ) T � Σ = N n =1 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 18 / 28

  19. Gaussian Discriminative Analysis vs Logistic Regression If you examine p ( t = 1 | x ) under GDA, you will find that it looks like this: 1 p ( t | x , φ, µ 0 , µ 1 , Σ) = 1 + exp( − w T x ) where w is an appropriate function of ( φ, µ 0 , µ 1 , Σ) So the decision boundary has the same form as logistic regression! When should we prefer GDA to LR, and vice versa? Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 19 / 28

  20. Gaussian Discriminative Analysis vs Logistic Regression GDA makes stronger modeling assumption: assumes class-conditional data is multivariate Gaussian If this is true, GDA is asymptotically efficient (best model in limit of large N) But LR is more robust, less sensitive to incorrect modeling assumptions Many class-conditional distributions lead to logistic classifier When these distributions are non-Gaussian, in limit of large N, LR beats GDA Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 20 / 28

  21. Simplifying the Model What if x is high-dimensional? For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance matrix has many parameters Save some parameters by using a shared covariance for the classes Any other idea you can think of? Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 21 / 28

  22. Naive Bayes Naive Bayes is an alternative generative model: Assumes features independent given the class d � p ( x | t = k ) = p ( x i | t = k ) i =1 Assuming likelihoods are Gaussian, how many parameters required for Naive Bayes classifier? Important note: Naive Bayes does not assume a particular distribution Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 22 / 28

  23. Naive Bayes Classifier Given prior p ( t = k ) assuming features are conditionally independent given the class likelihood p ( x i | t = k ) for each x i The decision rule d � y = arg max p ( t = k ) p ( x i | t = k ) k i =1 If the assumption of conditional independence holds, NB is the optimal classifier If not, a heavily regularized version of generative classifier What’s the regularization? Note: NB’s assumptions (cond. independence) typically do not hold in practice. However, the resulting algorithm still works well on many problems, and it typically serves as a decent baseline for more sophisticated models Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 23 / 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend