na ve bayes
play

Nave Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Nave Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning / Generative Models


  1. 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1

  2. Reminders • Homework 6: PAC Learning / Generative Models – Out: Wed, Oct 31 – Due: Wed, Nov 7 at 11:59pm (1 week) TIP: Do the readings! • Exam Viewing – Thu, Nov 1 – Fri, Nov 2 2

  3. NAÏVE BAYES 4

  4. Naïve Bayes Outline • Real-world Dataset – Economist vs. Onion articles – Document à bag-of-words à binary feature vector • Naive Bayes: Model – Generating synthetic "labeled documents" – Definition of model – Naive Bayes assumption – Counting # of parameters with / without NB assumption • Naïve Bayes: Learning from Data – Data likelihood – MLE for Naive Bayes – MAP for Naive Bayes • Visualizing Gaussian Naive Bayes 5

  5. Fake News Detector Today’s Goal: To define a generative model of emails of two different classes (e.g. real vs. fake news) The Economist The Onion 6

  6. Naive Bayes: Model Whiteboard – Document à bag-of-words à binary feature vector – Generating synthetic "labeled documents" – Definition of model – Naive Bayes assumption – Counting # of parameters with / without NB assumption 7

  7. Model 1: Bernoulli Naïve Bayes Flip weighted coin If TAILS, flip If HEADS, flip each blue coin each red coin y x 1 x 3 x 2 x M … … … 0 1 0 1 … 1 We can generate data in 1 0 0 1 … 1 this fashion. Though in practice we never would 1 1 1 1 … 1 since our data is given . 0 0 1 0 … 1 Instead, this provides an 0 1 0 1 … 0 explanation of how the Each red coin data was generated corresponds to 1 1 0 1 … 0 (albeit a terrible one). an x m 8

  8. What’s wrong with the Naïve Bayes Assumption? The features might not be independent!! • Example 1: – If a document contains the word “Donald”, it’s extremely likely to contain the word “Trump” – These are not independent! • Example 2: – If the petal width is very high, the petal length is also likely to be very high 9

  9. Naïve Bayes: Learning from Data Whiteboard – Data likelihood – MLE for Naive Bayes – Example: MLE for Naïve Bayes with Two Features – MAP for Naive Bayes 10

  10. NAÏVE BAYES: MODEL DETAILS 11

  11. Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K � ∈ { 0 , 1 } K Generative Story: Y ∼ Bernoulli ( φ ) X k ∼ Bernoulli ( θ k,Y ) ∀ k ∈ { 1 , . . . , K } Model: p φ , θ ( x , y ) = p φ , θ ( x 1 , . . . , x K , y ) K � = p φ ( y ) p θ k ( x k | y ) k =1 K = ( φ ) y (1 − φ ) (1 − y ) � ( θ k,y ) x k (1 − θ k,y ) (1 − x k ) k =1 12

  12. Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K � ∈ { 0 , 1 } K Generative Story: Y ∼ Bernoulli ( φ ) X k ∼ Bernoulli ( θ k,Y ) ∀ k ∈ { 1 , . . . , K } Same as Generic K Naïve Bayes Model: = ( φ ) y (1 − φ ) (1 − y ) � ( θ k,y ) x k (1 − θ k,y ) (1 − x k ) p φ , θ ( x , y ) = k =1 Classification: Find the class that maximizes the posterior y = ������ ˆ p ( y | � ) y 13

  13. Model 1: Bernoulli Naïve Bayes Training: Find the class-conditional MLE parameters For P(Y) , we find the MLE using all the data. For each P(X k |Y) we condition on the data with the corresponding class. i =1 I ( y ( i ) = 1) � N φ = N i =1 I ( y ( i ) = 0 ∧ x ( i ) � N = 1) k θ k, 0 = i =1 I ( y ( i ) = 0) � N i =1 I ( y ( i ) = 1 ∧ x ( i ) � N = 1) k θ k, 1 = i =1 I ( y ( i ) = 1) � N ∀ k ∈ { 1 , . . . , K } 14

  14. Model 1: Bernoulli Naïve Bayes Training: Find the class-conditional MLE parameters For P(Y) , we find the MLE using all the data. For each P(X k |Y) we condition on the data with the corresponding class. i =1 I ( y ( i ) = 1) � N Data: φ = y x 1 x 2 x 3 x K … N 0 1 0 1 … 1 i =1 I ( y ( i ) = 0 ∧ x ( i ) � N = 1) k θ k, 0 = 1 0 0 1 … 1 i =1 I ( y ( i ) = 0) � N 1 1 1 1 … 1 i =1 I ( y ( i ) = 1 ∧ x ( i ) � N = 1) 0 0 1 0 … 1 k θ k, 1 = i =1 I ( y ( i ) = 1) � N 0 1 0 1 … 0 1 1 0 1 … 0 ∀ k ∈ { 1 , . . . , K } 15

  15. Other NB Models 1. Bernoulli Naïve Bayes: – for binary features 2. Gaussian Naïve Bayes: – for continuous features 3. Multinomial Naïve Bayes: – for integer features 4. Multi-class Naïve Bayes: – for classification problems with > 2 classes – event model could be any of Bernoulli, Gaussian, Multinomial, depending on features 16

  16. Model 2: Gaussian Naïve Bayes Support: � ∈ R K Model: Product of prior and the event model p ( x , y ) = p ( x 1 , . . . , x K , y ) K � = p ( y ) p ( x k | y ) k =1 Gaussian Naive Bayes assumes that p ( x k | y ) is given by a Normal distribution. 17

  17. Model 3: Multinomial Naïve Bayes Support: Option 1: Integer vector (word IDs) � = [ x 1 , x 2 , . . . , x M ] where x m ∈ { 1 , . . . , K } a word id. Generative Story: for i ∈ { 1 , . . . , N } : y ( i ) ∼ Bernoulli ( φ ) for j ∈ { 1 , . . . , M i } : x ( i ) ∼ Multinomial ( θ y ( i ) , 1) j Model: K � p φ , θ ( x , y ) = p φ ( y ) p θ k ( x k | y ) k =1 M i = ( φ ) y (1 − φ ) (1 − y ) � θ y,x j 18 j =1

  18. Model 5: Multiclass Naïve Bayes Model: The only change is that we permit y to range over C classes. p ( x , y ) = p ( x 1 , . . . , x K , y ) K � = p ( y ) p ( x k | y ) k =1 Now, y ∼ Multinomial ( φ , 1) and we have a sepa- rate conditional distribution p ( x k | y ) for each of the C classes. 19

  19. Generic Naïve Bayes Model Support: Depends on the choice of event model , P(X k |Y) Model: Product of prior and the event model K � P ( � , Y ) = P ( Y ) P ( X k | Y ) k =1 Training: Find the class-conditional MLE parameters For P(Y) , we find the MLE using all the data. For each P(X k |Y) we condition on the data with the corresponding class. Classification: Find the class that maximizes the posterior y = ������ ˆ p ( y | � ) y 20

  20. Generic Naïve Bayes Model Classification: (posterior) y = ������ ˆ p ( y | � ) y p ( � | y ) p ( y ) (by Bayes’ rule) = ������ p ( x ) y = ������ p ( � | y ) p ( y ) y 21

  21. Smoothing 1. Add-1 Smoothing 2. Add-λ Smoothing 3. MAP Estimation (Beta Prior) 22

  22. MLE What does maximizing likelihood accomplish? • There is only a finite amount of probability mass (i.e. sum-to-one constraint) • MLE tries to allocate as much probability mass as possible to the things we have observed… … at the expense of the things we have not observed 23

  23. MLE For Naïve Bayes, suppose we never observe the word “ serious ” in an Onion article. In this case, what is the MLE of p(x k | y)? i =1 I ( y ( i ) = 0 ∧ x ( i ) � N = 1) k θ k, 0 = i =1 I ( y ( i ) = 0) � N Now suppose we observe the word “serious” at test time. What is the posterior probability that the article was an Onion article? p ( y | x ) = p ( x | y ) p ( y ) p ( x ) 24

  24. 1. Add-1 Smoothing The simplest setting for smoothing simply adds a single pseudo-observation to the data. This converts the true observations D into a new dataset D � from we derive the MLEs. D = { ( � ( i ) , y ( i ) ) } N (1) i =1 D � = D ∪ { ( � , 0) , ( � , 1) , ( � , 0) , ( � , 1) } (2) where � is the vector of all zeros and � is the vector of all ones. This has the effect of pretending that we observed each feature x k with each class y . 25

  25. 1. Add-1 Smoothing What if we write the MLEs in terms of the original dataset D ? i =1 I ( y ( i ) = 1) � N φ = N i =1 I ( y ( i ) = 0 ∧ x ( i ) θ k, 0 = 1 + � N = 1) k i =1 I ( y ( i ) = 0) 2 + � N i =1 I ( y ( i ) = 1 ∧ x ( i ) θ k, 1 = 1 + � N = 1) k i =1 I ( y ( i ) = 1) 2 + � N ∀ k ∈ { 1 , . . . , K } 26

  26. 2. Add-λ Smoothing For the Categorical Distribution Suppose we have a dataset obtained by repeatedly rolling a K -sided (weighted) die. Given data D = i =1 where x ( i ) ∈ { 1 , . . . , K } , we have the fol- { x ( i ) } N lowing MLE: i =1 I ( x ( i ) = k ) � N φ k = N Withadd- λ smoothing, weaddpseudo-observationsas before to obtain a smoothed estimate: i =1 I ( x ( i ) = k ) φ k = λ + � N 27 k λ + N

  27. 3. MAP Estimation (Beta Prior) Generative Story: Training: Find the class-conditional The parameters are MAP parameters drawn once for the i =1 I ( y ( i ) = 1) � N φ = entire dataset. N for k ∈ { 1 , . . . , K } : i =1 I ( y ( i ) = 0 ∧ x ( i ) θ k, 0 = ( α − 1) + � N = 1) for y ∈ { 0 , 1 } : k i =1 I ( y ( i ) = 0) ( α − 1) + ( β − 1) + � N θ k,y ∼ Beta ( α , β ) for i ∈ { 1 , . . . , N } : y ( i ) ∼ Bernoulli ( φ ) i =1 I ( y ( i ) = 1 ∧ x ( i ) θ k, 1 = ( α − 1) + � N = 1) k for k ∈ { 1 , . . . , K } : i =1 I ( y ( i ) = 1) ( α − 1) + ( β − 1) + � N x ( i ) ∼ Bernoulli ( θ k,y ( i ) ) k ∀ k ∈ { 1 , . . . , K } 28

  28. VISUALIZING NAÏVE BAYES 29

  29. Fisher Iris Dataset Fisher (1936) used 150 measurements of flowers from 3 different species: Iris setosa (0), Iris virginica (1), Iris versicolor (2) collected by Anderson (1936) Species Sepal Sepal Petal Petal Length Width Length Width 0 4.3 3.0 1.1 0.1 0 4.9 3.6 1.4 0.1 0 5.3 3.7 1.5 0.2 1 4.9 2.4 3.3 1.0 1 5.7 2.8 4.1 1.3 1 6.3 3.3 4.7 1.6 1 6.7 3.0 5.0 1.7 31 Full dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set

  30. Slide from William Cohen

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend