na ve bayes
play

Nave Bayes Yingyu Liang Computer Sciences 760 Fall 2017 - PowerPoint PPT Presentation

Nave Bayes Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell,


  1. Naïve Bayes Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

  2. Goals for the lecture • understand the concepts • generative/discriminative models • examples of the two approaches • MLE (Maximum Likelihood Estimation) • Naïve Bayes • Naïve Bayes assumption • model 1: Bernoulli Naïve Bayes • model 2: Multinomial Naïve Bayes • model 3: Gaussian Naïve Bayes • model 4: Multiclass Naïve Bayes

  3. Review: supervised learning problem setting • set of possible instances: X • unknown target function (concept): • set of hypotheses (hypothesis class): given • training set of instances of unknown target function f       m y ( 1 ) ( 1 ) ( 2 ) ( 2 ) ( ) ( m ) , y , , y ... , x x x output h  • H hypothesis that best approximates target function

  4. Parametric hypothesis class h     • H hypothesis is indexed by parameter h   • H learning: find the such that best approximate the target  h  H   • different from nonparametric approaches like decision trees and nearest neighbor • advantages: various hypothesis class; easier to use math/optimization

  5. Discriminative approaches h  • H hypothesis directly predicts the label given the features   y h ( x ) or more generally, p ( y | x ) h ( x ) • L ( h ) then define a loss function and find hypothesis with min. loss • example: linear regression   h ( x ) x ,  m 1    ( i ) ( i ) 2 L ( h ) ( h ( x ) y )   m  i 1

  6. Generative approaches h  • H hypothesis specifies a generative story for how the data was created  h ( x , y ) p ( x , y ) • then pick a hypothesis by maximum likelihood estimation (MLE) or Maximum A Posteriori (MAP) • example: roll a weighted die  • weights for each side ( ) define how the data are generated  • use MLE on the training data to learn

  7. Comments on discriminative/generative • usually for supervised learning, parametric hypothesis class • can also for unsupervised learning • k-means clustering (discriminative flavor) vs Mixture of Gaussians (generative) • can also for nonparametric • nonparametric Bayesian: a large subfield of ML • when discriminative/generative is likely to be better? Discussed in later lecture • typical discriminative: linear regression, logistic regression, SVM, many neural networks (not all!), … • typical generative: Naïve Bayes, Bayesian Networks, …

  8. MLE vs. MAP Maximum Likelihood Estimate (MLE) 8

  9. Background: MLE Example: MLE of Exponential Distribution 9

  10. Background: MLE Example: MLE of Exponential Distribution 10

  11. Background: MLE Example: MLE of Exponential Distribution 11

  12. MLE vs. MAP Maximum Likelihood Estimate (MLE) Maximum a posteriori (MAP) estimate Prior 12

  13. Spam News The Economist The Onion 13

  14. Model 0: Not-so-naïve Model? Generative Story: 1. Flip a weighted coin ( Y ) 2. If heads, roll the red many sided die to sample a document vector ( X ) from the Spam distribution 3. If tails, roll the blue many sided die to sample a document vector ( X ) from the Not-Spam distribution This model is computationally naïve! 14

  15. Model 0: Not-so-naïve Model? Generative Story: 1. Flip a weighted coin ( Y ) 2. If heads, sample a document ID ( X ) from the Spam distribution 3. If tails, sample a document ID ( X ) from the Not-Spam distribution This model is computationally naïve! 15

  16. Model 0: Not-so-naïve Model? Flip weighted coin If TAILS, roll If HEADS, roll blue die red die y x 1 x 2 x 3 … x K 0 1 0 1 … 1 1 0 1 0 … 1 1 1 1 1 … 1 0 0 0 1 … 1 Each side of the die 0 1 0 1 … 0 is labeled with a document vector 1 1 0 1 … 0 (e.g. [1,0,1,…,1]) 16

  17. Naïve Bayes Assumption Conditional independence of features: 17

  18. Assuming conditional independence, the conditional probabilities encode the same information as the joint table. They are very convenient for estimating P( X 1 ,…, X n |Y)=P( X 1 |Y)*…*P( X n |Y) They are almost as good for computing P ( Y | X 1 ,..., X n ) = P ( X 1, ..., X n | Y ) P ( Y ) P ( X 1 ,..., X n )   P ( X ..., X | Y ) P ( Y y ) x     1 , n , y : P ( Y y | X ,..., X ) x x  1 n P ( X ,..., X ) x 1 n

  19. Generic Naïve Bayes Model Support: Depends on the choice of event model , P(X k |Y) Model: Product of prior and the event model Training: Find the class-conditional MLE parameters For P(Y) , we find the MLE using all the data. For each P(X k |Y) we condition on the data with the corresponding class. Classification: Find the class that maximizes the posterior 21

  20. Generic Naïve Bayes Model Classification: 22

  21. Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K Generative Story: Model: 23

  22. Model 1: Bernoulli Naïve Bayes Flip weighted coin If TAILS, flip If HEADS, flip each blue coin each red coin y x 1 x 2 x 3 … x K … … 0 1 0 1 … 1 We can generate data in 1 0 1 0 … 1 this fashion. Though in practice we never would 1 1 1 1 … 1 since our data is given . 0 0 0 1 … 1 Instead, this provides an 0 1 0 1 … 0 explanation of how the Each red coin data was generated corresponds to 1 1 0 1 … 0 (albeit a terrible one). an x k 24

  23. Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K Generative Story: Same as Generic Naïve Bayes Model: Classification: Find the class that maximizes the posterior 25

  24. Generic Naïve Bayes Model Classification: 26

  25. Model 1: Bernoulli Naïve Bayes Training: Find the class-conditional MLE parameters For P(Y) , we find the MLE using all the data. For each P(X k |Y) we condition on the data with the corresponding class. 27

  26. Model 2: Multinomial Naïve Bayes Support: Integer vector (word IDs) Generative Story: Model: 28

  27. Model 3: Gaussian Naïve Bayes Support: Model: Product of prior and the event model 29

  28. Model 4: Multiclass Naïve Bayes Model: 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend