applied machine learning applied machine learning
play

Applied Machine Learning Applied Machine Learning Naive Bayes - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Naive Bayes Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives generative vs. discriminative classifier Naive


  1. Applied Machine Learning Applied Machine Learning Naive Bayes Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

  2. Learning objectives Learning objectives generative vs. discriminative classifier Naive Bayes classifier assumption different design choices 2

  3. Discreminative vs generative classification Discreminative vs generative classification so far we modeled the conditional distribution : p ( y ∣ x ) discriminative learn the joint distribution p ( y , x ) = p ( y ) p ( x ∣ y ) generative prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) how to classify new input x? Bayes rule p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ∣ y = 1) p ( x ) p ( x ∣ y = 0) posterior probability marginal probability of the input (evidence) x of a given class C ′ p ( x , c ) ∑ c =1 ′ image: https://rpsychologist.com 3 . 1

  4. Example: Example: Bayes rule for classification Bayes rule for classification patient having cancer? y ∈ {yes, no} x ∈ {−, +} test results, a single binary feature prior: 1% of population has cancer p (yes) = .01 likelihood: p (+∣yes) = .9 TP rate of the test (90%) p ( c ) p ( x ∣ c ) p ( c ∣ x ) = p ( x ) FP rate of the test (5%) posterior: p (yes∣+) = .08 p (+) = p (yes) p (+∣yes) + p (no) p (+∣no) = .01 × .9 + .99 × .05 = .189 evidence: in a generative classifier likelihood & prior class probabilities are learned from data 3 . 2 Winter 2020 | Applied Machine Learning (COMP551)

  5. Generative classification Generative classification prior class probability: frequency of observing this label likelihood of input features given the class label (input features for each label come from a different distribution) p ( c ) p ( x ∣ c ) p ( y = c ∣ x ) = p ( x ) posterior probability marginal probability of the input (evidence) of a given class C ′ p ( x , c ) ∑ c =1 ′ Some generative classifiers: Gaussian Discriminant Analysis: the likelihood is multivariate Gaussian Naive Bayes: decomposed likelihood image: https://rpsychologist.com 4 . 1

  6. Naive Bayes: Naive Bayes: model model number of input features D assumption about the likelihood p ( x ∣ y ) = p ( x ∣ y ) ∏ d =1 d when is this assumption correct? ⊥ ∣ when features are conditionally independent given the label x x y i j knowing the label, the value of one input feature gives us no information about the other input features chain rule of probability (true for any distribution) p ( x ∣ y ) = p ( x ∣ y ) p ( x ∣ y , x ) p ( x ∣ y , x , x ) … p ( x ∣ y , x , … , x ) 1 2 1 3 1 2 1 D −1 D conditional independence assumption x1, x2 give no extra information, so p ( x ∣ y , x , x ) = p ( x ∣ y ) 3 1 2 3 4 . 2

  7. Naive Bayes: objective Naive Bayes: objective given the training dataset (1) (1) ( N ) ( N ) D = {( x , y ), … , ( x , y )} maximize the joint likelihood (contrast with logistic regression) ( n ) ( n ) ℓ( w , u ) = log p ( x , y ) ∑ n u , w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n u w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n ∑ n u w ( n ) ( n ) ( n ) = log p ( y ) + log p ( x ∣ y ) ∑ n ∑ d ∑ n using Naive Bayes assumption u w [ d ] d separate MLE estimates for each part 4 . 3

  8. Naive Bayes: Naive Bayes: train-test train-test given the training dataset (1) (1) ( N ) ( N ) D = {( x , y ), … , ( x , y )} training time ( y ) p learn the prior class probabilities u ( x ∣ y ) ∀ d p learn the likelihood components w d [ d ] test time find posterior class probabilities ( c ) D ( x ∣ c ) ∏ d =1 p p u w d arg max p ( c ∣ x ) = arg max [ d ] ′ ∏ d =1 c c ( c ) ( x ∣ c ) ′ ∑ c =1 C D p p ′ u w d [ d ] 4 . 4 Winter 2020 | Applied Machine Learning (COMP551)

  9. Class prior Class prior D ( c ) ∏ d =1 ( x ∣ c ) p p u w d p ( c ∣ x ) = [ d ] C ( c ) D ( x ∣ c ) ′ ∑ c =1 ∏ d =1 p p ′ u w d [ d ] binary classification u ) 1− y ( y ) = u (1 − y Bernoulli distribution p u maximizing the log-likelihood N ( n ) ( n ) ℓ( u ) = log( u ) + (1 − y ) log(1 − u ) ∑ n =1 y = N log( u ) + ( N − N ) log(1 − u ) 1 1 frequency of class 1 in the dataset frequency of class 0 in the dataset setting its derivative to zero 1 max-likelihood estimate (MLE) is the d N − N ∗ N N ℓ( u ) = − = 0 ⇒ u = 1 1 frequency of class labels d u 1− u u N 5 . 1

  10. Class prior Class prior D ( c ) ∏ d =1 ( x ∣ c ) p p u w d p ( c ∣ x ) = [ d ] ∑ c =1 C ( c ) ∏ d =1 D ( x ∣ c ) ′ p p ′ u w d [ d ] multiclass classification C ( y ) = y categorical distribution ∏ c =1 p u c u c assuming one-hot coding for labels u = [ u , … , u ] is now a parameter vector 1 C ( n ) maximizing the log likelihood ℓ( u ) = log( u ) ∑ n ∑ c y c c = 1 ∑ c subject to: u c number of instances in class 1 ∗ N N u = [ , … , ] closed form for the optimal parameter 1 C N N all instances in the dataset 5 . 2 Winter 2020 | Applied Machine Learning (COMP551)

  11. Likelihood terms Likelihood terms (class-conditionals) D ( c ) ∏ d =1 ( x ∣ c ) p p u w d p ( c ∣ x ) = [ d ] ∑ c =1 C ( c ) ∏ d =1 D ( x ∣ c ) ′ p p ′ u w d [ d ] choice of likelihood distribution depends on the type of features (likelihood encodes our assumption about "generative process") Bernoulli: binary features note that these are different from the choice of Categorical: categorical features distribution for class prior Gaussian: continuous distribution ... each feature may use a different likelihood x d separate max-likelihood estimates for each feature ( n ) N [ d ]∗ ( n ) = arg max log p ( x ∣ ) [ d ] ∑ n =1 w y w w d [ d ] 6 . 1

  12. Bernoulli Bernoulli Naive Bayes Naive Bayes binary features : likelihood is Bernoulli { ( x ∣ y = 0) = Bernoulli( x ; w ) p [ d ],0 w d d one parameter per label [ d ] ( x ∣ y = 1) = Bernoulli( x ; w ) p [ d ],1 w d d [ d ] ( x ∣ y ) = Bernoulli( x ; w ) short form: p [ d ], y w d d [ d ] max-likelihood estimation is similar to what we saw for the prior number of training instances satisfying this condition N ( y = c , x =1) ∗ = w d [ d ], c N ( y = c ) closed form solution of MLE 6 . 2

  13. Example Example: Bernoulli Naive Bayes Bernoulli Naive Bayes using naive Bayes for document classification : ∗ w [ d ],1 2 classes (documents types) ∗ 600 binary features w [ d ],0 ( n ) = 1 word d is present in the document n (vocabulary of 600) x d likelihood of words in two document types d 1 def BernoulliNaiveBayes(prior,# vector of size 2 for class prior 2 likelihood, #600 x 2: likelihood of each word under each class 3 x, #vector of size 600: binary features for a new document 4 ): 5 logp = np.log(prior) + np.sum(np.log(likelihood * x[:,None]), 0) + \ 6 np.sum(np.log((1-likelihood) * (1 - x[:,None])), 0) 7 log_p -= np.max(log_p) #numerical stability 8 posterior = np.exp(log_p) # vector of size 2 9 posterior /= np.sum(posterior) # normalize 10 return posterior # posterior class probability 6 . 3

  14. Multinomial Multinomial Naive Bayes Naive Bayes what if we wanted to use word frequencies in document classification ( n ) is the number of times word d appears in document n x d ( )! ∑ d x D x ( x ∣ c ) = ∏ d =1 d Multinomial likelihood: p w d w d , c D ∏ d =1 ! x d we have a vector of size D for each class C × D (parameters) ( n ) ( n ) count of word d in all documents labelled y ∑ x y ∗ = MLE estimates: w d c d , c ( n ) ( n ) ∑ n ∑ d ′ total word count in all documents labelled y x y d ′ c 6 . 4 Winter 2020 | Applied Machine Learning (COMP551)

  15. Gaussian Naive Bayes Gaussian Naive Bayes Gaussian likelihood terms d , y 2 ( x − μ ) d − 1 2 d , y 2 ( x ∣ y ) = N ( x ; μ , σ ) = 2 σ p e d , y w d d d , y [ d ] d , y 2 2 πσ = ( μ , σ , … , μ , σ ) w d ,1 d ,1 d , C d , C [ d ] one mean and std. parameter for each class-feature pair writing log-likelihood and setting derivative to zero we get maximum likelihood estimate: 1 ∑ n =1 ( n ) ( n ) N = μ x y x d , y c empirical mean & std of feature d N d c 1 ∑ n =1 ( n ) ( n ) 2 N d , y 2 across instances with label y = ( x − ) σ y μ c d , y d N c 7 . 1

  16. Example: Example: Gaussian Naive Bayes Gaussian Naive Bayes classification on Iris flowers dataset : (a classic dataset originally used by Fisher) samples with D=4 features, for each of C=3 = 50 N c species of Iris flower our setting 3 classes 2 features (septal width, petal length) 7 . 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend