stat 339 naive bayes classification
play

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer - PowerPoint PPT Presentation

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes Classification Naive Bayes Classifier Using MLE Parameter Estimates Bayesian Naive Bayes Continuous Features Mixed Feature Types Back to Supervised


  1. STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson

  2. Outline Naive Bayes Classification Naive Bayes Classifier Using MLE Parameter Estimates Bayesian Naive Bayes Continuous Features Mixed Feature Types

  3. Back to Supervised Learning... ▸ How can we use Bayesian methods to do classification? ▸ General idea: model the class conditional distributions , p ( x ∣ t = c ) for each category, c , and include a prior distribution over categories. Then use Bayes’ rule. p ( t new = c ) p ( x new ∣ t new ) p ( t new = c ∣ x new ) = ∑ c ′ p ( t new = c ′ ) p ( x new ∣ t new = c ′ )

  4. Example: Federalist Papers (Mosteller and Wallace, 1963) ▸ Dataset: 12 anonymous essays to try to convince New York to ratify U.S Constitution. Presumed written by either John Jay, James Madison, or Alexander Hamilton. ▸ Training data: known-author Federalist papers ▸ t = author, x = vector of words ▸ Question: how to model p ( x ∣ t ) ?

  5. The Naive Bayes Assumption ▸ To simplify, make the unrealistic (hence “naive”) assumption that the words are all independent given the class (author). That is for an N word document p ( x ∣ t ) = N p ( x n ∣ t ) ∏ n = 1 ▸ Then, the problem reduces to estimating a word distribution for each author.

  6. Categorical and Multinomial Distributions A given word, x n is one of many in a vocabulary. We can model word distribution for author t as a categorical distribution with parameter vector θ t = { θ tw } ,w = 1 ,...,W , where θ tw = p ( x n = w ∣ t ) . p ( x n ∣ t ) = ∏ W θ I ( x n = w ) tw w = 1 Then p ( x ∣ t, θ t ) = ∏ N = ∏ W n = 1 ∏ W w = 1 θ I ( x n = w ) w = 1 θ n tw tw , where n tw tw is the number of times w appears in documents written by author t . The distribution of the n tw (for fixed t ) is a multinomial distribution .

  7. Samples from a Multinomial 30 25 ● ● ● ● ● 20 ● ● count ● ● 15 ● ● ● ● ● ● ● 10 ● ● 5 0 1 2 3 4 5 6 bin Figure: Each line represents one sample from a multinomial distribution over the values { 1 , 2 ,..., 6 } with equal probabilities for each category. Sample size is 100.

  8. Maximum Likelihood Estimation ▸ The likelihood and log likelihood for θ t : L ( θ t ) = ∏ W θ n tw tw w = 1 log L ( θ t ) = W n tw log ( θ tw ) ∑ w = 1 ▸ Because the θ tw are constrained to sum to 1 for each t , we can’t just maximize this function freely; we need a constrained optimization technique such as Lagrange multipliers. ▸ Omitting details, we get θ tw = n tw ˆ ∑ W w ′ = 1 n tw ′ i.e., the sample proportions.

  9. Sparse Data ▸ Many of the counts will be zero for a particular author in the training set. Do we really want to say that these words have probability zero for that author? ▸ Ad-hoc approach: “Laplace smoothing”. Add a small number to each count to get rid of zeroes. ▸ Bayesian approach: use a prior on the parameter vectors θ t , t = 1 ,...,T ( T being the number of different classes).

  10. Dirichlet-Multinomial Model ▸ The conjugate prior for a multinomial parameter vector is the Dirichlet distribution . p ( θ t ∣ α ) = Γ (∑ w α w ) ∏ W θ α w − 1 ∏ w Γ ( α w ) tw w = 1 (where all α w > 0 and ∑ w θ tw = 1 )

  11. Mean of a Dirichlet The mean vector for a Dirichlet is E { θ t ∣ α } = ( E { θ t 1 ∣ α } ,..., E { θ tW ∣ α }) where E { θ tw 0 ∣ α } = Γ (∑ w α w ) W ∏ ∏ w Γ ( α w ) ∫ θ w 0 θ α w − 1 dθ tw 0 tw w = 1 = Γ ( ∑ w α w ) W ∏ ∏ w Γ ( α w ) ∫ θ α w + I ( w = w 0 )− 1 dθ tw 0 tw w = 1 Γ ( ∑ w α w ) Γ ( α w 0 + 1 ) ∏ w ≠ w 0 Γ ( α w ) = Γ ( α w 0 ) ∏ w ≠ w 0 Γ ( α w ) Γ ( ∑ w α w + 1 ) = α w 0 ∑ w α w

  12. Dirichlet-Multinomial Model ▸ The conjugate prior for a multinomial parameter vector is the Dirichlet distribution . p ( θ t ∣ α ) = Γ (∑ w α w ) W ∏ θ α w − 1 ∏ w Γ ( α w ) tw w = 1 ▸ Together with the multinomial likelihood: p ( x ∣ θ t ) = W ∏ θ n tw tw w = 1 ▸ The posterior is Dirichlet p ( θ t ∣ x ) = Γ ( N + ∑ w α w ) ∏ W θ α w + n tw − 1 ∏ w Γ ( α w + n tw ) tw w = 1

  13. Dirichlet-Multinomial Predictive Distribution ▸ The posterior is Dirichlet p ( θ t ∣ x ,t ) = Γ ( N + ∑ w α w ) ∏ W θ α w + n tw ∏ w Γ ( α w + n tw ) tw w = 1 ▸ The predictive probability that x new = w is then p ( x new = w 0 ∣ x ,t ) = ∫ p ( x new = w ∣ θ t ) p ( θ t ∣ x ) d θ t = Γ ( N + ∑ w α w ) W ∏ ∏ w Γ ( α w + n tw ) ∫ θ w θ α w + n tw tw w = 1 = E { θ tw ∣ x ,t } α w + n tw = N + ∑ w α w

  14. Summary: Naive Bayes with Categorical Features ▸ We make the “naive Bayes” assumption that the feature dimensions are independent: D p ( x ∣ t = c ) = ∏ p ( x d ∣ t = c ) d = 1 ▸ Each p ( x d ∣ t ) is a categorical distribution with parameter vector θ dt , which are probabilities over values of x d . ▸ The MLEs are just the training proportions; but to “smooth”, can use a Dirichlet prior, which yields predictive probabilities (integrating out θ dt ): α w + n dtw p ( x new,d ∣ t = c, x ) = N dt + ∑ w α w ▸ Can use the same procedure to estimate prior probs p ( t ) .

  15. Example: Iris Data

  16. A Generative Model for Continuous Features 1 (a) 0.5 0.2 0.3 0.5 0 0 0.5 1

  17. Naive Bayes with Continuous Features The naive Bayes assumption can be made regardless of the individual feature types: p ( x ∣ t = c ) = D p ( x d ∣ t = c ) ∏ d = 1 for example, suppose x d ∣ t can be modeled as Normal: p ( x d ∣ t ) = √ exp {− 1 ( x d − µ td ) 2 } 1 2 σ 2 2 πσ 2 td td Then, the joint likelihood function is L ( µ , σ 2 ) = N D √ exp {− ( x nd − µ t n d ) 2 } ∏ ∏ 1 1 2 σ 2 2 πσ 2 n = 1 d = 1 t n d t n d

  18. MLE Parameter Estimates The joint likelihood function is L ( µ , σ 2 ) = exp {− ( x nd − µ t n d ) 2 } ∏ N ∏ D √ 1 1 2 σ 2 2 πσ 2 n = 1 d = 1 t n d t n d When considering the µ t 0 d 0 ,σ 2 t 0 d 0 part of the gradient, all terms for which t n ≠ t 0 and/or d ≠ d 0 are constants, so we get = − ∑ ( x nd 0 − µ t 0 d 0 ) ∂ log L 1 σ 2 ∂µ t 0 d 0 n ∶ t n = t 0 t 0 d 0 = (− + t 0 d 0 ) 2 ( x nd 0 − µ t 0 d 0 ) 2 ) ∑ ∂ log L 1 1 2 ( σ 2 ∂σ 2 2 σ 2 n ∶ t n = t 0 t 0 d 0 t 0 d 0 which means we can consider each class and each coordinate separately, and the MLEs are the MLEs for the corresponding univariate Normal.

  19. Naive Bayes Generative Model of Iris Data 8.0 setosa versicolor ● 7.5 virginica 7.0 ● ● ● ● ● ● ● ● 6.5 Sepal.Length ● ● ● ● ● ● ● ● ● ● ● ● 6.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5.5 ● ● ● ● ● ● ● 5.0 ● ● ● 4.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Sepal.Width Figure: Class-conditional densities are shown as bivariate Normals with diagonal covariance matrices. Means and variances estimated using MLE

  20. Classification Having estimated parameters, the posterior probability that t new = c is just p ( t = c ) p ( x new ∣ t = c, ˆ σ 2 ) p ( t = c ∣ ˆ σ 2 ) = µ , ˆ µ , ˆ ∑ c ′ p ( t = c ) p ( x new ∣ t = c, ˆ σ 2 ) µ , ˆ p ( t = c )∏ D d = 1 N( x new ∣ ˆ cd ) = σ 2 µ cd , ˆ ∑ c ′ p ( t = c )∏ D d = 1 N( x new ∣ ˆ c ′ d ) σ 2 µ c ′ d , ˆ

  21. Mixed Feature Types Since the naive Bayes assumption models all features as conditionally independent if we know the class label, there is no extra difficulty in constructing models with arbitrary combinations of categorical and quantitative features.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend