STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer - PowerPoint PPT Presentation

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson

Outline Naive Bayes Classification Naive Bayes Classifier Using MLE Parameter Estimates Bayesian Naive Bayes Continuous Features Mixed Feature Types

Back to Supervised Learning... ▸ How can we use Bayesian methods to do classification? ▸ General idea: model the class conditional distributions , p ( x ∣ t = c ) for each category, c , and include a prior distribution over categories. Then use Bayes’ rule. p ( t new = c ) p ( x new ∣ t new ) p ( t new = c ∣ x new ) = ∑ c ′ p ( t new = c ′ ) p ( x new ∣ t new = c ′ )

Example: Federalist Papers (Mosteller and Wallace, 1963) ▸ Dataset: 12 anonymous essays to try to convince New York to ratify U.S Constitution. Presumed written by either John Jay, James Madison, or Alexander Hamilton. ▸ Training data: known-author Federalist papers ▸ t = author, x = vector of words ▸ Question: how to model p ( x ∣ t ) ?

The Naive Bayes Assumption ▸ To simplify, make the unrealistic (hence “naive”) assumption that the words are all independent given the class (author). That is for an N word document p ( x ∣ t ) = N p ( x n ∣ t ) ∏ n = 1 ▸ Then, the problem reduces to estimating a word distribution for each author.

Categorical and Multinomial Distributions A given word, x n is one of many in a vocabulary. We can model word distribution for author t as a categorical distribution with parameter vector θ t = { θ tw } ,w = 1 ,...,W , where θ tw = p ( x n = w ∣ t ) . p ( x n ∣ t ) = ∏ W θ I ( x n = w ) tw w = 1 Then p ( x ∣ t, θ t ) = ∏ N = ∏ W n = 1 ∏ W w = 1 θ I ( x n = w ) w = 1 θ n tw tw , where n tw tw is the number of times w appears in documents written by author t . The distribution of the n tw (for fixed t ) is a multinomial distribution .

Samples from a Multinomial 30 25 ● ● ● ● ● 20 ● ● count ● ● 15 ● ● ● ● ● ● ● 10 ● ● 5 0 1 2 3 4 5 6 bin Figure: Each line represents one sample from a multinomial distribution over the values { 1 , 2 ,..., 6 } with equal probabilities for each category. Sample size is 100.

Maximum Likelihood Estimation ▸ The likelihood and log likelihood for θ t : L ( θ t ) = ∏ W θ n tw tw w = 1 log L ( θ t ) = W n tw log ( θ tw ) ∑ w = 1 ▸ Because the θ tw are constrained to sum to 1 for each t , we can’t just maximize this function freely; we need a constrained optimization technique such as Lagrange multipliers. ▸ Omitting details, we get θ tw = n tw ˆ ∑ W w ′ = 1 n tw ′ i.e., the sample proportions.

Sparse Data ▸ Many of the counts will be zero for a particular author in the training set. Do we really want to say that these words have probability zero for that author? ▸ Ad-hoc approach: “Laplace smoothing”. Add a small number to each count to get rid of zeroes. ▸ Bayesian approach: use a prior on the parameter vectors θ t , t = 1 ,...,T ( T being the number of different classes).

Dirichlet-Multinomial Model ▸ The conjugate prior for a multinomial parameter vector is the Dirichlet distribution . p ( θ t ∣ α ) = Γ (∑ w α w ) ∏ W θ α w − 1 ∏ w Γ ( α w ) tw w = 1 (where all α w > 0 and ∑ w θ tw = 1 )

Mean of a Dirichlet The mean vector for a Dirichlet is E { θ t ∣ α } = ( E { θ t 1 ∣ α } ,..., E { θ tW ∣ α }) where E { θ tw 0 ∣ α } = Γ (∑ w α w ) W ∏ ∏ w Γ ( α w ) ∫ θ w 0 θ α w − 1 dθ tw 0 tw w = 1 = Γ ( ∑ w α w ) W ∏ ∏ w Γ ( α w ) ∫ θ α w + I ( w = w 0 )− 1 dθ tw 0 tw w = 1 Γ ( ∑ w α w ) Γ ( α w 0 + 1 ) ∏ w ≠ w 0 Γ ( α w ) = Γ ( α w 0 ) ∏ w ≠ w 0 Γ ( α w ) Γ ( ∑ w α w + 1 ) = α w 0 ∑ w α w

Dirichlet-Multinomial Model ▸ The conjugate prior for a multinomial parameter vector is the Dirichlet distribution . p ( θ t ∣ α ) = Γ (∑ w α w ) W ∏ θ α w − 1 ∏ w Γ ( α w ) tw w = 1 ▸ Together with the multinomial likelihood: p ( x ∣ θ t ) = W ∏ θ n tw tw w = 1 ▸ The posterior is Dirichlet p ( θ t ∣ x ) = Γ ( N + ∑ w α w ) ∏ W θ α w + n tw − 1 ∏ w Γ ( α w + n tw ) tw w = 1

Dirichlet-Multinomial Predictive Distribution ▸ The posterior is Dirichlet p ( θ t ∣ x ,t ) = Γ ( N + ∑ w α w ) ∏ W θ α w + n tw ∏ w Γ ( α w + n tw ) tw w = 1 ▸ The predictive probability that x new = w is then p ( x new = w 0 ∣ x ,t ) = ∫ p ( x new = w ∣ θ t ) p ( θ t ∣ x ) d θ t = Γ ( N + ∑ w α w ) W ∏ ∏ w Γ ( α w + n tw ) ∫ θ w θ α w + n tw tw w = 1 = E { θ tw ∣ x ,t } α w + n tw = N + ∑ w α w

Summary: Naive Bayes with Categorical Features ▸ We make the “naive Bayes” assumption that the feature dimensions are independent: D p ( x ∣ t = c ) = ∏ p ( x d ∣ t = c ) d = 1 ▸ Each p ( x d ∣ t ) is a categorical distribution with parameter vector θ dt , which are probabilities over values of x d . ▸ The MLEs are just the training proportions; but to “smooth”, can use a Dirichlet prior, which yields predictive probabilities (integrating out θ dt ): α w + n dtw p ( x new,d ∣ t = c, x ) = N dt + ∑ w α w ▸ Can use the same procedure to estimate prior probs p ( t ) .

Example: Iris Data

A Generative Model for Continuous Features 1 (a) 0.5 0.2 0.3 0.5 0 0 0.5 1

Naive Bayes with Continuous Features The naive Bayes assumption can be made regardless of the individual feature types: p ( x ∣ t = c ) = D p ( x d ∣ t = c ) ∏ d = 1 for example, suppose x d ∣ t can be modeled as Normal: p ( x d ∣ t ) = √ exp {− 1 ( x d − µ td ) 2 } 1 2 σ 2 2 πσ 2 td td Then, the joint likelihood function is L ( µ , σ 2 ) = N D √ exp {− ( x nd − µ t n d ) 2 } ∏ ∏ 1 1 2 σ 2 2 πσ 2 n = 1 d = 1 t n d t n d

MLE Parameter Estimates The joint likelihood function is L ( µ , σ 2 ) = exp {− ( x nd − µ t n d ) 2 } ∏ N ∏ D √ 1 1 2 σ 2 2 πσ 2 n = 1 d = 1 t n d t n d When considering the µ t 0 d 0 ,σ 2 t 0 d 0 part of the gradient, all terms for which t n ≠ t 0 and/or d ≠ d 0 are constants, so we get = − ∑ ( x nd 0 − µ t 0 d 0 ) ∂ log L 1 σ 2 ∂µ t 0 d 0 n ∶ t n = t 0 t 0 d 0 = (− + t 0 d 0 ) 2 ( x nd 0 − µ t 0 d 0 ) 2 ) ∑ ∂ log L 1 1 2 ( σ 2 ∂σ 2 2 σ 2 n ∶ t n = t 0 t 0 d 0 t 0 d 0 which means we can consider each class and each coordinate separately, and the MLEs are the MLEs for the corresponding univariate Normal.

Naive Bayes Generative Model of Iris Data 8.0 setosa versicolor ● 7.5 virginica 7.0 ● ● ● ● ● ● ● ● 6.5 Sepal.Length ● ● ● ● ● ● ● ● ● ● ● ● 6.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5.5 ● ● ● ● ● ● ● 5.0 ● ● ● 4.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Sepal.Width Figure: Class-conditional densities are shown as bivariate Normals with diagonal covariance matrices. Means and variances estimated using MLE

Classification Having estimated parameters, the posterior probability that t new = c is just p ( t = c ) p ( x new ∣ t = c, ˆ σ 2 ) p ( t = c ∣ ˆ σ 2 ) = µ , ˆ µ , ˆ ∑ c ′ p ( t = c ) p ( x new ∣ t = c, ˆ σ 2 ) µ , ˆ p ( t = c )∏ D d = 1 N( x new ∣ ˆ cd ) = σ 2 µ cd , ˆ ∑ c ′ p ( t = c )∏ D d = 1 N( x new ∣ ˆ c ′ d ) σ 2 µ c ′ d , ˆ

Mixed Feature Types Since the naive Bayes assumption models all features as conditionally independent if we know the class label, there is no extra difficulty in constructing models with arbitrary combinations of categorical and quantitative features.

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer - PowerPoint PPT Presentation

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes Classification Naive Bayes Classifier Using MLE Parameter Estimates Bayesian Naive Bayes Continuous Features Mixed Feature Types Back to Supervised

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Naive Bayes Classication Naive Bayes Classication In [1]: % matplotlib inline from

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Statistical classification Lecture notes Naive Bayes Bayes' theorem P ( c|a ) P ( a ) = P ( a|c

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures

Outline Naive Credal Classifier 2: an extension of Naive Bayes Introducing NCC2 1 for

INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 (Mostly Text)

Naive Bayes case study Training set: 10,000 emails that are either SPAM or HAM Testing set:

CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides asst 5 milestone was due

Presentation of a Scientific Paper Naive Bayes Models for Probability Estimation Daniel Lowd and

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

The Human Right to a Healthy Environment INTERNATIONAL ASSOCIATION OF DEMOCRATIC LAWYERS HANOI,

Depth, Universality, Learned Ministry Challenges for Preparing Young Generations for the Future

Employment Options and Guidelines for Hiring Foreign Employees MOTT Center Wayne State

Cross Border Forum on Agricultural Interdependence Floyd D. Gaibler Deputy Under Secretary for

Revision (Part I I ) Ke Chen Revision slides are going to summarise all you have learnt from

Air Travel Forecast Problem Objectives Introduction to forecasting methods Experience

Linear Methods for Regression and Classification Petr Pok Czech Technical University in

Good Predictions Are Worth a Few Comparisons Carine Pivoteau with Nicolas Auger and Cyril Nicaud

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer - PowerPoint PPT Presentation

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes Classification Naive Bayes Classifier Using MLE Parameter Estimates Bayesian Naive Bayes Continuous Features Mixed Feature Types Back to Supervised

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Naive Bayes Classication Naive Bayes Classication In [1]: % matplotlib inline from

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Statistical classification Lecture notes Naive Bayes Bayes' theorem P ( c|a ) P ( a ) = P ( a|c

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun &amp; Rich Zemels lectures

Outline Naive Credal Classifier 2: an extension of Naive Bayes Introducing NCC2 1 for

INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 (Mostly Text)

Naive Bayes case study Training set: 10,000 emails that are either SPAM or HAM Testing set:

CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides asst 5 milestone was due

Presentation of a Scientific Paper Naive Bayes Models for Probability Estimation Daniel Lowd and

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

The Human Right to a Healthy Environment INTERNATIONAL ASSOCIATION OF DEMOCRATIC LAWYERS HANOI,

Depth, Universality, Learned Ministry Challenges for Preparing Young Generations for the Future

Employment Options and Guidelines for Hiring Foreign Employees MOTT Center Wayne State

Cross Border Forum on Agricultural Interdependence Floyd D. Gaibler Deputy Under Secretary for

Revision (Part I I ) Ke Chen Revision slides are going to summarise all you have learnt from

Air Travel Forecast Problem Objectives Introduction to forecasting methods Experience

Linear Methods for Regression and Classification Petr Pok Czech Technical University in

Good Predictions Are Worth a Few Comparisons Carine Pivoteau with Nicolas Auger and Cyril Nicaud

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemels lectures