Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes - PowerPoint PPT Presentation

Naïve ¡Bayes ¡Classifica/on ¡ Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014

Bayes’ ¡Theorem ¡ § Thomas Bayes (1701-1761) § Simple form of Bayes’ Theorem, for two random variables C and X Class ¡prior ¡ probability ¡ Likelihood ¡ P ( C | X ) = P ( X | C ) × P ( C ) P ( X ) Predictor ¡prior ¡ Posterior ¡ probability ¡or ¡ probability ¡ evidence ¡ C X 2 ¡

Probability ¡Model ¡ § Probability model: for a target class variable C which is dependent over features X 1 ,…, X n P ( C | X 1 ,..., X n ) = P ( C ) × P ( X 1 ,..., X n | C ) P ( X 1 ,..., X n ) The ¡values ¡of ¡the ¡ features ¡are ¡given ¡ § So the denominator is effectively constant § Goal: calculating probabilities for the possible values of C § We are interested in the numerator: P ( C ) × P ( X 1 ,..., X n | C ) 3 ¡

Naïve ¡Bayes ¡Probability ¡Model ¡ P ( C | X 1 ,..., X n ) = P ( C ) × P ( X 1 | C ) × P ( X 2 | C ) × ! × P ( X n | C ) P ( X 1 ,..., X n ) Class ¡posterior ¡ Known ¡values: ¡ probability ¡ constant ¡ 6 ¡

Classifier ¡based ¡on ¡Naïve ¡Bayes ¡ § Decision rule: pick the hypothesis (value c of C ) that has highest probability – Maximum A-Posteriori (MAP) decision rule " % n ∏ argmax P ( C = c ) P ( X i = x i | C = c ) # & $ ' c i = 1 Approximated ¡ The ¡values ¡of ¡ Approximated ¡ from ¡rela/ve ¡ features ¡are ¡ from ¡frequency ¡in ¡ frequencies ¡in ¡ known ¡for ¡the ¡ the ¡training ¡set ¡ the ¡training ¡set ¡ new ¡observa/on ¡ 7 ¡

Example of Naïve Bayes Reference: The IR Book by Raghavan et al, Chapter 6 Text ¡Classifica/on ¡with ¡Naïve ¡Bayes ¡ 8 ¡

The ¡Text ¡Classifica/on ¡Problem ¡ § Set of class labels / tags / categories: C § Training set: set D of documents with labels < d,c > ∈ D × C § Example: a document, and a class label <Beijing joins the World Trade Organization, China> § Set of all terms: V § Given d ∈ D’ , a set of new documents, the goal is to find a class of the document c(d) 9 ¡

Mul/nomial ¡Naïve ¡Bayes ¡Model ¡ § Probability of a document d being in class c n d ∏ P ( c | d ) ∝ P ( c ) × P ( t k | c ) k = 1 where P ( t k | c ) = probability of a term t k occurring in a document of class c Intuitively: • P ( t k | c ) ~ how much evidence t k contributes to the class c • P ( c ) ~ prior probability of a document being labeled by class c 10 ¡

Mul/nomial ¡Naïve ¡Bayes ¡Model ¡ § The expression n d ∏ P ( c ) × P ( t k | c ) k = 1 has many probabilities. § May result a floating point underflow. – Add logarithms of the probabilities instead of multiplying the original probability values Term ¡weight ¡of ¡ t k ¡ in ¡class ¡ c ¡ # & n d ∑ c map = argmax log P ( c ) + log P ( t k | c ) % ( $ ' c ∈ C k = 1 Todo: estimating these probabilities 11 ¡

Maximum ¡Likelihood ¡Es/mate ¡ § Based on relative frequencies #of ¡documents ¡labeled ¡as ¡ § Class prior probability class ¡ c ¡in ¡training ¡data ¡ P ( c ) = N c total ¡#of ¡documents ¡in ¡ N training ¡data ¡ § Estimate P ( t | c ) as the relative frequency of t occurring in documents labeled as class c total ¡#of ¡occurrences ¡of ¡ T ct t ¡in ¡documents ¡ d ¡ ∈ ¡ c ¡ P ( t | c ) = ∑ T ct ' total ¡#of ¡occurrences ¡of ¡all ¡ terms ¡in ¡documents ¡ d ¡ ∈ ¡ c ¡ t ' ∈ V 12 ¡

Handling ¡Rare ¡Events ¡ § What if: a term t did not occur in documents belonging to class c in the training data? – Quite common. Terms are sparse in documents. § Problem: P ( t | c ) becomes zero, the whole expression becomes zero § Use add-one or Laplace smoothing T ct + 1 T ct + 1 P ( t | c ) = = # & ∑ ( T ct ' + 1) ∑ T ct ' ( + V % t ' ∈ V $ ' t ' ∈ V 13 ¡

Example ¡ India ¡ India ¡ India ¡ UK ¡ Indian ¡ Indian ¡ London ¡ Indian ¡ Delhi ¡ Indian ¡ Indian ¡ Goa ¡ Indian ¡ Taj ¡Mahal ¡ Embassy ¡ Indian ¡ Embassy ¡ London ¡ Indian ¡ Indian ¡ Classify ¡ 14 ¡

Bernoulli ¡Naïve ¡Bayes ¡Model ¡ § Binary indicator of occurrence instead of frequency § Estimate P ( t | c ) as the fraction of documents in c containing the term t § Models absence of terms explicitly: n ∏ [ ] P ( C | X 1 ,..., X n ) = X i P ( t i | C ) + (1 − X i )(1 − P ( t i | C ) i = 1 X i = 1 if t i is present Absence ¡of ¡terms ¡ 0 otherwise Difference ¡between ¡Mul/nomial ¡with ¡frequencies ¡ truncated ¡to ¡1, ¡and ¡Bernoulli ¡Naïve ¡Bayes? ¡ 15 ¡

Example ¡ India ¡ India ¡ India ¡ UK ¡ Indian ¡ Indian ¡ London ¡ Indian ¡ Delhi ¡ Indian ¡ Indian ¡ Goa ¡ Indian ¡ Taj ¡Mahal ¡ Embassy ¡ Indian ¡ Embassy ¡ London ¡ Indian ¡ Indian ¡ Classify ¡ 16 ¡

Naïve ¡Bayes ¡as ¡a ¡Genera/ve ¡Model ¡ § The probability model: P ( c | d ) = P ( c ) × P ( d | c ) P ( d ) Mul/nomial ¡model ¡ UK ¡ ( ) P ( d | c ) = P t 1 ,..., t n d c Terms as they occur in X 1 = ¡ X 2 = ¡ X 3 = ¡ d , exclude other terms London ¡ India ¡ Embassy ¡ where X i is the random variable for position i in the document – Takes values as terms of the vocabulary § Positional independence assumption à Bag of words model: ( ) P ( X k 1 = t | c ) = P X k 2 = t c 17 ¡

Naïve ¡Bayes ¡as ¡a ¡Genera/ve ¡Model ¡ § The probability model: P ( c | d ) = P ( c ) × P ( d | c ) P ( d ) Bernoulli ¡model ¡ UK ¡ ( ) P ( d | c ) = P e 1 ,..., e | V | c U london =1 ¡ U Embassy =1 ¡ All terms in the U delhi =0 ¡ U TajMahal =0 ¡ vocabulary U India =1 ¡ U goa =0 ¡ P ( U i =1| c ) is the probability that term t i will occur in any position in a document of class c 18 ¡

Mul/nomial ¡vs ¡Bernoulli ¡ Multinomial Bernoulli Event Model Generation of token Generation of document Multiple occurrences Matters Does not matter Length of documents Better for larger Better for shorter documents documents #Features Can handle more Works best with fewer 19 ¡

On ¡Naïve ¡Bayes ¡ § Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters – SpamAssassin, SpamBayes, CRM114, … § Simple: the conditional independence assumption is very strong (naïve) – Naïve Bayes may not estimate right in many cases, but ends up classifying correctly quite often – It is difficult to understand the dependencies between features in real problems 20 ¡

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes - PowerPoint PPT Presentation

Nave Bayes Classifica/on Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 14, 2014 Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for two

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

31. Stokes Theorem Stokes theorem is to Greens theorem, for the work done, as the

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

7 Modelling Uncertainty Bayes theorem 7 Modelling Uncertainty Bayes theorem

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

BAYES FORMULA a two-stage experiment Xingru Chen xingru.chen.gr@dartmouth.edu XC 2020

Another Walkthrough of Variational Bayes Bevan Jones ML for NLP Reading Group The University of

Arrows Impossibility Theorem Lecture 12 Arrows Impossibility Theorem Lecture 12, Slide 1

Ch04. Maximum Theorem, Implicit Function Theorem and Envelope Theorem Ping Yu Faculty of

Bayes Theorem Bayess theorem Pr [ A | B ] = Pr [ B | A ] Pr [ A ] Pr [ B ] Note: This

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Machine Learning Lecture 3 Justin Pearson 1 2020 1

Statistics Review of Probability Model Shiu-Sheng Chen Department of Economics National Taiwan

The LESO-PB building building control system 0.0015 Density estimate 0.0010 0.0005 0.0000 0

Bayesian Deep Learning and Restricted Boltzmann Machines Narada Warakagoda Forsvarets

INFO 1301 Prof. Michael Paul Prof. William Aspray The Normal

Introductory Statistics Day 15 Introduction to the Normal Distribution Big ideas about normal

Stochastic Exploration of Real Varieties David J. Kahle Associate Professor Joint with Jon

Continuous random Variables Anna Karlin Most Slides by Alex Tsun + Joshua Fan Agenda Recap