DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - PowerPoint PPT Presentation

DATA MINING: NAÏVE BAYES 1

Naïve Bayes Classifier Thomas Bayes 1702 - 1761 We will start off with some mathematical background. But first we start with some visual intuition. 2

Grasshoppers Katydids 10 9 8 7 Antenna Length 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length Remember this example? Let’s get lots more data… 3

With a lot of data, we can build a histogram. Let us just build one for “Antenna Length” for now… 10 9 8 7 Antenna Length 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Katydids Grasshoppers 4

We can leave the histograms as they are, or we can summarize them with two normal distributions. Let us us two normal distributions for ease of visualization in the following slides… 5

• We want to classify an insect we have found. Its antennae are 3 units long. • How can we classify it? • We can just ask ourselves, give the distributions of antennae lengths we have seen, is it more probable that our insect is a Grasshopper or a Katydid . • There is a formal way to discuss the most probable classification … P(c j |d) = probability of c j given that we have observed d 3 Antennae length is 3 6

P(c j |d) = probability of c j given that we have observed d P( Grasshopper | 3 ) = 10 / (10 + 2) = 0.833 P( Katydid | 3 ) = 2 / (10 + 2) = 0.166 10 2 3 Antennae length is 3 7

Bayes Classifier  A probabilistic framework for classification problems  Often appropriate because the world is noisy and also some relationships are probabilistic in nature  Is predicting who will win a baseball game probabilistic in nature?  Before getting the heart of the matter, we will go over some basic probability.  We will review the concept of reasoning with uncertainty, which is based on probability theory  Should be review for many of you 10

Discrete Random Variables  A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs.  Examples  A = The next patient you examine is suffering from inhalational anthrax  A = The next patient you examine has a cough  A = There is an active terrorist cell in your city  We view P(A) as “the fraction of possible worlds in which A is true” 11

Visualizing A Event space of all possible worlds P(A) = Area of Worlds in which A reddish oval is true Its area is 1 Worlds in which A is False 12

The Axioms Of Probability  0 <= P(A) <= 1  P(True) = 1  P(False) = 0  P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any smaller than 0 And a zero area would mean no world could ever have A true 13

Interpreting the axioms  0 <= P(A) <= 1  P(True) = 1  P(False) = 0  P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any bigger than 1 And an area of 1 would mean all worlds will have A true 14

Interpreting the axioms  0 <= P(A) <= 1  P(True) = 1  P(False) = 0  P(A or B) = P(A) + P(B) - P(A and B) A P(A or B) B B P(A and B) Simple addition and subtraction 15

Another Important Theorem  0 <= P(A) <= 1, P(True) = 1, P(False) = 0  P(A or B) = P(A) + P(B) - P(A and B) From these we can prove: P(A) = P(A and B) + P(A and not B) A B 16

Conditional Probability  P(A|B) = Fraction of worlds in which B is true that also have A true H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 F “Headaches are rare and flu is H rarer, but if you’re coming down with ‘flu there’s a 50 -50 chance you’ll have a headache.” 17

Conditional Probability P(H|F) = Fraction of flu-inflicted worlds in which you have a headache F = #worlds with flu and headache H ------------------------------------ #worlds with flu = Area of “H and F” region H = “Have a headache” ------------------------------ F = “Coming down with Flu” Area of “F” region P(H) = 1/10 = P(H and F) P(F) = 1/40 --------------- P(H|F) = 1/2 P(F) 18

Definition of Conditional Probability P(A and B) P(A|B) = ----------- P(B) Corollary: The Chain Rule P(A and B) = P(A|B) P(B) 19

Probabilistic Inference H = “Have a headache” F = “Coming down with Flu” F P(H) = 1/10 H P(F) = 1/40 P(H|F) = 1/2 One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu” Is this reasoning good? 20

Probabilistic Inference H = “Have a headache” F = “Coming down with Flu” F P(H) = 1/10 H P(F) = 1/40 P(H|F) = 1/2 P(F and H) = … P(F|H) = … 21

Probabilistic Inference H = “Have a headache” F = “Coming down with Flu” F P(H) = 1/10 H P(F) = 1/40 P(H|F) = 1/2 1 1 1      ( and ) ( | ) ( ) P F H P H F P F 2 40 80 1 ( and ) 1 P F H 80    ( | ) P F H 1 ( ) 8 P H 10 22

What we just did… P(A & B) P(A|B) P(B) P(B|A) = ----------- = --------------- P(A) P(A) This is Bayes Rule Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 23

More Terminology  The Prior Probability is the probability assuming no specific information.  Thus we would refer to P(A) as the prior probability of even A occurring  We would not say that P(A|C) is the prior probability of A occurring  The Posterior probability is the probability given that we know something  We would say that P(A|C) is the posterior probability of A (given that C occurs) 24

Example of Bayes Theorem  Given:  A doctor knows that meningitis causes stiff neck 50% of the time  Prior probability of any patient having meningitis is 1/50,000  Prior probability of any patient having stiff neck is 1/20  If a patient has stiff neck, what’s the probability he/she has meningitis?  ( | ) ( ) 0 . 5 1 / 50000 P S M P M    ( | ) 0 . 0002 P M S ( ) 1 / 20 P S 25

Why Bayes Theorem at All? ( | ) ( ) P A C P C  ( | ) P C A ( ) P A  Why model P(C|A) via P(A|C)  We will see it is easier, but only with significant assumptions  In classification, what is C and what is A?  C is class and A is the example, a vector of attribute values  Why not model P(C|A) directly? How would we compute it?  We would need to observe A at least once and probably many times in order to come up with reasonable probability estimates. If we observe it once, we would have a probability of 1 for some C and 0 for rest.  We cannot expect to see every attribute vector even once! 26

Bayes Classifiers That was a visual intuition for a simple case of the Bayes classifier, also called: • Idiot Bayes • Naïve Bayes • Simple Bayes We are about to see some of the mathematical formalisms, and more examples, but keep in mind the basic idea. Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class. 27

Bayesian Classifiers  Bayesian classifiers use Bayes theorem , which says p ( c j | d ) = p ( d | c j ) p ( c j ) p(d) p ( c j | d ) = probability of instance d being in class c j , This is what we are trying to compute  p ( d | c j ) = probability of generating instance d given class c j , We can imagine that being in class c j , causes you to have feature d with some probability  p(c j ) = probability of occurrence of class c j , This is just how frequent the class c j , is in our database  p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes 28

Bayesian Classifiers  Given a record with attributes (A 1 , A 2 ,…,A n )  The goal is to predict class C  Actually, we want to find the value of C that maximizes P(C| A 1 , A 2 ,…,A n )  Can we estimate P(C| A 1 , A 2 ,…,A n ) directly (w/o Bayes)?  Yes, we simply need to count up the number of times we see A 1 , A 2 ,…,A n and then see what fraction belongs to each class  For example, if n=3 and the feature vector “4,3,2” occurs 10 times and 4 of these belong to C1 and 6 to C2, then:  What is P(C1|”4,3,2”)?  What is P(C2|”4,3,2”)?  Unfortunately, this is generally not feasible since not every feature vector will be found in the training set (as we just said) 29

Bayesian Classifiers  Indirect Approach: Use Bayes Theorem  compute the posterior probability P(C | A 1 , A 2 , …, A n ) for all values of C using the Bayes theorem  ( | ) ( ) P A A A C P C   ( | ) P C A A A 1 2 n  1 2 n ( ) P A A A 1 2 n  Choose value of C that maximizes P(C | A 1 , A 2 , …, A n )  Equivalent to choosing value of C that maximizes P(A 1 , A 2 , …, A n |C) P(C)  Since the denominator is the same for all values of C 30

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - PowerPoint PPT Presentation

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off with some mathematical background. But first we start with some visual intuition. 2 Grasshoppers Katydids 10 9 8 7 Antenna Length 6 5 4

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Learning: A Bayesian solution Dmitry P. Vetrov Research professor at HSE, Head of Bayesian

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian

Classification Algorithms UCSB 293S, 2017. T. Yang Some of slides based on R. Mooney (UT Austin)

ECE 4524 Artificial Intelligence and Engineering Applications Lecture 17: Bayesian Inference

Bayesian Networks Philipp Koehn 2 April 2020 Philipp Koehn Artificial Intelligence: Bayesian

Ba y esian Learning Read Ch Suggested exercises

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - PowerPoint PPT Presentation

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off with some mathematical background. But first we start with some visual intuition. 2 Grasshoppers Katydids 10 9 8 7 Antenna Length 6 5 4

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Learning: A Bayesian solution Dmitry P. Vetrov Research professor at HSE, Head of Bayesian

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian

Classification Algorithms UCSB 293S, 2017. T. Yang Some of slides based on R. Mooney (UT Austin)

ECE 4524 Artificial Intelligence and Engineering Applications Lecture 17: Bayesian Inference

Bayesian Networks Philipp Koehn 2 April 2020 Philipp Koehn Artificial Intelligence: Bayesian

Ba y esian Learning Read Ch Suggested exercises

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex