Data Mining 2013 Bayesian Network Classifiers Ad Feelders - PowerPoint PPT Presentation

Data Mining 2013 Bayesian Network Classifiers Ad Feelders Universiteit Utrecht October 24, 2013 Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 1 / 49

Literature N. Friedman, D. Geiger and M. Goldszmidt Bayesian Network Classifiers Machine Learning, 29, pp. 131-163 (1997) (except section 6) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 2 / 49

Bayesian Network Classifiers Bayesian Networks are models of the joint distribution of a collection of random variables. The joint distribution is simplified by introducing independence assumptions. In many applications we are in fact interested in the conditional distribution of one variable (the class variable) given the other variables (attributes). Can we use Bayesian Networks as classifiers? Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 3 / 49

The Naive Bayes Classifier C . . . A 1 A k A 2 This Bayesian Network is equivalent to its undirected version (why?): C . . . A 1 A 2 A k Attributes are independent given the class label. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 4 / 49

The Naive Bayes Classifier C . . . A 1 A k A 2 BN factorisation: k � P ( X ) = P ( X i | X pa ( i ) ) , i =1 So factorisation corresponding to NB classifier is: P ( C , A 1 , . . . , A k ) = P ( C ) P ( A 1 | C ) · · · P ( A k | C ) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 5 / 49

Naive Bayes assumption Via Bayes rule we have P ( C = i | A ) = P ( A 1 , A 2 , . . . , A k , C = i ) (product rule) P ( A 1 , A 2 , . . . , A k ) P ( A 1 , A 2 , . . . , A k | C = i ) P ( C = i ) = � c j =1 P ( A 1 , A 2 , . . . , A k | C = j ) P ( C = j ) (product rule and sum rule) P ( A 1 | C = i ) P ( A 2 | C = i ) · · · P ( A k | C = i ) P ( C = i ) = � c j =1 P ( A 1 | C = j ) P ( A 2 | C = j ) · · · P ( A k | C = j ) P ( C = j ) (NB factorisation) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 6 / 49

Why Naive Bayes is competitive The conditional independence assumption is often clearly inappropriate, yet the predictive accuracy of Naive Bayes is competitive with more complex classifiers. How come? Probability estimates of Naive Bayes may be way off, but this does not necessarily result in wrong classification! Naive Bayes has only few parameters compared to more complex models, so it can estimate parameters more reliably. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 7 / 49

Naive Bayes: Example P ( C = 0) = 0 . 4, P ( C = 1) = 0 . 6 C = 0 A 2 C = 1 A 2 A 1 0 1 P ( A 1 ) A 1 0 1 P ( A 1 ) 0 0.2 0.1 0.3 0 0.5 0.2 0.7 1 0.1 0.6 0.7 1 0.1 0.2 0.3 P ( A 2 ) 0.3 0.7 1 P ( A 2 ) 0.6 0.4 1 We have that 0 . 5 × 0 . 6 P ( C = 1 | A 1 = 0 , A 2 = 0) = 0 . 5 × 0 . 6 + 0 . 2 × 0 . 4 = 0 . 79 According to naive Bayes 0 . 7 × 0 . 6 × 0 . 6 P ( C = 1 | A 1 = 0 , A 2 = 0) = 0 . 7 × 0 . 6 × 0 . 6 + 0 . 3 × 0 . 3 × 0 . 4 = 0 . 88 Naive Bayes assigns to the right class. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 8 / 49

What about this model? . . . A 1 A 2 A k C BN factorisation: k � P ( X ) = P ( X i | X pa ( i ) ) , i =1 So factorisation is: P ( C , A 1 , . . . , A k ) = P ( C | A 1 , . . . , A k ) P ( A 1 ) · · · P ( A k ) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 9 / 49

Bayesian Networks as Classifiers A 2 A 1 A 3 C A 5 A 4 A 6 A 7 A 8 Markov Blanket: Parents, Children and Parents of Children. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 10 / 49

Markov Blanket of C : Moral Graph A 2 A 1 A 3 C A 5 A 4 A 6 A 7 A 8 Markov Blanket: Parents, Children and Parents of Children. Local Markov property: C ⊥ ⊥ rest | boundary( C ) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 11 / 49

Bayesian Networks as Classifiers Loglikelihood under model M is n � log P M ( X ( j ) ) L ( M | D ) = j =1 where X ( j ) = ( A ( j ) 1 , A ( j ) 2 , . . . , A ( j ) k , C ( j ) ). We can rewrite this as n n � � log P M ( C ( j ) | A ( j ) ) + log P M ( A ( j ) ) L ( M | D ) = j =1 j =1 If there are many attributes, the second term will dominate the loglikelihood score. But we are not interested in modeling the distribution of the attributes! Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 12 / 49

Bayesian Networks as Classifiers 10 8 6 −log P(x) 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 P(x) Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 13 / 49

Dataset # Attributes # Classes # Instances Train Test 1 australian 14 2 690 CV-5 2 breast 10 2 683 CV-5 3 chess 36 2 2130 1066 4 cleve 13 2 296 CV-5 5 corral 6 2 128 CV-5 6 crx 15 2 653 CV-5 7 diabetes 8 2 768 CV-5 8 flare 10 2 1066 CV-5 9 german 20 2 1000 CV-5 10 glass 9 7 214 CV-5 11 glass2 9 2 163 CV-5 12 heart 13 2 270 CV-5 13 hepatitis 19 2 80 CV-5 14 iris 4 3 150 CV-5 15 letter 16 26 15000 5000 16 lymphography 18 4 148 CV-5 17 mofn-3-7-10 10 2 300 1024 18 pima 8 2 768 CV-5 19 satimage 36 6 4435 2000 20 segment 19 7 1540 770 21 shuttle-small 9 7 3866 1934 22 soybean-large 35 19 562 CV-5 23 vehicle 18 4 846 CV-5 24 vote 16 2 435 CV-5 25 waveform-21 21 3 300 4700 Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 14 / 49

Naive Bayes vs. Unrestricted BN Bayesian Network 45 Naive Bayes 40 Percentage Classification Error 35 30 25 20 15 10 5 0 22 19 10 25 16 11 9 4 6 18 17 2 13 1 15 14 12 21 7 20 23 8 24 3 5 Data Set Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 15 / 49

Use Conditional Log-likelihood? Discriminative vs. Generative learning. Conditional loglikelihood function: n � log P M ( C ( j ) | A ( j ) 1 , . . . , A ( j ) CL ( M | D ) = k ) j =1 No closed form solution for ML estimates. Remark: can be done via Logistic Regression for models with perfect graphs (Naive Bayes, TAN’s). Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 16 / 49

NB and Logistic Regression The logistic regression assumption is � P ( C = 1 | A ) � k � log = α + β i A i , P ( C = 0 | A ) i =1 that is, the log odds is a linear function of the attributes. Under the naive Bayes assumption, this is exactly true. Assign to class 1 if α + � k i =1 β i A i > 0 and to class 0 otherwise. Logistic regression maximizes conditional likelihood under this assumption (it is a so-called discriminative model). There is no closed form solution for the maximum likelihood estimates of α and β i , but the loglikelihood function is globally concave (unique global optimum). Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 17 / 49

Proof (continued) Expand and collect terms. β i � �� P ( C = 1 | a ) � k � � log P ( a i = 1 | C = 1) P ( a i = 0 | C = 0) � log = a i P ( C = 0 | a ) P ( a i = 1 | C = 0) P ( a i = 0 | C = 1) i =1 k � � log P ( a i = 0 | C = 1) + log P ( C = 1) � + P ( a i = 0 | C = 0) P ( C = 0) i =1 � �� α which is a linear function of a . Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 19 / 49

Example Suppose P ( C = 1) = 0 . 6 , P ( a 1 = 1 | C = 1) = 0 . 8 , P ( a 1 = 1 | C = 0) = 0 . 5, P ( a 2 = 1 | C = 1) = 0 . 6 , P ( a 2 = 1 | C = 0) = 0 . 3. Then � P ( C = 1 | a 1 , a 2 ) � log = 1 . 386 a 1 + 1 . 253 a 2 − 1 . 476 + 0 . 405 P ( C = 0 | a 1 , a 2 ) = − 1 . 071 + 1 . 386 a 1 + 1 . 253 a 2 Classify a point with a 1 = 1 and a 2 = 0: � P ( C = 1 | 1 , 0) � log = − 1 . 071 + 1 . 386 × 1 + 1 . 253 × 0 = 0 . 315 P ( C = 0 | 1 , 0) Decision rule: assign to class 1 if k � α + β i A i > 0 i =1 and to class 0 otherwise. Linear decision boundary. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 20 / 49

Linear Decision Boundary 1.0 CLASS 1 0.8 a 2 = 0.855 − 1.106a 1 0.6 A2 0.4 CLASS 0 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 A1 Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 21 / 49

Relax strong assumptions of NB Conditional independence assumption of NB is often incorrect, and could lead to suboptimal classification performance. Relax this assumption by allowing (restricted) dependencies between attributes. This may produce more accurate probability estimates, possibly leading to better classification performance. This is not guaranteed, because the more complex model may be overfitting. Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 22 / 49

Data Mining 2013 Bayesian Network Classifiers Ad Feelders - PowerPoint PPT Presentation

Data Mining 2013 Bayesian Network Classifiers Ad Feelders Universiteit Utrecht October 24, 2013 Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 1 / 49 Literature N. Friedman, D. Geiger and M. Goldszmidt Bayesian Network

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

On Robust Trimming of Bayesian Network Classifiers YooJung Choi and Guy Van den Broeck UCLA

Data Mining 2016 Bayesian Network Classifiers Ad Feelders Universiteit Utrecht Ad Feelders (

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CSSE463: Image Recognition Day 31 Today: Bayesian classifiers Questions? Bayesian

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Full Bayesian Network Classifiers by Jiang Su and Harry Zhang Flemming Jensen November 2008

7/8/2013 1 7/8/2013 2 7/8/2013 3 7/8/2013 4 7/8/2013 5 7/8/2013 6 7/8/2013 7 7/8/2013

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Britain and the European Union the past, the present and the future #UKPW16 Thursday, 17

Argumentation God does not God does not exist want us to kill c Some people b Death penalty

Political Participation and Awareness in America 1 1 2 3 3 4 4 Go To Chapter 6, Section 2

The Greatest Letter Ever Written! Rom 7:7-12 What then shall we say? That the law is sin? By no

Superstition and Steady State Learning Drew Fudenberg and David K. Levine 6/3/05 Introduction

Death by Deception LESSON 2 Your Response to the Lesson What was most interesting in the Bible

Trademark and Unfair Competition Law Slides 25: False Advertising LAWS 7341-001 Prof. Kristelia

The grace and wrath of God: unity and generosity, greed and deception Acts 4:32-5:11