Data Mining 2016 Bayesian Network Classifiers
Ad Feelders
Universiteit Utrecht
Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 48
Data Mining 2016 Bayesian Network Classifiers Ad Feelders - - PowerPoint PPT Presentation
Data Mining 2016 Bayesian Network Classifiers Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 48 Literature N. Friedman, D. Geiger and M. Goldszmidt Bayesian Network Classifiers Machine Learning, 29,
Universiteit Utrecht
Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 48
C A2 A1 Ak . . .
C A2 A1 Ak . . .
Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 48
C A2 A1 Ak . . .
k
Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 48
j=1 P(A1, A2, . . . , Ak | C = j)P(C = j)
j=1 P(A1|C = j)P(A2|C = j) · · · P(Ak|C = j)P(C = j)
Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 48
P(C = 0) = 0.4, P(C = 1) = 0.6 C = 0 A2 A1 1 P(A1) 0.2 0.1 0.3 1 0.1 0.6 0.7 P(A2) 0.3 0.7 1 C = 1 A2 A1 1 P(A1) 0.5 0.2 0.7 1 0.1 0.2 0.3 P(A2) 0.6 0.4 1 We have that P(C = 1|A1 = 0, A2 = 0) = 0.5 × 0.6 0.5 × 0.6 + 0.2 × 0.4 = 0.79 According to naive Bayes P(C = 1|A1 = 0, A2 = 0) = 0.7 × 0.6 × 0.6 0.7 × 0.6 × 0.6 + 0.3 × 0.3 × 0.4 = 0.88 Naive Bayes assigns to the right class.
Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 48
k
Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 48
C A2 A1 A3 A4 A5 A6 A7 A8
Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 48
C A2 A1 A3 A4 A5 A6 A7 A8
Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 48
Loglikelihood under model M is L(M|D) =
n
log PM(X (j)) =
n
log PM(A(j), C (j)) where A(j) = (A(j)
1 , A(j) 2 , . . . , A(j) k ).
We can rewrite this as (product rule and log ab = log a + log b): L(M|D) =
n
log PM(C (j)|A(j)) +
n
log PM(A(j)) If there are many attributes, the second term will dominate the loglikelihood score. But we are not interested in modeling the distribution of the attributes!
Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 48
0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 P(x) −log P(x)
Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 48
Dataset # Attributes # Classes # Instances Train Test 1 australian 14 2 690 CV-5 2 breast 10 2 683 CV-5 3 chess 36 2 2130 1066 4 cleve 13 2 296 CV-5 5 corral 6 2 128 CV-5 6 crx 15 2 653 CV-5 7 diabetes 8 2 768 CV-5 8 flare 10 2 1066 CV-5 9 german 20 2 1000 CV-5 10 glass 9 7 214 CV-5 11 glass2 9 2 163 CV-5 12 heart 13 2 270 CV-5 13 hepatitis 19 2 80 CV-5 14 iris 4 3 150 CV-5 15 letter 16 26 15000 5000 16 lymphography 18 4 148 CV-5 17 mofn-3-7-10 10 2 300 1024 18 pima 8 2 768 CV-5 19 satimage 36 6 4435 2000 20 segment 19 7 1540 770 21 shuttle-small 9 7 3866 1934 22 soybean-large 35 19 562 CV-5 23 vehicle 18 4 846 CV-5 24 vote 16 2 435 CV-5 25 waveform-21 21 3 300 4700
Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 48
5 10 15 20 25 30 35 40 45 22 19 10 25 16 11 9 4 6 18 17 2 13 1 15 14 12 21 7 20 23 8 24 3 5 Percentage Classification Error Data Set Bayesian Network Naive Bayes
Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 48
n
1 , . . . , A(j) k )
Ad Feelders ( Universiteit Utrecht ) Data Mining 16 / 48
The logistic regression assumption is log P(C = 1|A) P(C = 0|A)
k
βiAi, that is, the log odds is a linear function of the attributes. Under the naive Bayes assumption, this is exactly true. Assign to class 1 if α + k
i=1 βiAi > 0 and to class 0 otherwise.
Logistic regression maximizes conditional likelihood under this assumption (it is a so-called discriminative model). There is no closed form solution for the maximum likelihood estimates of α and βi, but the loglikelihood function is globally concave (unique global optimum).
Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 48
Under the naive Bayes assumption we have: P(C = 1|a) P(C = 0|a) = P(a1|C = 1) · · · P(ak|C = 1)P(C = 1) P(a1|C = 0) · · · P(ak|C = 0)P(C = 0) =
k
P(ai = 1|C = 1) P(ai = 1|C = 0) ai P(ai = 0|C = 1) P(ai = 0|C = 0) (1−ai) × P(C = 1) P(C = 0) Taking the log we get log P(C = 1|a) P(C = 0|a)
k
P(ai = 1|C = 1) P(ai = 1|C = 0)
P(ai = 0|C = 1) P(ai = 0|C = 0)
P(C = 1) P(C = 0)
( Universiteit Utrecht ) Data Mining 18 / 48
Expand and collect terms. log P(C = 1|a) P(C = 0|a)
k
ai
βi
P(ai = 1|C = 0) P(ai = 0|C = 0) P(ai = 0|C = 1)
k
P(ai = 0|C = 0)
P(C = 0)
which is a linear function of a.
Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 48
Suppose P(C = 1) = 0.6, P(a1 = 1|C = 1) = 0.8, P(a1 = 1|C = 0) = 0.5, P(a2 = 1|C = 1) = 0.6, P(a2 = 1|C = 0) = 0.3. Then log P(C = 1|a1, a2) P(C = 0|a1, a2)
1.386a1 + 1.253a2 − 1.476 + 0.405 = −1.071 + 1.386a1 + 1.253a2 Classify a point with a1 = 1 and a2 = 0: log P(C = 1|1, 0) P(C = 0|1, 0)
Decision rule: assign to class 1 if α +
k
βiAi > 0 and to class 0 otherwise. Linear decision boundary.
Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 48
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 A1 A2 a2 = 0.855 − 1.106a1
Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 48
PD(Xi; Xj) between each pair of variables. (O(nk2))
PD(Xi; Xj)
Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 48
PD(X1; X4) =
Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 48
PD(Ai; Aj|C) between each pair of attributes, where
PD(Ai; Aj|C)
Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 48
3|1,2(1|1, 2) = 2 × 0 + 2 × 0.4
Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 48
Dataset # Attributes # Classes # Instances Train Test 1 australian 14 2 690 CV-5 2 breast 10 2 683 CV-5 3 chess 36 2 2130 1066 4 cleve 13 2 296 CV-5 5 corral 6 2 128 CV-5 6 crx 15 2 653 CV-5 7 diabetes 8 2 768 CV-5 8 flare 10 2 1066 CV-5 9 german 20 2 1000 CV-5 10 glass 9 7 214 CV-5 11 glass2 9 2 163 CV-5 12 heart 13 2 270 CV-5 13 hepatitis 19 2 80 CV-5 14 iris 4 3 150 CV-5 15 letter 16 26 15000 5000 16 lymphography 18 4 148 CV-5 17 mofn-3-7-10 10 2 300 1024 18 pima 8 2 768 CV-5 19 satimage 36 6 4435 2000 20 segment 19 7 1540 770 21 shuttle-small 9 7 3866 1934 22 soybean-large 35 19 562 CV-5 23 vehicle 18 4 846 CV-5 24 vote 16 2 435 CV-5 25 waveform-21 21 3 300 4700
Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 41 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 42 / 48
Ad Feelders ( Universiteit Utrecht ) Data Mining 43 / 48
> library(bnlearn) > library(Rgraphviz) > death.nb <- naive.bayes(rhcsmall.dat[train.index,],"death") > plot(as(amat(death.nb),"graphNEL")) > death.nb.pred <- predict(death.nb,rhcsmall.dat[-train.index,]) Warning message: In check.data(data, allowed.types = discrete.data.types) : variable cat1 has levels that are not observed in the data. > confmat <- table(rhcsmall.dat[-train.index,"death"],death.nb.pred) > confmat death.nb.pred No Yes No 259 341 Yes 217 918 > sum(diag(confmat))/sum(confmat) [1] 0.6783862 > sum(confmat[2,])/sum(confmat) [1] 0.6541787
Ad Feelders ( Universiteit Utrecht ) Data Mining 44 / 48
death cat1 swang1 gender race ninsclas income ca age meanbp1
Ad Feelders ( Universiteit Utrecht ) Data Mining 45 / 48
# fit TAN structure to data with "death" as class variable > death.tan <- tree.bayes(rhcsmall.dat[train.index,],"death") # fit TAN parameters using maximum likelihood estimation ("mle") > death.tan.fit <- bn.fit(death.tan,rhcsmall.dat[train.index,],"mle") # predict class on test sample > death.tan.pred <- predict(death.tan.fit,rhcsmall.dat[-train.index,]) # make confusion matrix > confmat <- table(rhcsmall.dat[-train.index,"death"],death.tan.pred) > confmat death.tan.pred No Yes No 230 370 Yes 171 964 # compute accuracy > sum(diag(confmat))/sum(confmat) [1] 0.6881844 # plot the TAN structure > plot(as(amat(death.tan),"graphNEL"))
Ad Feelders ( Universiteit Utrecht ) Data Mining 46 / 48
death cat1 swang1 gender race ninsclas income ca age meanbp1
Ad Feelders ( Universiteit Utrecht ) Data Mining 47 / 48
# fit model on train set > death.logreg <- glm(death~.,data=rhcsmall.dat[train.index,], family="binomial") # make predictions on test set > death.logreg.pred <- predict(death.logreg,rhcsmall.dat[-train.index,], type="response") # compute confusion matrix > confmat <- table(rhcsmall.dat[-train.index,"death"], death.logreg.pred > 0.5) > confmat FALSE TRUE No 212 388 Yes 156 979 # compute accuracy > sum(diag(confmat))/sum(confmat) [1] 0.6864553
Ad Feelders ( Universiteit Utrecht ) Data Mining 48 / 48