[PPT] - Data Mining 2016 Bayesian Network Classifiers Ad Feelders PowerPoint Presentation

SLIDE 1

Data Mining 2016 Bayesian Network Classifiers

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 48

SLIDE 2

Literature

N. Friedman, D. Geiger and M. Goldszmidt

Bayesian Network Classifiers Machine Learning, 29, pp. 131-163 (1997) (except section 6)

Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 48

SLIDE 3

Bayesian Network Classifiers

Bayesian Networks are models of the joint distribution of a collection of random variables. The joint distribution is simplified by introducing independence assumptions. In many applications we are in fact interested in the conditional distribution of one variable (the class variable) given the other variables (attributes). Can we use Bayesian Networks as classifiers?

Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 48

SLIDE 4

The Naive Bayes Classifier

C A2 A1 Ak . . .

This Bayesian Network is equivalent to its undirected version (why?):

C A2 A1 Ak . . .

Attributes are independent given the class label.

Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 48

SLIDE 5

The Naive Bayes Classifier

C A2 A1 Ak . . .

BN factorisation: P(X) =

k

i=1

P(Xi | Xpa(i)), So factorisation corresponding to NB classifier is: P(C, A1, . . . , Ak) = P(C)P(A1|C) · · · P(Ak|C)

Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 48

SLIDE 6

Naive Bayes assumption

Via Bayes rule we have P(C = i|A) = P(A1, A2, . . . , Ak, C = i) P(A1, A2, . . . , Ak) (product rule) = P(A1, A2, . . . , Ak | C = i)P(C = i) c

j=1 P(A1, A2, . . . , Ak | C = j)P(C = j)

(product rule and sum rule) = P(A1|C = i)P(A2|C = i) · · · P(Ak|C = i)P(C = i) c

j=1 P(A1|C = j)P(A2|C = j) · · · P(Ak|C = j)P(C = j)

(NB factorisation)

Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 48

SLIDE 7

Why Naive Bayes is competitive

The conditional independence assumption is often clearly inappropriate, yet the predictive accuracy of Naive Bayes is competitive with more complex classifiers. How come? Probability estimates of Naive Bayes may be way off, but this does not necessarily result in wrong classification! Naive Bayes has only few parameters compared to more complex models, so it can estimate parameters more reliably.

Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 48

SLIDE 8

Naive Bayes: Example

P(C = 0) = 0.4, P(C = 1) = 0.6 C = 0 A2 A1 1 P(A1) 0.2 0.1 0.3 1 0.1 0.6 0.7 P(A2) 0.3 0.7 1 C = 1 A2 A1 1 P(A1) 0.5 0.2 0.7 1 0.1 0.2 0.3 P(A2) 0.6 0.4 1 We have that P(C = 1|A1 = 0, A2 = 0) = 0.5 × 0.6 0.5 × 0.6 + 0.2 × 0.4 = 0.79 According to naive Bayes P(C = 1|A1 = 0, A2 = 0) = 0.7 × 0.6 × 0.6 0.7 × 0.6 × 0.6 + 0.3 × 0.3 × 0.4 = 0.88 Naive Bayes assigns to the right class.

Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 48

SLIDE 9

What about this model?

C A2 A1 Ak . . .

Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 48

SLIDE 10

What about this model?

C A2 A1 Ak . . .

BN factorisation: P(X) =

k

i=1

P(Xi | Xpa(i)), So factorisation is: P(C, A1, . . . , Ak) = P(C|A1, . . . , Ak)P(A1) · · · P(Ak)

Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 48

SLIDE 11

Bayesian Networks as Classifiers

C A2 A1 A3 A4 A5 A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 48

SLIDE 12

Markov Blanket of C: Moral Graph

C A2 A1 A3 A4 A5 A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children. Local Markov property: C ⊥ ⊥ rest | boundary(C)

Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 48

SLIDE 13

Bayesian Networks as Classifiers

Loglikelihood under model M is L(M|D) =

n

j=1

log PM(X (j)) =

n

j=1

log PM(A(j), C (j)) where A(j) = (A(j)

1 , A(j) 2 , . . . , A(j) k ).

We can rewrite this as (product rule and log ab = log a + log b): L(M|D) =

n

j=1

log PM(C (j)|A(j)) +

n

j=1

log PM(A(j)) If there are many attributes, the second term will dominate the loglikelihood score. But we are not interested in modeling the distribution of the attributes!

Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 48

SLIDE 14

Bayesian Networks as Classifiers

0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 P(x) −log P(x)

Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 48

SLIDE 15

Dataset # Attributes # Classes # Instances Train Test 1 australian 14 2 690 CV-5 2 breast 10 2 683 CV-5 3 chess 36 2 2130 1066 4 cleve 13 2 296 CV-5 5 corral 6 2 128 CV-5 6 crx 15 2 653 CV-5 7 diabetes 8 2 768 CV-5 8 flare 10 2 1066 CV-5 9 german 20 2 1000 CV-5 10 glass 9 7 214 CV-5 11 glass2 9 2 163 CV-5 12 heart 13 2 270 CV-5 13 hepatitis 19 2 80 CV-5 14 iris 4 3 150 CV-5 15 letter 16 26 15000 5000 16 lymphography 18 4 148 CV-5 17 mofn-3-7-10 10 2 300 1024 18 pima 8 2 768 CV-5 19 satimage 36 6 4435 2000 20 segment 19 7 1540 770 21 shuttle-small 9 7 3866 1934 22 soybean-large 35 19 562 CV-5 23 vehicle 18 4 846 CV-5 24 vote 16 2 435 CV-5 25 waveform-21 21 3 300 4700

Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 48

SLIDE 16

Naive Bayes vs. Unrestricted BN

5 10 15 20 25 30 35 40 45 22 19 10 25 16 11 9 4 6 18 17 2 13 1 15 14 12 21 7 20 23 8 24 3 5 Percentage Classification Error Data Set Bayesian Network Naive Bayes

Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 48

SLIDE 17

Use Conditional Log-likelihood?

Discriminative vs. Generative learning. Conditional loglikelihood function: CL(M|D) =

n

j=1

log PM(C (j)|A(j)

1 , . . . , A(j) k )

No closed form solution for ML estimates. Remark: can be done via Logistic Regression for models with perfect graphs (Naive Bayes, TAN’s).

Ad Feelders ( Universiteit Utrecht ) Data Mining 16 / 48

SLIDE 18

NB and Logistic Regression

The logistic regression assumption is log P(C = 1|A) P(C = 0|A)

= α +

k

i=1

βiAi, that is, the log odds is a linear function of the attributes. Under the naive Bayes assumption, this is exactly true. Assign to class 1 if α + k

i=1 βiAi > 0 and to class 0 otherwise.

Logistic regression maximizes conditional likelihood under this assumption (it is a so-called discriminative model). There is no closed form solution for the maximum likelihood estimates of α and βi, but the loglikelihood function is globally concave (unique global optimum).

Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 48

SLIDE 19

Proof (for binary attributes Ai)

Under the naive Bayes assumption we have: P(C = 1|a) P(C = 0|a) = P(a1|C = 1) · · · P(ak|C = 1)P(C = 1) P(a1|C = 0) · · · P(ak|C = 0)P(C = 0) =

k

i=1

P(ai = 1|C = 1) P(ai = 1|C = 0) ai P(ai = 0|C = 1) P(ai = 0|C = 0) (1−ai) × P(C = 1) P(C = 0) Taking the log we get log P(C = 1|a) P(C = 0|a)

=

k

i=1
ai log

P(ai = 1|C = 1) P(ai = 1|C = 0)

+ (1 − ai) log

P(ai = 0|C = 1) P(ai = 0|C = 0)

+ log

P(C = 1) P(C = 0)

Ad Feelders

( Universiteit Utrecht ) Data Mining 18 / 48

SLIDE 20

Proof (continued)

Expand and collect terms. log P(C = 1|a) P(C = 0|a)

=

k

i=1

ai

βi

log P(ai = 1|C = 1)

P(ai = 1|C = 0) P(ai = 0|C = 0) P(ai = 0|C = 1)

+

k

i=1
log P(ai = 0|C = 1)

P(ai = 0|C = 0)

+ log P(C = 1)

P(C = 0)

α

which is a linear function of a.

Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 48

SLIDE 21

Example

Suppose P(C = 1) = 0.6, P(a1 = 1|C = 1) = 0.8, P(a1 = 1|C = 0) = 0.5, P(a2 = 1|C = 1) = 0.6, P(a2 = 1|C = 0) = 0.3. Then log P(C = 1|a1, a2) P(C = 0|a1, a2)

=

1.386a1 + 1.253a2 − 1.476 + 0.405 = −1.071 + 1.386a1 + 1.253a2 Classify a point with a1 = 1 and a2 = 0: log P(C = 1|1, 0) P(C = 0|1, 0)

= −1.071 + 1.386 × 1 + 1.253 × 0 = 0.315

Decision rule: assign to class 1 if α +

k

i=1

βiAi > 0 and to class 0 otherwise. Linear decision boundary.

Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 48

SLIDE 22

Linear Decision Boundary

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 A1 A2 a2 = 0.855 − 1.106a1

CLASS 0 CLASS 1

Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 48

SLIDE 23

Relax strong assumptions of NB

Conditional independence assumption of NB is often incorrect, and could lead to suboptimal classification performance. Relax this assumption by allowing (restricted) dependencies between attributes. This may produce more accurate probability estimates, possibly leading to better classification performance. This is not guaranteed, because the more complex model may be

verfitting.

Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 48

SLIDE 24

Tree Structured BN

Tree structure: each node (except the root of the tree) has exactly

ne parent.

For tree structured Bayesian Networks there is an algorithm, due to Chow and Liu, that produces the optimal structure in polynomial time. This algorithm is guaranteed to produce the tree structure that maximizes the loglikelihood score. Why no penalty for complexity?

Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 48

SLIDE 25

Mutual Information

Measure of association between (discrete) random variables X and Y : IP(X; Y ) =

x,y

P(x, y) log P(x, y) P(x)P(y) “Distance” (Kullback-Leibler divergence) between joint distribution of X and Y , and their joint distribution under the independence assumption. If X and Y are independent, their mutual information is zero, otherwise it is some positive quantity.

Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 48

SLIDE 26

Algorithm Construct-Tree of Chow and Liu

Compute I ˆ

PD(Xi; Xj) between each pair of variables. (O(nk2))

Build complete undirected graph with weights I ˆ

PD(Xi; Xj)

Build a maximum weighted spanning tree (O(k2 log k)) Choose root, and let all edges point away from it

Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 48

SLIDE 27

Example Data Set

bs

X1 X2 X3 X4 1 1 1 1 1 2 1 1 1 1 3 1 1 2 1 4 1 2 2 1 5 1 2 2 2 6 2 1 1 2 7 2 1 2 3 8 2 1 2 3 9 2 2 2 3 10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 48

SLIDE 28

Mutual Information

I ˆ

PD(X1; X4) =

x1,x4

ˆ PD(x1, x4) log ˆ PD(x1, x4) ˆ PD(x1) ˆ PD(x4) = 0.4 log 0.4 (0.5)(0.4) + 0.1 log 0.1 (0.5)(0.2) + 0 log (0.5)(0.4) + 0 log (0.5)(0.4) + 0.1 log 0.1 (0.5)(0.2) + 0.4 log 0.4 (0.5)(0.4) = 0.55

Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 48

SLIDE 29

Build Graph with Weights

1 2 3 4

0.55 0.032 0.032 0.032

Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 48

SLIDE 30

Maximum weighted spanning tree

1 2 3 4

Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 48

SLIDE 31

Choose root node

1 2 3 4

Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 48

SLIDE 32

Algorithm to construct TAN

Compute I ˆ

PD(Ai; Aj|C) between each pair of attributes, where

IP(Ai; Aj|C) =

ai,aj,c

P(ai, aj, c) log P(ai, aj|c) P(ai|c)P(aj|c) is the conditional mutual information between Ai and Aj given C. Build complete undirected graph with weights I ˆ

PD(Ai; Aj|C)

Build a maximum weighted spanning tree Choose root, and let all edges point away from it Construct a TAN by adding C and an arc from C to all attributes. This algorithm is guaranteed to produce the TAN structure with optimal loglikelihood score.

Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 48

SLIDE 33

TAN for Pima Indians Data

C Pregnant Insulin Age DPF Glucose Mass

Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 48

SLIDE 34

Interpretation of TAN’s

Just like NB models, TAN’s are equivalent to their undirected counterparts (why?).

C P A I D M G

Since there is an edge between Pregnant and Age, the influence of Age on the class label depends on (is different for different values of) Pregnant (and vice versa).

Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 48

SLIDE 35

Smoothing by adding “prior counts”

Recall that the maximum likelihood estimate of p(xi | xpa(i)) is: ˆ p(xi | xpa(i)) = n(xi, xpa(i)) n(xpa(i)) , where n(xpa(i)) is the number of observations (rows) with parent configuration xpa(i), and n(xi, xpa(i)) is the number of observations with parent configuration xpa(i) and value xi for variable Xi. But sometimes we have no (n(xpa(i)) = 0) or very few observations to estimate these (conditional) probabilities.

Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 48

SLIDE 36

Smoothing by adding “prior counts”

Add “prior counts” to “smooth” the estimates. ˆ ps(xi | xpa(i)) = n(xpa(i))ˆ p(xi | xpa(i)) + m(xpa(i))p0(xi | xpa(i)) n(xpa(i)) + m(xpa(i)) where m(xpa(i)) is the prior precision, ˆ ps(xi | xpa(i)) is the smoothed estimate, and p0(xi | xpa(i)) is our prior estimate of p(xi | xpa(i)). Common to take m(xpa(i)) to be the same for all parent configurations. Weighted average of ML estimate and prior estimate.

Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 48

SLIDE 37

ML estimate vs. smoothed estimate

For example ˆ p3|1,2(1|1, 2) = n(x1 = 1, x2 = 2, x3 = 1) n(x1 = 1, x2 = 2) = 0 2 = 0 Suppose we set m = 2, and p0(xi | xpa(i)) = ˆ p(xi) Then we get ˆ ps

3|1,2(1|1, 2) = 2 × 0 + 2 × 0.4

2 + 2 = 0.2, since ˆ p3(1) = 0.4.

Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 48

SLIDE 38

Bayesian Multinets

Build structure on attributes for each class separately and use PM(C = i, A1, . . . , Ak) = P(C = i)PM(A1, . . . , Ak|C = i) = P(C = i)PMi(A1, . . . , Ak) i = 1, . . . , c Using trees for class-conditional structures Split D into c partitions, D1, D2, . . . , Dc where c is the number of distinct values of class label C. Di contains all records in D with C = i. Set P(C = i) = ˆ PD(C = i) for i = 1, . . . , c. Apply Construct-Tree on Di to construct Mi Unlike in TAN’s you may get a different tree structure for each class!

Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 48

SLIDE 39

Experimental Study of Friedman et al.

The following classifiers were compared: NB: Naive Bayes Classifier SNB: Naive Bayes with attribute selection BN: Unrestricted Bayesian Network (Markov Blanket of Class) C4.5: Classification Tree And also structure per class structure same different tree TAN CL dag ANB MN Superscript s indicates smoothing of parameter estimates.

Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 48

SLIDE 40

Dataset # Attributes # Classes # Instances Train Test 1 australian 14 2 690 CV-5 2 breast 10 2 683 CV-5 3 chess 36 2 2130 1066 4 cleve 13 2 296 CV-5 5 corral 6 2 128 CV-5 6 crx 15 2 653 CV-5 7 diabetes 8 2 768 CV-5 8 flare 10 2 1066 CV-5 9 german 20 2 1000 CV-5 10 glass 9 7 214 CV-5 11 glass2 9 2 163 CV-5 12 heart 13 2 270 CV-5 13 hepatitis 19 2 80 CV-5 14 iris 4 3 150 CV-5 15 letter 16 26 15000 5000 16 lymphography 18 4 148 CV-5 17 mofn-3-7-10 10 2 300 1024 18 pima 8 2 768 CV-5 19 satimage 36 6 4435 2000 20 segment 19 7 1540 770 21 shuttle-small 9 7 3866 1934 22 soybean-large 35 19 562 CV-5 23 vehicle 18 4 846 CV-5 24 vote 16 2 435 CV-5 25 waveform-21 21 3 300 4700

Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 48

SLIDE 41

Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 48

SLIDE 42

Ad Feelders ( Universiteit Utrecht ) Data Mining 41 / 48

SLIDE 43

Ad Feelders ( Universiteit Utrecht ) Data Mining 42 / 48

SLIDE 44

Example: RHC small data set

Variable Description cat1 Primary disease category death Did patient die within 180 days? swang1 Did patient get Swan-Ganz catheter? gender Gender race Race ninsclas Insurance class income Income ca Cancer status age Age meanbp1 Mean blood pressure We have 4000 observations for training and 1735 for testing.

Ad Feelders ( Universiteit Utrecht ) Data Mining 43 / 48

SLIDE 45

Naive Bayes

> library(bnlearn) > library(Rgraphviz) > death.nb <- naive.bayes(rhcsmall.dat[train.index,],"death") > plot(as(amat(death.nb),"graphNEL")) > death.nb.pred <- predict(death.nb,rhcsmall.dat[-train.index,]) Warning message: In check.data(data, allowed.types = discrete.data.types) : variable cat1 has levels that are not observed in the data. > confmat <- table(rhcsmall.dat[-train.index,"death"],death.nb.pred) > confmat death.nb.pred No Yes No 259 341 Yes 217 918 > sum(diag(confmat))/sum(confmat) [1] 0.6783862 > sum(confmat[2,])/sum(confmat) [1] 0.6541787

Ad Feelders ( Universiteit Utrecht ) Data Mining 44 / 48

SLIDE 46

NB for RHC data

death cat1 swang1 gender race ninsclas income ca age meanbp1

Ad Feelders ( Universiteit Utrecht ) Data Mining 45 / 48

SLIDE 47

Tree Augmented Naive Bayes (TAN)

# fit TAN structure to data with "death" as class variable > death.tan <- tree.bayes(rhcsmall.dat[train.index,],"death") # fit TAN parameters using maximum likelihood estimation ("mle") > death.tan.fit <- bn.fit(death.tan,rhcsmall.dat[train.index,],"mle") # predict class on test sample > death.tan.pred <- predict(death.tan.fit,rhcsmall.dat[-train.index,]) # make confusion matrix > confmat <- table(rhcsmall.dat[-train.index,"death"],death.tan.pred) > confmat death.tan.pred No Yes No 230 370 Yes 171 964 # compute accuracy > sum(diag(confmat))/sum(confmat) [1] 0.6881844 # plot the TAN structure > plot(as(amat(death.tan),"graphNEL"))

Ad Feelders ( Universiteit Utrecht ) Data Mining 46 / 48

SLIDE 48

TAN for RHC data

death cat1 swang1 gender race ninsclas income ca age meanbp1

Ad Feelders ( Universiteit Utrecht ) Data Mining 47 / 48

SLIDE 49

Logistic Regression

# fit model on train set > death.logreg <- glm(death~.,data=rhcsmall.dat[train.index,], family="binomial") # make predictions on test set > death.logreg.pred <- predict(death.logreg,rhcsmall.dat[-train.index,], type="response") # compute confusion matrix > confmat <- table(rhcsmall.dat[-train.index,"death"], death.logreg.pred > 0.5) > confmat FALSE TRUE No 212 388 Yes 156 979 # compute accuracy > sum(diag(confmat))/sum(confmat) [1] 0.6864553

Ad Feelders ( Universiteit Utrecht ) Data Mining 48 / 48