Data Mining 2013 Bayesian Network Classifiers Ad Feelders - - PowerPoint PPT Presentation

data mining 2013 bayesian network classifiers
SMART_READER_LITE
LIVE PREVIEW

Data Mining 2013 Bayesian Network Classifiers Ad Feelders - - PowerPoint PPT Presentation

Data Mining 2013 Bayesian Network Classifiers Ad Feelders Universiteit Utrecht October 24, 2013 Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 1 / 49 Literature N. Friedman, D. Geiger and M. Goldszmidt Bayesian Network


slide-1
SLIDE 1

Data Mining 2013 Bayesian Network Classifiers

Ad Feelders

Universiteit Utrecht

October 24, 2013

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 1 / 49

slide-2
SLIDE 2

Literature

  • N. Friedman, D. Geiger and M. Goldszmidt

Bayesian Network Classifiers Machine Learning, 29, pp. 131-163 (1997) (except section 6)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 2 / 49

slide-3
SLIDE 3

Bayesian Network Classifiers

Bayesian Networks are models of the joint distribution of a collection of random variables. The joint distribution is simplified by introducing independence assumptions. In many applications we are in fact interested in the conditional distribution of one variable (the class variable) given the other variables (attributes). Can we use Bayesian Networks as classifiers?

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 3 / 49

slide-4
SLIDE 4

The Naive Bayes Classifier

C A2 A1 Ak . . .

This Bayesian Network is equivalent to its undirected version (why?):

C A2 A1 Ak . . .

Attributes are independent given the class label.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 4 / 49

slide-5
SLIDE 5

The Naive Bayes Classifier

C A2 A1 Ak . . .

BN factorisation: P(X) =

k

  • i=1

P(Xi | Xpa(i)), So factorisation corresponding to NB classifier is: P(C, A1, . . . , Ak) = P(C)P(A1|C) · · · P(Ak|C)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 5 / 49

slide-6
SLIDE 6

Naive Bayes assumption

Via Bayes rule we have P(C = i|A) = P(A1, A2, . . . , Ak, C = i) P(A1, A2, . . . , Ak) (product rule) = P(A1, A2, . . . , Ak | C = i)P(C = i) c

j=1 P(A1, A2, . . . , Ak | C = j)P(C = j)

(product rule and sum rule) = P(A1|C = i)P(A2|C = i) · · · P(Ak|C = i)P(C = i) c

j=1 P(A1|C = j)P(A2|C = j) · · · P(Ak|C = j)P(C = j)

(NB factorisation)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 6 / 49

slide-7
SLIDE 7

Why Naive Bayes is competitive

The conditional independence assumption is often clearly inappropriate, yet the predictive accuracy of Naive Bayes is competitive with more complex classifiers. How come? Probability estimates of Naive Bayes may be way off, but this does not necessarily result in wrong classification! Naive Bayes has only few parameters compared to more complex models, so it can estimate parameters more reliably.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 7 / 49

slide-8
SLIDE 8

Naive Bayes: Example

P(C = 0) = 0.4, P(C = 1) = 0.6 C = 0 A2 A1 1 P(A1) 0.2 0.1 0.3 1 0.1 0.6 0.7 P(A2) 0.3 0.7 1 C = 1 A2 A1 1 P(A1) 0.5 0.2 0.7 1 0.1 0.2 0.3 P(A2) 0.6 0.4 1 We have that P(C = 1|A1 = 0, A2 = 0) = 0.5 × 0.6 0.5 × 0.6 + 0.2 × 0.4 = 0.79 According to naive Bayes P(C = 1|A1 = 0, A2 = 0) = 0.7 × 0.6 × 0.6 0.7 × 0.6 × 0.6 + 0.3 × 0.3 × 0.4 = 0.88 Naive Bayes assigns to the right class.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 8 / 49

slide-9
SLIDE 9

What about this model?

C A2 A1 Ak . . .

BN factorisation: P(X) =

k

  • i=1

P(Xi | Xpa(i)), So factorisation is: P(C, A1, . . . , Ak) = P(C|A1, . . . , Ak)P(A1) · · · P(Ak)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 9 / 49

slide-10
SLIDE 10

Bayesian Networks as Classifiers

C A2 A1 A3 A4 A5 A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 10 / 49

slide-11
SLIDE 11

Markov Blanket of C: Moral Graph

C A2 A1 A3 A4 A5 A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children. Local Markov property: C ⊥ ⊥ rest | boundary(C)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 11 / 49

slide-12
SLIDE 12

Bayesian Networks as Classifiers

Loglikelihood under model M is L(M|D) =

n

  • j=1

log PM(X (j)) where X (j) = (A(j)

1 , A(j) 2 , . . . , A(j) k , C (j)).

We can rewrite this as L(M|D) =

n

  • j=1

log PM(C (j)|A(j)) +

n

  • j=1

log PM(A(j)) If there are many attributes, the second term will dominate the loglikelihood score. But we are not interested in modeling the distribution of the attributes!

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 12 / 49

slide-13
SLIDE 13

Bayesian Networks as Classifiers

0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 P(x) −log P(x)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 13 / 49

slide-14
SLIDE 14

Dataset # Attributes # Classes # Instances Train Test 1 australian 14 2 690 CV-5 2 breast 10 2 683 CV-5 3 chess 36 2 2130 1066 4 cleve 13 2 296 CV-5 5 corral 6 2 128 CV-5 6 crx 15 2 653 CV-5 7 diabetes 8 2 768 CV-5 8 flare 10 2 1066 CV-5 9 german 20 2 1000 CV-5 10 glass 9 7 214 CV-5 11 glass2 9 2 163 CV-5 12 heart 13 2 270 CV-5 13 hepatitis 19 2 80 CV-5 14 iris 4 3 150 CV-5 15 letter 16 26 15000 5000 16 lymphography 18 4 148 CV-5 17 mofn-3-7-10 10 2 300 1024 18 pima 8 2 768 CV-5 19 satimage 36 6 4435 2000 20 segment 19 7 1540 770 21 shuttle-small 9 7 3866 1934 22 soybean-large 35 19 562 CV-5 23 vehicle 18 4 846 CV-5 24 vote 16 2 435 CV-5 25 waveform-21 21 3 300 4700

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 14 / 49

slide-15
SLIDE 15

Naive Bayes vs. Unrestricted BN

5 10 15 20 25 30 35 40 45 22 19 10 25 16 11 9 4 6 18 17 2 13 1 15 14 12 21 7 20 23 8 24 3 5 Percentage Classification Error Data Set Bayesian Network Naive Bayes

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 15 / 49

slide-16
SLIDE 16

Use Conditional Log-likelihood?

Discriminative vs. Generative learning. Conditional loglikelihood function: CL(M|D) =

n

  • j=1

log PM(C (j)|A(j)

1 , . . . , A(j) k )

No closed form solution for ML estimates. Remark: can be done via Logistic Regression for models with perfect graphs (Naive Bayes, TAN’s).

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 16 / 49

slide-17
SLIDE 17

NB and Logistic Regression

The logistic regression assumption is log P(C = 1|A) P(C = 0|A)

  • = α +

k

  • i=1

βiAi, that is, the log odds is a linear function of the attributes. Under the naive Bayes assumption, this is exactly true. Assign to class 1 if α + k

i=1 βiAi > 0 and to class 0 otherwise.

Logistic regression maximizes conditional likelihood under this assumption (it is a so-called discriminative model). There is no closed form solution for the maximum likelihood estimates of α and βi, but the loglikelihood function is globally concave (unique global

  • ptimum).

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 17 / 49

slide-18
SLIDE 18

Proof (for binary attributes Ai)

Under the naive Bayes assumption we have: P(C = 1|a) P(C = 0|a) = P(a1|C = 1) · · · P(ak|C = 1)P(C = 1) P(a1|C = 0) · · · P(ak|C = 0)P(C = 0) =

k

  • i=1

P(ai = 1|C = 1) P(ai = 1|C = 0) ai P(ai = 0|C = 1) P(ai = 0|C = 0) (1−ai) × P(C = 1) P(C = 0) Taking the log we get log P(C = 1|a) P(C = 0|a)

  • =

k

  • i=1
  • ai log

P(ai = 1|C = 1) P(ai = 1|C = 0)

  • + (1 − ai) log

P(ai = 0|C = 1) P(ai = 0|C = 0)

  • + log

P(C = 1) P(C = 0)

  • Ad Feelders

( Universiteit Utrecht ) Data Mining October 24, 2013 18 / 49

slide-19
SLIDE 19

Proof (continued)

Expand and collect terms. log P(C = 1|a) P(C = 0|a)

  • =

k

  • i=1

ai

βi

  • log P(ai = 1|C = 1)

P(ai = 1|C = 0) P(ai = 0|C = 0) P(ai = 0|C = 1)

  • +

k

  • i=1
  • log P(ai = 0|C = 1)

P(ai = 0|C = 0)

  • + log P(C = 1)

P(C = 0)

  • α

which is a linear function of a.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 19 / 49

slide-20
SLIDE 20

Example

Suppose P(C = 1) = 0.6, P(a1 = 1|C = 1) = 0.8, P(a1 = 1|C = 0) = 0.5, P(a2 = 1|C = 1) = 0.6, P(a2 = 1|C = 0) = 0.3. Then log P(C = 1|a1, a2) P(C = 0|a1, a2)

  • =

1.386a1 + 1.253a2 − 1.476 + 0.405 = −1.071 + 1.386a1 + 1.253a2 Classify a point with a1 = 1 and a2 = 0: log P(C = 1|1, 0) P(C = 0|1, 0)

  • = −1.071 + 1.386 × 1 + 1.253 × 0 = 0.315

Decision rule: assign to class 1 if α +

k

  • i=1

βiAi > 0 and to class 0 otherwise. Linear decision boundary.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 20 / 49

slide-21
SLIDE 21

Linear Decision Boundary

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 A1 A2 a2 = 0.855 − 1.106a1

CLASS 0 CLASS 1

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 21 / 49

slide-22
SLIDE 22

Relax strong assumptions of NB

Conditional independence assumption of NB is often incorrect, and could lead to suboptimal classification performance. Relax this assumption by allowing (restricted) dependencies between attributes. This may produce more accurate probability estimates, possibly leading to better classification performance. This is not guaranteed, because the more complex model may be

  • verfitting.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 22 / 49

slide-23
SLIDE 23

Tree Structured BN

Tree structure: each node (except the root of the tree) has exactly

  • ne parent.

For tree structured Bayesian Networks there is an algorithm, due to Chow and Liu, that produces the optimal structure in polynomial time. This algorithm is guaranteed to produce the tree structure that maximizes the loglikelihood score. Why no penalty for complexity?

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 23 / 49

slide-24
SLIDE 24

Mutual Information

Measure of association between (discrete) random variables X and Y : IP(X; Y ) =

  • x,y

P(x, y) log P(x, y) P(x)P(y) “Distance” (Kullback-Leibler divergence) between joint distribution of X and Y , and their joint distribution under the independence assumption. If X and Y are independent, their mutual information is zero, otherwise it is some positive quantity.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 24 / 49

slide-25
SLIDE 25

Algorithm Construct-Tree of Chow and Liu

Compute Iˆ

PD(Xi; Xj) between each pair of variables. (O(nk2))

Build complete undirected graph with weights Iˆ

PD(Xi; Xj)

Build a maximum weighted spanning tree (O(k2 log k)) Choose root, and let all edges point away from it

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 25 / 49

slide-26
SLIDE 26

Example Data Set

  • bs

X1 X2 X3 X4 1 1 1 1 1 2 1 1 1 1 3 1 1 2 1 4 1 2 2 1 5 1 2 2 2 6 2 1 1 2 7 2 1 2 3 8 2 1 2 3 9 2 2 2 3 10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 26 / 49

slide-27
SLIDE 27

Mutual Information

PD(X1; X4) =

  • x1,x4

ˆ PD(x1, x4) log ˆ PD(x1, x4) ˆ PD(x1)ˆ PD(x4) = 0.4 log 0.4 (0.5)(0.4) + 0.1 log 0.1 (0.5)(0.2) + 0 log (0.5)(0.4) + 0 log (0.5)(0.4) + 0.1 log 0.1 (0.5)(0.2) + 0.4 log 0.4 (0.5)(0.4) = 0.55

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 27 / 49

slide-28
SLIDE 28

Build Graph with Weights

1 2 3 4

0.55 0.032 0.032 0.032

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 28 / 49

slide-29
SLIDE 29

Maximum weighted spanning tree

1 2 3 4

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 29 / 49

slide-30
SLIDE 30

Choose root node

1 2 3 4

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 30 / 49

slide-31
SLIDE 31

Algorithm to construct TAN

Compute Iˆ

PD(Ai; Aj|C) between each pair of attributes, where

IP(Ai; Aj|C) =

  • ai,aj,c

P(ai, aj, c) log P(ai, aj|c) P(ai|c)P(aj|c) is the conditional mutual information between Ai and Aj given C. Build complete undirected graph with weights Iˆ

PD(Ai; Aj|C)

Build a maximum weighted spanning tree Choose root, and let all edges point away from it Construct a TAN by adding C and an arc from C to all attributes. This algorithm is guaranteed to produce the TAN structure with optimal loglikelihood score.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 31 / 49

slide-32
SLIDE 32

TAN for Pima Indians Data

C Pregnant Insulin Age DPF Glucose Mass

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 32 / 49

slide-33
SLIDE 33

Interpretation of TAN’s

Just like NB models, TAN’s are equivalent to their undirected counterparts (why?).

C P A I D M G

Since there is an edge between Pregnant and Age, the influence of Age on the class label depends on (is different for different values of) Pregnant (and vice versa).

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 33 / 49

slide-34
SLIDE 34

Smoothing by adding “prior counts”

Recall that the maximum likelihood estimate of p(xi | xpa(i)) is: ˆ p(xi | xpa(i)) = n(xi, xpa(i)) n(xpa(i)) , where n(xpa(i)) is the number of observations (rows) with parent configuration xpa(i), and n(xi, xpa(i)) is the number of observations with parent configuration xpa(i) and value xi for variable Xi. But sometimes we have no (n(xpa(i)) = 0) or very few observations to estimate these (conditional) probabilities.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 34 / 49

slide-35
SLIDE 35

Smoothing by adding “prior counts”

Add “prior counts” to “smooth” the estimates. ˆ ps(xi | xpa(i)) = n(xpa(i))ˆ p(xi | xpa(i)) + m(xpa(i))p0(xi | xpa(i)) n(xpa(i)) + m(xpa(i)) where m(xpa(i)) is the prior precision, ˆ ps(xi | xpa(i)) is the smoothed estimate, and p0(xi | xpa(i)) is our prior estimate of p(xi | xpa(i)). Common to take m(xpa(i)) to be the same for all parent configurations. Weighted average of ML estimate and prior estimate.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 35 / 49

slide-36
SLIDE 36

ML estimate vs. smoothed estimate

For example ˆ p3|1,2(1|1, 2) = n(x1 = 1, x2 = 2, x3 = 1) n(x1 = 1, x2 = 2) = 0 2 = 0 Suppose we set m = 2, and p0(xi | xpa(i)) = ˆ p(xi) Then we get ˆ ps

3|1,2(1|1, 2) = 2 × 0 + 2 × 0.4

2 + 2 = 0.2, since ˆ p3(1) = 0.4.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 36 / 49

slide-37
SLIDE 37

Bayesian Multinets

Build structure on attributes for each class separately and use PM(C = i, A1, . . . , Ak) = P(C = i)PM(A1, . . . , Ak|C = i) = P(C = i)PMi(A1, . . . , Ak) i = 1, . . . , c Using trees for class-conditional structures Split D into c partitions, D1, D2, . . . , Dc where c is the number of distinct values of class label C. Di contains all records in D with C = i. Set P(C = i) = ˆ PD(C = i) for i = 1, . . . , c. Apply Construct-Tree on Di to construct Mi Unlike in TAN’s you may get a different tree structure for each class!

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 37 / 49

slide-38
SLIDE 38

Experimental Study of Friedman et al.

The following classifiers were compared: NB: Naive Bayes Classifier SNB: Naive Bayes with attribute selection BN: Unrestricted Bayesian Network (Markov Blanket of Class) C4.5: Classification Tree And also structure per class structure same different tree TAN CL dag ANB MN Superscript s indicates smoothing of parameter estimates.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 38 / 49

slide-39
SLIDE 39

Dataset # Attributes # Classes # Instances Train Test 1 australian 14 2 690 CV-5 2 breast 10 2 683 CV-5 3 chess 36 2 2130 1066 4 cleve 13 2 296 CV-5 5 corral 6 2 128 CV-5 6 crx 15 2 653 CV-5 7 diabetes 8 2 768 CV-5 8 flare 10 2 1066 CV-5 9 german 20 2 1000 CV-5 10 glass 9 7 214 CV-5 11 glass2 9 2 163 CV-5 12 heart 13 2 270 CV-5 13 hepatitis 19 2 80 CV-5 14 iris 4 3 150 CV-5 15 letter 16 26 15000 5000 16 lymphography 18 4 148 CV-5 17 mofn-3-7-10 10 2 300 1024 18 pima 8 2 768 CV-5 19 satimage 36 6 4435 2000 20 segment 19 7 1540 770 21 shuttle-small 9 7 3866 1934 22 soybean-large 35 19 562 CV-5 23 vehicle 18 4 846 CV-5 24 vote 16 2 435 CV-5 25 waveform-21 21 3 300 4700

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 39 / 49

slide-40
SLIDE 40

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 40 / 49

slide-41
SLIDE 41

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 41 / 49

slide-42
SLIDE 42

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 42 / 49

slide-43
SLIDE 43

Example: data on death penalty

Data provided by the Georgia parole board. Variable Description death Did the defendant get the death penalty? blkdef Is the defendant black? whtvict Is the victim white? aggcirc Number of aggravating circumstances stranger Were victim and defendant strangers? We have 100 observations.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 43 / 49

slide-44
SLIDE 44

Example in bnlearn package

> death.small[1:5,] death blkdef whtvict aggcirc stranger 1 1 1 2 1 1 3 1 1 4 1 1 2 5 1 1 > summary(death.small) death blkdef whtvict aggcirc stranger 0:51 0:47 0:26 1:22 0:49 1:49 1:53 1:74 2:78 1:51 > death.nb <- naive.bayes("death",data=death.small) > death.nb.pred <- predict(death.nb,death.small) > table(death.small[,1],death.nb.pred) death.nb.pred 1 0 36 15 1 12 37 > sum(diag(table(death.small[,1],death.nb.pred))/nrow(death.small)) [1] 0.73

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 44 / 49

slide-45
SLIDE 45

Example in bnlearn package

# fit TAN to death penalty data with "death" as class variable > death.tan <- tree.bayes(death.small,"death") # fit TAN parameters using maximum likelihood estimation ("mle") > death.tan.fit <- bn.fit(death.tan,death.small,"mle") # predict class on training sample > death.tan.pred <- predict(death.tan.fit,death.small) # make confusion matrix > table(death.small[,1],death.tan.pred) death.tan.pred 1 0 32 19 1 8 41 # compute accuracy > sum(diag(table(death.small[,1],death.tan.pred))/nrow(death.small)) [1] 0.73 # plot the TAN structure > plot(death.tan)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 45 / 49

slide-46
SLIDE 46

TAN for death penalty data

death blkdef whtvict aggcirc stranger

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 46 / 49

slide-47
SLIDE 47

Example in bnlearn package

# learn network structure with hill-climber > death.hc <- hc(death.small) > plot(death.hc) > death.hc.fit <- bn.fit(death.hc,death.small,"mle") > death.hc.pred <- predict(death.hc.fit,node="death",data=death.small) > table(death.small[,1],death.hc.pred) death.hc.pred 1 0 20 31 1 6 43 > sum(diag(table(death.small[,1],death.hc.pred))/nrow(death.small)) [1] 0.63

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 47 / 49

slide-48
SLIDE 48

Network structure with hill climbing for death penalty data

death blkdef whtvict aggcirc stranger

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 48 / 49

slide-49
SLIDE 49

Network structure with hill climbing for death penalty data

death blkdef whtvict aggcirc stranger

Death penalty and race of defendant are independent given race of victim.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 24, 2013 49 / 49