[PPT] - Full Bayesian Network Classifiers by Jiang Su and Harry Zhang PowerPoint Presentation

SLIDE 1

Full Bayesian Network Classifiers by Jiang Su and Harry Zhang

Flemming Jensen November 2008

SLIDE 2

Purpose

To introduce the full Bayesian network classifier(FBC).

SLIDE 3

Introduction

Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set

f labeled training examples.

SLIDE 4

Introduction

Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set

f labeled training examples.

Since the number of possible network structures is extremely huge, structure learning often has high computational complexity.

SLIDE 5

Introduction

Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set

f labeled training examples.

Since the number of possible network structures is extremely huge, structure learning often has high computational complexity. The idea behind the full Bayesian network classifier is to reduce the computational complexity of structure learning by using a full Bayesian network as the structure, and represent variable independence in the conditional probability tables instead of in the network structure.

SLIDE 6

Introduction

Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set

f labeled training examples.

Since the number of possible network structures is extremely huge, structure learning often has high computational complexity. The idea behind the full Bayesian network classifier is to reduce the computational complexity of structure learning by using a full Bayesian network as the structure, and represent variable independence in the conditional probability tables instead of in the network structure. We use decision trees to represent the conditional probability tables to keep the compact representation of the joint distribution.

SLIDE 7

Variable Independence

Definition - Conditionally independence Let X, Y , Z be subsets of the variable set W . The subsets X and Y are conditionally independent given Z if: P(X|Y , Z) = P(X|Z)

SLIDE 8

Variable Independence

Definition - Conditionally independence Let X, Y , Z be subsets of the variable set W . The subsets X and Y are conditionally independent given Z if: P(X|Y , Z) = P(X|Z) Definition - Contextually independence Let X, Y , Z, T be disjoint subsets of the variable set W . The subsets X and Y are contextually independent given Z and the context t if: P(X|Y , Z, t) = P(X|Z, t)

SLIDE 9

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies.

SLIDE 10

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof:

SLIDE 11

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering.

SLIDE 12

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering. Go through each node X in the topological ordering, and add arcs to all the nodes ranked after X.

SLIDE 13

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering. Go through each node X in the topological ordering, and add arcs to all the nodes ranked after X. The resulting network FB is a full BN.

SLIDE 14

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering. Go through each node X in the topological ordering, and add arcs to all the nodes ranked after X. The resulting network FB is a full BN. Build a CPT-tree for each node X in FB, such that any variable that is not in the parent set ΠX of X in B does not

ccur in the CPT-tree of X in FB.

SLIDE 15

Example - FBC for Naive Bayes

Example of a naive Bayes

C X

1

X2 X3 X4

SLIDE 16

Example - FBC for Naive Bayes

Example of an FBC for the naive Bayes

C X

1

X2 X3 X4 X1 X1 C

p

11p 12 p 13p 14

X2 X2 C

p

21p 22 p 23p 24

X3 X3 C

p

31p 32 p 33p 34

X4 X4 C

p

41p 42 p 43p 44

SLIDE 17

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts:

SLIDE 18

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts: Construction of a full BN.

SLIDE 19

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts: Construction of a full BN. Learning of decision trees to represent the CPT of each variable.

SLIDE 20

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts: Construction of a full BN. Learning of decision trees to represent the CPT of each variable. The full BN is implemented using a Bayesian multinet.

SLIDE 21

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts: Construction of a full BN. Learning of decision trees to represent the CPT of each variable. The full BN is implemented using a Bayesian multinet. Definition - Bayesian multinet A Bayesian multinet is a set of Bayesian networks, each of which corresponds to a value c of the class variable C.

SLIDE 22

Structure Learning

Learning the structure of a full BN actually means learning an

rder of variables and then adding arcs from a variable to all the

variables ranked after it.

SLIDE 23

Structure Learning

Learning the structure of a full BN actually means learning an

rder of variables and then adding arcs from a variable to all the

variables ranked after it. A variable is ranked based on its total influence on other variables.

SLIDE 24

Structure Learning

Learning the structure of a full BN actually means learning an

rder of variables and then adding arcs from a variable to all the

variables ranked after it. A variable is ranked based on its total influence on other variables. The influence (dependency) between two variables can be measured by mutual information.

SLIDE 25

Structure Learning

Learning the structure of a full BN actually means learning an

rder of variables and then adding arcs from a variable to all the

variables ranked after it. A variable is ranked based on its total influence on other variables. The influence (dependency) between two variables can be measured by mutual information. Definition - Mutual information Let X and Y be two variables in a Bayesian network. The mutual information is defined as: M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y)

SLIDE 26

Structure Learning

It is possible that the dependency between two variables, measured by mutual information, is caused merely by noise.

SLIDE 27

Structure Learning

It is possible that the dependency between two variables, measured by mutual information, is caused merely by noise. Results by Friedman are used as a dependency threshold to filter

ut unreliable dependencies.

SLIDE 28

Structure Learning

It is possible that the dependency between two variables, measured by mutual information, is caused merely by noise. Results by Friedman are used as a dependency threshold to filter

ut unreliable dependencies.

Definition - Dependency threshold Let Xi and Xj be two variables in a Bayesian network. The dependency threshold, denoted by φ, is defined as: φ(Xi, Xj) = logN

2N × Tij, where Tij = |Xi| × |Xj|.

SLIDE 29

Structure Learning

The total influence of a variable on other variables can now be defined:

SLIDE 30

Structure Learning

The total influence of a variable on other variables can now be defined: Definition - Total influence Let Xi be a variable in a Bayesian network. The total influence of Xi on other variables, denoted by W (Xi), is defined as: W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj).

SLIDE 31

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

SLIDE 32

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty.

SLIDE 33

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

SLIDE 34

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

SLIDE 35

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj.

SLIDE 36

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi.

SLIDE 37

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X

SLIDE 38

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X

Add all the variables Xj with W (Xj) > W (Xi) to the parent

set ΠXi of Xi.

SLIDE 39

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X

Add all the variables Xj with W (Xj) > W (Xi) to the parent

set ΠXi of Xi.

Add arcs from all the variables Xj in ΠXi to Xi.

SLIDE 40

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X

Add all the variables Xj with W (Xj) > W (Xi) to the parent

set ΠXi of Xi.

Add arcs from all the variables Xj in ΠXi to Xi.

Add the resulting network Bc to B.

SLIDE 41

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X

Add all the variables Xj with W (Xj) > W (Xi) to the parent

set ΠXi of Xi.

Add arcs from all the variables Xj in ΠXi to Xi.

Add the resulting network Bc to B.

4 Return B.

SLIDE 42

Example - Structure Learning Algorithm

Example using 1000 labeled instances, where C is the class variable and A, B, and D are feature variables. C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 C A B D # c2 a1 b1 d1 36 c2 a1 b1 d2 36 c2 a1 b2 d1 259 c2 a1 b2 d2 29 c2 a2 b1 d1 96 c2 a2 b1 d2 96 c2 a2 b2 d1 43 c2 a2 b2 d2 5

SLIDE 43

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

SLIDE 44

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

SLIDE 45

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

SLIDE 46

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

SLIDE 47

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

SLIDE 48

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

SLIDE 49

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B)

SLIDE 50

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 51

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 52

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 53

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 54

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 55

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 56

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 57

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 58

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 59

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085)+0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 60

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 61

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) + 0.06 · log( 0.06 0.015)+0.09 · log( 0.09 0.135) = 0.027

SLIDE 62

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) + 0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135)= 0.027

SLIDE 63

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) + 0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

SLIDE 64

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027

SLIDE 65

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004

SLIDE 66

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018

SLIDE 67

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

SLIDE 68

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D)

SLIDE 69

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

SLIDE 70

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013

SLIDE 71

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj)

SLIDE 72

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj) indent indent indentW (A)

SLIDE 73

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B)

SLIDE 74

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027

SLIDE 75

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B)

SLIDE 76

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B)

SLIDE 77

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D)

SLIDE 78

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045

SLIDE 79

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045 indent indent indentW (D)

SLIDE 80

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045 indent indent indentW (D) = M(B; D)

SLIDE 81

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045 indent indent indentW (D) = M(B; D) = 0.018

SLIDE 82

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values:

B A D

SLIDE 83

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018

B A D

SLIDE 84

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

SLIDE 85

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.

SLIDE 86

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.

SLIDE 87

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.

SLIDE 88

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.

SLIDE 89

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.

SLIDE 90

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN.

SLIDE 91

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N).

SLIDE 92

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N). Instead a fast decision tree learning algorithm is purposed.

SLIDE 93

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N). Instead a fast decision tree learning algorithm is purposed. The algorithm uses the mutual information to determine a fixed

rdering of variables from root to leaves.

SLIDE 94

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N). Instead a fast decision tree learning algorithm is purposed. The algorithm uses the mutual information to determine a fixed

rdering of variables from root to leaves.

The predefined variable ordering makes the algorithm faster than traditional decision tree learning algorithms.

SLIDE 95

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

SLIDE 96

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T.

SLIDE 97

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

SLIDE 98

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False.

SLIDE 99

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

SLIDE 100

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi).

SLIDE 101

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi.

SLIDE 102

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S.

SLIDE 103

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S.

SLIDE 104

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

SLIDE 105

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

SLIDE 106

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T.

SLIDE 107

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj.

SLIDE 108

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj

SLIDE 109

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj

Tx = Fast-CPT-Tree(ΠXi, Sx)

SLIDE 110

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj

Tx = Fast-CPT-Tree(ΠXi, Sx)
Add Tx as a child of Xj.

SLIDE 111

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj

Tx = Fast-CPT-Tree(ΠXi, Sx)
Add Tx as a child of Xj.

6 Return T.

SLIDE 112

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first.

B b1 b2

SLIDE 113

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S)

B b1 b2

SLIDE 114

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B.

B b1 b2

SLIDE 115

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018

B b1 b2

SLIDE 116

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013

B b1 b2

SLIDE 117

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013 MS(D; B) > φS(D, B) so qualified = True.

B b1 b2

SLIDE 118

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013 MS(D; B) > φS(D, B) so qualified = True. Since qualified == True, create a root for Xj = B and partition S into the subsets Sb1 and Sb2.

B b1 b2

SLIDE 119

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013 MS(D; B) > φS(D, B) so qualified = True. Since qualified == True, create a root for Xj = B and partition S into the subsets Sb1 and Sb2. Recursively call: Fast-CPT-Tree(ΠD = {A}, Sb1) and Fast-CPT-Tree(ΠD = {A}, Sb2)

B b1 b2

SLIDE 120

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013 MS(D; B) > φS(D, B) so qualified = True. Since qualified == True, create a root for Xj = B and partition S into the subsets Sb1 and Sb2. Recursively call: Fast-CPT-Tree(ΠD = {A}, Sb1) and Fast-CPT-Tree(ΠD = {A}, Sb2) Add the resulting trees as children of Xj = B.

B b1 b2

SLIDE 121

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013 MS(D; B) > φS(D, B) so qualified = True. Since qualified == True, create a root for Xj = B and partition S into the subsets Sb1 and Sb2. Recursively call: Fast-CPT-Tree(ΠD = {A}, Sb1) and Fast-CPT-Tree(ΠD = {A}, Sb2) Add the resulting trees as children of Xj = B.

B b1 b2

SLIDE 122

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1)

SLIDE 123

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A.

SLIDE 124

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6

SLIDE 125

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015

SLIDE 126

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False.

SLIDE 127

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree.

SLIDE 128

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2)

SLIDE 129

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2) Only one parent variable remains, so Xj = A.

SLIDE 130

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2) Only one parent variable remains, so Xj = A. MSb2(D; A) = 4 · 10−5

SLIDE 131

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2) Only one parent variable remains, so Xj = A. MSb2(D; A) = 4 · 10−5 , φSb2(D, A) = 0.059

SLIDE 132

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2) Only one parent variable remains, so Xj = A. MSb2(D; A) = 4 · 10−5 , φSb2(D, A) = 0.059 MSb2(D; A) ≯ φSb2(D, A) so qualified = False.

SLIDE 133

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2) Only one parent variable remains, so Xj = A. MSb2(D; A) = 4 · 10−5 , φSb2(D, A) = 0.059 MSb2(D; A) ≯ φSb2(D, A) so qualified = False. Since qualified == False, return the empty tree.

SLIDE 134

Example - CPT-tree Learning Algorithm

We now only need to add Xi = D as children of B and specify the probabilities, which are trivial to calculate.

B b1 b2

SLIDE 135

Example - CPT-tree Learning Algorithm

We now only need to add Xi = D as children of B and specify the probabilities, which are trivial to calculate.

B b1 b2

We should repeat this process for each variable in each network.

SLIDE 136

Example - CPT-tree Learning Algorithm

We now only need to add Xi = D as children of B and specify the probabilities, which are trivial to calculate.

B b1 b2 D d1 d2 D d1 d2

We should repeat this process for each variable in each network.

SLIDE 137

Example - CPT-tree Learning Algorithm

We now only need to add Xi = D as children of B and specify the probabilities, which are trivial to calculate.

B b1 b2 D d1 d2 D d1 d2

11+227 340 =0.7 5+97 340 =0.3 7+11 60 =0.3 17+25 60 =0.7

We should repeat this process for each variable in each network.

SLIDE 138

Complexity

Let n be the number of variables and N the number of data instances.

SLIDE 139

Complexity

Let n be the number of variables and N the number of data instances. FBC-Structure has time complexity O(n2 · N).

SLIDE 140

Complexity

Let n be the number of variables and N the number of data instances. FBC-Structure has time complexity O(n2 · N). Fast-CPT-Tree has time complexity O(n · N).

SLIDE 141

Complexity

Let n be the number of variables and N the number of data instances. FBC-Structure has time complexity O(n2 · N). Fast-CPT-Tree has time complexity O(n · N). Fast-CPT-Tree is called once for each variable in each of the |C| multinet parts. Hence the time complexity: O(|C| · n2 · N

|C|) = O(n2 · N).

SLIDE 142

Complexity

Let n be the number of variables and N the number of data instances. FBC-Structure has time complexity O(n2 · N). Fast-CPT-Tree has time complexity O(n · N). Fast-CPT-Tree is called once for each variable in each of the |C| multinet parts. Hence the time complexity: O(|C| · n2 · N

|C|) = O(n2 · N).

Thus, the FBC learning algorithm has the time complexity O(n2 · N).

SLIDE 143

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments.

SLIDE 144

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments. Performance of an algorithm on each data set is observed via 10 runs of 10-fold cross validation.

SLIDE 145

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments. Performance of an algorithm on each data set is observed via 10 runs of 10-fold cross validation. Two-tailed t-test with a 95% confidence interval is conducted to compare each pair of algorithms on each data set.

SLIDE 146

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments. Performance of an algorithm on each data set is observed via 10 runs of 10-fold cross validation. Two-tailed t-test with a 95% confidence interval is conducted to compare each pair of algorithms on each data set. Results on accuracy - classification (data sets won - draw - lost) AODE HGC TAN NBT C4.5 SMO FBC 8/22/3 4/27/2 6/27/0 6/27/0 11/19/3 6/24/2

SLIDE 147

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments. Performance of an algorithm on each data set is observed via 10 runs of 10-fold cross validation. Two-tailed t-test with a 95% confidence interval is conducted to compare each pair of algorithms on each data set. Results on accuracy - classification (data sets won - draw - lost) AODE HGC TAN NBT C4.5 SMO FBC 8/22/3 4/27/2 6/27/0 6/27/0 11/19/3 6/24/2 Results on AUC - ranking (data sets won - draw - lost) AODE HGC TAN NBT C4.5L SMO FBC 7/22/4 6/25/2 9/24/0 8/24/1 25/7/1 10/20/3

SLIDE 148

Experiments - Complexity

Complexity of tested algorithms Training Classification FBC O(n2 · N) O(n) AODE O(n2 · N) O(n2) HGC O(n4 · N) O(n) TAN O(n2 · N) O(n) NBT O(n3 · N) O(n) C4.5 O(n2 · N) O(n) SMO O(n2.3) O(n)

SLIDE 149

Experiments - Conclusion

FBC demonstrates good performance in both classification and ranking.

SLIDE 150

Experiments - Conclusion

FBC demonstrates good performance in both classification and ranking. FBC is among the most efficient algorithms in both training and classification time.

SLIDE 151

Experiments - Conclusion

FBC demonstrates good performance in both classification and ranking. FBC is among the most efficient algorithms in both training and classification time. Overall, the performance of FBC is the best among the algorithms compared.