Full Bayesian Network Classifiers by Jiang Su and Harry Zhang - - PowerPoint PPT Presentation

full bayesian network classifiers by jiang su and harry
SMART_READER_LITE
LIVE PREVIEW

Full Bayesian Network Classifiers by Jiang Su and Harry Zhang - - PowerPoint PPT Presentation

Full Bayesian Network Classifiers by Jiang Su and Harry Zhang Flemming Jensen November 2008 Purpose To introduce the full Bayesian network classifier(FBC). Introduction Bayesian networks are often used for the classification problem, where a


slide-1
SLIDE 1

Full Bayesian Network Classifiers by Jiang Su and Harry Zhang

Flemming Jensen November 2008

slide-2
SLIDE 2

Purpose

To introduce the full Bayesian network classifier(FBC).

slide-3
SLIDE 3

Introduction

Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set

  • f labeled training examples.
slide-4
SLIDE 4

Introduction

Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set

  • f labeled training examples.

Since the number of possible network structures is extremely huge, structure learning often has high computational complexity.

slide-5
SLIDE 5

Introduction

Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set

  • f labeled training examples.

Since the number of possible network structures is extremely huge, structure learning often has high computational complexity. The idea behind the full Bayesian network classifier is to reduce the computational complexity of structure learning by using a full Bayesian network as the structure, and represent variable independence in the conditional probability tables instead of in the network structure.

slide-6
SLIDE 6

Introduction

Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set

  • f labeled training examples.

Since the number of possible network structures is extremely huge, structure learning often has high computational complexity. The idea behind the full Bayesian network classifier is to reduce the computational complexity of structure learning by using a full Bayesian network as the structure, and represent variable independence in the conditional probability tables instead of in the network structure. We use decision trees to represent the conditional probability tables to keep the compact representation of the joint distribution.

slide-7
SLIDE 7

Variable Independence

Definition - Conditionally independence Let X, Y , Z be subsets of the variable set W . The subsets X and Y are conditionally independent given Z if: P(X|Y , Z) = P(X|Z)

slide-8
SLIDE 8

Variable Independence

Definition - Conditionally independence Let X, Y , Z be subsets of the variable set W . The subsets X and Y are conditionally independent given Z if: P(X|Y , Z) = P(X|Z) Definition - Contextually independence Let X, Y , Z, T be disjoint subsets of the variable set W . The subsets X and Y are contextually independent given Z and the context t if: P(X|Y , Z, t) = P(X|Z, t)

slide-9
SLIDE 9

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies.

slide-10
SLIDE 10

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof:

slide-11
SLIDE 11

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering.

slide-12
SLIDE 12

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering. Go through each node X in the topological ordering, and add arcs to all the nodes ranked after X.

slide-13
SLIDE 13

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering. Go through each node X in the topological ordering, and add arcs to all the nodes ranked after X. The resulting network FB is a full BN.

slide-14
SLIDE 14

Existence

Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering. Go through each node X in the topological ordering, and add arcs to all the nodes ranked after X. The resulting network FB is a full BN. Build a CPT-tree for each node X in FB, such that any variable that is not in the parent set ΠX of X in B does not

  • ccur in the CPT-tree of X in FB.
slide-15
SLIDE 15

Example - FBC for Naive Bayes

Example of a naive Bayes

C X

1

X2 X3 X4

slide-16
SLIDE 16

Example - FBC for Naive Bayes

Example of an FBC for the naive Bayes

C X

1

X2 X3 X4 X1 X1 C

p

11p 12 p 13p 14

X2 X2 C

p

21p 22 p 23p 24

X3 X3 C

p

31p 32 p 33p 34

X4 X4 C

p

41p 42 p 43p 44

slide-17
SLIDE 17

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts:

slide-18
SLIDE 18

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts: Construction of a full BN.

slide-19
SLIDE 19

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts: Construction of a full BN. Learning of decision trees to represent the CPT of each variable.

slide-20
SLIDE 20

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts: Construction of a full BN. Learning of decision trees to represent the CPT of each variable. The full BN is implemented using a Bayesian multinet.

slide-21
SLIDE 21

Learning Full Bayesian Network Classifiers

Learning an FBC consists of two parts: Construction of a full BN. Learning of decision trees to represent the CPT of each variable. The full BN is implemented using a Bayesian multinet. Definition - Bayesian multinet A Bayesian multinet is a set of Bayesian networks, each of which corresponds to a value c of the class variable C.

slide-22
SLIDE 22

Structure Learning

Learning the structure of a full BN actually means learning an

  • rder of variables and then adding arcs from a variable to all the

variables ranked after it.

slide-23
SLIDE 23

Structure Learning

Learning the structure of a full BN actually means learning an

  • rder of variables and then adding arcs from a variable to all the

variables ranked after it. A variable is ranked based on its total influence on other variables.

slide-24
SLIDE 24

Structure Learning

Learning the structure of a full BN actually means learning an

  • rder of variables and then adding arcs from a variable to all the

variables ranked after it. A variable is ranked based on its total influence on other variables. The influence (dependency) between two variables can be measured by mutual information.

slide-25
SLIDE 25

Structure Learning

Learning the structure of a full BN actually means learning an

  • rder of variables and then adding arcs from a variable to all the

variables ranked after it. A variable is ranked based on its total influence on other variables. The influence (dependency) between two variables can be measured by mutual information. Definition - Mutual information Let X and Y be two variables in a Bayesian network. The mutual information is defined as: M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y)

slide-26
SLIDE 26

Structure Learning

It is possible that the dependency between two variables, measured by mutual information, is caused merely by noise.

slide-27
SLIDE 27

Structure Learning

It is possible that the dependency between two variables, measured by mutual information, is caused merely by noise. Results by Friedman are used as a dependency threshold to filter

  • ut unreliable dependencies.
slide-28
SLIDE 28

Structure Learning

It is possible that the dependency between two variables, measured by mutual information, is caused merely by noise. Results by Friedman are used as a dependency threshold to filter

  • ut unreliable dependencies.

Definition - Dependency threshold Let Xi and Xj be two variables in a Bayesian network. The dependency threshold, denoted by φ, is defined as: φ(Xi, Xj) = logN

2N × Tij, where Tij = |Xi| × |Xj|.

slide-29
SLIDE 29

Structure Learning

The total influence of a variable on other variables can now be defined:

slide-30
SLIDE 30

Structure Learning

The total influence of a variable on other variables can now be defined: Definition - Total influence Let Xi be a variable in a Bayesian network. The total influence of Xi on other variables, denoted by W (Xi), is defined as: W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj).

slide-31
SLIDE 31

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

slide-32
SLIDE 32

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty.

slide-33
SLIDE 33

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

slide-34
SLIDE 34

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

slide-35
SLIDE 35

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj.

slide-36
SLIDE 36

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi.

slide-37
SLIDE 37

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X

slide-38
SLIDE 38

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X

  • Add all the variables Xj with W (Xj) > W (Xi) to the parent

set ΠXi of Xi.

slide-39
SLIDE 39

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X

  • Add all the variables Xj with W (Xj) > W (Xi) to the parent

set ΠXi of Xi.

  • Add arcs from all the variables Xj in ΠXi to Xi.
slide-40
SLIDE 40

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X

  • Add all the variables Xj with W (Xj) > W (Xi) to the parent

set ΠXi of Xi.

  • Add arcs from all the variables Xj in ΠXi to Xi.

Add the resulting network Bc to B.

slide-41
SLIDE 41

Structure Learning Algorithm

Algorithm FBC-Structure(S, X)

1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class

value c.

3 For each training data set Sc

Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X

  • Add all the variables Xj with W (Xj) > W (Xi) to the parent

set ΠXi of Xi.

  • Add arcs from all the variables Xj in ΠXi to Xi.

Add the resulting network Bc to B.

4 Return B.

slide-42
SLIDE 42

Example - Structure Learning Algorithm

Example using 1000 labeled instances, where C is the class variable and A, B, and D are feature variables. C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 C A B D # c2 a1 b1 d1 36 c2 a1 b1 d2 36 c2 a1 b2 d1 259 c2 a1 b2 d2 29 c2 a2 b1 d1 96 c2 a2 b1 d2 96 c2 a2 b2 d1 43 c2 a2 b2 d2 5

slide-43
SLIDE 43

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

slide-44
SLIDE 44

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

slide-45
SLIDE 45

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

slide-46
SLIDE 46

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

slide-47
SLIDE 47

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

slide-48
SLIDE 48

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1

11+5 400 7+17 400

a2

227+97 400 11+25 400

P(A, B)

slide-49
SLIDE 49

Example - Structure Learning Algorithm

C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B)

slide-50
SLIDE 50

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-51
SLIDE 51

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-52
SLIDE 52

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-53
SLIDE 53

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-54
SLIDE 54

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-55
SLIDE 55

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-56
SLIDE 56

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-57
SLIDE 57

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-58
SLIDE 58

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-59
SLIDE 59

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085)+0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-60
SLIDE 60

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-61
SLIDE 61

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) + 0.06 · log( 0.06 0.015)+0.09 · log( 0.09 0.135) = 0.027

slide-62
SLIDE 62

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) + 0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135)= 0.027

slide-63
SLIDE 63

Example - Structure Learning Algorithm

b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =

  • x∈X,y∈Y

P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) + 0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027

slide-64
SLIDE 64

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027

slide-65
SLIDE 65

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004

slide-66
SLIDE 66

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018

slide-67
SLIDE 67

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

slide-68
SLIDE 68

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D)

slide-69
SLIDE 69

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

slide-70
SLIDE 70

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013

slide-71
SLIDE 71

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj)

slide-72
SLIDE 72

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj) indent indent indentW (A)

slide-73
SLIDE 73

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B)

slide-74
SLIDE 74

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027

slide-75
SLIDE 75

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B)

slide-76
SLIDE 76

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B)

slide-77
SLIDE 77

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D)

slide-78
SLIDE 78

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045

slide-79
SLIDE 79

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045 indent indent indentW (D)

slide-80
SLIDE 80

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045 indent indent indentW (D) = M(B; D)

slide-81
SLIDE 81

Example - Structure Learning Algorithm

Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN

2N × Tij

φ(A, B) = φ(A, D) = φ(B, D) = 4log400

800

= 0.013 Total influence W (Xi) =

M(Xi;Xj)>φ(Xi,Xj)

  • j(j=i)

M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045 indent indent indentW (D) = M(B; D) = 0.018

slide-82
SLIDE 82

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values:

B A D

slide-83
SLIDE 83

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018

B A D

slide-84
SLIDE 84

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

slide-85
SLIDE 85

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.

slide-86
SLIDE 86

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.

slide-87
SLIDE 87

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.

slide-88
SLIDE 88

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.

slide-89
SLIDE 89

Example - Structure Learning Algorithm

We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)

B A D

We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.

slide-90
SLIDE 90

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN.

slide-91
SLIDE 91

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N).

slide-92
SLIDE 92

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N). Instead a fast decision tree learning algorithm is purposed.

slide-93
SLIDE 93

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N). Instead a fast decision tree learning algorithm is purposed. The algorithm uses the mutual information to determine a fixed

  • rdering of variables from root to leaves.
slide-94
SLIDE 94

CPT-tree Learning

We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N). Instead a fast decision tree learning algorithm is purposed. The algorithm uses the mutual information to determine a fixed

  • rdering of variables from root to leaves.

The predefined variable ordering makes the algorithm faster than traditional decision tree learning algorithms.

slide-95
SLIDE 95

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

slide-96
SLIDE 96

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T.

slide-97
SLIDE 97

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

slide-98
SLIDE 98

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False.

slide-99
SLIDE 99

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

slide-100
SLIDE 100

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi).

slide-101
SLIDE 101

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi.

slide-102
SLIDE 102

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S.

slide-103
SLIDE 103

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S.

slide-104
SLIDE 104

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

slide-105
SLIDE 105

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

slide-106
SLIDE 106

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T.

slide-107
SLIDE 107

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj.

slide-108
SLIDE 108

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj

slide-109
SLIDE 109

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj

  • Tx = Fast-CPT-Tree(ΠXi, Sx)
slide-110
SLIDE 110

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj

  • Tx = Fast-CPT-Tree(ΠXi, Sx)
  • Add Tx as a child of Xj.
slide-111
SLIDE 111

CPT-tree Learning Algorithm

Algorithm Fast-CPT-Tree(ΠXi, S)

1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)

Return T.

3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)

Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True

5 If qualified == True

Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj

  • Tx = Fast-CPT-Tree(ΠXi, Sx)
  • Add Tx as a child of Xj.

6 Return T.

slide-112
SLIDE 112

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first.

B b1 b2

slide-113
SLIDE 113

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S)

B b1 b2

slide-114
SLIDE 114

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B.

B b1 b2

slide-115
SLIDE 115

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018

B b1 b2

slide-116
SLIDE 116

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013

B b1 b2

slide-117
SLIDE 117

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013 MS(D; B) > φS(D, B) so qualified = True.

B b1 b2

slide-118
SLIDE 118

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013 MS(D; B) > φS(D, B) so qualified = True. Since qualified == True, create a root for Xj = B and partition S into the subsets Sb1 and Sb2.

B b1 b2

slide-119
SLIDE 119

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013 MS(D; B) > φS(D, B) so qualified = True. Since qualified == True, create a root for Xj = B and partition S into the subsets Sb1 and Sb2. Recursively call: Fast-CPT-Tree(ΠD = {A}, Sb1) and Fast-CPT-Tree(ΠD = {A}, Sb2)

B b1 b2

slide-120
SLIDE 120

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013 MS(D; B) > φS(D, B) so qualified = True. Since qualified == True, create a root for Xj = B and partition S into the subsets Sb1 and Sb2. Recursively call: Fast-CPT-Tree(ΠD = {A}, Sb1) and Fast-CPT-Tree(ΠD = {A}, Sb2) Add the resulting trees as children of Xj = B.

B b1 b2

slide-121
SLIDE 121

Example - CPT-tree Learning Algorithm

We construct the CPT-tree for the variable D first. Fast-CPT-Tree(ΠD = {A, B}, S) M(D; B) = 0.018 > M(D; A) = 0.004 so Xj = B. MS(D; B) = M(D; B) = 0.018 , φS(D, B) = φ(D, B) = 0.013 MS(D; B) > φS(D, B) so qualified = True. Since qualified == True, create a root for Xj = B and partition S into the subsets Sb1 and Sb2. Recursively call: Fast-CPT-Tree(ΠD = {A}, Sb1) and Fast-CPT-Tree(ΠD = {A}, Sb2) Add the resulting trees as children of Xj = B.

B b1 b2

slide-122
SLIDE 122

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1)

slide-123
SLIDE 123

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A.

slide-124
SLIDE 124

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6

slide-125
SLIDE 125

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015

slide-126
SLIDE 126

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False.

slide-127
SLIDE 127

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree.

slide-128
SLIDE 128

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2)

slide-129
SLIDE 129

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2) Only one parent variable remains, so Xj = A.

slide-130
SLIDE 130

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2) Only one parent variable remains, so Xj = A. MSb2(D; A) = 4 · 10−5

slide-131
SLIDE 131

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2) Only one parent variable remains, so Xj = A. MSb2(D; A) = 4 · 10−5 , φSb2(D, A) = 0.059

slide-132
SLIDE 132

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2) Only one parent variable remains, so Xj = A. MSb2(D; A) = 4 · 10−5 , φSb2(D, A) = 0.059 MSb2(D; A) ≯ φSb2(D, A) so qualified = False.

slide-133
SLIDE 133

Example - CPT-tree Learning Algorithm

Fast-CPT-Tree(ΠD = {A}, Sb1) Only one parent variable remains, so Xj = A. MSb1(D; A) = 7 · 10−6 , φSb1(D, A) = 0.015 MSb1(D; A) ≯ φSb1(D, A) so qualified = False. Since qualified == False, return the empty tree. Fast-CPT-Tree(ΠD = {A}, Sb2) Only one parent variable remains, so Xj = A. MSb2(D; A) = 4 · 10−5 , φSb2(D, A) = 0.059 MSb2(D; A) ≯ φSb2(D, A) so qualified = False. Since qualified == False, return the empty tree.

slide-134
SLIDE 134

Example - CPT-tree Learning Algorithm

We now only need to add Xi = D as children of B and specify the probabilities, which are trivial to calculate.

B b1 b2

slide-135
SLIDE 135

Example - CPT-tree Learning Algorithm

We now only need to add Xi = D as children of B and specify the probabilities, which are trivial to calculate.

B b1 b2

We should repeat this process for each variable in each network.

slide-136
SLIDE 136

Example - CPT-tree Learning Algorithm

We now only need to add Xi = D as children of B and specify the probabilities, which are trivial to calculate.

B b1 b2 D d1 d2 D d1 d2

We should repeat this process for each variable in each network.

slide-137
SLIDE 137

Example - CPT-tree Learning Algorithm

We now only need to add Xi = D as children of B and specify the probabilities, which are trivial to calculate.

B b1 b2 D d1 d2 D d1 d2

11+227 340 =0.7 5+97 340 =0.3 7+11 60 =0.3 17+25 60 =0.7

We should repeat this process for each variable in each network.

slide-138
SLIDE 138

Complexity

Let n be the number of variables and N the number of data instances.

slide-139
SLIDE 139

Complexity

Let n be the number of variables and N the number of data instances. FBC-Structure has time complexity O(n2 · N).

slide-140
SLIDE 140

Complexity

Let n be the number of variables and N the number of data instances. FBC-Structure has time complexity O(n2 · N). Fast-CPT-Tree has time complexity O(n · N).

slide-141
SLIDE 141

Complexity

Let n be the number of variables and N the number of data instances. FBC-Structure has time complexity O(n2 · N). Fast-CPT-Tree has time complexity O(n · N). Fast-CPT-Tree is called once for each variable in each of the |C| multinet parts. Hence the time complexity: O(|C| · n2 · N

|C|) = O(n2 · N).

slide-142
SLIDE 142

Complexity

Let n be the number of variables and N the number of data instances. FBC-Structure has time complexity O(n2 · N). Fast-CPT-Tree has time complexity O(n · N). Fast-CPT-Tree is called once for each variable in each of the |C| multinet parts. Hence the time complexity: O(|C| · n2 · N

|C|) = O(n2 · N).

Thus, the FBC learning algorithm has the time complexity O(n2 · N).

slide-143
SLIDE 143

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments.

slide-144
SLIDE 144

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments. Performance of an algorithm on each data set is observed via 10 runs of 10-fold cross validation.

slide-145
SLIDE 145

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments. Performance of an algorithm on each data set is observed via 10 runs of 10-fold cross validation. Two-tailed t-test with a 95% confidence interval is conducted to compare each pair of algorithms on each data set.

slide-146
SLIDE 146

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments. Performance of an algorithm on each data set is observed via 10 runs of 10-fold cross validation. Two-tailed t-test with a 95% confidence interval is conducted to compare each pair of algorithms on each data set. Results on accuracy - classification (data sets won - draw - lost) AODE HGC TAN NBT C4.5 SMO FBC 8/22/3 4/27/2 6/27/0 6/27/0 11/19/3 6/24/2

slide-147
SLIDE 147

Experiments - Results

33 UCI data sets, available in Weka, are used for experiments. Performance of an algorithm on each data set is observed via 10 runs of 10-fold cross validation. Two-tailed t-test with a 95% confidence interval is conducted to compare each pair of algorithms on each data set. Results on accuracy - classification (data sets won - draw - lost) AODE HGC TAN NBT C4.5 SMO FBC 8/22/3 4/27/2 6/27/0 6/27/0 11/19/3 6/24/2 Results on AUC - ranking (data sets won - draw - lost) AODE HGC TAN NBT C4.5L SMO FBC 7/22/4 6/25/2 9/24/0 8/24/1 25/7/1 10/20/3

slide-148
SLIDE 148

Experiments - Complexity

Complexity of tested algorithms Training Classification FBC O(n2 · N) O(n) AODE O(n2 · N) O(n2) HGC O(n4 · N) O(n) TAN O(n2 · N) O(n) NBT O(n3 · N) O(n) C4.5 O(n2 · N) O(n) SMO O(n2.3) O(n)

slide-149
SLIDE 149

Experiments - Conclusion

FBC demonstrates good performance in both classification and ranking.

slide-150
SLIDE 150

Experiments - Conclusion

FBC demonstrates good performance in both classification and ranking. FBC is among the most efficient algorithms in both training and classification time.

slide-151
SLIDE 151

Experiments - Conclusion

FBC demonstrates good performance in both classification and ranking. FBC is among the most efficient algorithms in both training and classification time. Overall, the performance of FBC is the best among the algorithms compared.