Full Bayesian Network Classifiers by Jiang Su and Harry Zhang - - PowerPoint PPT Presentation
Full Bayesian Network Classifiers by Jiang Su and Harry Zhang - - PowerPoint PPT Presentation
Full Bayesian Network Classifiers by Jiang Su and Harry Zhang Flemming Jensen November 2008 Purpose To introduce the full Bayesian network classifier(FBC). Introduction Bayesian networks are often used for the classification problem, where a
Purpose
To introduce the full Bayesian network classifier(FBC).
Introduction
Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set
- f labeled training examples.
Introduction
Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set
- f labeled training examples.
Since the number of possible network structures is extremely huge, structure learning often has high computational complexity.
Introduction
Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set
- f labeled training examples.
Since the number of possible network structures is extremely huge, structure learning often has high computational complexity. The idea behind the full Bayesian network classifier is to reduce the computational complexity of structure learning by using a full Bayesian network as the structure, and represent variable independence in the conditional probability tables instead of in the network structure.
Introduction
Bayesian networks are often used for the classification problem, where a learner attempts to construct a classifier from a given set
- f labeled training examples.
Since the number of possible network structures is extremely huge, structure learning often has high computational complexity. The idea behind the full Bayesian network classifier is to reduce the computational complexity of structure learning by using a full Bayesian network as the structure, and represent variable independence in the conditional probability tables instead of in the network structure. We use decision trees to represent the conditional probability tables to keep the compact representation of the joint distribution.
Variable Independence
Definition - Conditionally independence Let X, Y , Z be subsets of the variable set W . The subsets X and Y are conditionally independent given Z if: P(X|Y , Z) = P(X|Z)
Variable Independence
Definition - Conditionally independence Let X, Y , Z be subsets of the variable set W . The subsets X and Y are conditionally independent given Z if: P(X|Y , Z) = P(X|Z) Definition - Contextually independence Let X, Y , Z, T be disjoint subsets of the variable set W . The subsets X and Y are contextually independent given Z and the context t if: P(X|Y , Z, t) = P(X|Z, t)
Existence
Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies.
Existence
Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof:
Existence
Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering.
Existence
Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering. Go through each node X in the topological ordering, and add arcs to all the nodes ranked after X.
Existence
Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering. Go through each node X in the topological ordering, and add arcs to all the nodes ranked after X. The resulting network FB is a full BN.
Existence
Theorem - Existence For any BN B, there exists an FBC FB, such that B and FB encode the same variable independencies. Proof: Since B is an acyclic graph, the nodes of B can be sorted on the basis of the topological ordering. Go through each node X in the topological ordering, and add arcs to all the nodes ranked after X. The resulting network FB is a full BN. Build a CPT-tree for each node X in FB, such that any variable that is not in the parent set ΠX of X in B does not
- ccur in the CPT-tree of X in FB.
Example - FBC for Naive Bayes
Example of a naive Bayes
C X
1
X2 X3 X4
Example - FBC for Naive Bayes
Example of an FBC for the naive Bayes
C X
1
X2 X3 X4 X1 X1 C
p
11p 12 p 13p 14
X2 X2 C
p
21p 22 p 23p 24
X3 X3 C
p
31p 32 p 33p 34
X4 X4 C
p
41p 42 p 43p 44
Learning Full Bayesian Network Classifiers
Learning an FBC consists of two parts:
Learning Full Bayesian Network Classifiers
Learning an FBC consists of two parts: Construction of a full BN.
Learning Full Bayesian Network Classifiers
Learning an FBC consists of two parts: Construction of a full BN. Learning of decision trees to represent the CPT of each variable.
Learning Full Bayesian Network Classifiers
Learning an FBC consists of two parts: Construction of a full BN. Learning of decision trees to represent the CPT of each variable. The full BN is implemented using a Bayesian multinet.
Learning Full Bayesian Network Classifiers
Learning an FBC consists of two parts: Construction of a full BN. Learning of decision trees to represent the CPT of each variable. The full BN is implemented using a Bayesian multinet. Definition - Bayesian multinet A Bayesian multinet is a set of Bayesian networks, each of which corresponds to a value c of the class variable C.
Structure Learning
Learning the structure of a full BN actually means learning an
- rder of variables and then adding arcs from a variable to all the
variables ranked after it.
Structure Learning
Learning the structure of a full BN actually means learning an
- rder of variables and then adding arcs from a variable to all the
variables ranked after it. A variable is ranked based on its total influence on other variables.
Structure Learning
Learning the structure of a full BN actually means learning an
- rder of variables and then adding arcs from a variable to all the
variables ranked after it. A variable is ranked based on its total influence on other variables. The influence (dependency) between two variables can be measured by mutual information.
Structure Learning
Learning the structure of a full BN actually means learning an
- rder of variables and then adding arcs from a variable to all the
variables ranked after it. A variable is ranked based on its total influence on other variables. The influence (dependency) between two variables can be measured by mutual information. Definition - Mutual information Let X and Y be two variables in a Bayesian network. The mutual information is defined as: M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y)
Structure Learning
It is possible that the dependency between two variables, measured by mutual information, is caused merely by noise.
Structure Learning
It is possible that the dependency between two variables, measured by mutual information, is caused merely by noise. Results by Friedman are used as a dependency threshold to filter
- ut unreliable dependencies.
Structure Learning
It is possible that the dependency between two variables, measured by mutual information, is caused merely by noise. Results by Friedman are used as a dependency threshold to filter
- ut unreliable dependencies.
Definition - Dependency threshold Let Xi and Xj be two variables in a Bayesian network. The dependency threshold, denoted by φ, is defined as: φ(Xi, Xj) = logN
2N × Tij, where Tij = |Xi| × |Xj|.
Structure Learning
The total influence of a variable on other variables can now be defined:
Structure Learning
The total influence of a variable on other variables can now be defined: Definition - Total influence Let Xi be a variable in a Bayesian network. The total influence of Xi on other variables, denoted by W (Xi), is defined as: W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj).
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
1 B = empty.
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class
value c.
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class
value c.
3 For each training data set Sc
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class
value c.
3 For each training data set Sc
Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj.
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class
value c.
3 For each training data set Sc
Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi.
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class
value c.
3 For each training data set Sc
Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class
value c.
3 For each training data set Sc
Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X
- Add all the variables Xj with W (Xj) > W (Xi) to the parent
set ΠXi of Xi.
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class
value c.
3 For each training data set Sc
Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X
- Add all the variables Xj with W (Xj) > W (Xi) to the parent
set ΠXi of Xi.
- Add arcs from all the variables Xj in ΠXi to Xi.
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class
value c.
3 For each training data set Sc
Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X
- Add all the variables Xj with W (Xj) > W (Xi) to the parent
set ΠXi of Xi.
- Add arcs from all the variables Xj in ΠXi to Xi.
Add the resulting network Bc to B.
Structure Learning Algorithm
Algorithm FBC-Structure(S, X)
1 B = empty. 2 Partition the training data S into |C| subsets Sc by the class
value c.
3 For each training data set Sc
Compute the mutual information M(Xi; Xj) and the dependency threshold φ(Xi, Xj) between each pair of variables Xi and Xj. Compute W (Xi) for each variable Xi. For all variables Xi in X
- Add all the variables Xj with W (Xj) > W (Xi) to the parent
set ΠXi of Xi.
- Add arcs from all the variables Xj in ΠXi to Xi.
Add the resulting network Bc to B.
4 Return B.
Example - Structure Learning Algorithm
Example using 1000 labeled instances, where C is the class variable and A, B, and D are feature variables. C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 C A B D # c2 a1 b1 d1 36 c2 a1 b1 d2 36 c2 a1 b2 d1 259 c2 a1 b2 d2 29 c2 a2 b1 d1 96 c2 a2 b1 d2 96 c2 a2 b2 d1 43 c2 a2 b2 d2 5
Example - Structure Learning Algorithm
C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1
11+5 400 7+17 400
a2
227+97 400 11+25 400
P(A, B)
Example - Structure Learning Algorithm
C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1
11+5 400 7+17 400
a2
227+97 400 11+25 400
P(A, B)
Example - Structure Learning Algorithm
C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1
11+5 400 7+17 400
a2
227+97 400 11+25 400
P(A, B)
Example - Structure Learning Algorithm
C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1
11+5 400 7+17 400
a2
227+97 400 11+25 400
P(A, B)
Example - Structure Learning Algorithm
C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1
11+5 400 7+17 400
a2
227+97 400 11+25 400
P(A, B)
Example - Structure Learning Algorithm
C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1
11+5 400 7+17 400
a2
227+97 400 11+25 400
P(A, B)
Example - Structure Learning Algorithm
C A B D # c1 a1 b1 d1 11 c1 a1 b1 d2 5 c1 a1 b2 d1 7 c1 a1 b2 d2 17 c1 a2 b1 d1 227 c1 a2 b1 d2 97 c1 a2 b2 d1 11 c1 a2 b2 d2 25 The 400 data instances where C = c1. b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B)
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 (0.04 + 0.06) · (0.04 + 0.81) (0.04 + 0.06) · (0.06 + 0.09) a2 (0.81 + 0.09) · (0.04 + 0.81) (0.81 + 0.09) · (0.06 + 0.09) P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B)= 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085)+0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) +0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) + 0.06 · log( 0.06 0.015)+0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) + 0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135)= 0.027
Example - Structure Learning Algorithm
b1 b2 a1 0.04 0.06 a2 0.81 0.09 P(A, B) b1 b2 a1 0.085 0.015 a2 0.765 0.135 P(A)P(B) M(X; Y ) =
- x∈X,y∈Y
P(x, y)log P(x, y) P(x)P(y) M(A; B) = 0.04 · log( 0.04 0.085) + 0.81 · log( 0.81 0.765) + 0.06 · log( 0.06 0.015) + 0.09 · log( 0.09 0.135) = 0.027
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D)
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj)
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj) indent indent indentW (A)
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj) indent indent indentW (A) = M(A; B)
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B)
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B)
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D)
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045 indent indent indentW (D)
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045 indent indent indentW (D) = M(B; D)
Example - Structure Learning Algorithm
Mutual information M(A; B) = 0.027 M(A; D) = 0.004 M(B; D) = 0.018 Dependency threshold φ(Xi, Xj) = logN
2N × Tij
φ(A, B) = φ(A, D) = φ(B, D) = 4log400
800
= 0.013 Total influence W (Xi) =
M(Xi;Xj)>φ(Xi,Xj)
- j(j=i)
M(Xi; Xj) indent indent indentW (A) = M(A; B) = 0.027 indent indent indentW (B) = M(A; B) +M(B; D) = 0.045 indent indent indentW (D) = M(B; D) = 0.018
Example - Structure Learning Algorithm
We now construct a full Bayesian network with variable order according to the total influence values:
B A D
Example - Structure Learning Algorithm
We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018
B A D
Example - Structure Learning Algorithm
We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)
B A D
Example - Structure Learning Algorithm
We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)
B A D
We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.
Example - Structure Learning Algorithm
We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)
B A D
We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.
Example - Structure Learning Algorithm
We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)
B A D
We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.
Example - Structure Learning Algorithm
We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)
B A D
We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.
Example - Structure Learning Algorithm
We now construct a full Bayesian network with variable order according to the total influence values: W (A) = 0.027 W (B) = 0.045 W (D) = 0.018 W (B) > W (A) > W (D)
B A D
We now have the full Bayesian network Bc1, which is the part of the multinet that corresponds to C = c1. We should now repeat the process to construct Bc2 and thereby complete the FBC structure learning.
CPT-tree Learning
We now need to learn a CPT-tree for each variable in the full BN.
CPT-tree Learning
We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N).
CPT-tree Learning
We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N). Instead a fast decision tree learning algorithm is purposed.
CPT-tree Learning
We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N). Instead a fast decision tree learning algorithm is purposed. The algorithm uses the mutual information to determine a fixed
- rdering of variables from root to leaves.
CPT-tree Learning
We now need to learn a CPT-tree for each variable in the full BN. A traditional decision tree learning algorithm, such as C4.5, can be used to learn CPT-trees. However, since the time complexity typically is O(n2 · N) the resulting FBC learning algorithm would have a complexity of O(n3 · N). Instead a fast decision tree learning algorithm is purposed. The algorithm uses the mutual information to determine a fixed
- rdering of variables from root to leaves.
The predefined variable ordering makes the algorithm faster than traditional decision tree learning algorithms.
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T.
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False.
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi).
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi.
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S.
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S.
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True
5 If qualified == True
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True
5 If qualified == True
Create a root Xj for T.
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True
5 If qualified == True
Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj.
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True
5 If qualified == True
Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True
5 If qualified == True
Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj
- Tx = Fast-CPT-Tree(ΠXi, Sx)
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True
5 If qualified == True
Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj
- Tx = Fast-CPT-Tree(ΠXi, Sx)
- Add Tx as a child of Xj.
CPT-tree Learning Algorithm
Algorithm Fast-CPT-Tree(ΠXi, S)
1 Create an empty tree T. 2 If (S is pure or empty) or (ΠXi is empty)
Return T.
3 qualified = False. 4 While (qualified == False) and (ΠXi is not empty)
Choose the variable Xj with the highest M(Xj; Xi). Remove Xj from ΠXi. Compute the local mutual information MS(Xi; Xj) on S. Compute the local dependency threshold φS(Xi, Xj) on S. If MS(Xi; Xj) > φS(Xi, Xj) qualified = True
5 If qualified == True
Create a root Xj for T. Partition S into disjoint subsets Sx, x is a value of Xj. For all values x of Xj
- Tx = Fast-CPT-Tree(ΠXi, Sx)
- Add Tx as a child of Xj.
6 Return T.