Mixtures of Tree-Structured Probabilistic Graphical Models for - - PowerPoint PPT Presentation

mixtures of tree structured probabilistic graphical
SMART_READER_LITE
LIVE PREVIEW

Mixtures of Tree-Structured Probabilistic Graphical Models for - - PowerPoint PPT Presentation

Mixtures of Tree-Structured Probabilistic Graphical Models for Density Estimation in High Dimensional Spaces F. Schnitzler University of Li` ege 24 September 2012 F. Schnitzler (ULG) Mixtures of Markov trees PhD defense 1 / 45 Density


slide-1
SLIDE 1

Mixtures of Tree-Structured Probabilistic Graphical Models for Density Estimation in High Dimensional Spaces

  • F. Schnitzler

University of Li` ege

24 September 2012

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 1 / 45

slide-2
SLIDE 2

Density estimation

Density estimation consists in learning a joint probability density P(X), based on N realisations of the problem:

  • xi ∼ P(X)

N

i=1.

Example: estimating P(X=”result of a dice throw”).

◮ A realisation belongs to {1, 2, 3, 4, 5, 6}. ◮ 10 realisations: D = (2, 3, 1, 3, 6, 1, 4, 2, 6, 2). ◮ A possible estimate, based on these realisations:

P(X = 1) = 2/10, P(X = 2) = 3/10, P(X = 3) = 3/10, P(X = 4) = 1/10, P(X = 5) = 0/10, P(X = 6) = 1/10.

In this thesis: density estimation for high-dimensional problems:

◮ high number of discrete variables p (thousands or more), ◮ low number of samples N (hundreds).

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 2 / 45

slide-3
SLIDE 3

Density estimation for electrical networks

Dimensions Tens of thousands to millions (depending on the level

  • f detail). Example: in the order of 10 000 transmis-

sion nodes at extra-high voltage level, 100 000 wind turbines. Prediction Power flow in/out countries, based on local consump- tion? Production of solar/wind energy, based on the weather? Power in each line, based on production?

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 3 / 45

slide-4
SLIDE 4

Density estimation in bioinformatics

Dimensions thousands to tens of thousands of genes, hundreds of thousands of proteins. Prediction Effect of a combination of diseases and treatments

  • n gene expression level? Most efficient medicine to

tackle a particular disease?

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 4 / 45

slide-5
SLIDE 5

Problem solving with probabilistic graphical models

Learning set Learning algorithm Probabilistic graphical model Inference algorithm Answer Question

Definitions

Learning refers to the automatic construction of a model from a set

  • f observations, the learning set. It may be done only once.

A probabilistic graphical model encodes a probability density, e.g. a joint probability density over a set of variables: P(X). Probabilistic inference, on a given model and for a particular question, consists in computing an answer to the query. The more general the model, the more questions can be answered.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 5 / 45

slide-6
SLIDE 6

What is so difficult about it?

Algorithmic complexity

Algorithmic complexity refers to the asymptotic complexity of an algorithm as a function of the size of the input problem. Example: a O(p) algorithm has a run-time that increases linearly with p, as p → ∞. For problems with a high number of variables p, the algorithms must have a complexity that is a very low-order polynomial in p.

Small number of samples → high variance

Having few samples leads to variability in the models constructed. Illustration: dice throw example, 2 sequences of observations:

◮ D1 = (2, 3, 1, 3, 6, 1, 4, 2, 6, 2) : P1(”5”) = 0/10 ◮ D2 = (1, 5, 2, 6, 1, 1, 6, 1, 6, 5) : P2(”5”) = 2/10

Both models cannot be right: variance is a source of error.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 6 / 45

slide-7
SLIDE 7

What is so difficult about it?

Algorithmic complexity

Algorithmic complexity refers to the asymptotic complexity of an algorithm as a function of the size of the input problem. Example: a O(p) algorithm has a run-time that increases linearly with p, as p → ∞. For problems with a high number of variables p, the algorithms must have a complexity that is a very low-order polynomial in p.

Small number of samples → high variance

Having few samples leads to variability in the models constructed. Illustration: dice throw example, 2 sequences of observations:

◮ D1 = (2, 3, 1, 3, 6, 1, 4, 2, 6, 2) : P1(”5”) = 0/10 ◮ D2 = (1, 5, 2, 6, 1, 1, 6, 1, 6, 5) : P2(”5”) = 2/10

Both models cannot be right: variance is a source of error.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 6 / 45

slide-8
SLIDE 8

Mixtures of Tree-Structured Probabilistic Graphical Models for Density Estimation in High Dimensional Spaces

Goal: Density estimation for high p (number of variables) low N (number of samples). algorithmic complexity

→ simple models (≡Markov trees)

high variance

→ simple models → mixture of simple models

Mixture µ1 µ2 µ3 B C A D A C B D C B A D

P ˆ

T (X) = m

  • i=1

µiPTi(X)

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 7 / 45

slide-9
SLIDE 9

Mixtures of Tree-Structured Probabilistic Graphical Models for Density Estimation in High Dimensional Spaces

1 Background 2 Contributions (x3) 3 Final words

Background: What is it you are doing again? ? Probabilistic graphical models ? Tree-structured probabilistic graphical models ? Mixtures

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 8 / 45

slide-10
SLIDE 10

Mixtures of Tree-Structured Probabilistic Graphical Models for Density Estimation in High Dimensional Spaces

1 Background 2 Contributions (x3) 3 Final words

Background: What is it you are doing again? ? Probabilistic graphical models ? Tree-structured probabilistic graphical models ? Mixtures

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 8 / 45

slide-11
SLIDE 11

A Bayesian network is a PGM encoding a joint probability density over a set of variables X.

A Bayesian network is composed of 2 elements. The directed acyclic graph structure G encodes (conditional independence) relationships among variables. The set of parameters θ quantifies the probability density. Alarm Earthquake RadioNews NeighborCall Burglary

The value of a variable is either “yes” or “no”.

The Bayesian network reduces the number of parameters stored:

P(B, E, A, R, N) = P(B) P(E|B) P(A|B, E) P(R|B, E, A) P(N|B, E, A, R) = P(B) P(E) P(A|B, E) P(R|E) P(N|A) parameters : 25 − 1 = 31 ↔ 1 + 1 + 4 + 2 + 2 = 10

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 9 / 45

slide-12
SLIDE 12

A Markov tree is Bayesian network where the graphical structure is constrained to be a (connected) tree.

X1 X3 X4 X2 Bayesian networks Markov forests not a Bayesian network Markov trees X1 X3 X4 X2 not a Markov forest not a Markov tree X1 X3 X4 X2 X1 X3 X4 X2

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 10 / 45

slide-13
SLIDE 13

The class of models considered is a tradeoff between capacity of representation and computational complexity.

Learning set Learning algorithm Probabilistic graphical model Inference algorithm Answer Question Bayesian networks Markov trees Complexity superexponential in p for both learning and inference tractable for both learning (O(p2 log p)) and inference (O(p)) Accuracy any probability den- sity can be encoded

  • nly a restricted set of prob-

ability densities can be en- coded Capacity of representation might be detrimental to accuracy!

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 11 / 45

slide-14
SLIDE 14

The class of models considered is a tradeoff between capacity of representation and computational complexity.

Learning set Learning algorithm Probabilistic graphical model Inference algorithm Answer Question Bayesian networks Markov trees Complexity superexponential in p for both learning and inference tractable for both learning (O(p2 log p)) and inference (O(p)) Accuracy any probability den- sity can be encoded

  • nly a restricted set of prob-

ability densities can be en- coded Capacity of representation might be detrimental to accuracy!

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 11 / 45

slide-15
SLIDE 15

Average error of a model learnt = bias + variance.

Overly simple class of models

target polynomial mean model learned standard deviation limit

Overly complex class of models

target polynomial mean model learned standard deviation limit

Bias is the difference between the mean model learned (over all possible sets of observations of a given size) and the best model. Variance is the average variability of a model learnt, with respect to the mean model learned. When the complexity of the class of models increases:

◮ bias tends to decrease, ◮ variance to increase.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 12 / 45

slide-16
SLIDE 16

Constructing a mixture can reduce the variance.

Normal algorithm: Learning set Learning algorithm Error: bias Variance (D) Perturb: randomize the learning algorithm Learning set Perturbed algorithm Error: bias Variance (D) Variance (algo) Perturb&Combine: generate several models and combine them Learning set Error: bias Variance (D) Variance (mixture algo) ×m m = trees P(X) = 1 m m

i=1 Pi(X)

Perturbed algorithm m → ∞

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 13 / 45

slide-17
SLIDE 17

The validation of the algorithms is empirical.

Score of the algorithms: efficiency accuracy Algorithms tested

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 14 / 45

slide-18
SLIDE 18

The validation of the algorithms is empirical.

Score of the algorithms: efficiency accuracy Algorithms tested Target distributions P

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 14 / 45

slide-19
SLIDE 19

The validation of the algorithms is empirical.

Score of the algorithms: efficiency accuracy Generated mixtures Learning sets Algorithms tested Target distributions P

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 14 / 45

slide-20
SLIDE 20

The validation of the algorithms is empirical.

Score of the algorithms: efficiency accuracy Generated mixtures Learning sets Algorithms tested Target distributions P test sets

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 14 / 45

slide-21
SLIDE 21

The validation of the algorithms is empirical.

Score of the algorithms: efficiency accuracy Reference scores: efficiency accuracy Generated mixtures Learning sets Algorithms tested Reference algorithms Reference models Target distributions P test sets

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 14 / 45

slide-22
SLIDE 22

Mixtures of Tree-Structured Probabilistic Graphical Models for Density Estimation in High Dimensional Spaces

1 Background 2 Contributions ◮ Repeatedly learning a perturbed Markov tree ◮ Building a sequence of Markov trees ◮ Combining mixtures 3 Final words

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 15 / 45

slide-23
SLIDE 23

Learning a Markov tree is a 2-step process.

Learning set Learning algorithm Error: bias Variance (D)

1 Learn the best structure T , given the observations. ◮ Define a score on the structures, and find the structure maximizing it. 2 Learn the parameters θ for the selected structure G. ◮ It amounts to counting observations.

NB:

Parameter learning in this thesis: P(Xi = x|PaXi = a) = 1 + ND(a, x) |Val(Xi )| +

x∈Val(Xi ) ND(a, x)

. ND(a, x) is the number of samples where Xi = x and PaXi = a.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 16 / 45

slide-24
SLIDE 24

The Chow-Liu algorithm learns a Markov tree structure able to maximize the likelihood of the learning set.

TCL(D) = arg maxT

  • (X,Y)∈E(T ) ID(X; Y)

A C B D 1 1 1 1 1 1 1 1 1 1 p variables N samples

Best modeling of the learning set Structure learning is a 3-step process:

1

Construction of a complete graph whose edge-weights are empirical mutual informations (O(p2N))

2

Computation of the maximum weight spanning tree (O(p2 log p))

3

Orientation (O(p))

→ Complexity is O(p2 log p). → Bottleneck: number of candidate edges for each tree.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 17 / 45

slide-25
SLIDE 25

The Chow-Liu algorithm learns a Markov tree structure able to maximize the likelihood of the learning set.

TCL(D) = arg maxT

  • (X,Y)∈E(T ) ID(X; Y)

A C B D 1 1 1 1 1 1 1 1 1 1

ID(C; D) ID(A; B)

p variables N samples A B C D

Best modeling of the learning set Structure learning is a 3-step process:

1

Construction of a complete graph whose edge-weights are empirical mutual informations (O(p2N))

2

Computation of the maximum weight spanning tree (O(p2 log p))

3

Orientation (O(p))

→ Complexity is O(p2 log p). → Bottleneck: number of candidate edges for each tree.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 17 / 45

slide-26
SLIDE 26

The Chow-Liu algorithm learns a Markov tree structure able to maximize the likelihood of the learning set.

TCL(D) = arg maxT

  • (X,Y)∈E(T ) ID(X; Y)

A C B D 1 1 1 1 1 1 1 1 1 1 p variables N samples

ID(C; D) ID(A; B)

A B C D A B C D

Best modeling of the learning set Structure learning is a 3-step process:

1

Construction of a complete graph whose edge-weights are empirical mutual informations (O(p2N))

2

Computation of the maximum weight spanning tree (O(p2 log p))

3

Orientation (O(p))

→ Complexity is O(p2 log p). → Bottleneck: number of candidate edges for each tree.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 17 / 45

slide-27
SLIDE 27

The Chow-Liu algorithm learns a Markov tree structure able to maximize the likelihood of the learning set.

TCL(D) = arg maxT

  • (X,Y)∈E(T ) ID(X; Y)

A C B D 1 1 1 1 1 1 1 1 1 1 p variables N samples

ID(C; D) ID(A; B)

A B C D A B C D

Best modeling of the learning set Structure learning is a 3-step process:

1

Construction of a complete graph whose edge-weights are empirical mutual informations (O(p2N))

2

Computation of the maximum weight spanning tree (O(p2 log p))

3

Orientation (O(p))

→ Complexity is O(p2 log p). → Bottleneck: number of candidate edges for each tree.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 17 / 45

slide-28
SLIDE 28

The Chow-Liu algorithm learns a Markov tree structure able to maximize the likelihood of the learning set.

TCL(D) = arg maxT

  • (X,Y)∈E(T ) ID(X; Y)

A C B D 1 1 1 1 1 1 1 1 1 1 p variables N samples

ID(C; D) ID(A; B)

A B C D A B C D

Best modeling of the learning set Structure learning is a 3-step process:

1

Construction of a complete graph whose edge-weights are empirical mutual informations (O(p2N))

2

Computation of the maximum weight spanning tree (O(p2 log p))

3

Orientation (O(p))

→ Complexity is O(p2 log p). → Bottleneck: number of candidate edges for each tree.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 17 / 45

slide-29
SLIDE 29

The Chow-Liu algorithm learns a Markov tree structure able to maximize the likelihood of the learning set.

TCL(D) = arg maxT

  • (X,Y)∈E(T ) ID(X; Y)

A C B D 1 1 1 1 1 1 1 1 1 1 p variables N samples

ID(C; D) ID(A; B)

A B C D A B C D

Best modeling of the learning set Structure learning is a 3-step process:

1

Construction of a complete graph whose edge-weights are empirical mutual informations (O(p2N))

2

Computation of the maximum weight spanning tree (O(p2 log p))

3

Orientation (O(p))

→ Complexity is O(p2 log p). → Bottleneck: number of candidate edges for each tree.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 17 / 45

slide-30
SLIDE 30

The Chow-Liu algorithm learns a Markov tree structure able to maximize the likelihood of the learning set.

TCL(D) = arg maxT

  • (X,Y)∈E(T ) ID(X; Y)

A C B D 1 1 1 1 1 1 1 1 1 1 p variables N samples

ID(C; D) ID(A; B)

A B C D A B C D

Best modeling of the learning set Structure learning is a 3-step process:

1

Construction of a complete graph whose edge-weights are empirical mutual informations (O(p2N))

2

Computation of the maximum weight spanning tree (O(p2 log p))

3

Orientation (O(p))

→ Complexity is O(p2 log p). → Bottleneck: number of candidate edges for each tree.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 17 / 45

slide-31
SLIDE 31

The Chow-Liu algorithm learns a Markov tree structure able to maximize the likelihood of the learning set.

TCL(D) = arg maxT

  • (X,Y)∈E(T ) ID(X; Y)

A C B D 1 1 1 1 1 1 1 1 1 1 p variables N samples

ID(C; D) ID(A; B)

A B C D A B C D arbitrary root

Best modeling of the learning set Structure learning is a 3-step process:

1

Construction of a complete graph whose edge-weights are empirical mutual informations (O(p2N))

2

Computation of the maximum weight spanning tree (O(p2 log p))

3

Orientation (O(p))

→ Complexity is O(p2 log p). → Bottleneck: number of candidate edges for each tree.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 17 / 45

slide-32
SLIDE 32

The Chow-Liu algorithm learns a Markov tree structure able to maximize the likelihood of the learning set.

TCL(D) = arg maxT

  • (X,Y)∈E(T ) ID(X; Y)

A C B D 1 1 1 1 1 1 1 1 1 1 p variables N samples

ID(C; D) ID(A; B)

A B C D A B C D arbitrary root

Best modeling of the learning set Structure learning is a 3-step process:

1

Construction of a complete graph whose edge-weights are empirical mutual informations (O(p2N))

2

Computation of the maximum weight spanning tree (O(p2 log p))

3

Orientation (O(p))

→ Complexity is O(p2 log p). → Bottleneck: number of candidate edges for each tree.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 17 / 45

slide-33
SLIDE 33

Example of perturbation: bagging

A C B D 1 1 1 1 1 1 1 1 1 1 Chow-Liu A C B D 1 1 1 1 1 1 1 1 1 1 1 A C B D 1 1 1 1 1 1 1 1 1 1 1 m . . .

Average over m max-likelihood trees learnt from m bootstrap replicates A bootstrap replicate D′ of a sample set D is the same size as D and is drawn with replacement from D. → Complexity is O(mp2 log p). In my case, θ are learned on D, not D′.

Probability Density Estimation by Perturbing and Combining Tree Structured Markov Networks,

  • S. Ammar, P. Leray, B. Defourny and L. Wehenkel ECSQARU 2009.
  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 18 / 45

slide-34
SLIDE 34

Learning the parameters on the original learning set leads to a better accuracy.

The more trees in the mixture, the better the accuracy.

100 50 155 175 165 ˆ DKL(P || PT ) Number of Markov trees m (for the mixtures) Chow-Liu tree Mixture of bagged Chow-Liu trees (structure only) Mixture of bagged Chow-Liu trees (structure and parameters)

Averaged results on 10 randomly generated Bayesian networks (1000 binary variables) × 10 learning sets, for N = 300 samples.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 19 / 45

slide-35
SLIDE 35

In order to reduce algorithmic complexity, only a subset of K edges must be considered.

A C B D 1 1 1 1 1 1 1 1 1 1 A B C D A B C D

A simple way to reduce complexity is therefore to randomly select a subset of K edges. Reduction in complexity (for each term):

◮ Construction of an uncomplete graph: O(KN) ◮ Computation of the maximum width spanning

tree (O(K log K))

The higher K, the closer to the Chow-Liu algorithm.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 20 / 45

slide-36
SLIDE 36

Intuitively, the structure of the problem can be exploited to target more interesting edges.

In a Euclidean space, similar problems can be approximated by sub-quadratic algorithms. When 2 points B and C are close to A, they are likely to be close as well. d(B, C) d(A, B) + d(A, C)

A B C

Mutual information is not a Euclidean distance. However the same reasoning can be applied. If the pairs A; B and A; C have high mutual information values, I(B; C) may be high as well. I(B; C) I(A; B) + I(A; C) − H(A)

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 21 / 45

slide-37
SLIDE 37

I want to group related variables together and to use this structure to target interesting edges.

The algorithm builds a clustering of the variables and relationships between clusters, exploits this structure to target interesting edges, i.e. edges with strong associated mutual information, uses these edges to build a Markov tree.

A C B D 1 1 1 1 1 1 1 1 1 1 interesting edges and associated mutual information values Markov tree structure MWST

Classical approach to build clusters: compute distance for each object pair → not suitable

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 22 / 45

slide-38
SLIDE 38

The clusters are built iteratively, one at a time:

A center (X5) is randomly chosen and compared to the 12 other variables.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 23 / 45

slide-39
SLIDE 39

The clusters are built iteratively, one at a time:

First cluster is created: it is composed of 5 members and 1 neighbour. Variables are assigned to a cluster based on two thresholds and their empirical mutual information with the center of the cluster.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 23 / 45

slide-40
SLIDE 40

The clusters are built iteratively, one at a time:

The second cluster is built around X13. It is only compared to the 7 remaining variables.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 23 / 45

slide-41
SLIDE 41

The clusters are built iteratively, one at a time:

After 4 iterations, all variables belong to a cluster. The algorithm stops.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 23 / 45

slide-42
SLIDE 42

The clusters are exploited to target interesting edges:

Computation of mutual information among variables belonging to the same cluster.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 23 / 45

slide-43
SLIDE 43

The clusters are exploited to target interesting edges:

Two clusters containing neighbor variables are neighbors.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 23 / 45

slide-44
SLIDE 44

The clusters are exploited to target interesting edges:

Computation of mutual information between variables belonging to neighboring clusters.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 23 / 45

slide-45
SLIDE 45

The mixtures developed can achieve a better accuracy than a single Chow-Liu tree

Targeting interesting edges is better than random subsampling.

100 50 Number of Markov trees m (for the mixtures) 155 175 165 ˆ DKL(P || PT ) Chow-Liu tree Mixture of cluster-based trees Mixture of bagged Chow-Liu trees Mixture of random edge-subsampling trees

Averaged results on 10 randomly generated Bayesian networks (1000 binary variables) × 10 learning sets, for N = 300 samples.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 24 / 45

slide-46
SLIDE 46

Targeting interesting rather than random edges also leads to an improvement in computational complexity.

Table: Relative CPU times for training the Chow-Liu tree and 10×10 mixtures of size m = 100, cumulated on 100 data sets of 1000 samples and 200 variables. (1 ≈ 16.5 hours)

Random edge subsampling 1.08 Cluster-based trees 1 Bagged Chow-Liu trees 2.8 ( Single Chow-Liu tree 0.03 ) This improvement might be a consequence of the determinism of the cluster-based learning procedure, once a root is chosen. However, the number of edges, and therefore the run-time, of this procedure is not directly controlable.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 25 / 45

slide-47
SLIDE 47

Mixtures of Tree-Structured Probabilistic Graphical Models for Density Estimation in High Dimensional Spaces

1 Background 2 Contributions ◮ Repeatedly learning a perturbed Markov tree ◮ Building a sequence of Markov trees ◮ Combining mixtures 3 Final words

Previous chapter:

A C B D 1 1 1 1 1 1 1 1 1 1

This chapter:

A C B D 1 1 1 1 1 1 1 1 1 1

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 26 / 45

slide-48
SLIDE 48

Structural knowledge obtained from the first tree learning

  • peration can guide the other tree learning procedures.

Starting each tree learning operation from scratch throws away valuable information about the problem. Exploit the first tree to select a good subset S of candidate edges. The search is constrained to the tree (or forest) structures spanning S. Complexity: O(mp2 log p) → O(p2 log p + m|S| log |S|).

A C B D 1 1 1 1 1 1 1 1 1 1 S T1 edges selection

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 27 / 45

slide-49
SLIDE 49

Structural knowledge obtained from the first tree learning

  • peration can guide the other tree learning procedures.

Starting each tree learning operation from scratch throws away valuable information about the problem. Exploit the first tree to select a good subset S of candidate edges. The search is constrained to the tree (or forest) structures spanning S. Complexity: O(mp2 log p) → O(p2 log p + m|S| log |S|).

A C B D 1 1 1 1 1 1 1 1 1 1 A C B D 1 1 1 1 1 1 1 1 1 1 1 S bootstrap replicate ×1 ×(m − 1) T1 T2 Tm . . . edges selection

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 27 / 45

slide-50
SLIDE 50

The skeleton S should contain edges with a strong associated mutual information.

Edges with weak weights are

◮ not likelily to be part of a tree (even if weights are perturbed), ◮ probably not meaningful (noise or not direct relation).

→ We can ignore them in the search. S statistical test

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 28 / 45

slide-51
SLIDE 51

Edges are included in S based on an independence test.

Comparing ID(X; Y) (χ-square distributed under independence) to a threshold depending on a postulated p-value, say α = 0.05 or smaller. S contains the pairs of variables whose mutual information (on the

  • riginal data set) is above the threshold.

Mutual information values are a by-product of the computation of the first tree.

S statistical test

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 29 / 45

slide-52
SLIDE 52

Regularization is another way to reduce variance.

FCL(D) = arg max

F

 

  • (X,Y)∈E(F)

ID(X; Y)−λ|E(F)|   New reference algorithm based on regularization: λ is optimized to maximize the evaluation score (i.e. on the test set) → Best possible regularization (optimistic score)

14 16 18 100 200 ˆ DKL(P||PFλ) edges in the forest F Best possible regularization Tree

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 30 / 45

slide-53
SLIDE 53

The skeleton method can achieve an accuracy close to a mixture of bagged Chow-Liu trees, but is faster.

The mixtures are better than the regularized Chow-Liu algorithm.

100 200 300 400 500 13 14 15 Number of trees (for the mixtures) ˆ DKL(P || P ˆ

T )

Chow-Liu tree Mixture of bagged Chow-Liu trees (≡ skeleton method [α = 1]) Skeleton method (α = 0.05) Chow-Liu forest regularized by an oracle

Relative run-time for mixtures of 500 trees (one max-likelihood tree: 1): Mixture of bagged Chow-Liu trees: 532 Skeleton-based approximations: 21

Averaged results on 5 randomly generated Bayesian networks (200 binary variables) × 6 learning sets of 200 samples.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 31 / 45

slide-54
SLIDE 54

A closer look at the influence of α

100 200 300 400 500 13 14 15 Number of trees ˆ DKL(P || P ˆ

T )

α = 5E −4 α = 5E −3 α = 5E −2

The smaller α, the smaller the variance of the first tree.

→ Increase in accuracy (here).

◮ Increase of the bias.

The larger α, the slower the convergence, but the better the accuracy.

→ Larger diversity in the Markov trees generated → Better reduction of the variance by the mixture.

◮ The bias of these Markov trees is also smaller.

Averaged results on 5 randomly generated Bayesian networks (200 binary variables) × 6 learning sets of 200 samples.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 32 / 45

slide-55
SLIDE 55

A closer look at the influence of α

100 200 300 400 500 13 14 15 Number of trees ˆ DKL(P || P ˆ

T )

α = 5E −2 α = 5E −4 α = 5E −3

The smaller α, the smaller the variance of the first tree.

→ Increase in accuracy (here).

◮ Increase of the bias.

The larger α, the slower the convergence, but the better the accuracy.

→ Larger diversity in the Markov trees generated → Better reduction of the variance by the mixture.

◮ The bias of these Markov trees is also smaller.

Averaged results on 5 randomly generated Bayesian networks (200 binary variables) × 6 learning sets of 200 samples.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 32 / 45

slide-56
SLIDE 56

Results on realistic learning set

On 8 more realistic models × 2 learning set sizes (200 and 500), the skeleton based method (α = 0.05) achieves: a worse accuracy than bagging in 3/16 settings, an accuracy similar to bagging in 9/16 settings, a better accuracy than bagging in 4/16 settings.

100 200 206 208 210 212 214 216 Number of trees (mixture) Nagative log-likelihood Mixture of bagged Chow-Liu trees Skeleton method (α = 0.05) Chow-Liu tree Chow-Liu forest regularized by an oracle 100 200 548 550 552 554 556 558 560 546 Number of trees (mixture) Mixture of bagged Chow-Liu trees Chow-Liu tree Chow-Liu forest regularized by an oracle Skeleton method (α = 0.05)

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 33 / 45

slide-57
SLIDE 57

Mixtures of Tree-Structured Probabilistic Graphical Models for Density Estimation in High Dimensional Spaces

1 Background 2 Contributions ◮ Repeatedly learning a perturbed Markov tree ◮ Building a sequence of Markov trees ◮ Combining mixtures 3 Final words

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 34 / 45

slide-58
SLIDE 58

Mixtures can also be built to reduce the bias with respect to a single Chow-Liu tree.

Expectation-Maximization mixture

Learning the mixture is viewed as a global optimization problem aiming at maximizing the data likelihood. There is a bias-variance trade-off associated to the number of terms. Soft partition of the learning set: each tree models a subset of

  • bservations.

Example: 200 variables and 2000 samples

1 2 3 4 5 Chow-Liu tree EM mixture Number of trees (m) 10.7 11 11.3 ˆ DKL(P || P ˆ

T )

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 35 / 45

slide-59
SLIDE 59

I combine the two types of mixtures.

1 Build an EM mixture and associated soft partition {Dk}m1

k=1.

2 Replace each tree Tk by a variance reducing mixture learnt on Dk.

P ˆ

T (X)

µm1 µ1 . . . The EM algorithm creates a partition of the data set. E.g., for m1 = 2:

A C B D 1 1 1 1 1 1 1 1 1 1 T1 T2 A C B D 1 1 1 1 1 1 1 1 1 1 A C B D 1 1 1 1 1 1 1 1 1 1 Chow-Liu Chow-Liu µ1 = 3 4 µ2 = 1 4

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 36 / 45

slide-60
SLIDE 60

I combine the two types of mixtures.

1 Build an EM mixture and associated soft partition {Dk}m1

k=1.

2 Replace each tree Tk by a variance reducing mixture learnt on Dk.

P ˆ

Tm1(X)

λm1,m2 λm1,1 P ˆ

T1(X)

λ1,m2 λ1,1 P ˆ

T (X)

µm µ1 . . . . . . . . . Each tree of the EM mixture is replaced by m2 bagged Chow-Liu trees.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 36 / 45

slide-61
SLIDE 61

The combined method achieves a higher accuracy than an EM mixture for learning a mixture of 3 Markov trees.

2 4 6 8 10 12 14 102 103 104 ˆ DKL(P || P ˆ

T )

Number of samples (N, logarithmic scale) EM mixture Mixture of 3 ensembles of 10 bagged CL trees

Averaged results on 1 uniformly weighted mixture of 3 randomly generated Markov trees (200 binary variables) × 1 learning set × many initializations of the EM algorithm.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 37 / 45

slide-62
SLIDE 62

The mixture of ensembles of CL trees can lead to a better accuracy than both the EM mixture or an ensemble of bagged Chow-Liu trees.

Chow-Liu tree EM mixture ensemble of 10 bagged Chow-Liu trees mixture of ensembles of 10 bagged CL trees

1 2 3 4 5

Number of trees (m1)

10.7 11 11.3 ˆ DKL(P || P ˆ

T )

Averaged results on 5 randomly generated Bayesian networks (200 binary variables) × 6 learning sets of 2000 samples.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 38 / 45

slide-63
SLIDE 63

The mixture of ensembles of CL trees is better than the EM mixture, no matter what the optimal m1 is.

120 samples

EM mixture mixture of ensembles of 10 bagged CL trees Chow-Liu tree ensemble of 10 bagged CL trees 1 2 3 4 5 16 20 24 28 Number of trees (m1) ˆ DKL(P || P ˆ

T )

8000 samples

1 2 3 4 5 10 10.3 10.6 10.9 Number of trees (m1)

The 2-level mixture does not compensate the increase in variance when increasing m1. The gap between the EM and 2-level mixtures increases with m1.

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 39 / 45

slide-64
SLIDE 64

Mixtures of Tree-Structured Probabilistic Graphical Models for Density Estimation in High Dimensional Spaces

1 Background 2 Contributions ◮ Repeatedly learning a perturbed Markov tree ◮ Building a sequence of Markov trees ◮ Combining mixtures 3 Final words

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 40 / 45

slide-65
SLIDE 65

Main contributions

Randomized Chow-Liu algorithm and corresponding mixtures Skeleton-based approximation Combining bias- and variance- reducing mixtures Extensive evaluation on synthetic and realistic probability densities

◮ In particular, comparison to regularization

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 41 / 45

slide-66
SLIDE 66

Short term perspectives

Compare P&C mixtures to other methods.

◮ true Bayesian approaches ◮ other two-level mixtures

Understand why and when mixtures are good and when the skeleton method is as good as bagging.

◮ application related: automatically set a value for the parameters

(e.g. α)

◮ tool: better measure the bias and the 2 types of variance of the

randomized methods

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 42 / 45

slide-67
SLIDE 67

Long term perspectives

New methods:

◮ Regularization on each term of the mixture ◮ New models: ⋆ bounded tree-width models ⋆ conditional random fields

New experimental conditions:

◮ Other class of target densities ◮ Other types of variables

Applications: electrical networks, bioinformatics

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 43 / 45

slide-68
SLIDE 68

How can perturb and combine be applied to tree- structured CRFs?

C1 C2 C4 X1 X2 X3 X4 φ1 φ2 φ3 C3 tree structure feature mapping

Conditional random fields encode a conditional probability density:

P(C|X) = 1 Z(X)φ1(C1, C2|X1, X3, X4) φ2(C2, C3|X2, X4) φ3(C2, C4|X3, X4)

No modelling of P(X)! New target for randomization with respect to mixtures of Markov trees networks: the feature mapping: what are the X associated to each edge (Ci, Cj), i = j? parameter learning: less trivial (done by gradient descent).

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 44 / 45

slide-69
SLIDE 69

How can perturb and combine be applied to tree- structured CRFs?

C1 C2 C4 X1 X2 X3 X4 φ1 φ2 φ3 C3 tree structure feature mapping

Conditional random fields encode a conditional probability density:

P(C|X) = 1 Z(X)φ1(C1, C2|X1, X3, X4) φ2(C2, C3|X2, X4) φ3(C2, C4|X3, X4)

No modelling of P(X)! New target for randomization with respect to mixtures of Markov trees networks: the feature mapping: what are the X associated to each edge (Ci, Cj), i = j? parameter learning: less trivial (done by gradient descent).

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 44 / 45

slide-70
SLIDE 70

How can perturb and combine be applied to tree- structured CRFs?

C1 C2 C4 X1 X2 X3 X4 φ1 φ2 φ3 C3 tree structure feature mapping

Conditional random fields encode a conditional probability density:

P(C|X) = 1 Z(X)φ1(C1, C2|X1, X3, X4) φ2(C2, C3|X2, X4) φ3(C2, C4|X3, X4)

No modelling of P(X)! New target for randomization with respect to mixtures of Markov trees networks: the feature mapping: what are the X associated to each edge (Ci, Cj), i = j? parameter learning: less trivial (done by gradient descent).

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 44 / 45

slide-71
SLIDE 71

Long term perspectives

New methods:

◮ Regularization on each term of the mixture ◮ New models: ⋆ bounded tree-width models ⋆ conditional random fields

New experimental conditions:

◮ Other class of target densities ◮ Other types of variables

Applications: electrical networks, bioinformatics

  • F. Schnitzler (ULG)

Mixtures of Markov trees PhD defense 45 / 45

slide-72
SLIDE 72

More realistic data sets (by C. Aliferis, A. Statnikov, I. Tsamardinos & al).

Name p |Xi| |E(G)| |θ| Alarm10 370 2-4 570 5468 Child10 200 2-6 257 2323 Gene 801 3-5 977 8348 Hailfinder10 560 2-11 1017 97448 Insurance10 270 2-5 556 14613 Link 724 2-4 1125 14211 Lung Cancer 800 2-3 1476 8452 Munin 189 1-21 282 15622 Pigs 441 3-3 592 3675

Table: Distributions from the literature and their characteristics. p corresponds to the number of variables, |Xi| to the range of cardinalities of single variables, |E(G)| to the number of edges and |θ| to the number of independent parameters in the original model.

slide-73
SLIDE 73

More terms might be necessary in the 2nd level ensemble when estimating more complex probability densities.

Data set N = 200 N = 500 N = 2500 +bag, +bag, +bag, MT-EM bagCl MT-EM bagCl MT-EM bagCl Child10

  • 25
  • 25
  • 10

Pigs 21 4

  • 25
  • 10

Alarm10 3 22

  • 25
  • 10

Gene 25

  • 25
  • 10

Lung Cancer 25

  • 8

17

  • 10

Link 3 22

  • 25
  • 10

Insurance10 2 23 1 24

  • 10

Munin 1 24 6 19

  • 10

Hailfinder10 25

  • 25
  • 10

ALL 105 120 40 185 90 Table: Best methods on realistic data sets (by increasing complexity) for 5 learning sets × several initializations of the EM mixture, with m1 = 2 and m2 = 10. N is the number of learning samples.