Who Learns Better Bayesian Network Structures Constraint-Based, - - PowerPoint PPT Presentation

who learns better bayesian network structures
SMART_READER_LITE
LIVE PREVIEW

Who Learns Better Bayesian Network Structures Constraint-Based, - - PowerPoint PPT Presentation

Who Learns Better Bayesian Network Structures Constraint-Based, Score-based or Hybrid Algorithms? Marco Scutari 1 Catharina Elisabeth Graafland 2 errez 2 Jos e Manuel Guti 1 Department of Statistics University of Oxford, UK


slide-1
SLIDE 1

Who Learns Better Bayesian Network Structures

Constraint-Based, Score-based or Hybrid Algorithms? Marco Scutari1 Catharina Elisabeth Graafland2 Jos´ e Manuel Guti´ errez2

1Department of Statistics

University of Oxford, UK scutari@stats.ox.ac.uk

2Institute of Physics of Cantabria (CSIC-UC)

Santander, Spain

September 11, 2018

slide-2
SLIDE 2

Outline

Bayesian network Structure learning is defined by the combination of a statistical criterion and an algorithm that determines how the criterion is applied to the data. After removing the confounding effect of different choices for the statistical criterion, we ask the following questions:

Q1 Which of constraint-based and score-based algorithms provide the most accurate structural reconstruction? Q2 Are hybrid algorithms more accurate than constraint-based or score-based algorithms? Q3 Are score-based algorithms slower than constraint-based and hybrid algorithms?

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-3
SLIDE 3

Classes of Structure Learning Algorithms

Structure learning consists in finding the DAG G that encodes the depen- dence structure of a data set D with n observations. Algorithms for this task fall into one three classes:

  • Constraint-based algorithms identify conditional independence

constraints with statistical tests, and link nodes that are not found to be independent.

  • Score-based algorithms are applications of general optimisation

techniques; each candidate DAG is assigned a network score maximise as the objective function.

  • Hybrid algorithms have a restrict phase implementing a

constraint-based strategy to reduce the space of candidate DAGs; and a maximise phase implementing a score-based strategy to find the optimal DAG in the restricted space.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-4
SLIDE 4

Conditional Independence Tests and Network Scores

For discrete BNs, the most common test is the log-likelihood ratio test G2(X, Y | Z) = 2 log P(X | Y, Z) P(X | Z) = 2

R

  • i=1

C

  • j=1

L

  • k=1

nijk log nijkn++k ni+kn+jk , has an asymptotic χ2

(R−1)(C−1)L distribution. For GBNs,

G2(X, Y | Z) = n log(1 − ρ2

XY |Z) .

∼ χ2

1.

As for network scores, the Bayesian Information criterion BIC(G; D) =

N

  • i=1
  • log P(Xi | ΠXi) − |ΘXi|

2 log n

  • ,

is a common choice for both discrete BNs and GBNs, as it provides a simple approximation to log P(G | D). log P(G | D) itself is available in closed form as BDeu and BGeu [5, 4].

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-5
SLIDE 5

Score- and Constraint-Based Algorithms Can Be Equivalent

Cowell [3] famously showed that constraint-based and score-based algo- rithms can select identical discrete BNs.

  • 1. He noticed that the G2 test in has the same expression as a

score-based network comparison based on the log-likelihoods log P(X | Y, Z) − log P(X | Z) if we take Z = ΠX.

  • 2. He then showed that these two classes of algorithms are equivalent

if we assume a fixed, known topological ordering and we use log-likelihood and G2 as matching statistical criteria. We take the same view that the algorithms and the statistical criteria they use are separate and complementary in determining the overall behaviour

  • f structure learning. We then want to remove the confounding effect of

choices for the statistical criterion from our evaluation of the algorithms.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-6
SLIDE 6

Constructing Matching Tests and Scores

Consider two DAGs G+ and G− that differ by a single arc Xj → Xi. In a score-based approach, we can compare them using BIC:

BIC(G+; D) > BIC(G−; D) ⇒ 2 log P(Xi | ΠXi ∪ {Xj}) P(Xi | ΠXi) >

  • |ΘG+

Xi | − |ΘG− Xi |

  • log n

which is equivalent to testing the conditional independence of Xi and Xj given ΠXi using the G2 test, just with a different significance thresh-

  • ld. We will call this test G2

BIC and use it as the matching statistical

criterion for BIC to compare different learning algorithms. For discrete BNs, starting from log P(G | D) we get

log P(G+ | D) > log P(G− | D) ⇒ log BF = log P(G+ | D) P(G− | D) > 0,

which uses Bayes factors as matching tests for BDeu.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-7
SLIDE 7

A Simulation Study

We assess three constraint-based algorithms (PC [2], GS [6], Inter-IAMB [13]), two score-based algorithms (tabu search, simulated annealing [7] for BIC, GES [1] for log BDeu) and two hybrid algorithms (MMHC [10], RSMAX2 [9]) on 14 reference networks [8]. For each BN:

  • 1. We generate 20 samples of size n/|Θ| = 0.1, 0.2, 0.5 (small

samples), 1.0, 2.0, 5.0 (large samples).

  • 2. We learn G using (BIC, G2

BIC), and (log BDeu, log BF) as well for

discrete BNs.

  • 3. We measure the accuracy of the learned DAGs using SHD/|A| [10]

from the reference BN; and we measure the speed of the learning algorithms with the number of calls to the statistical criterion.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-8
SLIDE 8

Discrete Bayesian Networks (Large Samples)

log10(calls to the statistical criterion) Scaled SHD

0.4 0.6 0.8 1.0 3.2 3.4 3.6 3.8 4.0 4.2

ALARM

0.2 0.4 0.6 0.8 1.0 4.8 5.0 5.2 5.4 5.6

ANDES

0.2 0.4 0.6 0.8 1.0 2.6 2.8 3.0 3.2 3.4 3.6 3.8

CHILD

0.2 0.4 0.6 0.8 3.6 3.8 4.0 4.2 4.4

HAILFINDER

0.4 0.6 0.8 1.0 3.8 4.0 4.2 4.4 4.6

HEPAR2

0.85 0.90 0.95 1.00 1.05 4.8 5.0 5.2 5.4

MUNIN1

0.8 0.9 1.0 1.1 1.2 4.2 4.4 4.6 4.8 5.0

PATHFINDER

0.0 0.2 0.4 0.6 0.8 5.5 6.0 6.5

PIGS

0.700.750.800.850.900.95 3.2 3.4 3.6 3.8 4.0 4.2

WATER

0.4 0.6 0.8 1.0 4.0 4.2 4.4 4.6 4.8

WIN95PTS

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-9
SLIDE 9

Discrete Bayesian Networks (Small Samples)

Scaled SHD

0.8 1.0 1.2 3.2 3.4 3.6 3.8 4.0 4.2

ALARM

0.6 0.8 1.0 1.2 4.8 5.0 5.2 5.4 5.6

ANDES

0.6 0.8 1.0 1.2 1.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8

CHILD

0.6 0.8 1.0 3.6 3.8 4.0 4.2 4.4 4.6

HAILFINDER

0.7 0.8 0.9 1.0 1.1 1.2 3.8 4.0 4.2 4.4 4.6

HEPAR2

0.9 1.0 1.1 4.8 5.0 5.2 5.4

MUNIN1

4.2 4.4 4.6 4.8 5.0 5.2 1.0 1.1 1.2 1.3 1.4 1.5

PATHFINDER

0.0 0.2 0.4 0.6 0.8 1.0 5.4 5.6 5.8 6.0 6.2 6.4

PIGS

0.85 0.90 0.95 1.00 3.0 3.2 3.4 3.6 3.8 4.0

WATER

0.7 0.8 0.9 1.0 1.1 1.2 3.8 4.0 4.2 4.4 4.6 4.8

WIN95PTS log10(calls to the statistical criterion)

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-10
SLIDE 10

Gaussian Bayesian Networks

log10(calls to the statistical criterion) Scaled SHD

2 4 6 8 10 4.5 5.0 5.5 6.0

ARTH150 (small samples)

1 2 3 4 5 3.6 3.8 4.0 4.2 4.4 4.6

ECOLI70 (small samples)

1 2 3 4 5 6 7 4 5 6 7

MAGIC−IRRI (small samples)

1 2 3 4 5 6 3.5 4.0 4.5 5.0 5.5 6.0

MAGIC−NIAB (small samples)

0.5 1.0 1.5 2.0 4.5 5.0 5.5 6.0 6.5 7.0

ARTH150 (large samples)

0.4 0.6 0.8 1.0 1.2 1.4 4.0 4.5 5.0

ECOLI70 (large samples)

0.6 0.8 1.0 1.2 3.8 4.0 4.2 4.4

MAGIC−IRRI (large samples)

0.2 0.4 0.6 0.8 1.0 1.2 3.4 3.6 3.8 4.0

MAGIC−NIAB (large samples)

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-11
SLIDE 11

Overall Conclusions

Discrete networks:

  • score-based algorithms often have higher SHDs for small samples;
  • hybrid and constraint-based algorithms have comparable SHDs;
  • constraint-based algorithms have better SHD than score-based

algorithms for small sample sizes in 7/10 BNs, but it decreases more slowly as n increases for all BNs;

  • simulated annealing is consistently slower; tabu search is always

fast and accurate in large samples, for 6/10 BNs in small samples. Gaussian networks:

  • tabu search and simulated annealing have larger SHDs than

constraint-based or hybrid algorithms for most samples;

  • hybrid and constraint-based algorithms have roughly the same SHD

for all sample sizes.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-12
SLIDE 12

Real-World Climate Data...

Climate networks aim to analyse the complex spatial structure of climate data: spatial dependence among nearby locations, but also long-range large-scale oscillation patterns over distant regions in the world, known as teleconnections [11], such as the El Ni˜ no Southern Oscillation (ENSO) [12]. We confirm the results above using NCEP/NCAR monthly surface tem- perature data on a global 10◦-resolution grid between 1981 and 2010. This gives sample size n = 30 × 12 = 360 and variables N = 18 × 36 = 648, which we model with a Gaussian Bayesian network. The sample would count as a “small sample” in the simulation study.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-13
SLIDE 13

... Gives Networks that Look Like This...

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

(c) |A| = 1594 (conditional probabilities) (b) |A| = 898 (links) (a) |A| = 1594 (links) (d) |A| = 898 (conditional probabilities)

P(V>=1|V81=2)-P(V>=1|)

  • We want to find teleconnections, so we are looking forward to learning

networks like that on the left more than that on the right because the latter only encodes short-range perturbations.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-14
SLIDE 14

... and Agree with the Simulation Study

10 20 30 40 50 log10(calls to the statistical criterion) 5 5.5 6 6.5 10 20 30 40 50

  • 350000 -300000 -250000 -200000 -150000

(b) Score (log-likelihood) Log P(x|(G,θ) 10 20 30 40 50 500 1000 1500 2000 |A| = number of arcs

8.8 7.7 7.1

γ γ γ (a) Speed (c) Size

7868 3503 501 898

  • Constraint-based algorithms produce BNs with the highest log-likelihood,

hybrid have the worst log-likelihood values and includes only a few teleconnections;

  • score-based algorithms produce high-likelihood networks with a large number
  • f teleconnections that allow propagating evidences with realistic results.
  • score-based algorithms are faster than both hybrid and constraint-based

algorithms.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-15
SLIDE 15

Conclusions

We assessed the three classes of BN structure learning algorithms, remov- ing the confounding effect of different choices of statistical criteria. Interestingly, we found that: Q1 constraint-based algorithms are more accurate than score-based algorithms for small sample sizes; Q2 that they are as accurate as hybrid algorithms; Q3 and that tabu search, as a score-based algorithm, is faster than constraint-based algorithms more often than not. This in contrast with the general view in the literature that score-based algorithms are less sensitive to individual errors and more accurate than constraint-based algorithms; and that hybrid algorithms are faster and more accurate than both. More so at small sample sizes. Also, score- based algorithms are supposed to scale less well to high-dimensional data.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-16
SLIDE 16

Thanks!

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-17
SLIDE 17

References

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-18
SLIDE 18

References

References I

  • D. M. Chickering.

Optimal Structure Identification With Greedy Search. Journal of Machine Learning Research, 3:507–554, 2002.

  • D. Colombo and M. H. Maathuis.

Order-Independent Constraint-Based Causal Structure Learning. Journal of Machine Learning Research, 15:3921–3962, 2014.

  • R. Cowell.

Conditions Under Which Conditional Independence and Scoring Methods Lead to Identical Selection of Bayesian Network Models. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages 91–97, 2001.

  • D. Geiger and D. Heckerman.

Learning Gaussian Networks. In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, pages 235–243, 1994.

  • D. Heckerman, D. Geiger, and D. M. Chickering.

Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning, 20(3):197–243, 1995.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-19
SLIDE 19

References

References II

  • D. Margaritis.

Learning Bayesian Network Model Structure from Data. PhD thesis, School of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, May 2003.

  • S. J. Russell and P. Norvig.

Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition, 2009.

  • M. Scutari.

Bayesian Network Repository. http://www.bnlearn.com/bnrepository, 2012.

  • M. Scutari, P. Howell, D. J. Balding, and I. Mackay.

Multiple Quantitative Trait Analysis Using Bayesian Networks. Genetics, 198(1):129–137, 2014.

  • I. Tsamardinos, L. E. Brown, and C. F. Aliferis.

The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm. Machine Learning, 65(1):31–78, 2006.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford

slide-20
SLIDE 20

References

References III

  • A. A. Tsonis, K. L. Swanson, and G. Wang.

On the Role of Atmospheric Teleconnections in Climate. Journal of Climate, 21(12):2990–3001, 2008.

  • K. Yamasaki, A. Gozolchiani, and S. Havlin.

Climate Networks around the Globe are Significantly Affected by El Ni˜ no.

  • Phys. Rev. Lett., 100:228501, 2008.
  • S. Yaramakala and D. Margaritis.

Speculative Markov Blanket Discovery for Optimal Feature Selection. In ICDM ’05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 809–812. IEEE Computer Society, 2005.

Marco Scutari, Catharina Elisabeth Graafland, Jos´ e Manuel Guti´ errez University of Oxford