An Empirical-Bayes Score for Discrete Bayesian Networks Marco - PowerPoint PPT Presentation

An Empirical-Bayes Score for Discrete Bayesian Networks Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford September 8, 2016

Bayesian Network Structure Learning Learning a BN B = ( G , Θ) from a data set D is performed in two steps: P( B | D ) = P( G , Θ | D ) = P( G | D ) · P(Θ | G , D ) . � �� learning structure learning parameter learning In a Bayesian setting structure learning consists in finding the DAG with the best P( G | D ) (BIC [5] is a common alternative) with some heuristic search algorithm. We can decompose P( G | D ) into � P( G | D ) ∝ P( G ) P( D | G ) = P( G ) P( D | G , Θ) P(Θ | G ) d Θ where P( G ) is the prior distribution over the space of the DAGs and P( D | G ) is the marginal likelihood of the data given G averaged over all possible parameter sets Θ ; and then N �� P( D | G ) = P( X i | Π X i , Θ X i ) P(Θ X i | Π X i ) d Θ X i . i =1 where X i | Π X i are the parents of X i in G . Marco Scutari University of Oxford

The Bayesian Dirichlet Marginal Likelihood If D contains no missing values and assuming: ❼ a Dirichlet conjugate prior ( X i | Π X i ∼ Multinomial (Θ X i | Π X i ) and Θ X i | Π X i ∼ Dirichlet ( α ijk ) , � jk α ijk = α i the imaginary sample size); ❼ positivity (all conditional probabilties π ijk > 0 ); ❼ parameter independence ( π ijk for different parent configurations are independent) and modularity ( π ijk in different nodes are independent); [2] derived a closed form expression for P( D | G ) : N � BD( G , D ; α ) = BD( X i , Π X i ; α i ) = i =1 � � q i N r i Γ( α ij ) Γ( α ijk + n ijk ) � � � = Γ( α ij + n ij ) Γ( α ijk ) i =1 j =1 k =1 where r i is the number of states of X i ; q i is the number of configurations of Π X i ; n ij = � k n ijk ; and α ij = � k α ijk . Marco Scutari University of Oxford

Bayesian Dirichlet Equivalent Uniform (BDeu) The most common implementation of BD assumes α ijk = α/ ( r i q i ) , α i = α and is known from [2] as the Bayesian Dirichlet equivalent uniform (BDeu) marginal likelihood. The uniform prior over the parameters was justified by the lack of prior knowledge and widely assumed to be non-informative. However, there is ample evidence that this is a problematic choice: ❼ The prior is actually not uninformative. ❼ MAP DAGs selected using BDeu are highly sensitive to the choice of α and can have markedly different number of arcs even for reasonable α [8]. ❼ In the limits α → 0 and α → ∞ it is possible to obtain both very simple and very complex DAGs, and model comparison may be inconsistent for small D and small α [8, 10]. ❼ The sparseness of the MAP network is determined by a complex interaction between α and D [10, 13]. ❼ There are formal proofs of all this in [12, 13]. Marco Scutari University of Oxford

The Uniform (U) Graph Prior The most common choice for P( G ) is the uniform (U) distribution because it is extremely difficult to specify informative priors [1, 3]. Assuming a uniform prior is problematic because: ❼ Score-based structure learning algorithms typically generate new candidate DAGs by a single arc addition, deletion or reversal, e.g. P( G ∪ { X j → X i } | D ) P( G ∪ { X j → X i } ) P( D | G ∪ { X j → X i } ) ✘ ✘✘✘✘✘✘✘✘ = . P( G | D ) P( G ) P( D | G ) U always simplifies, and that implies − p ij = ← → − p ij = ˚ p ij = 1 / 3 favouring the inclusion of new arcs as − p ij + ← → − p ij = 2 / 3 for each possible arc a ij . ❼ Two arcs are correlated if they are incident on a common node [7], so false positives and false negatives can potentially propagate through P( G ) and lead to further errors in learning G . ❼ DAGs that are completely unsupported by the data have most of the probability mass for large enough N . Marco Scutari University of Oxford

Better Than BDeu: Bayesian Dirichlet Sparse (BDs) If the positivity assumption is violated or the sample size n is small, there may be configurations of some Π X i that are not observed in D . BDeu( X i , Π X i ; α ) = � � � � Γ( α ∗ + n ijk ) r i r i Γ( r i α ∗ ) Γ( α ∗ ) Γ( r i α ∗ ) � � � � ✘✘✘✘✘✘✘✘ ✘ = . Γ( r i α ∗ + n ij ) Γ( r i α ∗ ) Γ( α ∗ ) Γ( α ∗ ) j : n ij =0 k =1 j : n ij > 0 k =1 So the effective imaginary sample size decreases as the number of unobserved parents configurations increases, and the MAP estimates of π ijk gradually converge to the ML and favour overfitting. To address these two undesirable features of BDeu we replace α ∗ with � α/ ( r i ˜ q i ) if n ij > 0 α = ˜ otherwise , q i = { number of Π X i such that n ij > 0 } ˜ 0 and we plug it in BD instead of α ∗ = α/ ( r i q i ) to obtain BDs. Marco Scutari University of Oxford

BDeu and BDs Compared Cells that correspond to ( X i , Π X i ) combinations that are not observed in the data are in red, observed combinations are in green. Marco Scutari University of Oxford

Some Notes on BDs � � r i Γ( r i ˜ α ) Γ(˜ α + n ijk ) � � BDs( X i , Π X i ; α ) = Γ( r i ˜ α + n ij ) Γ(˜ α ) j : n ij > 0 k =1 ❼ BDeu is score-equivalent, meaning it takes the same value for DAGs that represent the same probability distribution. BDs is not score-equivalent for finite samples that have unobserved parents configurations. Asymptotically, BDs → BDeu as n → ∞ if the positivity assumption holds. ❼ The ˜ α is a piece-wise uniform empirical Bayes prior because it depends on D . ❼ We always have � � k ˜ α = α , so the effective imaginary j : n ij > 0 sample size is the same for all DAGs. Therefore DAG comparisons are consistent with BDs, which is not the case with BDeu. Marco Scutari University of Oxford

Better Than U: the Marginal Uniform (MU) Graph Prior In our previous work [7], we explored the first- and second-order properties of U and we showed that p ij ≈ 1 4( N − 1) → 1 1 p ij ≈ 1 2( N − 1) → 1 1 p ij = ← − → − 4 + and ˚ 2 − 2 , 4 so each possible arc is present in G with marginal probability ≈ 1 / 2 and, when present, it appears in each direction with probability 1 / 2 . We can use that as a starting point, and assume an independent prior for each arc with the same marginal probabilities as U (hence the name MU). ❼ MU does not favour arc inclusion as − p ij + ← → − p ij = 1 / 2 . ❼ MU does not favour the propagation of errors in structure learning because arcs are independent from each other. ❼ MU computationally trivial to use: the ratio of the prior probabilities is 1 / 2 for arc addition, 2 for arc deletion and 1 for arc reversal, for all arcs. Marco Scutari University of Oxford

Design of the Simulation Study We evaluated BIC and U+BDeu, U+BDs, MU+BDeu, MU+BDs with α = 1 , 5 , 10 on: ❼ 10 reference BNs covering a wide range of N ( 8 to 442 ), p = | Θ | ( 18 to 77 K) and number of arcs | A | ( 8 to 602 ). ❼ 20 samples of size n / p = 0 . 1 , 0 . 2 , 0 . 5 , 1 . 0 , 2 . 0 , and 5 . 0 (to allow for meaningful comparisons between BNs with such different N and p ) for each BN and n / p . with performance measures for: ❼ the quality of the learned DAG using the SHD distance [11] from the reference BN; ❼ the number of arcs compared to the reference BN; ❼ the log-likelihood on a separate test set of size 10 K, as an approximation of Kullback-Leibler distance. using hill-climbing and the bnlearn R package [6]. Marco Scutari University of Oxford

Results: ALARM, SHD BIC U + BDeu, α = 1 U + BDs, α = 1 MU + BDeu, α = 1 MU + BDs, α = 1 U + BDeu, α = 10 U + BDs, α = 10 100 MU + BDeu, α = 10 MU + BDs, α = 10 50 0.1 0.2 0.5 1 2 5 Marco Scutari University of Oxford

Results: ALARM, Number of Arcs BIC U + BDeu, α = 1 U + BDs, α = 1 MU + BDeu, α = 1 120 MU + BDs, α = 1 U + BDeu, α = 10 U + BDs, α = 10 100 MU + BDeu, α = 10 MU + BDs, α = 10 80 60 40 20 0.1 0.2 0.5 1 2 5 Marco Scutari University of Oxford

Results: ALARM, Log-likelihood on the Test Set −100000 −120000 −140000 −160000 BIC U + BDeu, α = 1 −180000 U + BDs, α = 1 MU + BDeu, α = 1 MU + BDs, α = 1 U + BDeu, α = 10 −200000 U + BDs, α = 10 MU + BDeu, α = 10 MU + BDs, α = 10 −220000 0.1 0.2 0.5 1 2 5 Marco Scutari University of Oxford

Conclusions ❼ We propose a new posterior score for discrete BN structure learning, defined it as the combination of a new prior over the space of DAGs, the marginal uniform (MU) prior, and of a new empirical Bayes marginal likelihood, which we call Bayesian Dirichlet sparse (BDs). ❼ In an extensive simulation study using 10 reference BNs we find that MU+BDs outperforms U+BDeu for all combinations of BN and sample sizes, both in the quality of the learned DAGs and in predictive accuracy. Other proposals in the literature improve one at the expense of the other [4, 9, 13, 14]. ❼ This is achieved without increasing the computational complexity of the posterior score, since MU+BDs can be computed in the same time as U+BDeu. Marco Scutari University of Oxford

Thanks! Marco Scutari University of Oxford

References Marco Scutari University of Oxford

An Empirical-Bayes Score for Discrete Bayesian Networks Marco - PowerPoint PPT Presentation

An Empirical-Bayes Score for Discrete Bayesian Networks Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford September 8, 2016 Bayesian Network Structure Learning Learning a BN B = ( G , ) from a data set D is

Empirical Bayes Newton Method Bayesian Linear Models MAP Learning Will Penny MEG Source

MARC Fall Meeting 09/24/17 MARC Fall Meeting 09/24/17 SCORE Presentation SCORE

Bayes Nets (Ch. 14) Announcements Homework 1 posted Bayesian Network A Bayesian network (Bayes

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Arthur Berg Pennsylvania State University Introduction Bayes Estimation Empirical Bayes

1 Problem with Brute force Nave Bayes ( ) ( ) ( ) It cannot generalize to unseen

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

4 Bayesian Belief Networks (also called Bayes Nets) Interesting because: The Naive Bayes

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners Minimum description

Learning Bayesian Networks: Learning Bayesian Networks: Na ve and non ve and non- -Na

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Toward heterogeneous specifications Linking institutions with each other . . . various maps

SOLVING EQUATIONS IN FINITE ALGEBRAS Erhard Aichinger Institute for Algebra Austrian Science

Algorithms for Planning as State-Space Search Section 10.2 Sec. 10.2 p.1/17 Outline Forward

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast Query: Query:

Isogeny graphs in cryptography Luca De Feo Universit Paris Saclay, UVSQ & Inria March

MACE: Model-inference-Assisted Concolic Exploration Domagoj Babic

Strategic Aim for Cluster Based Integration (CSS/CHC) Design and implement a cluster-based service

An Empirical-Bayes Score for Discrete Bayesian Networks Marco - PowerPoint PPT Presentation

An Empirical-Bayes Score for Discrete Bayesian Networks Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford September 8, 2016 Bayesian Network Structure Learning Learning a BN B = ( G , ) from a data set D is

Empirical Bayes Newton Method Bayesian Linear Models MAP Learning Will Penny MEG Source

MARC Fall Meeting 09/24/17 MARC Fall Meeting 09/24/17 SCORE Presentation SCORE

Bayes Nets (Ch. 14) Announcements Homework 1 posted Bayesian Network A Bayesian network (Bayes

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Arthur Berg Pennsylvania State University Introduction Bayes Estimation Empirical Bayes

1 Problem with Brute force Nave Bayes ( ) ( ) ( ) It cannot generalize to unseen

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

4 Bayesian Belief Networks (also called Bayes Nets) Interesting because: The Naive Bayes

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners Minimum description

Learning Bayesian Networks: Learning Bayesian Networks: Na ve and non ve and non- -Na

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Toward heterogeneous specifications Linking institutions with each other . . . various maps

SOLVING EQUATIONS IN FINITE ALGEBRAS Erhard Aichinger Institute for Algebra Austrian Science

Algorithms for Planning as State-Space Search Section 10.2 Sec. 10.2 p.1/17 Outline Forward

CSCI 446: Ar*ficial Intelligence CSCI 446: Ar*ficial Intelligence

Searching Sequence databases 1: Searching Sequence databases 1: Blast Blast Query: Query:

Isogeny graphs in cryptography Luca De Feo Universit Paris Saclay, UVSQ &amp; Inria March

MACE: Model-inference-Assisted Concolic Exploration Domagoj Babic

Strategic Aim for Cluster Based Integration (CSS/CHC) Design and implement a cluster-based service

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

Isogeny graphs in cryptography Luca De Feo Universit Paris Saclay, UVSQ & Inria March