Beyond Uniform Priors in Bayesian Network Structure Learning (for - PowerPoint PPT Presentation

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks) Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford April 5, 2017

Bayesian Network Structure Learning Learning a BN B = ( G , Θ) from a data set D is performed in two steps: P( B | D ) = P( G , Θ | D ) = P( G | D ) · P(Θ | G , D ) . � �� learning structure learning parameter learning In a Bayesian setting structure learning consists in finding the DAG with the best P( G | D ) (BIC [5] is a common alternative) with some heuristic search algorithm. We can decompose P( G | D ) into � P( G | D ) ∝ P( G ) P( D | G ) = P( G ) P( D | G , Θ) P(Θ | G ) d Θ where P( G ) is the prior distribution over the space of the DAGs and P( D | G ) is the marginal likelihood of the data given G averaged over all possible parameter sets Θ ; and then N �� P( D | G ) = P( X i | Π X i , Θ X i ) P(Θ X i | Π X i ) d Θ X i . i =1 where Π X i are the parents of X i in G . Marco Scutari University of Oxford

The Bayesian Dirichlet Marginal Likelihood If D contains no missing values and assuming: ❼ a Dirichlet conjugate prior ( X i | Π X i ∼ Multinomial (Θ X i | Π X i ) and Θ X i | Π X i ∼ Dirichlet ( α ijk ) , � jk α ijk = α i the imaginary sample size); ❼ positivity (all conditional probabilties π ijk > 0 ); ❼ parameter independence ( π ijk for different parent configurations are independent) and modularity ( π ijk in different nodes are independent); Heckerman et al. [2] derived a closed form expression for P( D | G ) : N � BD( G , D ; α ) = BD( X i , Π X i ; α i ) = i =1 � � q i N r i Γ( α ij ) Γ( α ijk + n ijk ) � � � = Γ( α ij + n ij ) Γ( α ijk ) i =1 j =1 k =1 where r i is the number of states of X i ; q i is the number of configurations of Π X i ; n ij = � k n ijk ; and α ij = � k α ijk . Marco Scutari University of Oxford

Bayesian Dirichlet Equivalent Uniform (BDeu) The most common implementation of BD assumes α ijk = α/ ( r i q i ) , α i = α and is known from [2] as the Bayesian Dirichlet equivalent uniform (BDeu) marginal likelihood. The uniform prior over the parameters was justified by the lack of prior knowledge and widely assumed to be non-informative. However, there is ample evidence that this is a problematic choice: ❼ The prior is actually not uninformative. ❼ MAP DAGs selected using BDeu are highly sensitive to the choice of α and can have markedly different number of arcs even for reasonable α [8]. ❼ In the limits α → 0 and α → ∞ it is possible to obtain both very simple and very complex DAGs, and model comparison may be inconsistent for small D and small α [8, 10]. ❼ The sparseness of the MAP network is determined by a complex interaction between α and D [10, 13]. ❼ There are formal proofs of all this in [12, 13]. Marco Scutari University of Oxford

Exhibits A and B W Y Z X W Z Y X Marco Scutari University of Oxford

Exhibit A The sample frequencies ( n ijk ) for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 2 1 1 2 X 1 1 2 2 1 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 2 1 1 0 0 0 0 2 X 1 1 2 2 0 0 0 0 1 Even though X | Π X and X | Π X ∪ Y have the same entropy, � � − 1 3 log 1 3 − 2 3 log 2 H( X | Π X ) = H( X | Π X ∪ Y ) = 4 = 2 . 546 ... 3 Marco Scutari University of Oxford

Exhibit A ... G − has a higher entropy than G + a posteriori ... � � − 1 + 1 / 4 log 1 + 1 / 4 − 2 + 1 / 4 log 2 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 = 2 . 580 � � − 1 + 1 / 8 log 1 + 1 / 8 − 2 + 1 / 8 log 2 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 = 2 . 564 ... and BDeu with α = 1 chooses accordingly, and things fortunately work out: � � Γ( 1 / �� 4 Γ( 1 / 4 ) 8 + 2) · Γ( 1 / 8 + 1) BDeu( X | Π X ) = Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) = 3 . 906 × 10 − 7 , � � Γ( 1 / �� 4 Γ( 1 / 8 ) 16 + 2) · Γ( 1 / 16 + 1) BDeu( X | Π X ∪ Y ) = Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) = 3 . 721 × 10 − 8 . Marco Scutari University of Oxford

Exhibit B The sample frequencies for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 3 0 0 3 X 1 0 3 3 0 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 3 0 0 0 0 0 0 3 X 1 0 3 3 0 0 0 0 0 The conditional entropy of X is equal to zero for both G + and G − , since the value of X is completely determined by the configurations of its parents in both cases. Marco Scutari University of Oxford

Exhibit B Again, the posterior entropies for G + and G − differ: � � − 0 + 1 / 4 log 0 + 1 / 4 − 3 + 1 / 4 log 3 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 = 0 . 652 , 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 � � − 0 + 1 / 8 log 0 + 1 / 8 − 3 + 1 / 8 log 3 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 = 0 . 392 . 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 However, BDeu with α = 1 yields � � �� 4 � Γ( 1 / 4 ) Γ( 1 / 8 + 3) Γ( 1 / 8 ) BDeu( X | Π X ) = · � = 0 . 032 , Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) � � � Γ( 1 / ✚ �� 4 ✚✚✚ Γ( 1 / 8 ) 16 + 3) Γ( 1 / 16 ) BDeu( X | Π X ∪ Y ) = · = 0 . 044 , Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) preferring G + over G − even though the additional arc Y → X does not provide any additional information on the distribution of X , and even though 4 out of 8 conditional distributions in X | Π X ∪ Y are not observed at all in the data. Marco Scutari University of Oxford

Better Than BDeu: Bayesian Dirichlet Sparse (BDs) If the positivity assumption is violated or the sample size n is small, there may be configurations of some Π X i that are not observed in D . BDeu( X i , Π X i ; α ) = � ✘ � � � ✘✘✘✘✘✘✘✘ Γ( α ∗ + n ijk ) r i r i Γ( r i α ∗ ) Γ( α ∗ ) Γ( r i α ∗ ) � � � � = , Γ( r i α ∗ + n ij ) Γ( r i α ∗ ) Γ( α ∗ ) Γ( α ∗ ) j : n ij =0 k =1 j : n ij > 0 k =1 so the effective imaginary sample size decreases as the number of unobserved parents configurations increases. We can prevent that by replacing α ijk with � α/ ( r i ˜ q i ) if n ij > 0 α ijk = ˜ otherwise , ˜ q i = { number of Π X i such that n ij > 0 } 0 and we plug it in BD instead of α ijk = α/ ( r i q i ) to obtain BDs. Then BDs( X i , Π X i ; α ) = BDeu( X i , Π X i ; αq i / ˜ q i ) . Marco Scutari University of Oxford

BDeu and BDs Compared Cells that correspond to ( X i , Π X i ) combinations that are not observed in the data are in red, observed combinations are in green. Marco Scutari University of Oxford

Exhibits A and B, Once More BDs does not suffer from the bias arising from ˜ q i < q i and it correctly assigns the same score to G − and G + in both examples, BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 3 . 906 × 10 − 7 . BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 0 . 03262 . following the maximum entropy principle. 2.5 1.0 BDeu BDs BDeu 0.8 BDs Bayes factor 2.0 Bayes factor 0.6 0.4 1.5 0.2 1.0 −4 −2 0 2 4 −4 −2 0 2 4 log 10 ( α ) log 10 ( α ) Marco Scutari University of Oxford

Entropy and BDeu In a Bayesian setting, the conditional entropy H( · ) of X | Π X given a uniform Dirichlet prior with imaginary sample size α over the cell probabilities is r i ij | k = α ∗ i + n ijk � � p ( α ∗ ij | k log p ( α ∗ p ( α ∗ i ) i ) i ) H( X | Π X ; α ) = − with . ij | k r i α ∗ i + n ij j : n ij > 0 k =1 and H( X | Π X ; α ) > H( X | Π X ; β ) if α > β and X | Π X is not a uniform distribution. Let α/ ( r i q i ) → 0 and let α > β > 0 . Then if d ( X i , G ) BDeu( X | Π X ; α ) > BDeu( X | Π X ; β ) > 0 , EP � 1 � ˜ q i if d ( X i , G ) BDeu( X | Π X ; α ) = = 0 . EP r i Marco Scutari University of Oxford

To Sum It Up in a Theorem Let G + and G − be two DAGs differing from a single arc X j → X i , and let α/ ( r i q i ) → 0 . Then the Bayes factor computed using BDs corresponds to the Bayes factor computed using BDeu weighted by the following implicit prior ratio: q i ) d ( Xi, G +) P( G + ) P( G − ) = ( q i / ˜ EP . i ) d ( Xi, G− ) ( q ′ q ′ i / ˜ EP and can be written as q i ) d ( Xi, G +) α d ( G +) BDs( X i , Π X i ∪ X j ; α ) = ( q i / ˜ EP EP i ) d ( Xi, G− ) α d ( G− ) BDs( X i , Π X i ; α ) ( q ′ q ′ i / ˜ EP EP � if d EDF > − log α (P( G + ) / P( G − )) 0 → if d EDF < − log α (P( G + ) / P( G − )) . + ∞ Marco Scutari University of Oxford

Beyond Uniform Priors in Bayesian Network Structure Learning (for - PowerPoint PPT Presentation

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks) Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford April 5, 2017 Bayesian Network Structure Learning Learning a BN

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Non-Uniform Computation Lecture 10 Non-Uniform Computational Models: Circuits 1 Non-Uniform

Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui Paulo et al. Sheng Zhang

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Informative Priors for Graphical Model Structure James Cussens, University of York

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Non-Uniform Computation & Circuits Lecture 10 Wherein every language can be decided 1

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Generalized Bayesian Inference with Sets of Conjugate Priors for Dealing with Prior-Data Conflict

Selecting priors Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics

Choosing Priors Probability Intervals 18.05 Spring 2014 Conjugate priors A prior is conjugate

Conjugate Priors: Beta and Normal 18.05 Spring 2018 Review: Continuous priors, discrete data

Introduction to Bayesian Statistics Lecture 3: Single Parameter (II) Rung-Ching Tsai Department

Conjugate Priors: Beta and Normal 18.05 Spring 2014 January 1, 2017 1 /15 Review: Continuous

AP Calculus AB Limits & Continuity 2015-10-20 www.njctl.org Slide 3 / 233 Slide 4 / 233

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical

Automatic code rewriting in probabilistic programming Internship supervised by Hongseok Yang at

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course

Related to Bayesian Statistics by Atsuhide Mori (Osaka Dental University, Japan) Geometric

Beyond Uniform Priors in Bayesian Network Structure Learning (for - PowerPoint PPT Presentation

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks) Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford April 5, 2017 Bayesian Network Structure Learning Learning a BN

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Non-Uniform Computation Lecture 10 Non-Uniform Computational Models: Circuits 1 Non-Uniform

Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui Paulo et al. Sheng Zhang

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Informative Priors for Graphical Model Structure James Cussens, University of York

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Non-Uniform Computation &amp; Circuits Lecture 10 Wherein every language can be decided 1

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Generalized Bayesian Inference with Sets of Conjugate Priors for Dealing with Prior-Data Conflict

Selecting priors Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics

Choosing Priors Probability Intervals 18.05 Spring 2014 Conjugate priors A prior is conjugate

Conjugate Priors: Beta and Normal 18.05 Spring 2018 Review: Continuous priors, discrete data

Introduction to Bayesian Statistics Lecture 3: Single Parameter (II) Rung-Ching Tsai Department

Conjugate Priors: Beta and Normal 18.05 Spring 2014 January 1, 2017 1 /15 Review: Continuous

AP Calculus AB Limits &amp; Continuity 2015-10-20 www.njctl.org Slide 3 / 233 Slide 4 / 233

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical

Automatic code rewriting in probabilistic programming Internship supervised by Hongseok Yang at

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course

Related to Bayesian Statistics by Atsuhide Mori (Osaka Dental University, Japan) Geometric

Non-Uniform Computation & Circuits Lecture 10 Wherein every language can be decided 1

AP Calculus AB Limits & Continuity 2015-10-20 www.njctl.org Slide 3 / 233 Slide 4 / 233