beyond uniform priors in bayesian network structure
play

Beyond Uniform Priors in Bayesian Network Structure Learning (for - PowerPoint PPT Presentation

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks) Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford April 5, 2017 Bayesian Network Structure Learning Learning a BN


  1. Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks) Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford April 5, 2017

  2. Bayesian Network Structure Learning Learning a BN B = ( G , Θ) from a data set D is performed in two steps: P( B | D ) = P( G , Θ | D ) = P( G | D ) · P(Θ | G , D ) . � �� � � �� � � �� � learning structure learning parameter learning In a Bayesian setting structure learning consists in finding the DAG with the best P( G | D ) (BIC [5] is a common alternative) with some heuristic search algorithm. We can decompose P( G | D ) into � P( G | D ) ∝ P( G ) P( D | G ) = P( G ) P( D | G , Θ) P(Θ | G ) d Θ where P( G ) is the prior distribution over the space of the DAGs and P( D | G ) is the marginal likelihood of the data given G averaged over all possible parameter sets Θ ; and then N �� � � P( D | G ) = P( X i | Π X i , Θ X i ) P(Θ X i | Π X i ) d Θ X i . i =1 where Π X i are the parents of X i in G . Marco Scutari University of Oxford

  3. The Bayesian Dirichlet Marginal Likelihood If D contains no missing values and assuming: ❼ a Dirichlet conjugate prior ( X i | Π X i ∼ Multinomial (Θ X i | Π X i ) and Θ X i | Π X i ∼ Dirichlet ( α ijk ) , � jk α ijk = α i the imaginary sample size); ❼ positivity (all conditional probabilties π ijk > 0 ); ❼ parameter independence ( π ijk for different parent configurations are independent) and modularity ( π ijk in different nodes are independent); Heckerman et al. [2] derived a closed form expression for P( D | G ) : N � BD( G , D ; α ) = BD( X i , Π X i ; α i ) = i =1 � � q i N r i Γ( α ij ) Γ( α ijk + n ijk ) � � � = Γ( α ij + n ij ) Γ( α ijk ) i =1 j =1 k =1 where r i is the number of states of X i ; q i is the number of configurations of Π X i ; n ij = � k n ijk ; and α ij = � k α ijk . Marco Scutari University of Oxford

  4. Bayesian Dirichlet Equivalent Uniform (BDeu) The most common implementation of BD assumes α ijk = α/ ( r i q i ) , α i = α and is known from [2] as the Bayesian Dirichlet equivalent uniform (BDeu) marginal likelihood. The uniform prior over the parameters was justified by the lack of prior knowledge and widely assumed to be non-informative. However, there is ample evidence that this is a problematic choice: ❼ The prior is actually not uninformative. ❼ MAP DAGs selected using BDeu are highly sensitive to the choice of α and can have markedly different number of arcs even for reasonable α [8]. ❼ In the limits α → 0 and α → ∞ it is possible to obtain both very simple and very complex DAGs, and model comparison may be inconsistent for small D and small α [8, 10]. ❼ The sparseness of the MAP network is determined by a complex interaction between α and D [10, 13]. ❼ There are formal proofs of all this in [12, 13]. Marco Scutari University of Oxford

  5. Exhibits A and B W Y Z X W Z Y X Marco Scutari University of Oxford

  6. Exhibit A The sample frequencies ( n ijk ) for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 2 1 1 2 X 1 1 2 2 1 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 2 1 1 0 0 0 0 2 X 1 1 2 2 0 0 0 0 1 Even though X | Π X and X | Π X ∪ Y have the same entropy, � � − 1 3 log 1 3 − 2 3 log 2 H( X | Π X ) = H( X | Π X ∪ Y ) = 4 = 2 . 546 ... 3 Marco Scutari University of Oxford

  7. Exhibit A ... G − has a higher entropy than G + a posteriori ... � � − 1 + 1 / 4 log 1 + 1 / 4 − 2 + 1 / 4 log 2 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 = 2 . 580 � � − 1 + 1 / 8 log 1 + 1 / 8 − 2 + 1 / 8 log 2 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 = 2 . 564 ... and BDeu with α = 1 chooses accordingly, and things fortunately work out: � � Γ( 1 / �� 4 Γ( 1 / 4 ) 8 + 2) · Γ( 1 / 8 + 1) BDeu( X | Π X ) = Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) = 3 . 906 × 10 − 7 , � � Γ( 1 / �� 4 Γ( 1 / 8 ) 16 + 2) · Γ( 1 / 16 + 1) BDeu( X | Π X ∪ Y ) = Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) = 3 . 721 × 10 − 8 . Marco Scutari University of Oxford

  8. Exhibit B The sample frequencies for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 3 0 0 3 X 1 0 3 3 0 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 3 0 0 0 0 0 0 3 X 1 0 3 3 0 0 0 0 0 The conditional entropy of X is equal to zero for both G + and G − , since the value of X is completely determined by the configurations of its parents in both cases. Marco Scutari University of Oxford

  9. Exhibit B Again, the posterior entropies for G + and G − differ: � � − 0 + 1 / 4 log 0 + 1 / 4 − 3 + 1 / 4 log 3 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 = 0 . 652 , 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 � � − 0 + 1 / 8 log 0 + 1 / 8 − 3 + 1 / 8 log 3 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 = 0 . 392 . 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 However, BDeu with α = 1 yields � � �� 4 � Γ( 1 / 4 ) Γ( 1 / 8 + 3) Γ( 1 / 8 ) BDeu( X | Π X ) = · � = 0 . 032 , Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) � � � Γ( 1 / ✚ �� 4 ✚✚✚ Γ( 1 / 8 ) 16 + 3) Γ( 1 / 16 ) BDeu( X | Π X ∪ Y ) = · = 0 . 044 , Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) preferring G + over G − even though the additional arc Y → X does not provide any additional information on the distribution of X , and even though 4 out of 8 conditional distributions in X | Π X ∪ Y are not observed at all in the data. Marco Scutari University of Oxford

  10. Better Than BDeu: Bayesian Dirichlet Sparse (BDs) If the positivity assumption is violated or the sample size n is small, there may be configurations of some Π X i that are not observed in D . BDeu( X i , Π X i ; α ) = � ✘ � � � ✘✘✘✘✘✘✘✘ Γ( α ∗ + n ijk ) r i r i Γ( r i α ∗ ) Γ( α ∗ ) Γ( r i α ∗ ) � � � � = , Γ( r i α ∗ + n ij ) Γ( r i α ∗ ) Γ( α ∗ ) Γ( α ∗ ) j : n ij =0 k =1 j : n ij > 0 k =1 so the effective imaginary sample size decreases as the number of unobserved parents configurations increases. We can prevent that by replacing α ijk with � α/ ( r i ˜ q i ) if n ij > 0 α ijk = ˜ otherwise , ˜ q i = { number of Π X i such that n ij > 0 } 0 and we plug it in BD instead of α ijk = α/ ( r i q i ) to obtain BDs. Then BDs( X i , Π X i ; α ) = BDeu( X i , Π X i ; αq i / ˜ q i ) . Marco Scutari University of Oxford

  11. BDeu and BDs Compared Cells that correspond to ( X i , Π X i ) combinations that are not observed in the data are in red, observed combinations are in green. Marco Scutari University of Oxford

  12. Exhibits A and B, Once More BDs does not suffer from the bias arising from ˜ q i < q i and it correctly assigns the same score to G − and G + in both examples, BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 3 . 906 × 10 − 7 . BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 0 . 03262 . following the maximum entropy principle. 2.5 1.0 BDeu BDs BDeu 0.8 BDs Bayes factor 2.0 Bayes factor 0.6 0.4 1.5 0.2 1.0 −4 −2 0 2 4 −4 −2 0 2 4 log 10 ( α ) log 10 ( α ) Marco Scutari University of Oxford

  13. Entropy and BDeu In a Bayesian setting, the conditional entropy H( · ) of X | Π X given a uniform Dirichlet prior with imaginary sample size α over the cell probabilities is r i ij | k = α ∗ i + n ijk � � p ( α ∗ ij | k log p ( α ∗ p ( α ∗ i ) i ) i ) H( X | Π X ; α ) = − with . ij | k r i α ∗ i + n ij j : n ij > 0 k =1 and H( X | Π X ; α ) > H( X | Π X ; β ) if α > β and X | Π X is not a uniform distribution. Let α/ ( r i q i ) → 0 and let α > β > 0 . Then if d ( X i , G ) BDeu( X | Π X ; α ) > BDeu( X | Π X ; β ) > 0 , EP � 1 � ˜ q i if d ( X i , G ) BDeu( X | Π X ; α ) = = 0 . EP r i Marco Scutari University of Oxford

  14. To Sum It Up in a Theorem Let G + and G − be two DAGs differing from a single arc X j → X i , and let α/ ( r i q i ) → 0 . Then the Bayes factor computed using BDs corresponds to the Bayes factor computed using BDeu weighted by the following implicit prior ratio: q i ) d ( Xi, G +) P( G + ) P( G − ) = ( q i / ˜ EP . i ) d ( Xi, G− ) ( q ′ q ′ i / ˜ EP and can be written as q i ) d ( Xi, G +) α d ( G +) BDs( X i , Π X i ∪ X j ; α ) = ( q i / ˜ EP EP i ) d ( Xi, G− ) α d ( G− ) BDs( X i , Π X i ; α ) ( q ′ q ′ i / ˜ EP EP � if d EDF > − log α (P( G + ) / P( G − )) 0 → if d EDF < − log α (P( G + ) / P( G − )) . + ∞ Marco Scutari University of Oxford

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend