dirichlet bayesian network scores
play

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle - PowerPoint PPT Presentation

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford September 21, 2017 Bayesian Network Structure Learning Learning a BN B = ( G , ) from a


  1. Dirichlet Bayesian Network Scores and the Maximum Entropy Principle Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford September 21, 2017

  2. Bayesian Network Structure Learning Learning a BN B = ( G , Θ) from a data set D is performed in two steps: P( B | D ) = P( G , Θ | D ) = P( G | D ) · P(Θ | G , D ) . � �� � � �� � � �� � learning structure learning parameter learning In a Bayesian setting structure learning consists in finding the DAG with the best P( G | D ) (BIC [6] is a common alternative) with some heuristic search algorithm. We can decompose P( G | D ) into � P( G | D ) ∝ P( G ) P( D | G ) = P( G ) P( D | G , Θ) P(Θ | G ) d Θ where P( G ) is the prior distribution over the space of the DAGs and P( D | G ) is the marginal likelihood of the data given G averaged over all possible parameter sets Θ ; and then N �� � � P( D | G ) = P( X i | Π X i , Θ X i ) P(Θ X i | Π X i ) d Θ X i i =1 where Π X i are the parents of X i in G . Marco Scutari University of Oxford

  3. The Bayesian Dirichlet Marginal Likelihood If D contains no missing values and assuming: ❼ a Dirichlet conjugate prior ( X i | Π X i ∼ Mult (Θ X i | Π X i ) and Θ X i | Π X i ∼ Dir ( α ijk ) , � jk α ijk = α i the imaginary sample size); ❼ positivity (all conditional probabilities π ijk > 0 ); ❼ parameter independence ( π ijk for different parent configurations are independent) and modularity ( π ijk in different nodes are independent); Heckerman et al. [4] derived a closed form expression for P( D | G ) : N � BD( G , D ; α ) = BD( X i , Π X i ; α i ) = i =1 � � q i N r i Γ( α ij ) Γ( α ijk + n ijk ) � � � = Γ( α ij + n ij ) Γ( α ijk ) i =1 j =1 k =1 where r i is the number of states of X i ; q i is the number of configurations of Π X i ; n ij = � k n ijk ; and α ij = � k α ijk . Marco Scutari University of Oxford

  4. Bayesian Dirichlet Equivalent Uniform (BDeu) The most common implementation of BD assumes α ijk = α/ ( r i q i ) , α = α i and is known from [4] as the Bayesian Dirichlet equivalent uniform (BDeu) marginal likelihood. However, there is evidence that assuming a flat prior over the parameters can be problematic: ❼ The prior is actually not uninformative [5]. ❼ MAP DAGs selected using BDeu are highly sensitive to the choice of α and can have markedly different number of arcs even for reasonable α [8]. ❼ In the limits α → 0 and α → ∞ it is possible to obtain both very simple and very complex DAGs, and model comparison may be inconsistent for small D and small α [8, 10]. ❼ The sparseness of the MAP network is determined by a complex interaction between α and D [10, 12]. ❼ There are formal proofs of all this in [11, 12]. Marco Scutari University of Oxford

  5. Exhibits A and B W Y Z X W Z Y X Marco Scutari University of Oxford

  6. Exhibit A The sample frequencies ( n ijk ) for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 2 1 1 2 X 1 1 2 2 1 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 2 1 1 0 0 0 0 2 X 1 1 2 2 0 0 0 0 1 Even though X | Π X and X | Π X ∪ Y have the same empirical entropy, � � − 1 3 log 1 3 − 2 3 log 2 H( X | Π X ) = H( X | Π X ∪ Y ) = 4 = 2 . 546 ... 3 Marco Scutari University of Oxford

  7. Exhibit A ... G − has a higher entropy than G + a posteriori with α = 1 ... � � − 1 + 1 / 4 log 1 + 1 / 4 − 2 + 1 / 4 log 2 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 = 2 . 580 , � � − 1 + 1 / 8 log 1 + 1 / 8 − 2 + 1 / 8 log 2 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 = 2 . 564 ... and BDeu with α = 1 chooses accordingly, so things fortunately work out: � � Γ( 1 / �� 4 Γ( 1 / 4 ) 8 + 2) · Γ( 1 / 8 + 1) BDeu( X | Π X ) = Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) = 3 . 906 × 10 − 7 , � � Γ( 1 / �� 4 Γ( 1 / 8 ) 16 + 2) · Γ( 1 / 16 + 1) BDeu( X | Π X ∪ Y ) = Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) = 3 . 721 × 10 − 8 . Marco Scutari University of Oxford

  8. Exhibit B The sample frequencies for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 3 0 0 3 X 1 0 3 3 0 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 3 0 0 0 0 0 0 3 X 1 0 3 3 0 0 0 0 0 The empirical entropy of X is equal to zero for both G + and G − , since the value of X is completely determined by the configurations of its parents in both cases. Marco Scutari University of Oxford

  9. Exhibit B Again, the posterior entropies for G + and G − differ: � � − 0 + 1 / 4 log 0 + 1 / 4 − 3 + 1 / 4 log 3 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 = 0 . 652 , � � − 0 + 1 / 8 log 0 + 1 / 8 − 3 + 1 / 8 log 3 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 = 0 . 392 . However, BDeu with α = 1 yields � � �� 4 � Γ( 1 / 4 ) Γ( 1 / 8 + 3) Γ( 1 / 8 ) � BDeu( X | Π X ) = · = 0 . 032 , Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) � � � Γ( 1 / ✚ �� 4 Γ( 1 / 8 ) 16 + 3) ✚✚✚ Γ( 1 / 16 ) BDeu( X | Π X ∪ Y ) = · = 0 . 044 , Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) preferring G + over G − even though the additional arc Y → X does not provide any additional information on the distribution of X , and even though 4 out of 8 conditional distributions in X | Π X ∪ Y are not observed at all in the data. Marco Scutari University of Oxford

  10. Better Than BDeu: Bayesian Dirichlet Sparse (BDs) If the positivity assumption is violated or the sample size n is small, there may be configurations of some Π X i that are not observed in D . And then BDeu( X i , Π X i ; α ) = � ✘ � � � ✘✘✘✘✘✘✘✘ r i r i Γ( α ij ) Γ( α ijk ) Γ( α ij ) Γ( α ijk + n ijk ) � � � � , Γ( α ij ) Γ( α ijk ) Γ( α ij + n ij ) Γ( α ijk ) j : n ij =0 k =1 j : n ij > 0 k =1 so the effective imaginary sample size decreases as the number of unobserved parents configurations increases. We can prevent that by replacing α ijk with � α/ ( r i ˜ q i ) if n ij > 0 α ijk = ˜ otherwise , ˜ q i = { number of Π X i such that n ij > 0 } 0 and plugging it in BD instead of α ijk = α/ ( r i q i ) to obtain BDs. Then BDs( X i , Π X i ; α ) = BDeu( X i , Π X i ; αq i / ˜ q i ) . Marco Scutari University of Oxford

  11. BDeu and BDs Compared Cells that correspond to ( X i , Π X i ) combinations that are not observed in the data are in red, observed combinations are in green. Marco Scutari University of Oxford

  12. Exhibits A and B, Once More BDs does not suffer from the bias arising from ˜ q i < q i and it assigns the same score to G − and G + in both examples, BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 3 . 9 × 10 − 7 , Exhibit A: Exhibit B: BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 0 . 032 . It also avoids giving wildly different Bayes factors depending on the value of α . 2.5 1.0 BDeu BDs BDeu 0.8 BDs 2.0 Bayes factor Bayes factor 0.6 0.4 1.5 0.2 1.0 −4 −2 0 2 4 −4 −2 0 2 4 log 10 ( α ) log 10 ( α ) Marco Scutari University of Oxford

  13. This Left Me with a Few Questions... The obvious one being: 1. The behaviour of BDeu is certainly undesirable, but it is it wrong? Followed by: 2. Posterior entropy and BDeu rank G − and G + in the same order for Exhibit A, but they do not for Exhibit B. Why is that? And the reason why I found that surprising is that: 3. Maximum (relative) entropy [7, 9, 1] represents a very general approach that includes Bayesian posterior estimation as a particular case [3]; it can also be seen as a particular case of MDL [2]. Hence, unless something is wrong with BDeu I would expect the two to agree. Especially because we can use MDL (using BIC), MAP (using BDeu/BDs), Marco Scutari University of Oxford

  14. Bayesian Statistics and Information Theory (I) The derivation of Bayesian posterior as a particular case of maximum (relative) entropy is made clear in Giffin and Caticha [3]. The selected joint posterior P( X, Θ) is that which maximises the relative entropy � P( X, Θ) S (P , P old ) = − P( X, Θ) log P old ( X, Θ) dX d Θ . The family of posteriors that reflects the fact that X is now known to take value x ′ is such that � P( X = x ′ , Θ) d Θ = δ ( X − x ′ ) P( X ) = which amounts to a (possibly infinite) number of constraints on P( X, Θ) : for each possible value of X there is one constraint. Marco Scutari University of Oxford

  15. Bayesian Statistics and Information Theory (II) Maximising S (P , P old ) subject to those constraints using Lagrange multipliers means solving �� � S (P , P old ) + λ 0 P dX d Θ − 1 + � �� � normalising constraint �� � � P( X, Θ) − δ ( X − x ′ ) d Θ λ ( x ) dX � �� � constraint for each value of X and yields the familiar Bayesian update rule: P new ( X, Θ) = P old ( X, Θ) δ ( X − x ′ ) = P old (Θ | X ) δ ( X − x ′ ) . P old ( X ) Marco Scutari University of Oxford

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend