on testing marginal versus conditional independence
play

On Testing Marginal versus Conditional Independence Richard Guo - PowerPoint PPT Presentation

On Testing Marginal versus Conditional Independence Richard Guo ricguo@uw.edu Nov, 2019 Department of Statistics, University of Washington, Seattle 1 Introduction Motivation Inferring causal structures usually involves model selection among


  1. On Testing Marginal versus Conditional Independence Richard Guo ricguo@uw.edu Nov, 2019 Department of Statistics, University of Washington, Seattle 1

  2. Introduction

  3. Motivation Inferring causal structures usually involves model selection among directed acyclic graphs (DAGs). While learning undirected graphical models has been relatively well-developed (e.g., graphical lasso, neighborhood selection), model selection for DAGs is less well-understood. This poses a challenge to maintaining error guarantee in causal inference, even in large samples. In this talk, I will analyze the simplest example where such a challenge arises. 2

  4. Marginal vs. conditional independence Consider ( X 1 , X 2 , X 3 ) ⊺ ∼ N (0 , Σ) on R 3 . Covariance Σ ∈ S 3 , the set of 3 × 3 real positive definite matrices. We want to test between M 0 : X 1 ⊥ ⊥ X 2 , ( X 1 → X 3 ← X 2 ) , M 1 : X 1 ⊥ ⊥ X 2 | X 3 , ( X 1 − X 3 − X 2 ) , assuming that at least one of them is true. X 1 − X 3 − X 2 includes the following Markov-equivalent DAGs X 1 ← X 3 ← X 2 , X 1 → X 3 → X 2 , X 1 ← X 3 → X 2 . 3

  5. Marginal vs. conditional independence Testing between M 0 : X 1 ⊥ ⊥ X 2 M 1 : X 1 ⊥ ⊥ X 2 | X 3 vs. is a non-nested model selection problem. They correspond to equality/algebraic constraints on Σ = { σ ij } : M 0 : σ 12 = 0 , M 1 : σ 12 · 3 = σ 12 − σ 13 σ − 1 33 σ 23 = 0 ⇔ σ 12 σ 33 = σ 13 σ 23 . M 0 and M 1 intersect at the two axes M 0 ∩ M 1 = { σ 12 = σ 13 = 0 } ∪ { σ 12 = σ 23 = 0 } . 4

  6. Geometry We visualize the parameter space in the correlation space. M 0 : ρ 12 = 0 , M 1 : ρ 12 = ρ 13 ρ 23 5

  7. Singularity The two axes further intersect at the origin M sing : { σ 12 = σ 13 = σ 23 = 0 } , which is a singularity . M sing corresponds to diagonal Σ. • M 0 ∩ M 1 vs. S 3 : Likelihood-ratio test (LRT) was studied by Drton (2006, 2009) and Drton and Sullivant (2007). • LRT has a non-standard asymptotic distribution at M sing . • M 0 vs. M 1 : At M sing , the tangent cones of the two models coincide. • They are called “1-equivalent” by Evans (2018), meaning that linear approximations to the parameter space are the same. • In the Euclidean m − 1 / 2 -ball of M sing , m 2 samples are required to distinguish M 0 and M 1 . 6

  8. Difficulty Model selection for DAGs is usually conducted by the following approaches (Drton and Maathuis, 2017). • Score-based : Picking the model with the highest penalized likelihood score (e.g., AIC, BIC). Since dim( M 0 ) = dim( M 1 ), both AIC and BIC will pick the model with the higher likelihood. • Constraint-based : Testing M 0 : X 1 ⊥ ⊥ X 2 M 1 : X 1 ⊥ ⊥ X 2 | X 3 . vs. This is adopted by the PC algorithm. For Gaussian data, Fisher’s z -transformation of partial correlation is used as the test statistic. 7

  9. Difficulty Simulated with n = 1 , 000, ρ = 0 . 3 and unit variances under level α = 0 . 05. X 1 X 2 X 1 X 2 γ n − 1 / 2 γ n − 1 / 2 ρ ρ X 3 X 3 M 0 \ M 1 M 1 \ M 0 M0\M1 M1\M0 1.00 0.75 method size 0.50 BIC/AIC PC 0.25 0.00 0 5 10 0 5 10 8 | γ |

  10. Method

  11. Likelihood ratio test for nested models Consider a parametric family { P θ : θ ∈ Θ } , where Θ is an open subset of R d . For Θ 0 ⊆ Θ, suppose we want to test H 0 : θ ∈ Θ 0 vs. H 1 : θ ∈ Θ . Under regularity, the likelihood ratio test (LRT) statistic � � d ⇒ χ 2 λ n = 2 sup ℓ n ( θ ) − sup ℓ n ( θ ) c , θ θ 0 where c = d − dim(Θ 0 ). ℓ n ( · ) is the log-likelihood under sample size n . For example, in linear regression y ∼ β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 . We use χ 2 2 for testing H 1 : β ∈ R 4 . H 0 : β 0 = β 1 = 0 vs. 9

  12. Likelihood ratio test Similarly, we define the log-likelihood ratio of M 0 versus M 1 as � � λ (0:1) :=2 sup ℓ n (Σ) − sup ℓ n (Σ) n Σ ∈M 0 Σ ∈M 1 � � Σ (0) Σ (1) ℓ n (ˆ n ) − ℓ n (ˆ =2 n ) , Σ (0) Σ (1) where ˆ n , ˆ are MLEs within M 0 and M 1 respectively. n ℓ n ( · ) is the Gaussian log-likelihood function ℓ n (Σ) = n 2( − log | Σ | − Tr ( S n Σ − 1 )) . 10

  13. Likelihood ratio test The Gaussian MLEs for DAGs take a closed form (Drton and Richardson, 2008), which yields the following expression for the LRT. �� � s 2 � � s 2 � 13 − s 11 s 33 23 − s 22 s 33 λ (0:1) − = n log n s 33 � s 22 s 2 13 − 2 s 12 s 23 s 13 + s 11 s 2 � �� 23 n log s 11 s 22 + s 33 , s 2 12 − s 11 s 22 where S is the sample covariance taken with respect to mean zero. 11

  14. Our plan 1. An information-theoretic analysis on how well the two models can be distinguished (by any means). 2. Look at the regimes of “effect size” ∼ n , such that the optimal error is between 0 and 1. • a stable, non-degenerate asymptotic distribution of LRT. • We will be doing large- n -small-effect asymptotics ! 3. Derive the asymptotic distributions. • Are they uniform? 4. Develop a model selection procedure with error guarantees. 12

  15. Optimal error We study the minimax rate of distinguishing two sequences of distributions, one within M 0 and the other within M 1 , as they approach M 0 ∩ M 1 . Lemma: testing two simple hypotheses For testing H 0 : X ∼ P versus H 1 : X ∼ Q , the minimum sum of type-I and type-II errors is 1 − d TV ( P , Q ). Total variation distance | P ( A ) − Q ( A ) | = 1 � d TV ( P , Q ) = sup | p − q | d µ. 2 A 13

  16. Optimal error Consider a sequence → Σ ∗ ∈ M 0 ∩ M 1 . Σ (0) Σ (0) P n = P Σ (0) n , ∈ M 0 \ M 1 , n n Correspondingly, let Q n = P Σ (1) from M 1 \ M 0 such that n Σ (1) = arg min D KL ( P Σ (0) n � P Σ ) , n Σ ∈M 1 \M 0 which is the most difficult to distinguish from. With P n = P Σ (0) and Q n = P Σ (1) n , let us compute the total n variation between the product measures ( n iid samples). The limiting optimal error can be sandwiched by the Hellinger ( √ p − √ q ) 2 d µ � 1 � 1 / 2 . � distance H ( P , Q ) := 2 � H 2 ( P n n , Q n n ) ≤ d TV ( P n n , Q n n ) ≤ H ( P n n , Q n 2 − H 2 ( P n n ) n , Q n n ) . 14

  17. Optimal error With some algebra, we have  H ( P n , Q n ) = ω ( n − 1 / 2 ) 0 ,  1 − d TV ( P n n , Q n n ) → , H ( P n , Q n ) = o ( n − 1 / 2 ) 1 ,  and when H ( P n , Q n ) ≍ n − 1 / 2 , { 1 − d TV ( P n n , Q n { 1 − d TV ( P n n , Q n n ) } ≤ lim sup n ) } < 1 . 0 < lim inf n n Effect size H ( P n , Q n ) ≍ ρ 13 , n ρ 23 , n , where ρ ij = σ ij / √ σ ii σ jj is the correlation coefficient. 15

  18. Optimal error Comparing H ( P n , Q n ) to n − 1 / 2 , to stabilize the asymptotic error, there are two regimes. Two regimes { 1 − d TV ( P n n , Q n n ) } → c ∈ (0 , 1)  ρ 13 , n ≍ γ n − 1 / 2 , ρ 23 , n → ρ 23 � = 0 �   “ weak-strong ”  iff ρ 23 , n ≍ γ n − 1 / 2 , ρ 13 , n → ρ 13 � = 0  “ weak-weak ”  ρ 13 , n ρ 23 , n ≍ δ n − 1 / 2 , ρ 13 , n , ρ 23 , n → 0 .  16

  19. Asymptotics: weak-strong regime We study the (local) asymptotic distribution of λ (0:1) . n For r = γ √ σ 11 σ 33 , we set r / √ n   σ 11 0 Σ (0) =  , 0 σ 22 σ 23   n r / √ n  σ 23 σ 33 ( r / √ n ) σ 23 /σ 33 r / √ n   σ 11 ( r / √ n ) σ 23 /σ 33 Σ (1) = σ 22 σ 23  ,   n r / √ n  σ 23 σ 33   σ 11 0 0 → Σ ∗ = Σ (0) n , Σ (1) 0 σ 22 σ 23   n   0 σ 23 σ 33 17

  20. <latexit sha1_base64="C5oYjXs5XFpIvmf4zkzWcJwqK0=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmIckS5idzCZD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyc9vRI9bOgNu2XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2P3iKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPv0YBpSiyfOIKJZu5WREZY2JdRiUXQrD8ip1arBRdW/v6zUb/I4inACp3AOAVxBHe6gAU0gIOAZXuHN096L9+59LFoLXj5zDH/gf4AgsuQNQ=</latexit> <latexit sha1_base64="N2t1x1B6o7YXeWQNQ7GLKgYtqW0=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKewaQY9BLx4jmIckS5idzCZD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyc9vRI9bOgNu2XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2P3iKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPv0YBpSiyfOIKJZu5WREZY2JdRiUXQrD8ipXVSDWtW/v6zUb/I4inACp3AOAVxBHe6gAU0gIOAZXuHN096L9+59LFoLXj5zDH/gf4AhFCQNg=</latexit> <latexit sha1_base64="+q3hAqsk252mIPphuj3whPN1q8w=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKewmgh6DXjxGMImSLGF2MpsMmcyMyuEJV/hxYMiXv0cb/6Nk2QPmljQUFR1090VJZwZ6/vfXmFtfWNzq7hd2tnd2z8oHx61jUo1oS2iuNIPETaUM0lblOHxJNsYg47UTjm5nfeaLaMCXv7ShocBDyWJGsHXSY0+PVD+r1af9csWv+nOgVRLkpAI5mv3yV2+gSCqotIRjY7qBn9gw9oywum01EsNTAZ4yHtOiqxoCbM5gdP0ZlTBihW2pW0aK7+nsiwMGYiItcpsB2ZW8m/ud1UxtfhRmTSWqpJItFcqRVWj2PRowTYnlE0cw0czdisgIa0ysy6jkQgiWX14l7Vo1qFf9u4tK4zqPowgncArnEMAlNOAWmtACAgKe4RXePO29eO/ex6K14OUzx/AH3ucPhdaQNw=</latexit> Asymptotics: weak-strong regime ρ 12 ρ 23 ρ 13 X 1 X 2 X 1 X 2 X 1 X 2 γ n − 1 / 2 γ n − 1 / 2 ρ ρ ρ X 3 X 3 X 3 Σ (0) Σ (1) Σ ∗ ∈ M 0 ∩ M 1 ∈ M 0 \ M 1 ∈ M 1 \ M 0 n n 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend