 
              Two-Sample Instrumental Variable Analysis: Challenges and Some Progress Qingyuan Zhao Department of Statistics, The Wharton School, University of Pennsylvania November 28, 2017
Outline Two-Sample Some interesting history IV Qingyuan Bristol → Admiral William Penn → William Penn → Zhao Pennsylvania (Penn’s woods). Introduction Part 1 This talk is based on joint work with Part 2 Jingshu Wang, Dylan Small (Penn). References Jack Bowden (Bristol). Manuscript and slides are available on my webpage http://www-stat.wharton.upenn.edu/~qyzhao/ . Part 0 Primer of instrumental variable (IV) and Mendelian randomization (MR). Part 1 Two-sample IV using heterogeneous samples. Part 2 New methods for two-sample MR using GWAS summary statistics. 1/42
Causal inference Two-Sample The general problem of causal inference IV Without randomized controlled experiments, can we still Qingyuan Zhao estimate the causal effect of variable X on variable Y? Introduction Three general identification strategies Part 1 Part 2 1 Condition on all common causes of X and Y . References 2 Study all causal mechanisms by which X influences Y . 3 Use instrumental variables (IV) or natural experiments. C 1 1 3 2 2 Z X M Y 2/42
Instrumental variables Two-Sample IV Core IV assumptions Qingyuan 1 IV causes the exposure ( X ). Zhao 2 IV is independent of the unmeasured confounder ( C ). Introduction Part 1 3 IV cannot have any direct effect on the outcome ( Y ). Part 2 References 2 C × 1 Z X Y × 3 3/42
Why does IV work? Two-Sample IV C × Qingyuan Zhao γ β Introduction Z X Y Part 1 Part 2 × References Heuristic: Effect of Z on Y entirely goes through X . Wald ratio estimator β = lm( Y ∼ Z ) lm( X ∼ Z ) . Two-stage least squares (LS) β = lm( Y ∼ ˆ X ) , where ˆ X = E [ X | Z ] = predict(lm( X ∼ Z )) . 4/42
Can we trust an IV analysis? Two-Sample IV Qingyuan Success of an IV analysis depends on Zhao 1 Using good instrument(s). Introduction Can we reasonably justify the core IV assumptions? Part 1 Is the IV-exposure association strong enough? Part 2 2 Statistical inference. References Can we establish consistency and asymptotic normality? 3 Robustness. Can we check if the data satisfies the modeling assumptions? How sensitive is the conclusion to violations of the identification and modeling assumptions? 5/42
Mendelian randomization (MR) Two-Sample IV Qingyuan Zhao A brilliant idea [Katan, 1986, Davey Smith and Ebrahim, 2003] Introduction Part 1 Use genetic variants as IV. Part 2 References Recall the three core IV assumptions: 1 Need to find SNPs that are associated with the exposure. 2 Independence of unmeasured confounder is self-evident. The only minor concern is population stratification. 3 Direct effect on the outcome is possible (pleiotropy). 6/42
Next Two-Sample IV Two great ideas Qingyuan Zhao 1 Two-sample IV: don’t need the full data ( Z , X , Y ) for all individuals. Introduction Part 1 Use ( Z , X , NA ) to estimate lm( X ∼ Z ). Part 2 Use ( Z , NA , Y ) to estimate lm( Y ∼ Z ). Dates back at least to Klevmarken [1982] (thanks to David References Pacini). The most well known references are Angrist and Krueger [1992], Inoue and Solon [2010]. 2 MR with GWAS summary statistics: don’t need individual level data. Next: Part 1 What if the two samples are from different populations? Part 2 New statistical methods for two-sample MR. 7/42
An example Two-Sample An easy way to confirm heterogeneity of the two samples: IV check allele frequency. Qingyuan Zhao Frequency Introduction SNP Gene Allele Sample a Sample b Part 1 Part 2 rs12916 HMGCR C 0.40 0.43 References rs1564348 LPA C 0.18 0.16 rs2072183 NPC1L1 C 0.29 0.25 rs2479409 PCSK9 G 0.32 0.35 Table : The instrumental variables usually have different distributions in two-sample Mendelian randomization. In this Table we included four single nucleotide polymorphisms (SNPs) used in Hemani et al. [2016, Figure 2] to estimate the effect of low-density lipoprotein (LDL) cholesterol lowering on the risk of coronary heart disease. 8/42
Summary of results Two-Sample IV Qingyuan Question Zhao Is this a big problem (for identification and estimation)? Introduction Part 1 Surprisingly, little is known even though two-sample IV is Part 2 widely used in econometrics. References Main messages Additional untestable assumptions are needed for identification. The IV analysis is no longer robust to misspecified instrument-exposure model. The two stage LS is not asymptotically efficient. 9/42
Some notations Two-Sample IV Qingyuan Zhao i ), i = 1 , 2 , . . . , n s and s ∈ { a , b } is the sample Data: ( z s i , x s i , y s Introduction index. Part 1 Part 2 The two-sample instrumental variable problem References Suppose only Z a , x a , Z b , and y b are observed (in other words y a and x b are not observed). If x is endogenous, what can we learn about the exposure-outcome relationship by using the IVs z ? 10/42
Message 1: Identification Two-Sample Assumption Detail 1 2 3 4 IV Y ∼ X : y s i = g s ( x s i , u s i ) Qingyuan (1) Structural model � � � � Zhao X ∼ Z : x s i = f s ( z s i , v s i ) z s ⊥ ( u s i , v s (2) Validity of IV i ) � � � � i ⊥ Introduction g b ( x i , u i ) = β b x i + u i (3.1) Linearity of Y ∼ X � � Part 1 f s ( z i , v i ) = ( γ s ) T z i + v i (3.2) Linearity of X ∼ Z � Part 2 f a = f b (4) Structural invariance � � � � References d v a = v b (5) Sampling homogeneity � i i of noise f s ( z , v ) = f s z ( z ) + f s (6) Additivity of X ∼ Z v ( v ) � f s ( z , v ) is monotone in z (7) Monotonicity � � β b β b β b LATE β ab Identifiable estimand LATE Table : Summary of some identification results and assumptions. Highlighted assumptions (4 and 5) are new due to heterogeneity and untestable. Case 3 and 4 consider binary IV and binary exposure. β b LATE is the local average treatment effect (LATE) in population b [Angrist, Imbens, and Rubin, 1996]. β ab LATE = β b LATE × P b (complier) / P a (complier). 11/42
A robustness property of one-sample IV Two-Sample A well known fact IV Qingyuan In one-sample IV analysis, two stage LS is robust against Zhao misspecified IV-exposure model. Introduction Why? β can be identified by the estimating equation Part 1 Part 2 E [ h ( z )( y − x β )] = 0 References for any function h of z . n n � ��� � IV estimate: ˆ � � β h = y i h ( z i ) x i h ( z i ) . i =1 i =1 Consistent and asymptotically normal if Cov ( x , h ( z )) � = 0. The most efficient choice is h ∗ ( z ) = E [ x | z ]. Two-stage LS: h ( z ) = z T γ is the best linear approximation to h ∗ ( z ). 12/42
Message 2 Two-Sample IV Qingyuan Zhao Message 2 Introduction This robustness property does not carry to two-sample IV with Part 1 heterogeneous samples. Part 2 References Why? The best parametric approximation depends on the population! Buja et al. [2014] described this “conspiracy” of model misspecification and random design. 13/42
An example of the conspiracy Two-Sample IV Qingyuan 20 Zhao 15 Introduction Part 1 sample 10 Part 2 a y b References 5 0 −2 0 2 4 0.4 sample density 0.3 a 0.2 0.1 b 0.0 −2 0 2 4 x 14/42
Matching Two-Sample An intuitive solution: make sure the IVs has the same IV Qingyuan distribution in both samples, for example by matching. Zhao 7.5 Introduction Part 1 5.0 Part 2 sample References 2.5 a y b 0.0 −2.5 −4 −2 0 2 4 0.6 sample density 0.4 a 0.2 b 0.0 −4 −2 0 2 4 x 15/42
Message 3 Two-Sample IV Qingyuan Zhao When the linear IV-exposure model is correctly specified, the two-stage LS estimator is asymptotically efficient in the class of Introduction limited information estimators Part 1 Part 2 1 In the one-sample setting [Wooldridge, 2010], and References 2 In the homogeneous two-sample setting [Inoue and Solon, 2010]. Message 3 The asymptotic efficiency does not carry to two-sample IV with heterogeneous samples. 16/42
Generalized method of moments (GMM) Two-Sample IV Assume all the variables are centered. Let S be the sample Qingyuan covariance matrix. For example, S s zy = ( Z s ) T y s / n s . Zhao Over-identified estimating equations: Introduction Part 1 m n ( β ) = ( S b zz ) − 1 S b zy − ( S a zz ) − 1 S a zx β. Part 2 References The class of GMM estimators: ˆ m n ( β ) T Wm n ( β ) . β n , W = arg min β Two stage LS: W = S b zz . Optimal choice: W ∝ Cov ( m n ( β )) − 1 = 1 i ) + 1 ( S b zz ) − 1 Var ( y b i | z b ( S a zz ) − 1 β 2 Var ( x a i | z a i ). n b n a 17/42
Recap Two-Sample IV Qingyuan Three messages of Part I Zhao In two-sample IV with heterogeneous samples, Introduction Part 1 Additional untestable assumptions are needed for Part 2 identification. References The IV analysis is no longer robust to misspecified instrument-exposure model. The two stage LS is not asymptotically efficient. Next: Part 2 New statistical methods for two-sample MR using just summary statistics. 18/42
Recommend
More recommend