instrumental variables deepiv and forbidden regressions
play

Instrumental Variables, DeepIV, and Forbidden Regressions Aaron - PowerPoint PPT Presentation

Instrumental Variables, DeepIV, and Forbidden Regressions Aaron Mishkin UBC MLRG 2019W2 1 41 Introduction Goal : Counterfactual reasoning in the presence of unknown confounders. From the CONSORT 2010 statement [Schulz et al., 2010];


  1. Instrumental Variables, DeepIV, and Forbidden Regressions Aaron Mishkin UBC MLRG 2019W2 1 ⁄ 41

  2. Introduction Goal : Counterfactual reasoning in the presence of unknown confounders. From the CONSORT 2010 statement [Schulz et al., 2010]; https://commons.wikimedia.org/w/index.php?curid=9841081 2 ⁄ 41

  3. Introduction: Motivation Can we draw causal conclusions from observational data? • Medical Trials : Is the new sunscreen I’m using effective? ◮ Confounder : I live in my laboratory! • Pricing : should airlines increase ticket prices next December? ◮ Confounder : NeurIPS 2019 was in Vancouver. • Policy : will unemployment continue to drop if the Federal Reserve keeps interest rates low? ◮ Confounder : US shale oil production increases. We cannot control for confounders in observational data! 3 ⁄ 41

  4. Introduction: Graphical Model ǫ Features X Confounder Policy Response P Y We will graphical models to represent our learning problem. • X : observed features associated with a trial. • ǫ : unobserved (possibly unknown) confounders . • P : the policy variable we will to control. • Y : the response we want to predict. 4 ⁄ 41

  5. Introduction: Answering Causal Questions ǫ Features X Confounder Policy Response P Y • Causal Statements : Y is caused by P . • Action Sentences : Y will happen if we do P . • Counterfactuals: Given ( x , p , y ) happened, how would Y change if we had done P instead? 5 ⁄ 41

  6. Introduction: Berkeley Gender Bias Study S : Gender causes admission to UC Berkeley [Bickel et al., 1975]. A : Estimate mapping g ( p ) from 1973 admissions records. Gender G ? g ( G ) A Admission Men Women Applications Admitted Applications Admitted 8442 44% 4321 35% 6 ⁄ 41

  7. Introduction: Berkeley with a Controlled Trial Controlled Exp. Observational Data G D G D A A Simpson’s Paradox : Controlling for the effects of D shows “small but statistically significant bias in favor of women” [Bickel et al., 1975]. 7 ⁄ 41

  8. Part 1: “Intervention Graphs” 8 ⁄ 41

  9. Intervention Graphs The do( · ) operator formalizes this transformation [Pearl, 2009]. Observation Intervention ǫ ǫ X X do( P = p 0 ) p 0 P Y Y Intuition : effects of forcing P = p 0 vs “natural” occurrence. 9 ⁄ 41

  10. Intervention Graphs: Supervised vs Causal Learning Setup Graphical Model ǫ • ǫ, η ∼ N (0 , 1). P • P = p + 2 ǫ . g 0 ( p ) η � P � • g 0 ( P ) = max 5 , P . • Y = g 0 ( P ) − 2 ǫ + η . Y Can supervised learning recover g 0 ( P = p 0 ) from observations? Synthetic example introduced by Bennett et al. [2019] 10 ⁄ 41

  11. Intervention Graphs: Supervised Failure 4 2 0 2 true g 0 estimated by neural net 4 observed 6 4 2 0 2 4 6 8 Supervised learning fails because it assumes P ⊥ ⊥ ǫ ! Taken from https://arxiv.org/abs/1905.12495 11 ⁄ 41

  12. Intervention Graphs: Supervised vs Causal Learning Observation Intervention ǫ ǫ P P do( P ) η η g 0 ( p ) g 0 ( p ) Y Y Given dataset D = { p i , y i } n i =1 : • Supervised Learning estimates the conditional E [ Y | P ] = g 0 ( P ) − 2 E [ ǫ | P ] • Causal Learning estimates the conditional E [ Y | do( P )] = g 0 ( P ) − 2 E [ ǫ ] ���� =0 12 ⁄ 41

  13. Intervention Graphs: Known Confounders Obervations Intervention ǫ X i X p 0 P i Y i Y i ∈ [ n ] What if 1. all confounders are known and in ǫ ; 2. ǫ persists across observations; 3. the mapping Y = f ( X , P , ǫ ) is known and persists. 13 ⁄ 41

  14. Intervention Graphs: Inference Obervations Intervention ǫ X i X p 0 P i Y i Y i ∈ [ n ] Steps to inference: 1. Abduction : compute posterior P ( ǫ | { x i , p i , y i } n i =1 ) 2. Action : form subgraph corresponding to do( P = p 0 ). 3. Prediction : compute P ( Y | do( P = p 0 ) , { x i , p i , y i } n i =1 ). 14 ⁄ 41

  15. Intervention Graphs: Limitations Our assumptions are unrealistic since • identifying all confounders is hard . • assuming all confounders are “global” is unrealistic . • characterizing Y = f ( X , P , ǫ ) requires expert knowledge . What we really want is to • allow any number and kind of confounders! • allow confounders to be “ local ”. • learn f ( X , P , ǫ ) from data! 15 ⁄ 41

  16. Part 2: Instrumental Variables 16 ⁄ 41

  17. Instrumental Variables . . . the drawing of inferences from studies in which subjects have the final choice of program; the randomization is confined to an indirect instrument (or assignment) that merely encourages or discourages participation in the various programs. — Pearl [2009] 17 ⁄ 41

  18. IV: Expanded Model ǫ Features X Confounder Response Instrument Z P Y Policy We augment our model with an instrumental variable Z that • affects the distribution of P ; • only affects Y through P ; • is conditionally independent of ǫ . 18 ⁄ 41

  19. IV: Air Travel Example Fuel F Price Conference P C I Income Intuition : “[ F is] as good as randomization for the purposes of causal inference”— Hartford et al. [2017]. 19 ⁄ 41

  20. IV: Formally Goal : counterfactual predictions of the form E [ Y | X , do( P = p 0 )] − E [ Y | X , do( P = p 1 )] . Let’s make the following assumptions: 1. the additive noise model Y = g ( P , X ) + ǫ , 2. the following conditions on the IV: 2.1 Relevance : p ( P | X , Z ) is not constant in Z . 2.2 Exclusion : Z ⊥ ⊥ Y | P , X , ǫ . 2.3 Unconfounded Instrument : Z ⊥ ⊥ ǫ | P . 20 ⁄ 41

  21. IV: Model Learning Part 1 ǫ X Intervention p Y = g ( P , X ) + ǫ Z Y Under the do operator: E [ Y | X , do( P = p 0 )] − E [ Y | X , do( P = p 1 )] = g ( p 0 , X ) − g ( p 1 , X ) + E [ ǫ − ǫ | X ] . � �� � =0 So, we only need to estimate h ( P , X ) = g ( P , X ) + E [ ǫ | X ]! 21 ⁄ 41

  22. IV: Model Learning Part 2 Want : h ( P , X ) = g ( P , X ) + E [ ǫ | X ]. Approach : Marginalize out confounded policy P . � E [ Y | X , Z ] = ( g ( P , X ) + E [ ǫ | P , X ]) dp ( P | X , Z ) P � = ( g ( P , X ) + E [ ǫ | X ]) dp ( P | X , Z ) P � = h ( P , X ) dp ( P | X , Z ) . P Key Trick : E [ ǫ | X ] is the same as E [ ǫ | P , X ] when marginalizing. 22 ⁄ 41

  23. IV: Two-Stage Methods n � � � 1 � L h ( P , x i ) dp ( P | z i ) Objective : y i , . n P i =1 Two-stage methods: p ( P | X , Z ) from 1. Estimate Density : learn ˆ D = { p i , x i , z i } n i =1 . 2. Estimate Function : learn ˆ h ( P , X ) from ¯ D = { y i , x i , z i } n i =1 . 3. Evaluate : counterfactual reasoning via ˆ h ( p 0 , x ) − ˆ h ( p 1 , x ). 23 ⁄ 41

  24. IV: Two-Stage Least-Squares Classic Approach : two-stage least-squares (2SLS). h ( P , X ) = w ⊤ 0 P + w ⊤ 1 X + ǫ E [ P | X , Z ] = A 0 X + A 1 Z + r ( ǫ ) Then we have the following: � E [ Y | X , Z ] = h ( P , X ) dp ( P | X , Z ) P � � � w ⊤ 0 P + w ⊤ = 1 X dp ( P | X , Z ) P � = w ⊤ 1 X + w ⊤ Pdp ( P | X , Z ) 0 P = w ⊤ 1 X + w ⊤ 0 ( A 0 X + A 1 Z ) . No need for density estimation! See Angrist and Pischke [2008]. 24 ⁄ 41

  25. Part 3: Deep IV 25 ⁄ 41

  26. Deep IV: Problems with 2SLS Problem : Linear models aren’t very expressive. • What if we want to do causal inference with time-series? Federal Reserve Economic Research, Federal Reserve Bank of Saint Louis. https://fred.stlouisfed.org/ 26 ⁄ 41

  27. Deep IV: Problems with 2SLS Problem : Linear models aren’t very expressive. • How about complex image data? https://alexgkendall.com/computer vision/bayesian deep learning for safe ai/ 27 ⁄ 41

  28. Deep IV: Approach Remember our objective function: n � � � 1 � Objective : L y i , h ( P , x i ) dp ( P | z i ) . n P i =1 Deep IV : Two-stage method using deep neural networks. p ( P | φ ( X , Z ) ). 1. Treatment Network : estimate ˆ ◮ Categorical P : softmax w/ favourite architecture. ◮ Continuous P : autoregressive models (MADE, RNADE, etc.), normalizing flows (MAF, IAF, etc) and so on. 2. Outcome Network : fit favorite architecture ˆ h θ ( P , X ) ≈ h ( P , X ) . Autogressive models: [Germain et al., 2015, Uria et al., 2013], Normalizing Flows: [Rezende and Mohamed, 2015, Papamakarios et al., 2017, Kingma et al., 2016] 28 ⁄ 41

  29. Deep IV: Training Deep IV Models 1. Treatment Network “easy” via maximum-likelihood: � n � � φ ∗ = arg max log ˆ p ( p i | φ ( x i , z i ) ) φ i =1 2. Outcome Network : Monte Carlo approximation for loss: n � � � L ( θ ) = 1 � ˆ L y i , h θ ( P , X ) d ˆ p ( P | φ ( x i , z i ) ) n P i =1   n m ≈ 1  y i , 1 � � ˆ  := ˆ L h θ ( p j , x i ) L ( θ ) , n M i =1 j =1 where p j ∼ ˆ p ( P | φ ( x i , z i ) ). 29 ⁄ 41

  30. Deep IV: Biased and Unbiased Gradients y ) 2 : When L ( y , ˆ y ) = ( y − ˆ n � � 2 � L ( θ ) = 1 � y i − h ( P , x i ) dp ( P | z i ) . n P i =1 � � ˆ If we use a single set of samples to estimate E ˆ h θ ( P , x i ) : p n L ( θ ) ≈ − 21 � � � ∇ ˆ y i − ˆ h θ ( P , x i ) ∇ θ ˆ h θ ( P , x i ) E ˆ p n i =1 n ≥ − 21 � � � � � y i − ˆ ∇ θ ˆ h θ ( P , x i ) h θ ( P , x i ) = ∇ θ L ( θ ) , E ˆ E ˆ p p n i =1 by Jensen’s inequality. 30 ⁄ 41

  31. Part 4: Experimental Results and Forbidden Techniques 31 ⁄ 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend