Instrumental Variables, DeepIV, and Forbidden Regressions Aaron - PowerPoint PPT Presentation

Instrumental Variables, DeepIV, and Forbidden Regressions Aaron Mishkin UBC MLRG 2019W2 1 ⁄ 41

Introduction Goal : Counterfactual reasoning in the presence of unknown confounders. From the CONSORT 2010 statement [Schulz et al., 2010]; https://commons.wikimedia.org/w/index.php?curid=9841081 2 ⁄ 41

Introduction: Motivation Can we draw causal conclusions from observational data? • Medical Trials : Is the new sunscreen I’m using effective? ◮ Confounder : I live in my laboratory! • Pricing : should airlines increase ticket prices next December? ◮ Confounder : NeurIPS 2019 was in Vancouver. • Policy : will unemployment continue to drop if the Federal Reserve keeps interest rates low? ◮ Confounder : US shale oil production increases. We cannot control for confounders in observational data! 3 ⁄ 41

Introduction: Graphical Model ǫ Features X Confounder Policy Response P Y We will graphical models to represent our learning problem. • X : observed features associated with a trial. • ǫ : unobserved (possibly unknown) confounders . • P : the policy variable we will to control. • Y : the response we want to predict. 4 ⁄ 41

Introduction: Answering Causal Questions ǫ Features X Confounder Policy Response P Y • Causal Statements : Y is caused by P . • Action Sentences : Y will happen if we do P . • Counterfactuals: Given ( x , p , y ) happened, how would Y change if we had done P instead? 5 ⁄ 41

Introduction: Berkeley Gender Bias Study S : Gender causes admission to UC Berkeley [Bickel et al., 1975]. A : Estimate mapping g ( p ) from 1973 admissions records. Gender G ? g ( G ) A Admission Men Women Applications Admitted Applications Admitted 8442 44% 4321 35% 6 ⁄ 41

Introduction: Berkeley with a Controlled Trial Controlled Exp. Observational Data G D G D A A Simpson’s Paradox : Controlling for the effects of D shows “small but statistically significant bias in favor of women” [Bickel et al., 1975]. 7 ⁄ 41

Part 1: “Intervention Graphs” 8 ⁄ 41

Intervention Graphs The do( · ) operator formalizes this transformation [Pearl, 2009]. Observation Intervention ǫ ǫ X X do( P = p 0 ) p 0 P Y Y Intuition : effects of forcing P = p 0 vs “natural” occurrence. 9 ⁄ 41

Intervention Graphs: Supervised vs Causal Learning Setup Graphical Model ǫ • ǫ, η ∼ N (0 , 1). P • P = p + 2 ǫ . g 0 ( p ) η � P � • g 0 ( P ) = max 5 , P . • Y = g 0 ( P ) − 2 ǫ + η . Y Can supervised learning recover g 0 ( P = p 0 ) from observations? Synthetic example introduced by Bennett et al. [2019] 10 ⁄ 41

Intervention Graphs: Supervised Failure 4 2 0 2 true g 0 estimated by neural net 4 observed 6 4 2 0 2 4 6 8 Supervised learning fails because it assumes P ⊥ ⊥ ǫ ! Taken from https://arxiv.org/abs/1905.12495 11 ⁄ 41

Intervention Graphs: Supervised vs Causal Learning Observation Intervention ǫ ǫ P P do( P ) η η g 0 ( p ) g 0 ( p ) Y Y Given dataset D = { p i , y i } n i =1 : • Supervised Learning estimates the conditional E [ Y | P ] = g 0 ( P ) − 2 E [ ǫ | P ] • Causal Learning estimates the conditional E [ Y | do( P )] = g 0 ( P ) − 2 E [ ǫ ] �� =0 12 ⁄ 41

Intervention Graphs: Known Confounders Obervations Intervention ǫ X i X p 0 P i Y i Y i ∈ [ n ] What if 1. all confounders are known and in ǫ ; 2. ǫ persists across observations; 3. the mapping Y = f ( X , P , ǫ ) is known and persists. 13 ⁄ 41

Intervention Graphs: Inference Obervations Intervention ǫ X i X p 0 P i Y i Y i ∈ [ n ] Steps to inference: 1. Abduction : compute posterior P ( ǫ | { x i , p i , y i } n i =1 ) 2. Action : form subgraph corresponding to do( P = p 0 ). 3. Prediction : compute P ( Y | do( P = p 0 ) , { x i , p i , y i } n i =1 ). 14 ⁄ 41

Intervention Graphs: Limitations Our assumptions are unrealistic since • identifying all confounders is hard . • assuming all confounders are “global” is unrealistic . • characterizing Y = f ( X , P , ǫ ) requires expert knowledge . What we really want is to • allow any number and kind of confounders! • allow confounders to be “ local ”. • learn f ( X , P , ǫ ) from data! 15 ⁄ 41

Part 2: Instrumental Variables 16 ⁄ 41

Instrumental Variables . . . the drawing of inferences from studies in which subjects have the final choice of program; the randomization is confined to an indirect instrument (or assignment) that merely encourages or discourages participation in the various programs. — Pearl [2009] 17 ⁄ 41

IV: Expanded Model ǫ Features X Confounder Response Instrument Z P Y Policy We augment our model with an instrumental variable Z that • affects the distribution of P ; • only affects Y through P ; • is conditionally independent of ǫ . 18 ⁄ 41

IV: Air Travel Example Fuel F Price Conference P C I Income Intuition : “[ F is] as good as randomization for the purposes of causal inference”— Hartford et al. [2017]. 19 ⁄ 41

IV: Formally Goal : counterfactual predictions of the form E [ Y | X , do( P = p 0 )] − E [ Y | X , do( P = p 1 )] . Let’s make the following assumptions: 1. the additive noise model Y = g ( P , X ) + ǫ , 2. the following conditions on the IV: 2.1 Relevance : p ( P | X , Z ) is not constant in Z . 2.2 Exclusion : Z ⊥ ⊥ Y | P , X , ǫ . 2.3 Unconfounded Instrument : Z ⊥ ⊥ ǫ | P . 20 ⁄ 41

IV: Model Learning Part 1 ǫ X Intervention p Y = g ( P , X ) + ǫ Z Y Under the do operator: E [ Y | X , do( P = p 0 )] − E [ Y | X , do( P = p 1 )] = g ( p 0 , X ) − g ( p 1 , X ) + E [ ǫ − ǫ | X ] . � �� =0 So, we only need to estimate h ( P , X ) = g ( P , X ) + E [ ǫ | X ]! 21 ⁄ 41

IV: Two-Stage Methods n � � � 1 � L h ( P , x i ) dp ( P | z i ) Objective : y i , . n P i =1 Two-stage methods: p ( P | X , Z ) from 1. Estimate Density : learn ˆ D = { p i , x i , z i } n i =1 . 2. Estimate Function : learn ˆ h ( P , X ) from ¯ D = { y i , x i , z i } n i =1 . 3. Evaluate : counterfactual reasoning via ˆ h ( p 0 , x ) − ˆ h ( p 1 , x ). 23 ⁄ 41

IV: Two-Stage Least-Squares Classic Approach : two-stage least-squares (2SLS). h ( P , X ) = w ⊤ 0 P + w ⊤ 1 X + ǫ E [ P | X , Z ] = A 0 X + A 1 Z + r ( ǫ ) Then we have the following: � E [ Y | X , Z ] = h ( P , X ) dp ( P | X , Z ) P � � � w ⊤ 0 P + w ⊤ = 1 X dp ( P | X , Z ) P � = w ⊤ 1 X + w ⊤ Pdp ( P | X , Z ) 0 P = w ⊤ 1 X + w ⊤ 0 ( A 0 X + A 1 Z ) . No need for density estimation! See Angrist and Pischke [2008]. 24 ⁄ 41

Part 3: Deep IV 25 ⁄ 41

Deep IV: Problems with 2SLS Problem : Linear models aren’t very expressive. • What if we want to do causal inference with time-series? Federal Reserve Economic Research, Federal Reserve Bank of Saint Louis. https://fred.stlouisfed.org/ 26 ⁄ 41

Deep IV: Problems with 2SLS Problem : Linear models aren’t very expressive. • How about complex image data? https://alexgkendall.com/computer vision/bayesian deep learning for safe ai/ 27 ⁄ 41

Deep IV: Approach Remember our objective function: n � � � 1 � Objective : L y i , h ( P , x i ) dp ( P | z i ) . n P i =1 Deep IV : Two-stage method using deep neural networks. p ( P | φ ( X , Z ) ). 1. Treatment Network : estimate ˆ ◮ Categorical P : softmax w/ favourite architecture. ◮ Continuous P : autoregressive models (MADE, RNADE, etc.), normalizing flows (MAF, IAF, etc) and so on. 2. Outcome Network : fit favorite architecture ˆ h θ ( P , X ) ≈ h ( P , X ) . Autogressive models: [Germain et al., 2015, Uria et al., 2013], Normalizing Flows: [Rezende and Mohamed, 2015, Papamakarios et al., 2017, Kingma et al., 2016] 28 ⁄ 41

Deep IV: Training Deep IV Models 1. Treatment Network “easy” via maximum-likelihood: � n � � φ ∗ = arg max log ˆ p ( p i | φ ( x i , z i ) ) φ i =1 2. Outcome Network : Monte Carlo approximation for loss: n � � � L ( θ ) = 1 � ˆ L y i , h θ ( P , X ) d ˆ p ( P | φ ( x i , z i ) ) n P i =1   n m ≈ 1  y i , 1 � � ˆ  := ˆ L h θ ( p j , x i ) L ( θ ) , n M i =1 j =1 where p j ∼ ˆ p ( P | φ ( x i , z i ) ). 29 ⁄ 41

Deep IV: Biased and Unbiased Gradients y ) 2 : When L ( y , ˆ y ) = ( y − ˆ n � � 2 � L ( θ ) = 1 � y i − h ( P , x i ) dp ( P | z i ) . n P i =1 � � ˆ If we use a single set of samples to estimate E ˆ h θ ( P , x i ) : p n L ( θ ) ≈ − 21 � � � ∇ ˆ y i − ˆ h θ ( P , x i ) ∇ θ ˆ h θ ( P , x i ) E ˆ p n i =1 n ≥ − 21 � � � � � y i − ˆ ∇ θ ˆ h θ ( P , x i ) h θ ( P , x i ) = ∇ θ L ( θ ) , E ˆ E ˆ p p n i =1 by Jensen’s inequality. 30 ⁄ 41

Part 4: Experimental Results and Forbidden Techniques 31 ⁄ 41

Instrumental Variables, DeepIV, and Forbidden Regressions Aaron - PowerPoint PPT Presentation

Instrumental Variables, DeepIV, and Forbidden Regressions Aaron Mishkin UBC MLRG 2019W2 1 41 Introduction Goal : Counterfactual reasoning in the presence of unknown confounders. From the CONSORT 2010 statement [Schulz et al., 2010];

Econ 2148, fall 2017 Instrumental variables I, origins and binary treatment case Maximilian Kasy

Econ 2148, fall 2019 Instrumental variables I, origins and binary treatment case Maximilian Kasy

Econ 2148, fall 2017 Instrumental variables II, continuous treatment Maximilian Kasy Department

Econ 2148, fall 2019 Instrumental variables II, continuous treatment Maximilian Kasy Department

Instrumental Variables Philosophy of Economics University of Virginia Matthias Brinkmann

Instrumental Variables for Dummies January 2011 () IV January 2011 1 / 4 Instrumental

Session 03 Classical Linear Models Regression with factor variables Separate quadratic

Variables (IV) in Stata Austin Nichols @austnnchols Magic Bullets Instrumental Variables

DeepI DeepIV: A : A F Flexibl ble A Appr pproa oach for for Co Counte terf rfac actu

On the Causal Interpretation of Race in Regressions Adjusting for Confounding and Mediating

Endogeneity and Instrumental Variables Ping Yu School of Economics and Finance The University of

Statistical Inverse Problems and abstract inverse problems examples Instrumental Variables

Gov 2002 - Causal Inference II: Instrumental Variables Matthew Blackwell Arthur Spirling

On Time Intervention: An Instrumental Variables Evaluation of a Community College Early Alert

YCL Week 3 Lets talk about variables! Variables Variables are containers for data. Variables

Instrumental Variable Regression Erik Gahner Larsen Advanced applied statistics, 2015 1 / 58

On the Interpolation of Product-Based Message Passing Heuristics for SAT Oliver Gableske 1 1

Gray-Level Interpolation (1) Gray levels in f are defined only at integral values of x and y .

Applied Math 205 Homework schedule now posted. Deadlines at 5 PM on Sep 26, Oct 10, Oct 24,

Analytic Interpolation on the Unit Disk History, Recent Developments, and Applications Hendra

Lives and Livelihoods: Estimates of the global mortality and poverty effects of the Covid-19

NI TSO Price Control SECG3 28 November 2018 Agenda Time Topic Presenters(s) 10:00

Introduction to Survey Statistics Day 3 Measurement in Surveys Federico Vegetti Central

Collecting Interviewer Observation Data via a Mobile Survey: Lessons Learned Jennifer Kelley