Causal Inference An introduction based on S. Wagers course on Causal - PowerPoint PPT Presentation

Causal Inference An introduction based on S. Wager’s course on Causal Inference (OIT 661) Imke Mayer November 23, 2018 Group Meeting, CMAP

Outline 1. Treatment effect estimation in randomized experiments 2. Beyond a single randomized controlled trial 3. Inverse-propensity weighting 4. Double robustness property 5. Cross-fitting and machine learning for ATE estimation 6. Conclusion 1

Treatment effect estimation in randomized experiments

Definitions and notations Given a treatment, define the causal effect via potential outcomes: Causal effect Binary treatment w ∈ { 0 , 1 } on i-th individual with potential outcomes Y i (1) and Y i (0). Individual causal effect of the treatment: ∆ i = Y i (1) − Y i (0) 2

Definitions and notations Given a treatment, define the causal effect via potential outcomes: Causal effect Binary treatment w ∈ { 0 , 1 } on i-th individual with potential outcomes Y i (1) and Y i (0). Individual causal effect of the treatment: ∆ i = Y i (1) − Y i (0) • Problem: ∆ i never observed. • (Partial) Solution: randomized experiments to learn certain properties of ∆ i . • Average treatment effect τ = E [∆ i ] = E [ Y i (1) − Y i (0)]. 2

Average treatment effect (ATE) Average treatment effect τ = E [∆ i ] = E [ Y i (1) − Y i (0)] Idea: estimate τ using large randomized experiments. Assumptions: Random variables ( Y , W ) having values in R × { 0 , 1 } . Observe n iid samples ( Y i , W i ) each satisfying: • Y i = Y i ( W i ) (SUTVA) • W i ⊥ ⊥ { Y i (0) , Y i (1) } (random treatment assignment) 3

Average treatment effect (ATE) Average treatment effect τ = E [∆ i ] = E [ Y i (1) − Y i (0)] Idea: estimate τ using large randomized experiments. Assumptions: Random variables ( Y , W ) having values in R × { 0 , 1 } . Observe n iid samples ( Y i , W i ) each satisfying: • Y i = Y i ( W i ) (SUTVA) • W i ⊥ ⊥ { Y i (0) , Y i (1) } (random treatment assignment) Difference-in-means estimator τ DM = 1 Y i − 1 � � ˆ Y i , n 1 n 0 W 1 =1 W 1 =0 where n w = # { i : W i = w } . 3

Average treatment effect estimation Properties of ˆ τ DM • Using the previous assumptions (iid, SUTVA, random treatment τ DM is unbiased and √ n -consistent. assignment), we can prove that ˆ √ n (ˆ d τ DM − τ ) − n →∞ N (0 , V DM ) , − − → where V DM = Var ( Y i (0)) P ( W i =0) + Var ( Y i (1)) P ( W i =1) 4

Average treatment effect estimation Properties of ˆ τ DM • Using the previous assumptions (iid, SUTVA, random treatment τ DM is unbiased and √ n -consistent. assignment), we can prove that ˆ √ n (ˆ d τ DM − τ ) − n →∞ N (0 , V DM ) , − − → where V DM = Var ( Y i (0)) P ( W i =0) + Var ( Y i (1)) P ( W i =1) • Using plug-in estimators we also get confidence intervals   �   ˆ V DM  = 1 − α, τ DM ± Φ − 1 (1 − α/ 2) lim  τ ∈  ˆ n →∞ P  n where Φ is the standard Gaussian cdf. 4

Average treatment effect estimation with Difference-of-Means Difference-of-Means estimator • conceptually simple estimator and simple to estimate, • consistent estimator with asymptotically valid inference, • but is it the optimal way to use the data for fixed finite n ? 5

Average treatment effect estimation with Difference-of-Means Difference-of-Means estimator • conceptually simple estimator and simple to estimate, • consistent estimator with asymptotically valid inference, • but is it the optimal way to use the data for fixed finite n ? Average Treatment effect τ is a causal parameter , i.e. property we wish to know about a population. It is not related to the study design or the estimation method. 5

Randomized trials in the linear model Idea: assume linearity of the responses Y i (0) and Y i (1) in the covariates. Assumptions • n iid samples ( X i , Y i , W i ), • Y i ( w ) = c ( w ) + X i β ( w ) + ε i ( w ), w ∈ { 0 , 1 } , • E [ ε i ( w ) | X i ] = 0 and Var ( ε i ( w ) | X i ) = σ 2 . and without loss of generality we additionally assume: • P ( W i = 0) = P ( W i = 1) = 1 2 , • E [ X ] = 0. 6

Randomized trials in the linear model OLS estimator c (0) + ¯ X (ˆ β (1) − ˆ c (1) − ˆ τ OLS := ˆ ˆ β (0) ) n = 1 � � � c (1) + X i ˆ c (0) − X i ˆ (ˆ β (1) ) − (ˆ β (0) ) , n i =1 where ¯ � n X = 1 i =1 X i and the estimators are obtained by OLS for the n two linear models. 7

Randomized trials in the linear model OLS estimator c (0) + ¯ X (ˆ β (1) − ˆ c (1) − ˆ τ OLS := ˆ ˆ β (0) ) n = 1 � � � c (1) + X i ˆ c (0) − X i ˆ (ˆ β (1) ) − (ˆ β (0) ) , n i =1 where ¯ � n X = 1 i =1 X i and the estimators are obtained by OLS for the n two linear models. Properties of ˆ τ OLS c ( w ) , ˆ β ( w ) and ¯ • Asymptotic independence of ˆ X and also c (0) − c (0) )+ ¯ X ( β (1) − β (0) )+ ¯ X (ˆ β (1) − ˆ τ OLS − τ = (ˆ c (1) − c (1) ) − (ˆ β (0) − β (1) + β (0) ) . ˆ • Noting V OLS = 4 σ 2 + ( β (0) − β (1) ) T Var ( X )( β (0) − β (1) ), by central limit theorem we get √ n (ˆ d τ OLS − τ ) − n →∞ N (0 , V OLS ) . − − → 7

Randomized trials in the linear model Properties of ˆ τ OLS • Noting V OLS = 4 σ 2 + � β (0) − β (1) � 2 A , by central limit theorem we get √ n (ˆ d τ OLS − τ ) − n →∞ N (0 , V OLS ) . − − → Remark • Under the linearity assumption, V DM = 4 σ 2 + � β (0) − β (1) � 2 A + � β (0) + β (1) ) � 2 A . ⇒ ˆ τ OLS is always at least as good as ˆ τ DM in terms of asymptotic variance. • This still holds in case of model mis-specification. (proof uses Huber-White linear regression analysis) 8

Beyond a single randomized controlled trial

How to combine different experiments or data sets Study the effect of a cash incentive to discourage teenagers from smoking in two different cities. 9

How to combine different experiments or data sets Study the effect of a cash incentive to discourage teenagers from smoking in two different cities. Correct aggregation of the two studies: 9

Aggregating several ATE estimators How to combine several trials testing the same treatment but on different populations? Assumptions • n iid samples ( X i , Y i , W i ), • Covariates X i take values in a finite discrete space X (i.e. |X| = p ). • Treatment assignment is random conditionally on X i : { Y i (0) , Y i (1) } ⊥ ⊥ W i | X i = x , ∀ x ∈ X . Bucket-wise ATE τ ( x ) = E [ Y i (1) − Y i (0) | X i = x ] 10

Results for aggregated difference-in-means estimators Aggregated difference-in-means estimator   n x n x  1 Y i − 1 � � � � ˆ τ := n ˆ τ ( x ) = Y i  n n x 1 n x 0 x ∈X x ∈X { X i = x , W i =1 } { X i = x , W i =0 } • Denoting e ( x ) = P ( W i = 1 | X i = x ) and adding simplifying assumption Var ( Y ( w ) | X = x ) = σ 2 ( x ) we can show that σ 2 ( x ) √ n x (ˆ � � d τ ( x ) − τ ( x )) − n →∞ N − − → 0 , e ( x )(1 − e ( x )) � σ 2 ( X ) � • Finally, denoting V BUCKET = Var ( τ ( X )) + E , e ( X )(1 − e ( X )) √ n (ˆ d τ − τ ) − n →∞ N (0 , V BUCKET ) − − → no dependence in p , # of buckets! 11

Inverse-propensity weighting

Continuous X and the propensity score Observation from discrete X with finite number of buckets: the number of buckets p does not affect the accuracy of inference. How to transpose the analysis and results to the continuous case? 1. Modify assumptions 2. Define analogue of ”buckets” Assumptions • n iid samples ( X i , Y i , W i ), • Covariates X i take values in a continuous space X . • Treatment assignment is random conditionally on X i : { Y i (0) , Y i (1) } ⊥ ⊥ W i | X i ≡ unconfoundedness assumption . 12

Unconfoundedness and the propensity score Observation from discrete X with finite number of buckets: the number of buckets p does not affect the accuracy of inference. How to transpose the analysis and results to the continuous case? 1. Modify assumpions 2. Define analogue of ”buckets” Propensity score e ( x ) = P ( W i = 1 | X i = x ) ∀ x ∈ X . 13

Unconfoundedness and the propensity score Propensity score e ( x ) = P ( W i = 1 | X i = x ) ∀ x ∈ X . Key property e is a balancing score, i.e. under unconfoundedness, it satisfies { Y i (0) , Y i (1) } ⊥ ⊥ W i | e ( X i ) As a consequence, it suffices to control for e ( X ) (rather than X ), to remove biases associated with non-random treatment assignment. 14

Unconfoundedness and the propensity score: finite number of strata If the data falls in J strata ( S j ) 1 ≤ j ≤ J , with J < ∞ and such that e ( x ) = e j in each stratum, then we have a consistent estimator for ATE:   J J n j n j  1 Y i − 1 � � � � ˆ τ := n ˆ τ j = Y i  n n j 1 n j 0 j =1 j =1 { X i ∈ S j , W i =1 } { X i ∈ S j , W i =0 } 15

Causal Inference An introduction based on S. Wagers course on Causal - PowerPoint PPT Presentation

Causal Inference An introduction based on S. Wagers course on Causal Inference (OIT 661) Imke Mayer November 23, 2018 Group Meeting, CMAP Outline 1. Treatment effect estimation in randomized experiments 2. Beyond a single randomized

Political Science 209 - Fall 2018 Causal Inference Florian Hollenbach 7th September 2018 Causal

Causal Effect Evaluation and Causal Network Learning Zhi Geng Peking University, China June

Causal Inference By: Miguel A. Hern an and James M. Robins Part I: Causal inference without

A Brief Introduction to Causal Inference Brady Neal causalcourse.com What is causal inference?

Introduction to Causal Inference Lan Liu University of Minnesota at Twin Cities liux3771@umn.edu

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery

Modes of Statistical Inference for Causal Efgects Plus an overview of the testing based approach

Geographic Data Science - Lecture IX Causal Inference Dani Arribas-Bel Today Correlation Vs

Causal inference Gary Goertz Kroc Institute for International Peace Studies University of Notre

Geographic Data Science - Lecture IX Causal Inference Dani Arribas-Bel Today Correlation Vs

Causal Inference Theory and Applications Dr. Matthias Uflacker, Johannes Huegle, Christopher

Geographic Data Science - Lecture IX Causal Inference Dani Arribas-Bel Today Correlation Vs

Causal Inference and Response Surface Modeling Inference and

Causal Programming Causal Programming Joshua Brul Joshua Brul

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

Causal Discovery from Observational Data Brady Neal causalcourse.com What if we dont have

ACMS 20340 Statistics for Life Sciences Chapter 24: One-way Analysis of Variance: Comparing

Duplication of Benefits (DOB) Welcome & Speakers Welcome to HUDs webinar series on

Second Quarter Results 2009 Zurich July 23, 2009 Cautionary statement Cautionary statement

for Analysts and Investors Q4 2016 Cautionary notes CAUTIONARY NOTE REGARDING FORWARD-LOOKING

Bargaining and Coalition Formation Dr James Tremewan (james.tremewan@univie.ac.at) Focal Points

6. Introduction to Transistor Amplifiers: Concepts and Small-Signal Model Lecture notes: Sec. 5

Vishal Gupta* (Georgia Tech) Ripal Nathuji (Microsoft Research) * Work done during summer

Briefing on the Status of Subsequent License Renewal Preparations Commission Meeting with NRC

Causal Inference An introduction based on S. Wagers course on Causal - PowerPoint PPT Presentation

Causal Inference An introduction based on S. Wagers course on Causal Inference (OIT 661) Imke Mayer November 23, 2018 Group Meeting, CMAP Outline 1. Treatment effect estimation in randomized experiments 2. Beyond a single randomized

Political Science 209 - Fall 2018 Causal Inference Florian Hollenbach 7th September 2018 Causal

Causal Effect Evaluation and Causal Network Learning Zhi Geng Peking University, China June

Causal Inference By: Miguel A. Hern an and James M. Robins Part I: Causal inference without

A Brief Introduction to Causal Inference Brady Neal causalcourse.com What is causal inference?

Introduction to Causal Inference Lan Liu University of Minnesota at Twin Cities liux3771@umn.edu

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery

Modes of Statistical Inference for Causal Efgects Plus an overview of the testing based approach

Geographic Data Science - Lecture IX Causal Inference Dani Arribas-Bel Today Correlation Vs

Causal inference Gary Goertz Kroc Institute for International Peace Studies University of Notre

Geographic Data Science - Lecture IX Causal Inference Dani Arribas-Bel Today Correlation Vs

Causal Inference Theory and Applications Dr. Matthias Uflacker, Johannes Huegle, Christopher

Geographic Data Science - Lecture IX Causal Inference Dani Arribas-Bel Today Correlation Vs

Causal Inference and Response Surface Modeling Inference and

Causal Programming Causal Programming Joshua Brul Joshua Brul

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

Causal Discovery from Observational Data Brady Neal causalcourse.com What if we dont have

ACMS 20340 Statistics for Life Sciences Chapter 24: One-way Analysis of Variance: Comparing

Duplication of Benefits (DOB) Welcome &amp; Speakers Welcome to HUDs webinar series on

Second Quarter Results 2009 Zurich July 23, 2009 Cautionary statement Cautionary statement

for Analysts and Investors Q4 2016 Cautionary notes CAUTIONARY NOTE REGARDING FORWARD-LOOKING

Bargaining and Coalition Formation Dr James Tremewan (james.tremewan@univie.ac.at) Focal Points

6. Introduction to Transistor Amplifiers: Concepts and Small-Signal Model Lecture notes: Sec. 5

Vishal Gupta* (Georgia Tech) Ripal Nathuji (Microsoft Research) * Work done during summer

Briefing on the Status of Subsequent License Renewal Preparations Commission Meeting with NRC

Duplication of Benefits (DOB) Welcome & Speakers Welcome to HUDs webinar series on