Causal Inference An introduction based on S. Wagers course on Causal - - PowerPoint PPT Presentation

causal inference
SMART_READER_LITE
LIVE PREVIEW

Causal Inference An introduction based on S. Wagers course on Causal - - PowerPoint PPT Presentation

Causal Inference An introduction based on S. Wagers course on Causal Inference (OIT 661) Imke Mayer November 23, 2018 Group Meeting, CMAP Outline 1. Treatment effect estimation in randomized experiments 2. Beyond a single randomized


slide-1
SLIDE 1

Causal Inference

An introduction based on S. Wager’s course on Causal Inference (OIT 661)

Imke Mayer November 23, 2018

Group Meeting, CMAP

slide-2
SLIDE 2

Outline

  • 1. Treatment effect estimation in randomized experiments
  • 2. Beyond a single randomized controlled trial
  • 3. Inverse-propensity weighting
  • 4. Double robustness property
  • 5. Cross-fitting and machine learning for ATE estimation
  • 6. Conclusion

1

slide-3
SLIDE 3

Treatment effect estimation in randomized experiments

slide-4
SLIDE 4

Definitions and notations

Given a treatment, define the causal effect via potential outcomes: Causal effect Binary treatment w ∈ {0, 1} on i-th individual with potential outcomes Yi(1) and Yi(0). Individual causal effect of the treatment: ∆i = Yi(1) − Yi(0)

2

slide-5
SLIDE 5

Definitions and notations

Given a treatment, define the causal effect via potential outcomes: Causal effect Binary treatment w ∈ {0, 1} on i-th individual with potential outcomes Yi(1) and Yi(0). Individual causal effect of the treatment: ∆i = Yi(1) − Yi(0)

  • Problem: ∆i never observed.
  • (Partial) Solution: randomized experiments to learn certain

properties of ∆i.

  • Average treatment effect τ = E[∆i] = E[Yi(1) − Yi(0)].

2

slide-6
SLIDE 6

Average treatment effect (ATE)

Average treatment effect τ = E[∆i] = E[Yi(1) − Yi(0)] Idea: estimate τ using large randomized experiments. Assumptions: Random variables (Y , W ) having values in R × {0, 1}. Observe n iid samples (Yi, Wi) each satisfying:

  • Yi = Yi(Wi)

(SUTVA)

  • Wi ⊥

⊥ {Yi(0), Yi(1)} (random treatment assignment)

3

slide-7
SLIDE 7

Average treatment effect (ATE)

Average treatment effect τ = E[∆i] = E[Yi(1) − Yi(0)] Idea: estimate τ using large randomized experiments. Assumptions: Random variables (Y , W ) having values in R × {0, 1}. Observe n iid samples (Yi, Wi) each satisfying:

  • Yi = Yi(Wi)

(SUTVA)

  • Wi ⊥

⊥ {Yi(0), Yi(1)} (random treatment assignment) Difference-in-means estimator ˆ τDM = 1 n1

  • W1=1

Yi − 1 n0

  • W1=0

Yi, where nw = #{i : Wi = w}.

3

slide-8
SLIDE 8

Average treatment effect estimation

Properties of ˆ τDM

  • Using the previous assumptions (iid, SUTVA, random treatment

assignment), we can prove that ˆ τDM is unbiased and √n-consistent. √n (ˆ τDM − τ)

d

− − − →

n→∞ N(0, VDM),

where VDM = Var(Yi(0))

P(Wi=0) + Var(Yi(1)) P(Wi=1) 4

slide-9
SLIDE 9

Average treatment effect estimation

Properties of ˆ τDM

  • Using the previous assumptions (iid, SUTVA, random treatment

assignment), we can prove that ˆ τDM is unbiased and √n-consistent. √n (ˆ τDM − τ)

d

− − − →

n→∞ N(0, VDM),

where VDM = Var(Yi(0))

P(Wi=0) + Var(Yi(1)) P(Wi=1)

  • Using plug-in estimators we also get confidence intervals

lim

n→∞ P

 τ ∈  ˆ τDM ± Φ−1(1 − α/2)

  • ˆ

VDM n     = 1 − α, where Φ is the standard Gaussian cdf.

4

slide-10
SLIDE 10

Average treatment effect estimation with Difference-of-Means

Difference-of-Means estimator

  • conceptually simple estimator and simple to estimate,
  • consistent estimator with asymptotically valid inference,
  • but is it the optimal way to use the data for fixed finite n?

5

slide-11
SLIDE 11

Average treatment effect estimation with Difference-of-Means

Difference-of-Means estimator

  • conceptually simple estimator and simple to estimate,
  • consistent estimator with asymptotically valid inference,
  • but is it the optimal way to use the data for fixed finite n?

Average Treatment effect τ is a causal parameter, i.e. property we wish to know about a

  • population. It is not related to the study design or the estimation

method.

5

slide-12
SLIDE 12

Randomized trials in the linear model

Idea: assume linearity of the responses Yi(0) and Yi(1) in the covariates. Assumptions

  • n iid samples (Xi, Yi, Wi),
  • Yi(w) = c(w) + Xiβ(w) + εi(w), w ∈ {0, 1},
  • E[εi(w)|Xi] = 0 and Var(εi(w)|Xi) = σ2.

and without loss of generality we additionally assume:

  • P(Wi = 0) = P(Wi = 1) = 1

2,

  • E[X] = 0.

6

slide-13
SLIDE 13

Randomized trials in the linear model

OLS estimator ˆ τOLS := ˆ c(1) − ˆ c(0) + ¯ X(ˆ β(1) − ˆ β(0)) = 1 n

n

  • i=1

c(1) + Xi ˆ β(1)) − (ˆ c(0) − Xi ˆ β(0))

  • ,

where ¯ X = 1

n

n

i=1 Xi and the estimators are obtained by OLS for the

two linear models.

7

slide-14
SLIDE 14

Randomized trials in the linear model

OLS estimator ˆ τOLS := ˆ c(1) − ˆ c(0) + ¯ X(ˆ β(1) − ˆ β(0)) = 1 n

n

  • i=1

c(1) + Xi ˆ β(1)) − (ˆ c(0) − Xi ˆ β(0))

  • ,

where ¯ X = 1

n

n

i=1 Xi and the estimators are obtained by OLS for the

two linear models. Properties of ˆ τOLS

  • Asymptotic independence of ˆ

c(w), ˆ β(w) and ¯ X and also

ˆ τOLS−τ = (ˆ c(1)−c(1))−(ˆ c(0)−c(0))+ ¯ X(β(1)−β(0))+ ¯ X(ˆ β(1)−ˆ β(0)−β(1)+β(0)).

  • Noting VOLS = 4σ2 + (β(0) − β(1))TVar(X)(β(0) − β(1)), by central limit

theorem we get √n (ˆ τOLS − τ)

d

− − − →

n→∞ N(0, VOLS).

7

slide-15
SLIDE 15

Randomized trials in the linear model

Properties of ˆ τOLS

  • Noting VOLS = 4σ2 + β(0) − β(1)2

A, by central limit theorem we get

√n (ˆ τOLS − τ)

d

− − − →

n→∞ N(0, VOLS).

Remark

  • Under the linearity assumption,

VDM = 4σ2 + β(0) − β(1)2

A + β(0) + β(1))2 A.

⇒ ˆ τOLS is always at least as good as ˆ τDM in terms of asymptotic variance.

  • This still holds in case of model mis-specification. (proof uses

Huber-White linear regression analysis)

8

slide-16
SLIDE 16

Beyond a single randomized controlled trial

slide-17
SLIDE 17

How to combine different experiments or data sets

Study the effect of a cash incentive to discourage teenagers from smoking in two different cities.

9

slide-18
SLIDE 18

How to combine different experiments or data sets

Study the effect of a cash incentive to discourage teenagers from smoking in two different cities.

9

slide-19
SLIDE 19

How to combine different experiments or data sets

Study the effect of a cash incentive to discourage teenagers from smoking in two different cities.

9

slide-20
SLIDE 20

How to combine different experiments or data sets

Study the effect of a cash incentive to discourage teenagers from smoking in two different cities. Correct aggregation of the two studies:

9

slide-21
SLIDE 21

Aggregating several ATE estimators

How to combine several trials testing the same treatment but on different populations? Assumptions

  • n iid samples (Xi, Yi, Wi),
  • Covariates Xi take values in a finite discrete space X (i.e.

|X| = p).

  • Treatment assignment is random conditionally on Xi:

{Yi(0), Yi(1)} ⊥ ⊥ Wi | Xi = x, ∀x ∈ X. Bucket-wise ATE τ(x) = E[Yi(1) − Yi(0) | Xi = x]

10

slide-22
SLIDE 22

Results for aggregated difference-in-means estimators

Aggregated difference-in-means estimator ˆ τ :=

  • x∈X

nx n ˆ τ(x) =

  • x∈X

nx n   1 nx1

  • {Xi=x, Wi=1}

Yi − 1 nx0

  • {Xi=x, Wi=0}

Yi  

  • Denoting e(x) = P(Wi = 1 | Xi = x) and adding simplifying

assumption Var(Y (w) | X = x) = σ2(x) we can show that √nx (ˆ τ(x) − τ(x))

d

− − − →

n→∞ N

  • 0,

σ2(x) e(x)(1 − e(x))

  • Finally, denoting VBUCKET = Var(τ(X)) + E
  • σ2(X)

e(X)(1−e(X))

  • ,

√n (ˆ τ−τ)

d

− − − →

n→∞ N(0, VBUCKET)

no dependence in p, # of buckets!

11

slide-23
SLIDE 23

Inverse-propensity weighting

slide-24
SLIDE 24

Continuous X and the propensity score

Observation from discrete X with finite number of buckets: the number

  • f buckets p does not affect the accuracy of inference.

How to transpose the analysis and results to the continuous case?

  • 1. Modify assumptions
  • 2. Define analogue of ”buckets”

Assumptions

  • n iid samples (Xi, Yi, Wi),
  • Covariates Xi take values in a continuous space X.
  • Treatment assignment is random conditionally on Xi:

{Yi(0), Yi(1)} ⊥ ⊥ Wi | Xi ≡ unconfoundedness assumption.

12

slide-25
SLIDE 25

Unconfoundedness and the propensity score

Observation from discrete X with finite number of buckets: the number

  • f buckets p does not affect the accuracy of inference.

How to transpose the analysis and results to the continuous case?

  • 1. Modify assumpions
  • 2. Define analogue of ”buckets”

Propensity score e(x) = P(Wi = 1 | Xi = x) ∀ x ∈ X.

13

slide-26
SLIDE 26

Unconfoundedness and the propensity score

Propensity score e(x) = P(Wi = 1 | Xi = x) ∀ x ∈ X. Key property e is a balancing score, i.e. under unconfoundedness, it satisfies {Yi(0), Yi(1)} ⊥ ⊥ Wi | e(Xi) As a consequence, it suffices to control for e(X) (rather than X), to remove biases associated with non-random treatment assignment.

14

slide-27
SLIDE 27

Unconfoundedness and the propensity score: finite number of strata

If the data falls in J strata (Sj)1≤j≤J, with J < ∞ and such that e(x) = ej in each stratum, then we have a consistent estimator for ATE: ˆ τ :=

J

  • j=1

nj n ˆ τj =

J

  • j=1

nj n   1 nj1

  • {Xi∈Sj, Wi=1}

Yi − 1 nj0

  • {Xi∈Sj, Wi=0}

Yi  

15

slide-28
SLIDE 28

Unconfoundedness and the propensity score: inverse-propensity weighting

The previous finite number of strata assumption is unrealistic. But we can generalize the previous estimator using propensity score estimates: ˆ τ :=

J

  • j=1

nj n   1 nj1

  • {Xi∈Sj, Wi=1}

Yi − 1 nj0

  • {Xi∈Sj, Wi=0}

Yi   = 1 n

J

  • j=1

  1 ˆ ej

  • {Xi∈Sj, Wi=1}

Yi − 1 1 − ˆ ej

  • {Xi∈Sj, Wi=0}

Yi   = 1 n

n

  • i=1

WiYi ˆ e(Xi) − (1 − Wi)Yi 1 − ˆ e(Xi)

  • .

here we have ˆ e(x) = ˆ ej = nj1

nj for all x ∈ Sj but we could use any other

method to estimate ˆ e.

16

slide-29
SLIDE 29

Unconfoundedness and the propensity score: inverse-propensity weighting

And define ˆ τIPW = 1 n

n

  • i=1

WiYi ˆ e(Xi) − (1 − Wi)Yi 1 − ˆ e(Xi)

  • an inverse-propensity weighted estimation of ATE.

The quality of this estimator depends on the estimation quality of ˆ e(x).

17

slide-30
SLIDE 30

Propensity score estimation and inverse-propensity weighting

Assume a linear-logistic model:

  • 1. e(x) = P(Wi = 1 | Xi = x) =

1 1+e−xT α

  • 2. µ(w)(x) = xTβ(w) (for w ∈ {0, 1}).
  • 3. Yi(w) = µ(Wi)(Xi) + εi.

Decompose the general ATE estimator ˆ τ = 1 n

n

  • i=1
  • ˆ

γ(1)(Xi)WiYi − ˆ γ(0)(Xi)(1 − Wi)Yi

  • as follows:

ˆ τ = ¯ X(β(1) − β(0)) + [term to pay that depends on the noise ε] +

  • 1

n

n

  • i=1

ˆ γ(1)(Xi)WiXi − ¯ X

  • β(1)

  • 1

n

n

  • i=1

ˆ γ(0)(Xi)(1 − Wi)Xi − ¯ X

  • β(0)

18

slide-31
SLIDE 31

Propensity score estimation and inverse-propensity weighting

Covariate balancing propensity score (CBPS)

  • Use ˆ

γ(1) =

1 ˆ e(x) = 1 + e−xT ˆ α(1) and solve for α(1) by moment

matching: 1 n

n

  • i=1

ˆ γ(1)(Xi)WiXi − ¯ X = 0

  • Same for ˆ

γ(0) =

1 1−ˆ e(x) = e

−xT ˆ α(0)

1+e

−xT ˆ α(0) .

Note that ˆ γ(1) and ˆ γ(0) do not use the same propensity model but we can verify that both ˆ α(1) and ˆ α(0) are √n-consistent: ˆ α(w) − α2 = OP 1 √n

  • for w ∈ {0, 1}

19

slide-32
SLIDE 32

Propensity score estimation and inverse-propensity weighting

IPW with covariate balancing propensity score (CBPS) Under regularity assumptions (including overlap, i.e. ∃η > 0 such that η ≤ e(x) ≤ 1 − η for all x ∈ X), we have: ˆ τCBPS = ¯ X(β(1) − β(0)) + 1 n

n

  • i=1

Wiεi ˆ e(Xi) − (1 − Wi)εi 1 − ˆ e(Xi)

  • + OP

1 n

  • And this estimator has same asymptotic variance as for bucketing.

20

slide-33
SLIDE 33

Double robustness property

slide-34
SLIDE 34

Double robustness of CBPS

Under linear-logistic specification, ˆ τCBPS has “good” asymptotic variance. What happens if the model is mis-specified? Double robustness ˆ τCBPS remains consistent in either one of the following cases:

  • 1. Outcome model is linear but propensity score e(x) is not logistic.
  • 2. Propensity score e(x) is logistic but outcome model is not linear.

Note that the asymptotic variance might be different in these cases.

21

slide-35
SLIDE 35

Another doubly robust ATE estimator

Define µ(w)(x) := E[Yi(w) | Xi = x] and e(x) := P(Wi = 1 | Xi = x). Doubly robust estimator Assume we have access to estimates ˆ µ(w) and ˆ e(x).

ˆ τDR := 1 n

n

  • i=1
  • ˆ

µ(1)(Xi) − ˆ µ(0)(Xi) + Wi Yi − ˆ µ(1)(Xi) ˆ e(Xi) − (1 − Wi)Yi − ˆ µ(0)(Xi) 1 − ˆ e(Xi)

  • is consistent if either the ˆ

µ(w)(x) are consistent or ˆ e(x) is consistent. Furthermore ˆ τDR∗ has same asymptotic variance as ˆ τBUCKET and ˆ τCBPS. Remark: In case of overparametrization or non-parametric estimation ˆ µ(w)(x) and ˆ e(x) should be learned/estimated by cross-validation to avoid overfitting.

22

slide-36
SLIDE 36

Semiparametric efficiency for ATE estimation

Efficient score estimator Given unconfoundedness ({Yi(1), Yi(1)} ⊥ ⊥ Wi | Xi) but no further parametric assumptions on µ(w)(x) and e(x), the previously attained asymptotic variance, V ∗ := Var(τ(X)) + E

  • σ2(X)

e(X)(1 − e(X))

  • ,

is optimal and any estimator τ ∗ that attains it is asymptotically equivalent to ˆ τDR∗. V ∗ is the semiparametric efficient variance for ATE estimation.

23

slide-37
SLIDE 37

Cross-fitting and machine learning for ATE estimation

slide-38
SLIDE 38

Cross-fitting for ATE estimation

Cross-fitted ATE estimator Assume we divide the data into K folds. ˆ τCF = 1 n

n

  • i=1
  • ˆ

µ(−k(i))

(1)

(Xi) − ˆ µ(−k(i))

(0)

(Xi) + Wi Yi − ˆ µ(−k(i))

(1)

(Xi) ˆ e(−k(i))(Xi) − (1 − Wi) Yi − ˆ µ(−k(i))

(0)

(Xi) 1 − ˆ e(−k(i))(Xi)   , where k(i) maps an observation Xi to one of the K folds and ˆ µ(−j) indicates that the estimator has been learned on all the folds except the j-th fold. Assuming overlap, sup-norm consistency of all used machine learning adjustments and risk decay, we have √n (ˆ τCF − ˆ τDR∗)

p

− − − →

n→∞ 0.

And we can prove that we can build level-α confidence intervals for τ.

24

slide-39
SLIDE 39

Heterogeneous treatment Effect Estimation

Instead of estimating the average treatment effect, we may seek to estimate the conditional average treatment effect function τ(x) = E[Yi(1) − Yi(0) | Xi = x]. → harder to solve this problem (note that τ = E[τ(X)]). → take care of regularization bias (different amounts of regularization in treatement and control models). → need for further investigations in different directions.

25

slide-40
SLIDE 40

Optimal policy estimation

Beyond causal inference: based on the heterogeneous treatment effect, establish decision rules by defining an optimal policy: Given Π = {π : X → {0, 1}, with potentially some constraints on π}, find the policy that maximizes the expected utility E[Yi(π(Xi))] or that minimizes the regret.

26

slide-41
SLIDE 41

Conclusion

slide-42
SLIDE 42

Summary

Problem and question of interest

  • Estimate the effect of a treatment on an individual via a potential
  • utcomes model.
  • Inevitably faced with missing values (we only observe one outcome

per individual).

27

slide-43
SLIDE 43

Summary

Problem and question of interest

  • Estimate the effect of a treatment on an individual via a potential
  • utcomes model.
  • Inevitably faced with missing values (we only observe one outcome

per individual). Established approach(es)

  • First solution: randomized controlled trials (RCT).
  • Second solution: bucketing/inverse-propensity weighting to adjust

for biases in the treatment assignment.

  • Efficient score estimator is computationally feasible (by using

cross-fitting).

  • Double robustness property for model mis-specifications.
  • Using machine learning approaches do not harm the interpretability
  • f the causal effect estimation.

27

slide-44
SLIDE 44

Objectives for Traumabase and traumatic brain injury (TBI)

  • Traumatic brain injury is a very heterogeneous injury: the patients

injury/physiological profiles can differ a lot and the symptoms and degrees of severity cover a large spectrum. Is it still possible to estimate causal effects using the potential outcomes model?

  • Does the administration of tranexamic acid have an effect on

mortality? → single treatment and binary outcome, currently studied by student group.

  • Do certain treatment strategies, i.e. bundles of treatments

(administration of noradrenaline and SSH and tranexamic acid, etc.), have an effect on 24h mortality, on 14d mortality, etc.? → more methodological investigations needed to perform causal inference for this type of question.

28

slide-45
SLIDE 45

Alternatives to potential outcomes models.

Potential outcomes model proposed by Neyman (1923) and Rubin (1974). But there are other approaches to causal effect estimation:

  • Structural equation models (common in economics, social sciences).
  • Instrumental variables (Wright, 1928)

And the causal inference model can be made more rich by introducing ”mediators” which are affected by the treatment and linked to the

  • utcome.

29

slide-46
SLIDE 46

Do you have any questions or comments?

29

slide-47
SLIDE 47

References I

  • Loftus, J. (2015). Lecture on Causal Inference. Stanford University.
  • Pearl, J. (2000). Causality: Models, Reasoning and Inference.

Cambridge University Press, New York, NY.

  • Wager, S. (2018). Lecture Notes on Causal Inference (OIT 661).

Stanford University.