Machine Learning for Healthcare HST.956, 6.S897 Lecture 15: Causal - - PowerPoint PPT Presentation

β–Ά
machine learning for healthcare hst 956 6 s897
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Healthcare HST.956, 6.S897 Lecture 15: Causal - - PowerPoint PPT Presentation

Machine Learning for Healthcare HST.956, 6.S897 Lecture 15: Causal Inference Part 2 David Sontag Acknowledgement: adapted from slides by Uri Shalit (Technion) Reminder: Potential Outcomes Each unit (individual) " has two potential


slide-1
SLIDE 1

Machine Learning for Healthcare HST.956, 6.S897

Lecture 15: Causal Inference Part 2 David Sontag

Acknowledgement: adapted from slides by Uri Shalit (Technion)

slide-2
SLIDE 2

Reminder: Potential Outcomes

  • Each unit (individual) 𝑦" has two potential outcomes:

– 𝑍

$(𝑦") is the potential outcome had the unit not been treated:

β€œcontrol outcome” – 𝑍

'(𝑦") is the potential outcome had the unit been treated:

β€œtreated outcome”

  • Conditional average treatment effect for unit 𝑗:

π·π΅π‘ˆπΉ 𝑦" = 𝔽/

0~2(/ 0|45) [𝑍

'|𝑦"] βˆ’ 𝔽/

:~2(/ :|45)[𝑍

$|𝑦"]

  • Average Treatment Effect:

π΅π‘ˆπΉ = 𝔽4~2(4) π·π΅π‘ˆπΉ 𝑦

slide-3
SLIDE 3

Two common approaches for counterfactual inference Covariate adjustment Propensity scores

slide-4
SLIDE 4

𝑦' 𝑦; 𝑦< π‘ˆ

… 𝑔(𝑦, π‘ˆ)

𝑧

Regression model Outcome Covariates (Features)

Covariate adjustment (reminder)

Explicitly model the relationship between treatment, confounders, and outcome:

slide-5
SLIDE 5

Covariate adjustment (reminder)

  • Under ignorability,

π·π΅π‘ˆπΉ 𝑦 = 𝔽4~2 4 𝔽 𝑍

' π‘ˆ = 1, 𝑦 βˆ’ 𝔽 𝑍 $ π‘ˆ = 0, 𝑦

  • Fit a model 𝑔 𝑦, 𝑒 β‰ˆ 𝔽 𝑍

D π‘ˆ = 𝑒, 𝑦 , then:

π·π΅π‘ˆπΉ J 𝑦" = 𝑔 𝑦", 1 βˆ’ 𝑔(𝑦", 0).

slide-6
SLIDE 6

Covariate adjustment with linear models

  • Assume that:
  • Then:

π·π΅π‘ˆπΉ(𝑦): = 𝔽[𝑍

' 𝑦 βˆ’ 𝑍 $ 𝑦 ] =

𝔽[(𝛾𝑦 + 𝛿 + πœ—') βˆ’ 𝛾𝑦 + πœ—$ ] = 𝛿

age medication Blood pressure

𝑍

D 𝑦 = 𝛾𝑦 + 𝛿 β‹… 𝑒 + πœ—D

𝔽 πœ—D = 0

slide-7
SLIDE 7
  • Assume that:
  • Then:

π·π΅π‘ˆπΉ(𝑦): = 𝔽[𝑍

' 𝑦 βˆ’ 𝑍 $ 𝑦 ] =

𝔽[(𝛾𝑦 + 𝛿 + πœ—') βˆ’ 𝛾𝑦 + πœ—$ ] = 𝛿

age medication

π΅π‘ˆπΉ: = 𝔽2 4 π·π΅π‘ˆπΉ 𝑦 = 𝛿

Blood pressure

𝑍

D 𝑦 = 𝛾𝑦 + 𝛿 β‹… 𝑒 + πœ—D

𝔽 πœ—D = 0

Covariate adjustment with linear models

slide-8
SLIDE 8
  • Assume that:
  • For causal inference, need to estimate 𝛿 well,

not 𝑍

D 𝑦 - Identification, not prediction

  • Major difference between ML and statistics

age medication

π΅π‘ˆπΉ: = 𝔽2 4 π·π΅π‘ˆπΉ 𝑦 = 𝛿

Blood pressure

𝑍

D 𝑦 = 𝛾𝑦 + 𝛿 β‹… 𝑒 + πœ—D

𝔽 πœ—D = 0

Covariate adjustment with linear models

slide-9
SLIDE 9

What happens if true model is not linear?

  • True data generating process, 𝑦 ∈ ℝ:

π΅π‘ˆπΉ = 𝔽 𝑍

' βˆ’ 𝑍 $ = 𝛿

  • Hypothesized model:

𝑍

D 𝑦 = 𝛾𝑦 + 𝛿 β‹… 𝑒 + πœ€ β‹… 𝑦;

𝑍

D

T 𝑦 = 𝛾 U𝑦 + 𝛿 V β‹… 𝑒 𝛿 V = 𝛿 + πœ€ 𝔽 𝑦𝑒 𝔽 𝑦; βˆ’ 𝔽[𝑒;]𝔽[𝑦;𝑒] 𝔽 𝑦𝑒 ; βˆ’ 𝔽[𝑦;]𝔽[𝑒;]

Depending on 𝜺, can be made to be arbitrarily large or small!

slide-10
SLIDE 10

Covariate adjustment with non-linear models

  • Random forests and Bayesian trees

Hill (2011), Athey & Imbens (2015), Wager & Athey (2015)

  • Gaussian processes

Hoyer et al. (2009), Zigler et al. (2012)

  • Neural networks

Beck et al. (2000), Johansson et al. (2016), Shalit et al. (2016), Lopez-Paz et al. (2016)

slide-11
SLIDE 11

Example: Gaussian processes

10 20 30 40 50 60 80 90 100 110 120

GPβˆ’Independent

  • 10

20 30 40 50 60 80 90 100 110 120

GPβˆ’Grouped

  • Figures: Vincent Dorie & Jennifer Hill

Separate treated and control models Joint treated and control model

𝑍

' 𝑦

𝑍

$ 𝑦

𝑍

' 𝑦

𝑍

$ 𝑦

𝑦 𝑦 𝑧

Treated Control

slide-12
SLIDE 12

Example: Neural networks

Shalit, Johansson, Sontag. Estimating Individual Treatment Effect: Generalization Bounds and Algorithms. ICML, 2017

" Ξ¦

… … …

%

&

%

'

( )

Covariates Shared representation Predicted potential outcomes Learning objective Inte Neural network layers

slide-13
SLIDE 13

Matching

  • Find each unit’s long-lost counterfactual

identical twin, check up on his outcome

slide-14
SLIDE 14

Matching

  • Find each unit’s long-lost counterfactual

identical twin, check up on his outcome

Obama, had he gone to law school Obama, had he gone to business school

slide-15
SLIDE 15

Matching

  • Find each unit’s long-lost counterfactual

identical twin, check up on his outcome

  • Used for estimating both ATE and CATE
slide-16
SLIDE 16

Match to nearest neighbor from

  • pposite group

Treated Control Age Charleson comorbidity index

slide-17
SLIDE 17

Match to nearest neighbor from

  • pposite group

Treated Control Age Charleson comorbidity index

slide-18
SLIDE 18

1-NN Matching

  • Let 𝑒 β‹…,β‹… be a metric between 𝑦’s
  • For each 𝑗, define π‘˜ 𝑗 = argmin

_ `.D. DabD5

𝑒(𝑦_, 𝑦")

π‘˜ 𝑗 is the nearest counterfactual neighbor of 𝑗

  • 𝑒" = 1, unit 𝑗 is treated:

π·π΅π‘ˆπΉ J 𝑦" = 𝑧" βˆ’ 𝑧_ "

  • 𝑒" =0, unit 𝑗 is control:

π·π΅π‘ˆπΉ J 𝑦" = 𝑧_(") βˆ’ 𝑧"

slide-19
SLIDE 19

1-NN Matching

  • Let 𝑒 β‹…,β‹… be a metric between 𝑦’s
  • For each 𝑗, define π‘˜ 𝑗 = argmin

_ `.D. DabD5

𝑒(𝑦_, 𝑦")

π‘˜ 𝑗 is the nearest counterfactual neighbor of 𝑗

  • π·π΅π‘ˆπΉ

J 𝑦" = (2𝑒" βˆ’ 1)(𝑧"βˆ’π‘§_ " )

  • π΅π‘ˆπΉ

J =

' d βˆ‘

π·π΅π‘ˆπΉ J 𝑦"

d "f'

slide-20
SLIDE 20

Matching

  • Interpretable, especially in small-sample regime
  • Nonparametric
  • Heavily reliant on the underlying metric
  • Could be misled by features which don’t affect

the outcome

slide-21
SLIDE 21

Covariate adjustment and matching

  • Matching is equivalent to covariate adjustment

with two 1-nearest neighbor classifiers: 𝑍 g

' 𝑦 = 𝑧hh0 4 , 𝑍

g

$ 𝑦 = 𝑧hh: 4

where 𝑧hhi 4 is the nearest-neighbor of 𝑦 among units with treatment assignment 𝑒 = 0,1

  • 1-NN matching is in general inconsistent,

though only with small bias (Imbens 2004)

slide-22
SLIDE 22

Two common approaches for counterfactual inference Covariate adjustment Propensity scores

slide-23
SLIDE 23

Propensity scores

  • Tool for estimating ATE
  • Basic idea: turn observational study into a

pseudo-randomized trial by re-weighting samples, similar to importance sampling

slide-24
SLIDE 24

Inverse propensity score re-weighting

𝑦' = 𝑏𝑕𝑓 𝑦; = Charlson comorbidity index

Treated Control

π‘ž(𝑦|𝑒 = 0) β‰  π‘ž 𝑦 𝑒 = 1 control treated

slide-25
SLIDE 25

π‘ž 𝑦 𝑒 = 0 β‹… π‘₯$(𝑦) β‰ˆ π‘ž 𝑦 𝑒 = 1 β‹… π‘₯'(𝑦)

reweighted control reweighted treated

Inverse propensity score re-weighting

𝑦' = 𝑏𝑕𝑓 𝑦; = Charlson comorbidity index

Treated Control

slide-26
SLIDE 26

Propensity score

  • Propensity score: π‘ž π‘ˆ = 1 𝑦 ,

using machine learning tools

  • Samples re-weighted by the inverse propensity

score of the treatment they received

slide-27
SLIDE 27

Propensity scores – algorithm

Inverse probability of treatment weighted estimator

How to calculate ATE with propensity score for sample 𝑦', 𝑒', 𝑧' , … , (𝑦d, 𝑒d, 𝑧d)

  • 1. Use any ML method to estimate π‘ž

V π‘ˆ = 𝑒 𝑦 2.

Λ† ATE = 1 n X

i s.t. ti=1

yi Λ† p(ti = 1|xi) βˆ’ 1 n X

i s.t. ti=0

yi Λ† p(ti = 0|xi)

slide-28
SLIDE 28

Propensity scores – algorithm

Inverse probability of treatment weighted estimator

How to calculate ATE with propensity score for sample 𝑦', 𝑒', 𝑧' , … , (𝑦d, 𝑒d, 𝑧d)

  • 1. Randomized trial π‘ž(π‘ˆ = 𝑒|𝑦) = 0.5

2.

Λ† ATE = 1 n X

i s.t. ti=1

yi Λ† p(ti = 1|xi) βˆ’ 1 n X

i s.t. ti=0

yi Λ† p(ti = 0|xi)

slide-29
SLIDE 29

Propensity scores – algorithm

Inverse probability of treatment weighted estimator

How to calculate ATE with propensity score for sample 𝑦', 𝑒', 𝑧' , … , (𝑦d, 𝑒d, 𝑧d)

  • 1. Randomized trial π‘ž(π‘ˆ = 𝑒|𝑦) = 0.5

2.

Λ† ATE = 1 n X

i s.t. ti=1

yi 0.5 βˆ’ 1 n X

i s.t. ti=0

yi 0.5 = X X

slide-30
SLIDE 30

Propensity scores – algorithm

Inverse probability of treatment weighted estimator

How to calculate ATE with propensity score for sample 𝑦', 𝑒', 𝑧' , … , (𝑦d, 𝑒d, 𝑧d)

  • 1. Randomized trial π‘ž = 0.5

2.

Λ† ATE = 1 n X

i s.t. ti=1

yi 0.5 βˆ’ 1 n X

i s.t. ti=0

yi 0.5 = 2 n X

i s.t. ti=1

yi βˆ’ 2 n X

i s.t. ti=0

yi

slide-31
SLIDE 31

Propensity scores – algorithm

Inverse probability of treatment weighted estimator

How to calculate ATE with propensity score for sample 𝑦', 𝑒', 𝑧' , … , (𝑦d, 𝑒d, 𝑧d)

  • 1. Randomized trial π‘ž = 0.5

2.

Λ† ATE = 1 n X

i s.t. ti=1

yi 0.5 βˆ’ 1 n X

i s.t. ti=0

yi 0.5 = 2 n X

i s.t. ti=1

yi βˆ’ 2 n X

i s.t. ti=0

yi

Sum over ~

𝒐 πŸ‘ terms

slide-32
SLIDE 32

Propensity scores - derivation

  • Recall average treatment effect:
  • We only have samples for:

Ex∼p(x)[ E [Y1|x, T = 1]βˆ’E [Y0|x, T = 0] ]

Ex∼p(x|T =1)[ E [Y1|x, T = 1]] Ex∼p(x|T =0)[ E [Y0|x, T = 0]]

slide-33
SLIDE 33

Propensity scores - derivation

  • We only have samples for:

Ex∼p(x|T =1)[ E [Y1|x, T = 1]] Ex∼p(x|T =0)[ E [Y0|x, T = 0]]

slide-34
SLIDE 34

Propensity scores - derivation

  • We only have samples for:
  • We need to turn π‘ž(𝑦|π‘ˆ = 1) into π‘ž(𝑦):

Ex∼p(x|T =1)[ E [Y1|x, T = 1]] Ex∼p(x|T =0)[ E [Y0|x, T = 0]] p(x|T = 1) · p(T = 1) p(T = 1|x) = p(x)

?

slide-35
SLIDE 35

Propensity scores - derivation

  • We only have samples for:
  • We need to turn π‘ž(𝑦|π‘ˆ = 1) into π‘ž(𝑦):

Ex∼p(x|T =1)[ E [Y1|x, T = 1]] Ex∼p(x|T =0)[ E [Y0|x, T = 0]] p(x|T = 1) · p(T = 1) p(T = 1|x) = p(x)

Propensity score

slide-36
SLIDE 36

Propensity scores - derivation

  • We only have samples for:
  • We need to turn π‘ž(𝑦|π‘ˆ = 0) into π‘ž(𝑦):

Ex∼p(x|T =1)[ E [Y1|x, T = 1]] Ex∼p(x|T =0)[ E [Y0|x, T = 0]] p(x|T = 0) · p(T = 0) p(T = 0|x) = p(x)

Propensity score

slide-37
SLIDE 37
  • We want:
  • We know that:
  • Thus:
  • We can approximate this empirically as:

(similarly for ti=0)

p(x|T = 1) · p(T = 1) p(T = 1|x) = p(x) Ex∼p(x)[Y1(x)]

Ex∼p(x|T =1)  p(T = 1) p(T = 1 | x)Y1(x)

  • = Ex∼p(x)[Y1(x)]

1 n1 X

i s.t.ti=1

ο£Ώ n1/n Λ† p(ti = 1 | xi)yi

  • = 1

n X

i s.t.ti=1

yi Λ† p(ti = 1 | xi)

slide-38
SLIDE 38

Problems with IPW

  • Need to estimate propensity score (problem in

all propensity score methods)

  • If there’s not much overlap, propensity scores

become non-informative and easily mis- calibrated

  • Weighting by inverse can create large variance

and large errors for small propensity scores

– Exacerbated when more than two treatments

slide-39
SLIDE 39

Many more ideas and methods

  • Natural experiments & regression

discontinuity

  • Instrumental variables
slide-40
SLIDE 40

Many more ideas and methods – Natural experiments

  • Does stress during pregnancy affect later child

development?

  • Confounding: genetic, mother personality,

economic factors…

  • Natural experiment: the Cuban missile crisis of

October 1962. Many people were afraid a nuclear war is about to break out.

  • Compare children who were in utero during the

crisis with children from immediately before and after

slide-41
SLIDE 41

Many more ideas and methods – Instrumental variables

  • Informally: a variable which affects treatment

assignment but not the outcome

  • Example: are private schools better than public

schools?

  • Confounding: different student population,

different teacher population

  • Can’t force people which school to go to
slide-42
SLIDE 42

Many more ideas and methods – Instrumental variables

  • Informally: a variable which affects treatment

assignment but not the outcome

  • Example: are private schools better than public

schools?

  • Can’t force people which school to go to
  • Can randomly give out vouchers to some children,

giving them an opportunity to attend private schools

  • The voucher assignment is the instrumental

variable

slide-43
SLIDE 43

Summary

  • Two approaches to use machine learning for

causal inference:

1. Predict outcome given features and treatment, then use resulting model to impute counterfactuals (covariate adjustment) 2. Predict treatment using features (propensity score), then use to reweight outcome or stratify the data

  • Causal graphs important for thinking through

whether problem is setup appropriately and whether assumptions hold