MACHINE LEARNING FOR HEALTHCARE 6.S897, HST.S53 Lecture 3: Causal - - PowerPoint PPT Presentation

machine learning for healthcare 6 s897 hst s53 lecture 3
SMART_READER_LITE
LIVE PREVIEW

MACHINE LEARNING FOR HEALTHCARE 6.S897, HST.S53 Lecture 3: Causal - - PowerPoint PPT Presentation

MACHINE LEARNING FOR HEALTHCARE 6.S897, HST.S53 Lecture 3: Causal inference Prof. David Sontag MIT EECS, CSAIL, IMES (Thanks to Uri Shalit for many of the slides) *Last week: Type 2 diabetes 2013 1994 2000 <4.5% 4.5%5.9%


slide-1
SLIDE 1

MACHINE LEARNING FOR HEALTHCARE 6.S897, HST.S53

  • Prof. David Sontag

MIT EECS, CSAIL, IMES

Lecture 3: Causal inference

(Thanks to Uri Shalit for many of the slides)

slide-2
SLIDE 2

Early detection of Type 2 diabetes: (Razavian et al., Big Data, 2016)

*Last week: Type 2 diabetes

1994 2000

<4.5% 4.5%–5.9% 6.0%–7.4% 7.5%–8.9% >9.0%

2013

slide-3
SLIDE 3

*Last week: Discovered risk factors

Highly weighted features Odds Ratio

Impaired Fasting Glucose (Code 790.21) 4.17 (3.87 4.49) Abnormal Glucose NEC (790.29) 4.07 (3.76 4.41) Hypertension (401) 3.28 (3.17 3.39) Obstructive Sleep Apnea (327.23) 2.98 (2.78 3.20) Obesity (278) 2.88 (2.75 3.02) Abnormal Blood Chemistry (790.6) 2.49 (2.36 2.62) Hyperlipidemia (272.4) 2.45 (2.37 2.53) Shortness Of Breath (786.05) 2.09 (1.99 2.19) Esophageal Reflux (530.81) 1.85 (1.78 1.93)

Diabetes 1-year gap

Additional Disease Risk Factors Include: Pituitary dwarfism (253.3), Hepatomegaly(789.1), Chronic Hepatitis C (070.54), Hepatitis (573.3), Calcaneal Spur(726.73), Thyrotoxicosis without mention

  • f goiter(242.90), Sinoatrial

Node dysfunction(427.81), Acute frontal sinusitis (461.1 ), Hypertrophic and atrophic conditions of skin(701.9), Irregular menstruation(626.4), … (Razavian et al., Big Data, 2016)

slide-4
SLIDE 4

Thinking about interventions

1.

Do highly weighted features suggest avenues for preventing onset of diabetes?

  • Example: Gastric bypass surgery. Highest negative weight

(9th most predictive feature)

  • What is the mathematical justification for thinking of highly

weighted features in this way?

2.

What happens if the patient did not get diabetes because an intervention made in the gap?

  • How do we deconvolve effect of interventions from the

prediction task?

3.

Solution is to reframe as causal inference problem: predict for which patients an intervention will reduce chances of getting T2D

slide-5
SLIDE 5

Randomized trials vs.

  • bservational studies

Which treatment works better? A or B

slide-6
SLIDE 6

Randomized controlled trial (RCT)

A A A A A B B B B A B B A B A A

Which treatment works better? A or B

Socio-economic class

Wealthy Poor

slide-7
SLIDE 7

A A A A B B B

Which treatment works better? A or B

A B B A A A B A B

Observational study

Socio-economic class

Wealthy Poor

slide-8
SLIDE 8

A A A A B B B

Which treatment works better? A or B

A B B A A A B A B

Observational study

Socio-economic class is a potential Confounder

Socio-economic class

Wealthy Poor

slide-9
SLIDE 9

In many fields randomized studies are the gold standard for causal inference, but…

slide-10
SLIDE 10
  • Does inhaling Asbestos cause cancer?
  • Does decreasing the interest rate

reinvigorate the economy?

  • We have a budget for one new anti-

diabetic drug experiment. Can we use past health records of 100,000 diabetics to guide us?

slide-11
SLIDE 11

Even randomized controlled trials have flaws

  • Not personalized – only population effect
  • Study population might not represent true

population

  • Recruiting is hard
  • People might drop out of study
  • Study in one company/hospital/state/country

could fail to generalize to others

slide-12
SLIDE 12

Example 1 Precision medicine: Individualized Treatment Effect (ITE)

slide-13
SLIDE 13

Which treatment is best for me?

  • Which anti-hypertensive

treatment?

  • Calcium channel blocker (A)
  • ACE inhibitor (B)
  • Current situation:
  • Clinical trials
  • Doctor’s knowledge & intuition
  • Use datasets of patients and their histories
  • Blood pressure = 150/95
  • WBC count = 6*109/L
  • Temperature = 98°F
  • HbA1c = 6.6%
  • Thickness of heart artery

plaque = 3mm

  • Weight = 65kg
slide-14
SLIDE 14

Which treatment is best for me?

  • Which anti-hypertensive

treatment?

  • Calcium channel blocker (A)
  • ACE inhibitor (B)
  • Future blood pressure: treatment A vs. B
  • Individualized Treatment Effect (ITE)
slide-15
SLIDE 15

Which treatment is best for me?

  • Which anti-hypertensive

treatment?

  • Calcium channel blocker (A)
  • ACE inhibitor (B)
  • Potential confounder: maybe rich patients

got medication A more often, and poor patients got medication B more often

slide-16
SLIDE 16

Example 2 Job training: Average Treatment Effect (ATE)

slide-17
SLIDE 17

Should the government fund job-training programs?

  • Existing job training programs seem to help

unemployed and underemployed find better jobs

  • Should the government fund such programs?
  • Maybe training helps but only marginally? Is it

worth the investment?

  • Average Treatment Effect (ATE)
  • Potential confounder: Maybe only motivated

people go to job training? Maybe they would have found better jobs anyway?

slide-18
SLIDE 18

Observational studies

A major challenge in causal inference from

  • bservational studies is how to control or

adjust for the confounding factors

slide-19
SLIDE 19

Counterfactuals and causal inference

  • Does treatment 𝑼 cause outcome 𝒁?
  • If 𝑼 had not occurred, 𝒁 would not have
  • ccurred (David Hume)
  • Counterfactuals:

Kim received job training (𝑼), and her income one year later (𝒁) is 20,000$ What would have been Kim’s income had she not had job training?

slide-20
SLIDE 20

Counterfactuals and causal inference

  • Counterfactuals:

Kim received job training (𝑼), and her income one year later (𝒁) is $20,000 What would have been Kim’s income had she not had job training?

  • If her income would have been $18,000, we

say that job training caused an increase of $2,000 in Kim’s income

  • The problem: you never know what might

have been

slide-21
SLIDE 21

Sliding Doors

slide-22
SLIDE 22

Potential Outcomes Framework (Rubin-Neyman Causal Model)

  • Each unit 𝑦& has two potential outcomes:
  • 𝑍

((𝑦&) is the potential outcome had the unit not been

treated: “control outcome”

  • 𝑍

)(𝑦&) is the potential outcome had the unit been

treated: “treated outcome”

  • Individual Treatment Effect for unit 𝑗:

𝐽𝑈𝐹 𝑦& = 𝔽0

1~3(0 1|56) [𝑍

)|𝑦&] − 𝔽0

;~3(0 ;|56)[𝑍

(|𝑦&]

  • Average Treatment Effect:

𝐵𝑈𝐹:= 𝔽 𝑍

) − 𝑍 ( = 𝔽5~3(5) 𝐽𝑈𝐹 𝑦

slide-23
SLIDE 23

Potential Outcomes Framework (Rubin-Neyman Causal Model)

  • Each unit 𝑦& has two potential outcomes:
  • 𝑍

((𝑦&) is the potential outcome had the unit not been

treated: “control outcome”

  • 𝑍

)(𝑦&) is the potential outcome had the unit been

treated: “treated outcome”

  • Observed factual outcome:

𝑧& = 𝑢&𝑍

) 𝑦& + 1 − 𝑢& 𝑍 ((𝑦&)

  • Unobserved counterfactual outcome:

𝑧&

BC = (1 − 𝑢&)𝑍 ) 𝑦& + 𝑢&𝑍 ((𝑦&)

slide-24
SLIDE 24

Terminology

  • Unit: data point, e.g. patient, customer, student
  • Treatment: binary indicator (in this tutorial)

Also called intervention

  • Treated: units who received treatment=1
  • Control: units who received treatment=0
  • Factual: the set of observed units with their

respective treatment assignment

  • Counterfactual: the factual set with flipped

treatment assignment

slide-25
SLIDE 25

Treated

𝑦 = 𝑏𝑕𝑓 𝑧 = 𝑐𝑚𝑝𝑝𝑒_𝑞𝑠𝑓𝑡. 𝑍

) 𝑦

𝑍

( 𝑦

Example – Blood pressure and age

slide-26
SLIDE 26

Treated

𝑦 = 𝑏𝑕𝑓 𝑧 = 𝑐𝑚𝑝𝑝𝑒_𝑞𝑠𝑓𝑡. 𝑍

) 𝑦

𝑍

( 𝑦

Blood pressure and age

𝐽𝑈𝐹(𝑦)

slide-27
SLIDE 27

Treated

𝑦 = 𝑏𝑕𝑓 𝑧 = 𝑐𝑚𝑝𝑝𝑒_𝑞𝑠𝑓𝑡. 𝑍

) 𝑦

𝑍

( 𝑦

Blood pressure and age

𝐵𝑈𝐹

slide-28
SLIDE 28

Treated

𝑦 = 𝑏𝑕𝑓 𝑧 = 𝑐𝑚𝑝𝑝𝑒_𝑞𝑠𝑓𝑡. 𝑍

) 𝑦

𝑍

( 𝑦

Blood pressure and age

Treated Control

slide-29
SLIDE 29

Treated

𝑦 = 𝑏𝑕𝑓 𝑧 = 𝑐𝑚𝑝𝑝𝑒_𝑞𝑠𝑓𝑡. 𝑍

) 𝑦

𝑍

( 𝑦

Blood pressure and age

Treated Control

Counterfactual treated Counterfactual control

slide-30
SLIDE 30

The fundamental problem of causal inference

“The fundamental problem of causal inference” We only ever observe one of the two outcomes

slide-31
SLIDE 31

“The Assumptions” – no unmeasured confounders

𝑍

(, 𝑍 ): potential outcomes for control and treated

𝑦: unit covariates (features) T: treatment assignment We assume:

(𝑍

(,𝑍 )) ⫫ 𝑈 | 𝑦

The potential outcomes are independent of treatment assignment, conditioned on covariates 𝑦

slide-32
SLIDE 32

“The Assumptions” – no unmeasured confounders

𝑍

(, 𝑍 ): potential outcomes for control and treated

𝑦: unit covariates (features) T: treatment assignment We assume:

(𝑍

(,𝑍 )) ⫫ 𝑈 | 𝑦

Ignorability

slide-33
SLIDE 33

covariates (features) treatment Potential outcomes

𝑼 𝒚 𝒁𝟐 𝒁𝟏

Ignorability

(𝑍

(,𝑍 )) ⫫ 𝑈 | 𝑦

slide-34
SLIDE 34

𝑼 𝒚 𝒁𝟐 𝒁𝟏

anti- hypertensive medication blood pressure after medication A age, gender, weight, diet, heart rate at rest,… blood pressure after medication B

Ignorability

(𝑍

(,𝑍 )) ⫫ 𝑈 | 𝑦

slide-35
SLIDE 35

𝒚 𝒁𝟐 𝒁𝟏

blood pressure after medication A age, gender, weight, diet, heart rate at rest,… blood pressure after medication B

𝒊

No Ignorability

diabetic

𝑼

anti- hypertensive medication

(𝑍

(,𝑍 )) ⫫ 𝑈 | 𝑦

slide-36
SLIDE 36

“The Assumptions” – common support

Y(, 𝑍

): potential outcomes for control and treated

𝑦: unit covariates (features) 𝑈: treatment assignment We assume:

𝑞 𝑈 = 𝑢 𝑌 = 𝑦 > 0 ∀𝑢, 𝑦

slide-37
SLIDE 37

Average Treatment Effect

The expected causal effect of 𝑈 on 𝑍:

ATE := E [Y1 − Y0]

slide-38
SLIDE 38

Average Treatment Effect – the adjustment formula

  • Assuming ignorability, we will derive the

adjustment formula (Hernán & Robins 2010, Pearl 2009)

  • The adjustment formula is extremely useful

in causal inference

  • Also called G-formula
slide-39
SLIDE 39

Average Treatment Effect

The expected causal effect of 𝑈 on 𝑍:

ATE := E [Y1 − Y0]

slide-40
SLIDE 40

Average Treatment Effect

The expected causal effect of 𝑈 on 𝑍:

ATE := E [Y1 − Y0] E [Y1] = Ex∼p(x) ⇥ EY1∼p(Y1|x) [Y1|x] ⇤ = ⇥ ⇤

law of total expectation

slide-41
SLIDE 41

Average Treatment Effect

The expected causal effect of 𝑈 on 𝑍:

ATE := E [Y1 − Y0] E [Y1] = Ex∼p(x) ⇥ EY1∼p(Y1|x) [Y1|x] ⇤ = Ex∼p(x) ⇥ EY1∼p(Y1|x) [Y1|x, T = 1] ⇤ = E E

ignorability (𝑍

(, 𝑍 )) ⫫ 𝑈 | 𝑦

slide-42
SLIDE 42

Average Treatment Effect

The expected causal effect of 𝑈 on 𝑍:

ATE := E [Y1 − Y0] E [Y1] = Ex∼p(x) ⇥ EY1∼p(Y1|x) [Y1|x] ⇤ = Ex∼p(x) ⇥ EY1∼p(Y1|x) [Y1|x, T = 1] ⇤ = Ex∼p(x) [E [Y1|x, T = 1]]

shorter notation

slide-43
SLIDE 43

Average Treatment Effect

The expected causal effect of 𝑈 on 𝑍:

ATE := E [Y1 − Y0] E [Y0] = Ex∼p(x) ⇥ EY0∼p(Y0|x) [Y0|x] ⇤ = Ex∼p(x) ⇥ EY0∼p(Y0|x) [Y0|x, T = 1] ⇤ = Ex∼p(x) [E [Y0|x, T = 0]]

slide-44
SLIDE 44

Quantities we can estimate from data

The adjustment formula

(

E [Y1|x, T = 1] E [Y0|x, T = 0]

ATE = E [Y1 − Y0] = Ex∼p(x)[ E [Y1|x, T = 1]−E [Y0|x, T = 0] ] Under the assumption of ignorability, we have that:

slide-45
SLIDE 45

Quantities we cannot directly estimate from data

The adjustment formula

(

ATE = E [Y1 − Y0] = Ex∼p(x)[ E [Y1|x, T = 1]−E [Y0|x, T = 0] ] Under the assumption of ignorability, we have that:

E [Y0|x, T = 1] E [Y1|x, T = 0] E [Y0|x] E [Y1|x]

slide-46
SLIDE 46

Quantities we can estimate from data

The adjustment formula

(

E [Y1|x, T = 1] E [Y0|x, T = 0]

ATE = E [Y1 − Y0] = Ex∼p(x)[ E [Y1|x, T = 1]−E [Y0|x, T = 0] ]

Empirically we have samples from 𝑞(𝑦|𝑈 = 1) or 𝑞 𝑦 𝑈 = 0 . Extrapolate to 𝑞(𝑦)

Under the assumption of ignorability, we have that:

slide-47
SLIDE 47

Outline

Tools of the trade Matching Covariate adjustment Propensity score

slide-48
SLIDE 48

Set up

  • Samples: 𝑦), 𝑦[, … , 𝑦]
  • Observed binary treatment assignments:

𝑢), 𝑢[, … , 𝑢]

  • Observed outcomes: 𝑧), 𝑧[, … , 𝑧]

𝑦 = (𝑏𝑕𝑓, 𝑕𝑓𝑜𝑒𝑓𝑠, 𝑛𝑏𝑠𝑠𝑗𝑓𝑒, 𝑓𝑒𝑣𝑑𝑏𝑢𝑗𝑝𝑜, 𝑗𝑜𝑑𝑝𝑛𝑓_𝑚𝑏𝑡𝑢_𝑧𝑓𝑏𝑠, … ) 𝑢 = 𝑜𝑝_𝑘𝑝𝑐_𝑢𝑠𝑏𝑗𝑜𝑗𝑜𝑕,𝑘𝑝𝑐_𝑢𝑠𝑏𝑗𝑜𝑗𝑜𝑕 𝑧 = 𝑗𝑜𝑑𝑝𝑛𝑓_𝑝𝑜𝑓_𝑧𝑓𝑏𝑠_𝑏𝑔𝑢𝑓𝑠_𝑢𝑠𝑏𝑗𝑜𝑗𝑜𝑕

  • Does job training raise average future income?
slide-49
SLIDE 49

Outline

Tools of the trade Matching Covariate adjustment Propensity score

slide-50
SLIDE 50

Matching

  • Find each unit’s long-lost counterfactual

identical twin, check up on his outcome

slide-51
SLIDE 51

Matching

  • Find each unit’s long-lost counterfactual

identical twin, check up on his outcome

Obama, had he gone to law school Obama, had he gone to business school

slide-52
SLIDE 52

Matching

  • Find each unit’s long-lost counterfactual

identical twin, check up on his outcome

  • Used for estimating both ATE and ITE
slide-53
SLIDE 53

Match to nearest neighbor from opposite group

Treated Control age years of education

slide-54
SLIDE 54

Match to nearest neighbor from opposite group

Treated Control age years of education

slide-55
SLIDE 55

1-NN Matching

  • Let 𝑒 ⋅,⋅ be a metric between 𝑦’s
  • For each 𝑗, define 𝑘 𝑗 = argmin

k l.m. mnom6

𝑒(𝑦k,𝑦&)

𝑘 𝑗 is the nearest counterfactual neighbor of 𝑗

  • 𝑢& = 1, unit 𝑗 is treated:

𝐽𝑈𝐹 p 𝑦& = 𝑧& − 𝑧k &

  • 𝑢& =0, unit 𝑗 is control:

𝐽𝑈𝐹 p 𝑦& = 𝑧k(&) − 𝑧&

slide-56
SLIDE 56

1-NN Matching

  • Let 𝑒 ⋅,⋅ be a metric between 𝑦’s
  • For each 𝑗, define 𝑘 𝑗 = argmin

k l.m. mnom6

𝑒(𝑦k,𝑦&)

𝑘 𝑗 is the nearest counterfactual neighbor of 𝑗

  • 𝐽𝑈𝐹

p 𝑦& = (2𝑢& − 1)(𝑧&−𝑧k & )

  • 𝐵𝑈𝐹

r =

) ] ∑

𝐽𝑈𝐹 p 𝑦&

] &t)

slide-57
SLIDE 57

Matching

  • Interpretable, especially in small-sample

regime

  • Nonparametric
  • Heavily reliant on the underlying metric

(however see below about propensity score matching)

  • Could be misled by features which don’t

affect the outcome

slide-58
SLIDE 58

Matching

  • Many other matching methods we won’t

discuss:

  • Coarsened exact matching

Iacus et al. (2011)

  • Optimal matching

Rosenbaum (1989,2002)

  • Propensity score matching

Rosenbaum & Rubin (1983), Austin (2011)

  • Mahalanobis distance matching

Rosenbaum (1989,2002)

slide-59
SLIDE 59

Outline

Tools of the trade Matching Covariate adjustment Propensity score

slide-60
SLIDE 60

Covariate adjustment

  • Explicitly model the relationship between

treatment, confounders, and outcome

  • Also called “Response Surface Modeling”
  • Used for both ITE and ATE
  • A regression problem
slide-61
SLIDE 61

𝑦) 𝑦[ 𝑦u 𝑈

… 𝑔(𝑦, 𝑈)

𝑧

Regression model Outcome Covariates (Features)

slide-62
SLIDE 62

𝑦) 𝑦[ 𝑦u 𝑈

𝑧

Nuisance Parameters Regression model Outcome Parameter of interest

𝑔(𝑦, 𝑈)

slide-63
SLIDE 63

Covariate adjustment (parametric g-formula)

  • Explicitly model the relationship between

treatment, confounders, and outcome

  • Under ignorability, the expected causal

effect of 𝑈 on 𝑍: 𝔽5~3 5 𝔽 𝑍

) 𝑈 = 1, 𝑦 − 𝔽 𝑍 ( 𝑈 = 0, 𝑦

  • Fit a model 𝑔 𝑦, 𝑢 ≈ 𝔽 𝑍

m 𝑈 = 𝑢, 𝑦

𝐵𝑈𝐹 r = 1 𝑜 w 𝑔 𝑦&, 1 − 𝑔(𝑦&,0)

] &t)

slide-64
SLIDE 64

Covariate adjustment (parametric g-formula)

  • Explicitly model the relationship between

treatment, confounders, and outcome

  • Under ignorability, the expected causal

effect of 𝑈 on 𝑍: 𝔽5~3 5 𝔽 𝑍

) 𝑈 = 1, 𝑦 − 𝔽 𝑍 ( 𝑈 = 0, 𝑦

  • Fit a model 𝑔 𝑦, 𝑢 ≈ 𝔽 𝑍

m 𝑈 = 𝑢, 𝑦

𝐽𝑈𝐹 p 𝑦& = 𝑔 𝑦&,1 − 𝑔(𝑦&, 0)

slide-65
SLIDE 65

Treated

𝑦 = 𝑏𝑕𝑓 𝑧 = 𝑐𝑚𝑝𝑝𝑒_𝑞𝑠𝑓𝑡. 𝑍

) 𝑦

𝑍

( 𝑦

Covariate adjustment

Treated Control

slide-66
SLIDE 66

Treated

𝑦 = 𝑏𝑕𝑓 𝑧 = 𝑐𝑚𝑝𝑝𝑒_𝑞𝑠𝑓𝑡. 𝑍

) 𝑦

𝑍

( 𝑦

Covariate adjustment

Treated Control

Counterfactual treated Counterfactual control

𝒈

slide-67
SLIDE 67
  • Our model was optimized to predict
  • utcome, not to differentiate the influence
  • f A vs. B
  • What if our high-dimensional model threw

away the feature of medication A/B?

  • Maybe the model never saw a patient like

Anna get medication A? Maybe there’s a reason patients like Anna never get A?

Warning: this is not a classic supervised learning problem

slide-68
SLIDE 68

Covariate adjustment - consistency

  • If the model 𝑔 𝑦,𝑢 ≈ 𝔽 𝑍

m 𝑈 = 𝑢,𝑦 is

consistent in the limit of infinite samples, then under ignorability the estimated 𝐵𝑈𝐹 r will converge to the true 𝐵𝑈𝐹

  • A sufficient condition: overlap and well-

specified model

slide-69
SLIDE 69

Covariate adjustment: no overlap

Treated

Treated Control

𝑦 = 𝑏𝑕𝑓 𝑧 = 𝑐𝑚𝑝𝑝𝑒_𝑞𝑠𝑓𝑡.

𝑍

) 𝑦

𝑍

( 𝑦

slide-70
SLIDE 70

Linear model

  • Assume that:
  • Then:

𝐽𝑈𝐹(𝑦):= 𝑍

) 𝑦 − 𝑍 ( 𝑦 =

(𝛾𝑦 + 𝛿 + 𝜗)) − 𝛾𝑦 + 𝜗( = 𝛿 + 𝜗) − 𝜗(

age medication

𝐵𝑈𝐹: = 𝔽 𝑍

) 𝑦 − 𝑍 ( 𝑦

= 𝛿 𝐵𝑈𝐹:= 𝔽 𝑍

) 𝑦 − 𝑍 ( 𝑦

= 𝛿 + 𝔽[𝜗)] − 𝔽[𝜗(]

Blood pressure

𝑍

m 𝑦 = 𝛾𝑦 + 𝛿 ⋅ 𝑢 + 𝜗m

𝔽 𝜗m = 0

slide-71
SLIDE 71

Linear model

  • Assume that:
  • We care about 𝛿, not about 𝑍

m 𝑦

Identification, not prediction 𝐵𝑈𝐹 = 𝔽 𝑍

) 𝑦 − 𝑍 ( 𝑦

= 𝛿 𝑍

m 𝑦 = 𝛾|𝑦 + 𝛿 ⋅ 𝑢 + 𝜗m

𝔽 𝜗m = 0

slide-72
SLIDE 72

Linear model

  • 𝑍

m 𝑦 = 𝛾|𝑦 + 𝛿 ⋅ 𝑢 + 𝜗m

Hypertension is affected by many variables: lifestyle, weight, genetics, age

  • Each of these often stronger predictor of

blood-pressure, compared with type of medication taken

  • Regularization (e.g. Lasso) might remove

the treatment variable!

  • Features à (“nuisance parameters”,

“variable of interest”)

age,weight,… medication blood pressure

slide-73
SLIDE 73

Regression - misspecification

  • True data generating process, 𝑦 ∈ ℝ:

𝐵𝑈𝐹 = 𝔽 𝑍

) − 𝑍 ( = 𝛿

  • Hypothesized model:

𝑍

m 𝑦 = 𝛾𝑦 + 𝛿 ⋅ 𝑢 + 𝜀 ⋅ 𝑦[

𝑍

m

€ 𝑦 = 𝛾

  • 𝑦 + 𝛿

‚ ⋅ 𝑢 𝛿 ‚ = 𝛿 + 𝜀 𝔽 𝑦𝑢 𝔽 𝑦[ − 𝔽[𝑢[]𝔽[𝑦[𝑢] 𝔽 𝑦𝑢 [ − 𝔽[𝑦[]𝔽[𝑢[]

slide-74
SLIDE 74

Using machine learning for causal inference

  • Machine learning techniques can be very

useful and have recently seen wider adoption

  • Random forests and Bayesian trees

Hill (2011), Athey & Imbens (2015), Wager & Athey (2015)

  • Gaussian processes

Hoyer et al. (2009), Zigler et al. (2012)

  • Neural nets

Beck et al. (2000), Johansson et al. (2016), Shalit et al. (2016), Lopez-Paz et al. (2016)

  • “Causal” Lasso

Belloni et al. (2013), Farrell (2015), Athey et al. (2016)

slide-75
SLIDE 75
  • Machine learning techniques can be very

useful and have recently seen wider adoption

  • How is the treatment variable used:
  • Fit two different models for treated and control?
  • Not regularized?
  • Privileged

Using machine learning for causal inference

slide-76
SLIDE 76

Example: Gaussian process

10 20 30 40 50 60 80 90 100 110 120

GP−Independent

  • 10

20 30 40 50 60 80 90 100 110 120

GP−Grouped

  • Figures: Vincent Dorie & Jennifer Hill

Separate treated and control models Joint treated and control model

𝑍

) 𝑦

𝑍

( 𝑦

𝑍

) 𝑦

𝑍

( 𝑦

𝑦 𝑦 𝑧

Treated Control

𝑍

( 𝑦

𝑍 ƒm(𝑦) 𝑍

) 𝑦

slide-77
SLIDE 77

Covariate adjustment and matching

  • Matching is equivalent to covariate

adjustment with two 1-NN classifiers: 𝑍 ƒ

) 𝑦 = 𝑧„„1 5 , 𝑍

ƒ

( 𝑦 = 𝑧„„; 5

where 𝑧„„… 5 is the nearest-neighbor of 𝑦 among units with treatment assignment 𝑢 = 0,1

  • 1-NN matching is in general inconsistent,

though only with small bias (Imbens 2004)

slide-78
SLIDE 78

Outline

Tools of the trade Matching Covariate adjustment Propensity score

slide-79
SLIDE 79

Propensity score

  • Tool for estimating ATE
  • Basic idea: turn observational study into a

pseudo-randomized trial by re-weighting samples, similar to importance sampling

slide-80
SLIDE 80

𝑞 𝑦 𝑢 = 0 ⋅ 𝑥((𝑦) ≈ 𝑞 𝑦 𝑢 = 1 ⋅ 𝑥)(𝑦)

reweighted control reweighted treated

Inverse propensity score re-weighting

𝑦) = 𝑏𝑕𝑓 𝑦[ = 𝑗𝑜𝑑𝑝𝑛𝑓

Treated Control

𝑞(𝑦|𝑢 = 0) ≠ 𝑞 𝑦 𝑢 = 1 control treated

slide-81
SLIDE 81

Propensity score

  • Propensity score: 𝑞 𝑈 = 1 𝑦 ,

using machine learning tools

  • Samples re-weighted by the inverse

propensity score of the treatment they received

slide-82
SLIDE 82

How to obtain ATE with propensity score

slide-83
SLIDE 83

Propensity scores – algorithm

Inverse probability of treatment weighted estimator

How to calculate ATE with propensity score for sample 𝑦), 𝑢), 𝑧) , … , (𝑦], 𝑢], 𝑧])

  • 1. Use any ML method to estimate 𝑞

‚ 𝑈 = 𝑢 𝑦

2.

ˆ ATE = 1 n X

i s.t. ti=1

yi ˆ p(ti = 1|xi) − 1 n X

i s.t. ti=0

yi ˆ p(ti = 0|xi)

slide-84
SLIDE 84

Propensity scores – algorithm

Inverse probability of treatment weighted estimator

How to calculate ATE with propensity score for sample 𝑦), 𝑢), 𝑧) , … , (𝑦], 𝑢], 𝑧])

  • 1. Randomized trial 𝑞(𝑈 = 𝑢|𝑦) = 0.5

2.

ˆ ATE = 1 n X

i s.t. ti=1

yi ˆ p(ti = 1|xi) − 1 n X

i s.t. ti=0

yi ˆ p(ti = 0|xi)

slide-85
SLIDE 85

Propensity scores – algorithm

Inverse probability of treatment weighted estimator

How to calculate ATE with propensity score for sample 𝑦), 𝑢), 𝑧) , … , (𝑦], 𝑢], 𝑧])

  • 1. Randomized trial 𝑞(𝑈 = 𝑢|𝑦) = 0.5

2.

ˆ ATE = 1 n X

i s.t. ti=1

yi 0.5 − 1 n X

i s.t. ti=0

yi 0.5 = X X

slide-86
SLIDE 86

Propensity scores – algorithm

Inverse probability of treatment weighted estimator

How to calculate ATE with propensity score for sample 𝑦), 𝑢), 𝑧) , … , (𝑦], 𝑢], 𝑧])

  • 1. Randomized trial 𝑞 = 0.5

2.

ˆ ATE = 1 n X

i s.t. ti=1

yi 0.5 − 1 n X

i s.t. ti=0

yi 0.5 = 2 n X

i s.t. ti=1

yi − 2 n X

i s.t. ti=0

yi

slide-87
SLIDE 87

Propensity scores – algorithm

Inverse probability of treatment weighted estimator

How to calculate ATE with propensity score for sample 𝑦), 𝑢), 𝑧) , … , (𝑦], 𝑢], 𝑧])

  • 1. Randomized trial 𝑞 = 0.5

2.

ˆ ATE = 1 n X

i s.t. ti=1

yi 0.5 − 1 n X

i s.t. ti=0

yi 0.5 = 2 n X

i s.t. ti=1

yi − 2 n X

i s.t. ti=0

yi

Sum over ~

𝒐 𝟑 terms

slide-88
SLIDE 88

Propensity scores - derivation

  • Recall average treatment effect:
  • We only have samples for:

Ex∼p(x)[ E [Y1|x, T = 1]−E [Y0|x, T = 0] ]

Ex∼p(x|T =1)[ E [Y1|x, T = 1]] Ex∼p(x|T =0)[ E [Y0|x, T = 0]]

slide-89
SLIDE 89

Propensity scores - derivation

  • We only have samples for:

Ex∼p(x|T =1)[ E [Y1|x, T = 1]] Ex∼p(x|T =0)[ E [Y0|x, T = 0]]

slide-90
SLIDE 90

Propensity scores - derivation

  • We only have samples for:
  • We need to turn 𝑞(𝑦|𝑈 = 1) into 𝑞(𝑦):

Ex∼p(x|T =1)[ E [Y1|x, T = 1]] Ex∼p(x|T =0)[ E [Y0|x, T = 0]] p(x|T = 1) · p(T = 1) p(T = 1|x) = p(x)

?

slide-91
SLIDE 91

Propensity scores - derivation

  • We only have samples for:
  • We need to turn 𝑞(𝑦|𝑈 = 1) into 𝑞(𝑦):

Ex∼p(x|T =1)[ E [Y1|x, T = 1]] Ex∼p(x|T =0)[ E [Y0|x, T = 0]] p(x|T = 1) · p(T = 1) p(T = 1|x) = p(x)

Propensity score

slide-92
SLIDE 92

Propensity scores - derivation

  • We only have samples for:
  • We need to turn 𝑞(𝑦|𝑈 = 0) into 𝑞(𝑦):

Ex∼p(x|T =1)[ E [Y1|x, T = 1]] Ex∼p(x|T =0)[ E [Y0|x, T = 0]] p(x|T = 0) · p(T = 0) p(T = 0|x) = p(x)

Propensity score

slide-93
SLIDE 93
  • We only have samples for:
  • We want:
  • We know that:
  • Then:

Ex∼p(x|T =1)[ E [Y1|x, T = 1]]

p(x|T = 1) · p(T = 1) p(T = 1|x) = p(x)

Ex∼p(x|T =1)  p(T = 1) p(T = 1|x)E [Y1|x, T = 1]

  • =

Ex∼p(x) [E [Y1|x, T = 1]]  |

  • Ex∼p(x) [E [Y1|x, T = 1]]
slide-94
SLIDE 94

Calculating the propensity score

  • If 𝑞(𝑈 = 𝑢|𝑦) is known, then propensity scores

re-weighting is consistent

  • Example: ad-placement algorithm samples 𝑈 = 𝑢

based on a known algorithm

  • Usually the score is unknown and must be

estimated

  • Example: use logistic regression to estimate the

probability that patient 𝑦 received medication 𝑈 = 𝑢

  • Calibration: must estimate the probability correctly,

not just the binary assignment variable

slide-95
SLIDE 95

“The Assumptions” – ignorability

  • If ignorability doesn’t hold then the average

treatment effect is not 𝔽5~3 5 𝔽 𝑍

) 𝑈 = 1, 𝑦 − 𝔽 𝑍 ( 𝑈 = 0, 𝑦 ,

invalidating the starting point of the derivation

slide-96
SLIDE 96

“The Assumptions” – overlap

  • If there’s not much overlap, propensity scores

become non-informative and easily miscalibrated

  • Sample variance of inverse propensity score

re-weighting scales with ∑

) 3 ‚(|t)|56)3 ‚(|t(|56) ] &t)

, which can grow very large when samples are non-overlapping

(Williamson et al., 2014)

slide-97
SLIDE 97

Propensity score in machine learning

  • Same idea is in importance sampling!
  • Used in off-policy evaluation and learning

from logged bandit feedback

(Swaminathan & Joachims, 2015)

  • Similar ideas used in covariate shift work

(Bickel et al., 2009)