Gov 2002 - Causal Inference II: Instrumental Variables Matthew - - PowerPoint PPT Presentation

gov 2002 causal inference ii instrumental variables
SMART_READER_LITE
LIVE PREVIEW

Gov 2002 - Causal Inference II: Instrumental Variables Matthew - - PowerPoint PPT Presentation

Gov 2002 - Causal Inference II: Instrumental Variables Matthew Blackwell Arthur Spirling October 2nd, 2014 Instrumental Variables Last week we talked about how to make progress when you have randomization or selection on the observables.


slide-1
SLIDE 1

Gov 2002 - Causal Inference II: Instrumental Variables

Matthew Blackwell Arthur Spirling October 2nd, 2014

slide-2
SLIDE 2

Instrumental Variables

◮ Last week we talked about how to make progress when you

have randomization or selection on the observables.

slide-3
SLIDE 3

Instrumental Variables

◮ Last week we talked about how to make progress when you

have randomization or selection on the observables.

◮ But what if you have neither of those two for your treatment

variable? Are you doomed?

slide-4
SLIDE 4

Instrumental Variables

◮ Last week we talked about how to make progress when you

have randomization or selection on the observables.

◮ But what if you have neither of those two for your treatment

variable? Are you doomed?

◮ Maybe.

slide-5
SLIDE 5

Instrumental Variables

◮ Last week we talked about how to make progress when you

have randomization or selection on the observables.

◮ But what if you have neither of those two for your treatment

variable? Are you doomed?

◮ Maybe. ◮ But if you can identify some exogenous sources of variation

that drive the treatment, even if the treatment was not randomly assigned, you may be able to make headway.

slide-6
SLIDE 6

Instrumental Variables

◮ Last week we talked about how to make progress when you

have randomization or selection on the observables.

◮ But what if you have neither of those two for your treatment

variable? Are you doomed?

◮ Maybe. ◮ But if you can identify some exogenous sources of variation

that drive the treatment, even if the treatment was not randomly assigned, you may be able to make headway.

◮ The basic idea behind instrumental variables is that we have a

treatment with unmeasured confounding, but that we have another variable, called the instrument, that affects the treatment, but not the outcome, and thus give us that exogenous variation.

slide-7
SLIDE 7

Basic IV setup with DAGs

Z A U Y

slide-8
SLIDE 8

Basic IV setup with DAGs

Z A U Y exclusion restriction

◮ Z is the instrument, A is the treatment, and U is the

unmeasured confounder

slide-9
SLIDE 9

Basic IV setup with DAGs

Z A U Y exclusion restriction

◮ Z is the instrument, A is the treatment, and U is the

unmeasured confounder

◮ Exclusion restriction

slide-10
SLIDE 10

Basic IV setup with DAGs

Z A U Y exclusion restriction

◮ Z is the instrument, A is the treatment, and U is the

unmeasured confounder

◮ Exclusion restriction

◮ no common causes of the instrument and the outcome

slide-11
SLIDE 11

Basic IV setup with DAGs

Z A U Y exclusion restriction

◮ Z is the instrument, A is the treatment, and U is the

unmeasured confounder

◮ Exclusion restriction

◮ no common causes of the instrument and the outcome ◮ no direct or indirect effect of the instrument on the outcome

not through the treatment.

slide-12
SLIDE 12

Basic IV setup with DAGs

Z A U Y exclusion restriction

◮ Z is the instrument, A is the treatment, and U is the

unmeasured confounder

◮ Exclusion restriction

◮ no common causes of the instrument and the outcome ◮ no direct or indirect effect of the instrument on the outcome

not through the treatment.

◮ First-stage relationship: Z affects A

slide-13
SLIDE 13

An IV is only as good as its assumptions

Z A U Y exclusion restriction

◮ Finding a believable instrument is incredibly difficult and some

people never believe any IV setups.

slide-14
SLIDE 14

An IV is only as good as its assumptions

Z A U Y exclusion restriction

◮ Finding a believable instrument is incredibly difficult and some

people never believe any IV setups.

◮ We will see that even if all of the untestable assumptions are

met, the IV approach estimates a “local” ATE. That is, local to this particular case/instrument.

slide-15
SLIDE 15

IVs in the field

◮ Angrist (1990): Draft lottery as an IV for military service

(income as outcome)

slide-16
SLIDE 16

IVs in the field

◮ Angrist (1990): Draft lottery as an IV for military service

(income as outcome)

◮ Acemoglu et al (2001): settler mortality as an IV for

institutional quality (GDP/capita as outcome)

slide-17
SLIDE 17

IVs in the field

◮ Angrist (1990): Draft lottery as an IV for military service

(income as outcome)

◮ Acemoglu et al (2001): settler mortality as an IV for

institutional quality (GDP/capita as outcome)

◮ Levitt (1997): being an election year as IV for police force size

(crime as outcome)

slide-18
SLIDE 18

IVs in the field

◮ Angrist (1990): Draft lottery as an IV for military service

(income as outcome)

◮ Acemoglu et al (2001): settler mortality as an IV for

institutional quality (GDP/capita as outcome)

◮ Levitt (1997): being an election year as IV for police force size

(crime as outcome)

◮ Kern & Hainmueller (2009): having West German TV

reception in East Berlin as an instrument for West German TV watching (outcome is support for the East German regime)

slide-19
SLIDE 19

IVs in the field

◮ Angrist (1990): Draft lottery as an IV for military service

(income as outcome)

◮ Acemoglu et al (2001): settler mortality as an IV for

institutional quality (GDP/capita as outcome)

◮ Levitt (1997): being an election year as IV for police force size

(crime as outcome)

◮ Kern & Hainmueller (2009): having West German TV

reception in East Berlin as an instrument for West German TV watching (outcome is support for the East German regime)

◮ Nunn & Wantchekon (2011): historical distance of ethnic

group to the coast as a instrument for the slave raiding of that ethnic group (outcome are trust attitudes today)

slide-20
SLIDE 20

IVs in the field

◮ Angrist (1990): Draft lottery as an IV for military service

(income as outcome)

◮ Acemoglu et al (2001): settler mortality as an IV for

institutional quality (GDP/capita as outcome)

◮ Levitt (1997): being an election year as IV for police force size

(crime as outcome)

◮ Kern & Hainmueller (2009): having West German TV

reception in East Berlin as an instrument for West German TV watching (outcome is support for the East German regime)

◮ Nunn & Wantchekon (2011): historical distance of ethnic

group to the coast as a instrument for the slave raiding of that ethnic group (outcome are trust attitudes today)

◮ Acharya, Blackwell, Sen (2014): cotton suitability as IV for

proportion slave in 1860 (outcome is white attitudes today)

slide-21
SLIDE 21

IV with constant effects

◮ Let’s write down a causal model for Yi with constant effects

and an unmeasured confounder, Ui: Yi(a, u) = α + τa + γu + ηi

slide-22
SLIDE 22

IV with constant effects

◮ Let’s write down a causal model for Yi with constant effects

and an unmeasured confounder, Ui: Yi(a, u) = α + τa + γu + ηi

◮ If we connect this with a consistency assumption, we get the

this regression form: Yi = α + τAi + γUi + ηi

slide-23
SLIDE 23

IV with constant effects

◮ Let’s write down a causal model for Yi with constant effects

and an unmeasured confounder, Ui: Yi(a, u) = α + τa + γu + ηi

◮ If we connect this with a consistency assumption, we get the

this regression form: Yi = α + τAi + γUi + ηi

◮ Here we assume that E[Aiηi] = 0, so if we measured Ui, then

we would be able to estimate τ.

slide-24
SLIDE 24

IV with constant effects

◮ Let’s write down a causal model for Yi with constant effects

and an unmeasured confounder, Ui: Yi(a, u) = α + τa + γu + ηi

◮ If we connect this with a consistency assumption, we get the

this regression form: Yi = α + τAi + γUi + ηi

◮ Here we assume that E[Aiηi] = 0, so if we measured Ui, then

we would be able to estimate τ.

◮ But cov(γUi + ηi, Ai) = 0 because U is a common cause of A

and Y .

slide-25
SLIDE 25

The role of the instrument

◮ If we have an instrument, Zi, that satisfies the exclusions

restriction, then cov(γUi + ηi, Zi) = 0

slide-26
SLIDE 26

The role of the instrument

◮ If we have an instrument, Zi, that satisfies the exclusions

restriction, then cov(γUi + ηi, Zi) = 0

◮ It must be independent of Ui and it has no correlation with ηi

because neither does the treatment.

slide-27
SLIDE 27

The role of the instrument

◮ If we have an instrument, Zi, that satisfies the exclusions

restriction, then cov(γUi + ηi, Zi) = 0

◮ It must be independent of Ui and it has no correlation with ηi

because neither does the treatment.

slide-28
SLIDE 28

The role of the instrument

◮ If we have an instrument, Zi, that satisfies the exclusions

restriction, then cov(γUi + ηi, Zi) = 0

◮ It must be independent of Ui and it has no correlation with ηi

because neither does the treatment. cov(Yi, Zi) = cov(α + τAi + γUi + ηi, Zi)

slide-29
SLIDE 29

The role of the instrument

◮ If we have an instrument, Zi, that satisfies the exclusions

restriction, then cov(γUi + ηi, Zi) = 0

◮ It must be independent of Ui and it has no correlation with ηi

because neither does the treatment. cov(Yi, Zi) = cov(α + τAi + γUi + ηi, Zi) = cov(α, Zi) + cov(τAi, Zi) + cov(γUi + ηi, Zi)

slide-30
SLIDE 30

The role of the instrument

◮ If we have an instrument, Zi, that satisfies the exclusions

restriction, then cov(γUi + ηi, Zi) = 0

◮ It must be independent of Ui and it has no correlation with ηi

because neither does the treatment. cov(Yi, Zi) = cov(α + τAi + γUi + ηi, Zi) = cov(α, Zi) + cov(τAi, Zi) + cov(γUi + ηi, Zi) = 0 + τcov(Ai, Zi) + 0

slide-31
SLIDE 31

IV estimator with constant effects

Yi = α + τAi + γUi + ηi

◮ With this in hand, we can formulate an expression for the

average treatment effect here: τ = Cov(Yi, Zi) Cov(Ai, Zi) = Cov(Yi, Zi)/V [Zi] Cov(Ai, Zi)/V [Zi]

slide-32
SLIDE 32

IV estimator with constant effects

Yi = α + τAi + γUi + ηi

◮ With this in hand, we can formulate an expression for the

average treatment effect here: τ = Cov(Yi, Zi) Cov(Ai, Zi) = Cov(Yi, Zi)/V [Zi] Cov(Ai, Zi)/V [Zi]

◮ Reduced form coefficient: Cov(Yi, Zi)/V [Zi]

slide-33
SLIDE 33

IV estimator with constant effects

Yi = α + τAi + γUi + ηi

◮ With this in hand, we can formulate an expression for the

average treatment effect here: τ = Cov(Yi, Zi) Cov(Ai, Zi) = Cov(Yi, Zi)/V [Zi] Cov(Ai, Zi)/V [Zi]

◮ Reduced form coefficient: Cov(Yi, Zi)/V [Zi] ◮ First stage coefficient: Cov(Ai, Zi)/V [Zi]

slide-34
SLIDE 34

IV estimator with constant effects

Yi = α + τAi + γUi + ηi

◮ With this in hand, we can formulate an expression for the

average treatment effect here: τ = Cov(Yi, Zi) Cov(Ai, Zi) = Cov(Yi, Zi)/V [Zi] Cov(Ai, Zi)/V [Zi]

◮ Reduced form coefficient: Cov(Yi, Zi)/V [Zi] ◮ First stage coefficient: Cov(Ai, Zi)/V [Zi] ◮ What happens with a weak first stage?

slide-35
SLIDE 35

Wald Estimator

◮ With a binary instrument, there is a simple estimator based on

this formulation called the Wald estimator. It is easy to show that: τ = Cov(Yi, Zi) Cov(Ai, Zi) = E[Yi|Zi = 1] − E[Yi|Zi = 0] E[Ai|Zi = 1] − E[Ai|Zi = 0]

slide-36
SLIDE 36

Wald Estimator

◮ With a binary instrument, there is a simple estimator based on

this formulation called the Wald estimator. It is easy to show that: τ = Cov(Yi, Zi) Cov(Ai, Zi) = E[Yi|Zi = 1] − E[Yi|Zi = 0] E[Ai|Zi = 1] − E[Ai|Zi = 0]

◮ Intuitively, the effects of Zi on Yi divided by the effect of Zi on

Ai

slide-37
SLIDE 37

What about covariates?

◮ No covariates up until now. What if we have a set of covariates

Xi that we are also conditioning on?

slide-38
SLIDE 38

What about covariates?

◮ No covariates up until now. What if we have a set of covariates

Xi that we are also conditioning on?

◮ Let’s start with linear models for both the outcome and the

treatment: Yi = X ′

i β + τAi + εi

Ai = X ′

i α + γZi + νi

slide-39
SLIDE 39

What about covariates?

◮ No covariates up until now. What if we have a set of covariates

Xi that we are also conditioning on?

◮ Let’s start with linear models for both the outcome and the

treatment: Yi = X ′

i β + τAi + εi

Ai = X ′

i α + γZi + νi ◮ Now, we assume that Xi are exogenous along with Zi:

E[Ziνi] = 0 E[Ziεi] = 0 E[Xiνi] = 0 E[Xiεi] = 0

slide-40
SLIDE 40

What about covariates?

◮ No covariates up until now. What if we have a set of covariates

Xi that we are also conditioning on?

◮ Let’s start with linear models for both the outcome and the

treatment: Yi = X ′

i β + τAi + εi

Ai = X ′

i α + γZi + νi ◮ Now, we assume that Xi are exogenous along with Zi:

E[Ziνi] = 0 E[Ziεi] = 0 E[Xiνi] = 0 E[Xiεi] = 0

◮ . . . but Ai is endogenous: E[Aiεi] = 0

slide-41
SLIDE 41

Getting the reduced form

◮ We can plug the treatment equation into the outcome

equation: Yi = X ′

i β + τ[X ′ i α + γZi + νi] + εi

= X ′

i β + τ[X ′ i α + γZi] + [τνi + εi]

= X ′

i β + τ[X ′ i α + γZi] + ε∗ i

slide-42
SLIDE 42

Getting the reduced form

◮ We can plug the treatment equation into the outcome

equation: Yi = X ′

i β + τ[X ′ i α + γZi + νi] + εi

= X ′

i β + τ[X ′ i α + γZi] + [τνi + εi]

= X ′

i β + τ[X ′ i α + γZi] + ε∗ i ◮ Red value in the brackets is the population fitted value of the

treatment, E[Ai|Xi, Zi]

slide-43
SLIDE 43

Getting the reduced form

◮ We can plug the treatment equation into the outcome

equation: Yi = X ′

i β + τ[X ′ i α + γZi + νi] + εi

= X ′

i β + τ[X ′ i α + γZi] + [τνi + εi]

= X ′

i β + τ[X ′ i α + γZi] + ε∗ i ◮ Red value in the brackets is the population fitted value of the

treatment, E[Ai|Xi, Zi]

◮ Because Zi and Xi are uncorrelated with νi and εi, then this

fitted value is also independent of ε∗

i .

slide-44
SLIDE 44

Getting the reduced form

◮ We can plug the treatment equation into the outcome

equation: Yi = X ′

i β + τ[X ′ i α + γZi + νi] + εi

= X ′

i β + τ[X ′ i α + γZi] + [τνi + εi]

= X ′

i β + τ[X ′ i α + γZi] + ε∗ i ◮ Red value in the brackets is the population fitted value of the

treatment, E[Ai|Xi, Zi]

◮ Because Zi and Xi are uncorrelated with νi and εi, then this

fitted value is also independent of ε∗

i . ◮ Thus, the population regression coefficient of a Yi on

[X ′

i α + γZi] is the average treatment effect, τ.

slide-45
SLIDE 45

Two-stage least squares

◮ In practice, we estimate the first stage from a sample and

calculate OLS fitted values: ˆ Ai = X ′

i ˆ

α + ˆ γZi.

slide-46
SLIDE 46

Two-stage least squares

◮ In practice, we estimate the first stage from a sample and

calculate OLS fitted values: ˆ Ai = X ′

i ˆ

α + ˆ γZi.

◮ Here, ˆ

α and ˆ γ are estimates from OLS. Then, we estimate a regression of Yi on Xi and ˆ

  • Ai. We plug this into our equation

for Yi and note that the error for Ai is now a residual: Yi = X ′

i β + τ ˆ

Ai + [εi + τ(Ai − ˆ Ai)]

slide-47
SLIDE 47

Two-stage least squares

◮ In practice, we estimate the first stage from a sample and

calculate OLS fitted values: ˆ Ai = X ′

i ˆ

α + ˆ γZi.

◮ Here, ˆ

α and ˆ γ are estimates from OLS. Then, we estimate a regression of Yi on Xi and ˆ

  • Ai. We plug this into our equation

for Yi and note that the error for Ai is now a residual: Yi = X ′

i β + τ ˆ

Ai + [εi + τ(Ai − ˆ Ai)]

◮ Key question: is ˆ

Ai uncorrelated with the error?

slide-48
SLIDE 48

Two-stage least squares

◮ In practice, we estimate the first stage from a sample and

calculate OLS fitted values: ˆ Ai = X ′

i ˆ

α + ˆ γZi.

◮ Here, ˆ

α and ˆ γ are estimates from OLS. Then, we estimate a regression of Yi on Xi and ˆ

  • Ai. We plug this into our equation

for Yi and note that the error for Ai is now a residual: Yi = X ′

i β + τ ˆ

Ai + [εi + τ(Ai − ˆ Ai)]

◮ Key question: is ˆ

Ai uncorrelated with the error?

◮ ˆ

Ai is just a function of Xi and Zi so it is uncorrelated with εi.

slide-49
SLIDE 49

Two-stage least squares

◮ In practice, we estimate the first stage from a sample and

calculate OLS fitted values: ˆ Ai = X ′

i ˆ

α + ˆ γZi.

◮ Here, ˆ

α and ˆ γ are estimates from OLS. Then, we estimate a regression of Yi on Xi and ˆ

  • Ai. We plug this into our equation

for Yi and note that the error for Ai is now a residual: Yi = X ′

i β + τ ˆ

Ai + [εi + τ(Ai − ˆ Ai)]

◮ Key question: is ˆ

Ai uncorrelated with the error?

◮ ˆ

Ai is just a function of Xi and Zi so it is uncorrelated with εi.

◮ We also know that ˆ

Ai is uncorrelated with (Ai − ˆ Ai)?

slide-50
SLIDE 50

Two-stage least squares

◮ Heuristic procedure:

slide-51
SLIDE 51

Two-stage least squares

◮ Heuristic procedure:

  • 1. Run regression of treatment on covariates and instrument
slide-52
SLIDE 52

Two-stage least squares

◮ Heuristic procedure:

  • 1. Run regression of treatment on covariates and instrument
  • 2. Construct fitted values of treatment
slide-53
SLIDE 53

Two-stage least squares

◮ Heuristic procedure:

  • 1. Run regression of treatment on covariates and instrument
  • 2. Construct fitted values of treatment
  • 3. Run regression of outcome on covariates and fitted values
slide-54
SLIDE 54

Two-stage least squares

◮ Heuristic procedure:

  • 1. Run regression of treatment on covariates and instrument
  • 2. Construct fitted values of treatment
  • 3. Run regression of outcome on covariates and fitted values

◮ Note that this isn’t how we actually estimate 2SLS because the

standard errors are all wrong.

slide-55
SLIDE 55

Two-stage least squares

◮ Heuristic procedure:

  • 1. Run regression of treatment on covariates and instrument
  • 2. Construct fitted values of treatment
  • 3. Run regression of outcome on covariates and fitted values

◮ Note that this isn’t how we actually estimate 2SLS because the

standard errors are all wrong.

◮ Computer wants to calculate the standard errors based on ε∗ i ,

but what we really want is the standard errors based on εi.

slide-56
SLIDE 56

Nunn & Wantchekon IV example

slide-57
SLIDE 57

General 2SLS

◮ To save on notation, we’ll roll all the variables in the structural

model in one vector, Xi, of size k, some of which may be endogenous.

slide-58
SLIDE 58

General 2SLS

◮ To save on notation, we’ll roll all the variables in the structural

model in one vector, Xi, of size k, some of which may be endogenous.

◮ The structural model, then is:

Yi = X ′

i β + εi

slide-59
SLIDE 59

General 2SLS

◮ To save on notation, we’ll roll all the variables in the structural

model in one vector, Xi, of size k, some of which may be endogenous.

◮ The structural model, then is:

Yi = X ′

i β + εi ◮ Zi will be a vector of l exogenous variables that includes any

exogenous variables in Xi plus any instruments. Key assumption: E[Ziεi] = 0

slide-60
SLIDE 60

Nasty Matrix Algebra

◮ Useful quantities:

Π = (E[ZiZ ′

i ])−1E[ZiX ′ i ]

(projection matrix) Vi = Π′Zi (fitted values)

slide-61
SLIDE 61

Nasty Matrix Algebra

◮ Useful quantities:

Π = (E[ZiZ ′

i ])−1E[ZiX ′ i ]

(projection matrix) Vi = Π′Zi (fitted values)

◮ To derive the 2SLS estimator, take the fitted values, Π′Zi and

multiply both sides of the outcome equation by them: Yi = X ′

i β + εi

slide-62
SLIDE 62

Nasty Matrix Algebra

◮ Useful quantities:

Π = (E[ZiZ ′

i ])−1E[ZiX ′ i ]

(projection matrix) Vi = Π′Zi (fitted values)

◮ To derive the 2SLS estimator, take the fitted values, Π′Zi and

multiply both sides of the outcome equation by them: Yi = X ′

i β + εi

slide-63
SLIDE 63

Nasty Matrix Algebra

◮ Useful quantities:

Π = (E[ZiZ ′

i ])−1E[ZiX ′ i ]

(projection matrix) Vi = Π′Zi (fitted values)

◮ To derive the 2SLS estimator, take the fitted values, Π′Zi and

multiply both sides of the outcome equation by them: Yi = X ′

i β + εi

Π′ZiYi = Π′ZiX ′

i β + Π′Ziεi

slide-64
SLIDE 64

Nasty Matrix Algebra

◮ Useful quantities:

Π = (E[ZiZ ′

i ])−1E[ZiX ′ i ]

(projection matrix) Vi = Π′Zi (fitted values)

◮ To derive the 2SLS estimator, take the fitted values, Π′Zi and

multiply both sides of the outcome equation by them: Yi = X ′

i β + εi

Π′ZiYi = Π′ZiX ′

i β + Π′Ziεi

Π′E[ZiYi] = Π′E[ZiX ′

i ]β + Π′E[Ziεi]

slide-65
SLIDE 65

Nasty Matrix Algebra

◮ Useful quantities:

Π = (E[ZiZ ′

i ])−1E[ZiX ′ i ]

(projection matrix) Vi = Π′Zi (fitted values)

◮ To derive the 2SLS estimator, take the fitted values, Π′Zi and

multiply both sides of the outcome equation by them: Yi = X ′

i β + εi

Π′ZiYi = Π′ZiX ′

i β + Π′Ziεi

Π′E[ZiYi] = Π′E[ZiX ′

i ]β + Π′E[Ziεi]

Π′E[ZiYi] = Π′E[ZiX ′

i ]β

slide-66
SLIDE 66

Nasty Matrix Algebra

◮ Useful quantities:

Π = (E[ZiZ ′

i ])−1E[ZiX ′ i ]

(projection matrix) Vi = Π′Zi (fitted values)

◮ To derive the 2SLS estimator, take the fitted values, Π′Zi and

multiply both sides of the outcome equation by them: Yi = X ′

i β + εi

Π′ZiYi = Π′ZiX ′

i β + Π′Ziεi

Π′E[ZiYi] = Π′E[ZiX ′

i ]β + Π′E[Ziεi]

Π′E[ZiYi] = Π′E[ZiX ′

i ]β

β = (Π′E[ZiX ′

i ])−1Π′E[ZiYi]

slide-67
SLIDE 67

Nasty Matrix Algebra

◮ Useful quantities:

Π = (E[ZiZ ′

i ])−1E[ZiX ′ i ]

(projection matrix) Vi = Π′Zi (fitted values)

◮ To derive the 2SLS estimator, take the fitted values, Π′Zi and

multiply both sides of the outcome equation by them: Yi = X ′

i β + εi

Π′ZiYi = Π′ZiX ′

i β + Π′Ziεi

Π′E[ZiYi] = Π′E[ZiX ′

i ]β + Π′E[Ziεi]

Π′E[ZiYi] = Π′E[ZiX ′

i ]β

β = (Π′E[ZiX ′

i ])−1Π′E[ZiYi]

β = (E[XiZ ′

i ](E[ZiZ ′ i ])−1E[ZiX ′ i ])−1E[ZiX ′ i ](E[ZiZ ′ i ])−1E[ZiYi]

slide-68
SLIDE 68

How to estimate the parameters

◮ Collect Xi into a n × k matrix X = (X ′ 1, . . . , X ′ n)

slide-69
SLIDE 69

How to estimate the parameters

◮ Collect Xi into a n × k matrix X = (X ′ 1, . . . , X ′ n) ◮ Collect Zi into a n × l matrix Z = (Z ′ 1, . . . , Z ′ n)

slide-70
SLIDE 70

How to estimate the parameters

◮ Collect Xi into a n × k matrix X = (X ′ 1, . . . , X ′ n) ◮ Collect Zi into a n × l matrix Z = (Z ′ 1, . . . , Z ′ n) ◮ Matrix party trick: X ′Z/n = (1/n) N i XiZ ′ i p

→ E[XiZ ′

i ].

slide-71
SLIDE 71

How to estimate the parameters

◮ Collect Xi into a n × k matrix X = (X ′ 1, . . . , X ′ n) ◮ Collect Zi into a n × l matrix Z = (Z ′ 1, . . . , Z ′ n) ◮ Matrix party trick: X ′Z/n = (1/n) N i XiZ ′ i p

→ E[XiZ ′

i ]. ◮ Take the population formula for the parameters:

β = (E[ZiX ′

i ](E[ZiZ ′ i ])−1E[ZiX ′ i ])−1E[ZiX ′ i ](E[ZiZ ′ i ])−1E[ZiYi]

slide-72
SLIDE 72

How to estimate the parameters

◮ Collect Xi into a n × k matrix X = (X ′ 1, . . . , X ′ n) ◮ Collect Zi into a n × l matrix Z = (Z ′ 1, . . . , Z ′ n) ◮ Matrix party trick: X ′Z/n = (1/n) N i XiZ ′ i p

→ E[XiZ ′

i ]. ◮ Take the population formula for the parameters:

β = (E[ZiX ′

i ](E[ZiZ ′ i ])−1E[ZiX ′ i ])−1E[ZiX ′ i ](E[ZiZ ′ i ])−1E[ZiYi] ◮ And plug in the sample values (the n cancels out):

ˆ β = [(X ′Z)(Z ′Z)−1(Z ′X)]−1(Z ′X)(Z ′Z)−1(Z ′Y )

slide-73
SLIDE 73

How to estimate the parameters

◮ Collect Xi into a n × k matrix X = (X ′ 1, . . . , X ′ n) ◮ Collect Zi into a n × l matrix Z = (Z ′ 1, . . . , Z ′ n) ◮ Matrix party trick: X ′Z/n = (1/n) N i XiZ ′ i p

→ E[XiZ ′

i ]. ◮ Take the population formula for the parameters:

β = (E[ZiX ′

i ](E[ZiZ ′ i ])−1E[ZiX ′ i ])−1E[ZiX ′ i ](E[ZiZ ′ i ])−1E[ZiYi] ◮ And plug in the sample values (the n cancels out):

ˆ β = [(X ′Z)(Z ′Z)−1(Z ′X)]−1(Z ′X)(Z ′Z)−1(Z ′Y )

◮ This is how R/Stata estimates the 2SLS parameters

slide-74
SLIDE 74

Asymptotics for 2SLS

◮ Let V = Z(Z ′Z)−1Z ′X be the matrix of fitted values for X,

then we have ˆ β = (V ′V )−1V ′Y

slide-75
SLIDE 75

Asymptotics for 2SLS

◮ Let V = Z(Z ′Z)−1Z ′X be the matrix of fitted values for X,

then we have ˆ β = (V ′V )−1V ′Y

◮ We can insert the true model for Y :

ˆ β = (V ′V )−1V ′(Xβ + ε)

slide-76
SLIDE 76

Asymptotics for 2SLS

◮ Let V = Z(Z ′Z)−1Z ′X be the matrix of fitted values for X,

then we have ˆ β = (V ′V )−1V ′Y

◮ We can insert the true model for Y :

ˆ β = (V ′V )−1V ′(Xβ + ε)

◮ Using the matrix party trick and that V ′X = V ′V , we have

ˆ β = (V ′V )−1V ′Xβ + (V ′V )−1V ′ε = β +

  • n−1

i

ViV ′

i

−1

n−1

i

Viεi

slide-77
SLIDE 77

Asymptotics for 2SLS

◮ Let V = Z(Z ′Z)−1Z ′X be the matrix of fitted values for X,

then we have ˆ β = (V ′V )−1V ′Y

◮ We can insert the true model for Y :

ˆ β = (V ′V )−1V ′(Xβ + ε)

◮ Using the matrix party trick and that V ′X = V ′V , we have

ˆ β = (V ′V )−1V ′Xβ + (V ′V )−1V ′ε = β +

  • n−1

i

ViV ′

i

−1

n−1

i

Viεi

◮ Consistent because n−1 i Viεi p

→ E[Viεi] = 0.

slide-78
SLIDE 78

Asymptotic variance for 2SLS

√n(ˆ β − β) =

  • n−1

i

ViV ′

i

−1

n−1/2

i

Viεi

  • ◮ By the CLT, n−1/2

i Viεi converges in distribution to

N(0, B), where B = E[V ′

i ε′ iεiVi].

slide-79
SLIDE 79

Asymptotic variance for 2SLS

√n(ˆ β − β) =

  • n−1

i

ViV ′

i

−1

n−1/2

i

Viεi

  • ◮ By the CLT, n−1/2

i Viεi converges in distribution to

N(0, B), where B = E[V ′

i ε′ iεiVi]. ◮ By the LLN, n−1 i ViV ′ i p

→ E[V ′

i Vi].

slide-80
SLIDE 80

Asymptotic variance for 2SLS

√n(ˆ β − β) =

  • n−1

i

ViV ′

i

−1

n−1/2

i

Viεi

  • ◮ By the CLT, n−1/2

i Viεi converges in distribution to

N(0, B), where B = E[V ′

i ε′ iεiVi]. ◮ By the LLN, n−1 i ViV ′ i p

→ E[V ′

i Vi]. ◮ Thus, we have that √n(ˆ

β − β) has asymptotic variance: (E[V ′

i Vi])−1E[V ′ i ε′ iεiVi](E[V ′ i Vi])−1

slide-81
SLIDE 81

Asymptotic variance for 2SLS

√n(ˆ β − β) =

  • n−1

i

ViV ′

i

−1

n−1/2

i

Viεi

  • ◮ By the CLT, n−1/2

i Viεi converges in distribution to

N(0, B), where B = E[V ′

i ε′ iεiVi]. ◮ By the LLN, n−1 i ViV ′ i p

→ E[V ′

i Vi]. ◮ Thus, we have that √n(ˆ

β − β) has asymptotic variance: (E[V ′

i Vi])−1E[V ′ i ε′ iεiVi](E[V ′ i Vi])−1 ◮ Replace with the sample quantities to get estimates:

  • var(ˆ

β) = (V ′V )−1

i

ˆ u2

i ViV ′ i

  • (V ′V )−1

where ˆ ui = Yi − X ′

i ˆ

β

slide-82
SLIDE 82

Overidentification

◮ What if we have more instruments than endogenous variables?

slide-83
SLIDE 83

Overidentification

◮ What if we have more instruments than endogenous variables? ◮ When there are more instruments than causal parameters

(l > k), the model is overidentified.

slide-84
SLIDE 84

Overidentification

◮ What if we have more instruments than endogenous variables? ◮ When there are more instruments than causal parameters

(l > k), the model is overidentified.

◮ When there are as many instruments as causal parameters

(l = k), the model is just identified.

slide-85
SLIDE 85

Overidentification

◮ What if we have more instruments than endogenous variables? ◮ When there are more instruments than causal parameters

(l > k), the model is overidentified.

◮ When there are as many instruments as causal parameters

(l = k), the model is just identified.

◮ With more than one instrument and constant effects, we can

test for the plausibility of the exclusion restriction(s) using an

  • veridentification test.
slide-86
SLIDE 86

Overidentification

◮ What if we have more instruments than endogenous variables? ◮ When there are more instruments than causal parameters

(l > k), the model is overidentified.

◮ When there are as many instruments as causal parameters

(l = k), the model is just identified.

◮ With more than one instrument and constant effects, we can

test for the plausibility of the exclusion restriction(s) using an

  • veridentification test.

◮ Is it plausible to find more than one instrument?

slide-87
SLIDE 87

Overidentification tests

◮ Sargan test, Hansen test, J-test, etc.

slide-88
SLIDE 88

Overidentification tests

◮ Sargan test, Hansen test, J-test, etc. ◮ Basic idea: under null that all instruments are good, running it

with different subset of the instruments should only differ due to sampling noise.

slide-89
SLIDE 89

Overidentification tests

◮ Sargan test, Hansen test, J-test, etc. ◮ Basic idea: under null that all instruments are good, running it

with different subset of the instruments should only differ due to sampling noise.

◮ Identify the distribution of that noise under the null to develop

a test.

slide-90
SLIDE 90

Overidentification tests

◮ Sargan test, Hansen test, J-test, etc. ◮ Basic idea: under null that all instruments are good, running it

with different subset of the instruments should only differ due to sampling noise.

◮ Identify the distribution of that noise under the null to develop

a test.

◮ If we reject the null hypothesis in these overidentification tests,

then it means that the exclusion restrcitions for our instruments are probably incorrect. Note that it won’t tell us which of them are incorrect, just that at least one is.

slide-91
SLIDE 91

Overidentification tests

◮ Sargan test, Hansen test, J-test, etc. ◮ Basic idea: under null that all instruments are good, running it

with different subset of the instruments should only differ due to sampling noise.

◮ Identify the distribution of that noise under the null to develop

a test.

◮ If we reject the null hypothesis in these overidentification tests,

then it means that the exclusion restrcitions for our instruments are probably incorrect. Note that it won’t tell us which of them are incorrect, just that at least one is.

◮ These overidentification tests depend heavily on the constant

effects assumption

slide-92
SLIDE 92

Overidentification tests

◮ Sargan test, Hansen test, J-test, etc. ◮ Basic idea: under null that all instruments are good, running it

with different subset of the instruments should only differ due to sampling noise.

◮ Identify the distribution of that noise under the null to develop

a test.

◮ If we reject the null hypothesis in these overidentification tests,

then it means that the exclusion restrcitions for our instruments are probably incorrect. Note that it won’t tell us which of them are incorrect, just that at least one is.

◮ These overidentification tests depend heavily on the constant

effects assumption

◮ Once we move away from constant effects, we no longer can

generally pool multiple instruments together in this way.

slide-93
SLIDE 93

Reading

slide-94
SLIDE 94

Reading

slide-95
SLIDE 95

Instrumental Variables and Potential Outcomes

◮ The basic idea behind instrumental variable approaches is that

we do not have ignorability for Ai, but we do have a variable, Zi, that affects Ai, but only affects the outcome through Ai.

slide-96
SLIDE 96

Instrumental Variables and Potential Outcomes

◮ The basic idea behind instrumental variable approaches is that

we do not have ignorability for Ai, but we do have a variable, Zi, that affects Ai, but only affects the outcome through Ai.

◮ Note that we allow the instrument, Zi to have an effect on Ai,

so the treatment must have potential outcomes, Ai(1) and Ai(0), with the usual consistency assumption: Ai = ZiAi(1) + (1 − Zi)Ai(0)

slide-97
SLIDE 97

Instrumental Variables and Potential Outcomes

◮ The basic idea behind instrumental variable approaches is that

we do not have ignorability for Ai, but we do have a variable, Zi, that affects Ai, but only affects the outcome through Ai.

◮ Note that we allow the instrument, Zi to have an effect on Ai,

so the treatment must have potential outcomes, Ai(1) and Ai(0), with the usual consistency assumption: Ai = ZiAi(1) + (1 − Zi)Ai(0)

◮ Outcome can depend on both the treatment and the

instrument: Yi(a, z) is the outcome if unit i had received treatment Ai = a and instrument value Zi = z.

slide-98
SLIDE 98

Instrumental Variables and Potential Outcomes

◮ The basic idea behind instrumental variable approaches is that

we do not have ignorability for Ai, but we do have a variable, Zi, that affects Ai, but only affects the outcome through Ai.

◮ Note that we allow the instrument, Zi to have an effect on Ai,

so the treatment must have potential outcomes, Ai(1) and Ai(0), with the usual consistency assumption: Ai = ZiAi(1) + (1 − Zi)Ai(0)

◮ Outcome can depend on both the treatment and the

instrument: Yi(a, z) is the outcome if unit i had received treatment Ai = a and instrument value Zi = z.

◮ The effect of the treatment given the value of the instrument is

Yi(1, Zi) − Yi(0, Zi) .

slide-99
SLIDE 99

Key assumptions

  • 1. Randomization
slide-100
SLIDE 100

Key assumptions

  • 1. Randomization
  • 2. Exclusion Restriction
slide-101
SLIDE 101

Key assumptions

  • 1. Randomization
  • 2. Exclusion Restriction
  • 3. First-stage relationship
slide-102
SLIDE 102

Key assumptions

  • 1. Randomization
  • 2. Exclusion Restriction
  • 3. First-stage relationship
  • 4. Monotonicity
slide-103
SLIDE 103

Randomization

◮ Need the instrument to be randomized:

[{Yi(a, z), ∀a, z}, Ai(1), Ai(0)] ⊥ ⊥ Zi

slide-104
SLIDE 104

Randomization

◮ Need the instrument to be randomized:

[{Yi(a, z), ∀a, z}, Ai(1), Ai(0)] ⊥ ⊥ Zi

◮ We can weaken this to conditional ignorability

slide-105
SLIDE 105

Randomization

◮ Need the instrument to be randomized:

[{Yi(a, z), ∀a, z}, Ai(1), Ai(0)] ⊥ ⊥ Zi

◮ We can weaken this to conditional ignorability ◮ But why believe conditional ignorability for the instrument but

not the treatment?

slide-106
SLIDE 106

Randomization

◮ Need the instrument to be randomized:

[{Yi(a, z), ∀a, z}, Ai(1), Ai(0)] ⊥ ⊥ Zi

◮ We can weaken this to conditional ignorability ◮ But why believe conditional ignorability for the instrument but

not the treatment?

◮ Best instruments are truly randomized.

slide-107
SLIDE 107

Randomization

◮ Need the instrument to be randomized:

[{Yi(a, z), ∀a, z}, Ai(1), Ai(0)] ⊥ ⊥ Zi

◮ We can weaken this to conditional ignorability ◮ But why believe conditional ignorability for the instrument but

not the treatment?

◮ Best instruments are truly randomized. ◮ Identifies the intent-to-treat (ITT) effect:

E[Yi|Zi = 1] − E[Yi|Zi = 0] = E[Yi(Ai(1), 1) − Yi(Ai(0), 0)]

slide-108
SLIDE 108

Exclusion Restriction

◮ The instrument has no direct effect on the outcome, once we

fix the value of the treatment. Yi(a, 1) = Yi(a, 0) for a = 0, 1

slide-109
SLIDE 109

Exclusion Restriction

◮ The instrument has no direct effect on the outcome, once we

fix the value of the treatment. Yi(a, 1) = Yi(a, 0) for a = 0, 1

◮ Given this exclusion restriction, we know that the potential

  • utcomes for each treatment status only depend on the

treatment, not the instrument: Yi(1) ≡ Yi(1, 1) = Yi(1, 0) Yi(0) ≡ Yi(0, 1) = Yi(0, 0)

slide-110
SLIDE 110

Exclusion Restriction

◮ The instrument has no direct effect on the outcome, once we

fix the value of the treatment. Yi(a, 1) = Yi(a, 0) for a = 0, 1

◮ Given this exclusion restriction, we know that the potential

  • utcomes for each treatment status only depend on the

treatment, not the instrument: Yi(1) ≡ Yi(1, 1) = Yi(1, 0) Yi(0) ≡ Yi(0, 1) = Yi(0, 0)

◮ NOT

slide-111
SLIDE 111

Exclusion Restriction

◮ The instrument has no direct effect on the outcome, once we

fix the value of the treatment. Yi(a, 1) = Yi(a, 0) for a = 0, 1

◮ Given this exclusion restriction, we know that the potential

  • utcomes for each treatment status only depend on the

treatment, not the instrument: Yi(1) ≡ Yi(1, 1) = Yi(1, 0) Yi(0) ≡ Yi(0, 1) = Yi(0, 0)

◮ NOT

slide-112
SLIDE 112

Exclusion Restriction

◮ The instrument has no direct effect on the outcome, once we

fix the value of the treatment. Yi(a, 1) = Yi(a, 0) for a = 0, 1

◮ Given this exclusion restriction, we know that the potential

  • utcomes for each treatment status only depend on the

treatment, not the instrument: Yi(1) ≡ Yi(1, 1) = Yi(1, 0) Yi(0) ≡ Yi(0, 1) = Yi(0, 0)

◮ NOT A

slide-113
SLIDE 113

Exclusion Restriction

◮ The instrument has no direct effect on the outcome, once we

fix the value of the treatment. Yi(a, 1) = Yi(a, 0) for a = 0, 1

◮ Given this exclusion restriction, we know that the potential

  • utcomes for each treatment status only depend on the

treatment, not the instrument: Yi(1) ≡ Yi(1, 1) = Yi(1, 0) Yi(0) ≡ Yi(0, 1) = Yi(0, 0)

◮ NOT A TESTABLE

slide-114
SLIDE 114

Exclusion Restriction

◮ The instrument has no direct effect on the outcome, once we

fix the value of the treatment. Yi(a, 1) = Yi(a, 0) for a = 0, 1

◮ Given this exclusion restriction, we know that the potential

  • utcomes for each treatment status only depend on the

treatment, not the instrument: Yi(1) ≡ Yi(1, 1) = Yi(1, 0) Yi(0) ≡ Yi(0, 1) = Yi(0, 0)

◮ NOT A TESTABLE ASSUMPTION

slide-115
SLIDE 115

The linear model with heterogeneous effects

◮ Rewriting the usual consistency assumption gives us a linear

model with heterogeneous effects (we have seen this before in randomized experiments): Yi = Yi(0) + (Yi(1) − Yi(0))Ai = α0 + τiAi + ηi

slide-116
SLIDE 116

The linear model with heterogeneous effects

◮ Rewriting the usual consistency assumption gives us a linear

model with heterogeneous effects (we have seen this before in randomized experiments): Yi = Yi(0) + (Yi(1) − Yi(0))Ai = α0 + τiAi + ηi

◮ Here, we have α0 = E[Yi(0)] and τi = Yi(1) − Yi(0).

slide-117
SLIDE 117

First Stage

◮ This next assumption is a little mundane, but turns out to be

very important: the instrument must have an effect on the treatment. E[Ai(1) − Ai(0)] = 0

slide-118
SLIDE 118

First Stage

◮ This next assumption is a little mundane, but turns out to be

very important: the instrument must have an effect on the treatment. E[Ai(1) − Ai(0)] = 0

◮ Otherwise, what would we be doing? The instrument wouldn’t

affect anything.

slide-119
SLIDE 119

Monotonicity

◮ Lastly, we need to make another assumption about the

relationship between the instrument and the treatment.

slide-120
SLIDE 120

Monotonicity

◮ Lastly, we need to make another assumption about the

relationship between the instrument and the treatment.

◮ Monotonicity says that the presence of the instrument never

dissuades someone from taking the treatment: Ai(1) − Ai(0) ≥ 0

slide-121
SLIDE 121

Monotonicity

◮ Lastly, we need to make another assumption about the

relationship between the instrument and the treatment.

◮ Monotonicity says that the presence of the instrument never

dissuades someone from taking the treatment: Ai(1) − Ai(0) ≥ 0

◮ Note if this holds in the opposite direction Ai(1) − Ai(0) ≤ 0,

we can always rescale Ai to make the assumption hold.

slide-122
SLIDE 122

Monotonicity means no defiers

◮ This is sometimes called “no defiers”. It turns out that with a

binary treatment and a binary instrument, we can group units into four categories: Name Ai(1) Ai(0) Always Takers 1 1 Never Takers Compliers 1 Defiers 1

slide-123
SLIDE 123

Monotonicity means no defiers

◮ This is sometimes called “no defiers”. It turns out that with a

binary treatment and a binary instrument, we can group units into four categories: Name Ai(1) Ai(0) Always Takers 1 1 Never Takers Compliers 1 Defiers 1

◮ These compliance groups are sometimes called “principal

strata.”

slide-124
SLIDE 124

Monotonicity means no defiers

◮ This is sometimes called “no defiers”. It turns out that with a

binary treatment and a binary instrument, we can group units into four categories: Name Ai(1) Ai(0) Always Takers 1 1 Never Takers Compliers 1 Defiers 1

◮ These compliance groups are sometimes called “principal

strata.”

◮ The monotonicity assumption remove the possibility of there

being defiers in the population.

slide-125
SLIDE 125

Monotonicity means no defiers

◮ This is sometimes called “no defiers”. It turns out that with a

binary treatment and a binary instrument, we can group units into four categories: Name Ai(1) Ai(0) Always Takers 1 1 Never Takers Compliers 1 Defiers 1

◮ These compliance groups are sometimes called “principal

strata.”

◮ The monotonicity assumption remove the possibility of there

being defiers in the population.

◮ Anyone with Ai = 1 when Zi = 0 must be an always-taker and

anyone with Ai = 0 when Zi = 1 must be a never-taker.

slide-126
SLIDE 126

Local Average Treatment Effect (LATE)

◮ Under these four assumptions, the Wald estimator is equal

what we call Local average treatment effect (LATE) or the complier average treatment effect (CATE).

slide-127
SLIDE 127

Local Average Treatment Effect (LATE)

◮ Under these four assumptions, the Wald estimator is equal

what we call Local average treatment effect (LATE) or the complier average treatment effect (CATE).

◮ This is is the ATE among the compliers: those that take the

treatment when encouraged to do so.

slide-128
SLIDE 128

Local Average Treatment Effect (LATE)

◮ Under these four assumptions, the Wald estimator is equal

what we call Local average treatment effect (LATE) or the complier average treatment effect (CATE).

◮ This is is the ATE among the compliers: those that take the

treatment when encouraged to do so.

◮ That is, the LATE theorem, states that:

E[Yi|Zi = 1] − E[Yi|Zi = 0] E[Ai|Zi = 1] − E[Ai|Zi = 0] = E[Yi(1)−Yi(0)|Ai(1) > Ai(0)]

slide-129
SLIDE 129

Local Average Treatment Effect (LATE)

◮ Under these four assumptions, the Wald estimator is equal

what we call Local average treatment effect (LATE) or the complier average treatment effect (CATE).

◮ This is is the ATE among the compliers: those that take the

treatment when encouraged to do so.

◮ That is, the LATE theorem, states that:

E[Yi|Zi = 1] − E[Yi|Zi = 0] E[Ai|Zi = 1] − E[Ai|Zi = 0] = E[Yi(1)−Yi(0)|Ai(1) > Ai(0)]

◮ This fact was a massive intellectual jump in our understanding

  • f IV.
slide-130
SLIDE 130

Proof of the LATE theorem

◮ Under the exclusion restriction and randomization,

E[Yi|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai(1)] (randomization)

slide-131
SLIDE 131

Proof of the LATE theorem

◮ Under the exclusion restriction and randomization,

E[Yi|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai(1)] (randomization)

◮ The same applies to when Zi = 0, so we have

E[Yi|Zi = 0] = E[Yi(0) + (Yi(1) − Yi(0))Ai(0)]

slide-132
SLIDE 132

Proof of the LATE theorem

◮ Under the exclusion restriction and randomization,

E[Yi|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai(1)] (randomization)

◮ The same applies to when Zi = 0, so we have

E[Yi|Zi = 0] = E[Yi(0) + (Yi(1) − Yi(0))Ai(0)]

◮ Thus, $E[Y_i |Z_i = 1] - E[Y_i |Z_i = 0] = $

E[(Yi(1) − Yi(0))(Ai(1) − Ai(0))]

slide-133
SLIDE 133

Proof of the LATE theorem

◮ Under the exclusion restriction and randomization,

E[Yi|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai(1)] (randomization)

◮ The same applies to when Zi = 0, so we have

E[Yi|Zi = 0] = E[Yi(0) + (Yi(1) − Yi(0))Ai(0)]

◮ Thus, $E[Y_i |Z_i = 1] - E[Y_i |Z_i = 0] = $

E[(Yi(1) − Yi(0))(Ai(1) − Ai(0))]

slide-134
SLIDE 134

Proof of the LATE theorem

◮ Under the exclusion restriction and randomization,

E[Yi|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai(1)] (randomization)

◮ The same applies to when Zi = 0, so we have

E[Yi|Zi = 0] = E[Yi(0) + (Yi(1) − Yi(0))Ai(0)]

◮ Thus, $E[Y_i |Z_i = 1] - E[Y_i |Z_i = 0] = $

E[(Yi(1) − Yi(0))(Ai(1) − Ai(0))] =E[(Yi(1) − Yi(0))(1)|Ai(1) > Ai(0)] Pr[Ai(1) > Ai(0)] +E[(Yi(1) − Yi(0))(−1)|Ai(1) < Ai(0)] Pr[Ai(1) < Ai(0)]

slide-135
SLIDE 135

Proof of the LATE theorem

◮ Under the exclusion restriction and randomization,

E[Yi|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai(1)] (randomization)

◮ The same applies to when Zi = 0, so we have

E[Yi|Zi = 0] = E[Yi(0) + (Yi(1) − Yi(0))Ai(0)]

◮ Thus, $E[Y_i |Z_i = 1] - E[Y_i |Z_i = 0] = $

E[(Yi(1) − Yi(0))(Ai(1) − Ai(0))] =E[(Yi(1) − Yi(0))(1)|Ai(1) > Ai(0)] Pr[Ai(1) > Ai(0)] +E[(Yi(1) − Yi(0))(−1)|Ai(1) < Ai(0)] Pr[Ai(1) < Ai(0)] =E[Yi(1) − Yi(0)|Ai(1) > Ai(0)] Pr[Ai(1) > Ai(0)]

slide-136
SLIDE 136

Proof of the LATE theorem

◮ Under the exclusion restriction and randomization,

E[Yi|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] = E[Yi(0) + (Yi(1) − Yi(0))Ai(1)] (randomization)

◮ The same applies to when Zi = 0, so we have

E[Yi|Zi = 0] = E[Yi(0) + (Yi(1) − Yi(0))Ai(0)]

◮ Thus, $E[Y_i |Z_i = 1] - E[Y_i |Z_i = 0] = $

E[(Yi(1) − Yi(0))(Ai(1) − Ai(0))] =E[(Yi(1) − Yi(0))(1)|Ai(1) > Ai(0)] Pr[Ai(1) > Ai(0)] +E[(Yi(1) − Yi(0))(−1)|Ai(1) < Ai(0)] Pr[Ai(1) < Ai(0)] =E[Yi(1) − Yi(0)|Ai(1) > Ai(0)] Pr[Ai(1) > Ai(0)]

◮ The third equality comes from monotonicity: with this

assumption, Ai(1) < Ai(0) never occurs.

slide-137
SLIDE 137

Proof (continued)

E[Yi|Zi = 1]−E[Yi|Zi = 0] = E[Yi(1)−Yi(0)|Ai(1) > Ai(0)] Pr[Ai(1) > Ai(0)]

  • We can use the same argument for the denominator:

E[Ai|Zi = 1] − E[Ai|Zi = 0] = E[Ai(1) − Ai(0)] = Pr[Ai(1) > Ai(0)]

  • Dividing these two expressions through gives the LATE.
slide-138
SLIDE 138

Reading

slide-139
SLIDE 139

Reading

slide-140
SLIDE 140

Is the LATE useful?

◮ Once we allow for heterogeneous effects, all we can estimate

with IV is the effect of treatment among compliers.

slide-141
SLIDE 141

Is the LATE useful?

◮ Once we allow for heterogeneous effects, all we can estimate

with IV is the effect of treatment among compliers.

◮ This is a unknown subset of the data. Among treated units

with Zi = 1, we cannot distinguish them from the always-takers and similarly for the control units with Zi = 0.

slide-142
SLIDE 142

Is the LATE useful?

◮ Once we allow for heterogeneous effects, all we can estimate

with IV is the effect of treatment among compliers.

◮ This is a unknown subset of the data. Among treated units

with Zi = 1, we cannot distinguish them from the always-takers and similarly for the control units with Zi = 0.

◮ Without further assumptions, this estimand is not equal to

  • verall treatment effect or the treatment effect on the treated.
slide-143
SLIDE 143

Is the LATE useful?

◮ Once we allow for heterogeneous effects, all we can estimate

with IV is the effect of treatment among compliers.

◮ This is a unknown subset of the data. Among treated units

with Zi = 1, we cannot distinguish them from the always-takers and similarly for the control units with Zi = 0.

◮ Without further assumptions, this estimand is not equal to

  • verall treatment effect or the treatment effect on the treated.

◮ Furthermore, since the complier group depends on the

instrument, an IV estimate with one instrument will generally estimate a different quantity than an IV estimate of the same effect with a different instrument.

slide-144
SLIDE 144

Is the LATE useful?

◮ Once we allow for heterogeneous effects, all we can estimate

with IV is the effect of treatment among compliers.

◮ This is a unknown subset of the data. Among treated units

with Zi = 1, we cannot distinguish them from the always-takers and similarly for the control units with Zi = 0.

◮ Without further assumptions, this estimand is not equal to

  • verall treatment effect or the treatment effect on the treated.

◮ Furthermore, since the complier group depends on the

instrument, an IV estimate with one instrument will generally estimate a different quantity than an IV estimate of the same effect with a different instrument.

◮ 2SLS “cheats” by assuming that the effect is constant, so it is

the same for compliers and non-compliers.

slide-145
SLIDE 145

Randomized trials with one-sided noncompliance

◮ Will the LATE ever be equal to a usual causal quantity?

slide-146
SLIDE 146

Randomized trials with one-sided noncompliance

◮ Will the LATE ever be equal to a usual causal quantity? ◮ When non-compliance is one-sided, then the LATE is equal to

the ATT.

slide-147
SLIDE 147

Randomized trials with one-sided noncompliance

◮ Will the LATE ever be equal to a usual causal quantity? ◮ When non-compliance is one-sided, then the LATE is equal to

the ATT.

◮ Think of a randomized experiment:

slide-148
SLIDE 148

Randomized trials with one-sided noncompliance

◮ Will the LATE ever be equal to a usual causal quantity? ◮ When non-compliance is one-sided, then the LATE is equal to

the ATT.

◮ Think of a randomized experiment:

◮ Randomized treatment assignment = instrument (Zi)

slide-149
SLIDE 149

Randomized trials with one-sided noncompliance

◮ Will the LATE ever be equal to a usual causal quantity? ◮ When non-compliance is one-sided, then the LATE is equal to

the ATT.

◮ Think of a randomized experiment:

◮ Randomized treatment assignment = instrument (Zi) ◮ Non-randomized actual treatment taken = treatment (Ai)

slide-150
SLIDE 150

Randomized trials with one-sided noncompliance

◮ Will the LATE ever be equal to a usual causal quantity? ◮ When non-compliance is one-sided, then the LATE is equal to

the ATT.

◮ Think of a randomized experiment:

◮ Randomized treatment assignment = instrument (Zi) ◮ Non-randomized actual treatment taken = treatment (Ai)

◮ One-sided noncompliance: only those assigned to treatment

(control) can actually take the treatment (control). Or Pr[Ai = 1|Zi = 0] = 0

slide-151
SLIDE 151

Randomized trials with one-sided noncompliance

◮ Will the LATE ever be equal to a usual causal quantity? ◮ When non-compliance is one-sided, then the LATE is equal to

the ATT.

◮ Think of a randomized experiment:

◮ Randomized treatment assignment = instrument (Zi) ◮ Non-randomized actual treatment taken = treatment (Ai)

◮ One-sided noncompliance: only those assigned to treatment

(control) can actually take the treatment (control). Or Pr[Ai = 1|Zi = 0] = 0

◮ Maybe this is because only those treated actually get pills or

  • nly they are invited to the job training location.
slide-152
SLIDE 152

Benefits of one-sided noncompliance

◮ With this assumption, we know that there are no

“always-takers” and since there are no defiers, anyone treated (Zi = 1) that takes the treatment (Ai = 1) is a complier.

slide-153
SLIDE 153

Benefits of one-sided noncompliance

◮ With this assumption, we know that there are no

“always-takers” and since there are no defiers, anyone treated (Zi = 1) that takes the treatment (Ai = 1) is a complier.

◮ Thus, we know that: E[Yi|Zi = 1] − E[Yi|Zi = 0] =

E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] (exclusion restriction + one-sided noncompliance)

slide-154
SLIDE 154

Benefits of one-sided noncompliance

◮ With this assumption, we know that there are no

“always-takers” and since there are no defiers, anyone treated (Zi = 1) that takes the treatment (Ai = 1) is a complier.

◮ Thus, we know that: E[Yi|Zi = 1] − E[Yi|Zi = 0] =

E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] (exclusion restriction + one-sided noncompliance)

slide-155
SLIDE 155

Benefits of one-sided noncompliance

◮ With this assumption, we know that there are no

“always-takers” and since there are no defiers, anyone treated (Zi = 1) that takes the treatment (Ai = 1) is a complier.

◮ Thus, we know that: E[Yi|Zi = 1] − E[Yi|Zi = 0] =

E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] (exclusion restriction + one-sided noncompliance) =E[Yi(0)|Zi = 1] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0]

slide-156
SLIDE 156

Benefits of one-sided noncompliance

◮ With this assumption, we know that there are no

“always-takers” and since there are no defiers, anyone treated (Zi = 1) that takes the treatment (Ai = 1) is a complier.

◮ Thus, we know that: E[Yi|Zi = 1] − E[Yi|Zi = 0] =

E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] (exclusion restriction + one-sided noncompliance) =E[Yi(0)|Zi = 1] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] =E[Yi(0)] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)] (randomization)

slide-157
SLIDE 157

Benefits of one-sided noncompliance

◮ With this assumption, we know that there are no

“always-takers” and since there are no defiers, anyone treated (Zi = 1) that takes the treatment (Ai = 1) is a complier.

◮ Thus, we know that: E[Yi|Zi = 1] − E[Yi|Zi = 0] =

E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] (exclusion restriction + one-sided noncompliance) =E[Yi(0)|Zi = 1] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] =E[Yi(0)] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)] (randomization) =E[(Yi(1) − Yi(0))Ai|Zi = 1]

slide-158
SLIDE 158

Benefits of one-sided noncompliance

◮ With this assumption, we know that there are no

“always-takers” and since there are no defiers, anyone treated (Zi = 1) that takes the treatment (Ai = 1) is a complier.

◮ Thus, we know that: E[Yi|Zi = 1] − E[Yi|Zi = 0] =

E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] (exclusion restriction + one-sided noncompliance) =E[Yi(0)|Zi = 1] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] =E[Yi(0)] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)] (randomization) =E[(Yi(1) − Yi(0))Ai|Zi = 1] =E[Yi(1) − Yi(0)|Ai = 1, Zi = 1] Pr[Ai = 1|Zi = 1] (law of iterated expectations + binary treatment)

slide-159
SLIDE 159

Benefits of one-sided noncompliance

◮ With this assumption, we know that there are no

“always-takers” and since there are no defiers, anyone treated (Zi = 1) that takes the treatment (Ai = 1) is a complier.

◮ Thus, we know that: E[Yi|Zi = 1] − E[Yi|Zi = 0] =

E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] (exclusion restriction + one-sided noncompliance) =E[Yi(0)|Zi = 1] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] =E[Yi(0)] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)] (randomization) =E[(Yi(1) − Yi(0))Ai|Zi = 1] =E[Yi(1) − Yi(0)|Ai = 1, Zi = 1] Pr[Ai = 1|Zi = 1] (law of iterated expectations + binary treatment) =E[Yi(1) − Yi(0)|Ai = 1] Pr[Ai = 1|Zi = 1] (one-sided noncompliance)

slide-160
SLIDE 160

Benefits of one-sided noncompliance

◮ With this assumption, we know that there are no

“always-takers” and since there are no defiers, anyone treated (Zi = 1) that takes the treatment (Ai = 1) is a complier.

◮ Thus, we know that: E[Yi|Zi = 1] − E[Yi|Zi = 0] =

E[Yi(0) + (Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] (exclusion restriction + one-sided noncompliance) =E[Yi(0)|Zi = 1] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)|Zi = 0] =E[Yi(0)] + E[(Yi(1) − Yi(0))Ai|Zi = 1] − E[Yi(0)] (randomization) =E[(Yi(1) − Yi(0))Ai|Zi = 1] =E[Yi(1) − Yi(0)|Ai = 1, Zi = 1] Pr[Ai = 1|Zi = 1] (law of iterated expectations + binary treatment) =E[Yi(1) − Yi(0)|Ai = 1] Pr[Ai = 1|Zi = 1] (one-sided noncompliance)

slide-161
SLIDE 161

◮ Noting that Pr[Ai = 1|Zi = 0] = 0, then the Wald estimator is

just the ATT: E[Yi|Zi = 1] − E[Yi|Zi = 0] Pr[Ai = 1|Zi = 1] = E[Yi(1) − Yi(0)|Ai = 1]

slide-162
SLIDE 162

◮ Noting that Pr[Ai = 1|Zi = 0] = 0, then the Wald estimator is

just the ATT: E[Yi|Zi = 1] − E[Yi|Zi = 0] Pr[Ai = 1|Zi = 1] = E[Yi(1) − Yi(0)|Ai = 1]

◮ Thus, under the additional assumption of one-sided compliance,

we can estimate the ATT using the usual IV approach .

slide-163
SLIDE 163

◮ Noting that Pr[Ai = 1|Zi = 0] = 0, then the Wald estimator is

just the ATT: E[Yi|Zi = 1] − E[Yi|Zi = 0] Pr[Ai = 1|Zi = 1] = E[Yi(1) − Yi(0)|Ai = 1]

◮ Thus, under the additional assumption of one-sided compliance,

we can estimate the ATT using the usual IV approach .

◮ The ATT is a combination of the LATE and the ATE for the

always-takers. If we remove the possibility of the always takers, then anyone who actually takes the treatment is a complier.

slide-164
SLIDE 164

◮ Noting that Pr[Ai = 1|Zi = 0] = 0, then the Wald estimator is

just the ATT: E[Yi|Zi = 1] − E[Yi|Zi = 0] Pr[Ai = 1|Zi = 1] = E[Yi(1) − Yi(0)|Ai = 1]

◮ Thus, under the additional assumption of one-sided compliance,

we can estimate the ATT using the usual IV approach .

◮ The ATT is a combination of the LATE and the ATE for the

always-takers. If we remove the possibility of the always takers, then anyone who actually takes the treatment is a complier.

◮ It’s also easy to see that if we switch the direction of one-sided

compliance, then we can esimate the average treatment effect for the controls.

slide-165
SLIDE 165

Falsification tests

◮ The exclusion restriction cannot be tested directly, but it can

be falsified.

slide-166
SLIDE 166

Falsification tests

◮ The exclusion restriction cannot be tested directly, but it can

be falsified.

◮ Under the exclusion restriction, Zi only has an effect on Yi

because it has an effect on Ai.

slide-167
SLIDE 167

Falsification tests

◮ The exclusion restriction cannot be tested directly, but it can

be falsified.

◮ Under the exclusion restriction, Zi only has an effect on Yi

because it has an effect on Ai.

◮ Falsification test Test the reduced form effect of Zi on Yi in

situations where it is impossible or extremely unlikely that Zi could affect Ai.

slide-168
SLIDE 168

Falsification tests

◮ The exclusion restriction cannot be tested directly, but it can

be falsified.

◮ Under the exclusion restriction, Zi only has an effect on Yi

because it has an effect on Ai.

◮ Falsification test Test the reduced form effect of Zi on Yi in

situations where it is impossible or extremely unlikely that Zi could affect Ai.

◮ Because Zi can’t affect Ai, then the exclusion restriction

implies that this falsification test should have 0 effect. If we find an effect, instrument is suspicious.

slide-169
SLIDE 169

Falsification tests

◮ The exclusion restriction cannot be tested directly, but it can

be falsified.

◮ Under the exclusion restriction, Zi only has an effect on Yi

because it has an effect on Ai.

◮ Falsification test Test the reduced form effect of Zi on Yi in

situations where it is impossible or extremely unlikely that Zi could affect Ai.

◮ Because Zi can’t affect Ai, then the exclusion restriction

implies that this falsification test should have 0 effect. If we find an effect, instrument is suspicious.

◮ Nunn & Wantchekon (2011): use distance to coast as an

instrument for Africans, use distance to the coast in an Asian sample as falsification test.

slide-170
SLIDE 170

Nunn & Wantchekon falsification test

slide-171
SLIDE 171

Size, characteristics of the compliers

◮ While we cannot identify who is a complier and who is not a

complier in general, we can estimate the size of the complier group: Pr[Ai(1) > Ai(0)] = E[Ai(1)−Ai(0)] = E[Ai|Zi = 1]−E[Ai|Zi = 0]

slide-172
SLIDE 172

Size, characteristics of the compliers

◮ While we cannot identify who is a complier and who is not a

complier in general, we can estimate the size of the complier group: Pr[Ai(1) > Ai(0)] = E[Ai(1)−Ai(0)] = E[Ai|Zi = 1]−E[Ai|Zi = 0]

◮ Angrist and Pischke describe ways to calculate the difference

between the compliers and overall population in terms of binary covariates.

slide-173
SLIDE 173

Size, characteristics of the compliers

◮ While we cannot identify who is a complier and who is not a

complier in general, we can estimate the size of the complier group: Pr[Ai(1) > Ai(0)] = E[Ai(1)−Ai(0)] = E[Ai|Zi = 1]−E[Ai|Zi = 0]

◮ Angrist and Pischke describe ways to calculate the difference

between the compliers and overall population in terms of binary covariates.

◮ Abadie (2003) shows how to calculate the mean of any

covariate in the complier group.

slide-174
SLIDE 174

Multiple instruments

◮ Since each instrument implies a different complier group, each

instrument estimates a causal effect for a different subset of the population.

slide-175
SLIDE 175

Multiple instruments

◮ Since each instrument implies a different complier group, each

instrument estimates a causal effect for a different subset of the population.

◮ Thus, if we had two instrument, then there would be two

different LATEs, ρ1 and ρ2 for instruments Z1i and Z2i. We might try to use 2SLS to estimate an overall effect with these instruments with following first stage: ˆ Ai = π1Z1i + π2Z2i.

slide-176
SLIDE 176

2SLS as weighted average

◮ In Angrist and Pischke, they show that the 2SLS estimator

using these two instruments is a weighted sum of the two component LATEs: ρ2SLS = ψρ1 + (1 − ψ)ρ2, where the weights are: ψ = π1Cov(Ai, Z1i) π1Cov(Ai, Z1i) + π2Cov(Ai, Z2i)

slide-177
SLIDE 177

2SLS as weighted average

◮ In Angrist and Pischke, they show that the 2SLS estimator

using these two instruments is a weighted sum of the two component LATEs: ρ2SLS = ψρ1 + (1 − ψ)ρ2, where the weights are: ψ = π1Cov(Ai, Z1i) π1Cov(Ai, Z1i) + π2Cov(Ai, Z2i)

◮ Thus, the 2SLS estimate is a weighted average of causal effects

for each instrument, where the weights are related to the strenght of prediction for each of the first stage effects of the instruments.

slide-178
SLIDE 178

Covariates and heterogeneous effects

◮ It might be the case that the above assumptions only hold

conditional on some covariates, Xi. That is, instead of randomization, we might have conditional ignorability: [{Yi(a, z), ∀a, z}, Ai(1), Ai(0)] ⊥ ⊥ Zi|Xi

slide-179
SLIDE 179

Covariates and heterogeneous effects

◮ It might be the case that the above assumptions only hold

conditional on some covariates, Xi. That is, instead of randomization, we might have conditional ignorability: [{Yi(a, z), ∀a, z}, Ai(1), Ai(0)] ⊥ ⊥ Zi|Xi

◮ We would also have exclusion conditional on the covariates:

Pr[Yi(a, 0) = Yi(a, 1)|Xi] = 1 for a = 1, 0

slide-180
SLIDE 180

Covariates and heterogeneous effects

◮ It might be the case that the above assumptions only hold

conditional on some covariates, Xi. That is, instead of randomization, we might have conditional ignorability: [{Yi(a, z), ∀a, z}, Ai(1), Ai(0)] ⊥ ⊥ Zi|Xi

◮ We would also have exclusion conditional on the covariates:

Pr[Yi(a, 0) = Yi(a, 1)|Xi] = 1 for a = 1, 0

◮ Under these assumptions, Angrist and Pischke show that if you

fully saturate the first stage and the second stage in the covariates, then 2SLS estimates a weighted average of the covariates-specific LATEs (very similar to regression).

slide-181
SLIDE 181

Covariates and heterogeneous effects

◮ It might be the case that the above assumptions only hold

conditional on some covariates, Xi. That is, instead of randomization, we might have conditional ignorability: [{Yi(a, z), ∀a, z}, Ai(1), Ai(0)] ⊥ ⊥ Zi|Xi

◮ We would also have exclusion conditional on the covariates:

Pr[Yi(a, 0) = Yi(a, 1)|Xi] = 1 for a = 1, 0

◮ Under these assumptions, Angrist and Pischke show that if you

fully saturate the first stage and the second stage in the covariates, then 2SLS estimates a weighted average of the covariates-specific LATEs (very similar to regression).

◮ Abadie (2003) shows how to estimate the overall LATE using a

weighting approach based on a “propensity score” for the instrument.