Differences-in-Differences Estimator: Example Card and Krueger - - PowerPoint PPT Presentation

differences in differences estimator example card and
SMART_READER_LITE
LIVE PREVIEW

Differences-in-Differences Estimator: Example Card and Krueger - - PowerPoint PPT Presentation

Differences-in-Differences Estimator: Example Card and Krueger (1994) Example : Card D. and Krueger A. (1994) Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania, American Economic Review,


slide-1
SLIDE 1

Differences-in-Differences Estimator: Example Card and Krueger (1994)

  • Example: Card D. and Krueger A. (1994) “Minimum Wages

and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania”, American Economic Review,

  • Vol. 84, No. 4, pp. 772-793.
  • We mention this paper as an example of a natural or quasi-

experiment.

  • Effect of minimum wages on employment (a classic and

controversial question in labour economics).

slide-2
SLIDE 2

Differences-in-Differences Estimator: Example Card and Krueger (1994)

  • In February 1992 New Jersey (NJ) increased the state

minimum wage from $4.25 to $5.05. Pennsylvania (PA)’s minimum wage stayed at $4.25.

  • They surveyed about 400 fast food stores both in NJ and in

PA both before and after the minimum wage increase in NJ.

  • The differences-in-differences strategy amounts to comparing

the change in employment in NJ to the change in employment in PA.

slide-3
SLIDE 3

Differences-in-Differences Estimator and Common Trend Assumption

  • The key assumption for DID is that the outcome in the

treatment and control group would follow the same time trend in the absence of the treatment.

  • This does not mean that they have to have the same mean of

the outcome!

  • Common trend assumption is difficult to verify but one often

uses pre-treatment data to show that the trends are the same.

slide-4
SLIDE 4

Example of TTTC for time series: Causal Effects Using Bayesian Structural Time-Series Models (Brodersen et al. 2015)

  • Brodersen et al. (2015): new methodology to estimate the

causal effect of an intervention using time series data.

  • Goal: propose a new approach to infer the causal impact of

a market intervention, such as the launch of a new product,

  • r the effect of an advertising campaign.
  • Very useful method for private sector firms that want to

assess the profitability of investment decisions.

– No coincidence that all authors of the article work at Google. – They develop the R package CausalImpact to perform the estimation.

slide-5
SLIDE 5

Example of TTTC for time series: Causal Effects Using Bayesian Structural Time-Series Models (Brodersen et al. 2015)

  • Main idea: generalize the DID estimator to a time-series

setting by modelling the counterfactual of a time series

  • bserved both before and after a given intervention.
  • Example: estimate the causal effect of an ad campaign on

number of clicks to a website

– Causal effect: difference between the observed number of clicks and the number of clicks that would have been observed absent the ad campaign.

slide-6
SLIDE 6

Example of TTTC for time series: Causal Effects Using Bayesian Structural Time-Series Models (Brodersen et al. 2015)

  • Advantages with respect to DID:

– DID is based on a static regression model that assumes i.i.d. data, while time series clearly are not i.i.d. – DID considers only two points in time: before and after the

  • treatment. However, the way in which the treatment effect changes
  • ver time can be very important.

– DID applied to time series data imposes restrictions on how the variables used to characterize the control group are selected.

slide-7
SLIDE 7

CausalImpact Package in R

  • https://google.github.io/CausalImpact/CausalImpact.html#insta

lling-the-package

  • Tutorial: https://www.youtube.com/watch?v=GTgZfCltMm8
  • Methodology: bayesian structural time-series model.
  • Find time series that are correlated with the outcome of

interest but are uncorrelated with the treatment:

– Examples: markets where no action was taken, stock markets, weather, search queries from google trends –”unmoved movers” that indicate an interest in the industry or market where a firm operates without being directly affected by the action taken.

  • Training the model in the pre-period and apply the model in

the post-period.

  • Spike and slab prior to automatically find useful predictors.
slide-8
SLIDE 8
  • Caveats and checks:

– Back-test the model to prevent spurious correlations between time series that perfectly predict the outcome before the treatment:

  • Test the model before the treatment to test that there was no

effect before the treatment. – Rule of thumb: 5 and 20 predictor time series to balance the explanatory power of the predictors.

  • Open question: how to use this method to calculate the

impact of multiple events that overlap overtime?

  • Precision of the estimates:

– The longer you predict into the future, the wider the CI. – The stronger the predictors time series and the less noisy the

  • utcome, the tighter the CI.

CausalImpact Package in R

slide-9
SLIDE 9

Different Approaches

  • f Program Evaluation

1. Run an experiment and use simple differences estimator. 2. Use observational data to construct the counterfactual:

  • a. Selection on observables:
  • Unconfoundedness assumption: we assume to observe all X variables

that affect both participation decision and outcome.

  • Differences-in-Differences (DID)
  • Matching
  • Regression discontinuity (RD)
  • b. Selection on unobservables
  • We assume participation depends on unobserved variables.
  • Instrumental variable estimation
  • Control function approach
slide-10
SLIDE 10

Use Observational Data to Construct the Counterfactuals

  • How to construct the counterfactual when we cannot

simply use a differences estimator (DE)?

  • NOTE: we cannot simply use the DE because of unobservables affecting

the treatment and the control groups differently over time.

  • 1. Selection on observables:
  • Unconfoundedness assumption (UA): we assume to observe all X

variables that affect both participation decision and outcome.  Differences-in-Differences (DID): if UA satisfied, we can control for all relevant Xs and make sure common trend assumption is satisfied.

  • Common trend assumption: absent treatment, the change in

treated outcome would have been the same as the change in non-treated outcome.  Matching  Regression discontinuity (RD)

slide-11
SLIDE 11

Matching Estimator

  • Intuitive idea: we use a group of observed variables Z

to form matches between individuals in treatment and control groups.

  • Example: Angrist (1998) uses matching to estimate the

effect of military service on earnings later on in life.

– Treatment: D=1 if someone was in the military. – Matching idea: conditional on the observed variables Z that are used to select soldiers (age, schooling and test scores), having been in the military is independent of potential future earnings. – Matching estimator: conditional on Z, the impact of military service on earnings can be estimated by comparing the earnings

  • f those who were in the military to the earnings of those that

were not in the military.

slide-12
SLIDE 12

Matching Estimator

  • The matching method compares the outcomes of treated

participants with those of matched non treated, where matches are chosen on the basis of similarity in

  • bserved characteristics.
  • Covariate-specific treatment-control comparison weighted using

the distribution of covariates among the treated.

  • Main advantage of matching estimators:

they typically do not require specifying a functional form

  • f the outcome equation and are therefore not

susceptible to wrong functional form bias.

slide-13
SLIDE 13

Assumptions of Matching Approach

  • Assume you have access to data on treated and

untreated individuals (D=1 and D=0)

  • Assume you also have access to a set of Z variables

whose distribution, F(.), is not affected by D: F(Z|D,Y1,Y0)=F(Z|Y1,Y0)

slide-14
SLIDE 14

Assumptions of Matching Approach

1. Selection on Observables (Unconfoundedness Assump.)

 There exists a set of observed characteristics Z such that

  • utcomes are independent of treatment conditional on Z (i.e.

treatment assignment is “strictly ignorable” given Z (Rosenbaum and Rubin 1983)).

slide-15
SLIDE 15

Assumptions of Matching Approach

1. Selection on Observables (Unconfoundedness Assump.)

 There exists a set of observed characteristics Z such that

  • utcomes are independent of treatment conditional on Z (i.e.

treatment assignment is “strictly ignorable” given Z (Rosenbaum and Rubin 1983)).

2. Common Support Assumption

 Assumption 2 is required, so that matches for D=0 and D=1

  • bservations can effectively be found.
slide-16
SLIDE 16

Implications of Assumptions

  • If Assumptions 1 and 2 are satisfied, then the problem of determining

the average treatment effect can be solved by substituting the Y0 distribution observed for “matched-on-Z non-participants” for the missing Y0 distribution of participants.

slide-17
SLIDE 17

Implications of Assumptions

  • If Assumptions 1 and 2 are satisfied, then the problem of determining

the average treatment effect can be solved by substituting the Y0 distribution observed for “matched-on-Z non-participants” for the missing Y0 distribution of participants.

  • For assumption 1 to hold, individuals cannot select into the program

based on anticipated treatment impact.

slide-18
SLIDE 18

Implications of Assumptions

  • If Assumptions 1 and 2 are satisfied, then the problem of determining

the average treatment effect can be solved by substituting the Y0 distribution observed for “matched-on-Z non-participants” for the missing Y0 distribution of participants.

  • For assumption 1 to hold, individuals cannot select into the program

based on anticipated treatment impact.

  • Assumption 1 implies:
slide-19
SLIDE 19

Implications of Assumptions

  • If Assumptions 1 and 2 are satisfied, then the problem of determining

the average treatment effect can be solved by substituting the Y0 distribution observed for “matched-on-Z non-participants” for the missing Y0 distribution of participants.

  • For assumption 1 to hold, individuals cannot select into the program

based on anticipated treatment impact.

  • Assumption 1 implies:
  • Under these assumptions, we can estimate all ATE, TTE and UTE.
slide-20
SLIDE 20

Implications of Assumptions

  • In addition to assumptions 1 and 2, we also assume that

the distribution of the matching Z variables, F(.), is not affected by D: F(Z|D,Y1,Y0)=F(Z|Y1,Y0).

  • Empirical implication: age, gender and race would

generally be valid matching variables since they are predetermined, while, for example, marital status may not be if it was directly affected by having received the treatment.

slide-21
SLIDE 21

Matching Estimator

  • A typical matching estimator for the TTE takes the following form (n1

is the number of observations in the treatment group): where is an estimator for the matched “no treatment” outcome.

slide-22
SLIDE 22

Matching Estimator

  • A typical matching estimator for the TTE takes the following form (n1

is the number of observations in the treatment group): where is an estimator for the matched “no treatment” outcome.

  • Recall that we can estimate the counterfactual due to Assumption 1:
slide-23
SLIDE 23

How does Matching Compare to a Randomized Experiment?

  • In matching, the distribution of observables will be the same in

treatment and control group.

  • However, the distribution of unobservables is not necessarily

balanced across groups.

  • In an experiment there is by definition “full support”, while matching

relies on the common support condition (assumption 2).  If there are regions where the support of Z does not overlap for the D=0 and D=1 groups, then matching is only justified when it is performed over the region of common support, that is for the pairs that we can effectively match so that the estimated treatment effect is

  • nly defined for this specific sample of matches.
slide-24
SLIDE 24

Implementing Matching

  • Problem: How to construct a match when Z is of high

dimension?

  • Solution: estimate the probability of receiving the

treatment given Z, and construct matches based on this probability.

  • Propensity Score Matching:
  • P(D=1|Z) (the “propensity score”)
slide-25
SLIDE 25

Propensity Score Matching

  • Matching estimators difficult to implement when set of conditioning

variables Z is large (lots of Z variables) or made of continuous variables

slide-26
SLIDE 26

Propensity Score Matching

  • Matching estimators difficult to implement when set of conditioning

variables Z is large (lots of Z variables) or made of continuous variables  Rosenbaum and Rubin theorem (1983): Show that assumption 1 implies

slide-27
SLIDE 27

Propensity Score Matching

  • Matching estimators difficult to implement when set of conditioning

variables Z is large (lots of Z variables) or made of continuous variables  Rosenbaum and Rubin theorem (1983): Show that assumption 1 implies  Reduces the matching problem to a univariate problem, provided P(D=1|Z) (the “propensity score”) can be parametrically estimated.

slide-28
SLIDE 28

Propensity Score Matching

  • Matching estimators difficult to implement when set of conditioning

variables Z is large (lots of Z variables) or made of continuous variables  Rosenbaum and Rubin theorem (1983): Show that assumption 1 implies  Reduces the matching problem to a univariate problem, provided P(D=1|Z) (the “propensity score”) can be parametrically estimated.  We have to control only for the covariates that affect the probability of receiving the treatment.

slide-29
SLIDE 29

Implementation of the Propensity Score Matching Estimator

Step 1: Estimate the propensity score P(D=1|Z) for each unit of analysis, usually using a logit or probit. Step 2: Select matches based on the estimated propensity score.

  • Propensity score matching works in the same way as covariate

matching except that we match on the propensity score instead of the covariates.

slide-30
SLIDE 30

Propensity Score Matching Methods

  • For notational simplicity, let P=P(D=1|Z)
  • A typical propensity score matching estimator for the TTE takes the

form: with where denotes the set of program participants, the set of non- participants, the region of common support, and is the number of persons in the set

slide-31
SLIDE 31

Propensity Score Matching Methods

  • For notational simplicity, let P=P(D=1|Z)
  • A typical propensity score matching estimator for the TTE takes the

form: with where denotes the set of program participants, the set of non- participants, the region of common support, and is the number of persons in the set  The match for each participant is constructed as a weighted average over the outcomes of non-participants, where the weights depend on the distance between

slide-32
SLIDE 32

Alternative Ways of Constructing Matched Outcomes Using the Propensity Score

  • There are alternative ways of constructing matched outcomes that

are based on: – defining and constructing a neighborhood for each unit in the participants’ sample – matching non-participants based on a propensity score that belong to the defined neighborhood

  • These alternative matching estimators differ in how:

– the neighborhood is defined – the weights are constructed

slide-33
SLIDE 33
  • Define a neighborhood

for each i in the participant sample.

  • Neighbors for i are non-participants for whom
  • The persons matched to i are those people in set where

Alternative matching estimators that differ: – in how neighborhood is defined and – in how the weights are constructed

Alternative Ways of Constructing Matched Outcomes Using the Propensity Score

slide-34
SLIDE 34
  • Define a neighborhood

for each i in the participant sample.

  • Neighbors for i are non-participants for whom
  • The persons matched to i are those people in set where

Alternative matching estimators that differ: – in how neighborhood is defined and – in how the weights are constructed 1. Nearest Neighbor Matching 2. Stratification or Interval Matching 3. Kernel and Local Linear Matching Implementation requires knowledge of nonparametric estimation

  • Appendix 3 for details

Alternative Ways of Constructing Matched Outcomes Using the Propensity Score

slide-35
SLIDE 35

Differences-in-Differences Matching Estimators

  • Assumption of cross-sectional matching estimators:

– After conditioning on a set of observable characteristics, outcomes are conditionally mean independent of program participation.

slide-36
SLIDE 36

Differences-in-Differences Matching Estimators

  • Assumption of cross-sectional matching estimators:

– After conditioning on a set of observable characteristics, outcomes are conditionally mean independent of program participation.

  • BUT: there may be systematic differences between participant and

nonparticipant outcomes that could lead to a violation of the identification conditions required for matching

– e.g. due to program selectivity on unmeasured characteristics

slide-37
SLIDE 37

Differences-in-Differences Matching Estimators

  • Assumption of cross-sectional matching estimators:

– After conditioning on a set of observable characteristics, outcomes are conditionally mean independent of program participation.

  • BUT: there may be systematic differences between participant and

nonparticipant outcomes that could lead to a violation of the identification conditions required for matching

– e.g. due to program selectivity on unmeasured characteristics

  • Solution in the case of time-invariant differences in outcomes

between participants and nonparticipants:  difference-in-difference matching strategy (see Heckman, Ichimura and Todd (1997))

slide-38
SLIDE 38

Standard Errors of Matching Estimators

  • Standard errors for matching estimators are often

generated using bootstrap resampling methods.

  • This is valid for kernel or local linear matching

estimators, but not for nearest neighbor matching estimators (see Abadie and Imbens (2004), also for alternatives in that case).

slide-39
SLIDE 39

Matching Estimator and Machine Learning

  • We can use machine learning tools for propensity score matching: construct

and estimate predictive models for the probability of treatment.

  • Wager and Athey (2018): “causal forests”: version of a nearest neighbor

matching method with data-driven approach to determine which dimensions

  • f the covariate space to match on.
  • Very promising and emerging literature that combines machine learning

methods to improve estimation of causal effects (Section 4.2 in Athey and Imbens 2016). – Random forests, boosting or LASSO to estimate the propensity score. – Improved LASSO via double-selection method (Belloni et al. 2014 JEP):

  • 1. LASSO to select covariates that are correlated with the outcome.
  • 2. LASSO to select covariates that are correlated with the treatment.
  • 3. OLS of Y on D and selected covariates from 1 and 2 to estimate

ATE.

slide-40
SLIDE 40

Use Observational Data to Construct the Counterfactuals

  • How to construct the counterfactual when we cannot

simply use a differences estimator (DE)?

  • NOTE: we cannot simply use the DE because of unobservables affecting

the treatment and the control groups differently over time.

  • 1. Selection on observables:
  • Unconfoundedness assumption (UA): we assume to observe all X

variables that affect both participation decision and outcome.  Differences-in-Differences (DID): if UA satisfied, we can control for all relevant Xs and make sure common trend assumption is satisfied.

  • Common trend assumption: absent treatment, the change in

treated outcome would have been the same as the change in non-treated outcome.  Matching: assumption 1 is UA.  Regression discontinuity (RD)

slide-41
SLIDE 41

Regression Discontinuity

  • Basic Idea:

– Assignment to the treatment is determined in whole or in part based on whether an observable variable (the forcing variable x) is below or above a given threshold value.

slide-42
SLIDE 42

Regression Discontinuity

  • Basic Idea:

– Assignment to the treatment is determined in whole or in part based on whether an observable variable (the forcing variable x) is below or above a given threshold value. – The jump or discontinuity at the threshold gives the name to this technique.

slide-43
SLIDE 43

Regression Discontinuity

  • Basic Idea:

– Assignment to the treatment is determined in whole or in part based on whether an observable variable (the forcing variable x) is below or above a given threshold value. – The jump or discontinuity at the threshold gives the name to this technique.

  • The design often exploits precise knowledge of the rules

determining treatment. Often precise administrative decisions that determine treatment’s assignment.

– E.g.: test scores above a specific threshold give access to a fellowship

  • r admission to a university.
slide-44
SLIDE 44

Regression Discontinuity: Sharp and Fuzzy

Two types of RD design:

  • 1. Sharp: receipt of treatment is entirely determined by

whether a given variable x exceeds a threshold value. Treatment status is a deterministic and discontinuous function of one covariate.

  • 2. Fuzzy: crossing the threshold influences receipt of

treatment but it is not the sole determinant. The probability of treatment is discontinuous at a certain point

  • f one covariate. We use the discontinuity as an IV.
slide-45
SLIDE 45

Sharp Regression Discontinuity

  • The Sharp Regression Discontinuity Design

– Treatment status is a deterministic and discontinuous function of the forcing variable x; given the known threshold value x*, once we know the value of x, we know whether treatment D=1: D=1 if xi≥x* D=0 if xi<x*  All units with a covariate value of at least x* are in the treatment group (and participation is mandatory for these individuals), and all units with a covariate level less than x* are in the control group (members of this group are not eligible).  Therefore, no problem of partial compliance since all treated individuals have to get the program and all untreated individuals are prevented from taking part in the program.  There is a jump or discontinuity in the value of Y at the threshold.

slide-46
SLIDE 46

Suppose that potential outcomes can be described by a linear model, and that there is a rule such that x above or below a certain threshold determines the value of Y. Simple regression model formalizing the RD idea: where ρ is the causal effect of interest. Therefore: where D is a dummy which is equal to 1 if xi≥x*

treatment no with ) | ( 0

i i i

x x Y E    

Sharp Regression Discontinuity

ment with treat ) | (

1

  

i i i

Y x Y E

(1)

i i i i

D x y        

slide-47
SLIDE 47

If potential outcomes are a nonlinear function of x, that is if : Then, we can estimate a nonlinear regression function: where D is a dummy which is equal to 1 if xi≥x* A simple case of non linearity can be modelled with f(x) as a pth-order polynomial of x.

  • See Figure 6.1.1 page 254 Angrist and Pischke’s textbook.

(2) ) (

i i i i

D x f y       

Sharp Regression Discontinuity

) ( ) | ( 0

i i i

x f x Y E 

slide-48
SLIDE 48
slide-49
SLIDE 49
  • If the effect of x on Y is nonlinear, the validity of the RD

estimation depends on whether the chosen nonlinear model well describes E(Y0|x).

  • Otherwise, we can wrongly interpret a nonlinearity in the

counterfactual conditional mean function as a jump due to treatment.

  • See Figure 6.1.1 page 254 Angrist and Pischke’s textbook.
  • To reduce these mistakes we look at data in a neighborhood
  • f the discontinuity.
  • We compare average outcomes in a small neighborhood to the left

and right of the threshold, so that the estimated treatment effect does not depend on the correct specification of a model for E(Y0|x).

Sharp Regression Discontinuity

slide-50
SLIDE 50
  • How to consistently estimate averages of Y in a small

neighborhood of the threshold?

  • Main problem: small neighborhoods mean that there aren’t

many data.

  • Solution: non parametric estimation techniques:

– local linear regressions, partial linear and local polynomial regression estimators, which are essentially weighted least squares with more weight given to points that are close to the cutoff.

  • Robustness checks:

1. Compare the stability of RD estimates computed using different discontinuity samples: the smaller the neighborhood and thus the smaller the sample, the less precise the estimates but also the smaller the number of polynomial terms needed to model f(x). 2. No jump in the value of pretreatment variables at the threshold.

Sharp Regression Discontinuity

slide-51
SLIDE 51
  • Key identifying assumption:
  • This means that all unobserved determinants of Y are

continuously related to x, thus do not substantially affect the levels of Y for the treatment and control groups at each side of the discontinuity differentially.

  • This allows us to use average outcomes of units just below the

cutoff as a valid counterfactual for units right above the cutoff.

  • This assumption cannot be directly tested. The robustness

checks in the previous slides give suggestive evidence that the assumption is satisfied.

»

Sharp Regression Discontinuity

* 1

at in continuous ) | ( and ) | ( x x x Y E x Y E

i i i i

slide-52
SLIDE 52
  • Example of sharp RD: Lee (2008) studies the effect of party

incumbency on re-election probabilities in the US. – Research question: does a Democratic candidate for a seat in the US House of Representatives have an advantage if his/her party won the seat last time?

  • Y= probability that a Democratic candidate wins
  • x= relative vote shares in the previous election
  • D=1 if x≥x* where x* is the vote share margin of victory (the

difference between the Democratic and Republican vote shares)

Example of Sharp Regression Discontinuity: Lee (2008)

slide-53
SLIDE 53
slide-54
SLIDE 54
  • Incumbency appears to increase party re-election

probability by about 40%.

– Very substantial effect of incumbency on party re-election probability.

  • Key identifying assumption: all unobserved

determinants of Y are continuously related to x, thus do not differentially affect the levels of Y for the treatment and control groups at each side of the discontinuity.

Example of Sharp Regression Discontinuity: Lee (2008)

slide-55
SLIDE 55
  • Validity check: look at Democratic victories before the

last election. Democratic win rates in older elections are unrelated to the margin of victory cutoff in the last election.

  • Additional validity check: look at the density of x

around the discontinuity to see if there is bunching in the distribution of x at the threshold.

– Individuals with a stake in winning try to manipulate x near the threshold so that observations on either side of the cutoff are not comparable (something that happened in Florida in the 2000 US presidential election).

Example of Sharp Regression Discontinuity: Lee (2008)

slide-56
SLIDE 56

The Fuzzy Regression Discontinuity Design:

  • The probability of receiving the treatment does not

change from zero to one at the threshold.

  • Rather, there are discontinuities in the probability of

receiving the treatment conditional on a covariate x.

  • Treatment rules change at a threshold, but these rules

are not powerful enough to determine a 0-1 move from nonparticipation to participation.

  • The discontinuity is an instrumental variable for treatment

status.

Fuzzy Regression Discontinuity

slide-57
SLIDE 57
  • Let D denotes treatment status as before; rather than D=1 when xi≥x*,

there is a jump in the probability of treatment at x* so that:

  • The functions and g₁ can be anything as long as they differ (and the

more the better) at x*.

) ( ) ( if ) ( if ) ( ) | 1 (

* * 1 * * 1

x g x g x x x g x x x g x D P

i i i i i i

              

Fuzzy Regression Discontinuity

g

slide-58
SLIDE 58
  • Define the dummy variable T=1 if xi≥x*, T=0 otherwise.
  • T is the perfect instrument to estimate the causal effect of D
  • n Y:
  • 1. Strong correlation between T and D.
  • 2. Exogeneity of T: T affects Y only through D.
  • Use being above the threshold as an instrument for

treatment status and estimate a two stages least square regression where the treatment rule is the instrument.

  • As for SRD, crucial to correctly model the functional form

f(x) in the non-linear case.

Fuzzy Regression Discontinuity

slide-59
SLIDE 59
  • Example of fuzzy RD (one of the very first papers using

this technique): van der Klaauw (2002) evaluates the effects of university financial aid on college enrolment.

  • D is the size of the financial aid award offer; T is a

dummy variable indicating applicants with an ability index above predetermined award threshold cutoffs.

  • Note: it is possible to allow for treatment effects that

change as a function of some covariates by estimating the 2SLS model with treatment-covariates interactions.

Example of Fuzzy Regression Discontinuity: van der Klaauw (2002)

slide-60
SLIDE 60
  • Since fuzzy RD is an IV estimate, it estimates a LATE

(local average treatment effect).

– The causal effect on compliers, defined as individuals whose treatment status changes as the value of x moves from just to the left of threshold to just to the right of it.

  • Possible to exploit multiple discontinuities.

– Example: Angrist and Lavy (1999) use fuzzy RD to estimate the effects of class size on children’s test scores (same question addressed by the STAR experiment). They find substantial positive effects of smaller class size on test scores, and of a magnitude that is comparable to the STAR estimates.

Fuzzy Regression Discontinuity

slide-61
SLIDE 61
  • In RD there is no value of x at which we observe both

treatment and control observations.

  • Matching is based on treatment (T)-control (C)

comparisons conditional on a the values of a set of covariates where the T and C groups overlap.

  • RD assumes that T and C are perfectly comparable at

least in a neighbourhood of the discontinuity.

  • In RD we have to carefully specify the regression

functional form of the relationship between x and Y around the discontinuity.

Regression Discontinuity and Matching

slide-62
SLIDE 62

RD and External Validity

  • Both SRD and FRD provide estimates of the average effect

for the subpopulation with a value of x at the cutoff.

  • The RD designs never allow the researcher to estimate the
  • verall (population) average effect of the treatment.

– Unless one is willing to make strong assumptions justifying the extrapolation to other subpopulations, such as homogeneity of the treatment effect.

  • But specific RD average effect may be of special interest:

– For examples in cases where the research question concerns changing the location of the threshold.

slide-63
SLIDE 63

RD and External Validity

  • Disadvantage of RD: limited degree of external validity (e.g.

compared to matching).

  • Advantage of RD: relatively high degree of internal validity

(in settings where this design is applicable).

slide-64
SLIDE 64

Choosing Among Alternative Non-Experimental Methods

  • When running an experiment is not an available option, the implicit

question often is:

– “Which estimator is the “magic bullet” that is able to solve the selection problem?

  • This is the wrong question! There is no econometric estimator that

always solves the selection problem.

  • Correct question: what is the correct estimator given the research

question?

– Different estimators make different assumptions about the nature of the selection process, require different types of data and estimate different parameters of interest.

slide-65
SLIDE 65
  • There is no magic bullet. Know what estimators are available and

what assumptions they make.

  • Sometimes there is no appropriate estimator.

 STOP!!!

  • Results are more convincing if several different identification

strategies (i.e. different estimators) deliver similar results (since they rely on different assumptions).

  • One helpful strategy to assess the performance and the validity of a

non-experimental estimator is to run simulations.

Choosing Among Alternative Non-Experimental Methods

slide-66
SLIDE 66

Simulations: Motivation

  • In the presence of self-selection, you would expect

different estimators to give different estimates.

  • Idea: get a sense of the magnitude of the selection bias

present in different contexts using realistic values for the parameters used to generate the simulated data.

slide-67
SLIDE 67

Simulations: Very Brief Introduction

  • Section 8.3 of Heckman, LaLonde and Smith (1999).
  • Generate simulated data using standard participation

and outcome equations in a program evaluation context.

– Vary the parameters of the data-generating process and observe the corresponding changes in the bias associated with different non-experimental estimators (NE estimators) of treatment effects. – Vary the sample size and observe the corresponding changes in the performance of NE estimators. – Compare the performance of NE estimators in contexts with and without variation in the impact of treatment (homogeneous/heterogeneous treatment effects).

slide-68
SLIDE 68

Appendix 3: Alternative Ways of Constructing Matched Outcomes Using the Propensity Score

slide-69
SLIDE 69

Cross-sectional Method 1: Nearest Neighbor Matching

  • Traditional, pairwise matching, also called nearest-neighbor

matching, sets

  • That is the non-participant with the value of Pj that is closest to Pi is

selected as the match and Ai is a singleton set.

slide-70
SLIDE 70

Cross-sectional Method 1: Nearest Neighbor Matching

  • Traditional, pairwise matching, also called nearest-neighbor

matching, sets

  • That is the non-participant with the value of Pj that is closest to Pi is

selected as the match and Ai is a singleton set.

  • The estimator can be implemented either matching with or without

replacement

– With replacement: same comparison group observation can be used repeatedly as a match – Drawback of matching without replacement: final estimate will usually depend on the initial ordering of the treated observations for which the matches were selected

slide-71
SLIDE 71

Cross-sectional Method 1: Nearest Neighbor Matching

  • Variation of nearest-neighbor matching: Caliper matching

(Cochrane and Rubin 1973)

  • Attempts to avoid “bad” matches (those for which Pj is far from Pi) by

imposing a tolerance on the maximum distance allowed, i.e. a match for person i is selected only if where is a pre-specified tolerance.

slide-72
SLIDE 72

Cross-sectional Method 1: Nearest Neighbor Matching

  • Variation of nearest-neighbor matching: Caliper matching

(Cochrane and Rubin 1973)

  • Attempts to avoid “bad” matches (those for which Pj is far from Pi) by

imposing a tolerance on the maximum distance allowed, i.e. a match for person i is selected only if where is a pre-specified tolerance.

  • Treated individuals for whom no matches can be found within the

caliper are excluded from the analysis (one way of imposing the common support condition)

slide-73
SLIDE 73

Cross-sectional Method 1: Nearest Neighbor Matching

  • Variation of nearest-neighbor matching: Caliper matching

(Cochrane and Rubin 1973)

  • Attempts to avoid “bad” matches (those for which Pj is far from Pi) by

imposing a tolerance on the maximum distance allowed, i.e. a match for person i is selected only if where is a pre-specified tolerance.

  • Treated individuals for whom no matches can be found within the

caliper are excluded from the analysis (one way of imposing the common support condition)

  • Drawback of caliper matching: it is difficult to know a priori what is a

reasonable choice for the tolerance level.

slide-74
SLIDE 74

Cross-sectional Method 2: Stratification or Interval Matching

  • Method:

1. In this variant of matching, the common support is partitioned into a set

  • f intervals.

2. Average treatment impacts are calculated through simple averaging within each interval. 3. Overall average impact estimate:

  • a weighted average of the interval impact estimates, using the

fraction of the D=1 population in each interval as a weight.

slide-75
SLIDE 75

Cross-sectional Method 2: Stratification or Interval Matching

  • Method:

1. In this variant of matching, the common support is partitioned into a set

  • f intervals.

2. Average treatment impacts are calculated through simple averaging within each interval. 3. Overall average impact estimate:

  • a weighted average of the interval impact estimates, using the

fraction of the D=1 population in each interval as a weight.

  • Requires decision on how wide the intervals should be:

– Dehejia and Wahba (1999) use intervals that are selected such that the mean values of the estimated Pi’s and Pj’s are not statistically different from each other within intervals.

slide-76
SLIDE 76

Cross-sectional Method 3: Kernel and Local Linear Matching

  • Kernel Method:

– Uses a weighted average of all observations within the common support region: the farther away the comparison unit is from the treated unit the lower the weight.

  • Local linear matching:

– Similar to the kernel estimator but includes a linear term in the weighting function, which helps to avoid bias.

slide-77
SLIDE 77

Kernel and Local Linear Matching

A kernel estimator for is given by

slide-78
SLIDE 78

Kernel and Local Linear Matching

A kernel estimator for is given by with weights

K is a kernel function and h is a bandwidth (or smoothing parameter)