Synthetic Difference in Differences Dmitry Arkhangelsky Susan Athey - - PowerPoint PPT Presentation

synthetic difference in differences
SMART_READER_LITE
LIVE PREVIEW

Synthetic Difference in Differences Dmitry Arkhangelsky Susan Athey - - PowerPoint PPT Presentation

Synthetic Difference in Differences Dmitry Arkhangelsky Susan Athey David Hirshberg Guido Imbens Stefan Wager JSM. August 3rd, 2020. 1 When Berkeley implemented the first soda tax, we compared to San Francisco. While Berkeley, the first


slide-1
SLIDE 1

Synthetic Difference in Differences

Dmitry Arkhangelsky Susan Athey David Hirshberg Guido Imbens Stefan Wager

  • JSM. August 3rd, 2020.

1

slide-2
SLIDE 2

When Berkeley implemented the first soda tax, we compared to San Francisco.

While Berkeley, the first U.S. city to pass a “soda tax,” saw a substantial decline of 0.13 times/day in the consumption of soda in the months following implementation of the tax in March 2015, neighboring San Francisco, where a soda-tax measure was defeated, and Oakland, saw a 0.03 times/day increase, according to a study published today in the American Journal of Public Health.

2

slide-3
SLIDE 3

This is how we did it. Berkeley San Francisco Hallucinated Parallel Berkeley

}

}

3

slide-4
SLIDE 4

This is a “Difference-in-Differences” estimate

  • We compare Berkeley’s change in consumption to San Francisco’s.

τ = Y (1)

BK,post − Y (0) BK,post.

ˆ τ = [Y (1)

BK,post − Y (0) BK,pre] − [Y (0) SF,post − Y (0) SF,pre]

  • Subtracting SF’s change adjusts for a trend in absence of intervention.
  • It works if the cities follow parallel trends in absence of intervention.

Y (0)

city,time ≈ αcity + β · 1{time = post}.

  • This assumption is strong, but we need it (or more data)
  • We can’t distinguish a treatment effect from a difference in trend

4

slide-5
SLIDE 5

Difference in Differences

Things get interesting when we observe many units over many time periods. We focus on simultaneous adoption. 1, . . . , T0 T0 + 1, . . . , T = T0 + T1 1 . . . no treatment no treatment N0 N0 + 1 . . . no treatment treatment N = N0 + N1

  • We could still use a parallel trends model: Y (w)

it

∼ αi + βt + wτ

  • Least squares in this model is equivalent to 2 × 2 diff-in-diff

applied to the averages our 4 ‘blocks’

  • But we can see that trends in absence of treatment aren’t parallel

5

slide-6
SLIDE 6

California’s anti-smoking legislation (Proposition 99)

60 90 120 1970 1980 1990 2000 Average Control California

A 25 cents/pack excise tax increase took effect in 1989. California ≈

1 49Alaska + 1 49Alabama + . . . 6

slide-7
SLIDE 7

California’s anti-smoking legislation: Difference-in-Differences

60 90 120 1970 1980 1990 2000 Average Control California

} 𝝊

If we average and hallucinate a line, it obviously doesn’t fit.

7

slide-8
SLIDE 8

Synthetic Controls

  • If California’s pre-treatment trend doesn’t match the average state’s,

compare it to something else.

  • For example, a weighted average of states with a trend that does match.
  • This weighted average of units is called a synthetic control

[Abadie, Diamond, and Hainmueller, 2010]

  • Construction: weight the control units to match pre-treatment outcomes,
  • n≤N0

ˆ ωnYnt

  • control unit average at time t

≈ ¯ Ytreated,t

  • treated unit average at time t

for all t ≤ T0.

  • Treatment effects are typically estimated by cross-sectional comparison:

the mean post-treatment difference between treated and synthetic control. ˆ τ = 1 T1

  • t>T0
  • ¯

Ytreated,t −

  • n≤N0

ˆ ωnYnt

  • .

8

slide-9
SLIDE 9

California’s anti-smoking legislation: Synthetic Control

40 60 80 100 120 1970 1980 1990 2000 Synthetic Control California

When comparing to a synthetic control, trends line up better. California ≈ .3 Utah + .2 Nevada + .15 Montana + . . .

9

slide-10
SLIDE 10

Improving on Synthetic Control

Instead of constructing a unit for a cross-sectional comparison, construct a unit and time period for a diff-in-diff comparison.

This is a double robust version of synthetic control. If the before/after comparison is good, the unit comparison doesn’t have to be. And it’s easier to make them good. Constants shifts get differenced out, so constructed parallel trends are as good as overlaid.

10

slide-11
SLIDE 11

California’s anti-smoking legislation: Constructed Parallel Trends

  • 40

80 120 160 1970 1980 1990 2000

  • treated

sdid sc

11

slide-12
SLIDE 12

California’s anti-smoking legislation: Constructed Parallel Trends

  • 40

80 120 160 1970 1980 1990 2000

  • treated

sdid sc

11

slide-13
SLIDE 13

California’s anti-smoking legislation: Constructed Parallel Trends

  • 40

80 120 160 1970 1980 1990 2000

  • treated

sdid sc

11

slide-14
SLIDE 14

California’s anti-smoking legislation: Constructed Parallel Trends

  • 40

80 120 160 1970 1980 1990 2000

  • treated

sdid sc

11

slide-15
SLIDE 15

California’s anti-smoking legislation: Constructed Parallel Trends

  • 40

80 120 160 1970 1980 1990 2000

  • treated

sdid sc

11

slide-16
SLIDE 16

California’s anti-smoking legislation: Double Robustness

  • ● ●
  • −80

−40 Alabama Arkansas Colorado Connecticut Delaware Georgia Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Mexico North Carolina North Dakota Ohio Oklahoma Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia West Virginia Wisconsin Wyoming

unit.weight

  • 0.05

0.10 0.15 0.20 0.25

estimator

  • sc

sdid

12

slide-17
SLIDE 17

Implementation

  • 1. Estimate synthetic control weights ˆ

ω by simplex-constrained least squares

  • n the pre-treatment data.

ˆ ω = arg min

ω0,ω

  • t≤T0
  • ω0 + ωT Ycontrol,t − ¯

Ytreated,t 2 + ζ2T0ω2 subject to ω1 . . . ωN0 ≥ 0,

N0

  • n=1

ωn = 1 Use an intercept. We want parallel lines, not overlaid ones. Use a ridge penalty; multicollinearity is typical. Shrinkage helps control variance and own-observation bias.

  • 2. Estimate time series regression weights, ˆ

λ, on the control units.

  • 3. Estimate τ by (2 × 2) diff-in-diff on weighted block averages.
  • 4. Form confidence intervals using the jackknife estimate of standard error.

13

slide-18
SLIDE 18

Synthetic Difference-in-Differences

synthetic pre-treatment average post-treatment synthetic control

  • n≤N0
  • t≤T0

ˆ ωnˆ λtYnt

  • n≤N0
  • t>T0

ˆ ωnT −1

1

Ynt average treated

  • n>N0
  • t≤T0

N −1

1

ˆ λtYnt

  • n>N0
  • t>T0

N −1

1

T −1

1

Ynt DID uses equal weights ωn = 1/N0, λt = 1/T0. SC only take one difference (uses zero time weights λt = 0).

14

slide-19
SLIDE 19

Theory

slide-20
SLIDE 20

A General Setting

Ynt = Lnt + Wntτnt + εnt, E[ε | W] = 0

  • L: Matrix of noiseless control potential outcomes
  • τ: Matrix of treatment effects
  • ε: Noise matrix with iid subgaussian rows
  • We have autocorrelation over time
  • But no correlation between units.
  • W indicates the treated block

We estimate the ATT ¯ τ = 1 N1T1

  • nt

Wntτnt

  • Typical sample sizes are small, but the setting is ‘high dimensional’.
  • We see various dimension ratios T/N, T1/T0, N1/N0.
  • We lose the essence in asymptotics with too many fixed dimensions.
  • The signal L tends to be multicollinear: no restricted eigenvalue condition!
  • For simplicity, we’ll assume rank(L) ≪
  • min(N0, T0).

15

slide-21
SLIDE 21

What can go wrong?

Underfitting We don’t create parallel trends in pre-treatment outcomes. Overfitting We do, but by predicting signal from noise. Failed identification We adjust as intended, but we’re still confounded.

16

slide-22
SLIDE 22

Underfitting

It happens, but it tends to be something we can see. e.g., California cigarette consumption with southeastern states as controls.

50 100 150 1970 1980 1990 2000

  • synth. california

california

California ≈ .82 Louisiana + .10 Mississippi + . . .

17

slide-23
SLIDE 23

Overfitting

We prove concentration around an oracle estimator to rule out overfitting.

  • 1. Consider the limits of our unit and time weights: the minimizers of

expected (as opposed to empirical) mean squared error. ˜ ω = arg min

ω0,ω∈R×S

  • t≤T0
  • ω0 + ωT Lcontrol,t − ¯

Ltreated,t 2 + [trace(Σ) + ζ2T0]ω2 ˜ λ = arg min

λ0,λ∈R×S

  • n≤N0
  • λ0 + Ln,preλ − ¯

Ln,post 2 + N0Σ1/2(λ − ψ)2. We’re in an error-in-variables model, so implicit ridge penalty terms arise as the expectation of quadratics in the noise matrix ε. Σ = E εT

n,preεn,pre

pre-treatment autocovariance matrix ψ = arg min

v∈RT0

E (εn,prev − ¯ εn,post)2 post-on-pre autoregression vector

  • 2. The oracle estimator ˜

τ uses these in place of the empirical minimizers.

  • 3. Its error is easy to characterize because these weights are non-random.

18

slide-24
SLIDE 24

Concentration around the oracle

Deviation from the oracle is essentially bilinear in the weight differences. ˆ τ − ˜ τ ≈ (ˆ ω − ˜ ω)T Lcontrol,pre (ˆ λ − ˜ λ) ≤ ˆ ω − ˜ ω

  • Lcontrol,pre(ˆ

λ − ˜ λ)

  • .

Cauchy-Schwarz bounds depend on prediction error and coefficient error. We characterize these using a version of the ‘slow rate’ analysis for the lasso.

  • The simplex is small, so predictions converge at a rate

that depends logarithmically on its dimension.

  • With a multicollinear signal L, convergence of coefficients is driven by

ridge regularization and improves with its strength ζ.

  • Because we have error in variables, rates can be worse than without,

depending on the fit and dispersion (2-norm) of the limiting weights. Ridge regularization helps, as long as the limiting weights still fit the data.

  • Lc,p(ˆ

λ − ˜ λ)

  • log N0
  • (T0/Nef)1/4 + MSE1/4

˜ λ

  • ,

Nef = min

  • N1, ˜

ω−1 , ˆ ω − ˜ ω √log T0 ζT 1/2

  • (N0/Tef)1/4 + MSE1/4

˜ ω

  • ,

Tef = min

  • T1, ˜

λ−1 .

19

slide-25
SLIDE 25

Oracle bias

The oracle estimator’s bias is caused by changes in the predictive bias

  • f the limiting weights from training to generalization.
  • aT

N1Ltre,post − ˜

ωT Lcon,post − ˜ ω0

  • aT1

counterfactual post-treatment bias of ˜ ω

  • aT

N1Ltre,pre − ˜

ωT Lcon,pre − ˜ ω0

  • ˜

λ

bias of ˜ ω over the synthetic pre-treatment period

This change is small if either:

  • the regressions fit well during training and generalize;
  • they don’t, but the errors they make are predictable.

I’ve written this in terms of the bias of the unit weights ˜ ω above, but there is an analogous decomposition swapping the roles of ˜ ω and ˜ λ. Here an ∈ Rn = n−11.

20

slide-26
SLIDE 26

Oracle normality

Our oracle estimator’s error is approximately normal around the oracle bias. ˜ τ −τ −bias ≈ aT

N1

  • εtre,postaT1 − εtre,pre˜

λ

  • − ˜

ωT εcon,postaT1 − εcon,pre˜ λ

  • .

This goes for the real estimator if its deviation from the oracle is negligible, in which case we can estimate variance by resampling units. With autocorrelated noise, variance is reduced by the inclusion of time weights, as they are predictive of the post-treatment noise.

21

slide-27
SLIDE 27

−0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 5 10 15

Minimum wage assignment

estimate Density SDID SC DID

Error in simulation based on Bertrand, Duflo, and Mullainathan [2004].

22

slide-28
SLIDE 28

Is our identification strategy a problem?

  • Maybe, but it’s unclear what the alternatives are.
  • No consensus on the underlying conceptual assumptions about panel data.
  • Estimators rely on some latent structure that relates units.
  • To compare fundamentally different estimation strategies,

we need more ‘minimal’ assumptions.

23

slide-29
SLIDE 29

Thank you!

arxiv.org/abs/1812.09970 github.com/davidahirshberg/synthdid

24

slide-30
SLIDE 30

References

Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Synthetic control methods for comparative case studies: Estimating the effect of californias tobacco control program. Journal of the American Statistical Association, 105(490), 2010. Marianne Bertrand, Esther Duflo, and Sendhil Mullainathan. How much should we trust differences-in-differences estimates? The Quarterly journal of economics, 119(1):249–275, 2004.

25