Inference with Arbitrary Clustering Fabrizio Colella, Rafael Lalive, - - PowerPoint PPT Presentation

inference with arbitrary clustering
SMART_READER_LITE
LIVE PREVIEW

Inference with Arbitrary Clustering Fabrizio Colella, Rafael Lalive, - - PowerPoint PPT Presentation

Inference with Arbitrary Clustering Fabrizio Colella, Rafael Lalive, Seyhun O. Sakalli, Mathias Thoenig Swiss Stata Users Group Meeting, October 2018 University of Lausanne Introduction Motivation A tremendous surge of


slide-1
SLIDE 1

Inference with Arbitrary Clustering

Fabrizio Colella,∗ Rafael Lalive,∗ Seyhun O. Sakalli,∗ Mathias Thoenig∗

Swiss Stata Users Group Meeting, October 2018

∗University of Lausanne

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Motivation

A tremendous surge of empirical analysis with spatial data:

  • Growing availability of geocoded data
  • Integration of geographic information systems (GIS) in the toolkit
  • f economists

Network relations among individuals known and easily accessible Need for econometric methods to obtain asymptotically valid inference in settings with varying types of spatial, network, and temporal dependence between observation units Absence of Stata commands, especially in the 2SLS setting

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-4
SLIDE 4

This paper

Proposes an approach to obtain asymptotically valid inference in the presence of arbitrary correlation (spatial or within a network) in both OLS and 2SLS settings Provides a package, acreg, for the statistical software Stata Performs Monte Carlo simulations (using spatial data on U.S. towns and counties) to show the properties and performance of the proposed estimator

  • Generate random variables and check how close we get to 5% null-rejection

rate at 5% test level, following Bertrand, Duflo, and Mullainathan (2004)

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-5
SLIDE 5

Stata command: acreg

What is new in acreg compared to existing packages?

  • Performs standard error correction in both OLS and 2SLS settings following

White (1980)

  • Correlation weights can be given as input or computed from spatial or

network relations or multi-way clustering (Cameron et al., 2011)

  • Spatial relations can be defined both with a distance cutoff and a contigu-

ity/distance matrix (neighboring observations only)

  • Network relations can be defined both with a matrix of links or a distance

matrix or with any arbitrary cluster structure that user defines

  • Allows for observation i in time t to be correlated with observation j in its

cluster in time t + s

  • HAC standard errors and distance decays are optional
  • Fixes some bugs that exist in Conley (1999) and Hsiang (2010)

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-6
SLIDE 6

Arbitrary Clustering

slide-7
SLIDE 7

Spatial - 1 Cluster

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-8
SLIDE 8

Spatial - 2 Overlapping clusters

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-9
SLIDE 9

Network

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-10
SLIDE 10

Network - Adjacency matrix

j1 j2 j3 j4 j5 j6 j7 j8 j9 j10 j11 j1 1 1 1 1 1 j2 1 1 1 1 1 j3 1 1 1 1 j4 1 1 1 1 j5 1 1 1 j6 1 1 1 j7 1 1 1 1 j8 1 1 1 1 j9 1 1 1 j10 1 1 1 1 j11 1 1 1 1

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-11
SLIDE 11

Conceptual Framework

slide-12
SLIDE 12

Theoretical VCV of the 2SLS estimator

Standard IV Estimator b2SLS = ( ˆ X ′ ˆ X)−1( ˆ X ′y) With Variance VCV (b2SLS) = ( ˆ X ′ ˆ X)−1 ˆ X ′Ω ˆ X( ˆ X ′ ˆ X)−1

Where: y is the Dependent Variable X is the Matrix of Regressors (exogenous and endogenous) Z is the Matrix of Instruments (excluded and included) ˆ X = Z(Z ′Z)−1(Z ′X) is the fitted values from the First Stage Regression Ω is the VCV of errors

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-13
SLIDE 13

Estimating the VCV of the 2SLS estimator

Proposed Estimator for ˆ X ′Ω ˆ X is: ˆ X ′(S. × (uu′)) ˆ X =

n

  • i=1

T

  • t=1

n

  • j=1

T

  • s=1

ˆ xituitujs ˆ xjssitjs

Where: u ≡ y − ˆ X ˆ β2SLS are the estimated residuals

  • Each itjs-th component of s is a correlation weight [0,1]
  • The correlation weight can be arbitrarily set
  • The correlation weight should reflect the dependence of the error of
  • bservation it on the error of observation js

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-14
SLIDE 14

Asymptotics of the proposed estimator (work in progress)

Equivalence with multi-way clustering

  • Any bilateral links structure can be represented by a multi-way clustering

structure.

  • ˆ

VCV (ˆ β2SLS) in a multi-way cluster environment can be represented as sum

  • f one-way cluster-robust matrices (Cameron et al. 2011)
  • The sandwich estimator of the

ˆ VCV (ˆ β2SLS) in a one-way cluster environ- ment is consistent as G → ∞ (White 1984; Arellano 1987; Rogers 1993; Hansen 2007)

Dimensionality with arbitrary clustering (work in progress)

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-15
SLIDE 15

Command

slide-16
SLIDE 16

acreg - Syntax: baseline

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-17
SLIDE 17

acreg - Syntax: Spatial 1

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-18
SLIDE 18

acreg - Syntax: Spatial 2

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-19
SLIDE 19

acreg - Syntax: Network 1

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-20
SLIDE 20

acreg - Syntax: Network 2

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-21
SLIDE 21

acreg - Syntax: Multiway clustering

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-22
SLIDE 22

acreg - Additional Options

  • Panel Dimension and optional HAC standard errors
  • Allows for sampling weights (pweights)
  • Allows for ‘if’ and ‘in’ statements
  • Allows for partialling out up to 2 high-order fixed effects
  • Produces output similar to Stata’s native commands
  • Allows for storing distance matrix and weights matrix
  • Stores main results in e()

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-23
SLIDE 23

acreg - Output: Spatial

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-24
SLIDE 24

acreg - Output: Network

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-25
SLIDE 25

Simulations

slide-26
SLIDE 26

Simulations

In each Monte Carlo draw:

  • 1. Generate random variables Y and X1, and random shocks εY and εX1

for each observation

Go

  • 2. Distribute the random shocks to ”linked observations”

Go

  • Spatial Environment: kernel around Counties in U.S.

Illustration

  • Network Environment: coauthors in economics (RePEc)
  • 3. Introduce the correlation in the model by adding the common shocks

to Y and X1

Go

  • 4. Regression of Y on X1 and a constant.

Go

Test: as the number of Monte Carlo draws approaches infinity, the null hypothesis that ˆ β = 0, in a test with α = 0.05, will be rejected 5% of the times only if spatial correlation is accounted for.

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-27
SLIDE 27

Results

slide-28
SLIDE 28

Spatial setting: Null-rejection rates

Data generating process: Bartlett kernel Unit: U.S. towns U.S. counties Sample size: N=101 N=1001 N=3141 (1) (2) (3) Spatial correlation Correction Endogeneity Estimator Null-rejection rate Panel A: Cross section, t = 1 OLS 5.9% 5.0% 5.0%

  • 2SLS

5.6% 5.1% 5.2%

  • OLS

37.8% 50.2% 28.2%

  • 2SLS

33.4% 48.3% 26.5%

  • OLS

16.8% 7.2% 5.6%

  • 2SLS

16.7% 8.4% 5.5% Panel B: Panel, t = 5 OLS 5.8% 5.1% 5.3%

  • 2SLS

5.3% 5.0% 4.6%

  • OLS

39.1% 46.1% 17.9%

  • 2SLS

37.3% 44.3% 15.5%

  • OLS

19.4% 11.2% 10.1%

  • 2SLS

19.0% 11.1% 9.6% Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-29
SLIDE 29

Spatial setting: Null-rejection rates by sample size, cross section, t=1

.15 .3 .45 .6 .05 Null−rejection rate 2 4 6 8 10 12 14 16 18 20 Number of cities per state Not corrected Corrected

(a) OLS

.15 .3 .45 .6 .05 Null−rejection rate 2 4 6 8 10 12 14 16 18 20 Number of cities per state Not corrected Corrected

(b) 2SLS

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-30
SLIDE 30

Spatial setting: Null-rejection rates by sample size, panel, t=5

.15 .3 .45 .6 .05 Null−rejection rate 2 4 6 8 10 12 14 16 18 20 Number of cities per state Not corrected Corrected

(c) OLS

.15 .3 .45 .6 .05 Null−rejection rate 2 4 6 8 10 12 14 16 18 20 Number of cities per state Not corrected Corrected

(d) 2SLS

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-31
SLIDE 31

Network setting: Null-rejection rates

Data generating process: First-degree friends Unit: Top of the distribution Random sample Sample size: N=1000 N=2500 N=1000 N=2500 (1) (2) (3) (4) Network correlation Correction Endogeneity Estimator Null-rejection rate OLS 5.1% 4.7% 4.7% 5.1%

  • 2SLS

5.3% 4.9% 5.4% 4.7%

  • OLS

64.9% 59.0% 26.9% 36.2%

  • 2SLS

63.0% 58.2% 25.4% 35.4%

  • OLS

13.2% 9.2% 7.5% 8.1%

  • 2SLS

13.4% 9.7% 7.2% 8.4% Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-32
SLIDE 32

Conclusions

slide-33
SLIDE 33

Conclusions

  • We propose a variance-covariance matrix (VCV) estimator, accom-

panied with a companion statistical package acreg for Stata, that allows researchers to obtain cluster-robust inference in OLS and 2SLS settings with arbitrary dependence across observations and over time

  • We show that arbitrary clustering correction produces consistent es-

timates of the VCV by means of Monte Carlo simulations

  • Next step: Facing theoretically the dimensionality problem (suffi-

cient number of clusters) in the arbitrary clustering environment and produce guidelines for the users

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-34
SLIDE 34

Thank You

Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

slide-35
SLIDE 35

Appendix

slide-36
SLIDE 36

Data Generating Process (DGP) - Baseline

For each observational unit we generate two iid random variables Y and X1 X1 ∼ N(X1, σX1) Y ∼ N(Y , σY )1 For each observational unit we also generate two random shocks εY and εX1 that are independent and identically distributed (iid): εX1 ∼ (0, σεX1 ) εY ∼ (0, σεY )

1Y and X1 can be any number. Given that Y and X1 are iid, statistical theory predicts that if we regress Y on X1,

the null hypothesis that the β coefficient is equal to 0 at a 5% level, will be rejected with 5% probability. Back

slide-37
SLIDE 37

Data Generating Process (DGP) - Correlation

Spatial Environment

We take each Town/County in US as an observational unit and we dissipate the shocks εX1i and εYi to all observations js that are within a spatial distance from observation i. We impose a bartlett kernel such that the effect is lower as the spatial distance between observations i and j increases. The total common shock an observation receives are ςξ, with ξ = εX1, εY : ςξi = ξi +

N

  • j=i

[1 − (distij/distcut)] × ξj

Network Environment

We take each author registered at RePEc as an observational unit and we dissipate the shocks εX1i and εYi to all her coauthors registered at RePEc. Each coauthor j receives a fraction, ρ, of each shock. The total common shock an observation receives are ςξ, with ξ = εX1, εY : ςξi = ξi +

Ni

  • j=i

ρ × ξj ; ρ > 0

Back

slide-38
SLIDE 38

DGP - correlation in the model

We introduce the correlation created into the model by adding the sum of common shocks to the variables, X1 and Y : ˆ X1i = X1i + ςεX1i ˆ Yi = Yi + ςεYi

Endogeneity Panel dimension Back

slide-39
SLIDE 39

DGP - regression

We estimate the following equation both correcting and not correcting for the presence of spatial/network correlation using OLS: ˆ Yi = α3i + ˆ β ˆ X1i + υi = α3i + ˆ β(X1i + ςεX1i) + (υ′

i + ςεYi)

(1) Null hypothesis that ˆ β = 0 will be rejected 5% of the time at 5% level if spatial correlation in the model is accounted for.

Back

slide-40
SLIDE 40

Illustration 1: Idiosyncratic shocks

Legend Idiosyncratic component

  • 10.82 - -5.0
  • 4.99 - -3.0
  • 2.99 - -1.5
  • 1.49 - 0.0

0.01 - 1.5 1.51 - 3.0 3.01 - 5.0 5.01 - 10.19

Correlated Shocks Back

slide-41
SLIDE 41

Illustration 2: Spatially correlated shocks

Legend Total common shocks

  • 12.28 - -6.0
  • 5.99 - -3.5
  • 3.49 - -1.5
  • 1.49 - 0.0

0.01 - 3.0 3.01 - 6.5 6.51 - 13.0 13.01 - 23.41

Back

slide-42
SLIDE 42

Data Generating Process, endogeneity

We introduce endogeneity to the model by adding an endogenous variable, End, as a regressor: Yi = α1i + δ1X1i + δ2Endi + µi (2) We generate a random variable IV , which is independent and identically distributed (iid) to Y and X1: IV = IV + ǫIV , ǫIV ∼ N(0, σǫIV ); We define Endi as: Endi = F(X1i, IVi) + ǫYi We introduce correlation to the 2SLS model by adding the sum of common random shocks, ςεIVi , to the variable IV and computing End as a function of correlated variables and common shocks: ˆ IVi = IVi + ςεIVi ˆ Endi = F( ˆ X1i, ˆ IV i) + ǫYi + ςεYi

Back

slide-43
SLIDE 43

Data Generating Process, panel dimension

Before introducing correlation to the model, we introduce auto-correlation of degree 1 by adding a fraction of the random common shock an observation receives in time t − 1 to the random common shock it receives in time t: εYit = εYit + φεYit−1; εX1it = εX1it + φεX1it−1; εIVit = εIVit + φεIVit−1; φ > 0 This ensures that observation i in time t affect observation j in time t + 1 if i and j are in the same arbitrary spatial cluster, i.e., distij ≤ distcut.

Back