Inference with Arbitrary Clustering Fabrizio Colella, Rafael Lalive, - PowerPoint PPT Presentation

Inference with Arbitrary Clustering Fabrizio Colella, ∗ Rafael Lalive, ∗ Seyhun O. Sakalli, ∗ Mathias Thoenig ∗ Swiss Stata Users Group Meeting, October 2018 ∗ University of Lausanne

Introduction

Motivation A tremendous surge of empirical analysis with spatial data: • Growing availability of geocoded data • Integration of geographic information systems (GIS) in the toolkit of economists Network relations among individuals known and easily accessible Need for econometric methods to obtain asymptotically valid inference in settings with varying types of spatial, network, and temporal dependence between observation units Absence of Stata commands, especially in the 2SLS setting Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

This paper Proposes an approach to obtain asymptotically valid inference in the presence of arbitrary correlation (spatial or within a network) in both OLS and 2SLS settings Provides a package, acreg , for the statistical software Stata Performs Monte Carlo simulations (using spatial data on U.S. towns and counties) to show the properties and performance of the proposed estimator • Generate random variables and check how close we get to 5% null-rejection rate at 5% test level, following Bertrand, Duflo, and Mullainathan (2004) Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Stata command: acreg What is new in acreg compared to existing packages? • Performs standard error correction in both OLS and 2SLS settings following White (1980) • Correlation weights can be given as input or computed from spatial or network relations or multi-way clustering (Cameron et al., 2011) • Spatial relations can be defined both with a distance cutoff and a contigu- ity/distance matrix (neighboring observations only) • Network relations can be defined both with a matrix of links or a distance matrix or with any arbitrary cluster structure that user defines • Allows for observation i in time t to be correlated with observation j in its cluster in time t + s • HAC standard errors and distance decays are optional • Fixes some bugs that exist in Conley (1999) and Hsiang (2010) Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Arbitrary Clustering

Spatial - 1 Cluster Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Spatial - 2 Overlapping clusters Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Network Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Network - Adjacency matrix j 1 j 2 j 3 j 4 j 5 j 6 j 7 j 8 j 9 j 10 j 11 j 1 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 0 0 1 j 2 1 1 1 0 0 0 0 0 0 1 0 j 3 j 4 0 0 0 1 0 0 1 1 0 1 0 j 5 0 1 0 0 1 0 0 0 0 0 1 j 6 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 j 7 j 8 0 1 0 1 0 0 0 1 1 0 0 j 9 1 0 0 0 0 0 0 1 1 0 0 j 10 0 0 1 1 0 0 1 0 0 1 0 1 1 0 0 1 0 0 0 0 0 1 j 11 Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Conceptual Framework

Theoretical VCV of the 2SLS estimator Standard IV Estimator X ′ ˆ b 2 SLS = ( ˆ X ) − 1 ( ˆ X ′ y ) With Variance X ′ ˆ X ) − 1 ˆ X ′ ˆ VCV ( b 2 SLS ) = ( ˆ X ′ Ω ˆ X ( ˆ X ) − 1 Where: y is the Dependent Variable X is the Matrix of Regressors (exogenous and endogenous) Z is the Matrix of Instruments (excluded and included) X = Z ( Z ′ Z ) − 1 ( Z ′ X ) is the fitted values from the First Stage Regression ˆ Ω is the VCV of errors Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Estimating the VCV of the 2SLS estimator Proposed Estimator for ˆ X ′ Ω ˆ X is: n T n T X ′ ( S . × ( uu ′ )) ˆ ˆ � � � � X = ˆ x it u it u js ˆ x js s itjs i =1 t =1 j =1 s =1 Where: u ≡ y − ˆ X ˆ β 2 SLS are the estimated residuals • Each itjs -th component of s is a correlation weight [0,1] • The correlation weight can be arbitrarily set • The correlation weight should reflect the dependence of the error of observation it on the error of observation js Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Asymptotics of the proposed estimator (work in progress) Equivalence with multi-way clustering • Any bilateral links structure can be represented by a multi-way clustering structure. VCV (ˆ ˆ β 2 SLS ) in a multi-way cluster environment can be represented as sum • of one-way cluster-robust matrices (Cameron et al. 2011) VCV (ˆ ˆ • The sandwich estimator of the β 2 SLS ) in a one-way cluster environment is consistent as G → ∞ (White 1984; Arellano 1987; Rogers 1993; Hansen 2007) Dimensionality with arbitrary clustering (work in progress) Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Command

acreg - Syntax: baseline Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

acreg - Syntax: Spatial 1 Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

acreg - Syntax: Spatial 2 Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

acreg - Syntax: Network 1 Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

acreg - Syntax: Network 2 Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

acreg - Syntax: Multiway clustering Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

acreg - Additional Options • Panel Dimension and optional HAC standard errors • Allows for sampling weights ( pweights ) • Allows for ‘if’ and ‘in’ statements • Allows for partialling out up to 2 high-order fixed effects • Produces output similar to Stata’s native commands • Allows for storing distance matrix and weights matrix • Stores main results in e() Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

acreg - Output: Spatial Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

acreg - Output: Network Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Simulations

Simulations In each Monte Carlo draw: 1. Generate random variables Y and X 1 , and random shocks ε Y and ε X 1 for each observation Go 2. Distribute the random shocks to ”linked observations” Go • Spatial Environment: kernel around Counties in U.S. Illustration • Network Environment: coauthors in economics (RePEc) 3. Introduce the correlation in the model by adding the common shocks to Y and X 1 Go 4. Regression of Y on X 1 and a constant. Go Test: as the number of Monte Carlo draws approaches infinity, the null hypothesis that ˆ β = 0, in a test with α = 0 . 05, will be rejected 5% of the times only if spatial correlation is accounted for. Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Results

Spatial setting: Null-rejection rates Data generating process: Bartlett kernel Unit: U.S. towns U.S. counties Sample size: N=101 N=1001 N=3141 (1) (2) (3) Spatial correlation Correction Endogeneity Estimator Null-rejection rate Panel A: Cross section, t = 1 OLS 5.9% 5.0% 5.0% � 2SLS 5.6% 5.1% 5.2% � OLS 37.8% 50.2% 28.2% 2SLS 33.4% 48.3% 26.5% � � � � OLS 16.8% 7.2% 5.6% 2SLS 16.7% 8.4% 5.5% � � � Panel B: Panel, t = 5 OLS 5.8% 5.1% 5.3% � 2SLS 5.3% 5.0% 4.6% OLS 39.1% 46.1% 17.9% � � � 2SLS 37.3% 44.3% 15.5% OLS 19.4% 11.2% 10.1% � � � � � 2SLS 19.0% 11.1% 9.6% Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Spatial setting: Null-rejection rates by sample size, cross section, t=1 .6 .6 .45 .45 Null−rejection rate Null−rejection rate .3 .3 .15 .15 .05 .05 0 0 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 Number of cities per state Number of cities per state Not corrected Corrected Not corrected Corrected (a) OLS (b) 2SLS Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Spatial setting: Null-rejection rates by sample size, panel, t=5 .6 .6 .45 .45 Null−rejection rate Null−rejection rate .3 .3 .15 .15 .05 .05 0 0 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 Number of cities per state Number of cities per state Not corrected Corrected Not corrected Corrected (c) OLS (d) 2SLS Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Network setting: Null-rejection rates Data generating process: First-degree friends Unit: Top of the distribution Random sample Sample size: N=1000 N=2500 N=1000 N=2500 (1) (2) (3) (4) Network correlation Correction Endogeneity Estimator Null-rejection rate OLS 5.1% 4.7% 4.7% 5.1% � 2SLS 5.3% 4.9% 5.4% 4.7% OLS 64.9% 59.0% 26.9% 36.2% � 2SLS 63.0% 58.2% 25.4% 35.4% � � � � OLS 13.2% 9.2% 7.5% 8.1% � � � 2SLS 13.4% 9.7% 7.2% 8.4% Colella, Lalive, Sakalli, and Thoenig Inference with Arbitrary Clustering

Conclusions

Inference with Arbitrary Clustering Fabrizio Colella, Rafael Lalive, - PowerPoint PPT Presentation

Inference with Arbitrary Clustering Fabrizio Colella, Rafael Lalive, Seyhun O. Sakalli, Mathias Thoenig Swiss Stata Users Group Meeting, October 2018 University of Lausanne Introduction Motivation A tremendous surge of

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Uniform Sampling through the Lovsz Local Lemma Heng Guo Berkeley, Jun 06 2017 Queen Mary,

Significance of Alignments COMP 571 Luay Nakhleh, Rice University 2 Hypothesis Testing for

Familywise error rate control by interactive unmasking Boyan Aaditya Larry Duan Ramdas

RIP to HIP: The Data Reject Heterogeneous Labor Income Profiles Dmytro Hryshko Econ 312, Spring

1 Acceptance, Rejection, and I/O for Turing Machines Definition 1.1 (Initial Configuration) If M

Lecture 5: Sampling (Monte Carlo Methods) Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel

Advanced Algorithms (X) Shanghai Jiao Tong University Chihao Zhang May 11, 2020 Estimate

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling