Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido - PowerPoint PPT Presentation

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey Wooldridge CEMMAP London, April 15, 2016

• Adjusting standard errors for clustering is common in em- pirical work. • Motivation not always clear. • Implementation is not always clear. • We present a coherent framework for thinking about clustering that clarifies when and how to adjust for clustering. • Currently mostly exact calculations in simple cases. • Clarifies role of large number of clusters asymptotics. NOT about small sample issues, either small number of clusters or small number of units, NOT about serial correlation issues. (Important, but not key to issues discussed here) 1

Setup Data on ( Y i , D i , G i ), i = 1 , . . . , N Y i is outcome D i is regressor, mainly focus on special case where D i ∈ {− 1 , 1 } (to allow for exact results). G i ∈ { 1 , . . . , G } is group/cluster indicator. Estimate regression function β ′ = ( α, τ ) Y i = α + τ · D i + ε i = X ′ X ′ i β + ε, i = (1 , D i ) , 2

Least squares estimator (not generalized least squares) N ( Y i − α − τ · D i ) 2 τ ) ′ ˆ � (ˆ α, ˆ τ ) = arg min β = (ˆ α, ˆ i =1 Residuals ε i = Y i − ˆ ˆ α − ˆ τ · D i Focus of the paper is on properties of ˆ τ : • What is variance of ˆ τ • How do we estimate the variance of ˆ τ ? 3

Standard Textbook Approach: View D and G as fixed, assume ε ∼ N (0 , Ω) Ω block diagonal, correspondig to clusters  Ω 1 0 0  . . . 0 Ω 2 0 . . .   Ω =  .  .  ... . .    0 . . . Ω G Variance estimators differ by assumptions on Ω g : diagonal (robust, Eicker-Huber-White), unrestricted (cluster, Liang- Zeger/Stata), constant off-diagonal (Moulton/Kloek) 4

Common Variance estimators (normalized by sample size) Eicker-Huber-White, standard robust var (zero error covar): − 1  − 1      N N N X i X ′ X i X ′ ε 2 X i X ′ ˆ � � � V robust = N i ˆ i i i       i =1 i =1 i =1 Liang-Zeger, STATA, standard clustering adjustment, (unrestricted within-cluster covariance matrix): − 1 ′         N G N X i X ′ X i X ′ ˆ � � � � � V cluster = N X i ˆ ε i X i ˆ ε i i i         i =1 g =1 i : G i = g i : G i = g i =1 Moulton/Kloek (constant covariance within-clusters) 1 + ρ ε · ρ D · N � � V moulton = ˆ ˆ V robust · G where ρ ε , ρ D are the within-cluster correlations of ˆ ε and D . 5

Related Literature • Clustering: Moulton (1986, 1987, 1990), Kloek (1981) Hansen (2007), Cameron & Miller (2015), Angrist & Pischke (2008), Liang and Zeger (1986), Wooldridge (2010), Donald and Lang (2007), Bertrand, Duflo, and Mullainathan (2004) • Sample Design: Kish (1965) • Causal Literature: Neyman (1935, 1990), Rubin (1976, 2006), Rosenbaum (2000), Imbens and Rubin (2015) • Exper. Design: Murray (1998), Donner and Klar (2000) • Finite Population Issues: Abadie, Athey, Imbens, and Wooldridge (2014) 6

Views from the Literature • “The clustering problem is caused by the presence of a common unobserved random shock at the group level that will lead to correlation between all observations within each group” (Hansen, p. 671) • “The consensus is to be conservative and avoid bias and to use bigger and more aggregate clusters when possible, up to and including the point at which there is concern about having too few clusters.” (Cameron and Miller, p. 333) • Clustering does not matter when the regressors are not correlated within clusters. • Use ˆ V cluster when in doubt. 7

Questions 1. Is there any harm in using ˆ V cluster when ˆ V robust is valid? 2. Can we infer from the data whether ˆ V cluster or ˆ V robust is appropriate? 3. When are ˆ V cluster , ˆ V robust , or ˆ V moulton appropriate? 4. Is ˆ V cluster superior to ˆ V robust in large samples? 5. What is the role of within-cluster correlation of regressors? 8

We develop a framework within which these questions can be answered. Key features: • Specify population and estimand • Specify data generating process 9

Answers 1. Is there any harm in using ˆ V cluster when ˆ V robust is valid? YES 2. Can we infer from the data whether ˆ V cluster or ˆ V robust is appropriate? NO 3. When are ˆ V cluster or ˆ V robust appropriate? DEPENDS ON DESIGN 4. Is ˆ V cluster superior to ˆ V robust in large samples? DE- PENDS ON DESIGN 5. What is the role of within-cluster correlation of regressors? DEPENDS ON DESIGN 10

First, Define the Population and Estimand Population of size M . Population is partioned into G groups/clusters. The population size in cluster g is M g , here M g = M/G for all clusters for convenience. G i ∈ { 1 , . . . , G } is group/cluster indicator. M may be large/infinite, G may be large/infinite, M g may be large/infinite. R i ∈ { 0 , 1 } is sampling indicator, � M i =1 R i = N is sample size. 11

1. Descriptive Setting : Outcome Y i Estimand is population average M θ ∗ = 1 � Y i M i =1 Estimator is sample average M θ = 1 ˆ � R i · Y i N i =1 12

2. Causal Setting : potential outcomes Y i ( − 1) , Y i (1), treatment D i ∈ {− 1 , 1 } , re- alized outcome Y i = Y i ( D i ), Estimand is 0.5 times average treatment effect (to make estimand equal to limit of regression coefficient, simplifies calculations later, but not of essence) M θ ∗ = 1 � ( Y i (1) − Y i ( − 1)) / 2 M i =1 Estimator is � M � M i =1 R i · Y i · ( D i − D ) i =1 R i · D i ˆ θ = where D = � M � M i =1 R i · ( D i − D ) 2 i =1 R i 13

Descriptive Setting : population definitions 1 Y M,g = G � 2 σ 2 � � � g = Y i − Y M,g Y i M g − 1 M i : G i = g i : G i = g G 1 � 2 σ 2 � � cluster = Y M,g − Y M G − 1 g =1 G cond = 1 σ 2 σ 2 � g G g =1 σ 2 ( Y i − Y M )( Y j − Y M ) G cluster � ρ = ≈ σ 2 cluster + σ 2 σ 2 M ( M − G ) i � = j,G i = G j cond M 1 σ 2 = ( Y i − Y M ) 2 ≈ σ 2 cluster + σ 2 � cond M − 1 i =1 14

Estimator is M θ = 1 ˆ � R i · Y i N i =1 • (random sampling) Suppose sampling is completely random, � − 1 M � M � pr( R = r ) = ∀ r s . t . r i = N. , N i =1 Exact variance, normalized by sample size: 1 − N � � θ | RS) = σ 2 · ≈ σ 2 N · V (ˆ M 15

What do the variance estimators give us here? � � ˆ � ≈ σ 2 � RS E V robust � cluster · N � N � �� cond ≈ σ 2 · � ≈ σ 2 G + σ 2 � ˆ � � RS 1 + ρ · G − 1 E V cluster � • Adjusting the standard errors for clustering can make a difference here • Adjusting standard errors for clustering is wrong here 16

Why is the cluster variance wrong here? Implicitly the cluster variance takes as the estimand the average outcome in a super-population with a large number of clusters. The set of clusters that we see in the sample is viewed as just a small subset of that large population of clusters. In that case we dont have a random sample from the population of interest. • Be explicit about the population of interest. Do we see all clusters in the population or not. • This issue is distinct from the use of distributional approx- imations based on increasing the number of clusters. 17

Consider a model-based approach: Y i = X ′ ε i ∼ N (0 , σ 2 η g ∼ N (0 , σ 2 i β + ε i + η G i ε ) , η ) The standard ols variance expression β ) = ( X ′ X ) − 1 ( X ′ Ω X )( X ′ X ) − 1 V (ˆ is based on resampling units, or resampling both ε and η . In a random sample we will eventually see units from all clusters, and we do not need to resample the η g . The random sampling variance keeps the η g fixed. 18

• (clustered sampling) Suppose we randomly select H clusters out of G , and then select N/H units randomly from each of the sampled clusters: � − 1 � − H � � G M/G pr( R = r ) = · , H N/H � � for all r s . t . ∀ g r i = N/G ∨ r i = 0 . i : G i = g i : G i = g Now the exact variance is cluster · N 1 − H 1 − N � � � � θ | CS) = σ 2 + σ 2 N · V (ˆ H · cond · G M Adjusting standard errors for clustering here can make a difference and is correct here. Failure to do so leads to invalid confidence intervals. 19

Four Causal Settings • Random sample, random assignment of units. • Random sample, random assignment of clusters. • Clustered sample, random assignment of units. • Random sample, assignment prob varying across clusters. Questions 1. Is ˆ V robust valid? 2. Is ˆ V cluster valid? 20

Answers • Random sample, random assignment of units. ˆ ˆ V robust valid V cluster not generally valid • Random sample, random assignment of clusters. ˆ ˆ V robust not generally valid V cluster valid • Clustered sample, random assignment of units. depends on estimand: average effect in population versus average effect in sample • Random sample, assignment prob varying across clusters. neither generally valid 21

Causal Setting: Random Sampling, Random Assign- ment Points: 1. Should not cluster. 2. ˆ V robust is valid 3. ˆ V cluster can be different from ˆ V robust in large samples, with many clusters, even with ρ ε = ρ D = 0. 4. ˆ V moulton and ˆ V cluster are conceptually quite different. 22

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido - PowerPoint PPT Presentation

Clustering as a Design Problem Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey Wooldridge CEMMAP London, April 15, 2016 Adjusting standard errors for clustering is common in em- pirical work. Motivation not always clear.

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Clustering Data Clustering with user constraints The clustering problem : Given a set of

CS145: INTRODUCTION TO DATA MINING 09: Vector Data: Clustering Basics Instructor: Yizhou Sun

Graceful Register Clustering by Effective Mean Shift Algorithm for Power and Timing Balancing

Chapter 9. Clustering Analysis Wei Pan Division of Biostatistics, School of Public Health,

Data Mining Techniques: Partitioning Methods: K-Means Cluster Analysis Hierarchical

UNSUPERVISED LEARNING AND CLUSTERING Jeff Robble, Brian Renzenbrink, Doug Roberts Unsupervised

Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing

ASCLU Alternative Subspace Clustering Stephan Gnnemann Ines Frber Emmanuel Mller

Practical Orchestrator Shlomi Noach GitHub Percona Live Europe 2017 How people build so

Sambuz

Useful Links

Newsletter

Mail Us