Robust Statistics using Stata First Belgian Stata Users Meeting - PowerPoint PPT Presentation

Robust Statistics using Stata First Belgian Stata Users Meeting Vincenzo Verardi Fnrs, UNamur, ULB September 2016 Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 1 / 77

Outliers do matter and are not always bad August Landmesser Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 2 / 77

Outliers do matter and are not always bad Structure of the presentation Introduction Descriptive Satistics Univariate outliers identi…cation Regression models Multivariate analysis Multivariate outlier identi…cation Robust logit Conclusion Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 3 / 77

Outliers do matter and are not necessarily coding errors Star Cluster CYG OB1 Hertzsprung-Russell Data 7 Log of light intensity 6 5 4 3 3.5 4 4.5 Log of temperature Least Squares Robust Estimator Source: P. J. Rousseeuw and A. M. Leroy (1987) Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 4 / 77

Outliers do matter and are not necessarily coding errors Brain and Body Weights 65 Species of Land Animal Robust: y=1.98+0.75 x 10 LS: y=2.17+0.59 x Human Log of Brain Weight 5 Brachiosaurus Triceratops Dipliodocus Water opossum 0 -5 -5 0 5 10 15 Log of Body Weight Source: Weisberg, S. (1985) Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 5 / 77

Outliers do matter and are not necessarily coding errors Number of international calls from Belgium Belgian Statistical Survey, Ministry of Economy. 20 15 10 5 0 50 55 60 65 70 75 Year Least Squares Robust Estimator Source: P. J. Rousseeuw and A. M. Leroy (1987) Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 6 / 77

Measuring robustness of an estimator Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 7 / 77

Measuring robustness of an estimator Sensitivity curve, see inter alia [Maronna et al., 2006] Let us consider a data set X n = f x 1 , . . . , x n g and the statistic T n = T n ( x 1 , . . . , x n ) . To study the impact of a potential outlier on this statistic, we may analyze the modi…cation of value observed for the statistic when we add an extra data point x and allow it to move on the whole line (from � ∞ to + ∞ ) . The (standardized) sensitivity curve of the statistic T n for the sample X n is de…ned by SC ( x ; T n , X n ) = T n + 1 ( x 1 , . . . , x n , x ) � T n ( x 1 , . . . , x n ) ; 1 n + 1 for each value of x , we compare the value of the statistic in the "contaminated" sample with its value in the initial sample, and rescale the di¤erence by dividing by 1 / ( n + 1 ) , the amount of contamination. Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 8 / 77

Measuring robustness of an estimator Sensitivity curve Mean and Median Standardized Sensitivity Curve X~N(0,1), N=20 5 0 -5 -5 0 5 Median Mean Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 9 / 77

Measuring robustness of an estimator In‡uence function The in‡uence function (IF) can be considered as an asymptotic version of the sensitivity curve of the statistic T n when the sample size n grows, that is, when the empirical distribution function F n tends to the underlying population distribution function F : �� T ( F ) 1 1 n + 1 ∆ x T 1 � F + n + 1 IF ( x ; T , F ) = lim 1 n ! ∞ n + 1 T (( 1 � ε ) F + ε ∆ x ) � T ( F ) = lim , ε ε ! 0 where ∆ x denotes the probability distribution putting all its mass in the point x . This function measures the e¤ect on T of a pertubation of F obtained by adding a small probability mass at the point x . Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 10 / 77

Measuring robustness of an estimator In‡uence Function Mean and Median Influence Function X~N(0,1) 5 y 0 -5 -5 0 5 x Median Mean Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 11 / 77

Measuring robustness of an estimator Gross-error sensitivity The gross-error sensitivity of T at distribution F , de…ned by γ � ( T , F ) = sup x j IF ( x ; T , F ) j , evaluates the biggest in‡uence that an outlier may have on T . From the robustness point of view, it is of course preferable to use an estimator for which γ � ( T , F ) is …nite (i.e. bounded IF). Local-shift sensitivity The local-shift sensitivity measures the e¤ect of a small perturbation of the value of x on T . We may determine the local-shift sensitivity j IF ( y ; T , F ) � IF ( x ; T , F ) j λ � ( T , F ) = sup . j y � x j x 6 = y From the robustness point of view, it is of course preferable to use an estimator for which the IF is smooth everywhere. Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 12 / 77

Measuring robustness of an estimator Breakdown point The sensitivity curve shows how an estimator reacts to the introduction of one single outlier. Some estimators have bounded sensitivity curve (SC) and therefore resist to this contamination. However, it is possible that the number of outliers in a sample is so large that even these estimators with bounded SC can break. The breakdown point is, roughly, the smallest amount of contamination in the sample that may cause the estimator to take on arbitrary values . Example If the i th observation among x 1 , . . . , x n goes to in…nity, the sample mean µ n goes to in…nity as well. This means that the …nite-sample breakdown point of this statistic is only 1 / n . In contrast, the …nite-sample breakdown if n is even and ( n + 1 ) / 2 point of the median Q 0 . 5 ; n is n / 2 if n is odd. n n Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 13 / 77

Measuring robustness of an estimator Choosing a good (robust) estimator Fisher-consistent . If the estimator was calculated using the entire population rather than a sample, the true value of the estimated parameter should be obtained Bounded in‡uence function (low gross-error sensitivity). The biggest in‡uence that an outlier may have on the estimator should be limited Smooth in‡uence function (low local-shift sensitivity). The e¤ect on the estimator of a small perturbation in the data should be limited High breakdown point . The estimator must withstand a contamination of a large proportion of the data Highly e¢cient with convergence rate of p n Computationally feasible Compromises must often be made to achieve good performance. Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 14 / 77

Descriptive statistics Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 15 / 77

Descriptive statistics Location parameters Several measures of location are available in the literature. We compare i) two “classical” estimators based on (centered) moments of the empirical distribution, ii) an estimator based on quantiles of the distribution, and iii) an estimator based on pairwise comparisons of the observations Classical estimator (mean) µ n = 1 n ∑ n i = 1 x i Classical estimator (trimmed mean) n �b α n c 1 µ α n = n � 2 b α n c ∑ i = b α n c + 1 x ( i ) Quantile-based estimator (median) Q 0 . 5 = med f x i g Pairwise based estimator [Hodges and Lehmann, 1963] n x i + x j o HL n = med ; i < j 2 Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 16 / 77

Location parameters In‡uence functions 2 1 IF 0 -1 -2 -4 -2 0 2 4 x µ Q 0.5 HL µ 0.25 µ 0.05 Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 17 / 77

Location parameters Comparing properties of location estimators Asymptotic Computational ASV ( � , Φ ) Estimator breakdown complexity value µ n 1 0% O ( n ) 8 > > if α = 0 . 05 1 . 0263 > < µ α 100 α % O ( n ) 1 . 0604 if α = 0 . 10 n > > > : 1 . 1952 if α = 0 . 25 π / 2 = 1 . 5708 O ( n ) Q 0 . 5 ; n 50% HL n π / 3 = 1 . 0472 29% O ( n log n ) Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 18 / 77

Location parameters Stata example Clean dataset Contaminated dataset clear clear set seed 1234 set seed 1234 set obs 10000 set obs 10000 drawnorm z drawnorm z gen x=z gen x=z+10 in 1/100 sum x, d sum x, d robstat x, stat(hl) robstat x, stat(hl) µ n Q 0 . 5 ; n HL n µ n Q 0 . 5 ; n HL n Value -0.00 -0.01 -0.01 Value 1.00 0.00 0.01 Time 0.01 0.01 0.40 Time 0.01 0.01 0.41 Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 19 / 77

Robust Statistics using Stata First Belgian Stata Users Meeting - PowerPoint PPT Presentation

Robust Statistics using Stata First Belgian Stata Users Meeting Vincenzo Verardi Fnrs, UNamur, ULB September 2016 Vincenzo Verardi (Fnrs, UNamur, ULB) First Belgian Stata Users Meeting 6/09/2016 1 / 77 Outliers do matter and are not always

Robust Statistics in Stata Ben Jann University of Bern, ben.jann@soz.unibe.ch 2017 London Stata

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

Meta-analysis using Stata Yulia Marchenko Executive Director of Statistics StataCorp LLC 2019

Bayesian analysis using Stata Yulia Marchenko Executive Director of Statistics StataCorp LP

Bayesian Analysis using Stata Bill Rising StataCorp LP 2016 Brazilian Stata Users Group Meeting

Python applications in Stata 16 BPLIM 2020 Portuguese Stata Conference BPLIM Python

Simulating Baboon Behavior using Stata Phil Ender UCLA Statistical Consulting Group (Ret) Stata

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Stata: Basics, Shortcuts, and Integration with Introduction LaTeX Stata Syntax and Shortcuts

Analyzing interval-censored survival-time data in Stata Xiao Yang Senior Statistician and

Calibrating Survey Weights in Stata Jeff Pitblado StataCorp LLC 2018 Canadian Stata Users Group

Dynamic Documents in Stata Bill Rising StataCorp LP 2016 Oceania Stata Users Group Meeting

Estimating dynamic stochastic general equilibrium models in Stata David Schenck Senior

Dynamic Documents in Stata Bill Rising StataCorp LLC 2018 Canadian Stata Conference Simon

Nonlinear dynamic stochastic general equilibrium models in Stata 16 David Schenck Senior

Nonlinear dynamic stochastic general equilibrium models in Stata 16 David Schenck Senior

What is NLP? CS 188: Artificial Intelligence Spring 2006 Lecture 27: NLP 4/27/2006

Reading Why Do Visualization? CPSC 314 Computer Graphics FCG Chap 27 pictures help us

APA Ground Meshes Dr. Lucien Cremaldi L. Cremaldi Universiy of Mississippi Dr. Don Summer

Old Dominion University Hampton Roads Real Estate Welcome! Market Review and Forecast 2011 2011

Internal Labour Migration, Wages and Employment: Evidence from Urban Labour Markets in India

New York Highlights: Northeast Downstate Benefit Highlights $0 Premium Plan Part B

Introduction to Python Python: very high level language, has high-level data structures built-in

On the Dimensions of Discourse Salience Christian Chiarcos chiarcos@uni-potsdam.de Dimensions

Sambuz

Useful Links

Newsletter

Mail Us