robust location and scatter estimators
play

Robust Location and Scatter Estimators Outline for Multivariate - PowerPoint PPT Presentation

Robust Location and Scatter Estimation Robust Location and Scatter Estimation Robust Location and Scatter Estimators Outline for Multivariate Data Analysis Background and Motivation {robustbase}, {rrcov} Computing the Robust


  1. Robust Location and Scatter Estimation Robust Location and Scatter Estimation Robust Location and Scatter Estimators Outline for Multivariate Data Analysis • Background and Motivation {robustbase}, {rrcov} • Computing the Robust Estimates – Definition and computation • MCD, OGK, S, M Valentin Todorov – Object model for robust estimation – Comparison to other implementations • Applications – Hotelling T 2 – Robust Linear Discriminant Analysis valentin.todorov@chello.at • Conclusions and future work 15.06.2006 useR'2006, Vienna: Valentin Todorov 1 15.06.2006 useR'2006, Vienna: Valentin Todorov 2 Robust Location and Scatter Estimation Robust Location and Scatter Estimation Example Multivariate location and scatter • Marona & Yohai (1998) • Location : coordinate-wise mean • rrcov : data set maryo • A bivariate data set with: • Scatter : covariance matrix – Variances of the variables on the diagonal ( ) n = 20 , µ = 0 0 – Covariance of two variables as off-diagonal elements � � 1 0 . 8 � � S = • Optimally estimated by the sample mean and sample � � 0 . 8 1 covariance matrix at any multivariate normal model • Essential to a number of multivariate data analyses • sample correlation: 0.81 methods • interchange the largest and smallest value in the first coordinate • But extremely sensitive to outlying observations • the sample correlation becomes 0.05 15.06.2006 useR'2006, Vienna: Valentin Todorov 3 15.06.2006 useR'2006, Vienna: Valentin Todorov 4

  2. Robust Location and Scatter Estimation Robust Location and Scatter Estimation Software for robust estimation of multivariate Motivation location and scatter • R 2.3.1: cov.rob ( cov. mcd) in MASS, but • S-Plus – covRob in the Robust library – Implements C-Step similar to the one in Rousseeuw & Van Driessen (1999) but no partitioning and no nesting • Matlab – mcdcov in the toolbox LIBRA -> very slow for larger data sets • SAS/IML – MCD call – No small sample corrections • R – cov.rob and cov.mcd in MASS – No generic functions print/show, summary, plot – No graphical and diagnostic tools • R – covMcd in {robustbase} • R – CovMcd, CovOgk, CovMest {rrcov} 15.06.2006 useR'2006, Vienna: Valentin Todorov 5 15.06.2006 useR'2006, Vienna: Valentin Todorov 6 Robust Location and Scatter Estimation Robust Location and Scatter Estimation rrcov � � � � robustbase rrcov - Port of the Fortran code for FAST-MCD and FAST-LTS of Rousseeuw and Van Driessen + Constrained M-estimates of location and covariance - Rocke (1996) - Functions covMcd, ltsReg and the corresponding help files + Orthogonalized Gnanadesikan-Kettering (OGK) – + Datasets - Rousseeuw and Leroy (1987), Milk - Daudin Maronna and Zamar (2002) (1988), etc. + S4 object model + Generic functions print and summary for covMcd + CovMcd + Graphical and diagnostic tools based on the robust and + CovOgk classical Mahalanobis distances - plot.mcd + CovMest + Formula interface and generic functions print , summary and predict for ltsReg + Graphical and diagnostic tools based on the residual - plot.lts 15.06.2006 useR'2006, Vienna: Valentin Todorov 7 15.06.2006 useR'2006, Vienna: Valentin Todorov 8

  3. Robust Location and Scatter Estimation Robust Location and Scatter Estimation rrcov Outline » CovSest: S estimates - FAST-S Salibian & Yohai (2005) » Trellis style graphics • Background and Motivation » Hotelling T 2 • Computing the Robust Estimates » Robust Linear Discriminant Analysis with option for – Definition and computation Stepwise selection of variables • MCD, OGK, S, M » More data sets – Object model for robust estimation – Comparison to other implementations • Applications – Hotelling T 2 – Robust Linear Discriminant Analysis • Conclusions and future work 15.06.2006 useR'2006, Vienna: Valentin Todorov 9 15.06.2006 useR'2006, Vienna: Valentin Todorov 10 Robust Location and Scatter Estimation Robust Location and Scatter Estimation Minimum Covariance Determinant Estimator Computing of MCD: FAST-MCD • Consists of three phases: basic C-step iteration, partitioning and Given a p dimensional data set X ={ x 1 , …, x n } nesting – The MCD estimator (Rousseeuw, 84) is defined by • C-step : move from one approximation ( T 1 ,C 1 ) of MCD of a data set • the subset of h observations out of n whose classical X ={ x 1 , ..., x n } to a new one ( T 2 ,C 2 ) with possibly lower determinant by computing the distances relative to ( T 1 ,C 1 ) and then computing ( T 2 ,C 2 ) covariance matrix has a smallest determinant for the h observations with smallest distances. • the MCD location estimator T is defined by the mean • C-step iteration : of that subset – Repeat a number of times (say 500) { • the MCD scatter estimator C is a multiple of its • start from a trial subset of h points and perform several C-step s covariance matrix • keep the 10 best solutions • n /2 <= h < n ; h =[( n + p +1)/2] yields maximal BDP } – From each of these solutions carry out C-step s until convergence and select the best result 15.06.2006 useR'2006, Vienna: Valentin Todorov 11 15.06.2006 useR'2006, Vienna: Valentin Todorov 12

  4. Robust Location and Scatter Estimation Robust Location and Scatter Estimation Computing of MCD: FAST-MCD Compound Estimators • Partitioning : If the data set is large (e.g. > 600) it is partitioned into (five) disjoint subsets • MVE and MCD - a first stage procedure – Carry out C-step s iterations for each of the subsets • Rousseeuw and Leroy 87, Rousseeuw and van Zomeren – Use the best (50) solutions as starting points for C-step s on the 91 - one step re-weighting entire data set and again keep the best 10 solutions • One-step M-estimates using Huber or Hampel function – Iterate these 10 solutions to convergence • Woodruff and Rocke 93, 96 - use MCD as a starting point • Nesting : If the data set is larger then (say 1500) for S-estimation or constraint M-estimation – draw a random subset and apply the partitioning procedure to it – use the 10 best solutions from the partitioning phase for iterations on the entire data set • The number of solutions used and the number of C-step s performed on the entire data set depend on its size 15.06.2006 useR'2006, Vienna: Valentin Todorov 13 15.06.2006 useR'2006, Vienna: Valentin Todorov 14 Robust Location and Scatter Estimation Robust Location and Scatter Estimation >library(rrcov) Using the estimators: Example Loading required package: robustbase Loading required package: MASS Scalable Robust Estimators with High Breakdown Point (version 0.3-03) Delivery Time Data – Rousseeuw and Leroy (1987), page 155, table 23 (Montgomery >data(delivery) and Peck (1982)). >delivery.x <- as.matrix(delivery[, 1:2]) >mcd <- CovMcd(delivery.x) – 25 observations in 3 variables >mcd • X1 Number of Products • X2 Distance Call: CovMcd(x = delivery.x) • Y Delivery time – The aim is to explain the time required to service a vending Robust Estimate of Location: machine (Y) by means of the number of products stocked (X1) n.prod distance and the distance walked by the route driver (X2). 5.895 268.053 – delivery.x – the X-part of the data set Robust Estimate of Covariance: n.prod distance n.prod 12.30 232.98 distance 232.98 56158.36 15.06.2006 useR'2006, Vienna: Valentin Todorov 15 15.06.2006 useR'2006, Vienna: Valentin Todorov 16

  5. Robust Location and Scatter Estimation Robust Location and Scatter Estimation > summary(mcd) Call: The CovMcd object CovMcd(x = delivery.x) Robust Estimate of Location: n.prod distance • CovMcd() returns an S4 object of class CovMcd 5.895 268.053 > data.class(mcd) Robust Estimate of Covariance: [1] “CovMcd“ n.prod distance n.prod 12.30 232.98 distance 232.98 56158.36 • Input parameters used for controlling the estimation Eigenvalues of covariance matrix: algorithm: alpha, quan, method, n.obs , etc. [1] 56159.32 11.34 • Raw MCD estimates: crit, best, raw.center, raw.cov, Robust Distances: raw.mah, raw.wt [1] 1.51872 0.68199 0.99165 0.73930 0.27939 0.13181 1.37029 [8] 0.21985 57.68290 2.48532 9.30993 1.70046 0.30187 0.71296 • Final (re-weighted) estimates – center, cov, mah, wt … … 15.06.2006 useR'2006, Vienna: Valentin Todorov 17 15.06.2006 useR'2006, Vienna: Valentin Todorov 18 Robust Location and Scatter Estimation Robust Location and Scatter Estimation Plot of the Robust Distances The CovMcd object (cont.) • The Mahalanobis • show(mcd) distances based on the robust estimates – the • summary(mcd) – additionally prints the eigenvalues of the outliers have large Rd i covariance and the robust distances. • A line is drown at • plot(mcd) - shows the Mahalanobis distances based on 2 y = cutoff = χ the robust and classical estimates of the location and the p , 0 . 975 scatter matrix in different plots. • The observations with – distance plot 2 RD ≥ cutoff = χ i p , 0 . 975 – distance-distance plot are identified by their – chi-Square plot subscript – tolerance ellipses – scree plot 15.06.2006 useR'2006, Vienna: Valentin Todorov 19 15.06.2006 useR'2006, Vienna: Valentin Todorov 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend