finding multivariate outlier
play

Finding Multivariate Outlier Applied Multivariate Statistics Spring - PowerPoint PPT Presentation

Finding Multivariate Outlier Applied Multivariate Statistics Spring 2012 Goals Concept: Detecting outliers with (robustly) estimated Mahalanobis distance and QQ-plot R: chisq.plot, pcout from package mvoutlier Appl.


  1. Finding Multivariate Outlier Applied Multivariate Statistics – Spring 2012

  2. Goals  Concept: Detecting outliers with (robustly) estimated Mahalanobis distance and QQ-plot  R: chisq.plot, pcout from package “ mvoutlier ” Appl. Multivariate Statistics - Spring 2012 2

  3. Outlier in one dimension - easy  Look at scatterplots  Find dimensions of outliers  Find extreme samples just in these dimensions  Remove outlier Appl. Multivariate Statistics - Spring 2012 3

  4. No outlier in x or y 2d: More tricky Outlier Appl. Multivariate Statistics - Spring 2012 4

  5. Recap: Mahalanobis distance  True Mahalanobis distance: p ( x ¡ ¹ ) T § ¡ 1 ( x ¡ ¹ ) MD ( x ) =  Estimated Mahalanobis distance: q ¹ ) T ^ ^ § ¡ 1 ( x ¡ ^ ( x ¡ ^ MD ( x ) = ¹ ) Sq. Mahalanobis Distance MD 2 (x ) = Sq. distance from mean in standard deviations IN DIRECTION OF X Appl. Multivariate Statistics - Spring 2012 5

  6. µ ¶ 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 Appl. Multivariate Statistics - Spring 2012 6

  7. µ ¶ 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 (20,0) MD = 4 Appl. Multivariate Statistics - Spring 2012 7

  8. µ ¶ 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 (0,10) MD = 10 Appl. Multivariate Statistics - Spring 2012 8

  9. µ ¶ 0 ¹ = ; 0 µ ¶ Mahalanobis distance: Example 25 0 § = 0 1 (10, 7) MD = 7.3 Appl. Multivariate Statistics - Spring 2012 9

  10. Theory of Mahalanobis Distance Assume data is multivariate normally distributed (d dimensions) Mahalanobis distance of samples follows a Chi-Square distribution with d degrees of freedom (“By definition”: Sum of d standard normal random variables has Chi-Square distribution with d degrees of freedom.) Appl. Multivariate Statistics - Spring 2012 10

  11. Check for multivariate outlier  Are there samples with estimated Mahalanobis distance that don’t fit at all to a Chi -Square distribution?  Check with a QQ-Plot  Technical details: - Chi-Square distribution is still reasonably good for estimated Mahalanobis distance - use robust estimates for ¹; § Appl. Multivariate Statistics - Spring 2012 11

  12. Robust Estimates: Income of 7 people Robust Scatter Std. Dev.

  13. Robust Std. Dev.

  14. Robust Std. Dev.

  15. Robust Estimates for outlier detection  If scatter is estimated robustly, outlier “stick out” much more  Robust Mahalanobis distance: Mean and Covariance matrix estiamted robustly Appl. Multivariate Statistics - Spring 2012 15

  16. Outlier easily detected ! Example - continued Appl. Multivariate Statistics - Spring 2012 16

  17. Outliers in >2d can be well hidden ! No outlier, right? Appl. Multivariate Statistics - Spring 2012 17

  18. Outliers in >2d can be well hidden ! Wrong! Appl. Multivariate Statistics - Spring 2012 18

  19. Outliers in >2d can be well hidden ! This outlier can’t be seen in the scatterplot- matrix (but in a 3d plot) Appl. Multivariate Statistics - Spring 2012 19

  20. Method 1: Quantile of Chi-Sqaure distribution  Compute for each sample (in d dimensions) the robustly estimated Mahalanobis distance MD(x i )  Compute the 97.5%-Quantile Q of the Chi-Square distribution with d degrees of freedom  All samples with MD(x i ) > Q are declared outlier Appl. Multivariate Statistics - Spring 2012 20

  21. Method 2: Adjusted Quantile  Adjusted Quantile for outlier: Depends on distance between cdf of Chi-Square and ecdf of samples in tails  Simulate “normal” deviations in the tails  Outlier have “abnormally large” deviations in the tails (e.g. more than seen in 100 simulations without outliers) Appl. Multivariate Statistics - Spring 2012 21

  22. Method 2: Adjusted Quantile ECDF leaves “plausible” range Defines adaptive cutoff Appl. Multivariate Statistics - Spring 2012 22

  23. Method 2: Adjusted Quantile Function “ aq.plot ” Appl. Multivariate Statistics - Spring 2012 23

  24. Method 3: State of the art - pcout  Complex method based on robust principal components  Pretty involved methodology  Very fast – good for high dimensions  R: Function “ pcout ” in package “ mvoutlier ”  $wfinal01: 0 is outlier  $wfinal: Small values are more severe outlier  P. Filzmoser, R. Maronna, M. Werner. Outlier identification in high dimensions, Computational Statistics and Data Analysis , 52, 1694-1711, 2008 Appl. Multivariate Statistics - Spring 2012 24

  25. Automatic outlier detection  It is always better to look at a QQ-plot to find outlier ! Just find points “sticking out”; no distributional assumption  If you can’t: Automatic outlier detection - finds usually too many or too few outlier depending on parameter settings - depends on distribution assumptions (e.g. multivariate normality) + good for screening of large amounts of data Appl. Multivariate Statistics - Spring 2012 25

  26. Concepts to know  Find multivariate outlier with robustly estimated Mahalanobis distance  Cutoff - by eye (best method) - quantile of Chi-Square distribution Appl. Multivariate Statistics - Spring 2012 26

  27. R commands to know  chisq.plot, pcout in package “ mvoutlier ” Appl. Multivariate Statistics - Spring 2012 27

  28. Next week  Missing values Appl. Multivariate Statistics - Spring 2012 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend