Finding Multivariate Outlier Applied Multivariate Statistics Spring - - PowerPoint PPT Presentation

finding multivariate outlier
SMART_READER_LITE
LIVE PREVIEW

Finding Multivariate Outlier Applied Multivariate Statistics Spring - - PowerPoint PPT Presentation

Finding Multivariate Outlier Applied Multivariate Statistics Spring 2012 Goals Concept: Detecting outliers with (robustly) estimated Mahalanobis distance and QQ-plot R: chisq.plot, pcout from package mvoutlier Appl.


slide-1
SLIDE 1

Finding Multivariate Outlier

Applied Multivariate Statistics – Spring 2012

slide-2
SLIDE 2

Goals

  • Concept: Detecting outliers with (robustly) estimated

Mahalanobis distance and QQ-plot

  • R: chisq.plot, pcout from package “mvoutlier”

2

  • Appl. Multivariate Statistics - Spring 2012
slide-3
SLIDE 3

Outlier in one dimension - easy

  • Look at scatterplots
  • Find dimensions of outliers
  • Find extreme samples just in these dimensions
  • Remove outlier

3

  • Appl. Multivariate Statistics - Spring 2012
slide-4
SLIDE 4

2d: More tricky

4

  • Appl. Multivariate Statistics - Spring 2012

Outlier No outlier in x or y

slide-5
SLIDE 5
  • True Mahalanobis distance:
  • Estimated Mahalanobis distance:

Recap: Mahalanobis distance

5

  • Appl. Multivariate Statistics - Spring 2012

MD(x) = p (x ¡ ¹)T§¡1(x ¡ ¹)

  • Sq. Mahalanobis Distance MD2(x)

=

  • Sq. distance from mean in

standard deviations IN DIRECTION OF X

^ MD(x) = q (x ¡ ^ ¹)T ^ §¡1(x ¡ ^ ¹)

slide-6
SLIDE 6

Mahalanobis distance: Example

6

  • Appl. Multivariate Statistics - Spring 2012

§ = µ 25 1 ¶ ¹ = µ ¶ ;

slide-7
SLIDE 7

Mahalanobis distance: Example

7

  • Appl. Multivariate Statistics - Spring 2012

§ = µ 25 1 ¶ ¹ = µ ¶ ;

(20,0)

MD = 4

slide-8
SLIDE 8

Mahalanobis distance: Example

8

  • Appl. Multivariate Statistics - Spring 2012

§ = µ 25 1 ¶ ¹ = µ ¶ ;

(0,10)

MD = 10

slide-9
SLIDE 9

Mahalanobis distance: Example

9

  • Appl. Multivariate Statistics - Spring 2012

§ = µ 25 1 ¶ ¹ = µ ¶ ;

(10, 7)

MD = 7.3

slide-10
SLIDE 10

Theory of Mahalanobis Distance

Assume data is multivariate normally distributed (d dimensions)

10

  • Appl. Multivariate Statistics - Spring 2012

Mahalanobis distance of samples follows a Chi-Square distribution with d degrees of freedom (“By definition”: Sum of d standard normal random variables has Chi-Square distribution with d degrees of freedom.)

slide-11
SLIDE 11

Check for multivariate outlier

  • Are there samples with estimated Mahalanobis distance

that don’t fit at all to a Chi-Square distribution?

  • Check with a QQ-Plot
  • Technical details:
  • Chi-Square distribution is still reasonably good for

estimated Mahalanobis distance

  • use robust estimates for

11

  • Appl. Multivariate Statistics - Spring 2012

¹; §

slide-12
SLIDE 12

Robust Estimates: Income of 7 people

Robust Scatter

  • Std. Dev.
slide-13
SLIDE 13

Robust

  • Std. Dev.
slide-14
SLIDE 14

Robust

  • Std. Dev.
slide-15
SLIDE 15

Robust Estimates for outlier detection

  • If scatter is estimated robustly, outlier “stick out” much

more

  • Robust Mahalanobis distance:

Mean and Covariance matrix estiamted robustly

15

  • Appl. Multivariate Statistics - Spring 2012
slide-16
SLIDE 16

Example - continued

16

  • Appl. Multivariate Statistics - Spring 2012

Outlier easily detected !

slide-17
SLIDE 17

Outliers in >2d can be well hidden !

17

  • Appl. Multivariate Statistics - Spring 2012

No outlier, right?

slide-18
SLIDE 18

Outliers in >2d can be well hidden !

18

  • Appl. Multivariate Statistics - Spring 2012

Wrong!

slide-19
SLIDE 19

Outliers in >2d can be well hidden !

19

  • Appl. Multivariate Statistics - Spring 2012

This outlier can’t be seen in the scatterplot- matrix (but in a 3d plot)

slide-20
SLIDE 20

Method 1: Quantile of Chi-Sqaure distribution

  • Compute for each sample (in d dimensions) the robustly

estimated Mahalanobis distance MD(xi)

  • Compute the 97.5%-Quantile Q of the Chi-Square

distribution with d degrees of freedom

  • All samples with MD(xi) > Q are declared outlier

20

  • Appl. Multivariate Statistics - Spring 2012
slide-21
SLIDE 21

Method 2: Adjusted Quantile

  • Adjusted Quantile for outlier: Depends on distance

between cdf of Chi-Square and ecdf of samples in tails

  • Simulate “normal” deviations in the tails
  • Outlier have “abnormally large” deviations in the tails

(e.g. more than seen in 100 simulations without outliers)

21

  • Appl. Multivariate Statistics - Spring 2012
slide-22
SLIDE 22

Method 2: Adjusted Quantile

22

  • Appl. Multivariate Statistics - Spring 2012

ECDF leaves “plausible” range Defines adaptive cutoff

slide-23
SLIDE 23

Method 2: Adjusted Quantile Function “aq.plot”

23

  • Appl. Multivariate Statistics - Spring 2012
slide-24
SLIDE 24

Method 3: State of the art - pcout

  • Complex method based on robust principal components
  • Pretty involved methodology
  • Very fast – good for high dimensions
  • R: Function “pcout” in package “mvoutlier”
  • $wfinal01: 0 is outlier
  • $wfinal: Small values are more severe outlier
  • P. Filzmoser, R. Maronna, M. Werner. Outlier identification

in high dimensions, Computational Statistics and Data Analysis, 52, 1694-1711, 2008

24

  • Appl. Multivariate Statistics - Spring 2012
slide-25
SLIDE 25

Automatic outlier detection

  • It is always better to look at a QQ-plot to find outlier !

Just find points “sticking out”; no distributional assumption

  • If you can’t: Automatic outlier detection
  • finds usually too many or too few outlier depending on

parameter settings

  • depends on distribution assumptions

(e.g. multivariate normality) + good for screening of large amounts of data

25

  • Appl. Multivariate Statistics - Spring 2012
slide-26
SLIDE 26

Concepts to know

  • Find multivariate outlier with robustly estimated

Mahalanobis distance

  • Cutoff
  • by eye (best method)
  • quantile of Chi-Square distribution

26

  • Appl. Multivariate Statistics - Spring 2012
slide-27
SLIDE 27

R commands to know

  • chisq.plot, pcout in package “mvoutlier”

27

  • Appl. Multivariate Statistics - Spring 2012
slide-28
SLIDE 28

Next week

  • Missing values

28

  • Appl. Multivariate Statistics - Spring 2012