Identification of local multivariate outliers Anne Ruiz-Gazen and - - PowerPoint PPT Presentation

identification of local multivariate outliers
SMART_READER_LITE
LIVE PREVIEW

Identification of local multivariate outliers Anne Ruiz-Gazen and - - PowerPoint PPT Presentation

Identification of local multivariate outliers Anne Ruiz-Gazen and Christine Thomas-Agnan Gremaq, TSE and IMT Toulouse, France (in collab. with Peter Filzmoser) SSIAB - Avignon - 11/05/12 A. Ruiz-Gazen & C. Thomas-Agnan (TSE) Local


slide-1
SLIDE 1

Identification of local multivariate outliers

Anne Ruiz-Gazen and Christine Thomas-Agnan

Gremaq, TSE and IMT Toulouse, France (in collab. with Peter Filzmoser)

SSIAB - Avignon - 11/05/12

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 1 / 24

slide-2
SLIDE 2

Introduction

In robust statistics, an observation is considered as outlying if it differs from the main bulk of the data set.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 2 / 24

slide-3
SLIDE 3

Introduction

In robust statistics, an observation is considered as outlying if it differs from the main bulk of the data set. Fε = (1 − ε)F + εG In the case of continuous attributes, the main bulk of the data set assumed to follow an elliptical distribution (e.g. gaussian) F and the

  • utlying observations following a distribution G (e.g. point mass).
  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 2 / 24

slide-4
SLIDE 4

Introduction

In robust statistics, an observation is considered as outlying if it differs from the main bulk of the data set. Fε = (1 − ε)F + εG In the case of continuous attributes, the main bulk of the data set assumed to follow an elliptical distribution (e.g. gaussian) F and the

  • utlying observations following a distribution G (e.g. point mass).

Objective : identify/detect gross errors, atypical observations taking into account the multivariate and the spatial nature of the data.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 2 / 24

slide-5
SLIDE 5

Introduction

Rovaniemi Ivalo Kirkenes Nikel Murmansk Murmansk

Mine,inproduction

Legend

Mine,closeddown Importantmineraloccurrence, notdeveloped Smelter,production

  • fmineralconcentrate

City,town,settlement Projectboundary

Keivitsa Pahtavaara Saattopora

Apatity Kirovsk Zapolyarnij Monchegorsk Kandalaksha Kovdor Kittil Olenegorsk A r c t i c

  • C

i r c l e 24 E

  • 3

5 3 ' E

  • 0 km 50

100 150 200

FINLAND RUSSIA

BarentsSea

NOR W A Y

WhiteSea

NorthAtlanticOcean EUROP E

ä

The Kola project : concentration measures for more than 50 chemical elements in four layers and 617 observations.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 3 / 24

slide-6
SLIDE 6

Introduction

Rovaniemi Ivalo Kirkenes Nikel Murmansk Murmansk

Mine,inproduction

Legend

Mine,closeddown Importantmineraloccurrence, notdeveloped Smelter,production

  • fmineralconcentrate

City,town,settlement Projectboundary

Keivitsa Pahtavaara Saattopora

Apatity Kirovsk Zapolyarnij Monchegorsk Kandalaksha Kovdor Kittil Olenegorsk A r c t i c

  • C

i r c l e 24 E

  • 3

5 3 ' E

  • 0 km 50

100 150 200

FINLAND RUSSIA

BarentsSea

NOR W A Y

WhiteSea

NorthAtlanticOcean EUROP E

ä

The Kola project : concentration measures for more than 50 chemical elements in four layers and 617 observations. Data available in the R-package mvoutlier by M. Gschwandtner et P. Filzmoser.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 3 / 24

slide-7
SLIDE 7

1 Detection of outliers in a non spatial context

Detection of univariate outliers Detection of multivariate outliers

2 Spatial outliers

Global and local outliers Identification of univariate spatial outliers

3 Identification of multivariate spatial outliers

Variocloud of pairwise Mahalanobis distances Toy example Quantile geographical-variate plot

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 4 / 24

slide-8
SLIDE 8

Detection of outliers in a non spatial context Detection of univariate outliers

1 Detection of outliers in a non spatial context

Detection of univariate outliers Detection of multivariate outliers

2 Spatial outliers

Global and local outliers Identification of univariate spatial outliers

3 Identification of multivariate spatial outliers

Variocloud of pairwise Mahalanobis distances Toy example Quantile geographical-variate plot

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 5 / 24

slide-9
SLIDE 9

Detection of outliers in a non spatial context Detection of univariate outliers

Detection of univariate outliers

Let us consider a data set x, n × p with n observations xi and p variables.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 6 / 24

slide-10
SLIDE 10

Detection of outliers in a non spatial context Detection of univariate outliers

Detection of univariate outliers

Let us consider a data set x, n × p with n observations xi and p variables. In one dimension (p = 1), the detection of outliers is often based on |xi − ¯ x| σx (Grubbs, 1969).

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 6 / 24

slide-11
SLIDE 11

Detection of outliers in a non spatial context Detection of univariate outliers

Detection of univariate outliers

Let us consider a data set x, n × p with n observations xi and p variables. In one dimension (p = 1), the detection of outliers is often based on |xi − ¯ x| σx (Grubbs, 1969). Problem of masking effect : outliers may spoil the empirical mean and the standard deviation estimators in such a way that outliers are not detected.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 6 / 24

slide-12
SLIDE 12

Detection of outliers in a non spatial context Detection of univariate outliers

Detection of univariate outliers

Let us consider a data set x, n × p with n observations xi and p variables. In one dimension (p = 1), the detection of outliers is often based on |xi − ¯ x| σx (Grubbs, 1969). Problem of masking effect : outliers may spoil the empirical mean and the standard deviation estimators in such a way that outliers are not detected. Robust version : ¯ x and σx replaced by some robust estimators such as the median and the MAD.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 6 / 24

slide-13
SLIDE 13

Detection of outliers in a non spatial context Detection of univariate outliers

Detection of univariate outliers

−0.5 0.0 0.5 1.0 1.5

log10(As) mean +/− 2*s median +/− 2*MAD Boxplot

  • −0.5

0.0 0.5 1.0 1.5

Fig.: Histogram of log(Arsenic)

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 7 / 24

slide-14
SLIDE 14

Detection of outliers in a non spatial context Detection of multivariate outliers

Detection of multivariate outliers

In the multivariate case (p > 1), detection based on Mahalanobis distances to the center of the data set :

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 8 / 24

slide-15
SLIDE 15

Detection of outliers in a non spatial context Detection of multivariate outliers

Detection of multivariate outliers

In the multivariate case (p > 1), detection based on Mahalanobis distances to the center of the data set : Let t be a location estimator (p × 1) and C a dispersion matrix estimator (p × p) of the distribution of the main bulk of the data. MD(xi, t, C) =

  • (xi − t)′C −1(xi − t)

1/2

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 8 / 24

slide-16
SLIDE 16

Detection of outliers in a non spatial context Detection of multivariate outliers

Detection of multivariate outliers

In the multivariate case (p > 1), detection based on Mahalanobis distances to the center of the data set : Let t be a location estimator (p × 1) and C a dispersion matrix estimator (p × p) of the distribution of the main bulk of the data. MD(xi, t, C) =

  • (xi − t)′C −1(xi − t)

1/2 Multivariate outliers are associated with large values of Mahalanobis distances.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 8 / 24

slide-17
SLIDE 17

Detection of outliers in a non spatial context Detection of multivariate outliers

Detection of multivariate outliers

In the multivariate case (p > 1), detection based on Mahalanobis distances to the center of the data set : Let t be a location estimator (p × 1) and C a dispersion matrix estimator (p × p) of the distribution of the main bulk of the data. MD(xi, t, C) =

  • (xi − t)′C −1(xi − t)

1/2 Multivariate outliers are associated with large values of Mahalanobis distances. In the Gaussian case N(µ, Σ), the MD2(xi, µ, Σ) follow a chi-square distribution with p degrees of freedom and a common used cut-off value is the quantile of order 95% of this chi-square distribution.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 8 / 24

slide-18
SLIDE 18

Detection of outliers in a non spatial context Detection of multivariate outliers

Detection of multivariate outliers

(Rousseeuw and Van Zomeren, 1990)

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 9 / 24

slide-19
SLIDE 19

Detection of outliers in a non spatial context Detection of multivariate outliers

Detection of multivariate outliers

(Rousseeuw and Van Zomeren, 1990) Use robust estimators t and C such as the Minimum Covariance Determinant (MCD) estimators. Look for a subset of data points (e.g. 75%) having the smallest determinant for its covariance matrix.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 9 / 24

slide-20
SLIDE 20

Detection of outliers in a non spatial context Detection of multivariate outliers

Detection of multivariate outliers

(Rousseeuw and Van Zomeren, 1990) Use robust estimators t and C such as the Minimum Covariance Determinant (MCD) estimators. Look for a subset of data points (e.g. 75%) having the smallest determinant for its covariance matrix. R-packages mvoutliers et rrcov.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 9 / 24

slide-21
SLIDE 21

Detection of outliers in a non spatial context Detection of multivariate outliers

In two dimensions, scatterplot with ellipsoids corresponding to non-robust estimators (blue) t and C and MCD estimators (red) for a quantile of

  • rder 95%.
  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 10 / 24

slide-22
SLIDE 22

Spatial outliers

1 Detection of outliers in a non spatial context

Detection of univariate outliers Detection of multivariate outliers

2 Spatial outliers

Global and local outliers Identification of univariate spatial outliers

3 Identification of multivariate spatial outliers

Variocloud of pairwise Mahalanobis distances Toy example Quantile geographical-variate plot

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 11 / 24

slide-23
SLIDE 23

Spatial outliers Global and local outliers

Global and local outliers

Global outlier : relative to the distribution of the whole data set (whole area of study).

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 12 / 24

slide-24
SLIDE 24

Spatial outliers Global and local outliers

Global and local outliers

Global outlier : relative to the distribution of the whole data set (whole area of study). Local outlier : relative to the sub-distribution associated with the

  • bservation and its neighborhood.

Underlying assumption : positive spatial autocorrelation.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 12 / 24

slide-25
SLIDE 25

Spatial outliers Identification of univariate spatial outliers

Exploratory plots for identifying univariate spatial outliers

Neighbor plot Moran plot Drift map Angle plot Variocloud

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 13 / 24

slide-26
SLIDE 26

Spatial outliers Identification of univariate spatial outliers

Exploratory plots for identifying univariate spatial outliers

Neighbor plot Moran plot Drift map Angle plot Variocloud R-package GeoXp by C. Thomas et al. (2012).

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 13 / 24

slide-27
SLIDE 27

Identification of multivariate spatial outliers

1 Detection of outliers in a non spatial context

Detection of univariate outliers Detection of multivariate outliers

2 Spatial outliers

Global and local outliers Identification of univariate spatial outliers

3 Identification of multivariate spatial outliers

Variocloud of pairwise Mahalanobis distances Toy example Quantile geographical-variate plot

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 14 / 24

slide-28
SLIDE 28

Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances

Variocloud of pairwise Mahalanobis distances

MD(xi, xj, C) =

  • (xi − xj)′C −1(xi − xj)

1/2

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 15 / 24

slide-29
SLIDE 29

Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances

Variocloud of pairwise Mahalanobis distances

MD(xi, xj, C) =

  • (xi − xj)′C −1(xi − xj)

1/2 Draw a variocloud by replacing absolute pairwise difference with robust pairwise Mahalanobis distances (MCD covariance estimator)

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 15 / 24

slide-30
SLIDE 30

Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances

Variocloud of pairwise Mahalanobis distances

MD(xi, xj, C) =

  • (xi − xj)′C −1(xi − xj)

1/2 Draw a variocloud by replacing absolute pairwise difference with robust pairwise Mahalanobis distances (MCD covariance estimator) Draw only part of the cloud, summarize the rest by conditional quantile curves

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 15 / 24

slide-31
SLIDE 31

Identification of multivariate spatial outliers Variocloud of pairwise Mahalanobis distances

Example of multivariate variocloud

Selected units : small spatial distances and high pairwise Mahalanobis distance

0e+00 2e+05 4e+05 6e+05 5 10 15 Pairwise spatial distances Pairwise Mahalanobis distances 4e+05 5e+05 6e+05 7e+05 8e+05 7400000 7600000 7800000

X coordinate Y coordinate

Fig.: Multivariate variocloud

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 16 / 24

slide-32
SLIDE 32

Identification of multivariate spatial outliers Toy example

A small toy example

−6 −4 −2 2 4 6 −6 −4 −2 2 4 6

Variable 1 Variable 2

  • 2

4 6 8 10 2 4 6 8 10

X coordinate Y coordinate

  • Fig.: Toy example

Comparing pairwise geographical distance and pairwise distance in the non spatial attributes space.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 17 / 24

slide-33
SLIDE 33

Identification of multivariate spatial outliers Toy example

Variocloud for the toy example

2 4 6 8 10 12 1 2 3 4 5 6 7

Pairwise spatial distances Pairwise Mahalanobis distances

2 4 6 8 10 2 4 6 8 10

X coordinate Y coordinate

  • Fig.: Toy example
  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 18 / 24

slide-34
SLIDE 34

Identification of multivariate spatial outliers Quantile geographical-variate plot

Distribution of the pairwise Mahalanobis Distance

When the observations X1, . . . , Xn are i.i.d. with a normal distribution N(µ, Σ), we can prove that : Conditional on one observation Xi, the distribution of the squared pairwise Mahalanobis distance MD2(Xi, Xj, Σ) of Xj, j = i is a non central chi-square distribution with the Mahalanobis distance MD2(Xi, µ, Σ) as the non-centrality parameter and p degrees of freedom.

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 19 / 24

slide-35
SLIDE 35

Identification of multivariate spatial outliers Quantile geographical-variate plot

Distribution of pairwise MD : toy example

Histogram for filled square

Pairwise MD^2 Density 10 20 30 40 0.00 0.01 0.02 0.03 0.04 0.05 0.06

Histogram for filled triangle

Pairwise MD^2 Density 5 10 15 20 0.00 0.02 0.04 0.06 0.08 0.10

Histogram for filled circle

Pairwise MD^2 Density 5 10 15 20 0.00 0.05 0.10 0.15 0.20 0.25 0.30

Fig.: Toy example

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 20 / 24

slide-36
SLIDE 36

Identification of multivariate spatial outliers Quantile geographical-variate plot

Quantile geographical-variate plot

−6 −4 −2 2 4 6 −6 −4 −2 2 4 6

Variable 1 Variable 2

  • −6

−4 −2 2 4 6 −6 −4 −2 2 4 6

Variable 1 Variable 2

  • Fig.: Toy example
  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 21 / 24

slide-37
SLIDE 37

Identification of multivariate spatial outliers Quantile geographical-variate plot

Quantile geographical-variate plot

20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Percentage of next 10 neighbors Degree of isolation

Regular observations

20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Percentage of next 10 neighbors Degree of isolation

Outlying observations

Fig.: Toy example

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 22 / 24

slide-38
SLIDE 38

Identification of multivariate spatial outliers Quantile geographical-variate plot

Quantile geographical-variate plot on Kola example

20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Percentage of neighbors inside tolerance ellipse Quantile of non−central chisquare distribution

Regular observations

20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Percentage of neighbors inside tolerance ellipse Quantile of non−central chisquare distribution

Outlying observations

Fig.: Kola example

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 23 / 24

slide-39
SLIDE 39

Identification of multivariate spatial outliers Quantile geographical-variate plot

Conclusion

Thank you for your attention !

  • A. Ruiz-Gazen & C. Thomas-Agnan (TSE)

Local multivariate outliers SSIAB - Avignon - 11/05/12 24 / 24