Outlier Detection 3 with Application to Geochemistry - - PowerPoint PPT Presentation

outlier detection
SMART_READER_LITE
LIVE PREVIEW

Outlier Detection 3 with Application to Geochemistry - - PowerPoint PPT Presentation

Univariate versus Multivariate Outliers Outlier Detection 3 with Application to Geochemistry 2


slide-1
SLIDE 1

Outlier Detection with Application to Geochemistry

Peter Filzmoser Department of Statistics and Probability Theory Vienna University of Technology, Austria

Vienna, Austria

June 16, 2006

Vienna University of Technology

Univariate versus Multivariate Outliers

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

Univariate versus Multivariate Outliers

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

Univariate versus Multivariate Outliers

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

slide-2
SLIDE 2

Univariate versus Multivariate Outliers

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

Univariate versus Multivariate Outliers

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

Univariate versus Multivariate Outliers

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

Univariate versus Multivariate Outliers

  • −3

−2 −1 1 2 3 −3 −2 −1 1 2 3

slide-3
SLIDE 3

Multivariate Outlier Detection Methods

Standard methods are based on the Mahalanobis distances (MD): MDi := d(xi, t, C) = {(xi − t)⊤C−1(xi − t)}1/2 for a sample x1, . . . , xn ∈ I Rp and estimators of location t and covariance C. = ⇒ Robust estimates of location and covariance are needed! Outlier detection: Outliers will typically have large distance. If multivariate normal distribution is assumed, MD2

i is approx. χ2 p distributed.

= ⇒ suspect observations: MD2

i > χ2 p,0.975

  • does not account for different sample size
  • χ2

p-approximation is poor

Garrett (1989)

Chi-square plot: Plot robust MD2

i against quantiles of χ2 p.

= ⇒ iterative deletion of points with large distance until a straight line appears. Drawback: no automatic procedure, needs user interaction.

Example: Simulated data with outliers

  • −4

−2 2 −3 −2 −1 1 2 3

Iterative deletion of outliers:

  • ● ●
  • ● ●●
  • 5

10 15 2 4 6 8 10 12 Ordered robust MD^2 Quantiles of Chi_p^2

Chi^2−Plot

slide-4
SLIDE 4

Iterative deletion of outliers:

5 10 15 2 4 6 8 10 12 Ordered robust MD^2 Quantiles of Chi_p^2

Chi^2−Plot

  • ● ●
  • ● ●●
  • Iterative deletion of outliers:
  • 2

4 6 8 10 12 2 4 6 8 10 12 Ordered robust MD^2 Quantiles of Chi_p^2

Chi^2−Plot

Iterative deletion of outliers:

2 4 6 8 10 12 2 4 6 8 10 12 Ordered robust MD^2 Quantiles of Chi_p^2

Chi^2−Plot

  • Iterative deletion of outliers:
  • 2

4 6 2 4 6 8 10 12 Ordered robust MD^2 Quantiles of Chi_p^2

Chi^2−Plot

slide-5
SLIDE 5

Example: Simulated data with outliers

  • −4

−2 2 −3 −2 −1 1 2 3

  • New Proposal

G(u) . . . theoretical distribution function of χ2

p,

Gn(u) . . . empirical distribution function of MD2

i .

For η = χ2

p,1−α define

pn(η) = sup

u≥η

{G(u) − Gn(u)}+. Then a measure of outliers in the sample is αn(η) =

  • if

pn(η) ≤ pcrit(η, n, p) pn(η) if pn(η) > pcrit(η, n, p). pcrit(η, n, p) can be obtained by simulations.

Simulated Data Example

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

  • 2

4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

Ordered squared robust distance Cumulative probability

  • ● ●
  • ● ●
  • 97.5% quantile

Adjusted quantile

Example: Simulated Data

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

  • Outliers based on 97.5% quantile

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

  • Outliers based on new method
slide-6
SLIDE 6

Example: Kola Data

Consider the O-horizon (organic surface soil) of the Kola data set. Take (more or less) typical elements for“pollution” : As, Cd, Co, Cu, Mg, Pb, Zn Question: Where are the multivariate outliers?

KOLA-Project

Rovaniemi Ivalo Kirkenes Nikel Murmansk Murmansk

Mine,inproduction

Legend

Mine,closeddown Importantmineraloccurrence, notdeveloped Smelter,production

  • fmineralconcentrate

City,town,settlement Projectboundary

Keivitsa Pahtavaara Saattopora

Apatity Kirovsk Zapolyarnij Monchegorsk Kandalaksha Kovdor Kittil Olenegorsk A r c t i c

  • C

i r c l e 24 E

  • 35 30'E
  • 0 km 50

100 150 200

FINLAND RUSSIA

BarentsSea

NOR W A Y

WhiteSea

NorthAtlanticOcean EUROP E

ä

Example: Map showing outliers

  • 4e+05

5e+05 6e+05 7e+05 8e+05 7400000 7600000 7800000 XCOO YCOO

  • RED points are outliers

Choice of Symbols

  • 1

2 3 4 5 6 7 8 1 2 3 4 5 6 7 x y

slide-7
SLIDE 7

Choice of Symbols

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 x y

  • Symbols:
  • 25% quantile
  • 50% quantile
  • 75% quantile

Adjusted quantile

Example: Map showing outliers

4e+05 5e+05 6e+05 7e+05 8e+05 7400000 7600000 7800000 XCOO YCOO

  • Robust MD with symbols

Which Outliers?

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 x y

  • Which Outliers?

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 x y

slide-8
SLIDE 8

Example: Map showing outliers

4e+05 5e+05 6e+05 7e+05 8e+05 7400000 7600000 7800000 XCOO YCOO

  • Multivariate outlier plot

Example: From Multivariate to Univariate

−4 −2 2 4 6 8 10 Scaled data

  • As
  • Cd
  • ● ●
  • Co
  • Cu
  • Mg
  • Pb
  • Zn

Example: Symbols from multivariate plot

−4 −2 2 4 6 8 10 Scaled data

  • As
  • Cd
  • Co
  • Cu
  • Mg
  • Pb
  • Zn

Summary

library(mvoutlier) includes

  • all routines to generate the presented plots
  • Kola data and other interesting geochemical data sets