Robust multivariate methods for compositional data Peter Filzmoser - - PowerPoint PPT Presentation

robust multivariate methods for compositional data
SMART_READER_LITE
LIVE PREVIEW

Robust multivariate methods for compositional data Peter Filzmoser - - PowerPoint PPT Presentation

Robust multivariate methods for compositional data Peter Filzmoser Department of Statistics and Probability Theory Vienna University of Technology Compstat Paris, France August 23, 2010 Vienna University of Technology Contents


slide-1
SLIDE 1

Robust multivariate methods for compositional data

Peter Filzmoser Department of Statistics and Probability Theory Vienna University of Technology

Compstat – Paris, France

August 23, 2010

Vienna University of Technology

slide-2
SLIDE 2

Contents

  • Characterization of compositional data
  • Examples
  • Transformations
  • Factor analysis
  • Robustness
  • Conclusions
slide-3
SLIDE 3

Joint work with . . .

Karel Hron, Univ. Olomouc, Czech Republic Clemens Reimann, Geological Survey of Norway Robert Garrett, Geological Survey of Canada

slide-4
SLIDE 4

Example household expenditures

Household Expenditures in former HK$ (Aitchison, 1986)

Person Housing Foodstuff Alcohol Tobacco Other goods Total 1 640 328 147 169 196 1480 2 1800 484 515 2291 912 6002 3 2085 445 725 8373 1732 13360 4 616 331 126 117 149 1339 5 875 368 191 290 275 1999 6 770 364 196 242 236 1808 7 990 415 284 588 420 2697 8 414 305 94 68 112 993 . . . . . . . . . . . . . . . . . . . . . 18 1195 443 329 974 523 3464 19 2180 521 553 2781 1010 7045 20 1017 410 225 419 345 2416

slide-5
SLIDE 5

Characterization of compositional data

Definition: Compositional data consist of real-valued vectors x = (x1, . . . , xD)t with D strictly positive components describing the parts on a whole, and which carry

  • nly relative information (Aitchison, 1986; Egozcue, 2009).

Consequences:

  • The values x1, . . . , xD as such are not informative, but only their ratios are
  • f interest.
  • The parts x1, . . . , xD do not need to sum up to 1.
  • Compositional data follow the so-called Aitchison geometry on the simplex (and

not the Euclidean geometry). Most important reference:

  • J. Aitchison. The Statistical Analysis of Compositional Data. Chapman and Hall,

London, U.K., 1986.

slide-6
SLIDE 6

Example Kola data

Kola data: library(StatDA) about 600 samples from 4 soil layers

Statistical Data Analysis

Applied Environmental Statistics with R

Clemens Reimann.Peter Filzmoser.Robert Garrett.Rudolf Dutter

Explained

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50 100 km

N

Murmansk Monchegorsk Apatity Kovdor Nikel Zapoljarnij Rovaniemi

Barents Sea Finland Norway Russia

slide-7
SLIDE 7

Example Kola data

Two dominant parts in the C-horizon:

SiO2 in C−horizon [wt.−%] Al2O3 in C−horizon [wt.−%]

40 50 60 70 80 6 7 8 10 15 20 1.5 2.0 2.5 1.0 1.5 2.0

log(SiO2/TiO2) log(Al2O3/TiO2)

slide-8
SLIDE 8

Example factor analysis

(Reimann, Filzmoser, Garrett, 2002, Appl. Geochem.) Kola moss data: library(StatDA) data(moss) 594 samples 31 variables Factor analysis:

  • log-transformation
  • results presented in biplots

= ⇒ industrial contamination! BUT: We have compositional data!

−2 2 4 6 8 10 −2 2 4 6 8 10 Factor 1 (26.5%) Factor 2 (16.5%)

+ + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + ++ + ++ + + + + + + + + + + + + + + + + + + + + + + ++ + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + ++ + + + + + ++ ++ + + + + + + + + + + + + + + ++ + + ++ + + + + + + + + + + + + + + + + + + + ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + ++ + ++ + ++ + + + ++ + + + + + + + + + + + + + + + + + +

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0

Ag Al As B Ba Bi Ca Cd Co Cr Cu Fe Hg K Mg Mn Mo Na Ni P Pb Rb S Sb Si Sr Th Tl U V Zn

slide-9
SLIDE 9

Example household expenditures

Household Expenditures in former HK$ (Aitchison, 1986)

Person Housing Foodstuff Alcohol Tobacco Other goods Total 1 640 328 147 169 196 1480 2 1800 484 515 2291 912 6002 3 2085 445 725 8373 1732 13360 4 616 331 126 117 149 1339 5 875 368 191 290 275 1999 6 770 364 196 242 236 1808 7 990 415 284 588 420 2697 8 414 305 94 68 112 993 . . . . . . . . . . . . . . . . . . . . . 18 1195 443 329 974 523 3464 19 2180 521 553 2781 1010 7045 20 1017 410 225 419 345 2416

slide-10
SLIDE 10

Example household expenditures

Two versions: Data with and without Tobacco Data are normalized with the total expenditures

0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.05 0.10 0.15 0.20 0.25 0.30

Housing Foodstuff

Normalized data without Tobacco

0.42 0.44 0.46 0.48 0.50 0.52 0.10 0.15 0.20 0.25 0.30

Housing* Foodstuff*

Normalized data with Tobacco

slide-11
SLIDE 11

Example household expenditures

Solution: consider (log-)ratios

5 10 15 20 0.2 0.3 0.4 0.5 0.6

Index log(Housing/Foodstuff)

Normalized data without Tobacco

5 10 15 20 0.2 0.3 0.4 0.5 0.6

Index log(Housing*/Foodstuff*)

Normalized data with Tobacco

Normalization not necessary: same result with original data in HK$

slide-12
SLIDE 12

Geometrical properties

Compositional data with only 2 parts

  • 1

1 First part Second part

  • 1

1

0.1 0.2 0.5 0.6

First part Second part

Aitchison distance: dA(x, ˜

x) = 1

D

D−1

i=1

D

j=i+1

  • ln xi

xj − ln ˜ xi ˜ xj

2

slide-13
SLIDE 13

Transformations

Special transformations from the simplex to the Euclidean space:

  • alr (additive logratio) transformation:

Divide values by the j-th part, j ∈ {1, . . . , D}:

x(j) =

  • ln x1

xj , . . . , ln xj−1 xj , ln xj+1 xj , . . . , ln xD xj

t

  • clr (centered logratio) transformation:

Divide values by the geometric mean:

y =

  ln

x1

D

D

i=1 xi

, . . . , ln xD

D

D

i=1 xi

  

t

  • ilr (isometric logratio) transformation:

take an orthonormal basis in the clr-space = ⇒ difficult to interpret

slide-14
SLIDE 14

Factor analysis for compositional data

Given a D-dimensional random variable y. FA model:

y = Λf + e

with

Λ . . . loadings matrix f . . .“factors”of dimension k < D e . . . error term

With the usual assumptions this results in Cov(y) = ΛΛt + Ψ with the diagonal matrix Ψ = Cov(e) (uniquenesses).

slide-15
SLIDE 15

Factor analysis for compositional data

(Filzmoser, Hron, Reimann, Garret, 2009, Comp. & Geosci.) For an interpretation, FA must be related to the original variables! = ⇒ ilr transformation (z), covariance estimation (Cov(z)), back-transformation to the clr-space: Cov(y) = VCov(z)Vt Next problem: Cov(y) is singular, which is in conflict with Cov(y) = ΛΛt + Ψ with a diagonal form of Ψ. Solution: Projection of the diagonal matrix Ψ on the hyperplane y1 + . . . + yD = 0 formed by the clr-space. = ⇒ resulting Ψ∗ is no longer a diagonal matrix

slide-16
SLIDE 16

Robust parameter estimation

The basis for parameter estimation in the FA model is the estimation of the covariance matrix. The classical estimation is sensitive with respect to

  • utliers.

= ⇒ robust estimation of the covariance matrix leads to robust estimation of the parameters for FA (Pison, Rousseeuw, Filzmoser, Croux, 2003, J. Multiv. Anal.)

Classical estimation Robust estimation

slide-17
SLIDE 17

Robust covariance estimation

Minimum Covariance Determinant estimator (MCD):

slide-18
SLIDE 18

Robust covariance estimation

Minimum Covariance Determinant estimator (MCD): Search those 75% of data points having the smallest determinant of their classical covariance matrix

slide-19
SLIDE 19

Robust covariance estimation

Minimum Covariance Determinant estimator (MCD): Search those 75% of data points having the smallest determinant of their classical covariance matrix − → Arithm. mean is ro- bust estimator of location

slide-20
SLIDE 20

Robust covariance estimation

Minimum Covariance Determinant estimator (MCD): Search those 75% of data points having the smallest determinant of their classical covariance matrix − → Arithm. mean is ro- bust estimator of location − → classical covariance, multiplied by a factor, is robust covariance estimator

slide-21
SLIDE 21

Robust FA for compositional data

Kola moss data: library(StatDA) data(moss) 594 samples 31 variables Compare:

  • classical and robust FA for
  • log-transformed and ilr-

transformed data

50 100 km

N

Murmansk Monchegorsk Apatity Kovdor Nikel Zapoljarnij Rovaniemi

Barents Sea Finland Norway Russia

slide-22
SLIDE 22

Example

−2 2 4 −2 2 4 Classical F1 (26.5%) Classical F2 (16.5%)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

Ag Al As B Ba Bi Ca Cd Co Cr Cu Fe Hg K Mg Mn Mo Na Ni P Pb Rb S Sb Si Sr Th Tl U V Zn

Classical FA (log−transformed)

−4 −2 2 −4 −2 2 Classical F1 (18.7%) Classical F2 (18.0%)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + +

−0.6 −0.2 0.0 0.2 0.4 −0.6 −0.2 0.0 0.2 0.4

Ag Al As B Ba Bi Ca Cd Co Cr Cu Fe Hg K Mg Mn Mo Na Ni P Pb Rb S SbSi Sr Th Tl U V Zn

Classical FA (ilr−transformed)

slide-23
SLIDE 23

Example

−4 −2 2 4 6 −4 −2 2 4 6 Robust F1 (21.8%) Robust F2 (21.0%)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

−0.4 0.0 0.4 0.8 −0.4 0.0 0.4 0.8

Ag Al As B Ba Bi Ca Cd Co Cr Cu Fe Hg K Mg Mn Mo Na Ni P Pb Rb S Sb Si Sr Th Tl U V Zn

Robust FA (log−transformed)

−4 −2 2 −4 −2 2 Robust F1 (28.7%) Robust F2 (17.4%)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + +

−0.6 −0.2 0.2 0.4 −0.6 −0.2 0.2 0.4

Ag Al As B Ba Bi Ca Cd Co Cr Cu Fe Hg K Mg Mn Mo Na Ni P Pb Rb S Sb Si Sr Th Tl U V Zn

Robust FA (ilr−transformed)

Relations between variables would indicate a dominance of industrial contamination. Interesting processes: sea spray, relations to plant nutrients, contamination.

slide-24
SLIDE 24

Summary

  • Compositional data are NOT characterized by a constant sum constraint.

Rather, the compositional nature is an inherent data property.

  • The sample space of compositional data is the simplex. For applying methods

developed for the Euclidean geometry, the data first have to be transformed to the Euclidean space (ilr).

  • Robust statistical methods cannot“repair”an incorrect geometrical representa-

tion of the data.

  • Software is available; e.g. in the R packages compositions, robComposi-

tions

slide-25
SLIDE 25

Further work on this issue

  • M. Templ, P. Filzmoser, and C. Reimann (2008).

Cluster analysis applied to regional geochemical data: Problems and possibilities. Applied Geochemistry. 23(8):2198-2213.

  • P. Filzmoser, K. Hron (2008). Outlier detection for compositional data using robust methods.

Mathematical Geosciences, 40(3):233-248.

  • P. Filzmoser and K. Hron (2009). Correlation analysis for compositional data. Mathematical

Geosciences, 41:905-919.

  • P. Filzmoser, K. Hron, and C. Reimann (2009). Univariate statistical analysis of environmental

(compositional) data: Problems and possibilities. Science of the Total Environment, 407:6100-6108.

  • P. Filzmoser, K. Hron, and C. Reimann (2009). Principal component analysis for compositional

data with outliers. Environmetrics, 20:621-632.

  • P. Filzmoser, K. Hron, C. Reimann, and R.G. Garrett (2009).

Robust factor analysis for compositional data. Computers and Geosciences, 35:1854-1861.

  • K. Hron, M. Templ, P. Filzmoser (2010). Imputation of missing values for compositional data

using classical and robust methods. Computational Statistics & Data Analysis. To appear.

  • P. Filzmoser, K. Hron, and M. Templ (20??). Discriminant analysis for compositional data and

robust parameter estimation. Under review.

slide-26
SLIDE 26

Important references

  • J. Aitchison (1986). The statistical analysis of compositional data. Monographs on statistics

and applied probability. Chapman & Hall, London.

  • A. Buccianti, G. Mateu-Figueras, and V. Pawlowsky-Glahn (2006), editors, Compositional data

analysis in the geosciences: From theory to practice. Geological Society, London. J.J. Egozcue, V. Pawlowsky-Glahn, G. Mateu-Figueraz, C. Barcelo-Vidal (2003). Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35(3):279- 300. . . . and many more . . .