PCA by Projection Pursuit Department of Statistics and Probability - - PowerPoint PPT Presentation

pca by projection pursuit
SMART_READER_LITE
LIVE PREVIEW

PCA by Projection Pursuit Department of Statistics and Probability - - PowerPoint PPT Presentation

Joint work with . . . P. Filzmoser PCA by Projection Pursuit Department of Statistics and Probability Theory Vienna University of Technology, Austria The Package pcaPP C. Croux Heinrich Fritz Department of Applied Economics Vienna University


slide-1
SLIDE 1

PCA by Projection Pursuit The Package pcaPP

Heinrich Fritz Vienna University of Technology, Austria

Vienna, Austria

June, 2006

Vienna University of Technology

Joint work with . . .

  • P. Filzmoser

Department of Statistics and Probability Theory Vienna University of Technology, Austria

  • C. Croux

Department of Applied Economics K.U. Leuven, Belgium

M.R. Oliveira

Department of Mathematics Instituto Superior T´ ecnico, Lisbon, Portugal

  • K. Kalcher

Vienna University of Technology, Austria

Agenda

  • Principal components
  • Robust approaches
  • The implementation
  • Supporting methods
  • Covariance estimation by PCAs

Principal Component Analysis (PCA)

1 2 3 4 1 2 3 4 x y

slide-2
SLIDE 2

Principal Component Analysis (PCA)

1 2 3 4 1 2 3 4 x y

Principal Component Analysis (PCA)

1 2 3 4 1 2 3 4 x y

PC1 PC2

Outliers

1 2 3 4 1 2 3 4 x y

Outliers

1 2 3 4 1 2 3 4 x y

PC1 PC2

slide-3
SLIDE 3

Outliers

1 2 3 4 1 2 3 4 x y

PC1 PC2 PC1 PC2

The Classical Approach

  • PCA by decomposition of the covariance matrix

ˆ Σ = ΓΛΓt Y =

  • X − 1¯

xt Γ

  • Robustness due to robust covariance estimates.

– package rrcov: covMCD, covMest – package robustbase: covGK, covOGK

PCA by Projection Pursuit

  • No covariance estimation necessary
  • Especially for high dimensional data
  • Procedure

– Define a data center (mean, median, l1median, . . . ) – Search for promising directions by maximizing a spread estimation (sd, mad, qn) of the data projected onto these directions – Reduce the amount of candidate directions

Defining the Data Center

1 2 3 4 1 2 3 4 x y

slide-4
SLIDE 4

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.62 MAD = 0.54

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.62 MAD = 0.46

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.63 MAD = 0.4

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.63 MAD = 0.32

slide-5
SLIDE 5

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.64 MAD = 0.26

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.65 MAD = 0.25

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.65 MAD = 0.3

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.66 MAD = 0.35

slide-6
SLIDE 6

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.66 MAD = 0.43

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.67 MAD = 0.52

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.67 MAD = 0.63

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.66 MAD = 0.67

slide-7
SLIDE 7

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.66 MAD = 0.7

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.65 MAD = 0.69

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.65 MAD = 0.67

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.64 MAD = 0.61

slide-8
SLIDE 8

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.63 MAD = 0.64

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.63 MAD = 0.63

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.62 MAD = 0.66

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.62 MAD = 0.59

slide-9
SLIDE 9

Maximizing Spread

−1 1 2 3 4 5 −1 1 2 3 4 5 x y s = 0.62 MAD = 0.54

PCAproj

1 2 3 4 1 2 3 4 x y

PCAproj

1 2 3 4 1 2 3 4 x y

PCAproj

1 2 3 4 1 2 3 4 x y

slide-10
SLIDE 10

PCAproj

1 2 3 4 1 2 3 4 x y

PCAproj

Candidate Directions:

  • each data point
  • additionally random di-

rections through center

  • additional directions by

linear combinations of data points

  • update algorithm (based
  • n eigenvalues)

1 2 3 4 1 2 3 4 x y

PCAgrid

Grid Algorithm:

Optimization is done on a regular grid in the plane.

  • select two variables
  • optimization on the grid
  • select other variables
  • . . .

1 2 3 4 1 2 3 4 x y

Implementation

  • Implementation in C
  • Wrapping functions

– PCAproj(x, k = 2, method = c("sd", "mad", "qn"), CalcMethod = c("eachobs", "lincomb", "sphere"), nmax = 1000, update = TRUE, scores = TRUE, maxit = 5, maxhalf = 5, control, ...) – PCAgrid(x, k = 2, method = c("sd", "mad", "qn"), maxiter = 10, splitcircle = 10, scores = TRUE, anglehalving = TRUE, fact2dim = 10, control, ...)

slide-11
SLIDE 11

Common Parameters

  • x: Data matrix (data frame)
  • k: Number of principal components
  • method: Spread estimator for projection pursuit
  • scores: Return scores-matrix?
  • control: Control-structure
  • ... Passed to ScaleAdv

PCAproj - Individual Parameters

  • CalcMethod: "eachobs","lincomb" or "sphere"
  • nmax: Max directions to search in each step (for "lincomb"or "sphere")
  • update: Perform update steps?

– maxhalf: Maximum number of steps for angle halving – maxit: Maximum number of iterations

PCAgrid - Individual Parameters

  • splitcircle: Number of directions
  • anglehalving : Perform anglehalving
  • fact2dim : Behavior in 2 dimensional case.
  • maxiter: Maximum number of iterations.

Return Structure

  • (S3) class pcaPP derived from princomp:

– sdev: Spread of principal components – loadings: Matrix containing the loadings – center: Center applied to the data matrix – scale: Scale applied to the data matrix – n.obs: Number of observations – scores: Matrix containing the scores – call: Function call

slide-12
SLIDE 12

Additional Functions

  • l1median(X, MaxStep = 200, ItTol = 10−8)

Robust center estimator

  • qn(x)

Robust scale estimator

  • ScaleAdv(x, center = mean, scale = sd)

Advanced scaling method (takes functions or vectors as input values)

Robust Covariance Estimation

  • Robust covariance estimation based on PCs

ˆ Σ = ˆ Γˆ Λˆ Γt

  • covPCAproj(x, control)
  • covPCAgrid(x, control)
  • covPC(x, k, method) (under construction . . . )

Example

> library(pcaPP) > data(swiss) > result = PCAproj(swiss, k = 6, method = "mad")

> summary(result) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 44.1005199 41.0302723 17.09114152 6.92022550 4.619893062 Proportion of Variance 0.4859749 0.4206639 0.07299087 0.01196649 0.005333229 Cumulative Proportion 0.4859749 0.9066387 0.97962962 0.99159611 0.996929342 Comp.6 Standard deviation 3.505520822 Proportion of Variance 0.003070658 Cumulative Proportion 1.000000000

Example

screeplot(result)

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6

Scree−plot

Variances 500 1000 1500

slide-13
SLIDE 13

Example

biplot(result)

−0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2

Biplot

Comp.1 Comp.2 Courtelary Delemont Franches−Mnt Moutier Neuveville Porrentruy Broye Glane Gruyere Sarine Veveyse Aigle Aubonne Avenches Cossonay Echallens Grandson Lausanne La Vallee Lavaux Morges Moudon Nyone Orbe Oron Payerne Paysd’enhaut Rolle Vevey Yverdon Conthey Entremont Herens Martigwy Monthey St Maurice Sierre Sion Boudry La Chauxdfnd Le Locle Neuchatel Val de Ruz ValdeTravers

  • V. De Geneve

Rive Droite Rive Gauche −200 −100 100 200 300 −200 −100 100 200 300 Fertility Agriculture Examination Education Catholic Infant.Mortality

Covariance Estimation

> library (covrob) > covswiss.mad <- covrob (swiss, method="covPCAproj", control = list (k=6,method="mad")) > covswiss.sd <- covrob (swiss, method="covPCAproj", control = list (k=6,method="sd")) > plot (covswiss.mad, covswiss.sd)

Covariance Estimation

Fertility Fertility Agriculture Agriculture Examination Examination Education Education Catholic Catholic Infant.Mortality Infant.Mortality 56.08 410.6773 −62.8119 −233.6847 −120.8495 −467.9115 −78.4054 −140.1594 −130.6045 −318.3069 65.9795 144.4185 230.4597 −166.5624 255.309 −20.9876 −210.4271 2.7944 −106.3998 31.8595 11.9388 195.156 6.665 127.608 −3.9372 −80.5001 −5.1075 −49.1052 11.3704 −226.1518 Robust cov − estimation based on PCs (projection mode − sd) Robust cov − estimation based on PCs (projection mode − mad)