[PPT] - Recasting Principal Components R.W. Oldford University of Waterloo PowerPoint Presentation

SLIDE 1

Recasting Principal Components

R.W. Oldford

University of Waterloo

SLIDE 2

Reducing dimensions - recasting the problem of principal components

The principal axes (V) for a set of data XT = [x1, . . . , xn] can be found in one of two ways. Either:

◮ via the eigen-decomposition of XTX = VDλVT, or ◮ via the singular value decomposition of X = UDσVT

Either way, the principal components are formed as Y = XV = UDσ which means, that all we need are U and Dσ. We don’t really need V. We can get these two matrices from different eigen-decomposition, namely: YYT = UDσ(UDσ)T = UD2

σUT

= XXT Note that this matrix is n × n, i.e. depends only on the sample size and not on the dimension p. So there is a choice, either the p × p matrix XTX or the n × n matrix XXT.

SLIDE 3

Reducing dimensions - recasting the problem of principal components

The choice XXT has an interesting structure: XXT =

   

x1

T

x2

T

. . . xn

T

    [x1, x2, . . . , xn]

=

   

x1

Tx1

x1

Tx2

· · · x1

Txn

x2

Tx1

x2

Tx2

· · · x2

Txn

. . . . . . . . . xn

Tx1

xn

Tx2

· · · xn

Txn

   

Note that

◮ the (i, j) element xi Txj is an inner product ◮ this matrix of inner products is often called the Gram matrix ◮ if the data were centred (n i=1 xi = 0) then each row and column above

sums to 0

SLIDE 4

Reducing dimensions - problems with principal axes

Principal axes are, well, just that . . . axes. That is, they correspond to directions v1, . . . , vk where the data x1, . . . , xn, when projected orthogonally onto them, have a maximal spread. We look to remove axes (directions) for which the sum of the squared lengths of the projections are relatively small. For this to work, the data have to be (nearly) restricted to a linear subspace.

SLIDE 5

Reducing dimensions - problems with principal axes

What if the data lie in a very restricted part of the space, but it’s just not linear? Clearly, these points lie in a very compact region of the plane but not along any direction vector (or principal axis). Data are still essentially only one dimensional though, just around the perimeter

f a circle.

SLIDE 6

Reducing dimensions - problems with principal axes

Possible solutions?:

1. Change variables. E.g. to polar coordinates?

(x1, x2) → (f1, f2) =

x 2

1 + x 2 2 ,

acos

x1
x 2

1 + x 2 2

2. Somehow follow the points around the circle?

◮ That is try to find the nonlinear manifold on (or near) which the points lie. ◮ Somehow preserve local structure ◮ want points in the same neighbourhood of each other (along the nonlinear

manifold) should also be near each other in the reduced (linear) space.

SLIDE 7

Reducing dimensions - changing variables

x1 <- data[,1] x2 <- data[,2] newdata <- data.frame(f1 = sqrt(x1^2 + x2^2), f2 = acos(x1/sqrt(x1^2 + x2^2)) ) colnames(newdata) <- c("r","theta") newdata <- scale(newdata, center=TRUE, scale=FALSE) svd_data <- svd(newdata) svd_data$d ## [1] 1.103150e+01 7.430655e-16

SLIDE 8

Reducing dimensions - changing variables

x1 <- data[,1] x2 <- data[,2] newdata <- data.frame(f1 = sqrt(x1^2 + x2^2), f2 = acos(x1/sqrt(x1^2 + x2^2)) ) colnames(newdata) <- c("r","theta") newdata <- scale(newdata, center=TRUE, scale=FALSE) svd_data <- svd(newdata) svd_data$d ## [1] 1.103150e+01 7.430655e-16

This non-linear transformation has produced a coordinate system where all the data lie in a one-dimensional linear subspace.

SLIDE 9

Reducing dimensions - changing variables

Alternatively, we might try a transformation that added non-linear coordinates, say (x1, x2) → (f1, f2, f3, f4) = x1, x2, x2

1 , x2 2

f1

−1.0 0.0 0.5 1.0 0.0 0.4 0.8 −1.0 0.0 0.5 1.0 −1.0 0.0 0.5 1.0

f2 f3

0.0 0.4 0.8 −1.0 0.0 0.5 1.0 0.0 0.4 0.8 0.0 0.4 0.8

f4

SLIDE 10

Reducing dimensions - changing variables

A principal component analysis on the transformed data produces the singular values:

## [1] 8.790603e+00 8.461389e+00 5.923241e+00 1.750249e-15

The last of these is essentially zero; the corresponding eigen-vector v4

T

## [1] 0.000000e+00 -3.052969e-16 7.071068e-01 7.071068e-01

Note the linear structure in (f3, f4). This would be picked up by the last eigen-vector which, as seen above, is ≈ (0, 0, 1 √ 2 , 1 √ 2 )T . The line in (f3, f4) at left is orthog-

nal to the (1, 1) (i.e. the eigen-

vector v4). Note also that there would be 3 principal components; more than the original dimensionality of the data!

SLIDE 11

Reducing dimensions - changing variables

More generally, if x ∈ Rp, we can consider a mapping ψ : Rp → Rm.

◮ ψ could be non-linear ◮ m could be larger than p ◮ fi = ψ(xi) for i = 1, . . . , n are called “feature vectors” by some writers and

the range of ψ the “feature space” F ⊂ Rm. Note that while the dimensionality increases, the number of points n stays the same. Whether working in the original data space or in the constructed feature space, a principal component analysis can be had by an n × n Gram matrix. The dimensionality of the feature space could be much larger than the dimensionality of the data (i.e. m >> p); it could even be infinite! The principal component analysis (PCA) will never need a matrix larger than the corresponding n × n Gram matrix.

SLIDE 12

Reducing dimensions - changing variables

FT = [f1, . . . , fn] is the m × n matrix of feature vectors, which can be a nuisance to work with if m >> n and impossible if m = ∞. The corresponding Gram matrix, K = [kij] = FFT for the feature space however is always n × n (even if m = ∞). All we need to be able to do is determine the inner products kij = fiTfj = ψT(xi)ψ(xj) = < ψ(xi), ψ(xj) > = K(xi, xj), say. Looks like we only need to choose the function K(xi, xj). That is we never need to calculate any feature vector fi = ψ(xi) (or even determine the function ψ(.)) if we have the function K(xi, xj), a function of vector pairs in the data space. K(xi, xj) is called a Kernel function (N.B. not to be confused with “Kernel density” estimates) and this move from ψ functions to Kernel functions is sometimes called the “Kernel trick”.

SLIDE 13

Reducing dimensions - changing variables

A number of kernel functions have been proposed. Three common choices are

1. Polynomial of degree d, scale parameter σ, and offset θ:

K(x, y) = (σxTy + θ)d For example, suppose p = 3, d = 2 (with σ = 1, and θ = 0) then

K(x, y) = (x1y1 + x2y2 + x3y3)2 = x2

1 y2 1 + x2 2 y2 2 + x2 3 y2 3 + 2x1x2y1y2 + 2x1x3y1y3 + 2x2x3y2y3

= (x2

1 , x2 2 , x2 3 ,

√ 2x1x2, √ 2x1x3, √ 2x2x3)

   

y2

1

y2

2

y2

3

√ 2y1y2 √ 2y1y3 √ 2y2y3

   

So ψ(x) = (x2

1 , x2 2 , x2 3 ,

√ 2x1x2, √ 2x1x3, √ 2x2x3)

T

SLIDE 14

Reducing dimensions - changing variables

2. Radial basis function (Gaussian) (with scale parameter σ):

K(x, y) = exp

− ||x − y||2

2σ2

To see that this is also an inner product, consider a series expansion of et. The

feature space is infinite dimensional.

3. Sigmoid (hyperbolic tangent) with scale σ and offset θ:

K(x, y) = tanh σxTy + θ There is a theorem from functional analysis (involving reproducing Kernel Hilbert spaces, hence the name) called Mercer’s Theorem which gives conditions for which a function K(x, y) can be expressed as a dot product.

SLIDE 15

Reducing dimensions - changing variables

The kernel function needs to be such that the vectors f1, . . . , fn in the feature space are centred about n

i=1 fi = 0.

Recall that the kernel matrix K is a Gram matrix so its elements are the inner products fi

Tfj in the feature space.

A simple way to effect this is to ensure that the Kernel matrix K has rows and columns that sum to zero. That is, replace K by removing means from the front and back: K∗ =

In − 1(1T1)−11T

K In − 1(1T1)−11T =

In − 1

n11T

K In − 1

n11T

=

K − 1

n11TK

In − 1

n11T

= K − 1

nK11T − 1 n11TK + 1 n2 11TK11T

= K − ( 1

nK1)1T − 1( 1 n1TK) + 1 n2 (1TK1)11T

Or, in words, subtract the row means, subtract the column means, add back in the overall mean.

SLIDE 16

Reducing dimensions - changing variables

Many of these are available in R through the package kernlab. KPCA - “Kernel Principal Component Analysis”

library(kernlab) # Note that the argument "sigma" here is actually # 1/ (square of the sigma in our formula above) frey.kpca1 <- kpca(data, kernel="rbfdot", kpar=list(sigma=0.01)) frey.kpca2 <- kpca(data, kernel="rbfdot", kpar=list(sigma=0.00001)) frey.kpca3 <- kpca(data, kernel="polydot", kpar=list(degree=10)) d1 <- frey.kpca1@eig d2 <- frey.kpca2@eig d3 <- frey.kpca3@eig ####### Choose one d <- round(d1,10) d <- round(d2,10) d <- round(d3,10) ####### plot(d/max(d),type="b", main="Scree plot", xlab="principal component", ylab="eigen value (relative)", ylim=c(0,1), col= adjustcolor("grey50", 0.2), pch=19, cex=1) ####### Choose one reduced <- frey.kpca1@pcv[,1:20] reduced <- frey.kpca2@pcv[,1:20] reduced <- frey.kpca3@pcv[,1:20] ####### kpca_scags2d <- scagnostics2d(reduced) kpca_nav <- l_ng_plots(measures=kpca_scags2d, linkingGroup="frey") kpca_gl <- l_glyph_add_image(kpca_nav$plot, images=frey.imgs, label="frey faces") kpca_nav$plot['glyph'] <- kpca_gl

SLIDE 17

Reducing dimensions - local neighbourhoods

To determine which points are nearby each other, we could calculate the distances between each point in the original data space. Let dij = ||xi − xj|| be the Euclidean distance between points i and j. Note that d2

ij

= ||xi − xj||2 = (xi − xj)T(xi − xj) = xi

Txi − 2xi Txj + xj Txj

= gii − 2gij + gjj where gij = xi

Txj is the (i, j) entry of the Gram matrix G = [gij].

Let D⋆ = [d2

ij] denote the matrix of squared Eucidean distances. Then the

above equation can be written in matrix and vector form as D⋆ = 1gT − 2G + g1T

SLIDE 18

Reducing dimensions - local neighbourhoods

So given D⋆ = 1gT − 2G + g1T we can pre- and post-multiply by both sides by In − 1

n11T

giving

In − 1

n 11T

D⋆ In − 1

n 11T

=

In − 1

n 11T

1gT − 2G + g1T In − 1

n 11T

=

In − 1

n 11T

1gT − 2 In − 1

n 11T

G + In − 1

n 11T

g1T In − 1

n 11T

=

0gT − 2

In − 1

n 11T

G + In − 1

n 11T

g1T × In − 1

n 11T

= −2 In − 1

n 11T

G In − 1

n 11T

+ In − 1

n 11T

g1T In − 1

n 11T

= −2 In − 1

n 11T

G In − 1

n 11T

+ 0

SLIDE 19

Reducing dimensions - local neighbourhoods

So we have

In − 1

n11T D⋆ In − 1 n11T = −2

In − 1

n11T G

In − 1

n11T Now n

i=1 xi = 0 and G = XXT, which means we have

In − 1

n11T G

In − 1

n11T = G. So . . . given a matrix of squared distances, we can derive the corresponding Gram matrix as G = −1 2

In − 1(1T1)−11T

D⋆ In − 1(1T1)−11T . And recall that given a Gram matrix, we can always determine a point configuration (and hence possibly a lower dimensional embedding) from an eigen-decomposition of G = XXT = UDλUT which opens up a whole realm of possibilities, beginning with any squared distance between points!!!

SLIDE 20

Reducing dimensions - local neighbourhoods

We might ues this connection as follows. Consider a 3d roll; an example of an intrinsically 2d non-linear manifold in 3d space. Can we imagine how this might be unrolled.

A. Original data points, comparing the Euclidean distance to the distance along the manifold.
B. Form a local neighbourhood graph and use these to determine distances. These are more likely to

be along the manifold.

C. Use the neighbourhood graph distances to embed the points in a two-dimensional linear space. In

this linear space the Euclidean distances should be close to those along the neighbourhood graph.

SLIDE 21

Reducing dimensions - local neighbourhoods

We could use the relationship between D2 and G to find a point configuration that respects local distances as follows:

1. For every point i, determine a neighbourhood of nearby points. E.g. either by

◮ determining the k nearest neighbours of i for some k, or ◮ find all points j such that dij ≤ δ for some δ.

2. Create a graph with the points as nodes and an edge between each point i and all

points j in the defined neighbourhood of i.

◮ the graph should follow whatever low-dimensional structure the points lie on. ◮ e.g. a k nearest neighbour graph

3. For every pair of points i and j in the graph, determine the shortest path along the

graph between i and j. Let δij denote the distance along this path.

4. Produce the matrix of squared distances ∆∗ = [δ2

ij].

5. Construct the Gram matrix G from ∆∗.
6. Use the eigen-decomposition of G to produce a new point configuration f1, . . . , fn.

This method is called isomap. It preserves local distances and ignores longer distances.

SLIDE 22

Reducing dimensions - local neighbourhoods

library(dimRed) # Number of neighbours k_nbhrs <- 12 # Target dimension n_dims <- 20 # Dimension reduction via isomap iso20 <- embed(data, "Isomap", knn = k_nbhrs, ndim = n_dims) # iso20 is an S4 object having "slots" (from slotNames(iso20) ) # "data" "org.data" "apply" "inverse" "has.org.data" # "has.apply" "has.inverse" "method" "pars" iso20data <- iso20@data@data ####### iso_scags2d <- scagnostics2d(iso20data) iso_nav <- l_ng_plots(measures=iso_scags2d, linkingGroup="frey") iso_gl <- l_glyph_add_image(iso_nav$plot, images=frey.imgs, label="frey faces") iso_nav$plot['glyph'] <- iso_gl