Reducing dimensionality Principal components R.W. Oldford Reducing - - PowerPoint PPT Presentation

reducing dimensionality
SMART_READER_LITE
LIVE PREVIEW

Reducing dimensionality Principal components R.W. Oldford Reducing - - PowerPoint PPT Presentation

Reducing dimensionality Principal components R.W. Oldford Reducing dimensions Recall how orthogonal projections work. Given V = [ v 1 , . . . , v k ], an orthogonal projection matrix P is easily constructed as V T V 1 V T . P = V


slide-1
SLIDE 1

Reducing dimensionality

Principal components R.W. Oldford

slide-2
SLIDE 2

Reducing dimensions

Recall how orthogonal projections work.

Given V = [v1, . . . , vk], an orthogonal projection matrix P is easily constructed as P = V VTV−1 VT. And, if the column vectors of V form an orthonormal basis for S, then P = VVT. So far, we have only considered the choice where the vi s are unit vectors in the direction of the original data axes ei. Projections onto these directions simply return the scatterplots on pairs of the original variates. Are there other directions which would do as well (or possibly better)? Imagine that we have n points x1, . . . , xn ∈ Rp centred so that n

i=1 xi = 0 and we

denote by X = [x1, . . . , xn]T the n × p real matrix whose ith row is xT

i .

Being centred, we can now ask whether the data truly lie in a linear subspace of Rp. And, if they do, can we find that subspace? Alternatively, do they lie nearly in a linear subspace and could we find it?

slide-3
SLIDE 3

Reducing dimensions - Finding the principal axes

For example suppose n = 20 and p = 2 so that a point cloud might look like:

r r r r r r r r r r r r r r r r r r r r

The points x1, . . . , xn lie in the plane, but do not occupy all of it. They appear nearly to lie in a one-dimensional subspace of R2

slide-4
SLIDE 4

Reducing dimensions - Finding the principal axes

We can think about orthogonally projecting the points (or equivalently, vectors) x1, . . . , xn onto any direction vector a ∈ Rp (i.e. a = 1, a ∈ Rp).

✄ ✄ ✄ ✄ ✗

a

r r r r r r r r r r r r r r r rxk P P r r r r

slide-5
SLIDE 5

Reducing dimensions - Finding the principal axes

The orthogonal projection of the point xk onto a (i.e. onto the span{a}) is (aaT)xk = a × wk a vector in the direction of a × sign(wk) of length |wk| with wk = aTxk. Note that the squared length of this projection is wk

2 = ||aaTxk||2 = xk TaaTxk = aTxkxk Ta.

Since every point can be projected onto the direction vector a, we might ask what vector would maximize (or minimize) the sum of the squared lengths of the projections? That is, onto which direction would the projections have the largest average squared length? Because the points are already centred about 0, this is the same as asking which direction are the original data points most (or least) spread out (or variable)?

slide-6
SLIDE 6

Reducing dimensions - Finding the principal axes

Mathematically we want to find the direction vector a, which maximizes (minimizes) the sum n

k=1 w 2 k .

This sum can in turn be expressed in terms of the original points in the point cloud as:

n

  • k=1

wk

2 = n

  • k=1

xk

TaaTxk

=

n

  • k=1

aTxkxk

Ta

= aT(

n

  • k=1

xkxk

T)a

= aT(XTX)a where XT = [x1, x2, . . . , xn] is the p × n matrix of data vectors.

slide-7
SLIDE 7

Reducing dimensions - Finding the principal axes

The maximization (minimization) problem can now be expressed as follows. Find a ∈ Rp which maximizes (minimizes) aT(XTX)a subject to the constraint that aTa = 1. We can write this as an unconstrained optimization by introducing a Lagrange multiplier λ. The problem then becomes to find λ ∈ R and a ∈ Rp which maximizes (minimizes) aT(XTX)a + λ(1 − aTa) which we now simply differentiate with respect to a, set to zero, solve, etc. Note that the objective function to be maximized (minimized) is a quadratic form in a. More generally, a quadratic form in z ∈ Rp can always be written as Q(z) = zTAz + bTz + c where A ∈ Rp×p, b ∈ Rp, and c ∈ R are all constants (wlog A = AT).

slide-8
SLIDE 8

Reducing dimensions - Finding the principal axes

Differentiating the quadratic form Q(z) = zTAz + bTz + c with respect to the vector z is ∂ ∂zQ(z) = 2Az + b. For our problem, we have the variable vector z = a and constants c = λ, b = 0, and A = XTX − λIp. Differentiating with respect to a and setting the result to 0 gives the set of equations: 2(XTX − λIp)a = 0

  • r

(XTX)a = λa. Differentiating with respect to λ, setting to zero and solving yields aTa = 1. Which should look familiar . . . ?

slide-9
SLIDE 9

Reducing dimensions - Finding the principal axes

The solution to the systems of equations (XTX)a = λa and aTa = 1 are the eigen-vector a and its corresponding eigen-value λ of the real symmetric matrix XTX. The quadratic form we are maximizing (minimizing) is aT(XTX)a = λaTa = λ. To maximize (minimize) this quadratic form, we choose the eigen-vector corresponding to the largest (smallest) eigen-value. Denote the solution to this problem as v1 (or vp for the minimization problem). The solution v1 (or vp) will be the eigen-vector of (XTX) which corresponds to its largest (or smallest) eigen-value λ1 (or λp). Putting all eigen-vectors into an orthogonal matrix V = [v1, · · · , vp] ordered by the eigen-values λ1 ≥ λ2 ≥ · · · ≥ λp we have (XTX) = VDλVT is the eigen-decomposition of XTX with Dλ = diag(λ1, . . . , λp).

slide-10
SLIDE 10

Reducing dimensions - Finding the principal axes

The figure below shows v1 (and v2) for this data.

✓ ✓ ✓ ✓ ✼v1 ❩❩❩❩ ⑦v2 r r r r r r r r r r r r r r r r r r r r

slide-11
SLIDE 11

Reducing dimensions - Finding the principal axes

Consider a change of variables y = VTx (with VTV = Ip = VVT). The data in the original coordinate system has x =

   

x1 x2 . . . xp

    = [e1, e2, · · · , ep]    

x1 x2 . . . xp

    = Vy

and in the new coordinate system becomes [v1, v2, · · · , vp]

   

y1 y2 . . . yp

    = VVTx

The values y1, . . . , yp are coordinates on new axes v1, . . . vp. The axes were chosen so that the coordinates are most spread out for variable y1, next for variable y2, . . . , least for yp. The axes v1, . . . , vp are called the principal axes and the variables y1, . . . , yp the principal components.

(N.B. sometimes both axes and variables are called the principal components . . . sigh.) Note that the transformed variates (the principal components) yi and yj are now uncorrelated for i = j (since the y points are now aligned along their principal axes).

slide-12
SLIDE 12

Reducing dimensions - Finding the principal axes

For our example, the transform y = VTx yields y1 = v1

Tx = c1x1 + c2x2 + · · · + cpxp

as a weighted linear combination of the original x variables (using entries of v1 as weights). The following figure shows the points as they appear in the new co-ordinate system (e.g. yk = VTxk ).

r r r r r r r r r r r r r rr ryk r r r r

y1 y2 Note that the transformation rotates the points into position and (in this example) reflects them through one (or more) of the principal axes.

slide-13
SLIDE 13

Reducing dimensions - Finding the principal axes

For our example, consider only the Species = versicolor of the iris data and the first three variates. In three dimensions, the plot looks like

library(loon) ## Loading required package: tcltk data <- l_scale3D(iris[iris$Species == "versicolor", 1:4]) p3D <- l_plot3D(data[,1:3], showGuides = TRUE) plot(p3D)

Sepal.Length Sepal.Width

Which can now be rotated by hand to get to the principal components.

slide-14
SLIDE 14

Reducing dimensions

Note that λj = λjvj

Tvj = vj T(λjvj) = vj T(XTX)vj = (Xvj)T(Xvj) = zTz = n

  • i=1

z2

i

for real values zj, and so λj ≥ 0 ∀ j = 1, . . . , p. Note that each zi = vj

Txi is the coordinate of the data point xi projected onto

the principal axis vj. So, if λj = 0, then the projection of every point onto the direction vj is identically 0! That is, the data lie in a space orthogonal to vj. Suppose there is a value d < p such that λ1 ≥ · · · ≥ λd > 0, and λj = 0 for j > d. Then the points x1, . . . , xn lie in a d-dimensional subspace of Rp defined by the principal axes v1, . . . , vd. That is xi ∈ span{v1, . . . , vd} ⊂ Rp for all i = 1, . . . , n. Question: What if we only have λj ≈ 0 for j > d? Answer: The points x1, . . . , xn nearly lie in a d-dimensional subspace. Perhaps we can reduce consideration to d dimensions.

slide-15
SLIDE 15

Reducing dimensions

Let YT = [y1, . . . , yn] be the p × n matrix of the points y1, . . . , yn in the coordinate system of the principal axes. Note that these coordinates in the new space are simply given by Y = X[v1, . . . , vp] = XV. The ith column of Y is the ith principal component. When we want to reduce the dimensionality to only those defined by the first d principal components, then we could right-multiply X by the p × d matrix [v1, . . . , vd] or equivalently simply select the first d columns of Y. There is a decomposition of a real rectangular n × p matrix X which is particularly handy called the singular value decomposition: X = UDσVT where U is an n × p matrix with the property that UTU = Ip, V is a p × p matrix with VTV = VVT = Ip and Dσ = diag(σ1, . . . , σp) with σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0.

The scalars σi are called the singular values of X, the columns of U the left singular vectors and the columns of V the right singular vectors.

slide-16
SLIDE 16

Reducing dimensions

Letting X = UDσVT be the singular value decomposition of X, then XTX = (UDσVT)

TUDσVT

= VDσUTUDσVT = VDσDσVT = VD2

σVT

is just the eigen-decomposition of XTX with D2

σ = Dλ. In particular note that

σi = √λi; unlike λ, σ is on the same scale as the data values. So, a quick way to get the principal components is to find the singular value decomposition of the (centred) matrix X and then Y = XV = UDσ

slide-17
SLIDE 17

Reducing dimensions

For example, in R # Principal components # ... via a ... # Singular value decomposition (svd) # T # X = U D V library(loon) data <- oliveAcids # Scale as well as centre since we are looking at scatterplots data <- scale(data, center=TRUE, scale=TRUE) svd_data <- svd(data) u <- svd_data$u d <- svd_data$d v <- svd_data$v

slide-18
SLIDE 18

Reducing dimensions

# # V has orthonormal columns # round(t(v) %*% v, 10) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 1 ## [2,] 1 ## [3,] 1 ## [4,] 1 ## [5,] 1 ## [6,] 1 ## [7,] 1 ## [8,] 1 # and, being square, so are its rows round(v %*% t(v), 10) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 1 ## [2,] 1 ## [3,] 1 ## [4,] 1 ## [5,] 1 ## [6,] 1 ## [7,] 1 ## [8,] 1

slide-19
SLIDE 19

Reducing dimensions

# # Similarly u has orthonormal columns # (but not rows) round(t(u) %*% u, 10) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 1 ## [2,] 1 ## [3,] 1 ## [4,] 1 ## [5,] 1 ## [6,] 1 ## [7,] 1 ## [8,] 1 # compare: u %*% t(u) ... n by n non-orthogonal #

slide-20
SLIDE 20

Reducing dimensions

# d contains the singular values of X # Here they are in a diagonal matrix round(diag(d), 0) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 46 ## [2,] 32 ## [3,] 24 ## [4,] 21 ## [5,] 14 ## [6,] 12 ## [7,] 8 ## [8,] 1 Clearly, some of these are much larger than others. Perhaps we do not need all of these dimensions to explore the data? To get a sense of the relative size we look at a so-called scree plot.

slide-21
SLIDE 21

Reducing dimensions

Scree plots: determining the effective dimension of the subspace.

# The following is a "Scree plot" plot(d/max(d),type="b", xlab="principal component", ylab="singular value (relative)", col="grey50", pch=16,

1 2 3 4 5 6 7 8 0.0 0.2 0.4 0.6 0.8 1.0 principal component singular value (relative)

Perhaps 4 dimensions are enough? 5? Ideally there would be a large noticeable drop at some point.

slide-22
SLIDE 22

Reducing dimensions

Now,

  • σi

σmax

is a measure of the effective (fractional) dimensionality of the data. This suggests that we might consider looking at the ratio d

i=1 σi/p j=1 σj as a measure of the proportion of the

dimensionality retained by the first d components. This suggests an alternative

# The following is plot(cumsum(d)/sum(d),type="b", col="grey50", xlab="Number of principal components", ylab="Proportion of effective dimension", ylim=0:1, pch=16, cex=2)

1 2 3 4 5 6 7 8 0.0 0.2 0.4 0.6 0.8 1.0 Number of principal components Proportion of effective dimension

slide-23
SLIDE 23

Reducing dimensions - a more complex example

In the frey dataset to be found in the loon.data package, we have 1,965 images of Brendan Frey’s face. Here are 4 examples:

slide-24
SLIDE 24

Reducing dimensions - a more complex example

◮ Each pixel has a

grey value in [0,1].

◮ Each image is a

28 × 20 pixel image. So each image is a point in 560 dimensions.

slide-25
SLIDE 25

Reducing dimensions - a more complex example

# Principal components # ... via a ... # Singular value decomposition (svd) # T # X = U D V library(loon) library(loon.data) data(frey) data <- t(frey) # Scale as well as centre since we are looking at scatterplots data <- scale(data, center=TRUE, scale=FALSE) svd_data <- svd(data) u <- svd_data$u d <- svd_data$d v <- svd_data$v

slide-26
SLIDE 26

Reducing dimensions - a more complex example

100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Scree plot

principal component singular value (relative) 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Proportion effective dimension

Number of principal components Proportion of effective dimension

Doesn’t look like we need all 560 dimensions, but how many should we choose?

slide-27
SLIDE 27

Reducing dimensions - a more complex example

100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Scree plot

principal component singular value (relative) 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Proportion effective dimension

Number of principal components Proportion of effective dimension

Doesn’t look like we need all 560 dimensions, but how many should we choose? 100?

slide-28
SLIDE 28

Reducing dimensions - a more complex example

100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Scree plot

principal component singular value (relative) 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Proportion effective dimension

Number of principal components Proportion of effective dimension

Doesn’t look like we need all 560 dimensions, but how many should we choose? 100? Let’s look at 20.

slide-29
SLIDE 29

Reducing dimensions - a more complex example

# Here are the images frey.imgs <- l_image_import_array(frey, 28,20, img_in_row = FALSE, rotate = 90) l_imageviewer(frey.imgs) reduced <- data %*% v[, 1:20] nav1 <- l_navgraph(reduced, linkingGroup="frey") gl1 <- l_glyph_add_image(nav1$plot, images=frey.imgs, label="frey faces") nav1$plot['glyph'] <- gl1 scags2d <- scagnostics2d(reduced) nav2 <- l_ng_plots(measures=scags2d, linkingGroup="frey") gl2 <- l_glyph_add_image(nav2$plot, images=frey.imgs, label="frey faces") nav2$plot['glyph'] <- gl2

slide-30
SLIDE 30

Reducing dimensions - a more complex example

Some remarks:

◮ the dimension reduction made no use of the fact that the data:

◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)

slide-31
SLIDE 31

Reducing dimensions - a more complex example

Some remarks:

◮ the dimension reduction made no use of the fact that the data:

◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)

◮ principal components does not care which position in the 560-dimensional vector

that each pixel location occupies

◮ i.e. could have randomly assigned the image pixel array locations to the

locations in the vector

slide-32
SLIDE 32

Reducing dimensions - a more complex example

Some remarks:

◮ the dimension reduction made no use of the fact that the data:

◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)

◮ principal components does not care which position in the 560-dimensional vector

that each pixel location occupies

◮ i.e. could have randomly assigned the image pixel array locations to the

locations in the vector

◮ still got some useful positioning even in only the first two principal components

slide-33
SLIDE 33

Reducing dimensions - a more complex example

Some remarks:

◮ the dimension reduction made no use of the fact that the data:

◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)

◮ principal components does not care which position in the 560-dimensional vector

that each pixel location occupies

◮ i.e. could have randomly assigned the image pixel array locations to the

locations in the vector

◮ still got some useful positioning even in only the first two principal components ◮ there is no reason to reduce the dimensionality to only 2 or 3 dimensions

slide-34
SLIDE 34

Reducing dimensions - a more complex example

Some remarks:

◮ the dimension reduction made no use of the fact that the data:

◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)

◮ principal components does not care which position in the 560-dimensional vector

that each pixel location occupies

◮ i.e. could have randomly assigned the image pixel array locations to the

locations in the vector

◮ still got some useful positioning even in only the first two principal components ◮ there is no reason to reduce the dimensionality to only 2 or 3 dimensions - can

reduce to 20 or even more dimensions if that is what the scree plot suggests

slide-35
SLIDE 35

Reducing dimensions - a more complex example

Some remarks:

◮ the dimension reduction made no use of the fact that the data:

◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)

◮ principal components does not care which position in the 560-dimensional vector

that each pixel location occupies

◮ i.e. could have randomly assigned the image pixel array locations to the

locations in the vector

◮ still got some useful positioning even in only the first two principal components ◮ there is no reason to reduce the dimensionality to only 2 or 3 dimensions - can

reduce to 20 or even more dimensions if that is what the scree plot suggests - can reduce to any number provided that the determination of interesting projections is computationally feasible

slide-36
SLIDE 36

Reducing dimensions - a more complex example

Some remarks:

◮ the dimension reduction made no use of the fact that the data:

◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)

◮ principal components does not care which position in the 560-dimensional vector

that each pixel location occupies

◮ i.e. could have randomly assigned the image pixel array locations to the

locations in the vector

◮ still got some useful positioning even in only the first two principal components ◮ there is no reason to reduce the dimensionality to only 2 or 3 dimensions - can

reduce to 20 or even more dimensions if that is what the scree plot suggests - can reduce to any number provided that the determination of interesting projections is computationally feasible

◮ principal components is looking for low-dimensional linear subspaces (or linear

manifolds)