Reducing dimensionality Principal components R.W. Oldford Reducing - - PowerPoint PPT Presentation
Reducing dimensionality Principal components R.W. Oldford Reducing - - PowerPoint PPT Presentation
Reducing dimensionality Principal components R.W. Oldford Reducing dimensions Recall how orthogonal projections work. Given V = [ v 1 , . . . , v k ], an orthogonal projection matrix P is easily constructed as V T V 1 V T . P = V
Reducing dimensions
Recall how orthogonal projections work.
Given V = [v1, . . . , vk], an orthogonal projection matrix P is easily constructed as P = V VTV−1 VT. And, if the column vectors of V form an orthonormal basis for S, then P = VVT. So far, we have only considered the choice where the vi s are unit vectors in the direction of the original data axes ei. Projections onto these directions simply return the scatterplots on pairs of the original variates. Are there other directions which would do as well (or possibly better)? Imagine that we have n points x1, . . . , xn ∈ Rp centred so that n
i=1 xi = 0 and we
denote by X = [x1, . . . , xn]T the n × p real matrix whose ith row is xT
i .
Being centred, we can now ask whether the data truly lie in a linear subspace of Rp. And, if they do, can we find that subspace? Alternatively, do they lie nearly in a linear subspace and could we find it?
Reducing dimensions - Finding the principal axes
For example suppose n = 20 and p = 2 so that a point cloud might look like:
r r r r r r r r r r r r r r r r r r r r
The points x1, . . . , xn lie in the plane, but do not occupy all of it. They appear nearly to lie in a one-dimensional subspace of R2
Reducing dimensions - Finding the principal axes
We can think about orthogonally projecting the points (or equivalently, vectors) x1, . . . , xn onto any direction vector a ∈ Rp (i.e. a = 1, a ∈ Rp).
✄ ✄ ✄ ✄ ✗
a
r r r r r r r r r r r r r r r rxk P P r r r r
Reducing dimensions - Finding the principal axes
The orthogonal projection of the point xk onto a (i.e. onto the span{a}) is (aaT)xk = a × wk a vector in the direction of a × sign(wk) of length |wk| with wk = aTxk. Note that the squared length of this projection is wk
2 = ||aaTxk||2 = xk TaaTxk = aTxkxk Ta.
Since every point can be projected onto the direction vector a, we might ask what vector would maximize (or minimize) the sum of the squared lengths of the projections? That is, onto which direction would the projections have the largest average squared length? Because the points are already centred about 0, this is the same as asking which direction are the original data points most (or least) spread out (or variable)?
Reducing dimensions - Finding the principal axes
Mathematically we want to find the direction vector a, which maximizes (minimizes) the sum n
k=1 w 2 k .
This sum can in turn be expressed in terms of the original points in the point cloud as:
n
- k=1
wk
2 = n
- k=1
xk
TaaTxk
=
n
- k=1
aTxkxk
Ta
= aT(
n
- k=1
xkxk
T)a
= aT(XTX)a where XT = [x1, x2, . . . , xn] is the p × n matrix of data vectors.
Reducing dimensions - Finding the principal axes
The maximization (minimization) problem can now be expressed as follows. Find a ∈ Rp which maximizes (minimizes) aT(XTX)a subject to the constraint that aTa = 1. We can write this as an unconstrained optimization by introducing a Lagrange multiplier λ. The problem then becomes to find λ ∈ R and a ∈ Rp which maximizes (minimizes) aT(XTX)a + λ(1 − aTa) which we now simply differentiate with respect to a, set to zero, solve, etc. Note that the objective function to be maximized (minimized) is a quadratic form in a. More generally, a quadratic form in z ∈ Rp can always be written as Q(z) = zTAz + bTz + c where A ∈ Rp×p, b ∈ Rp, and c ∈ R are all constants (wlog A = AT).
Reducing dimensions - Finding the principal axes
Differentiating the quadratic form Q(z) = zTAz + bTz + c with respect to the vector z is ∂ ∂zQ(z) = 2Az + b. For our problem, we have the variable vector z = a and constants c = λ, b = 0, and A = XTX − λIp. Differentiating with respect to a and setting the result to 0 gives the set of equations: 2(XTX − λIp)a = 0
- r
(XTX)a = λa. Differentiating with respect to λ, setting to zero and solving yields aTa = 1. Which should look familiar . . . ?
Reducing dimensions - Finding the principal axes
The solution to the systems of equations (XTX)a = λa and aTa = 1 are the eigen-vector a and its corresponding eigen-value λ of the real symmetric matrix XTX. The quadratic form we are maximizing (minimizing) is aT(XTX)a = λaTa = λ. To maximize (minimize) this quadratic form, we choose the eigen-vector corresponding to the largest (smallest) eigen-value. Denote the solution to this problem as v1 (or vp for the minimization problem). The solution v1 (or vp) will be the eigen-vector of (XTX) which corresponds to its largest (or smallest) eigen-value λ1 (or λp). Putting all eigen-vectors into an orthogonal matrix V = [v1, · · · , vp] ordered by the eigen-values λ1 ≥ λ2 ≥ · · · ≥ λp we have (XTX) = VDλVT is the eigen-decomposition of XTX with Dλ = diag(λ1, . . . , λp).
Reducing dimensions - Finding the principal axes
The figure below shows v1 (and v2) for this data.
✓ ✓ ✓ ✓ ✼v1 ❩❩❩❩ ⑦v2 r r r r r r r r r r r r r r r r r r r r
Reducing dimensions - Finding the principal axes
Consider a change of variables y = VTx (with VTV = Ip = VVT). The data in the original coordinate system has x =
x1 x2 . . . xp
= [e1, e2, · · · , ep]
x1 x2 . . . xp
= Vy
and in the new coordinate system becomes [v1, v2, · · · , vp]
y1 y2 . . . yp
= VVTx
The values y1, . . . , yp are coordinates on new axes v1, . . . vp. The axes were chosen so that the coordinates are most spread out for variable y1, next for variable y2, . . . , least for yp. The axes v1, . . . , vp are called the principal axes and the variables y1, . . . , yp the principal components.
(N.B. sometimes both axes and variables are called the principal components . . . sigh.) Note that the transformed variates (the principal components) yi and yj are now uncorrelated for i = j (since the y points are now aligned along their principal axes).
Reducing dimensions - Finding the principal axes
For our example, the transform y = VTx yields y1 = v1
Tx = c1x1 + c2x2 + · · · + cpxp
as a weighted linear combination of the original x variables (using entries of v1 as weights). The following figure shows the points as they appear in the new co-ordinate system (e.g. yk = VTxk ).
r r r r r r r r r r r r r rr ryk r r r r
y1 y2 Note that the transformation rotates the points into position and (in this example) reflects them through one (or more) of the principal axes.
Reducing dimensions - Finding the principal axes
For our example, consider only the Species = versicolor of the iris data and the first three variates. In three dimensions, the plot looks like
library(loon) ## Loading required package: tcltk data <- l_scale3D(iris[iris$Species == "versicolor", 1:4]) p3D <- l_plot3D(data[,1:3], showGuides = TRUE) plot(p3D)
Sepal.Length Sepal.Width
Which can now be rotated by hand to get to the principal components.
Reducing dimensions
Note that λj = λjvj
Tvj = vj T(λjvj) = vj T(XTX)vj = (Xvj)T(Xvj) = zTz = n
- i=1
z2
i
for real values zj, and so λj ≥ 0 ∀ j = 1, . . . , p. Note that each zi = vj
Txi is the coordinate of the data point xi projected onto
the principal axis vj. So, if λj = 0, then the projection of every point onto the direction vj is identically 0! That is, the data lie in a space orthogonal to vj. Suppose there is a value d < p such that λ1 ≥ · · · ≥ λd > 0, and λj = 0 for j > d. Then the points x1, . . . , xn lie in a d-dimensional subspace of Rp defined by the principal axes v1, . . . , vd. That is xi ∈ span{v1, . . . , vd} ⊂ Rp for all i = 1, . . . , n. Question: What if we only have λj ≈ 0 for j > d? Answer: The points x1, . . . , xn nearly lie in a d-dimensional subspace. Perhaps we can reduce consideration to d dimensions.
Reducing dimensions
Let YT = [y1, . . . , yn] be the p × n matrix of the points y1, . . . , yn in the coordinate system of the principal axes. Note that these coordinates in the new space are simply given by Y = X[v1, . . . , vp] = XV. The ith column of Y is the ith principal component. When we want to reduce the dimensionality to only those defined by the first d principal components, then we could right-multiply X by the p × d matrix [v1, . . . , vd] or equivalently simply select the first d columns of Y. There is a decomposition of a real rectangular n × p matrix X which is particularly handy called the singular value decomposition: X = UDσVT where U is an n × p matrix with the property that UTU = Ip, V is a p × p matrix with VTV = VVT = Ip and Dσ = diag(σ1, . . . , σp) with σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0.
The scalars σi are called the singular values of X, the columns of U the left singular vectors and the columns of V the right singular vectors.
Reducing dimensions
Letting X = UDσVT be the singular value decomposition of X, then XTX = (UDσVT)
TUDσVT
= VDσUTUDσVT = VDσDσVT = VD2
σVT
is just the eigen-decomposition of XTX with D2
σ = Dλ. In particular note that
σi = √λi; unlike λ, σ is on the same scale as the data values. So, a quick way to get the principal components is to find the singular value decomposition of the (centred) matrix X and then Y = XV = UDσ
Reducing dimensions
For example, in R # Principal components # ... via a ... # Singular value decomposition (svd) # T # X = U D V library(loon) data <- oliveAcids # Scale as well as centre since we are looking at scatterplots data <- scale(data, center=TRUE, scale=TRUE) svd_data <- svd(data) u <- svd_data$u d <- svd_data$d v <- svd_data$v
Reducing dimensions
# # V has orthonormal columns # round(t(v) %*% v, 10) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 1 ## [2,] 1 ## [3,] 1 ## [4,] 1 ## [5,] 1 ## [6,] 1 ## [7,] 1 ## [8,] 1 # and, being square, so are its rows round(v %*% t(v), 10) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 1 ## [2,] 1 ## [3,] 1 ## [4,] 1 ## [5,] 1 ## [6,] 1 ## [7,] 1 ## [8,] 1
Reducing dimensions
# # Similarly u has orthonormal columns # (but not rows) round(t(u) %*% u, 10) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 1 ## [2,] 1 ## [3,] 1 ## [4,] 1 ## [5,] 1 ## [6,] 1 ## [7,] 1 ## [8,] 1 # compare: u %*% t(u) ... n by n non-orthogonal #
Reducing dimensions
# d contains the singular values of X # Here they are in a diagonal matrix round(diag(d), 0) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 46 ## [2,] 32 ## [3,] 24 ## [4,] 21 ## [5,] 14 ## [6,] 12 ## [7,] 8 ## [8,] 1 Clearly, some of these are much larger than others. Perhaps we do not need all of these dimensions to explore the data? To get a sense of the relative size we look at a so-called scree plot.
Reducing dimensions
Scree plots: determining the effective dimension of the subspace.
# The following is a "Scree plot" plot(d/max(d),type="b", xlab="principal component", ylab="singular value (relative)", col="grey50", pch=16,
1 2 3 4 5 6 7 8 0.0 0.2 0.4 0.6 0.8 1.0 principal component singular value (relative)
Perhaps 4 dimensions are enough? 5? Ideally there would be a large noticeable drop at some point.
Reducing dimensions
Now,
- σi
σmax
is a measure of the effective (fractional) dimensionality of the data. This suggests that we might consider looking at the ratio d
i=1 σi/p j=1 σj as a measure of the proportion of the
dimensionality retained by the first d components. This suggests an alternative
# The following is plot(cumsum(d)/sum(d),type="b", col="grey50", xlab="Number of principal components", ylab="Proportion of effective dimension", ylim=0:1, pch=16, cex=2)
1 2 3 4 5 6 7 8 0.0 0.2 0.4 0.6 0.8 1.0 Number of principal components Proportion of effective dimension
Reducing dimensions - a more complex example
In the frey dataset to be found in the loon.data package, we have 1,965 images of Brendan Frey’s face. Here are 4 examples:
Reducing dimensions - a more complex example
◮ Each pixel has a
grey value in [0,1].
◮ Each image is a
28 × 20 pixel image. So each image is a point in 560 dimensions.
Reducing dimensions - a more complex example
# Principal components # ... via a ... # Singular value decomposition (svd) # T # X = U D V library(loon) library(loon.data) data(frey) data <- t(frey) # Scale as well as centre since we are looking at scatterplots data <- scale(data, center=TRUE, scale=FALSE) svd_data <- svd(data) u <- svd_data$u d <- svd_data$d v <- svd_data$v
Reducing dimensions - a more complex example
100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0
Scree plot
principal component singular value (relative) 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0
Proportion effective dimension
Number of principal components Proportion of effective dimension
Doesn’t look like we need all 560 dimensions, but how many should we choose?
Reducing dimensions - a more complex example
100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0
Scree plot
principal component singular value (relative) 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0
Proportion effective dimension
Number of principal components Proportion of effective dimension
Doesn’t look like we need all 560 dimensions, but how many should we choose? 100?
Reducing dimensions - a more complex example
100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0
Scree plot
principal component singular value (relative) 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0
Proportion effective dimension
Number of principal components Proportion of effective dimension
Doesn’t look like we need all 560 dimensions, but how many should we choose? 100? Let’s look at 20.
Reducing dimensions - a more complex example
# Here are the images frey.imgs <- l_image_import_array(frey, 28,20, img_in_row = FALSE, rotate = 90) l_imageviewer(frey.imgs) reduced <- data %*% v[, 1:20] nav1 <- l_navgraph(reduced, linkingGroup="frey") gl1 <- l_glyph_add_image(nav1$plot, images=frey.imgs, label="frey faces") nav1$plot['glyph'] <- gl1 scags2d <- scagnostics2d(reduced) nav2 <- l_ng_plots(measures=scags2d, linkingGroup="frey") gl2 <- l_glyph_add_image(nav2$plot, images=frey.imgs, label="frey faces") nav2$plot['glyph'] <- gl2
Reducing dimensions - a more complex example
Some remarks:
◮ the dimension reduction made no use of the fact that the data:
◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)
Reducing dimensions - a more complex example
Some remarks:
◮ the dimension reduction made no use of the fact that the data:
◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)
◮ principal components does not care which position in the 560-dimensional vector
that each pixel location occupies
◮ i.e. could have randomly assigned the image pixel array locations to the
locations in the vector
Reducing dimensions - a more complex example
Some remarks:
◮ the dimension reduction made no use of the fact that the data:
◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)
◮ principal components does not care which position in the 560-dimensional vector
that each pixel location occupies
◮ i.e. could have randomly assigned the image pixel array locations to the
locations in the vector
◮ still got some useful positioning even in only the first two principal components
Reducing dimensions - a more complex example
Some remarks:
◮ the dimension reduction made no use of the fact that the data:
◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)
◮ principal components does not care which position in the 560-dimensional vector
that each pixel location occupies
◮ i.e. could have randomly assigned the image pixel array locations to the
locations in the vector
◮ still got some useful positioning even in only the first two principal components ◮ there is no reason to reduce the dimensionality to only 2 or 3 dimensions
Reducing dimensions - a more complex example
Some remarks:
◮ the dimension reduction made no use of the fact that the data:
◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)
◮ principal components does not care which position in the 560-dimensional vector
that each pixel location occupies
◮ i.e. could have randomly assigned the image pixel array locations to the
locations in the vector
◮ still got some useful positioning even in only the first two principal components ◮ there is no reason to reduce the dimensionality to only 2 or 3 dimensions - can
reduce to 20 or even more dimensions if that is what the scree plot suggests
Reducing dimensions - a more complex example
Some remarks:
◮ the dimension reduction made no use of the fact that the data:
◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)
◮ principal components does not care which position in the 560-dimensional vector
that each pixel location occupies
◮ i.e. could have randomly assigned the image pixel array locations to the
locations in the vector
◮ still got some useful positioning even in only the first two principal components ◮ there is no reason to reduce the dimensionality to only 2 or 3 dimensions - can
reduce to 20 or even more dimensions if that is what the scree plot suggests - can reduce to any number provided that the determination of interesting projections is computationally feasible
Reducing dimensions - a more complex example
Some remarks:
◮ the dimension reduction made no use of the fact that the data:
◮ were faces (i.e. no facial feature extracted) ◮ images (i.e. spatial position not really used)
◮ principal components does not care which position in the 560-dimensional vector
that each pixel location occupies
◮ i.e. could have randomly assigned the image pixel array locations to the