Non-linear dimensionality reduction Recasting Principal Components - PowerPoint PPT Presentation

Non-linear dimensionality reduction Recasting Principal Components R.W. Oldford

Reducing dimensions - recasting the problem of principal components The principal axes ( V ) for a set of data X T = [ x 1 , . . . , x n ] can be found in one of two ways. Either: ◮ via the eigen-decomposition of X T X = VD λ V T , or ◮ via the singular value decomposition of X = UD σ V T Either way, the principal components are formed as Y = XV = UD σ which means, that all we need are U and D σ . We don’t really need V . We can get these two matrices from a different eigen-decomposition, namely: YY T UD σ ( UD σ ) T = UD 2 σ U T = XX T = Note that this matrix is n × n , i.e. depends only on the sample size and not on the dimension p . So there is a choice, either the p × p matrix X T X or the n × n matrix XX T .

Reducing dimensions - recasting the problem of principal components The choice XX T has an interesting structure: T  x 1  T x 2 XX T   =  [ x 1 , x 2 , . . . , x n ] .   . .  T x n T x 1 T x 2 T x n · · ·  x 1 x 1 x 1  T x 1 T x 2 T x n · · · x 2 x 2 x 2   = . . .   . . . . . .   T x 1 T x 2 T x n x n x n · · · x n Note that T x j is an inner product ◮ the ( i , j ) element x i ◮ this matrix of inner products is often called the Gram matrix ◮ if the data were centred ( � n i =1 x i = 0 ) then each row and column above sums to 0

Reducing dimensions - problems with principal axes Principal axes are, well, just that . . . axes. That is, they correspond to directions v 1 , . . . , v k where the data x 1 , . . . , x n , when projected orthogonally onto them, have a maximal spread. We look to remove axes (directions) for which the sum of the squared lengths of the projections are relatively small. For this to work, the data have to be (nearly) restricted to a linear subspace.

Reducing dimensions - problems with principal axes What if the data lie in a very restricted part of the space, but it’s just not linear? Clearly, these points lie in a very compact region of the plane but not along any direction vector (or principal axis). Data are still essentially only one dimensional though, just around the perimeter of a circle.

Reducing dimensions - problems with principal axes Possible solutions?: 1. Change variables. E.g. to polar coordinates? � � �� x 1 � x 2 1 + x 2 ( x 1 , x 2 ) → ( f 1 , f 2 ) = 2 , acos � x 2 1 + x 2 2 2. Somehow follow the points around the circle? ◮ That is try to find the nonlinear manifold on (or near) which the points lie. ◮ Somehow preserve local structure ◮ want points in the same neighbourhood of each other (along the nonlinear manifold) should also be near each other in the reduced (linear) space.

Reducing dimensions - changing variables x1 <- data[,1] x2 <- data[,2] newdata <- data.frame (f1 = sqrt (x1 ^ 2 + x2 ^ 2), f2 = acos (x1 /sqrt (x1 ^ 2 + x2 ^ 2)) ) colnames (newdata) <- c ("r","theta") newdata <- scale (newdata, center=TRUE, scale=FALSE) svd_data <- svd (newdata) svd_data $ d ## [1] 1.103150e+01 7.430655e-16

Reducing dimensions - changing variables x1 <- data[,1] x2 <- data[,2] newdata <- data.frame (f1 = sqrt (x1 ^ 2 + x2 ^ 2), f2 = acos (x1 /sqrt (x1 ^ 2 + x2 ^ 2)) ) colnames (newdata) <- c ("r","theta") newdata <- scale (newdata, center=TRUE, scale=FALSE) svd_data <- svd (newdata) svd_data $ d ## [1] 1.103150e+01 7.430655e-16 This non-linear transformation has produced a coordinate system where all the data lie in a one-dimensional linear subspace.

Reducing dimensions - changing variables Alternatively, we might try a transformation that added non-linear coordinates, say ( x 1 , x 2 ) → ( f 1 , f 2 , f 3 , f 4 ) = � x 1 , x 2 , x 2 1 , x 2 � 2 −1.0 0.0 0.5 1.0 0.0 0.4 0.8 1.0 0.5 f1 0.0 −1.0 1.0 0.5 0.0 f2 −1.0 0.8 f3 0.4 0.0 0.8 f4 0.4 0.0 −1.0 0.0 0.5 1.0 0.0 0.4 0.8 ## [1] 8.790603e+00 8.461389e+00 5.923241e+00 1.741439e-15

Reducing dimensions - changing variables A principal component analysis on the transformed data produces the singular values: ## [1] 8.790603e+00 8.461389e+00 5.923241e+00 1.750249e-15 T The last of these is essentially zero; the corresponding eigen-vector v 4 ## [1] 0.000000e+00 -3.052969e-16 7.071068e-01 7.071068e-01 Note the linear structure in ( f 3 , f 4 ). This would be picked up by the last eigen-vector which, as seen above, is ≈ (0 , 0 , 1 , 1 ) T . √ √ 2 2 The line in ( f 3 , f 4 ) at left is orthog- onal to the (1 , 1) (i.e. the eigen- vector v 4 ). Note also that there would be 3 principal components; more than the original dimensionality of the data!

Reducing dimensions - changing variables More generally, if x ∈ R p , we can consider a mapping ψ : R p → R m . ◮ ψ could be non-linear ◮ m could be larger than p ◮ f i = ψ ( x i ) for i = 1 , . . . , n are called “feature vectors” by some writers and the range of ψ the “feature space” F ⊂ R m . Note that while the dimensionality increases, the number of points n stays the same. Whether working in the original data space or in the constructed feature space, a principal component analysis can be had by an n × n Gram matrix. The dimensionality of the feature space could be much larger than the dimensionality of the data (i.e. m >> p ); it could even be infinite! The principal component analysis (PCA) will never need a matrix larger than the corresponding n × n Gram matrix.

Reducing dimensions - changing variables F T = [ f 1 , . . . , f n ] is the m × n matrix of feature vectors, which can be a nuisance to work with if m >> n and impossible if m = ∞ . The corresponding Gram matrix, K = [ k ij ] = FF T for the feature space however is always n × n (even if m = ∞ ). All we need to be able to do is determine the inner products f i T f j k ij = ψ T ( x i ) ψ ( x j ) = = < ψ ( x i ) , ψ ( x j ) > = K ( x i , x j ) , say. Looks like we only need to choose the function K ( x i , x j ). That is we never need to calculate any feature vector f i = ψ ( x i ) (or even determine the function ψ ( . )) if we have the function K ( x i , x j ), a function of vector pairs in the data space. K ( x i , x j ) is called a Kernel function (N.B. not to be confused with “Kernel density” estimates) and this move from ψ functions to Kernel functions is sometimes called the “Kernel trick”.

Reducing dimensions - changing variables A number of kernel functions have been proposed. Three common choices are 1. Polynomial of degree d , scale parameter σ , and offset θ : K ( x , y ) = ( σ x T y + θ ) d For example, suppose p = 3, d = 2 (with σ = 1, and θ = 0) then ( x 1 y 1 + x 2 y 2 + x 3 y 3 ) 2 K ( x , y ) = x 2 1 y 2 1 + x 2 2 y 2 2 + x 2 3 y 2 = 3 + 2 x 1 x 2 y 1 y 2 + 2 x 1 x 3 y 1 y 3 + 2 x 2 x 3 y 2 y 3 y 2   1 y 2 2 y 2 √ √ √   ( x 2 1 , x 2 2 , x 2 3 = 3 , 2 x 1 x 2 , 2 x 1 x 3 , 2 x 2 x 3 ) √   2 y 1 y 2  √  2 y 1 y 3 √ 2 y 2 y 3 √ √ √ T So ψ ( x ) = ( x 2 1 , x 2 2 , x 2 3 , 2 x 1 x 2 , 2 x 1 x 3 , 2 x 2 x 3 )

Reducing dimensions - changing variables 2. Radial basis function (Gaussian) (with scale parameter σ ): � � − || x − y || 2 K ( x , y ) = exp 2 σ 2 To see that this is also an inner product, consider a series expansion of e t . The feature space is infinite dimensional. 3. Sigmoid (hyperbolic tangent) with scale σ and offset θ : K ( x , y ) = tanh � σ x T y + θ � There is a theorem from functional analysis (involving reproducing Kernel Hilbert spaces, hence the name) called Mercer’s Theorem which gives conditions for which a function K ( x , y ) can be expressed as a dot product.

Reducing dimensions - changing variables The kernel function needs to be such that the vectors f 1 , . . . , f n in the feature space are centred about � n i =1 f i = 0 . Recall that the kernel matrix K is a Gram matrix so its elements are the inner T f j in the feature space. products f i A simple way to effect this is to ensure that the Kernel matrix K has rows and columns that sum to zero. That is, replace K by removing means from the front and back: � I n − 1 ( 1 T 1 ) − 1 1 T � K � I n − 1 ( 1 T 1 ) − 1 1 T � = K ∗ � I n − 1 n 11 T � K � I n − 1 n 11 T � = � K − 1 n 11 T K � � I n − 1 n 11 T � = n K11 T − 1 K − 1 n 11 T K + n 2 11 T K11 T 1 = n K1 ) 1 T − 1 ( 1 n 1 T K ) + n 2 ( 1 T K1 ) 11 T K − ( 1 1 = Or, in words, subtract the row means, subtract the column means, add back in the overall mean.

Non-linear dimensionality reduction Recasting Principal Components - PowerPoint PPT Presentation

Non-linear dimensionality reduction Recasting Principal Components R.W. Oldford Reducing dimensions - recasting the problem of principal components The principal axes ( V ) for a set of data X T = [ x 1 , . . . , x n ] can be found in one of two

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

Spatial Data: Dimensionality Reduction CS444 Techniques, Lecture 3 In this subfield, we think

Spatial Data: Dimensionality Reduction CSC444 Techniques In this subfield, we think of a data

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 Byron C Wallace Today A

Universality of local spectral statistics of random matrices L aszl o Erd os

De latome au 1 supraconducteur haute temprature critique O. Parcollet Institut de

A Fast Jacobi-Type Method for Lattice Basis Reduction Zhaofei Tian Department of Computing and

A geometric model of twisted differential K -theory Byungdo Park CUNY Algebraic Topology Seminar

Some Bayesian Approaches for ERGM Ranran Wang, UW MURI-UCI August 25, 2009 Some Bayesian

This webinar series is made possible through the Grassroots Organizing Institute (GOI), a program

Your life tells a story and there is someone out there who needs to hear it. You may think that

Webinar: Restorative Justice May 3, 2018 Melissa Goemann Senior Policy Counsel National