Applied Machine Learning Dimensionality reduction using PCA Siamak - PowerPoint PPT Presentation

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall 2020)

Learning objectives What is dimensionality reduction? What is it good for? Linear dimensionality reduction: Principal Component Analysis Relation to Singular Value Decomposition

Motivation Scenario: we are given high dimensional data and asked to make sense of it! Real-world data is high-dimensional we can't visualize beyond 3D features may not have any semantics (value of the pixel vs happy/sad) processing and storage is costly many features may not vary much in our dataset (e.g., background pixels in face images) Dimensionality reduction: faithfully represent the data in low dimensions We can often do this with real-world data (manifold hypothesis) How to do it?

Dimensionality reduction Dimensionality reduction: faithfully represent the data in low dimensions How to do it? learn a mapping between (coordinates) at low-dimension and high-dimensional data ( n ) R 3 ( n ) R 2 ∈ ∈ x z some methods give this mapping in both directions and some only in one direction.

Dimensionality reduction Dimensionality reduction: faithfully represent the data in low dimensions How to do it? learn a mapping between low-dimensional (Euclidean space and our data) ( n ) R 400 ( n ) R 2 ∈ ∈ x z each image is 20x20 image: wikipedia COMP 551 | Fall 2020

Principal Component Analysis (PCA) PCA is a linear dimensionality reduction method R 2 R 2 ( n ) R 3 ( n ) ( n ) ∈ ∈ ∈ z z x Q ∈ R 3×2 Q ⊤ ⊤ where W has orthonormal columns Q Q = I † ⊤ −1 ⊤ Q ⊤ Q = ( Q Q ) = it follows that the pseudo-invarse of Q is Q

PCA: optimization objective PCA is a linear dimensionality reduction method ( n ) R 784 ∈ x ( n ) ( n ) R 2 R 2 ∈ ∈ z z each image has 28x28=784 pixels Q ∈ R 784×2 Q ⊤ faithfulness is measured by the reconstruction error ( n )⊤ ( n ) ⊤ 2 ⊤ s . t . Q Q = min Q ∣∣ x − QQ ∣∣ ∑ n I x 2 z ( n )

PCA: optimization objective PCA is a linear dimensionality reduction method faithfulness is measured by the reconstruction error ( n )⊤ ⊤ ( n ) ⊤ 2 min Q s . t . Q Q = ∣∣ x − QQ ∣∣ ∑ n I x 2 z ( n ) strategy : find matrix Q, and only use D' columns ⎡ Q ⎤ D × D , … , Q 1,1 1, D ⎢ ⎥ Since Q is orthogonal we can think of it as a change of coordinates Q = ⎣ ⋮, ⋱ , ⋮ D , D ⎦ , … , Q (0, 1, 0) Q D ,1 q 1 q 1 q D q 3 (1, 0, 0) (0, 0, 1) q 2

PCA: optimization objective strategy : find matrix Q, and only use D' columns ⎡ Q ⎤ D × D , … , Q 1,1 1, D ⎢ ⎥ Since Q is orthonormal we can think of it as a change of coordinates Q = ⎣ ⋮, ⋱ , ⋮ D , D ⎦ , … , Q (0, 1, 0) Q D ,1 q 1 q 1 q D q 3 D = 2 example (1, 0, 0) (0, 1, 0) (0, 0, 1) q 2 q 1 we want to change coordinates such that coordinates 1,2,...,D' best explain the data for any given D' (1, 0, 0) q 2 COMP 551 | Fall 2020

In other words ⎡ Q ⎤ , … , Q 1,1 1, D ⎢ ⎥ Q = Find a change of coordinate using orthonormal matrix ⋮, ⋱ , ⋮ ⎣ D , D ⎦ first new coordinate has maximum variance (lowest reconstruction error) , … , Q Q D ,1 second coordinate has the next largest variance q 1 ... along which one of these directions the data has a higher variance? this direction is the vector q 1 ( n ) ⊤ ( n )⊤ projection is given by x q = 1 x q 1 ∣∣ q ∣∣ 1 2 projection of the whole dataset is Xq 1 = z 1

Covariance matrix Find a change of coordinate using orthonormal matrix first new coordinate has maximum variance porjection of the whole dataset is z = Xq 1 1 1 ⊤ 1 assuming features have zero mean, maximize the variance of the projection z z 1 N 1 ⊤ 1 = max 1 ⊤ ⊤ ⊤ max = max q Σ q z z q X Xq 1 1 1 1 1 q 1 N q 1 N q 1 dxd covariance matrix 1 ∑ n recall 1 ⊤ ( n ) ( n ) 0) ⊤ Σ = X X = ( x − 0)( x − N N because the mean is zero 1 ∑ n ( n ) ( n ) Σ i , j is the sample covariance of feature i and j Σ = Cov[ X , X ] = x x i , j :, i :, j i j N

Eigenvalue decomposition find a change of coordinate using an orthogonal matrix first new coordinate has maximum variance ⊤ max q Σ q s . t . ∣∣ q ∣∣ = 1 1 1 q 1 1 covariance matrix is symmetric and positive semi-definite ⊤ ⊤ ⊤ ⊤ 1 ⊤ ⊤ 1 2 ( X X ) = a Σ a = a X Xa = ∣∣ Xa ∣∣ ≥ 0 ∀ a X X 2 N N any symmetric matrix has the following decomposition Σ = Q Λ Q ⊤ (as we see shortly using Q here is not a co-incidence) dxd orthogonal matrix diagonal ⊤ ⊤ = Q Q = QQ I each column is an eigenvector corresponding eigenvalues are on the diagonal positive semi-definiteness means these are non-negative

Principal directions find a change of coordinates using an orthogonal matrix first new coordinate has maximum variance ∗ 1⊤ q = arg max Σ q s . t . ∣∣ q ∣∣ = 1 q 1 1 1 q 1 ⊤ ⊤ 1 max q Q Λ Q q = using eigenvalue decomposition λ 1 q 1 1 maximizing direction is the eigenvector with the largest eigenvalue (first column of Q) q = first principal direction Q :,1 1 second eigenvector gives the q = second principal direction Q :,2 2 ... so for PCA we need to find the eigenvectors of the covariance matrix

Reducing dimensionality projection into the principal direction is given by Xq i q i think of the projection XQ as a change of coordinates we can use the first D' coordinates Z = XQ :,: D ′ to reduce the dimensionality while capturing a lot of the variance in the data ~ we can project back into original coordinates using ⊤ = X ZQ :,: D ′ reconstruction

Example: digits dataset ( n ) R 784 ∈ let's only work with digit 2! x ... x (1) x (2) form the covariance matrix Σ 784 × 784 center the data and find the eigenvectors of the covariance matrix, the principal directions ... … q 20 q 1 q 2 use the first 20 directions to reduce dimensionality from 784 to 20! using 20 numbers we can represent ⊤ i PC coefficient x q (the new coordinates) each image with a good accuracy

example 2: digits dataset 3D embedding of MNIST digits ( https://projector.tensorflow.org/ ) ( n ) R 784 ∈ x the embedding 3D coordinates are Xq , Xq , Xq 1 2 3 COMP 551 | Fall 2020

there is another way to do PCA without using the covariance matrix

Singular Value Decomposition (SVD) any N x D real matrix has the following decomposition X = USV ⊤ compressed SVD assuming we can ignore N > D N × D N × N N × D D × D why? orthogonal the last (N-D) columns of U orthogonal rectangular ⎡ ∣ ⎤ ⎡ s 1 ⎤ ⎡ ∣ ⎤ ⊤ ∣ diagonal … ∣ last (N-D) rows of S ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ∣ … ∣ ⎢ ⎥ ⎢ ⎥ s 2 similarly if we can compress , … D > N ⎣ ∣ ⎦ V S ⎢ ⎥ v 1 v N ⎢ ⎥ … ⋱ ⎢ ⎥ u 1 u N ⎢ ⎥ ∣ … X = USV ⊤ ∣ … ∣ ⎣ ∣ ⎦ ⎣ ⎦ ∣ N × D N × D D × D D × D ⊤ j v v = 0∀ i =  j s ≥ 0 ⊤ i u u = 0∀ i =  j i j i { u } left singular vectors singular values right singular vectors i

Singular Value Decomposition (SVD) optional it is as if we are finding orthonormal bases U and V for R , R D N such that X simply scales the i'th basis of and maps it to i'th basis of R D R N N=D=2 X s s 2 1 V ⊤ S s 1 s 2

Singular value & eigenvalue decomposition 1 ⊤ Σ = recall that for PCA we used the eigenvalue decomposition of X X N how does it relate to SVD? ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ 2 ⊤ X X = ( USV ) ( USV ) = = V S U USV V S V 1 ⊤ Q Λ Q ⊤ X X = compare to N Σ eigenvectors of are right singular vectors of X Q = V so for for PCA we could use SVD

Picking the number of PCs optional number of PCs in PCA is a hyper-parameter how should we choose this? ( n )2 1 ∑ n each new principle direction explains some variance in the data z a d d N a ≥ a ≥ … ≥ a D such that we have (by definition of PCA) 1 2 a i we can divide by total variance to get a ratio r = i ∑ d a d example for our digits example we get sum of variance ratios up to a PC we can explain 90% of variance in the data using 100 PCs first few principal directions explain most of the variance in the data!

Picking the number of PCs optional recall that for picking the principal direction we maximized the variance of the PC 1 ⊤ ⊤ max ⊤ = max q Σ q qX Xq ⊤ ⊤ = max q Q Λ Q q = λ 1 q N q q 1 ∣∣ q ∣∣ = 1 ∣∣ q ∣∣ = 1 ∣∣ q ∣∣ = 1 λ i r = so the variance ratios are also given by i ∑ d λ d so we can also use eigenvalues to pick the number of PCs digits example : two estimates of variance ratios do match COMP 551 | Fall 2020

Matrix factorization PCA and SVD perform matrix factorization N × D ′ D × ′ D D D ′ X ≈ ( XQ ) Q ⊤ Q ⊤ D ≈ rows of this matrix are principal components × X factor matrix N N D ′ Z rows are orthonormal Z this is the matrix of low-dimensional features pc coefficients N × D ′ factor loading matrix this gives a row-rank approximation to our original matrix X we can use this to compress the matrix we can find give a "smooth" reconstruction of X (remove noise or fill missing values)

Applied Machine Learning Dimensionality reduction using PCA Siamak - PowerPoint PPT Presentation

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives What is dimensionality reduction? What is it good for? Linear dimensionality reduction: Principal Component Analysis

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Topological Data Analysis A Framework for Machine Learning Samarth Bansal (11630) Deepak

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

PRINCIPAL COMPONENT ANALYSIS(PCA) By Deepen naorem Latent(hidden) representation Method A

A Cluster Target Similarity Based g y Principal Component Analysis for Interval Valued

Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

CS475/CS675 Lecture 23: July 19, 2016 Principal Component Analysis, Eigenfaces CS475/CS675 (c)

Machine Learning Basics Lecture slides for Chapter 5 of Deep Learning www.deeplearningbook.org

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models