Dimension Reduction using PCA and SVD Plan of Class Starting the - PowerPoint PPT Presentation

Dimension Reduction using PCA and SVD

Plan of Class • Starting the machine Learning part of the course. • Based on Linear Algebra. • If your linear algebra is rusty, check out the pages on “Resources/Linear Algebra” • This class will all be theory. • Next class will be on doing PCA in Spark. • HW3 will open on friday, be due the following friday.

Dimensionality reduction Why reduce the number of features in a data set? 1 It reduces storage and computation time. 2 High-dimensional data often has a lot of redundancy. 3 Remove noisy or irrelevant features. Example: are all the pixels in an image equally informative? x ∈ R 784 28 × 28 = 784pixels. A vector � If we were to choose a few pixels to discard, which would be the prime candidates? Those with lowest variance...

Eliminating low variance coordinates Example: MNIST. What fraction of the total variance is contained in the 100 (or 200, or 300) coordinates with lowest variance? We can easily drop 300-400 pixels... Can we eliminate more? Yes! By using features that are combinations of pixels instead of single pixels.

Covariance (a quick review) Suppose X has mean µ X and Y has mean µ Y . • Covariance cov( X , Y ) = E [( X − µ X )( Y − µ Y )] = E [ XY ] − µ X µ Y Maximized when X = Y , in which case it is var( X ). In general, it is at most std( X )std( Y ).

Covariance: example 1 cov( X , Y ) = E [( X − µ X )( Y − µ Y )] = E [ XY ] − µ X µ Y Pr ( x , y ) µ X = 0 x y − 1 − 1 1 / 3 µ Y = − 1 / 3 − 1 1 1 / 6 var( X ) = 1 1 − 1 1 / 3 var( Y ) = 8 / 9 1 1 1 / 6 cov( X , Y ) = 0 In this case, X , Y are independent. Independent variables always have zero covariance.

Covariance: example 2 cov( X , Y ) = E [( X − µ X )( Y − µ Y )] = E [ XY ] − µ X µ Y x y Pr ( x , y ) − 1 − 10 1 / 6 µ X = 0 − 1 10 1 / 3 µ Y = 0 1 − 10 1 / 3 var( X ) = 1 1 10 1 / 6 var( Y ) = 100 cov( X , Y ) = − 10 / 3 In this case, X and Y are negatively correlated.

Example: MNIST approximate a digit from class j as the class av- erage plus k corrections: k � � x ≈ µ j + a i � v j , i i =1 • µ j ∈ R 784 class mean vector • � v j , 1 , . . . , � v j , k are the principal directions .

The effect of correlation Suppose we wanted just one feature for the following data. This is the direction of maximum variance .

Two types of projection Projection onto a 1-d line in R 2 : Projection onto R :

Projection: formally What is the projection of x ∈ R p onto direction u ∈ R p (where � u � = 1)? As a one-dimensional value: x p � x · u = u · x = u T x = u i x i . i =1 As a p -dimensional vector: u ( x · u ) u = uu T x x · u “Move x · u units in direction u ” � 2 � What is the projection of x = onto the following directions? 3 • The coordinate direction e 1 ? Answer: 2 � 1 � √ • The direction ? Answer: − 1 / 2 − 1

matrix notation I A notation that allows a simple representation of multiple projections v ∈ R d can be represented, in matrix notation, as A vector � • A column vector:   v 1   v 2   v =   . .   . v d • A row vector: � � v T = v 1 v 2 · · · v d

matrix notation II By convension an inner product is represented by a row vector followed by a a column vector:   v 1   d v 2 � � u 1 �   u 2 · · · u d  =  .  u i v i .  . i =1 v d While a column vector followd by a row vector represents an outer product which is a matrix:     v 1 u 1 v 1 u 2 v 1 · · · u m v 1   v 2 � �     . . ... ... · · · = . .   u 1 u 2 u m   . . . .   . · · · u 1 v n u 2 v n u m v n v n

Projection onto multiple directions Want to project x ∈ R p into the k -dimensional subspace defined by vectors u 1 , . . . , u k ∈ R p . This is easiest when the u i ’s are orthonormal : • They each have length one. • They are at right angles to each other: u i · u j = 0 whenever i � = j Then the projection, as a k -dimensional vector, is   �   ← − − − − − u 1 − − − − − →     ← − − − − − u 2 − − − − − →       ( x · u 1 , x · u 2 , . . . , x · u k ) =   x .    .   .  � ← − − − − − u k − − − − − → � �� call this U T As a p -dimensional vector, the projection is ( x · u 1 ) u 1 + ( x · u 2 ) u 2 + · · · + ( x · u k ) u k = UU T x .

Projection onto multiple directions: example Suppose data are in R 4 and we want to project onto the first two coordinates.     1 0     0 1     Take vectors u 1 =  , u 2 = (notice: orthonormal)    0 0 0 0 � ← � � 1 � − − − − − u 1 − − − − − → 0 0 0 U T = Then write = ← − − − − − u 2 − − − − − → 0 1 0 0 The projection of x ∈ R 4 , The projection of x as a as a 2-d vector, is 4-d vector is   � � x 1 x 1 U T x =   x 2 x 2   UU T x =   0 0 But we’ll generally project along non-coordinate directions.

The best single direction Suppose we need to map our data x ∈ R p into just one dimension: for some unit direction u ∈ R p x �→ u · x What is the direction u of maximum variance? Theorem : Let Σ be the p × p covariance matrix of X . The variance of X in direction u is given by u T Σ u . • Suppose the mean of X is µ ∈ R p . The projection u T X has mean E ( u T X ) = u T E X = u T µ. • The variance of u T X is var( u T X ) = E ( u T X − u T µ ) 2 = E ( u T ( X − µ )( X − µ ) T u ) = u T E ( X − µ )( X − µ ) T u = u T Σ u . Another theorem: u T Σ u is maximized by setting u to the first eigenvector of Σ. The maximum value is the corresponding eigenvalue .

Best single direction: example This direction is the first eigenvector of the 2 × 2 covariance matrix of the data.

The best k -dimensional projection Let Σ be the p × p covariance matrix of X . Its eigendecomposition can be computed in O ( p 3 ) time and consists of: • real eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ p • corresponding eigenvectors u 1 , . . . , u p ∈ R p that are orthonormal: that is, each u i has unit length and u i · u j = 0 whenever i � = j . Theorem : Suppose we want to map data X ∈ R p to just k dimensions, while capturing as much of the variance of X as possible. The best choice of projection is: x �→ ( u 1 · x , u 2 · x , . . . , u k · x ) , where u i are the eigenvectors described above. Projecting the data in this way is principal component analysis (PCA).

Example: MNIST Contrast coordinate projections with PCA:

MNIST: image reconstruction Reconstruct this original image from its PCA projection to k dimensions. k = 200 k = 150 k = 100 k = 50 Q: What are these reconstructions exactly? A: Image x is reconstructed as UU T x , where U is a p × k matrix whose columns are the top k eigenvectors of Σ.

What are eigenvalues and eigenvectors? There are several steps to understanding these. 1 Any matrix M defines a function (or transformation ) x �→ Mx . 2 If M is a p × q matrix, then this transformation maps vector x ∈ R q to vector Mx ∈ R p . 3 We call it a linear transformation because M ( x + x ′ ) = Mx + Mx ′ . 4 We’d like to understand the nature of these transformations. The easiest case is when M is diagonal :       2 0 0 x 1 2 x 1       0 − 1 0 − x 2 x 2 = 0 0 10 10 x 3 x 3 � �� x M Mx In this case, M simply scales each coordinate separately. 5 What about more general matrices that are symmetric but not necessarily diagonal? They also just scale coordinates separately, but in a different coordinate system .

Eigenvalue and eigenvector: definition Let M be a p × p matrix. We say u ∈ R p is an eigenvector if M maps u onto the same direction, that is, Mu = λ u for some scaling constant λ . This λ is the eigenvalue associated with u . Question: What are the eigenvectors and eigenvalues of:   2 0 0   ? M = 0 − 1 0 0 0 10 Answer: Eigenvectors e 1 , e 2 . e 3 , with corresponding eigenvalues 2 , − 1 , 10. Notice that these eigenvectors form an orthonormal basis.

Eigenvectors of a real symmetric matrix Theorem. Let M be any real symmetric p × p matrix. Then M has • p eigenvalues λ 1 , . . . , λ p • corresponding eigenvectors u 1 , . . . , u p ∈ R p that are orthonormal We can think of u 1 , . . . , u p as being the axes of the natural coordinate system for understanding M . Example: consider the matrix � 3 � 1 M = 1 3 It has eigenvectors � 1 � � − 1 � 1 1 u 1 = √ , u 2 = √ 1 1 2 2 and corresponding eigenvalues λ 1 = 4 and λ 2 = 2. (Check)

Dimension Reduction using PCA and SVD Plan of Class Starting the - PowerPoint PPT Presentation

Dimension Reduction using PCA and SVD Plan of Class Starting the machine Learning part of the course. Based on Linear Algebra. If your linear algebra is rusty, check out the pages on Resources/Linear Algebra This class will

SVD Status H. Yin August 24, 2017 H. Yin SVD Status August 24, 2017 1 / 19 Overview SVD

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

1 Low-rank approximations to a matrix using SVD First point: we can write the SVD as a sum of

A study for hit-time reconstruction of Belle II SVD Yuma Uematsu (UTokyo) on behalf of Belle II

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

Dimensionality Reduction; PCA & SVD Kalev Kask Motivation High-dimensional data

Big Data Management & Analytics EXERCISE 9 SVD, CUR 11th of January, 2016 Sabrina

GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL Paul Irofti 1 Bogdan

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

Discriminative Feature Extraction and Dimension Reduction - PCA & LDA Berlin Chen, 2004

Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan

Partial Lanczos SVD methods for R Bryan Lewis 1 , adapted from the work of Jim Baglama 2 and Lothar

Intro to harmonic analysis on groups Risi Kondor . The Fourier series (1807) Any (sufficiently

How to use the Scanner For Slides Double click on the HP Precision scan icon You will see

A Consistent Regularization Approach for Structured Prediction Carlo Ciliberto, Alessandro Rudi,

Relational Pooling for Graph Representations Ryan L. Murphy 1 (with Balasubramaniam Srinivasan 2 ,

Companion symmetry of SUSY PHENO 2008 Hye-Sung Lee Lightest U -parity Particle (LUP) dark matter

Grover Mixers for QAOA: Shifting Complexity from Mixer Design to State Preparation you joint

Controlling Virtual Register Pressure in LLVM Middle-End 1 Outline Motivation Related work

"Key Migration Protocol" Sounds Scary Singapore Plenary - Oct, 2018

Sambuz

Useful Links

Newsletter

Mail Us

Dimension Reduction using PCA and SVD Plan of Class Starting the - PowerPoint PPT Presentation

Dimension Reduction using PCA and SVD Plan of Class Starting the machine Learning part of the course. Based on Linear Algebra. If your linear algebra is rusty, check out the pages on Resources/Linear Algebra This class will

SVD Status H. Yin August 24, 2017 H. Yin SVD Status August 24, 2017 1 / 19 Overview SVD

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

1 Low-rank approximations to a matrix using SVD First point: we can write the SVD as a sum of

A study for hit-time reconstruction of Belle II SVD Yuma Uematsu (UTokyo) on behalf of Belle II

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

Dimensionality Reduction; PCA &amp; SVD Kalev Kask Motivation High-dimensional data

Big Data Management &amp; Analytics EXERCISE 9 SVD, CUR 11th of January, 2016 Sabrina

GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL Paul Irofti 1 Bogdan

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

Discriminative Feature Extraction and Dimension Reduction - PCA &amp; LDA Berlin Chen, 2004

Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan

Partial Lanczos SVD methods for R Bryan Lewis 1 , adapted from the work of Jim Baglama 2 and Lothar

Intro to harmonic analysis on groups Risi Kondor . The Fourier series (1807) Any (sufficiently

How to use the Scanner For Slides Double click on the HP Precision scan icon You will see

A Consistent Regularization Approach for Structured Prediction Carlo Ciliberto, Alessandro Rudi,

Relational Pooling for Graph Representations Ryan L. Murphy 1 (with Balasubramaniam Srinivasan 2 ,

Companion symmetry of SUSY PHENO 2008 Hye-Sung Lee Lightest U -parity Particle (LUP) dark matter

Grover Mixers for QAOA: Shifting Complexity from Mixer Design to State Preparation you joint

Controlling Virtual Register Pressure in LLVM Middle-End 1 Outline Motivation Related work

&quot;Key Migration Protocol&quot; Sounds Scary Singapore Plenary - Oct, 2018

Sambuz

Useful Links

Newsletter

Mail Us

Dimensionality Reduction; PCA & SVD Kalev Kask Motivation High-dimensional data

Big Data Management & Analytics EXERCISE 9 SVD, CUR 11th of January, 2016 Sabrina

Discriminative Feature Extraction and Dimension Reduction - PCA & LDA Berlin Chen, 2004

"Key Migration Protocol" Sounds Scary Singapore Plenary - Oct, 2018