Principal component analysis DS-GA 1013 / MATH-GA 2824 Mathematical - PowerPoint PPT Presentation

Step By induction assumption there exist γ 1 , . . . , γ d − 1 and w 1 , . . . , w d − 1 such that || y || 2 = 1 y T By γ 1 = max || y || 2 = 1 y T By w 1 = arg max y T By , γ k = max 2 ≤ k ≤ d − 2 || y || 2 = 1 , y ⊥ w 1 ,..., w k − 1 y T By , w k = arg max 2 ≤ k ≤ d − 2 || y || 2 = 1 , y ⊥ w 1 ,..., w k − 1

Step For any x ∈ span( u 1 ) ⊥ , x = V ⊥ y for some y ∈ R d − 1 x T Ax = x T ( A − λ 1 u 1 u T max max 1 ) x || x || 2 = 1 , x ⊥ u 1 || x || 2 = 1 , x ⊥ u 1 x T V ⊥ V T ⊥ ( A − λ 1 u 1 u T 1 ) V ⊥ V T = max ⊥ x || x || 2 = 1 , x ⊥ u 1 || y || 2 = 1 y T By = max = γ 1 Inspired by this: u k := V ⊥ w k − 1 for k = 2 , . . . , d u 1 , . . . , u d are orthonormal basis

Step: eigenvectors Au k = V ⊥ V T ⊥ ( A − λ 1 u 1 u T 1 ) V ⊥ V T ⊥ V ⊥ w k − 1 = V ⊥ Bw k − 1 = γ k − 1 V ⊥ w k − 1 = λ k u k u k is an eigenvector of A with eigenvalue λ k := γ k − 1

Step Let x ∈ span( u 1 ) ⊥ be orthogonal to u k ′ , where 2 ≤ k ′ ≤ d There is y ∈ R d − 1 such that x = V ⊥ y and w T k ′ − 1 y = w T k ′ V T ⊥ V ⊥ y = u T k ′ x = 0

Step: eigenvalues Let x ∈ span( u 1 ) ⊥ be orthogonal to u k ′ , where 2 ≤ k ′ ≤ d There is y ∈ R d − 1 such that x = V ⊥ y and w T k ′ − 1 y = 0 x T Ax = x T V ⊥ V T ⊥ ( A − λ 1 u 1 u T 1 ) V ⊥ V T max max ⊥ x || x || 2 = 1 , x ⊥ u 1 ,..., u k − 1 || x || 2 = 1 , x ⊥ u 1 ,..., u k − 1 y T By = max || y || 2 = 1 , y ⊥ w 1 ,..., w k − 2 = γ k − 1 = λ k

Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors

Spectral theorem If A ∈ R d × d is symmetric, then it has an eigendecomposition   λ 1 0 · · · 0 � �   � � T , 0 λ 2 · · · 0   A = u 1 u 2 · · · u d u 1 u 2 · · · u d   · · · 0 0 · · · λ d Eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ d are real Eigenvectors u 1 , u 2 , . . . , u n are real and orthogonal

Variance in direction of a fixed vector v If random vector ˜ x has covariance matrix Σ ˜ x � � v T ˜ = v T Σ ˜ Var x x v

Principal directions Let u 1 , . . . , u d , and λ 1 > . . . > λ d be the eigenvectors/eigenvalues of Σ ˜ x || v || 2 = 1 Var ( v T ˜ λ 1 = max x ) || v || 2 = 1 Var ( v T ˜ u 1 = arg max x ) Var ( v T ˜ λ k = max x ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1 Var ( v T ˜ u k = arg max x ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1

Principal components Let c (˜ x ) := ˜ x − E (˜ x ) pc [ i ] := u T � i c (˜ x ) , 1 ≤ i ≤ d is the i th principal component Var ( � pc [ i ]) := λ i , 1 ≤ i ≤ d

Principal components are uncorrelated pc [ j ]) = E ( u T x ) u T E ( � pc [ i ] � i c (˜ j c (˜ x )) = u T x ) T ) u j i E ( c (˜ x ) c (˜ = u T i Σ ˜ x u j = λ i u T i u j = 0

Principal components For dataset X containing x 1 , x 2 , . . . , x n ∈ R d 1. Compute sample covariance matrix Σ X 2. Eigendecomposition of Σ X yields principal directions u 1 , . . . , u d 3. Center the data and compute principal components pc i [ j ] := u T j c ( x i ) , 1 ≤ i ≤ n , 1 ≤ j ≤ d , where c ( x i ) := x i − av( X )

First principal direction Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude

First principal component 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 First principal component

Second principal direction Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude

Second principal component 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 Second principal component

Sample variance in direction of a fixed vector v var ( P v X ) = v T Σ X v

Principal directions Let u 1 , . . . , u d , and λ 1 > . . . > λ d be the eigenvectors/eigenvalues of Σ X λ 1 = max || v || 2 = 1 var ( P v X ) u 1 = arg max || v || 2 = 1 var ( P v X ) λ k = max var ( P v X ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1 u k = arg max var ( P v X ) , 2 ≤ k ≤ d || v || 2 = 1 , v ⊥ u 1 ,..., u k − 1

Sample variance = 229 (sample std = 15.1) Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude

Sample variance = 229 (sample std = 15.1) 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 Component in selected direction

Sample variance = 531 (sample std = 23.1) Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude

Sample variance = 531 (sample std = 23.1 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 First principal component

Sample variance = 46.2 (sample std = 6.80) Centered latitude 30 20 10 0 10 40 20 0 20 40 Centered longitude

Sample variance = 46.2 (sample std = 6.80) 0.10 0.08 Density 0.06 0.04 0.02 0.00 40 20 0 20 40 Second principal component

PCA of faces Data set of 400 64 × 64 images from 40 subjects (10 per subject) Each face is vectorized and interpreted as a vector in R 4096

PCA of faces Center PD 1 PD 2 PD 3 PD 4 PD 5 330 251 192 152 130

PCA of faces PD 10 PD 15 PD 20 PD 30 PD 40 PD 50 90.2 70.8 58.7 45.1 36.0 30.8

PCA of faces PD 100 PD 150 PD 200 PD 250 PD 300 PD 359 19.0 13.7 10.3 8.01 6.14 3.06

Dimensionality reduction Data with a large number of features can be difficult to analyze or process Dimensionality reduction is a useful preprocessing step If data are modeled as vectors in R p we can reduce the dimension by projecting onto R k , where k < p For orthogonal projections, the new representation is � v 1 , x � , � v 2 , x � , . . . , � v k , x � for a basis v 1 , . . . , v k of the subspace that we project on Problem: How do we choose the subspace? Possible criterion: Capture as much sample variance as possible

Captured variance For any orthonormal v 1 , . . . , v k � k � k � n 1 v T i c ( x j ) c ( x j ) T v i var( P v i X ) = n i = 1 i = 1 j = 1 � k v T = i Σ X v i i = 1 By spectral theorem, eigenvectors optimize each individual term

Eigenvectors also optimize sum For any symmetric A ∈ R d × d with eigenvectors u 1 , . . . , u k � k � k u T v T i Au i ≥ i Av i . i = 1 i = 1 for any k orthonormal vectors v 1 , . . . , v k

Proof by induction on k Base ( k = 1)? Follows from spectral theorem

Step Let S := span( v 1 , . . . , v k ) For any orthonormal basis for S b 1 , . . . , b k of S VV T = BB T Choice of basis does not change cost function � � � k i = 1 v T V T AV i Av i = trace � AVV T � = trace � ABB T � = trace = � k i = 1 b T i Ab i Let’s choose wisely

Step We choose b orthogonal to u 1 , . . . , u k − 1 By spectral theorem u T k Au k ≥ b T Ab Now choose orthonormal basis b 1 , b 2 , . . . , b k for S so that b k := b By induction assumption k − 1 k − 1 � � u T b T i Au i ≥ i Ab i i = 1 i = 1

Conclusion For any k orthonormal vectors v 1 , . . . , v k k k � � var(pc[ i ]) ≥ var( P v i X ) , i = 1 i = 1 where pc[ i ] := { pc 1 [ i ] , . . . , pc n [ i ] } = P u i X

Faces � 7 x reduced := av( X ) + pc i [ j ] u j i j = 1

Projection onto first 7 principal directions Center PD 1 PD 2 = - 2459 8613 + 665 PD 3 PD 4 PD 5 - 180 + 301 + 566 PD 6 PD 7 + 638 + 403

Projection onto first k principal directions Signal 5 PDs 10 PDs 20 PDs 30 PDs 50 PDs 100 PDs 150 PDs 200 PDs 250 PDs 300 PDs 359 PDs

Nearest-neighbor classification Training set of points and labels { x 1 , l 1 } , . . . , { x n , l n } To classify a new data point y , find i ∗ := arg min 1 ≤ i ≤ n || y − x i || 2 , and assign l i ∗ to y Cost: O ( nd ) to classify new point

Nearest neighbors in principal-component space Idea: Project onto first k main principal directions beforehand Costly reduced to O ( nk ) Computing eigendecomposition is costly, but only needs to be done once

Face recognition Training set: 360 64 × 64 images from 40 different subjects (9 each) Test set: 1 new image from each subject We model each image as a vector in R 4096 ( d = 4096) To classify we: 1. Project onto first k principal directions 2. Apply nearest-neighbor classification using the ℓ 2 -norm distance in R k

Performance 30 Errors 20 10 4 0 10 20 30 40 50 60 70 80 90 100 Number of principal components

Nearest neighbor in R 41 Test image Projection Closest projection Corresponding image

Dimensionality reduction for visualization Motivation: Visualize high-dimensional features projected onto 2D or 3D Example: Seeds from three different varieties of wheat: Kama, Rosa and Canadian Features: ◮ Area ◮ Perimeter ◮ Compactness ◮ Length of kernel ◮ Width of kernel ◮ Asymmetry coefficient ◮ Length of kernel groove

Projection onto two first PDs 2.0 Second principal component 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 First principal component

Projection onto two last PDs 2.0 1.5 dth principal component 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 (d-1)th principal component

Gaussian random variables The pdf of a Gaussian or normal random variable ˜ a with mean µ and standard deviation σ is given by 1 e − ( a − µ ) 2 √ f ˜ a ( a ) = 2 σ 2 2 πσ

Gaussian random variables µ = 2 σ = 1 0 . 4 µ = 0 σ = 2 µ = 0 σ = 4 0 . 3 a ( a ) 0 . 2 f ˜ 0 . 1 0 − 10 − 8 − 6 − 4 − 2 0 2 4 6 8 10 a

Gaussian random variables � ∞ µ = af ˜ a ( a ) d a a = −∞ � ∞ σ 2 = ( a − µ ) 2 f ˜ a ( a ) d a a = −∞

Linear transformation of Gaussian If ˜ a is a Gaussian random variable with mean µ and standard deviation σ , then for any α, β ∈ R ˜ b := α ˜ a + β is a Gaussian random variable with αµ + β and standard deviation | α | σ

Proof Let α > 0 (proof for a < 0 is very similar), � � ˜ F ˜ b ( b ) = P b ≤ b = P ( α ˜ a + β ≤ b ) � � a ≤ b − β = P ˜ α � b − β e − ( a − µ ) 2 1 α = √ d a 2 σ 2 2 πσ −∞ � b e − ( w − αµ − β ) 2 1 = √ d w change of variables w := α a + β 2 α 2 σ 2 2 πασ −∞ Differentiating with respect to b : 1 e − ( b − αµ − β ) 2 b ( b ) = √ f ˜ 2 α 2 σ 2 2 πασ

Gaussian random vector A Gaussian random vector ˜ x is a random vector with joint pdf � � 1 − 1 2 ( x − µ ) T Σ − 1 ( x − µ ) � f ˜ x ( x ) = exp ( 2 π ) n | Σ | where µ ∈ R d is the mean and Σ ∈ R d × d the covariance matrix Σ ∈ R d × d is positive definite (positive eigenvalues)

Contour surfaces Set of points at which pdf is constant c = x T Σ − 1 x assuming µ = 0 = x T U Λ − 1 Ux d � ( u T i x ) 2 = λ i i = 1 Ellipsoid with axes proportional to √ λ i

2D example µ = 0 � 0 . 5 � − 0 . 3 Σ = − 0 . 3 0 . 5 λ 1 = 0 . 8 λ 2 = 0 . 2 � 1 / √ � 2 √ u 1 = − 1 / 2 √ � 1 / � 2 √ u 2 = 1 / 2 How does the ellipse look like?

Contour surfaces 1 . 5 10 − 4 1 . 0 10 − 2 0 . 5 x[2] 0 . 0 0.37 0.24 10 − 1 − 0 . 5 10 − 2 − 1 . 0 10 − 4 − 1 . 5 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 x[1]

Principal component analysis DS-GA 1013 / MATH-GA 2824 Mathematical - PowerPoint PPT Presentation

Principal component analysis DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html Carlos Fernandez-Granda Discussion Covariance matrix The spectral theorem Principal

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

Section 1 Principal Component Analysis 1 / 16 Principal Component Analysis ST 810-006

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Principal Component Analysis Powerpoint Presentation What is multivariate analysis? Summarizing

Principal component analysis Ingo Blechschmidt December 17th, 2014 Kleine Bayessche AG

Functional components Notification component Application received Refuse ? Notification

WIO IOSAP Project Budget Nairobi Convention WIO IOSAP Budget per Project Component COMPONENT

Principal Component Analysis http://setosa.io/ev/principal- Food consumption in the UK

CS475/CS675 Lecture 23: July 19, 2016 Principal Component Analysis, Eigenfaces CS475/CS675 (c)

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678

Introduction to Principal Component Analysis and Indepedent Component Analysis Tristan A. Hearn

Chapter 5 Singular value decomposition and principal component analysis In A Practical Approach to

Hebbian Learning, Hebbian Learning Principal Component Analysis, and Independent Component

Principal Component Analysis in a Linear Algebraic View by Anna Orosz under the mentorship of

Lecture 3 Principal Component Analysis Lin ZHANG, PhD School of Software Engineering Tongji

Component selection 1 (c) 2020 A.J.M. Montagne Component selection + - + - + - 2 (c)

Principal Component Analysis Eric Eager Data Scientist at Pro Football Focus DataCamp Linear

1 Principal Components Analysis (PCA) Review of basic setup: N vectors, { x 1 , . . .

Curriculum Briefing for P3 & P4 Parents 18 January 2020 Overview Vision, Mission &

Networking of ICT Technologies for Networking of ICT Technologies for Improvement in the Health

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy

A Cluster Target Similarity Based g y Principal Component Analysis for Interval Valued

PRINCIPAL COMPONENT ANALYSIS(PCA) By Deepen naorem Latent(hidden) representation Method A

Principal component analysis DS-GA 1013 / MATH-GA 2824 Mathematical - PowerPoint PPT Presentation

Principal component analysis DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html Carlos Fernandez-Granda Discussion Covariance matrix The spectral theorem Principal

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

Section 1 Principal Component Analysis 1 / 16 Principal Component Analysis ST 810-006

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Principal Component Analysis Powerpoint Presentation What is multivariate analysis? Summarizing

Principal component analysis Ingo Blechschmidt December 17th, 2014 Kleine Bayessche AG

Functional components Notification component Application received Refuse ? Notification

WIO IOSAP Project Budget Nairobi Convention WIO IOSAP Budget per Project Component COMPONENT

Principal Component Analysis http://setosa.io/ev/principal- Food consumption in the UK

CS475/CS675 Lecture 23: July 19, 2016 Principal Component Analysis, Eigenfaces CS475/CS675 (c)

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678

Introduction to Principal Component Analysis and Indepedent Component Analysis Tristan A. Hearn

Chapter 5 Singular value decomposition and principal component analysis In A Practical Approach to

Hebbian Learning, Hebbian Learning Principal Component Analysis, and Independent Component

Principal Component Analysis in a Linear Algebraic View by Anna Orosz under the mentorship of

Lecture 3 Principal Component Analysis Lin ZHANG, PhD School of Software Engineering Tongji

Component selection 1 (c) 2020 A.J.M. Montagne Component selection + - + - + - 2 (c)

Principal Component Analysis Eric Eager Data Scientist at Pro Football Focus DataCamp Linear

1 Principal Components Analysis (PCA) Review of basic setup: N vectors, { x 1 , . . .

Curriculum Briefing for P3 &amp; P4 Parents 18 January 2020 Overview Vision, Mission &amp;

Networking of ICT Technologies for Networking of ICT Technologies for Improvement in the Health

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info

Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy

A Cluster Target Similarity Based g y Principal Component Analysis for Interval Valued

PRINCIPAL COMPONENT ANALYSIS(PCA) By Deepen naorem Latent(hidden) representation Method A

Curriculum Briefing for P3 & P4 Parents 18 January 2020 Overview Vision, Mission &