Principal Component Analysis (PCA) CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Principal Component Analysis (PCA) CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani

Dimensionality Reduction: Feature Selection vs. Feature Extraction  Feature selection  Select a subset of a given feature set  Feature extraction  A linear or non-linear transform on the original feature space 𝑦 𝑗 1 𝑦 1 𝑦 1 𝑧 1 𝑦 1 ⋮ ⋮ → ⋮ ⋮ ⋮ → = 𝑔 𝑦 𝑒 𝑦 𝑗 𝑒′ 𝑦 𝑒 𝑧 𝑒 ′ 𝑦 𝑒 Feature Feature Selection Extraction ( 𝑒 ′ < 𝑒 ) 2

Feature Extraction  Mapping of the original data to another space  Criterion for feature extraction can be different based on problem settings  Unsupervised task: minimize the information loss (reconstruction error)  Supervised task: maximize the class discrimination on the projected space  Feature extraction algorithms  Linear Methods  Unsupervised: e.g., Principal Component Analysis (PCA)  Supervised: e.g., Linear Discriminant Analysis (LDA)  Also known as Fisher ’ s Discriminant Analysis (FDA) 3

Feature Extraction  Unsupervised feature extraction: A mapping 𝑔: ℝ 𝑒 → ℝ 𝑒 ′ (1) (1) 𝑦 1 ⋯ 𝑦 𝑒 Or 𝒀 = ⋮ ⋱ ⋮ Feature Extraction only the transformed data (𝑂) (𝑂) 𝑦 1 ⋯ 𝑦 𝑒 (1) (1) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝒀 ′ = ⋮ ⋱ ⋮ (𝑂) (𝑂) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′  Supervised feature extraction: A mapping 𝑔: ℝ 𝑒 → ℝ 𝑒 ′ (1) (1) 𝑦 1 ⋯ 𝑦 𝑒 Or 𝒀 = ⋮ ⋱ ⋮ Feature Extraction only the transformed data (𝑂) (𝑂) 𝑦 1 ⋯ 𝑦 𝑒 (1) (1) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝒀 ′ = 𝑧 (1) ⋮ ⋱ ⋮ 𝑍 = ⋮ (𝑂) (𝑂) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝑧 (𝑂) 4

Unsupervised Feature Reduction  Visualization: projection of high-dimensional data onto 2D or 3D.  Data compression: efficient storage, communication, or and retrieval.  Pre-process: to improve accuracy by reducing features  As a preprocessing step to reduce dimensions for supervised learning tasks  Helps avoiding overfitting  Noise removal  E.g, “ noise ” in the images introduced by minor lighting variations, slightly different imaging conditions, etc. 5

Linear Transformation  For linear transformation, we find an explicit mapping 𝑔 𝒚 = 𝑩 𝑈 𝒚 that can transform also new data vectors. Original data Type equation here. reduced data = 𝒚′ ∈ ℝ 𝑒 ′ 𝑩 𝑈 ∈ ℝ 𝑒 ′ ×𝑒 𝒚 ′ = 𝑩 𝑈 𝒚 𝑒 ′ < 𝑒 𝒚 ∈ ℝ 6

Linear Transformation  Linear transformation are simple mappings 𝑏 11 ⋯ 𝑏 1𝑒 ′ 𝒚 ′ = 𝑩 𝑈 𝒚 ⋮ ⋱ ⋮ 𝑩 = 𝑏 𝑒1 ⋯ 𝑏 𝑒𝑒 ′ a a d  1 𝑈 𝒃 1 ′ 𝑏 11 ⋯ 𝑏 𝑒1 𝑦 1 𝑦 1 ⋮ ⋱ ⋮ ⋮ ⋮ = ′ 𝑏 1𝑒 ′ ⋯ 𝑏 𝑒 ′ 𝑒 𝑦 𝑒 𝑦 𝑒 ′ 𝑈 𝒃 𝑒 ′ ′ = 𝒃 𝑘 𝑈 𝒚 𝑦 𝑘 𝑘 = 1, … , 𝑒 ′ 7

Linear Dimensionality Reduction  Unsupervised  Principal Component Analysis (PCA) [we will discuss]  Independent Component Analysis (ICA) [we will discuss]  SingularValue Decomposition (SVD)  Multi Dimensional Scaling (MDS)  Canonical Correlation Analysis (CCA) 8

Principal Component Analysis (PCA)  Also known as Karhonen-Loeve (KL) transform  Principal Components (PCs): orthogonal vectors that are ordered by the fraction of the total information (variation) in the corresponding directions  Find the directions at which data approximately lie  When the data is projected onto first PC, the variance of the projected data is maximized  PCA is an orthogonal projection of the data into a subspace so that the variance of the projected data is maximized. 9

Principal Component Analysis (PCA)  The “ best ” linear subspace (i.e. providing least reconstruction error of data):  Find mean reduced data  The axes have been rotated to new (principal) axes such that:  Principal axis 1 has the highest variance ....  Principal axis i has the i-th highest variance.  The principal axes are uncorrelated  Covariance among each pair of the principal axes is zero.  Goal: reducing the dimensionality of the data while preserving the variation present in the dataset as much as possible.  PCs can be found as the “ best ” eigenvectors of the covariance matrix of the data points. 10

Principal components  If data has a Gaussian distribution 𝑂(𝝂, 𝚻), the direction of the largest variance can be found by the eigenvector of 𝚻 that corresponds to the largest eigenvalue of 𝚻 𝒘 2 𝒘 1 11

PCA: Steps  Input: 𝑂 × 𝑒 data matrix 𝒀 (each row contain a 𝑒 dimensional data point) 1 𝑂 𝒚 (𝑗) 𝑂 𝑗=1  𝝂 =  𝒀 ← Mean value of data points is subtracted from rows of 𝒀 1 𝑂 𝒀 𝑈 𝒀 (Covariance matrix)  𝑫 =  Calculate eigenvalue and eigenvectors of 𝑫  Pick 𝑒 ′ eigenvectors corresponding to the largest eigenvalues and put them in the columns of 𝑩 = [𝒘 1 , … , 𝒘 𝑒 ′ ]  𝒀′ = 𝒀𝑩 First PC d ’ -th PC 12

Covariance Matrix 𝜈 1 𝐹(𝑦 1 ) ⋮ ⋮ 𝝂 𝒚 = = 𝜈 𝑒 𝐹(𝑦 𝑒 ) 𝒚 − 𝝂 𝒚 𝑈 𝜯 = 𝐹 𝒚 − 𝝂 𝒚 𝑂 :  ML estimate of covariance matrix from data points 𝒚 (𝑗) 𝑗=1 𝑂 𝜯 = 1 𝑈 = 1 𝒚 (𝑗) − 𝒚 (𝑗) − 𝒀 𝑈 𝑂 𝝂 𝝂 𝒀 𝑂 𝑗=1 𝒚 (1) − 𝑂 𝒚 (1) 𝝂 𝝂 = 1 𝒚 (𝑗) 𝒀 = = 𝑂 ⋮ ⋮ 𝒚 (𝑂) − 𝒚 (𝑂) 𝝂 𝑗=1 We now assume that data are mean removed 13 Mean-centered data and 𝒚 in the later slides is indeed 𝒚

Correlation matrix (1) (1) 𝑦 1 … 𝑦 𝑒 𝒀 = ⋮ ⋱ ⋮ (𝑂) (𝑂) 𝑦 1 … 𝑦 𝑒 (1) (𝑂) (1) (1) 𝑦 1 … 𝑦 1 𝑦 1 … 𝑦 𝑒 𝑂 𝒀 𝑈 𝒀 = 1 1 ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ 𝑂 (1) (𝑂) (𝑂) (𝑂) 𝑦 𝑒 … 𝑦 𝑒 𝑦 1 … 𝑦 𝑒 𝑂 𝑂 (𝑜) 𝑦 1 (𝑜) 𝑦 𝑒 (𝑜) (𝑜) 𝑦 1 … 𝑦 1 = 1 𝑜=1 𝑜=1 ⋮ ⋱ ⋮ 𝑂 𝑂 𝑂 (𝑜) 𝑦 1 (𝑜) 𝑦 𝑒 (𝑜) (𝑜) 𝑦 𝑒 … 𝑦 𝑒 𝑜=1 𝑜=1 14

Two Interpretations  MaximumVariance Subspace  PCA finds vectors v such that projections on to the vectors capture maximum variance in the data 2 = 1 1 𝑂 𝒃 𝑈 𝒚 𝑜 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 𝑂 𝑜=1   Minimum Reconstruction Error  PCA finds vectors v such that projection on to the vectors yields minimum MSE reconstruction 2 1 𝒚 𝑜 − 𝒃 𝑈 𝒚 𝑜 𝑂 𝑂 𝑜=1 𝒃  15

Least Squares Error Interpretation  PCs are linear least squares fits to samples, each orthogonal to the previous PCs:  First PC is a minimum distance fit to a vector in the original feature space  Second PC is a minimum distance fit to a vector in the plane perpendicular to the first PC  And so on 16

Example 17

Example 18

Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation)  Minimizing sum of square distances to the line is equivalent to maximizing the sum of squares of the projections on that line (Pythagoras). origin red 2 +blue 2 =green 2 green 2 is fixed (shows the data vector after mean removing) ⇒ maximizing blue 2 is equivalent to minimizing red 2 19

First PC  The first PC is direction of greatest variability in data  We will show that the first PC is the eigenvector of the covariance matrix corresponding the maximum eigen value of this matrix.  If ||𝒃|| = 1 , the projection of a d-dimensional 𝒚 on 𝒃 is 𝒃 𝑈 𝒚 𝒚 𝒃 𝜄 𝒃 𝑈 𝒚 origin = 𝒃 𝑈 𝒚 𝒚 cos 𝜄 = 𝒚 𝒚 𝒃 20

First PC 𝑂 1 2 = 1 𝒃 𝑈 𝒚 𝑜 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 argmax 𝑂 𝒃 𝑜=1 s.t. 𝒃 𝑈 𝒃 = 1 𝜖 1 = 0 ⇒ 1 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 + 𝜇 1 − 𝒃 𝑈 𝒃 𝑂 𝒀 𝑈 𝒀𝒃 = 𝜇𝒃 𝜖𝒃 1  𝒃 is the eigenvector of sample covariance matrix 𝑂 𝒀 𝑈 𝒀  The eigenvalue 𝜇 denotes the amount of variance along that dimension.  Variance= 1 1 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 = 𝒃 𝑈 𝑂 𝒀 𝑈 𝒀𝒃 = 𝒃 𝑈 𝜇 𝒃 = 𝜇  So, if we seek the dimension with the largest variance, it will be the eigenvector corresponding to the largest eigenvalue of the sample covariance matrix 21

PCA: Uncorrelated Features 𝒚 ′ = 𝑩 𝑈 𝒚 𝑺 𝒚 ′ = 𝐹 𝒚 ′ 𝒚 ′ 𝑈 = 𝐹 𝑩 𝑈 𝒚𝒚 𝑈 𝑩 = 𝑩 𝑈 𝐹 𝒚𝒚 𝑈 𝑩 = 𝑩 𝑈 𝑺 𝒚 𝑩  If 𝑩 = [𝒃 1 , … , 𝒃 𝑒 ] where 𝒃 1 , … , 𝒃 𝑒 are orthonormal eighenvectors of 𝑺 𝒚 : 𝑺 𝒚 ′ = 𝑩 𝑈 𝑺 𝒚 𝑩 = 𝑩 𝑈 𝑩𝚳𝑩 𝑈 𝑩 = 𝚳 ′ = 0 ′ 𝒚 𝑘 ⇒ ∀𝑗 ≠ 𝑘 𝑗, 𝑘 = 1, … , 𝑒 𝐹 𝒚 𝑗  then mutually uncorrelated features are obtained  Completely uncorrelated features avoid information redundancies 22

PCA Derivation: Mean Square Error Approximation  Incorporating all eigenvectors in 𝑩 = [𝒃 1 , … , 𝒃 𝑒 ] : 𝒚 ′ = 𝑩 𝑈 𝒚 ⇒ 𝑩𝒚 ′ = 𝑩𝑩 𝑈 𝒚 = 𝒚 ⇒ 𝒚 = 𝑩𝒚 ′  ⟹ If 𝑒 ′ = 𝑒 then 𝒚 can be reconstructed exactly from 𝒚 ′ 23

Principal Component Analysis (PCA) CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Principal Component Analysis (PCA) CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given feature

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

Section 1 Principal Component Analysis 1 / 16 Principal Component Analysis ST 810-006

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

PCA applied to bodies e 1 e 2 e 3 e 4 e 5 +4 4 Freifeld and Black, ECCV 2012 PCA

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

Lecture 24: Principal Component Analysis Aykut Erdem January 2017 Hacettepe University This

Principal Component Analysis (PCA) Dr. Veselina Kalinova Max Planck Institute for

Principal Component Analysis Powerpoint Presentation What is multivariate analysis? Summarizing

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA

Application of PCA to Facial Recognition Aaron Kosmatin, Clayton Broman Math 45 December 17,

CS475/CS675 Lecture 23: July 19, 2016 Principal Component Analysis, Eigenfaces CS475/CS675 (c)

ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING Overview Todays Lecture

Principal component analysis Ingo Blechschmidt December 17th, 2014 Kleine Bayessche AG

Ive Got You Under My Skin: A Comparison of IV and s/c PCA Nick Williamson Clinical Nurse

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

SVD and PCA Derek Onken and Li Xiong Feature Extraction Create new features (attributes) by

Advanced PCA: Choosing the right number of PCs Alexandros Tantos Assistant Professor Aristotle

Principal Component Analysis for CRM Data Verena Pflieger Data Scientist at INWT Statistics

Dimension Reduction CS 760@UW-Madison Goals for the lecture you should understand the following

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 16:

Principal Components Analysis (PCA) Exploratory data analysis of high-dimensional data sets.

Principal Component Analysis 4/7/17 PCA: the setting Unsupervised learning Unlabeled data

Principal Components Analysis Claire Le Barbenchon and Federico Ferrari Data Expeditions Welcome