principal component analysis pca
play

Principal Component Analysis (PCA) CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Principal Component Analysis (PCA) CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given feature


  1. Principal Component Analysis (PCA) CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani

  2. Dimensionality Reduction: Feature Selection vs. Feature Extraction  Feature selection  Select a subset of a given feature set  Feature extraction  A linear or non-linear transform on the original feature space 𝑦 𝑗 1 𝑦 1 𝑦 1 𝑧 1 𝑦 1 ⋮ ⋮ → ⋮ ⋮ ⋮ → = 𝑔 𝑦 𝑒 𝑦 𝑗 𝑒′ 𝑦 𝑒 𝑧 𝑒 ′ 𝑦 𝑒 Feature Feature Selection Extraction ( 𝑒 ′ < 𝑒 ) 2

  3. Feature Extraction  Mapping of the original data to another space  Criterion for feature extraction can be different based on problem settings  Unsupervised task: minimize the information loss (reconstruction error)  Supervised task: maximize the class discrimination on the projected space  Feature extraction algorithms  Linear Methods  Unsupervised: e.g., Principal Component Analysis (PCA)  Supervised: e.g., Linear Discriminant Analysis (LDA)  Also known as Fisher ’ s Discriminant Analysis (FDA) 3

  4. Feature Extraction  Unsupervised feature extraction: A mapping 𝑔: ℝ 𝑒 → ℝ 𝑒 ′ (1) (1) 𝑦 1 ⋯ 𝑦 𝑒 Or 𝒀 = ⋮ ⋱ ⋮ Feature Extraction only the transformed data (𝑂) (𝑂) 𝑦 1 ⋯ 𝑦 𝑒 (1) (1) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝒀 ′ = ⋮ ⋱ ⋮ (𝑂) (𝑂) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′  Supervised feature extraction: A mapping 𝑔: ℝ 𝑒 → ℝ 𝑒 ′ (1) (1) 𝑦 1 ⋯ 𝑦 𝑒 Or 𝒀 = ⋮ ⋱ ⋮ Feature Extraction only the transformed data (𝑂) (𝑂) 𝑦 1 ⋯ 𝑦 𝑒 (1) (1) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝒀 ′ = 𝑧 (1) ⋮ ⋱ ⋮ 𝑍 = ⋮ (𝑂) (𝑂) 𝑦′ 1 ⋯ 𝑦′ 𝑒 ′ 𝑧 (𝑂) 4

  5. Unsupervised Feature Reduction  Visualization: projection of high-dimensional data onto 2D or 3D.  Data compression: efficient storage, communication, or and retrieval.  Pre-process: to improve accuracy by reducing features  As a preprocessing step to reduce dimensions for supervised learning tasks  Helps avoiding overfitting  Noise removal  E.g, “ noise ” in the images introduced by minor lighting variations, slightly different imaging conditions, etc. 5

  6. Linear Transformation  For linear transformation, we find an explicit mapping 𝑔 𝒚 = 𝑩 𝑈 𝒚 that can transform also new data vectors. Original data Type equation here. reduced data = 𝒚′ ∈ ℝ 𝑒 ′ 𝑩 𝑈 ∈ ℝ 𝑒 ′ ×𝑒 𝒚 ′ = 𝑩 𝑈 𝒚 𝑒 ′ < 𝑒 𝒚 ∈ ℝ 6

  7. Linear Transformation  Linear transformation are simple mappings 𝑏 11 ⋯ 𝑏 1𝑒 ′ 𝒚 ′ = 𝑩 𝑈 𝒚 ⋮ ⋱ ⋮ 𝑩 = 𝑏 𝑒1 ⋯ 𝑏 𝑒𝑒 ′ a a d  1 𝑈 𝒃 1 ′ 𝑏 11 ⋯ 𝑏 𝑒1 𝑦 1 𝑦 1 ⋮ ⋱ ⋮ ⋮ ⋮ = ′ 𝑏 1𝑒 ′ ⋯ 𝑏 𝑒 ′ 𝑒 𝑦 𝑒 𝑦 𝑒 ′ 𝑈 𝒃 𝑒 ′ ′ = 𝒃 𝑘 𝑈 𝒚 𝑦 𝑘 𝑘 = 1, … , 𝑒 ′ 7

  8. Linear Dimensionality Reduction  Unsupervised  Principal Component Analysis (PCA) [we will discuss]  Independent Component Analysis (ICA) [we will discuss]  SingularValue Decomposition (SVD)  Multi Dimensional Scaling (MDS)  Canonical Correlation Analysis (CCA) 8

  9. Principal Component Analysis (PCA)  Also known as Karhonen-Loeve (KL) transform  Principal Components (PCs): orthogonal vectors that are ordered by the fraction of the total information (variation) in the corresponding directions  Find the directions at which data approximately lie  When the data is projected onto first PC, the variance of the projected data is maximized  PCA is an orthogonal projection of the data into a subspace so that the variance of the projected data is maximized. 9

  10. Principal Component Analysis (PCA)  The “ best ” linear subspace (i.e. providing least reconstruction error of data):  Find mean reduced data  The axes have been rotated to new (principal) axes such that:  Principal axis 1 has the highest variance ....  Principal axis i has the i-th highest variance.  The principal axes are uncorrelated  Covariance among each pair of the principal axes is zero.  Goal: reducing the dimensionality of the data while preserving the variation present in the dataset as much as possible.  PCs can be found as the “ best ” eigenvectors of the covariance matrix of the data points. 10

  11. Principal components  If data has a Gaussian distribution 𝑂(𝝂, 𝚻), the direction of the largest variance can be found by the eigenvector of 𝚻 that corresponds to the largest eigenvalue of 𝚻 𝒘 2 𝒘 1 11

  12. PCA: Steps  Input: 𝑂 × 𝑒 data matrix 𝒀 (each row contain a 𝑒 dimensional data point) 1 𝑂 𝒚 (𝑗) 𝑂 𝑗=1  𝝂 =  𝒀 ← Mean value of data points is subtracted from rows of 𝒀 1 𝑂 𝒀 𝑈 𝒀 (Covariance matrix)  𝑫 =  Calculate eigenvalue and eigenvectors of 𝑫  Pick 𝑒 ′ eigenvectors corresponding to the largest eigenvalues and put them in the columns of 𝑩 = [𝒘 1 , … , 𝒘 𝑒 ′ ]  𝒀′ = 𝒀𝑩 First PC d ’ -th PC 12

  13. Covariance Matrix 𝜈 1 𝐹(𝑦 1 ) ⋮ ⋮ 𝝂 𝒚 = = 𝜈 𝑒 𝐹(𝑦 𝑒 ) 𝒚 − 𝝂 𝒚 𝑈 𝜯 = 𝐹 𝒚 − 𝝂 𝒚 𝑂 :  ML estimate of covariance matrix from data points 𝒚 (𝑗) 𝑗=1 𝑂 𝜯 = 1 𝑈 = 1 𝒚 (𝑗) − 𝒚 (𝑗) − 𝒀 𝑈 𝑂 𝝂 𝝂 𝒀 𝑂 𝑗=1 𝒚 (1) − 𝑂 𝒚 (1) 𝝂 𝝂 = 1 𝒚 (𝑗) 𝒀 = = 𝑂 ⋮ ⋮ 𝒚 (𝑂) − 𝒚 (𝑂) 𝝂 𝑗=1 We now assume that data are mean removed 13 Mean-centered data and 𝒚 in the later slides is indeed 𝒚

  14. Correlation matrix (1) (1) 𝑦 1 … 𝑦 𝑒 𝒀 = ⋮ ⋱ ⋮ (𝑂) (𝑂) 𝑦 1 … 𝑦 𝑒 (1) (𝑂) (1) (1) 𝑦 1 … 𝑦 1 𝑦 1 … 𝑦 𝑒 𝑂 𝒀 𝑈 𝒀 = 1 1 ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ 𝑂 (1) (𝑂) (𝑂) (𝑂) 𝑦 𝑒 … 𝑦 𝑒 𝑦 1 … 𝑦 𝑒 𝑂 𝑂 (𝑜) 𝑦 1 (𝑜) 𝑦 𝑒 (𝑜) (𝑜) 𝑦 1 … 𝑦 1 = 1 𝑜=1 𝑜=1 ⋮ ⋱ ⋮ 𝑂 𝑂 𝑂 (𝑜) 𝑦 1 (𝑜) 𝑦 𝑒 (𝑜) (𝑜) 𝑦 𝑒 … 𝑦 𝑒 𝑜=1 𝑜=1 14

  15. Two Interpretations  MaximumVariance Subspace  PCA finds vectors v such that projections on to the vectors capture maximum variance in the data 2 = 1 1 𝑂 𝒃 𝑈 𝒚 𝑜 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 𝑂 𝑜=1   Minimum Reconstruction Error  PCA finds vectors v such that projection on to the vectors yields minimum MSE reconstruction 2 1 𝒚 𝑜 − 𝒃 𝑈 𝒚 𝑜 𝑂 𝑂 𝑜=1 𝒃  15

  16. Least Squares Error Interpretation  PCs are linear least squares fits to samples, each orthogonal to the previous PCs:  First PC is a minimum distance fit to a vector in the original feature space  Second PC is a minimum distance fit to a vector in the plane perpendicular to the first PC  And so on 16

  17. Example 17

  18. Example 18

  19. Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation)  Minimizing sum of square distances to the line is equivalent to maximizing the sum of squares of the projections on that line (Pythagoras). origin red 2 +blue 2 =green 2 green 2 is fixed (shows the data vector after mean removing) ⇒ maximizing blue 2 is equivalent to minimizing red 2 19

  20. First PC  The first PC is direction of greatest variability in data  We will show that the first PC is the eigenvector of the covariance matrix corresponding the maximum eigen value of this matrix.  If ||𝒃|| = 1 , the projection of a d-dimensional 𝒚 on 𝒃 is 𝒃 𝑈 𝒚 𝒚 𝒃 𝜄 𝒃 𝑈 𝒚 origin = 𝒃 𝑈 𝒚 𝒚 cos 𝜄 = 𝒚 𝒚 𝒃 20

  21. First PC 𝑂 1 2 = 1 𝒃 𝑈 𝒚 𝑜 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 argmax 𝑂 𝒃 𝑜=1 s.t. 𝒃 𝑈 𝒃 = 1 𝜖 1 = 0 ⇒ 1 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 + 𝜇 1 − 𝒃 𝑈 𝒃 𝑂 𝒀 𝑈 𝒀𝒃 = 𝜇𝒃 𝜖𝒃 1  𝒃 is the eigenvector of sample covariance matrix 𝑂 𝒀 𝑈 𝒀  The eigenvalue 𝜇 denotes the amount of variance along that dimension.  Variance= 1 1 𝑂 𝒃 𝑈 𝒀 𝑈 𝒀𝒃 = 𝒃 𝑈 𝑂 𝒀 𝑈 𝒀𝒃 = 𝒃 𝑈 𝜇 𝒃 = 𝜇  So, if we seek the dimension with the largest variance, it will be the eigenvector corresponding to the largest eigenvalue of the sample covariance matrix 21

  22. PCA: Uncorrelated Features 𝒚 ′ = 𝑩 𝑈 𝒚 𝑺 𝒚 ′ = 𝐹 𝒚 ′ 𝒚 ′ 𝑈 = 𝐹 𝑩 𝑈 𝒚𝒚 𝑈 𝑩 = 𝑩 𝑈 𝐹 𝒚𝒚 𝑈 𝑩 = 𝑩 𝑈 𝑺 𝒚 𝑩  If 𝑩 = [𝒃 1 , … , 𝒃 𝑒 ] where 𝒃 1 , … , 𝒃 𝑒 are orthonormal eighenvectors of 𝑺 𝒚 : 𝑺 𝒚 ′ = 𝑩 𝑈 𝑺 𝒚 𝑩 = 𝑩 𝑈 𝑩𝚳𝑩 𝑈 𝑩 = 𝚳 ′ = 0 ′ 𝒚 𝑘 ⇒ ∀𝑗 ≠ 𝑘 𝑗, 𝑘 = 1, … , 𝑒 𝐹 𝒚 𝑗  then mutually uncorrelated features are obtained  Completely uncorrelated features avoid information redundancies 22

  23. PCA Derivation: Mean Square Error Approximation  Incorporating all eigenvectors in 𝑩 = [𝒃 1 , … , 𝒃 𝑒 ] : 𝒚 ′ = 𝑩 𝑈 𝒚 ⇒ 𝑩𝒚 ′ = 𝑩𝑩 𝑈 𝒚 = 𝒚 ⇒ 𝒚 = 𝑩𝒚 ′  ⟹ If 𝑒 ′ = 𝑒 then 𝒚 can be reconstructed exactly from 𝒚 ′ 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend