pca ica
play

PCA & ICA CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction } Feature selection } Select a subset of a given feature set } Feature extraction }


  1. PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2018 Soleymani

  2. Dimensionality Reduction: Feature Selection vs. Feature Extraction } Feature selection } Select a subset of a given feature set } Feature extraction } A linear or non-linear transform on the original feature space ๐‘ฆ & ' ๐‘ฆ " ๐‘ฆ " ๐‘ง " ๐‘ฆ " โ‹ฎ โ‹ฎ โ†’ โ‹ฎ โ‹ฎ โ‹ฎ โ†’ = ๐‘” ๐‘ฆ $ ๐‘ฆ & () ๐‘ฆ $ ๐‘ง $ ) ๐‘ฆ $ Feature Feature Selection Extraction ( ๐‘’ + < ๐‘’ ) 2

  3. Feature Extraction } Mapping of the original data to another space } Criterion for feature extraction can be different based on problem settings } Unsupervised task: minimize the information loss (reconstruction error) } Supervised task: maximize the class discrimination on the projected space } Feature extraction algorithms } Linear Methods } Unsupervised: e.g., Principal Component Analysis (PCA) } Supervised: e.g., Linear Discriminant Analysis (LDA) ยจ Also known as Fisherโ€™s Discriminant Analysis (FDA) } Non-linear methods: } Supervised: MLP neural networks } Unsupervised: e.g., autoencoders 3

  4. Feature Extraction } Unsupervised feature extraction: A mapping ๐‘”: โ„ $ โ†’ โ„ $ ) (") (") ๐‘ฆ " โ‹ฏ ๐‘ฆ $ Or ๐’€ = โ‹ฎ โ‹ฑ โ‹ฎ Feature Extraction only the transformed data (5) (5) ๐‘ฆ " โ‹ฏ ๐‘ฆ $ (") (") ๐‘ฆโ€ฒ " โ‹ฏ ๐‘ฆโ€ฒ $ ) ๐’€ + = โ‹ฎ โ‹ฑ โ‹ฎ (5) (5) ๐‘ฆโ€ฒ " โ‹ฏ ๐‘ฆโ€ฒ $ ) } Supervised feature extraction: A mapping ๐‘”: โ„ $ โ†’ โ„ $ ) (") (") ๐‘ฆ " โ‹ฏ ๐‘ฆ $ Or ๐’€ = โ‹ฎ โ‹ฑ โ‹ฎ Feature Extraction only the transformed data (5) (5) ๐‘ฆ " โ‹ฏ ๐‘ฆ $ (") (") ๐‘ฆโ€ฒ " โ‹ฏ ๐‘ฆโ€ฒ $ ) ๐‘ง (") ๐’€ + = โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ = โ‹ฎ (5) (5) ๐‘ฆโ€ฒ " โ‹ฏ ๐‘ฆโ€ฒ $ ) ๐‘ง (5) 4

  5. Unsupervised Feature Reduction } Visualization and interpretation: projection of high- dimensional data onto 2D or 3D. } Data compression: efficient storage, communication, or and retrieval. } Pre-process: to improve accuracy by reducing features } As a preprocessing step to reduce dimensions for supervised learning tasks } Helps avoiding overfitting } Noise removal } E.g, โ€œnoiseโ€ in the images introduced by minor lighting variations, slightly different imaging conditions, 5

  6. Linear Transformation } For linear transformation, we find an explicit mapping ๐‘” ๐’š = ๐‘ฉ < ๐’š that can transform also new data vectors. Original data Type equation here. reduced data = ๐’šโ€ฒ โˆˆ โ„ $ ) ๐‘ฉ < โˆˆ โ„ $ ) ร—$ ๐’š + = ๐‘ฉ < ๐’š ๐‘’ + < ๐‘’ ๐’š โˆˆ โ„ 6

  7. Linear Transformation } Linear transformation are simple mappings ยข ยข = = T T x A x ( x a x ) ๐‘˜ = 1, โ€ฆ , ๐‘’ j j ๐‘ "" โ‹ฏ ๐‘ "$ โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ฉ = ๐‘ $" โ‹ฏ ๐‘ $$ ) a a 1 d ยข 7

  8. Linear Dimensionality Reduction } Unsupervised } Principal Component Analysis (PCA) } Independent Component Analysis (ICA) } SingularValue Decomposition (SVD) } Multi Dimensional Scaling (MDS) } Canonical Correlation Analysis (CCA) } โ€ฆ 8

  9. Principal Component Analysis (PCA) } Also known as Karhonen-Loeve (KL) transform } Principal Components (PCs): orthogonal vectors that are ordered by the fraction of the total information (variation) in the corresponding directions } Find the directions at which data approximately lie } When the data is projected onto first PC, the variance of the projected data is maximized 9

  10. Principal Component Analysis (PCA) } The โ€œbestโ€ linear subspace (i.e. providing least reconstruction error of data): } Find mean reduced data } The axes have been rotated to new (principal) axes such that: } Principal axis 1 has the highest variance .... } Principal axis i has the i-th highest variance. } The principal axes are uncorrelated } Covariance among each pair of the principal axes is zero. } Goal: reducing the dimensionality of the data while preserving the variation present in the dataset as much as possible. } PCs can be found as the โ€œbestโ€ eigenvectors of the covariance matrix of the data points. 10

  11. Principal components } If data has a Gaussian distribution ๐‘‚(๐‚, ๐šป), the direction of the largest variance can be found by the eigenvector of ๐šป that corresponds to the largest eigenvalue of ๐šป ๐’˜ W ๐’˜ " 11

  12. Example: random direction 12

  13. Example: principal component 13

  14. Covariance Matrix ๐œˆ " ๐น(๐‘ฆ " ) โ‹ฎ โ‹ฎ ๐‚ ๐’š = = ๐œˆ $ ๐น(๐‘ฆ $ ) ๐’š โˆ’ ๐‚ ๐’š < ๐œฏ = ๐น ๐’š โˆ’ ๐‚ ๐’š 5 : } ML estimate of covariance matrix from data points ๐’š (&) &\" 5 ] = 1 = 1 ๐‘‚ ^ ๐’š (&) โˆ’ ๐‚ ๐’š (&) โˆ’ ๐‚ < ` < ๐’€ ` ๐œฏ _ _ ๐‘‚ ๐’€ &\" ๐’š (") โˆ’ ๐‚ 5 a (") _ ๐’š _ = 1 ` = ๐‘‚ ^ ๐’š (&) ๐’€ = ๐‚ โ‹ฎ โ‹ฎ ๐’š (5) โˆ’ ๐‚ a (5) _ ๐’š &\" 14 Mean-centered data

  15. PCA: Steps } Input: ๐‘‚ร—๐‘’ data matrix ๐’€ (each row contain a ๐‘’ dimensional data point) " 5 ๐’š (&) 5 โˆ‘ } ๐‚ = &\" ` โ† Mean value of data points is subtracted from rows of ๐’€ } ๐’€ " ` < ๐’€ ` (Covariance matrix) } ๐šป = 5 ๐’€ } Calculate eigenvalue and eigenvectors of ๐šป } Pick ๐‘’ + eigenvectors corresponding to the largest eigenvalues and put them in the columns of ๐‘ฉ = [๐’˜ " , โ€ฆ , ๐’˜ $ ) ] } ๐’€โ€ฒ = ๐’€๐‘ฉ First PC dโ€™-th PC 15

  16. Find principal components } Assume that data is centered. } Find vector ๐’˜ that maximizes sample variance of the projected data: 5 1 = 1 W ๐‘‚ ^ ๐‘ค < ๐‘ฆ j ๐‘‚ ๐‘ค < ๐‘Œ < ๐‘Œ๐‘ค argmax h j\" s. t. ๐‘ค < ๐‘ค = 1 ๐‘€ ๐‘ค, ๐œ‡ = ๐‘ค < ๐‘Œ < ๐‘Œ๐‘ค โˆ’ ๐œ‡๐‘ค < ๐‘ค ๐œ–๐‘€ ๐œ–๐‘ค = 0 โ‡’ 2๐‘Œ < ๐‘Œ๐‘ค โˆ’ 2๐œ‡๐‘ค = 0 โ‡’ ๐‘Œ < ๐‘Œ๐‘ค = ๐œ‡๐‘ค 16

  17. Find principal components } For symmetric matrices, there exist eigen-vectors that are orthogonal. } Let ๐‘ค " , โ€ฆ ๐‘ค $ denote the eigen-vectors of ๐‘Œ < ๐‘Œ such that: < ๐‘ค s = 0, โˆ€๐‘— โ‰  ๐‘˜ ๐‘ค & < ๐‘ค & = 1, ๐‘ค & โˆ€๐‘— 17

  18. Find principal components ๐‘Œ < ๐‘Œ๐’˜ = ๐œ‡๐’˜ โ‡’ ๐’˜ < ๐‘Œ < ๐‘Œ๐’˜ = ๐œ‡๐’˜ < ๐’˜ = ๐œ‡ } ๐œ‡ denotes the amount of variance along the found dimension ๐’˜ (called energy along that dimension). } } Eigenvalues: ๐œ‡ " โ‰ฅ ๐œ‡ W โ‰ฅ ๐œ‡ x โ‰ฅ โ‹ฏ } The first PC ๐’˜ " is the the eigenvector of the sample covariance matrix ๐‘Œ < ๐‘Œ associated with the largest eigenvalue. } The 2nd PC ๐’˜ W is the the eigenvector of the sample covariance matrix ๐‘Œ < ๐‘Œ associated with the second largest eigenvalue } And so on ... 18

  19. Another Interpretation: Least Squares Error } PCs are linear least squares fits to samples, each orthogonal to the previous PCs: } First PC is a minimum distance fit to a vector in the original feature space } Second PC is a minimum distance fit to a vector in the plane perpendicular to the first PC } โ€ฆ 19

  20. Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation) } When data are mean-removed: } Minimizing sum of square distances to the line is equivalent to maximizing the sum of squares of the projections on that line (Pythagoras). origin 20

  21. Two interpretations } Maximum variance subspace 5 W ^ ๐‘ค < ๐‘ฆ j = ๐‘ค < ๐‘Œ < ๐‘Œ๐‘ค argmax h j\" } Minimum reconstruction error 5 W ^ ๐‘ฆ j โˆ’ ๐‘ค < ๐‘ฆ j argmin ๐‘ค h j\" ๐‘ฆ ๐‘ค blue 2 + red 2 = geen 2 geen 2 is fixed (shows data) So, maximizing red 2 is equivalent to minimizing blue 2 ๐‘ค < ๐‘ฆ origin 21

  22. PCA: Uncorrelated Features ๐’š + = ๐‘ฉ < ๐’š ๐‘บ ๐’š ) = ๐น ๐’š + ๐’š +< = ๐น ๐‘ฉ < ๐’š๐’š < ๐‘ฉ = ๐‘ฉ < ๐น ๐’š๐’š < ๐‘ฉ = ๐‘ฉ < ๐‘บ ๐’š ๐‘ฉ } If ๐‘ฉ = [๐’ƒ " , โ€ฆ , ๐’ƒ $ ] where ๐’ƒ " , โ€ฆ , ๐’ƒ $ are orthonormal eighenvectors of ๐‘บ ๐’š : ๐‘บ ๐’š ) = ๐‘ฉ < ๐‘บ ๐’š ๐‘ฉ = ๐‘ฉ < ๐‘ฉ๐šณ๐‘ฉ < ๐‘ฉ = ๐šณ + = 0 + ๐’š s โ‡’ โˆ€๐‘— โ‰  ๐‘˜ ๐‘—, ๐‘˜ = 1, โ€ฆ , ๐‘’ ๐น ๐’š & } then mutually uncorrelated features are obtained } Completely uncorrelated features avoid information redundancies 22

  23. Reconstruction < ๐’š < ๐’˜ " ๐’˜ " ๐’š + = โ‹ฎ โ‹ฎ = ๐’š < ๐’š < ๐’˜ $ ) ๐’˜ $ ) ๐’š + = ๐‘ฉ < ๐’š ๐‘ฉ = [๐’˜ " , โ€ฆ , ๐’˜ $ ) ] โ‡’ ๐‘ฉ๐’š + = ๐‘ฉ๐‘ฉ < ๐’š = ๐’š โ‡’ ๐’š = ๐‘ฉ๐’š + } Incorporating all eigenvectors in ๐‘ฉ = [๐’˜ " , โ€ฆ , ๐’˜ $ ] : โŸน If ๐‘’ + = ๐‘’ then ๐’š can be reconstructed exactly from ๐’š + 23

  24. PCA Derivation: Relation between Eigenvalues and Variances } The ๐‘˜ -th largest eigenvalue of ๐‘บ ๐’š is the variance on the ๐‘˜ -th PC: + = ๐’˜ s < ๐‘บ ๐’š ๐’˜ s = ๐œ‡ s ๐‘ค๐‘๐‘  ๐‘ฆ s 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend