dimensionality reduction linear discriminant analysis and
play

Dimensionality Reduction: Linear Discriminant Analysis and - PowerPoint PPT Presentation

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678 UMBC Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component


  1. Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678 UMBC

  2. Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

  3. Covariance covariance: how (linearly) correlated are variables ๐‘‚ Mean of Mean of 1 variable i variable j ๐œ ๐‘—๐‘˜ = ๐‘‚ โˆ’ 1 เท (๐‘ฆ ๐‘™๐‘— โˆ’ ๐œˆ ๐‘— )(๐‘ฆ ๐‘™๐‘˜ โˆ’ ๐œˆ ๐‘˜ ) ๐‘™=1 Value of covariance of Value of variable j variables i and j variable i in object k in object k

  4. Covariance covariance: how (linearly) correlated are variables ๐‘‚ Mean of Mean of 1 variable i variable j ๐œ ๐‘—๐‘˜ = ๐‘‚ โˆ’ 1 เท (๐‘ฆ ๐‘™๐‘— โˆ’ ๐œˆ ๐‘— )(๐‘ฆ ๐‘™๐‘˜ โˆ’ ๐œˆ ๐‘˜ ) ๐‘™=1 Value of covariance of Value of variable j variables i and j variable i in object k in object k ๐œ 11 โ‹ฏ ๐œ 1๐ฟ ๐œ ๐‘—๐‘˜ = ๐œ โ‹ฎ โ‹ฑ โ‹ฎ ฮฃ = ๐‘˜๐‘— ๐œ ๐ฟ1 โ‹ฏ ๐œ ๐ฟ๐ฟ

  5. Eigenvalues and Eigenvectors vector ๐ต๐‘ฆ = ๐œ‡๐‘ฆ scalar matrix for a given matrix operation (multiplication): what non-zero vector(s) change linearly? (by a single multiplication)

  6. Eigenvalues and Eigenvectors vector ๐ต๐‘ฆ = ๐œ‡๐‘ฆ scalar matrix ๐ต = 1 5 0 1

  7. Eigenvalues and Eigenvectors vector = ๐œ‡ ๐‘ฆ ๐‘ฆ + 5๐‘ง ๐ต๐‘ฆ = ๐œ‡๐‘ฆ ๐‘ง ๐‘ง scalar matrix ๐ต = 1 5 0 1 ๐‘ฆ ๐‘ง = ๐‘ฆ + 5๐‘ง 1 5 ๐‘ง 0 1

  8. Eigenvalues and Eigenvectors vector = ๐œ‡ ๐‘ฆ ๐‘ฆ + 5๐‘ง ๐ต๐‘ฆ = ๐œ‡๐‘ฆ ๐‘ง ๐‘ง 1 5 1 0 = 1 1 scalar matrix 0 1 0 ๐ต = 1 5 0 1 only non-zero vector to scale

  9. Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

  10. Dimensionality Reduction D input L reduced features features N Original (lightly preprocessed Compressed instances data) representation

  11. Dimensionality Reduction clarity of representation vs. ease of understanding oversimplification: loss of important or relevant information Courtesy Antano ลฝilinsko

  12. Why โ€œmaximizeโ€ the variance? How can we efficiently summarize? We maximize the variance within our summarization We donโ€™t increase the variance in the dataset How can we capture the most information with the fewest number of axes?

  13. Summarizing Redundant Information (4,2) (2,1) (2,-1) (-2,-1)

  14. Summarizing Redundant Information (4,2) (2,1) (2,-1) (-2,-1) (2,1) = 2*(1,0) + 1*(0,1)

  15. Summarizing Redundant Information 2u 1 (4,2) (4,2) u 1 (2,1) (2,1) u 2 (2,-1) (2,-1) (-2,-1) (-2,-1) -u 1 (2,1) = 1*(2,1) + 0*(2,-1) (4,2) = 2*(2,1) + 0*(2,-1)

  16. Summarizing Redundant Information 2u 1 (4,2) (4,2) u 1 (2,1) (2,1) u 2 (2,-1) (2,-1) (-2,-1) (-2,-1) -u 1 (2,1) = 1*(2,1) + 0*(2,-1) (Is it the most general? These vectors arenโ€™t orthogonal) (4,2) = 2*(2,1) + 0*(2,-1)

  17. Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

  18. Linear Discriminant Analysis (LDA, LDiscA) and Principal Component Analysis (PCA) Summarize D-dimensional input data by uncorrelated axes Uncorrelated axes are also called principal components Use the first L components to account for as much variance as possible

  19. Geometric Rationale of LDiscA & PCA Objective: to rigidly rotate the axes of the D- dimensional space to new positions (principal axes): ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis D has the lowest variance covariance among each pair of the principal axes is zero (the principal axes are uncorrelated) Courtesy Antano ลฝilinsko

  20. Remember: MAP Classifiers are Optimal for Classification ๐‘ง ๐‘— [โ„“ 0/1 (๐‘ง, เทข min ๐ฑ เท ๐”ฝ เทž ๐‘ง ๐‘— )] โ†’ max เท ๐‘ž เท ๐‘ง ๐‘— = ๐‘ง ๐‘— ๐‘ฆ ๐‘— ๐ฑ ๐‘— ๐‘— ๐‘ž เท ๐‘ง ๐‘— = ๐‘ง ๐‘— ๐‘ฆ ๐‘— โˆ ๐‘ž ๐‘ฆ ๐‘— เท ๐‘ง ๐‘— ๐‘ž(เท ๐‘ง ๐‘— ) class-conditional posterior class prior likelihood ๐‘ฆ ๐‘— โˆˆ โ„ ๐ธ

  21. Linear Discriminant Analysis MAP Classifier where: 1. class-conditional likelihoods are Gaussian 2. common covariance among class likelihoods

  22. LDiscA: (1) What if likelihoods are Gaussian ๐‘ž เท ๐‘ง ๐‘— = ๐‘ง ๐‘— ๐‘ฆ ๐‘— โˆ ๐‘ž ๐‘ฆ ๐‘— เท ๐‘ง ๐‘— ๐‘ž(เท ๐‘ง ๐‘— ) ๐‘ž ๐‘ฆ ๐‘— ๐‘™ = ๐’ช ๐œˆ ๐‘™ , ฮฃ ๐‘™ exp โˆ’ 1 โˆ’1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ๐‘ˆ ฮฃ ๐‘™ = 2๐œŒ ๐ธ/2 ฮฃ ๐‘™ 1/2 https://upload.wikimedia.org/wikipedia/commons/5/57/Multivariate_Gaussian.png

  23. LDiscA: (2) Shared Covariance log ๐‘ž เท ๐‘ง ๐‘— = ๐‘™ ๐‘ฆ ๐‘— = log ๐‘ž(๐‘ฆ ๐‘— |๐‘™) ๐‘ž(๐‘ฆ ๐‘— |๐‘š) + log ๐‘ž(๐‘™) ๐‘ž เท ๐‘ง ๐‘— = ๐‘š ๐‘ฆ ๐‘— ๐‘ž ๐‘š

  24. LDiscA: (2) Shared Covariance log ๐‘ž เท ๐‘ง ๐‘— = ๐‘™ ๐‘ฆ ๐‘— = log ๐‘ž(๐‘ฆ ๐‘— |๐‘™) ๐‘ž(๐‘ฆ ๐‘— |๐‘š) + log ๐‘ž(๐‘™) ๐‘ž เท ๐‘ง ๐‘— = ๐‘š ๐‘ฆ ๐‘— ๐‘ž ๐‘š exp โˆ’ 1 โˆ’1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ๐‘ˆ ฮฃ ๐‘™ 2๐œŒ ๐ธ/2 ฮฃ ๐‘™ 1/2 = log ๐‘ž(๐‘™) ๐‘ž ๐‘š + log exp โˆ’ 1 โˆ’1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘š 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘š ๐‘ˆ ฮฃ ๐‘š 2๐œŒ ๐ธ/2 ฮฃ ๐‘š 1/2

  25. LDiscA: (2) Shared Covariance log ๐‘ž เท ๐‘ง ๐‘— = ๐‘™ ๐‘ฆ ๐‘— = log ๐‘ž(๐‘ฆ ๐‘— |๐‘™) ๐‘ž(๐‘ฆ ๐‘— |๐‘š) + log ๐‘ž(๐‘™) ๐‘ž เท ๐‘ง ๐‘— = ๐‘š ๐‘ฆ ๐‘— ๐‘ž ๐‘š exp โˆ’ 1 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ๐‘ˆ ฮฃ โˆ’1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ 2๐œŒ ๐ธ/2 ฮฃ ๐‘™ 1/2 = log ๐‘ž(๐‘™) ๐‘ž ๐‘š + log exp โˆ’ 1 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘š ๐‘ˆ ฮฃ โˆ’1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘š 2๐œŒ ๐ธ/2 ฮฃ ๐‘š 1/2 ฮฃ ๐‘š = ฮฃ ๐‘™

  26. LDiscA: (2) Shared Covariance log ๐‘ž เท ๐‘ง ๐‘— = ๐‘™ ๐‘ฆ ๐‘— = log ๐‘ž(๐‘ฆ ๐‘— |๐‘™) ๐‘ž(๐‘ฆ ๐‘— |๐‘š) + log ๐‘ž(๐‘™) ๐‘ž เท ๐‘ง ๐‘— = ๐‘š ๐‘ฆ ๐‘— ๐‘ž ๐‘š = log ๐‘ž(๐‘™) ๐‘ž ๐‘š โˆ’ 1 2 ๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š + ๐‘ฆ ๐‘— ๐‘ˆ ฮฃ โˆ’1 (๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š ) linear in x i (check for yourself: why did the quadratic x i terms cancel?)

  27. LDiscA: (2) Shared Covariance log ๐‘ž เท ๐‘ง ๐‘— = ๐‘™ ๐‘ฆ ๐‘— = log ๐‘ž(๐‘ฆ ๐‘— |๐‘™) ๐‘ž(๐‘ฆ ๐‘— |๐‘š) + log ๐‘ž(๐‘™) ๐‘ž เท ๐‘ง ๐‘— = ๐‘š ๐‘ฆ ๐‘— ๐‘ž ๐‘š = log ๐‘ž(๐‘™) ๐‘ž ๐‘š โˆ’ 1 2 ๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š + ๐‘ฆ ๐‘— ๐‘ˆ ฮฃ โˆ’1 (๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š ) ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ โˆ’ 1 ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ + log ๐‘ž(๐‘™) = ๐‘ฆ ๐‘— 2 ๐œˆ ๐‘™ ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘š โˆ’ 1 ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘š + log ๐‘ž ๐‘š +๐‘ฆ ๐‘— 2 ๐œˆ ๐‘š linear in x i rewrite only in terms of x i (check for yourself: why did the (data) and single-class terms quadratic x i terms cancel?)

  28. Classify via Linear Discriminant Functions ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ โˆ’ 1 ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ + log ๐‘ž(๐‘™) ๐œ€ ๐‘™ ๐‘ฆ ๐‘— = ๐‘ฆ ๐‘— 2 ๐œˆ ๐‘™ arg max equivalent MAP classifier ๐œ€ ๐‘™ ๐‘ฆ ๐‘— to ๐‘™

  29. LDiscA Parameters to learn: ๐‘ž ๐‘™ ๐‘™ , ๐œˆ ๐‘™ ๐‘™ , ฮฃ ๐‘ž ๐‘™ โˆ ๐‘‚ ๐‘™ number of items labeled with class k

  30. LDiscA Parameters to learn: ๐‘ž ๐‘™ ๐‘™ , ๐œˆ ๐‘™ ๐‘™ , ฮฃ ๐œˆ ๐‘™ = 1 เท ๐‘ฆ ๐‘— ๐‘ž ๐‘™ โˆ ๐‘‚ ๐‘™ ๐‘‚ ๐‘™ ๐‘—:๐‘ง ๐‘— =๐‘™

  31. LDiscA Parameters to learn: ๐‘ž ๐‘™ ๐‘™ , ๐œˆ ๐‘™ ๐‘™ , ฮฃ ๐œˆ ๐‘™ = 1 เท ๐‘ฆ ๐‘— ๐‘ž ๐‘™ โˆ ๐‘‚ ๐‘™ ๐‘‚ ๐‘™ ๐‘—:๐‘ง ๐‘— =๐‘™ 1 1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ๐‘ˆ ฮฃ = ๐‘‚ โˆ’ ๐ฟ เท scatter ๐‘™ = ๐‘‚ โˆ’ ๐ฟ เท เท ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ๐‘™ ๐‘™ ๐‘—:๐‘ง ๐‘— =๐‘™ one option for ๐›ต within-class covariance

  32. Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance

  33. Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance ฮฃ = UDU T Eigen decomposition K x K orthonormal diagonal matrix of matrix (eigenvectors) eigenvalues

  34. Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance ฮฃ = UDU T 3. Sphere the data โˆ’1 X โˆ— = ๐ธ 2 ๐‘‰ ๐‘ˆ ๐‘Œ

  35. Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance ฮฃ = UDU T 3. Sphere the data (get unit covariance) โˆ’1 X โˆ— = ๐ธ 2 ๐‘‰ ๐‘ˆ ๐‘Œ 4. Classify according to linear discriminant โˆ— ) functions ๐œ€ ๐‘™ (๐‘ฆ ๐‘—

  36. Two Extensions to LDiscA Quadratic Discriminant Analysis (QDA) Keep separate covariances per class ๐œ€ ๐‘™ ๐‘ฆ ๐‘— = โˆ’ 1 โˆ’1 (๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ) 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ T ฮฃ k + log ๐‘ž ๐‘™ โˆ’ log |ฮฃ ๐‘™ | 2

  37. Two Extensions to LDiscA Quadratic Discriminant Analysis Regularized LDiscA (QDA) Keep separate covariances per Interpolate between shared class covariance estimate (LDiscA) and class-specific estimate (QDA) ๐œ€ ๐‘™ ๐‘ฆ ๐‘— = โˆ’ 1 โˆ’1 (๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ) 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ T ฮฃ k ฮฃ ๐‘™ ๐›ฝ = ๐›ฝฮฃ ๐‘™ + 1 โˆ’ ๐›ฝ ฮฃ + log ๐‘ž ๐‘™ โˆ’ log |ฮฃ ๐‘™ | 2

  38. Vowel Classification LDiscA (left) vs. QDA (right) ESL 4.3

  39. Vowel Classification LDiscA (left) vs. QDA (right) Regularized LDiscA ฮฃ ๐‘™ ๐›ฝ = ๐›ฝฮฃ ๐‘™ + 1 โˆ’ ๐›ฝ ฮฃ ESL 4.3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend