machine learning 2
play

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction I - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction I Byron C Wallace Machine Learning 2 DS 4420 - Spring 2020 Some slides today borrowing from : Percy Liang (Stanford) Other material from the MML book (Faisal and Ong)


  1. Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction I Byron C Wallace

  2. Machine Learning 2 DS 4420 - Spring 2020 Some slides today borrowing from : 
 Percy Liang (Stanford) Other material from the MML book (Faisal and Ong)

  3. Motivation • We often want to work with high dimensional data (e.g., images). We also often have lots of it. • This is computationally expensive to store and work with.

  4. Dimensionality Reduction Fundamental idea Exploit redundancy in the data; find lower-dimensional representation 4 4 2 2 x 2 x 2 0 0 − 2 − 2 − 4 − 4 − 5 . 0 − 2 . 5 0 . 0 2 . 5 5 . 0 − 5 . 0 − 2 . 5 0 . 0 2 . 5 5 . 0 x 1 x 1

  5. Example (from lecture 5): Dimensionality reduction via k- means

  6. Example (from lecture 5): Dimensionality reduction via k- means This highlights the natural connection between dimensionality reduction and compression.

  7. Dimensionality reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Original Data (4 dims) Projection with PCA (2 dims) Objective: projection should “preserve” relative distances

  8. Linear dimensionality reduction Idea : Project high-dimensional vector 
 onto a lower dimensional space ∈ x ∈ R 361 z = U > x z ∈ R 10

  9. Linear dimensionality reduction Original Reconstructed R D R D Compressed R M x z x ˜

  10. Objective Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large

  11. Principal Component Analysis (on board)

  12. In Sum: Principal Component Analysis Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n d d Eigenvectors of Covariance Eigen-decomposition   λ 1 λ 2   Λ = ...   λ d Idea : Take top- k eigenvectors to maximize variance

  13. Getting the eigenvalues, two ways • Direct eigenvalue decomposition of the covariance matrix N S = 1 n = 1 X N XX > x n x > N n =1

  14. Getting the eigenvalues, two ways • Direct eigenvalue decomposition of the covariance matrix N S = 1 n = 1 X N XX > x n x > N n =1 • Singular Value Decomposition (SVD)

  15. Singular Value Decomposition Idea : Decompose the 
 d x n matrix X into 1. A n x n basis V 
 (unitary matrix) 2. A d x n matrix Σ 
 (diagonal projection) 3. A d x d basis U 
 (unitary matrix) d X = U d ⇥ d Σ d ⇥ n V > n ⇥ n

  16. SVD for PCA V > = U X Σ , |{z} |{z} |{z} |{z} D ⇥ N D ⇥ D D ⇥ N N ⇥ N S = 1 N XX > = 1 Σ > U > = 1 N U Σ V > V N U ΣΣ > U > | {z } = I N

  17. SVD for PCA V > = U X Σ , |{z} |{z} |{z} |{z} D ⇥ N D ⇥ D D ⇥ N N ⇥ N S = 1 N XX > = 1 Σ > U > = 1 N U Σ V > V N U ΣΣ > U > | {z } = I N It turns out the columns of U are the eigenvectors of XX T

  18. Principal Component Analysis Example 10.3 (MNIST Digits Embedding) the in

  19. Principal Component Analysis Top 2 components Bottom 2 components Data : three varieties of wheat: Kama, Rosa, Canadian 
 Attributes : Area, Perimeter, Compactness, Length of Kernel, 
 Width of Kernel, Asymmetry Coefficient, Length of Groove

  20. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i

  21. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . .

  22. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification

  23. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification Much faster: O ( dk + nk ) time instead of O ( dn ) when n, d � k

  24. Aside: How many components? • • Magnitude of eigenvalues indicate fraction of variance captured. • Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 λ i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i • Eigenvalues typically drop o ff sharply, so don’t need that many. • Of course variance isn’t everything...

  25. Latent Semantic Analysis [Deerwater 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts =

  26. Latent Semantic Analysis [Deerwater 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d ⇥ n U d ⇥ k Z k ⇥ n u ( game: 1 · · · · · · · · · 3 ) u ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ··

  27. Latent Semantic Analysis [Deerwater 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d ⇥ n U d ⇥ k Z k ⇥ n u ( game: 1 · · · · · · · · · 3 ) u ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ·· How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2

  28. Probabilistic PCA • If we define a prior over z then we can sample from the latent space and hallucinate images

  29. Limitations of Linearity PCA is e ff ective PCA is ine ff ective

  30. Nonlinear PCA Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 }

  31. Nonlinear PCA Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } Idea: Use kernels 1 , x 2 ) > We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 { } Linear dimensionality reduction in φ ( x ) space ⇔ Nonlinear dimensionality reduction in x space

  32. Kernel PCA

  33. Wrapping up • PCA is a linear model for dimensionality reduction which finds a mapping to a lower dimensional space that maximizes variance • We saw that this is equivalent to performing an eigendecomposition on the covariance matrix of X • Next time Auto-encoders and neural compression for non-linear projections

  34. Wrapping up • PCA is a linear model for dimensionality reduction which finds a mapping to a lower dimensional space that maximizes variance • We saw that this is equivalent to performing an eigendecomposition on the covariance matrix of X • Next time Auto-encoders and neural compression for non-linear projections

  35. Wrapping up • PCA is a linear model for dimensionality reduction which finds a mapping to a lower dimensional space that maximizes variance • We saw that this is equivalent to performing an eigendecomposition on the covariance matrix of X • Next time Auto-encoders and neural compression for non-linear projections

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend