lecture 7
play

Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from : Percy Liang (Stanford) Dimensionality Reduction Goal: Map high dimensional


  1. Unsupervised Machine Learning 
 and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent

  2. DIMENSIONALITY REDUCTION Borrowing from : 
 Percy Liang 
 (Stanford)

  3. Dimensionality Reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Original Data (4 dims) Projection with PCA (2 dims) Objective: projection should “preserve” relative distances

  4. Linear Dimensionality Reduction Idea : Project high-dimensional vector 
 onto a lower dimensional space ∈ x ∈ R 361 z = U > x z ∈ R 10

  5. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Transpose of X 
 used in regression!

  6. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k

  7. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x

  8. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x Project x down to z = ( z 1 , . . . , z k ) > = U > x How to choose U ?

  9. Principal Component Analysis Top 2 components Bottom 2 components Data : three varieties of wheat: Kama, Rosa, Canadian 
 Attributes : Area, Perimeter, Compactness, Length of Kernel, 
 Width of Kernel, Asymmetry Coefficient, Length of Groove

  10. Principal Component Analysis ∈ x ∈ R 361 z = U > x z ∈ R 10 Optimize two equivalent objectives 1. Minimize the reconstruction error 2. Maximizes the projected variance

  11. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x P

  12. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x • Decode: ˜ x = Uz = P k j =1 z j u j

  13. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x • Decode: ˜ x = Uz = P k j =1 z j u j Want reconstruction error k x � ˜ x k to be small

  14. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x • Decode: ˜ x = Uz = P k j =1 z j u j Want reconstruction error k x � ˜ x k to be small Objective: minimize total squared reconstruction error n X k x i � UU > x i k 2 min U 2 R d ⇥ k i =1

  15. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c n

  16. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s

  17. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s

  18. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s Objective: maximize variance of projected data ˆ E [ k U > x k 2 ] max U 2 R d ⇥ k , U > U = I

  19. PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): P n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): P P n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 c c n Assume data is centered: ˆ E [ x ] = 0 (what’s ˆ E [ U > x ] ?) Objective: maximize variance of projected data ˆ E [ k U > x k 2 ] max U 2 R d ⇥ k , U > U = I

  20. Equivalence of two objectives Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large

  21. Equivalence of two objectives Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large Pythagorean decomposition: x = UU > x + ( I � UU > ) x k x k k ( I � UU > ) x k k UU > x k Take expectations; note rotation U doesn’t a ff ect length: E [ k x k 2 ] = ˆ ˆ E [ k U > x k 2 ] + ˆ E [ k x � UU > x k 2 ]

  22. Equivalence of two objectives Key intuition: variance of data = captured variance + reconstruction error | {z } | {z } | {z } fixed want small want large Pythagorean decomposition: x = UU > x + ( I � UU > ) x k x k k ( I � UU > ) x k k UU > x k Take expectations; note rotation U doesn’t a ff ect length: E [ k x k 2 ] = ˆ ˆ E [ k U > x k 2 ] + ˆ E [ k x � UU > x k 2 ] Minimize reconstruction error $ Maximize captured variance

  23. Changes of Basis Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n d d

  24. Changes of Basis Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n d d Inverse Change of basis Change of basis d to z = ( z 1 , . . . , z k ) > > j x x = Uz = ˜ d ” z j = u > j x z = U > x

  25. Principal Component Analysis Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n d d d Eigenvectors of Covariance Eigen-decomposition   λ 1 λ 2   Λ = ...   λ d Claim : Eigenvectors of a symmetric matrix are orthogonal

  26. Principal Component Analysis n (from stack exchange)

  27. Principal Component Analysis Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n d d d Eigenvectors of Covariance Eigen-decomposition   λ 1 λ 2   Λ = ...   λ d Idea : Take top- k eigenvectors to maximize variance

  28. Principal Component Analysis Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Eigenvectors of Covariance Truncated decomposition   λ 1 λ 2 Λ ( k ) =   ...   λ k

  29. PCA: Complexity Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Using eigen-value decomposition • Computation of covariance C : O ( n d 2 ) • Eigen-value decomposition: O ( d 3 ) • Total complexity: O ( n d 2 + d 3 )

  30. PCA: Complexity Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Using singular-value decomposition • Full decomposition: O(min{ nd 2 , n 2 d }) • Rank-k decomposition: O( k d n log(n)) 
 (with power method) 


  31. Singular Value Decomposition Idea : Decompose a 
 d x d matrix M into 1. Change of basis V 
 (unitary matrix) 2. A scaling Σ 
 (diagonal matrix) 3. Change of basis U 
 (unitary matrix)

  32. Singular Value Decomposition Idea : Decompose the 
 d x n matrix X into 1. A n x n basis V 
 (unitary matrix) 2. A d x n matrix Σ 
 (diagonal projection) 3. A d x d basis U 
 (unitary matrix) d X = U d ⇥ d Σ d ⇥ n V > n ⇥ n

  33. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i

  34. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . .

  35. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification

  36. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification Much faster: O ( dk + nk ) time instead of O ( dn ) when n, d � k

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend