compsci 514 algorithms for data science
play

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 16 0 summary Last Class: Low-Rank Approximation, Eigendecomposition, and PCA by projecting data points into that space. (PCA).


  1. compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 16 0

  2. summary Last Class: Low-Rank Approximation, Eigendecomposition, and PCA by projecting data points into that space. (PCA). This Class: Finish Low-Rank Approximation and Connection to the singular value decomposition (SVD) 1 • Can approximate data lying close to in a k -dimensional subspace • Finding the best k -dimensional subspace via eigendecomposition • Measuring error in terms of the eigenvalue spectrum. • Finish up PCA – runtime considerations and picking k . • View of optimal low-rank approximation using the SVD. • Applications of low-rank approximation beyond compression.

  3. basic set up matrix with these vectors as its columns. v k . 2 Set Up: Assume that data points ⃗ x 1 , . . . ,⃗ x n lie close to any k -dimensional subspace V of R d . Let X ∈ R n × d be the data matrix. Let ⃗ v 1 , . . . ,⃗ v k be an orthonormal basis for V and V ∈ R d × k be the • VV T ∈ R d × d is the projection matrix onto V . • X ≈ X ( VV T ) . Gives the closest approximation to X with rows in V . ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : orthogo- nal basis for subspace V . V ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  4. low-rank approximation via eigendecomposition 2 v k . F corresponding to the top k eigenvectors of the covariance matrix X T X , v k arg max 3 arg max k V minimizing ∥ X − XVV T ∥ 2 F is given by: ∑ ∥ X ⃗ orthonormal V ∈ R d × k ∥ XV ∥ 2 F = v j ∥ 2 j = 1 Solution via eigendecomposition: Letting V k have columns ⃗ v 1 , . . . ,⃗ V k = orthonormal V ∈ R d × k ∥ XV ∥ 2 • Proof via Courant-Fischer and greedy maximization. • Approximation error is ∥ X ∥ 2 F − ∥ XV k ∥ 2 F = ∑ d i = k + 1 λ i ( X T X ) . ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : orthogo- nal basis for subspace V . V ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  5. low-rank approximation via eigendecomposition 4

  6. spectrum analysis Plotting the spectrum of the covariance matrix X T X (its eigenvalues) v k . 5 shows how compressible X is using low-rank approximation (i.e., how close ⃗ x 1 , . . . ,⃗ x n are to a low-dimensional subspace). • Choose k to balance accuracy and compression. • Often at an ‘elbow’. ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  7. spectrum analysis Exercise: Show that the eigenvalues of X T X are always positive. v T v j . 6 Hint: Use that λ j = ⃗ j X T X ⃗

  8. interpretation in terms of correlation Covariance becomes diagonal. I.e., all correlations have been v k . Recall: Low-rank approximation is possible when our data features removed. Maximal compression. are correlated. top k eigenvectors of X T X . 7 Our compressed dataset is C = XV k where the columns of V k are the What is the covariance of C ? C T C = V T k X T XV k = V T k V Λ V T V k = Λ k ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  9. algorithmic considerations Many faster iterative and randomized methods. Runtime is v k . v k . What is the runtime to compute an optimal low-rank 8 v k approximation? • Computing the covariance matrix X T X requires O ( nd 2 ) time. • Computing its full eigendecomposition to obtain ⃗ v 1 , . . . ,⃗ requires O ( d 3 ) time (similar to the inverse ( X T X ) − 1 ). roughly ˜ O ( ndk ) to output just to top k eigenvectors ⃗ v 1 , . . . ,⃗ • Will see in a few classes (power method, Krylov methods). ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  10. singular value decomposition The Singular Value Decomposition (SVD) generalizes the eigendecomposition to asymmetric (even rectangular) matrices. Any values). The ‘swiss army knife’ of modern linear algebra. 9 matrix X ∈ R n × d with rank ( X ) = r can be written as X = U Σ V T . • U has orthonormal columns ⃗ u 1 , . . . ,⃗ u r ∈ R n (left singular vectors). • V has orthonormal columns ⃗ v 1 , . . . ,⃗ v r ∈ R d (right singular vectors). • Σ is diagonal with elements σ 1 ≥ σ 2 ≥ . . . ≥ σ r > 0 (singular

  11. connection of the svd to eigendecomposition v k , we know that agonal matrix containing singular values of X . Gives exactly the same approximation! u k ? What about U k U T XV k V T 10 The left and right singular vectors are the eigenvectors of the Writing X ∈ R n × d in its singular value decomposition X = U Σ V T : X T X = V Σ U T U Σ V T = V Σ 2 V T (the eigendecomposition) Similarly: XX T = U Σ V T V Σ U T = U Σ 2 U T . covariance matrix X T X and the gram matrix XX T respectively. So, letting V k ∈ R d × k have columns equal to ⃗ v 1 , . . . ,⃗ k is the best rank- k approximation to X (given by PCA). k X where U k ∈ R n × k has columns equal to ⃗ u 1 , . . . ,⃗ X ∈ R n × d : data matrix, U ∈ R n × rank ( X ) : matrix with orthonormal columns u 2 , . . . (left singular vectors), V ∈ R d × rank ( X ) : matrix with orthonormal ⃗ u 1 ,⃗ v 2 , . . . (right singular vectors), Σ ∈ R rank ( X ) × rank ( X ) : positive di- columns ⃗ v 1 ,⃗

  12. the svd and optimal low-rank approximation The best low-rank approximation to X : k Correspond to projecting the rows (data points) onto the span 11 X k = arg min rank − k B ∈ R n × d ∥ X − B ∥ F is given by: X k = XV k V T k = U k U T k X = U k Σ k V T of V k or the columns (features) onto the span of U k

  13. the svd and optimal low-rank approximation The best low-rank approximation to X : k agonal matrix containing singular values of X . 12 X k = arg min rank − k B ∈ R n × d ∥ X − B ∥ F is given by: X k = XV k V T k = U k U T k X = U k Σ k V T X ∈ R n × d : data matrix, U ∈ R n × rank ( X ) : matrix with orthonormal columns u 2 , . . . (left singular vectors), V ∈ R d × rank ( X ) : matrix with orthonormal ⃗ u 1 ,⃗ v 2 , . . . (right singular vectors), Σ ∈ R rank ( X ) × rank ( X ) : positive di- columns ⃗ v 1 ,⃗

  14. agonal matrix containing singular values of X . the svd and optimal low-rank approximation 13 X ∈ R n × d : data matrix, U ∈ R n × rank ( X ) : matrix with orthonormal columns u 2 , . . . (left singular vectors), V ∈ R d × rank ( X ) : matrix with orthonormal ⃗ u 1 ,⃗ v 2 , . . . (right singular vectors), Σ ∈ R rank ( X ) × rank ( X ) : positive di- columns ⃗ v 1 ,⃗

  15. applications of low-rank approximation Rest of Class: Examples of how low-rank approximation is applied in a variety of data science applications. reduction/data compression. 14 • Used for many reasons other than dimensionality

  16. matrix completion believe is close to rank- k (i.e., well approximated by a rank k matrix). Classic example: the Netflix prize problem. Under certain assumptions, can show that Y well approximates X on both the observed and (most importantly) unobserved entries. 15 Consider a matrix X ∈ R n × d which we cannot fully observe but ∑ [ ] 2 Solve: Y = arg min X j , k − B j , k rank − k B observed ( j , k )

  17. entity embeddings Dimensionality reduction embeds d -dimensional vectors into objects other than vectors? Usual Approach: Convert each item into a high-dimensional feature vector and then apply low-rank approximation. 16 d ′ dimensions. But what about when you want to embed • Documents (for topic-based search and classification) • Words (to identify synonyms, translations, etc.) • Nodes in a social network

  18. example: latent semantic analysis 17

  19. example: latent semantic analysis 18 • If the error ∥ X − YZ T ∥ F is small, then on average, X i , a ≈ ( YZ T ) i , a = ⟨ ⃗ y i ,⃗ z a ⟩ . • I.e., ⟨ ⃗ y i ,⃗ z a ⟩ ≈ 1 when doc i contains word a . • If doc i and doc j both contain word a , ⟨ ⃗ y i ,⃗ z a ⟩ ≈ ⟨ ⃗ y j ,⃗ z a ⟩ = 1.

  20. associates with that topic. example: latent semantic analysis 19 If doc i and doc j both contain word a , ⟨ ⃗ y i ,⃗ z a ⟩ ≈ ⟨ ⃗ y j ,⃗ z a ⟩ = 1 Another View: Each column of Y represents a ‘topic’. ⃗ y i ( j ) indicates how much doc i belongs to topic j . ⃗ z a ( j ) indicates how much word a

  21. example: latent semantic analysis documents. K . k V T 20 • Just like with documents, ⃗ z a and ⃗ z b will tend to have high dot product if word i and word j appear in many of the same • In an SVD decomposition we set Z = Σ k V T • The columns of V k are equivalently: the top k eigenvectors of X T X . The eigendecomposition of X T X is X T X = V Σ 2 V T . • What is the best rank- k approximation of X T X ? I.e. arg min rank − k B ∥ X T X − B ∥ F • X T X = V k Σ 2 k = ZZ T .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend