compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 16 0

summary Last Class: Low-Rank Approximation, Eigendecomposition, and PCA by projecting data points into that space. (PCA). This Class: Finish Low-Rank Approximation and Connection to the singular value decomposition (SVD) 1 • Can approximate data lying close to in a k -dimensional subspace • Finding the best k -dimensional subspace via eigendecomposition • Measuring error in terms of the eigenvalue spectrum. • Finish up PCA – runtime considerations and picking k . • View of optimal low-rank approximation using the SVD. • Applications of low-rank approximation beyond compression.

basic set up matrix with these vectors as its columns. v k . 2 Set Up: Assume that data points ⃗ x 1 , . . . ,⃗ x n lie close to any k -dimensional subspace V of R d . Let X ∈ R n × d be the data matrix. Let ⃗ v 1 , . . . ,⃗ v k be an orthonormal basis for V and V ∈ R d × k be the • VV T ∈ R d × d is the projection matrix onto V . • X ≈ X ( VV T ) . Gives the closest approximation to X with rows in V . ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : orthogo- nal basis for subspace V . V ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

low-rank approximation via eigendecomposition 2 v k . F corresponding to the top k eigenvectors of the covariance matrix X T X , v k arg max 3 arg max k V minimizing ∥ X − XVV T ∥ 2 F is given by: ∑ ∥ X ⃗ orthonormal V ∈ R d × k ∥ XV ∥ 2 F = v j ∥ 2 j = 1 Solution via eigendecomposition: Letting V k have columns ⃗ v 1 , . . . ,⃗ V k = orthonormal V ∈ R d × k ∥ XV ∥ 2 • Proof via Courant-Fischer and greedy maximization. • Approximation error is ∥ X ∥ 2 F − ∥ XV k ∥ 2 F = ∑ d i = k + 1 λ i ( X T X ) . ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : orthogo- nal basis for subspace V . V ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

low-rank approximation via eigendecomposition 4

spectrum analysis Plotting the spectrum of the covariance matrix X T X (its eigenvalues) v k . 5 shows how compressible X is using low-rank approximation (i.e., how close ⃗ x 1 , . . . ,⃗ x n are to a low-dimensional subspace). • Choose k to balance accuracy and compression. • Often at an ‘elbow’. ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

spectrum analysis Exercise: Show that the eigenvalues of X T X are always positive. v T v j . 6 Hint: Use that λ j = ⃗ j X T X ⃗

interpretation in terms of correlation Covariance becomes diagonal. I.e., all correlations have been v k . Recall: Low-rank approximation is possible when our data features removed. Maximal compression. are correlated. top k eigenvectors of X T X . 7 Our compressed dataset is C = XV k where the columns of V k are the What is the covariance of C ? C T C = V T k X T XV k = V T k V Λ V T V k = Λ k ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

algorithmic considerations Many faster iterative and randomized methods. Runtime is v k . v k . What is the runtime to compute an optimal low-rank 8 v k approximation? • Computing the covariance matrix X T X requires O ( nd 2 ) time. • Computing its full eigendecomposition to obtain ⃗ v 1 , . . . ,⃗ requires O ( d 3 ) time (similar to the inverse ( X T X ) − 1 ). roughly ˜ O ( ndk ) to output just to top k eigenvectors ⃗ v 1 , . . . ,⃗ • Will see in a few classes (power method, Krylov methods). ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

singular value decomposition The Singular Value Decomposition (SVD) generalizes the eigendecomposition to asymmetric (even rectangular) matrices. Any values). The ‘swiss army knife’ of modern linear algebra. 9 matrix X ∈ R n × d with rank ( X ) = r can be written as X = U Σ V T . • U has orthonormal columns ⃗ u 1 , . . . ,⃗ u r ∈ R n (left singular vectors). • V has orthonormal columns ⃗ v 1 , . . . ,⃗ v r ∈ R d (right singular vectors). • Σ is diagonal with elements σ 1 ≥ σ 2 ≥ . . . ≥ σ r > 0 (singular

connection of the svd to eigendecomposition v k , we know that agonal matrix containing singular values of X . Gives exactly the same approximation! u k ? What about U k U T XV k V T 10 The left and right singular vectors are the eigenvectors of the Writing X ∈ R n × d in its singular value decomposition X = U Σ V T : X T X = V Σ U T U Σ V T = V Σ 2 V T (the eigendecomposition) Similarly: XX T = U Σ V T V Σ U T = U Σ 2 U T . covariance matrix X T X and the gram matrix XX T respectively. So, letting V k ∈ R d × k have columns equal to ⃗ v 1 , . . . ,⃗ k is the best rank- k approximation to X (given by PCA). k X where U k ∈ R n × k has columns equal to ⃗ u 1 , . . . ,⃗ X ∈ R n × d : data matrix, U ∈ R n × rank ( X ) : matrix with orthonormal columns u 2 , . . . (left singular vectors), V ∈ R d × rank ( X ) : matrix with orthonormal ⃗ u 1 ,⃗ v 2 , . . . (right singular vectors), Σ ∈ R rank ( X ) × rank ( X ) : positive di- columns ⃗ v 1 ,⃗

the svd and optimal low-rank approximation The best low-rank approximation to X : k Correspond to projecting the rows (data points) onto the span 11 X k = arg min rank − k B ∈ R n × d ∥ X − B ∥ F is given by: X k = XV k V T k = U k U T k X = U k Σ k V T of V k or the columns (features) onto the span of U k

the svd and optimal low-rank approximation The best low-rank approximation to X : k agonal matrix containing singular values of X . 12 X k = arg min rank − k B ∈ R n × d ∥ X − B ∥ F is given by: X k = XV k V T k = U k U T k X = U k Σ k V T X ∈ R n × d : data matrix, U ∈ R n × rank ( X ) : matrix with orthonormal columns u 2 , . . . (left singular vectors), V ∈ R d × rank ( X ) : matrix with orthonormal ⃗ u 1 ,⃗ v 2 , . . . (right singular vectors), Σ ∈ R rank ( X ) × rank ( X ) : positive di- columns ⃗ v 1 ,⃗

agonal matrix containing singular values of X . the svd and optimal low-rank approximation 13 X ∈ R n × d : data matrix, U ∈ R n × rank ( X ) : matrix with orthonormal columns u 2 , . . . (left singular vectors), V ∈ R d × rank ( X ) : matrix with orthonormal ⃗ u 1 ,⃗ v 2 , . . . (right singular vectors), Σ ∈ R rank ( X ) × rank ( X ) : positive di- columns ⃗ v 1 ,⃗

applications of low-rank approximation Rest of Class: Examples of how low-rank approximation is applied in a variety of data science applications. reduction/data compression. 14 • Used for many reasons other than dimensionality

matrix completion believe is close to rank- k (i.e., well approximated by a rank k matrix). Classic example: the Netflix prize problem. Under certain assumptions, can show that Y well approximates X on both the observed and (most importantly) unobserved entries. 15 Consider a matrix X ∈ R n × d which we cannot fully observe but ∑ [ ] 2 Solve: Y = arg min X j , k − B j , k rank − k B observed ( j , k )

entity embeddings Dimensionality reduction embeds d -dimensional vectors into objects other than vectors? Usual Approach: Convert each item into a high-dimensional feature vector and then apply low-rank approximation. 16 d ′ dimensions. But what about when you want to embed • Documents (for topic-based search and classification) • Words (to identify synonyms, translations, etc.) • Nodes in a social network

example: latent semantic analysis 17

example: latent semantic analysis 18 • If the error ∥ X − YZ T ∥ F is small, then on average, X i , a ≈ ( YZ T ) i , a = ⟨ ⃗ y i ,⃗ z a ⟩ . • I.e., ⟨ ⃗ y i ,⃗ z a ⟩ ≈ 1 when doc i contains word a . • If doc i and doc j both contain word a , ⟨ ⃗ y i ,⃗ z a ⟩ ≈ ⟨ ⃗ y j ,⃗ z a ⟩ = 1.

associates with that topic. example: latent semantic analysis 19 If doc i and doc j both contain word a , ⟨ ⃗ y i ,⃗ z a ⟩ ≈ ⟨ ⃗ y j ,⃗ z a ⟩ = 1 Another View: Each column of Y represents a ‘topic’. ⃗ y i ( j ) indicates how much doc i belongs to topic j . ⃗ z a ( j ) indicates how much word a

example: latent semantic analysis documents. K . k V T 20 • Just like with documents, ⃗ z a and ⃗ z b will tend to have high dot product if word i and word j appear in many of the same • In an SVD decomposition we set Z = Σ k V T • The columns of V k are equivalently: the top k eigenvectors of X T X . The eigendecomposition of X T X is X T X = V Σ 2 V T . • What is the best rank- k approximation of X T X ? I.e. arg min rank − k B ∥ X T X − B ∥ F • X T X = V k Σ 2 k = ZZ T .

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 16 0 summary Last Class: Low-Rank Approximation, Eigendecomposition, and PCA by projecting data points into that space. (PCA).

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Prof. Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

Where do Multivariate Normal Samples Come from? Paul E. Johnson 1 2 1 Department of Political

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

Nested logit models Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

The Kalman Filter An Algorithm for Dealing with Uncertainty Steven Janke May 2011 Steven Janke

Data Classification Linear Classifier II Latent Differential Analysis Mean Classification

1 Random vectors I Some experiments produce outcomes that are vectors. Such a vector is