Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, - PowerPoint PPT Presentation

Principal Component Analysis Proof B ∈ R d × d to be the matrix such that the first r columns are the columns of B Define ˜ B ⊤ ˜ and in addition, ˜ i =1 ˜ B = I . Then for every j : � d B 2 j , i = 1, which implies that � r i =1 B 2 j , i ≤ 1. It follows that d r trace( U ⊤ Σ U ) ≤ � � max D j , j β j = D j , j β ∈ [0 , 1] d : || β || 1 ≤ r j =1 j =1 . Therefore for every matrix U ∈ R d × r with orthonormal columns, the following inequality holds: trace( U ⊤ Σ U ) ≤ � r j =1 D j , j . But if we set U to the matrix with the r leading eigenvectors of Σ as its columns, we obtain trace( U ⊤ Σ U ) = � r j =1 D j , j and thereby the optimal solution. � D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 23 / 117

Principal Component Analysis Runtime properties The overall runtime of PCA is O ( d 3 + d 2 n ): O ( d 3 ) for calculating the eigenvalues of Σ, O ( d 2 n ) for constructing the matrix Σ. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 24 / 117

Principal Component Analysis Speed-up for d >> n Often the number of features d greatly exceeds the number of samples n . The standard runtime of O ( d 3 + d 2 n ) is very expensive for large d . However, there is a workaround to perform the same calculations in O ( n 3 + n 2 d ). D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 25 / 117

Principal Component Analysis Speed-up for d >> n Workaround: X ∈ R d × n , Σ = 1 � n or Σ = 1 i =1 x i x ⊤ n XX ⊤ . i n Consider K = X ⊤ X , such that K ij = � x i , x j � Suppose v is an eigenvector of K : K v ∗ = λ v ∗ for some λ ∈ R . Multiplying the equation by 1 n X and using the definition of K we obtain nXX ⊤ X v ∗ = 1 1 n λ X v ∗ . (13) But, using the definition of Σ, we get that Σ( X v ∗ ) = λ n ( X v ∗ ). X v ∗ || X v ∗ || is an eigenvector of Σ with eigenvalue λ Hence n . Therefore, one can calculate the PCA solution by calculating the eigenvalues of K rather than Σ. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 26 / 117

Principal Component Analysis Pseudocode Require: A matrix X of n examples ∈ R d × n , number of components r 1: for i = 1 , . . . , n set x i ← x i − 1 � n i =1 x i n 2: if n > d then Σ = 1 n XX ⊤ 3: Compute the r leading eigenvectors v 1 , . . . , v r of Σ. 4: 5: else K = X ⊤ X 6: Compute the r leading eigenvectors v ∗ 1 , . . . , v ∗ r of K . 7: 1 i || X v ∗ for i = 1 , . . . , r set v i := i . 8: || X v ∗ 9: end if 10: return Compression matrix W = [ v 1 , . . . , v r ] ⊤ or compressed points W X D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 27 / 117

Principal Component Analysis How to choose the number of principal components? The principal components should capture α per cent of the total variance in the data ( α is often set to be 90%). Total variance in the data: � d i =1 λ i . Total variance captured by first r eigenvectors: � r i =1 λ i � r i =1 λ i The variance explained α is the ratio between the two: α = i =1 λ i . � d D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 28 / 117

Principal Component Analysis Theorem The variance captured by the first r eigenvectors of Σ is the sum over its r largest eigenvalues � r i =1 λ i . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 29 / 117

Principal Component Analysis Proof The variance in a dataset X is defined as n var( X ) = 1 || x i − 0 || 2 = � (14) n i =1 n n = 1 || x i || 2 = 1 � � � x i , x i � = (15) n n i =1 i =1 = tr(Σ) = tr( V ⊤ DV ) = tr( VV ⊤ D ) = tr( D ) = (16) d d � � = D i , i = λ i . (17) i =1 i =1 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 30 / 117

Principal Component Analysis Proof - Continued The variance in a projected dataset WX , with W = [ v 1 , . . . , v r ] ⊤ , is defined as n n n var( WX ) = 1 || W x j || 2 = 1 � W x j , W x j � = 1 � � � x ⊤ j W ⊤ W x j = (18) n n n j =1 j =1 j =1 n r n = 1 1 � x ⊤ j ( v 1 v ⊤ 1 + . . . + v r v ⊤ � v ⊤ � n ( x j x ⊤ r ) x j = j ) v i = (19) i n j =1 i =1 j =1 r r r = ( v ⊤ 1 Σ v 1 + . . . + v ⊤ � v ⊤ � v ⊤ � r Σ v r ) = i Σ v i = i λ i v i = λ i . (20) i =1 i =1 i =1 Therefore the variance explained can be written as a ratio over sums over eigenvalues of the covariance matrix Σ. � . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 31 / 117

Principal Component Analysis Geometric Interpretation An alternative interpretation of PCA is that it finds the major axis of variation in the data. This means that the 1st principal component defines the direction in the data with the greatest variance. The 2nd principal component defines a direction that i) is orthogonal to the 1st principal component and ii) captures the major direction of the remaining variance in the data. In general, the i -th principal component is orthogonal to all previous i − 1 principal components and represents the direction of maximum variance remaining in the data. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 32 / 117

Principal Component Analysis Proof: Variance Maximization along 1st principal component We start by trying to find one principal component v 1 that maximizes the variance of X : arg max v 1 Var ( v ⊤ 1 X ) = arg max v 1 v ⊤ 1 Σ v 1 To avoid picking a v 1 with arbitrarily large entries, we enforce v ⊤ 1 v 1 = 1. We form the Lagrangian to solve this problem v ⊤ 1 Σ v 1 − λ ( v ⊤ 1 v 1 − 1) . (21) and take the derivative with respect to v 1 and set it to zero ∂ ( v ⊤ 1 Σ v 1 − λ ( v ⊤ 1 v 1 − 1)) = 0; (22) ∂ v 1 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 33 / 117

Principal Component Analysis Proof: Variance Maximization along 1st principal component ∂ ( v ⊤ 1 Σ v 1 − λ ( v ⊤ 1 v 1 − 1)) = 0; (23) ∂ v 1 2Σ v 1 − 2 λ v 1 = 0; (24) Σ v 1 = λ v 1 . (25) The solution v 1 is an eigenvector of Σ. As v ⊤ 1 Σ v 1 = v ⊤ 1 λ v 1 = λ v ⊤ 1 v 1 = λ 1 , the variance is maximized by picking the eigenvector corresponding to the largest eigenvalue. Hence the 1st eigenvector v 1 is the principal component/direction that maximizes the variance of X v 1 . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 34 / 117

Principal Component Analysis Proof: Variance Maximization along 2nd principal component We will now show the solution for the 2nd principal component. The second direction of projection should be independent from the first one: cov ( v ⊤ 2 X , v ⊤ 1 X ) = 0. This can be written as cov ( v ⊤ 2 X , v ⊤ 1 X ) = v ⊤ 2 XX ⊤ v 1 = v ⊤ 2 Σ v 1 = v ⊤ 2 λ v 1 = λ v ⊤ 2 v 1 = 0 . We form the Lagrangian v ⊤ 2 Σ v 2 − λ ( v ⊤ 2 v 2 − 1) − ρ ( v ⊤ 2 v 1 ) and set the derivative with respect to v 2 to zero: 2Σ v 2 − 2 λ v 2 − ρ v 1 = 0. If we multiply this from the left by v ⊤ 1 , the first two terms vanish: − ρ = 0; As ρ = 0, we are left with Σ v 2 = λ v 2 , showing that v 2 is again an eigenvector of Σ, and we again pick the eigenvector of the - now second largest - eigenvalue to maximize the variance along the second principle component. The proofs for the other principal components k > 2 follows the same scheme. � D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 35 / 117

Principal Component Analysis 15 10 5 y 0 5 10 15 10 5 0 5 10 15 20 25 x D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 36 / 117

Principal Component Analysis 15 PC1 PC2 10 5 y 0 5 10 15 10 5 0 5 10 15 20 25 x D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 37 / 117

Principal Component Analysis 14 1.0 12 10 Transformed Y Values 8 6 4 2 0 2 4 15 10 5 0 5 10 15 20 25 30 Transformed X Values D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 38 / 117

Principal Component Analysis Summary Principal Component Analysis (Pearson, 1901) is the optimal way to perform dimensionality reduction to a lower-dimensional space with subsequent reconstruction with minimal squared reconstruction error. The principal components are the eigenvectors of the covariance matrix of the data. The eigenvalues capture the variance that is described by the corresponding eigenvector. The number of principal components can be chosen such that they capture α per cent of the total variance. The principal components capture the orthogonal directions of maximum variance in the data. PCA can be computed in O ( min ( n 3 + n 2 d , d 3 + d 2 n )). D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 39 / 117

1.2 Singular Value Decomposition based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 2.4.3.2 Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.4 Dr. R. Costin. Lecture Notes of MATHEMATICS 5101: Linear Mathematics in Finite Dimensions. OSU 2013. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 40 / 117

Singular Value Decomposition Goals Get to know a fundamental matrix decomposition that is widely used throughout data mining. Learn how SVD can be used to obtain a low-rank approximation to a given matrix. Get to know how to compute Principal Component Analysis via SVD. Learn how SVD can speed up fundamental matrix operations in data mining. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 41 / 117

Singular Value Decomposition Definition A Singular Value Decomposition is defined as a factorization of a given matrix D ∈ R n × d into three matrices: D = L ∆ R ⊤ , (26) where L is an n × n matrix with orthonormal columns, the left singular vectors, ∆ is an n × d diagonal matrix containing the singular values, and R is a d × d matrix with orthonormal columns, the right singular vectors. The singular values are non-negative and by convention arranged in decreasing order. Note that ∆ is not (necessarily) a square matrix. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 42 / 117

Singular Value Decomposition R T D L Δ d n d d D{1,1} ... D{1,d} L{1,1} ... L{1,n} δ 1 0 0 0 ... R{1,1} ... R{1,d} . . . . = n n n 0 δ 2 0 0 ... . . . . D{n,1} ... D{n,d} . . ... ... ... ... L{n,1} ... L{n,n} d . . . . R{d,1} ... R{d,d} D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 43 / 117

Singular Value Decomposition Reduced SVD One can discard the singular vectors that correspond to zero singular values, to obtain the reduced SVD: D = L r ∆ r R ⊤ r , where L r is the n × r matrix of the left singular vectors, R r is the d × r matrix of the right singular vectors, and δ r is the r × r diagonal matrix containing the positive singular vectors. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 44 / 117

Singular Value Decomposition Reduced SVD The reduced SVD gives rise to the spectral decomposition of D : D = L r ∆ r R ⊤ r = (27) r ⊤     δ 1 0 . . . 0 1 r ⊤ 0 δ 2 . . . 0    2  = ([ l 1 , l 2 , . . . , l r ])  = (28)     . . . . . . . . . . . . . . .    r ⊤ 0 0 . . . δ r r r = δ 1 l 1 r ⊤ 1 + δ 2 l 2 r ⊤ 2 + . . . + δ r l r r ⊤ � δ i l i r ⊤ r = (29) i i =1 Hence D is represented as a sum of rank-one matrices of the form δ i l i r ⊤ i . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 45 / 117

Singular Value Decomposition Eckart-Young Theorem By selecting the r largest singular values δ 1 , δ 2 , . . . , δ r , and the corresponding left and right singular vectors, we obtain the best rank- r approximation to the original matrix D . Theorem (Eckart-Young Theorem, 1936) If D r is the matrix defined as � r i =1 δ i l i r ⊤ i , then D r is the rank-r matrix that minimizes the objective || D − D r || F . The Frobenius Norm of a matrix A , || A || F , is defined as � n n � � � � A 2 || A || F = i , j . � i =1 j =1 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 46 / 117

Singular Value Decomposition Proof of Eckart-Young Theorem (Part 1/2) Assume D is of rank k ( k > r ). Since || D − D r || F = || L ∆ R ⊤ − D r || F = || ∆ − L ⊤ D r R || F . Denoting N = L ⊤ D r R , we can compute the Frobenius norm as k | ∆ i , j − N i , j | 2 = | δ i − N i , i | 2 + | N i , i | 2 + || ∆ − N || 2 � � � � | N i , j | 2 . F = i , j i =1 i > k i � = j This is minimized if all off-diagonal terms of N and all N i , i for i > k are zero. The minimum of � | δ i − N i , i | 2 is obtained for N i , i = δ i for i = 1 , . . . , r and all other N i , i are zero. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 47 / 117

Singular Value Decomposition Proof of Eckart-Young Theorem (Part 2/2) We do not need the full L and R for computing D r , only their first r columns. This can be seen by splitting L and R into blocks: L = [ L r , L 0 ] and R = [ R r , R 0 ] and � ∆ r � � R ⊤ � 0 D r = L ∆ r R ⊤ = [ L r , L 0 ] r (30) R ⊤ 0 0 0 � ∆ r R ⊤ � = L r ∆ r R ⊤ r = [ L r , L 0 ] r . (31) 0 � D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 48 / 117

Singular Value Decomposition Geometric interpretation: Column and row space Any n × d matrix D represents a linear transformation, D : R d → R n , from the space of d -dimensional vectors to the space of n -dimensional vectors, because for any x ∈ R d , there exists a y ∈ R n , such that D x = y . The column space of D is defined as the set of all vectors y ∈ R n such that D x = y over all possible x ∈ R d . The row space of D is defined as the set of all vectors x ∈ R d such that D ⊤ y = x over all possible y ∈ R n . The row space of D is the column space of D ⊤ . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 49 / 117

Singular Value Decomposition Geometric interpretation: Null spaces The set of all vectors x ∈ R d , such that D x = 0 is called the (right) null space of D . The set of all vectors y ∈ R n , such that D ⊤ y = 0 is called the left null space of D . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 50 / 117

Singular Value Decomposition Geometric interpretation: SVD as a ‘basis-generator’ SVD gives a basis for each of the four spaces associated with a matrix D . If D has rank r , then it has only r independent columns and only r independent rows. The r left singular vectors l 1 , l 2 , . . . , l r corresponding to the r nonzero singular values of D represent a basis for the column space of D . The remaining n − r left singular vectors l r +1 , . . . , l n represent a basis for the left null space of D . The r right singular vectors r 1 , r 2 , . . . , r r , corresponding to the r non-zero singular values represent a basis for the row space of D . The remaining d − r right singular vectors r r +1 , . . . , r d represent a basis for the (right) null space of D . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 51 / 117

Singular Value Decomposition Geometric interpretation: SVD as a ‘basis-generator’ Proof of right null space: Dx = 0 (32) L ⊤ L ∆ R ⊤ x = 0 (33) R ⊤ x =: ˜ (34) x ∆˜ x = ( δ 1 ˜ x 1 , δ 2 ˜ x 2 , . . . , δ r ˜ x r , 0 , . . . , 0) = 0 (35) x = R ˜ (36) x The weights for the first r right singular vectors r 1 , . . . , r r is zero. Hence x is a linear combination over the d − r right singular vectors r r +1 , . . . , r d . � D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 52 / 117

Singular Value Decomposition Geometric interpretation: SVD as a ‘basis-generator’ Consider the reduced SVD expression from Equation (27). Right-multiplying both sides by R r and noting that R ⊤ r R r = I , we obtain: DR r = L r ∆ r R ⊤ r R r (37) DR r = L r ∆ r (38)   δ 1 0 . . . 0 0 δ 2 . . . 0   DR r = L r (39)   . . . . . . . . . . . .   0 0 . . . δ r D ([ r 1 , r 2 , . . . , r r ]) = ([ δ 1 l 1 , δ 2 l 2 , . . . , δ r l r ]) (40) D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 53 / 117

Singular Value Decomposition Geometric interpretation: SVD as a ‘basis-generator’ Hence D r i = δ i l i for all i = 1 , . . . , r . That means, SVD is a special factorization of the matrix D such that any basis vector r i for the row space is mapped to the corresponding basis vector l i for the column space, scaled by δ i . SVD can be thought of as a mapping from an orthonormal basis ( r 1 , r 2 , . . . , r r ) in R d , the row space, to an orthonormal basis ( l 1 , l 2 , . . . , l r ) in R n , the column space, with the corresponding axes scaled according to the singular values δ 1 , δ 2 , . . . , δ r . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 54 / 117

Singular Value Decomposition Link to PCA There is a direct link between PCA and SVD, which we will elucidate next. This link allows the computation of PCA via SVD. We assume that D is a d × n matrix. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 55 / 117

Singular Value Decomposition Link to PCA: The covariance matrix via SVD The columns of matrix L , which are also the left singular vectors, are the orthonormal eigenvectors of DD ⊤ . Proof: DD ⊤ = L ∆( R ⊤ R )∆ ⊤ L ⊤ = L ∆∆ ⊤ L ⊤ Note that R ⊤ R = 1. The diagonal entries of ∆∆ ⊤ , that is, the squared nonzero singular values, are therefore the nonzero-eigenvalues of DD ⊤ . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 56 / 117

Singular Value Decomposition Link to PCA: The covariance matrix via SVD n DD ⊤ and the left singular vectors of The covariance matrix of mean-centered data is 1 SVD are eigenvectors of DD ⊤ . It follows that the eigenvectors of PCA are the same as the left singular vectors of SVD for mean-centered data. Furthermore, the square singular values of SVD are n times the eigenvalues of PCA. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 57 / 117

Singular Value Decomposition Link to PCA: The kernel matrix via SVD The columns of matrix R , which are the right singular vectors, are the orthonormal eigenvectors of D ⊤ D . Proof: D ⊤ D = R ∆ ⊤ ( L ⊤ L )∆ R ⊤ = R ∆ ⊤ ∆ R ⊤ Note that L ⊤ L = 1. � The diagonal entries of ∆ ⊤ ∆, that is, the squared nonzero singular values of ∆, are therefore the nonzero-eigenvalues of D ⊤ D . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 58 / 117

Singular Value Decomposition Link to PCA: The kernel matrix via SVD The kernel matrix on D is D ⊤ D and the right singular vectors of SVD are eigenvectors of D ⊤ D . It follows that each eigenvector v of PCA can be expressed in terms of a right singular vectors r of SVD: D r v = || D r || Again, the square singular values of SVD are n times the eigenvalues of PCA n = δ 2 ( λ Σ = λ K n ). D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 59 / 117

Singular Value Decomposition Applications of SVD and PCA beyond dimensionality reduction Noise reduction: Removing smaller eigenvectors or singular vectors often improves data quality. Data imputation: The reduced SVD matrices L r , ∆ r and R r can be computed even for incomplete matrices and used to impute missing values ( → link prediction) Linear equations: Obtain basis of the solution D x = 0. Matrix inversion: Assume D is square. D = L ∆ R ⊤ ⇒ D − 1 = R ∆ − 1 L ⊤ Powers of a matrix: Assume D is square and positive semidefinite. Then D = L ∆ L ⊤ and D k = L ∆ k L ⊤ . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 60 / 117

Singular Value Decomposition Summary Singular Value Decomposition (SVD) is a decomposition of a given matrix D into three submatrices. SVD can be used to obtain an optimal low-rank approximation of a matrix in terms of the Frobenius norm. SVD generates bases for the four spaces associated with a matrix. D can be thought of as a mapping between the basis vectors of its row space and the scaled basis vectors of its column space. SVD can be used to implement PCA. SVD is used throughout data mining for efficient matrix computations. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 61 / 117

1.3 Kernel Principal Component Analysis based on: B. Sch¨ olkopf, A. Smola, K.R. M¨ uller, Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Max Planck Institute for Biological Cybernetics, Technical Report No.44, 1996 Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.3 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 62 / 117

Kernel Principal Component Analysis Nonlinear 2D Dataset 1.5 1.0 0.5 0.0 y 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 63 / 117

Kernel Principal Component Analysis Nonlinear 2D Dataset 1.5 1.0 0.5 y 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 x D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 64 / 117

Kernel Principal Component Analysis Goals Learn how to perform non-linear dimensionality reduction. Learn how to perform PCA in feature space solely in terms of kernel functions. Learn how to compute the projections of points onto principal components in feature space. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 65 / 117

Kernel Principal Component Analysis Kernel Principal Component Analysis (Sch¨ olkopf, Smola, M¨ uller, 1996) PCA is restrictive in the sense that it only allows for linear dimensionality reduction. What about non-linear dimensionality reduction? Idea: Move the computation of principal components to feature space. This approach exists and is called Kernel PCA (Sch¨ olkopf, Smola and M¨ uller, 1996)! Define a mapping φ : X → H , x → φ ( x ) . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 66 / 117

Kernel Principal Component Analysis Kernel Principal Component Analysis We assume that we are dealing with centered data � n i =1 φ ( x i ) = 0. � n The covariance matrix then takes the form C = 1 i =1 φ ( x i ) φ ( x i ) ⊤ . n Then we have to find eigenvalues λ ≥ 0 and nonzero eigenvectors v ∈ H \ { 0 } satisfying: λ v = Cv . All solutions v with λ � = 0 lie in the span of φ ( x 1 ) , . . . , φ ( x n ), due to the fact that λ v = Cv = 1 � n i =1 ( φ ( x i ) ⊤ v ) φ ( x i ) . n D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 67 / 117

Kernel Principal Component Analysis Kernel Principal Component Analysis The first useful consequence is that we can consider the following equations instead: λ � φ ( x j ) , v � = � φ ( x j ) , Cv � for all j = 1 , . . . , n and the second is that there exist coefficients α j ( j = 1 , . . . n ) : n � v = α i φ ( x i ) . i =1 Combining these two consequences, we get for all j = 1 , . . . , n : n n n α i � φ ( x j ) , φ ( x i ) � = 1 � � � λ α i � φ ( x j ) , φ ( x k ) � φ ( x k ) , φ ( x i ) �� . n i =1 i =1 k =1 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 68 / 117

Kernel Principal Component Analysis Kernel Principal Component Analysis Combining these two consequences, we get for all j = 1 , . . . , n : n n n α i � φ ( x j ) , φ ( x i ) � = 1 � � � λ α i � φ ( x j ) , φ ( x k ) �� φ ( x k ) , φ ( x i ) � . n i =1 i =1 k =1 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 69 / 117

Kernel Principal Component Analysis Kernel PCA as an eigenvector problem In terms of the n × n Gram matrix K j , i := � φ ( x j ) , φ ( x i ) � , this can be rewritten as n λ K α = K 2 α , (41) where α denotes the column vector with entries α 1 , . . . , α n . To find solutions of Equation (41), we solve the problem n λ α = K α , (42) which we obtain by multiplying (41) by K − 1 from the left. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 70 / 117

Kernel Principal Component Analysis Normalizing the coefficients We require the eigenvectors v k to have unit length, that is � v k , v k � = 1 for all k = 1 , . . . , r . That means that 1 = � v k , v k � (43) n � α k i α k j � φ ( x i ) , φ ( x j ) � = (44) i , j =1 n � α k i α k = j K i , j (45) i , j =1 = � α k , K α k � = λ k � α k , α k � . (46) D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 71 / 117

Kernel Principal Component Analysis Normalizing the coefficients As eigenvectors of K , the α k have unit norm. � 1 λ to enforce that their norm is 1 Therefore we have to rescale them by λ . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 72 / 117

Kernel Principal Component Analysis Projecting points onto principal components A point x can be projected onto the principal component v k (for k = 1 , . . . , r ) via n � � v k , φ ( x ) � = α k i � φ ( x i ) , φ ( x ) � . i =1 resulting in a r -dimensional representation of x based on KernelPCA in H . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 73 / 117

Kernel Principal Component Analysis How to center the kernel matrix in H ? The kernel matrix K can be centered via K = K − 1 n 1 n × n K − 1 n K 1 n × n + 1 ˜ n 2 1 n × n K 1 n × n , (47) = ( I − 1 n 1 n × n ) K ( I − 1 n 1 n × n ) (48) using the notation (1 n × n ) ij := 1 for all i , j . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 74 / 117

Kernel Principal Component Analysis Pseudocode: KernelPCA ( X , r ) Require: A matrix X of n examples ∈ R d , n , number of components r 1: K i , j = { k ( x i , x j ) i , j =1 ,..., n } 2: K := ( I − 1 n 1 n × n ) K ( I − 1 n 1 n × n ) 3: ( λ 1 , . . . , λ r ) = eigenvalues( K ) 4: ( α 1 , . . . , α r ) = eigenvectors( K ) � 1 5: α i := λ i α i for all i = 1 , . . . , r 6: A = ( α 1 , . . . , α r ) 7: return Set of projected points: Z = { z i | z i = A ⊤ K (: , i ) , for i = 1 , . . . , n } D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 75 / 117

Kernel Principal Component Analysis Summary Non-linear dimensionality reduction can be performed in feature space in KernelPCA. At the heart of KernelPCA is finding eigenvectors of the kernel matrix. One can compute projections of the data onto the principal components explicitly, but not the principal components themselves (unless function φ is explicitly known). D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 76 / 117

1.4 Multidimensional Scaling based on: Wolfgang Karl H¨ ardle, Leopold Simar. Applied Multivariate Statistical Analysis. Springer 2015, Chapter 17 Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning, Springer 2008, Second Edition, Chapters 14.8 and 14.9 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 77 / 117

Multidimensional Scaling 0 214 279 610 596 237 214 0 492 533 496 444 279 492 0 520 772 140 610 533 520 0 521 687 596 496 772 521 0 771 237 444 140 687 771 0 Source of Table and subsequent figure: Wolfgang Karl H¨ ardle, Leopold Simar. Applied Multivariate Statistical Analysis. Springer 2015, Chapter 17 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 78 / 117

Multidimensional Scaling D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 79 / 117

Multidimensional Scaling D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 80 / 117

Multidimensional Scaling Goals Find a low-dimensional representation of data for which only distances, similarities or dissimilarities are given Understand the link between Multidimensional scaling and PCA Applications Visualize similarities or dissimilarities between high-dimensional objects, e.g. DNA sequences, protein structures D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 81 / 117

Multidimensional Scaling Setting We assume that we are given the pairwise distances, similarities or dissimilarities between all pairs of points d i , j in a dataset. Note: We do not need to know the actual coordinates of the points, as in PCA. The goal in multidimensional scaling (MDS) is to 1 find a low-dimensional representation of the data points, 2 which maximally preserves the pairwise distances. The solution is only determined up to rotation, reflection and shift. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 82 / 117

Multidimensional Scaling Optimization problem We assume that the original data objects are x 1 , . . . , x n ∈ R d , and the distance between x i and x j is d i , j . The goal is to find a lower-dimensional representation of these n objects: z 1 , . . . , z n ∈ R r . Hence the objective in metric MDS is to minimize the so-called stress function S M : � ( d ij − || z i − z j || ) 2 arg min S M ( z 1 , . . . , z n ) = (49) i � = j D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 83 / 117

Multidimensional Scaling Classic scaling Definition A n × n distance matrix D ij = d ij is Euclidean if for some points x 1 , . . . , x n ∈ R d ij = ( x i − x j ) ⊤ ( x i − x j ). d 2 Theorem Define a n × n matrix A with A ij = − 1 2 d 2 ij and B = HAH where H is the centering matrix. D is Euclidean if and only if B is positive semidefinite. If D is the distance matrix of a data matrix X , then B = HXX ⊤ H . B is called the inner product matrix. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 84 / 117

Multidimensional Scaling Classical scaling - From distances to inner products: Overview If distances are Euclidean, we can convert them to centered inner-products. ij = || x i − x j || 2 be the matrix of pairwise squared Euclidean distances. Let d 2 Then we write x || 2 + || x j − ¯ x || 2 − 2 � x i − ¯ d 2 ij = || x i − ¯ x , x j − ¯ x � (50) Defining A ij = {− 1 2 d 2 ij } , we double center B : B = HAH , (51) where H = I − 1 n 11 ⊤ . B is then the matrix of centered inner products. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 85 / 117

Multidimensional Scaling Classical scaling - From distances to inner products: Step by step The task of MDS is to find the original Euclidean coordinates from a given distance matrix. The Euclidean distance between the i th and j th points is given by d � d 2 ( x ik − x jk ) 2 ij = k =1 The general b ij term of B is given by d � x ik x jk = x ⊤ b ij = i x j k =1 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 86 / 117

Multidimensional Scaling Classical scaling - From distances to inner products: Step by step It is possible to derive B from the known squared distances d ij and them from B the unknown coordinates. d 2 ij = x ⊤ i x i + x ⊤ j x j − 2 x ⊤ i x j = = b ii + b jj − 2 b ij (52) Centering of the coordinate matrix X implies that � n i =1 b ij = 0. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 87 / 117

Multidimensional Scaling Classical scaling - From distances to inner products: Step by step Summing over i and j , we obtain n n 1 ij = 1 � � d 2 b ii + b jj (53) n n i =1 i =1 n n 1 ij = b ii + 1 � d 2 � b jj (54) n n j =1 j =1 n n n 1 ij = 2 � � � d 2 b ii (55) n 2 n i =1 j =1 i =1 b ij = − 1 2( d 2 ij − d 2 i • − d 2 • j + d 2 •• ) (56) D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 88 / 117

Multidimensional Scaling Classical scaling - From distances to inner products: Step by step With a ij = − 1 2 d 2 ij , and n a i • = 1 � a ij (57) n j =1 n a • j = 1 � a ij (58) n i =1 n n a •• = 1 � � a ij , (59) n 2 i =1 j =1 we get b ij = a ij − a i • − a • j + a •• D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 89 / 117

Multidimensional Scaling Classical scaling - From distances to inner products: Step by step Define the matrix A as ( a ij ) and observe that: B = HAH (60) The inner product matrix B can be expressed as B = XX ⊤ , (61) where X = ( x 1 , . . . , x n ) ⊤ is the n × p matrix of coordinates. The rank of B is then rank ( B ) = rank ( XX ⊤ ) = rank ( X ) = p . (62) D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 90 / 117

Multidimensional Scaling Classical scaling - From distances to inner products: Step by step B is symmetric, positive definite, of rank p and has p non-zero eigenvalues. B can now be written as B = V p Λ p V ⊤ p , (63) where Λ p = diag( λ 1 , . . . , λ p ), the diagonal matrix of the eigenvalues of B , and V p = ( v 1 , . . . , v p ), the matrix of corresponding eigenvectors. Hence the coordinate matrix X containing the point configuration in R p is given by 1 X = V p Λ 2 (64) p D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 91 / 117

Multidimensional Scaling Lower-dimensional approximation What if we want the representation of X to be lower-dimensional than p ( r dimensional, where r < p )? Minimize arg min Z : ZZ ⊤ = B r || B − B r || F Achieved if B r is the rank- r of B approximation based on SVD Then 1 Z = V r Λ r . 2 (65) D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 92 / 117

Multidimensional Scaling Classical scaling - Pseudocode The algorithm for recovering coordinates from distances D ij = d ij between pairs of points is as follows: 1 Form matrix A ij = {− 1 2 d 2 ij } . 2 Form matrix B = HAH , where H is the centering matrix H = I − 1 n 11 ⊤ . 3 Find the spectral decomposition of B , B = VΛV ⊤ , where Λ is the diagonal matrix formed from the eigenvalues of B , and V is the matrix of corresponding eigenvectors. 4 If the points were originally in a p -dimensional space, the first p eigenvalues of K are nonzero and the remaining n − p are zero. Discard these from Λ (rename as Λ p ), and discard the corresponding eigenvalues from V (rename as V p ). 1 5 Find X = V p Λ p , and then the coordinates of the points are given by the rows of X . 2 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 93 / 117

Multidimensional Scaling Non-metric multidimensional scaling Classic scaling, and more general metric scaling , approximate actual dissimilarities or similarities between the data. Non-metric scaling effectively uses a ranking of the dissimilarities to obtain a low-dimensional approximation of the data. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 94 / 117

Multidimensional Scaling Methods for nonlinear dimensionality reduction The underlying idea of these methods is that they lie close to an intrinsically low-dimensional nonlinear manifold embedded in a high-dimensional space. The methods are “flattening” the manifold and thereby create a low-dimensional representation of the data and their relative location in the manifold. They tend to be useful for systems with high signal to noise ratios. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 95 / 117

Multidimensional Scaling ISOMAP (Tenenbaum et al., 2000) Isometric feature mapping (ISOMAP) constructs a graph to approximate the geodesic distance between points along the manifold: Find all points in the ǫ -neighborhood of a point. Connect the point to its neighbors by an edge. The distance between all non-adjacent points will be approximated by the shortest path (geodesic) distance along the neighborhood graph. Finally, classical scaling is applied to the graph distances. Hence ISOMAP can be thought of as multidimensional scaling on an ǫ -neighborhood graph. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 96 / 117

Multidimensional Scaling ISOMAP (Tenenbaum et al., 2000) 1. Find the neighbours of 2. Connect the point to its each point neighbours by an edge 3. Compute shortest path 4. MDS on graph distances between non-adjacent points D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 97 / 117

Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) Key idea: Local linear embedding (LLE) approximates each point as a linear combination of its neighbors. Then a lower dimensional representation is constructed that best preserves these local approximations. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 98 / 117

Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) 1. Find the k-nearest neighbours 2. Compute weights for each 3. Compute embedding of each point point (linear combination coordinates for fixed of its neighbours) weights w3 x3 w1 x0 w2 x1 x2 x0=x1w1+x2w2+x3w3 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 99 / 117

Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) For each data point x i in d dimensions, we find its k -nearest neighbors N ( i ) in Euclidean distance. We approximate each point by an affine mixture of the points in it neighborhood: � w il x l || 2 min W il || x i − (66) l ∈N ( i ) ∈ N ( i ), � N over weights w il satisfying w il = 0, l / l =1 w il = 1. w il is the contribution of point l to the reconstruction of point i . Note that for a hope of a unique solution, we must have l < d . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 100 / 117

Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, - PowerPoint PPT Presentation

Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Z urich Basel, Spring Semester 2016 D-BSSE Our course - The team Dr. Damian Roqueiro, Dr. Dean Bodenham, Dr. Dominik Grimm, Dr. Xiao He D-BSSE Karsten Borgwardt Data

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Required Readings Further Reading Data Reduction HyperSlice: Visualization of scalar functions

Confluent Data Reduction for Edge Clique Cover: A Bridge Between Graph Transformation and

Progress of Flood Monitoring & Assessment by Remote Sensing in China China Institute of Water

Presentation 7.3a: Multiple linear regression Murray Logan 19 Jul 2017 Section 1 Theory

Data Reduction, Exact, and Heuristic Algorithms for Clique Cover Jens Gramm Jiong Guo Falk H

INF421, Lecture 1 Lists and Complexity Leo Liberti LIX, Ecole Polytechnique, France INF421,

Weighted and pointwise bounds in measure datum problems with applications Nguyen Cong Phuc

Mark Maddox & Matt Brown July 2020 1 Railroad Commission of Texas | June 27, 2016 (Change

Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, - PowerPoint PPT Presentation

Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Z urich Basel, Spring Semester 2016 D-BSSE Our course - The team Dr. Damian Roqueiro, Dr. Dean Bodenham, Dr. Dominik Grimm, Dr. Xiao He D-BSSE Karsten Borgwardt Data

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Required Readings Further Reading Data Reduction HyperSlice: Visualization of scalar functions

Confluent Data Reduction for Edge Clique Cover: A Bridge Between Graph Transformation and

Progress of Flood Monitoring &amp; Assessment by Remote Sensing in China China Institute of Water

Presentation 7.3a: Multiple linear regression Murray Logan 19 Jul 2017 Section 1 Theory

Data Reduction, Exact, and Heuristic Algorithms for Clique Cover Jens Gramm Jiong Guo Falk H

INF421, Lecture 1 Lists and Complexity Leo Liberti LIX, Ecole Polytechnique, France INF421,

Weighted and pointwise bounds in measure datum problems with applications Nguyen Cong Phuc

Mark Maddox &amp; Matt Brown July 2020 1 Railroad Commission of Texas | June 27, 2016 (Change

Progress of Flood Monitoring & Assessment by Remote Sensing in China China Institute of Water

Mark Maddox & Matt Brown July 2020 1 Railroad Commission of Texas | June 27, 2016 (Change