principal component analysis and autoencoders
play

Principal Component Analysis and Autoencoders Shuiwang Ji - PowerPoint PPT Presentation

Principal Component Analysis and Autoencoders Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 25 Orthogonal Matrices 1 An orthogonal matrix is a square matrix whose columns and rows are orthogonal unit


  1. Principal Component Analysis and Autoencoders Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 25

  2. Orthogonal Matrices 1 An orthogonal matrix is a square matrix whose columns and rows are orthogonal unit vectors, i.e. , orthonormal vectors. That is, if a matrix ◗ is an orthogonal matrix, we have ◗ T ◗ = ◗◗ T = ■ . 2 It leads to ◗ − 1 = ◗ T , which is a very useful property as it provides an easy way to compute the inverse. 3 For an orthogonal n × n matrix ◗ = [ q 1 , q 2 , . . . , q n ], where q i ∈ R n , i = 1 , 2 , . . . , n , it is easy to see that q T i q j = 0 when i � = j and q T i q i = 1. 4 Furthermore, suppose ◗ 1 = [ q 1 , q 2 , . . . , q i ] and ◗ 2 = [ q i +1 , q i +2 , . . . , q n ], we have ◗ T 1 ◗ 1 = ■ , ◗ T 2 ◗ 2 = ■ , but ◗ 1 ◗ T 1 � = ■ , ◗ 2 ◗ T 2 � = ■ . 2 / 25

  3. Eigen-Decomposition 1 A square n × n matrix ❙ with n linearly independent eigenvectors can be factorized as ❙ = ◗ Λ ◗ − 1 , where ◗ is the square n × n matrix whose columns are eigenvectors of ❙ , and Λ is the diagonal matrix whose diagonal elements are the corresponding eigenvalues. 2 Note that only diagonalizable matrices can be factorized in this way. 3 If ❙ is a symmetric matrix, its eigenvectors are orthogonal. Thus ◗ is an orthogonal matrix and we have ❙ = ◗ Λ ◗ T . 3 / 25

  4. Singular Value Decomposition The singular value decomposition (SVD) of an m × n real matrix (without loss of generality, we assume m ≥ n ) can be written as ❘ = ❯ ˜ Σ ❱ T , where ❯ is an orthogonal m × m matrix, ❱ is an orthogonal n × n matrix, and ˜ Σ is a diagonal m × n matrix with non-negative real values on diagonal. That is, ❯ T ❯ = ❯❯ T = ■ m × m , ❱ T ❱ = ❱ ❱ T = ■ n × n ,   0 0 0 σ 1 . . . 0 σ 2 0 . . . 0   � Σ n × n �   ˜ 0 0 0 σ 3 . . .   Σ = , Σ n × n = (1) ,   0 ...   m × n 0 0 0 0   0 0 0 . . . σ n where σ 1 ≥ σ 2 ≥ · · · ≥ σ n ≥ 0 are known as singular values. If rank( ❘ ) = r ( r ≤ n ), we have σ 1 ≥ σ 2 ≥ · · · ≥ σ r > 0 and σ r +1 = σ r +2 = · · · = σ n = 0. 4 / 25

  5. Relation to Eigen-Decomposition The columns of ❯ (left-singular vectors) are orthonormal eigenvectors of ❘❘ T , and the columns of ❱ (right-singular vectors) are orthonormal eigenvectors of ❘ T ❘ . In other words, we have ❘❘ T = ❯ Λ ❯ − 1 , ❘ T ❘ = ❱ Λ ❱ − 1 . It is easy to verify them as we have � Σ � � Σ � � � Σ � ❱ T = ❱ ( ) ❱ T = ❱ Σ 2 ❱ T , � ❘ T ❘ ❱ T ) T ❯ = ( ❯ Σ 0 0 0 0 � Σ � � Σ � � Σ � � � Σ 2 � 0 ❱ T ) T = ❯ ( ) ❯ T = ❯ � ❘❘ T ❱ T ( ❯ ❯ T = ❯ 0 Σ 0 0 0 0 0 and ❱ T = ❱ − 1 , ❯ T = ❯ − 1 . 5 / 25

  6. SVD and eigen-decomposition 1 Under what conditions are SVD and eigen-decomposition the same? First, ❘ is a symmetric matrix, i.e. , ❘ = ❘ T . Second, ❘ is a positive semi-definite matrix, i.e. , ∀ ① ∈ R n , ① T ❘① ≥ 0. 2 The difference between Λ in eigen-decomposition and Σ in SVD is that, the diagonal entries of Λ can be negative, while the diagonal entries of Σ are non-negative. What are the fundamental reasons underlying this difference? Why the requirements on the singular values in SVD (non-negative and in sorted order) do not prevent the generality of SVD? 6 / 25

  7. Compact SVD If rank( ❘ ) = r ( r ≤ n ), we have ❯ ˜ Σ ❱ T ❘ =     ✈ T 0 σ 1 . . . 1 ... ✈ T . . ...     . . 2     . . .     .     0 . . . . σ r     = [ ✉ 1 , ✉ 2 , . . . , ✉ r , . . . , ✉ m ] .    ✈ T  0 0 . . .     r ...     . . ... .  . .    . . .    .  ✈ T 0 0 . . . n By removing zero components, we obtain ❯ r Σ r ❱ T ❘ = r  ✈ T    1 0 σ 1 . . . ✈ T   2 . . ...     = [ ✉ 1 , ✉ 2 , . . . , ✉ r ] . . .     . . .  .  0 . . . σ r ✈ T r   ✈ T 1 ✈ T r   2 �   σ i ✉ i ✈ T = [ σ 1 ✉ 1 , σ 2 ✉ 2 , . . . , σ r ✉ r ] .  = i ,   .  . i =1 ✈ T r where rank( σ i ✉ i ✈ T i ) = 1, i = 1 , 2 , . . . , r . 7 / 25

  8. Truncated SVD and Best Low-Rank Approximation We can also approximate the matrix ❘ with the k largest singular values as k � ❘ k = ❯ k Σ k ❱ T σ i ✉ i ✈ T k = i . i =1 Apparently, ❘ � = ❘ k unless rank( ❘ ) = k . This approximation is the best in following sense: � � n � � σ 2 min || ❘ − ❇ || F = || ❘ − ❘ k || F = i , � ❇ : rank ( ❇ ) ≤ k i = k +1 min || ❘ − ❇ || 2 = || ❘ − ❘ k || 2 = σ k +1 , ❇ : rank ( ❇ ) ≤ k where || · || F denotes the Frobenius norm and || · || 2 denotes the spectral norm, defined as the largest singular value of the matrix. That is, ❘ k is the best rank- k approximation to ❘ in terms of both the Frobenius norm and spectral norm. Note the difference in terms of approximation errors when different matrix norms are used. 8 / 25

  9. What is PCA? 1 Principal Component Analysis (PCA) is a statistical procedure that can be used to achieve feature (dimensionality) reduction. 2 Note, feature reduction is different from feature selection. After feature reduction, we still use all the features, while feature selection selects a subset of features to use. 3 The goal of PCA is to project the high-dimensional features to a lower-dimensional space with maximal variance and minimum reconstruction error simultaneously. 4 We derive PCA based on maximizing variance, and then we show the solution also minimizes reconstruction error. 5 In machine learning, PCA is an unsupervised learning technique, and therefore does not need labels. 9 / 25

  10. PCA to 1D 1 To introduce PCA, we start from the simple case where PCA projects the features to a 1-dimensional space. 2 Formally, suppose we have n p -dimensional ( p > 1) features ① 1 , ① 2 , . . . , ① n ∈ R p . 3 Let ❛ ∈ R p represent a projection that ❛ T ① i = z i , i = 1 , 2 , . . . , n where z 1 , z 2 , . . . , z n ∈ R 1 . 4 PCA aims to solve n 1 � ❛ ∗ = arg max z ) 2 . ( z i − ¯ n || ❛ || =1 i =1 5 Note that the variance of the reduced data is n 1 � z ) 2 , ( z i − ¯ n i =1 which means that PCA tries to find the projection with the maximum variance in reduced data. 10 / 25

  11. PCA to 1D Since n n n z = 1 z i = 1 ❛ T ① i = ❛ T ( 1 � � � ① i ) = ❛ T ¯ ¯ ① , n n n i =1 i =1 i =1 the problem can be written as n 1 � ❛ ∗ z ) 2 = arg max ( z i − ¯ n || ❛ || =1 i =1 n 1 � ( ❛ T ① i − ¯ z ) 2 = arg max n || ❛ || =1 i =1 n 1 � ( ❛ T ① i − ❛ T ¯ ① ) 2 = arg max n || ❛ || =1 i =1 n 1 � ❛ T ( ① i − ¯ ① ) T ❛ = arg max ① )( ① i − ¯ n || ❛ || =1 i =1 � � n 1 � ❛ T ① ) T = arg max ( ① i − ¯ ① )( ① i − ¯ ❛ n || ❛ || =1 i =1 � �� � p × p covariance matrix ❛ T ❈❛ , = arg max || ❛ || =1 � n where ❈ = 1 ① ) T , denotes the covariance matrix. i =1 ( ① i − ¯ ① )( ① i − ¯ n 11 / 25

  12. PCA to k -dimensional space 1 What if we want to project the features to a k -dimensional space? Then the PCA problem becomes � � ❆ ∗ = ❆ T ❈❆ arg max trace , (2) ❆ ∈ R p × k : ❆ T ❆ = ■ k where ❆ = [ ❛ 1 , ❛ 2 , · · · , ❛ k ] ∈ R p × k . Note that when projecting onto k -dimensional space, PCA requires different projection vectors to be orthogonal. Also, the trace above is the sum of the variances after projecting the data to each of the k directions as k � � � � � ❆ T ❈❆ ❛ T trace = i ❈❛ i . i =1 12 / 25

  13. Ky Fan Theorem 1 Solving the problem in Eqn. (2) requires the follow theorem. 2 Theorem. ( Ky Fan ) Let ❍ ∈ R n × n be a symmetric matrix with eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ n , and the corresponding eigenvectors ❯ = [ ✉ 1 , . . . , ✉ n ]. Then � � ❆ T ❍❆ λ 1 + · · · λ k = max trace . ❆ ∈ R n × k : ❆ T ❆ = ■ k And the optimal ❆ ∗ is given by ❆ ∗ = [ ✉ 1 , . . . , ✉ k ] ◗ with ◗ an arbitrary orthogonal matrix. � 13 / 25

  14. Solutions to PCA 1 Note that in Eqn. (2), the covariance matrix ❈ is a symmetric matrix. Given the above theorem, we directly obtain � � ❆ T ❈❆ λ 1 + · · · λ k = arg max trace , ❆ ∈ R n × k : ❆ T ❆ = ■ k ❆ ∗ = [ ✉ 1 , . . . , ✉ k ] ◗ , where λ 1 , . . . , λ k are the k largest eigenvalues of the covariance matrix ❈ , and the solution ❆ ∗ is the matrix whose columns are corresponding eigenvectors. 2 It also follows from the above theorem that solutions to PCA are not unique, and they differ by an orthogonal matrix. We used the special case where ◗ = ■ , i.e. , ❆ ∗ = [ ✉ 1 , . . . , ✉ k ]. 14 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend