Principal Component Analysis and Autoencoders Shuiwang Ji - - PowerPoint PPT Presentation

principal component analysis and autoencoders
SMART_READER_LITE
LIVE PREVIEW

Principal Component Analysis and Autoencoders Shuiwang Ji - - PowerPoint PPT Presentation

Principal Component Analysis and Autoencoders Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 25 Orthogonal Matrices 1 An orthogonal matrix is a square matrix whose columns and rows are orthogonal unit


slide-1
SLIDE 1

Principal Component Analysis and Autoencoders

Shuiwang Ji Department of Computer Science & Engineering Texas A&M University

1 / 25

slide-2
SLIDE 2

Orthogonal Matrices

1 An orthogonal matrix is a square matrix whose columns and rows are

  • rthogonal unit vectors, i.e., orthonormal vectors. That is, if a matrix

◗ is an orthogonal matrix, we have ◗T◗ = ◗◗T = ■.

2 It leads to ◗−1 = ◗T, which is a very useful property as it provides

an easy way to compute the inverse.

3 For an orthogonal n × n matrix ◗ = [q1, q2, . . . , qn], where qi ∈ Rn,

i = 1, 2, . . . , n, it is easy to see that qT

i qj = 0 when i = j and

qT

i qi = 1. 4 Furthermore, suppose ◗1 = [q1, q2, . . . , qi] and

◗2 = [qi+1, qi+2, . . . , qn], we have ◗T

1 ◗1 = ■, ◗T 2 ◗2 = ■, but

◗1◗T

1 = ■, ◗2◗T 2 = ■.

2 / 25

slide-3
SLIDE 3

Eigen-Decomposition

1 A square n × n matrix ❙ with n linearly independent eigenvectors can

be factorized as ❙ = ◗Λ◗−1, where ◗ is the square n × n matrix whose columns are eigenvectors

  • f ❙, and Λ is the diagonal matrix whose diagonal elements are the

corresponding eigenvalues.

2 Note that only diagonalizable matrices can be factorized in this way. 3 If ❙ is a symmetric matrix, its eigenvectors are orthogonal. Thus ◗ is

an orthogonal matrix and we have ❙ = ◗Λ◗T.

3 / 25

slide-4
SLIDE 4

Singular Value Decomposition

The singular value decomposition (SVD) of an m × n real matrix (without loss of generality, we assume m ≥ n) can be written as ❘ = ❯ ˜ Σ❱ T, where ❯ is an orthogonal m × m matrix, ❱ is an orthogonal n × n matrix, and ˜ Σ is a diagonal m × n matrix with non-negative real values on

  • diagonal. That is,

❯T❯ = ❯❯T = ■ m×m, ❱ T❱ = ❱ ❱ T = ■ n×n, ˜ Σ = Σn×n

  • m×n

, Σn×n =        σ1 . . . σ2 . . . σ3 . . . ... . . . σn        , (1) where σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0 are known as singular values. If rank(❘) = r (r ≤ n), we have σ1 ≥ σ2 ≥ · · · ≥ σr > 0 and σr+1 = σr+2 = · · · = σn = 0.

4 / 25

slide-5
SLIDE 5

Relation to Eigen-Decomposition

The columns of ❯ (left-singular vectors) are orthonormal eigenvectors of ❘❘T, and the columns of ❱ (right-singular vectors) are orthonormal eigenvectors of ❘T❘. In other words, we have ❘❘T = ❯Λ❯−1, ❘T❘ = ❱ Λ❱ −1. It is easy to verify them as we have ❘T❘ = (❯ Σ

  • ❱ T)T❯

Σ

  • ❱ T = ❱ (
  • Σ

Σ

  • )❱ T = ❱ Σ2❱ T,

❘❘T = ❯ Σ

  • ❱ T(❯

Σ

  • ❱ T)T = ❯(

Σ Σ

  • )❯T = ❯

Σ2

  • ❯T

and ❱ T = ❱ −1, ❯T = ❯−1.

5 / 25

slide-6
SLIDE 6

SVD and eigen-decomposition

1 Under what conditions are SVD and eigen-decomposition the same?

First, ❘ is a symmetric matrix, i.e., ❘ = ❘T. Second, ❘ is a positive semi-definite matrix, i.e., ∀① ∈ Rn, ①T❘① ≥ 0.

2 The difference between Λ in eigen-decomposition and Σ in SVD is

that, the diagonal entries of Λ can be negative, while the diagonal entries of Σ are non-negative. What are the fundamental reasons underlying this difference? Why the requirements on the singular values in SVD (non-negative and in sorted order) do not prevent the generality of SVD?

6 / 25

slide-7
SLIDE 7

Compact SVD

If rank(❘) = r (r ≤ n), we have ❘ = ❯ ˜ Σ❱ T = [✉1, ✉2, . . . , ✉r, . . . , ✉m]            σ1 . . . . . . ... . . . . . . σr ... ... . . . . . . ... . . . . . .                       ✈ T

1

✈ T

2

. . . ✈ T

r

. . . ✈ T

n

           . By removing zero components, we obtain ❘ = ❯rΣr❱ T

r

= [✉1, ✉2, . . . , ✉r]    σ1 . . . . . . ... . . . . . . σr         ✈ T

1

✈ T

2

. . . ✈ T

r

     = [σ1✉1, σ2✉2, . . . , σr✉r]      ✈ T

1

✈ T

2

. . . ✈ T

r

     =

r

  • i=1

σi✉i✈ T

i ,

where rank(σi✉i✈ T

i ) = 1, i = 1, 2, . . . , r. 7 / 25

slide-8
SLIDE 8

Truncated SVD and Best Low-Rank Approximation

We can also approximate the matrix ❘ with the k largest singular values as ❘k = ❯kΣk❱ T

k = k

  • i=1

σi✉i✈ T

i .

Apparently, ❘ = ❘k unless rank(❘) = k. This approximation is the best in following sense: min

❇:rank(❇)≤k

||❘ − ❇||F = ||❘ − ❘k||F =

  • n
  • i=k+1

σ2

i ,

min

❇:rank(❇)≤k

||❘ − ❇||2 = ||❘ − ❘k||2 = σk+1, where || · ||F denotes the Frobenius norm and || · ||2 denotes the spectral norm, defined as the largest singular value of the matrix. That is, ❘k is the best rank-k approximation to ❘ in terms of both the Frobenius norm and spectral norm. Note the difference in terms of approximation errors when different matrix norms are used.

8 / 25

slide-9
SLIDE 9

What is PCA?

1 Principal Component Analysis (PCA) is a statistical procedure that

can be used to achieve feature (dimensionality) reduction.

2 Note, feature reduction is different from feature selection. After

feature reduction, we still use all the features, while feature selection selects a subset of features to use.

3 The goal of PCA is to project the high-dimensional features to a

lower-dimensional space with maximal variance and minimum reconstruction error simultaneously.

4 We derive PCA based on maximizing variance, and then we show the

solution also minimizes reconstruction error.

5 In machine learning, PCA is an unsupervised learning technique, and

therefore does not need labels.

9 / 25

slide-10
SLIDE 10

PCA to 1D

1 To introduce PCA, we start from the simple case where PCA projects

the features to a 1-dimensional space.

2 Formally, suppose we have n p-dimensional (p > 1) features

①1, ①2, . . . , ①n ∈ Rp.

3 Let ❛ ∈ Rp represent a projection that ❛T①i = zi, i = 1, 2, . . . , n

where z1, z2, . . . , zn ∈ R1.

4 PCA aims to solve

❛∗ = arg max

||❛||=1

1 n

n

  • i=1

(zi − ¯ z)2.

5 Note that the variance of the reduced data is

1 n

n

  • i=1

(zi − ¯ z)2, which means that PCA tries to find the projection with the maximum variance in reduced data.

10 / 25

slide-11
SLIDE 11

PCA to 1D

Since ¯ z = 1 n

n

  • i=1

zi = 1 n

n

  • i=1

❛T ①i = ❛T ( 1 n

n

  • i=1

①i) = ❛T ¯ ①, the problem can be written as ❛∗ = arg max

||❛||=1

1 n

n

  • i=1

(zi − ¯ z)2 = arg max

||❛||=1

1 n

n

  • i=1

(❛T ①i − ¯ z)2 = arg max

||❛||=1

1 n

n

  • i=1

(❛T ①i − ❛T ¯ ①)2 = arg max

||❛||=1

1 n

n

  • i=1

❛T (①i − ¯ ①)(①i − ¯ ①)T ❛ = arg max

||❛||=1

❛T

  • 1

n

n

  • i=1

(①i − ¯ ①)(①i − ¯ ①)T

  • p×p covariance matrix

❛ = arg max

||❛||=1

❛T ❈❛, where ❈ = 1

n

n

i=1(①i − ¯

①)(①i − ¯ ①)T , denotes the covariance matrix.

11 / 25

slide-12
SLIDE 12

PCA to k-dimensional space

1 What if we want to project the features to a k-dimensional space?

Then the PCA problem becomes ❆∗ = arg max

❆∈Rp×k:❆T ❆=■ k

trace

  • ❆T❈❆
  • ,

(2) where ❆ = [❛1, ❛2, · · · , ❛k] ∈ Rp×k. Note that when projecting onto k-dimensional space, PCA requires different projection vectors to be

  • rthogonal. Also, the trace above is the sum of the variances after

projecting the data to each of the k directions as trace

  • ❆T❈❆
  • =

k

  • i=1
  • ❛T

i ❈❛i

  • .

12 / 25

slide-13
SLIDE 13

Ky Fan Theorem

1 Solving the problem in Eqn. (2) requires the follow theorem. 2 Theorem. (Ky Fan) Let ❍ ∈ Rn×n be a symmetric matrix with

eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn, and the corresponding eigenvectors ❯ = [✉1, . . . , ✉n]. Then λ1 + · · · λk = max

❆∈Rn×k:❆T ❆=■ k

trace

  • ❆T❍❆
  • .

And the optimal ❆∗ is given by ❆∗ = [✉1, . . . , ✉k]◗ with ◗ an arbitrary orthogonal matrix.

  • 13 / 25
slide-14
SLIDE 14

Solutions to PCA

1 Note that in Eqn. (2), the covariance matrix ❈ is a symmetric matrix.

Given the above theorem, we directly obtain λ1 + · · · λk = arg max

❆∈Rn×k:❆T ❆=■ k

trace

  • ❆T❈❆
  • ,

❆∗ = [✉1, . . . , ✉k]◗, where λ1, . . . , λk are the k largest eigenvalues of the covariance matrix ❈, and the solution ❆∗ is the matrix whose columns are corresponding eigenvectors.

2 It also follows from the above theorem that solutions to PCA are not

unique, and they differ by an orthogonal matrix. We used the special case where ◗ = ■, i.e., ❆∗ = [✉1, . . . , ✉k].

14 / 25

slide-15
SLIDE 15

How to compute PCA efficiently?

1 We define

❳ = [①1, ①2, . . . , ①n] ∈ Rp×n, ¯ ❳ = 1 n❳1n ∈ Rp×1, ˜ ❳ = (❳ − ¯ ❳1T

n ) ∈ Rp×n,

where 1n is the n-dimensional all-one vector. Here, ˜ ❳ is the centered ❳, which is obtained by subtracting the mean from each column.

2 Then we have

❈ = 1 n

n

  • i=1

(①i − ¯ ①)(①i − ¯ ①)T = 1 n ˜ ❳ ˜ ❳

T ∈ Rp×p.

If p is large, it is very costly to compute eigenvectors of ❈ directly. Moreover, if p >> n, there are known computational problems.

15 / 25

slide-16
SLIDE 16

How to compute PCA efficiently?

1 However, we show that the SVD of ˜

❳ provides what we need. If the SVD of ˜ ❳ is ˜ ❳ = ❯ ˜ Σ❱ T, the columns of ❯ (left-singular vectors) are orthonormal eigenvectors of ˜ ❳ ˜ ❳

  • T. Note that we have

σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0 in Eqn. (1) and thus, the first k columns of ❯ correspond to the largest k eigenvalues of ˜ ❳ ˜ ❳

T. 2 With the SVD of ˜

❳, if we want to project features to a k-dimensional (k < p) space, simply take the first k columns of ❯ as the projection

  • matrix. Generally, the process of computing PCA can be described as

  • p×n

→ ˜ ❳

  • p×n

→ ˜ ❳ = ❯ ˜ Σ❱ T → ● = ❯k ∈ Rp×k. Then the projected features are

  • T ˜

❳ = ❩ ∈ Rk×n. (3) It is worth noting that the k-th row in ❩ is called the k-th principal components (PC). Therefore, PCA achieves feature reduction by keeping only the first k PCs.

16 / 25

slide-17
SLIDE 17

Centered or not?

1 Note that we project the centered data matrix ˜

❳ instead of the

  • riginal data matrix ❳ in Eqn (3). Maximal variance can be achieved

in both cases, as the the covariance matrix ❈ will not change.

2 But the minimal reconstruction error can only be achieved when ˜

❳ is used, as shown in the next section.

3 A common practice to avoid any confusion is to center the data before

applying PCA, and use the centered data matrix in all computations.

17 / 25

slide-18
SLIDE 18

Have We Achieved the Minimal Reconstruction Error?

1 First, we perform the verification in the case where we project and

reconstruct the centered data matrix ˜ ❳.

2 Given the projection

  • T ˜

❳ = ❩, the reconstruction process is ˇ ❳ = ●❩ = ●● T ˜ ❳, and the reconstruction error is ||˜ ❳ − ˇ ❳||F = ||˜ ❳ − ●● T ˜ ❳||F.

3 We will show that

  • ● T ˜

❳ = ❇∗ = arg min

❇:rank(❇)≤k

||˜ ❳ − ❇||F, (4) which means ˇ ❳ gives the minimum reconstruction error.

4 We’ve already learned that ❇∗ is the truncated SVD of ˜

❳. So, we

  • nly need to show that ●● T ˜

❳ is indeed the truncated SVD of ˜ ❳ as

  • follows. Note that similar arguments can be made for the spectral

norm.

18 / 25

slide-19
SLIDE 19

More details

Given ˜ ❳ = ❯ ˜ Σ❱ T and ● = ❯k, we have

  • ● T ˜

❳ = ❯k❯T

k ❯ ˜

Σ❱ T = ❯k❯T

k

  • ❯k

❯−k ˜ Σ❱ T = ❯k

˜ Σ❱ T =

  • ❯k

˜ Σ❱ T = ❯kΣk❱ T

k

=

k

  • i=1

σi✉i✈ T

i .

As a result, ˇ ❳ gives the minimum reconstruction error, which is min

❇:rank(❇)≤k

||˜ ❳ − ❇||F = ||˜ ❳ − ●● T ˜ ❳||F =

  • n
  • i=k+1

σ2

i .

19 / 25

slide-20
SLIDE 20

The case of spectral norm

1 Similarly, it can be shown that

min

❇:rank(❇)≤k

||˜ ❳ − ❇||2 = ||˜ ❳ − ●● T ˜ ❳||2 = σk+1.

2 Therefore, PCA projects the high-dimensional features to a

lower-dimensional space with minimum reconstruction error in terms

  • f both Frobenius and spectral norms.

20 / 25

slide-21
SLIDE 21

Connections with Autoencoders

Encoder E(·) Decoder D(·)

1 We assume that ❳ = [①1, ①2, . . . , ①n] ∈ Rp×n is already centered, for

the simplicity of notations.

2 The outputs of this framework have the same dimension as inputs,

and are supposed to reconstruct the inputs exactly. Without any constraint, the reconstruction task can be easily solved by directly

  • copying. However, autoencoders come with a crucial restriction that

the dimension of intermediate outputs should be smaller than inputs.

21 / 25

slide-22
SLIDE 22

Autoencoders

1 Concretely, the framework of autoencoders can be divided into two

parts: the encoder E(·) and the decoder D(·). Given ❳ ∈ Rp×n as inputs, the encoder is supposed to output a reduced representation of the inputs, i.e., E(❳) ∈ Rk×n, where k < p. And the decoder tries to reconstruct the inputs from this reduced representation.

2 It is straightforward to see that the objective function of autoencoders

is to minimize the reconstruction error: L(❳, D(E(❳))), where L(·, ·) is a loss function that measures the reconstruction error.

22 / 25

slide-23
SLIDE 23

PCA as Autoencoders

1 We’ve shown that PCA is able to achieve the minimum reconstruction

  • error. Both PCA and autoencoders are unsupervised learning models

with the goal of minimizing the reconstruction error. Are they connected?

2 Consider a special autoencoder with the encoder and decoder defined

as E(❳) = ❲ T❳ ∈ Rk×n, D(E(❳)) = ❲ E(❳) = ❲ ❲ T❳ ∈ Rp×n, where ❲ ∈ Rp×k. Note that rank(❲ ❲ T❳) ≤ min(rank(❲ ), rank(❲ T❳)) ≤ k

3 If the loss function L(·, ·) is the mean square error, the objective of

this autoencoder is equivalent to min

❲ :rank(❲ ❲ T ❳)≤k

||❳ − ❲ ❲ T❳||F. (5)

23 / 25

slide-24
SLIDE 24

PCA as Autoencoders

1 In an autoencoder, ❲ can be randomly initialized and optimized

through gradient decent. However, comparing Eqn. (4) and Eqn. (5), it is easy to see that the optima can be achieved when ❲ = ●, where ● is given by PCA.

2 Therefore, a natural question is: Will gradient decent give the same

results as PCA? The answer is two-fold: yes in terms of the minimal reconstruction error and not necessary in terms of optimal ❲ . Gradient decent converges to the global optima, resulting in the same minimal reconstruction error as PCA. However, the solutions to PCA are not unique and can differ by an orthogonal matrix. Thus, the

  • ptimal ❲ can be different from a PCA solution ●. But one can be

computed from another through an orthogonal matrix and the sets of possible solutions are the same.

3 To summarize, PCA is a special case of autoencoders, where the

encoder and decoder are both one-layer linear transformations, which share the weights and do not have bias terms.

24 / 25

slide-25
SLIDE 25

THANKS!

25 / 25