MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco - - PowerPoint PPT Presentation

mlcc 2015 dimensionality reduction and pca
SMART_READER_LITE
LIVE PREVIEW

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco - - PowerPoint PPT Presentation

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value


slide-1
SLIDE 1

MLCC 2015 Dimensionality Reduction and PCA

Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015

slide-2
SLIDE 2

Outline

PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA

MLCC 2015 2

slide-3
SLIDE 3

Dimensionality Reduction

In many practical applications it is of interest to reduce the dimensionality of the data:

◮ data visualization ◮ data exploration: for investigating the ”effective” dimensionality of

the data

MLCC 2015 3

slide-4
SLIDE 4

Dimensionality Reduction (cont.)

This problem of dimensionality reduction can be seen as the problem of defining a map M : X = RD → Rk, k ≪ D, according to some suitable criterion.

MLCC 2015 4

slide-5
SLIDE 5

Dimensionality Reduction (cont.)

This problem of dimensionality reduction can be seen as the problem of defining a map M : X = RD → Rk, k ≪ D, according to some suitable criterion. In the following data reconstruction will be our guiding principle.

MLCC 2015 5

slide-6
SLIDE 6

Principal Component Analysis

PCA is arguably the most popular dimensionality reduction procedure.

MLCC 2015 6

slide-7
SLIDE 7

Principal Component Analysis

PCA is arguably the most popular dimensionality reduction procedure. It is a data driven procedure that given an unsupervised sample S = (x1, . . . , xn) derive a dimensionality reduction defined by a linear map M.

MLCC 2015 7

slide-8
SLIDE 8

Principal Component Analysis

PCA is arguably the most popular dimensionality reduction procedure. It is a data driven procedure that given an unsupervised sample S = (x1, . . . , xn) derive a dimensionality reduction defined by a linear map M. PCA can be derived from several prospective and here we give a geometric derivation.

MLCC 2015 8

slide-9
SLIDE 9

Dimensionality Reduction by Reconstruction

Recall that, if w ∈ RD, w = 1, then (wT x)w is the orthogonal projection of x on w

MLCC 2015 9

slide-10
SLIDE 10

Dimensionality Reduction by Reconstruction

Recall that, if w ∈ RD, w = 1, then (wT x)w is the orthogonal projection of x on w

MLCC 2015 10

slide-11
SLIDE 11

Dimensionality Reduction by Reconstruction (cont.)

First, consider k = 1. The associated reconstruction error is x − (wT x)w2 (that is how much we lose by projecting x along the direction w)

MLCC 2015 11

slide-12
SLIDE 12

Dimensionality Reduction by Reconstruction (cont.)

First, consider k = 1. The associated reconstruction error is x − (wT x)w2 (that is how much we lose by projecting x along the direction w)

Problem:

Find the direction p allowing the best reconstruction of the training set

MLCC 2015 12

slide-13
SLIDE 13

Dimensionality Reduction by Reconstruction (cont.)

Let SD−1 = {w ∈ RD | w = 1} is the sphere in D dimensions. Consider the empirical reconstruction minimization problem, min

w∈SD−1

1 n

n

  • i=1

xi − (wT xi)w2. The solution p to the above problem is called the first principal component of the data

MLCC 2015 13

slide-14
SLIDE 14

An Equivalent Formulation

A direct computation shows that xi − (wT xi)w2 = xi − (wT xi)2

MLCC 2015 14

slide-15
SLIDE 15

An Equivalent Formulation

A direct computation shows that xi − (wT xi)w2 = xi − (wT xi)2 Then, problem min

w∈SD−1

1 n

n

  • i=1

xi − (wT xi)w2 is equivalent to max

w∈SD−1

1 n

n

  • i=1

(wT xi)2

MLCC 2015 15

slide-16
SLIDE 16

Outline

PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA

MLCC 2015 16

slide-17
SLIDE 17

Reconstruction and Variance

Assume the data to be centered, ¯ x = 1

nxi = 0, then we can interpret the

term (wT x)2 as the variance of x in the direction w.

MLCC 2015 17

slide-18
SLIDE 18

Reconstruction and Variance

Assume the data to be centered, ¯ x = 1

nxi = 0, then we can interpret the

term (wT x)2 as the variance of x in the direction w. The first PC can be seen as the direction along which the data have maximum variance. max

w∈SD−1

1 n

n

  • i=1

(wT xi)2

MLCC 2015 18

slide-19
SLIDE 19

Centering

If the data are not centered, we should consider max

w∈SD−1

1 n

n

  • i=1

(wT (xi − ¯ x))2 (1) equivalent to max

w∈SD−1

1 n

n

  • i=1

(wT xc

i)2

with xc = x − ¯ x.

MLCC 2015 19

slide-20
SLIDE 20

Centering and Reconstruction

If we consider the effect of centering to reconstruction it is easy to see that we get min

w,b∈SD−1

1 n

n

  • i=1

xi − ((wT (xi − b))w + b)2 where ((wT (xi − b))w + b is an affine (rather than an orthogonal) projection

MLCC 2015 20

slide-21
SLIDE 21

Outline

PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA

MLCC 2015 21

slide-22
SLIDE 22

PCA as an Eigenproblem

A further manipulation shows that PCA corresponds to an eigenvalue problem.

MLCC 2015 22

slide-23
SLIDE 23

PCA as an Eigenproblem

A further manipulation shows that PCA corresponds to an eigenvalue problem. Using the symmetry of the inner product, 1 n

n

  • i=1

(wT xi)2 = 1 n

n

  • i=1

wT xiwT xi = 1 n

n

  • i=1

wT xixT

i w = wT ( 1

n

n

  • i=1

xixT

i )w

MLCC 2015 23

slide-24
SLIDE 24

PCA as an Eigenproblem

A further manipulation shows that PCA corresponds to an eigenvalue problem. Using the symmetry of the inner product, 1 n

n

  • i=1

(wT xi)2 = 1 n

n

  • i=1

wT xiwT xi = 1 n

n

  • i=1

wT xixT

i w = wT ( 1

n

n

  • i=1

xixT

i )w

Then, we can consider the problem max

w∈SD−1 wT Cnw,

Cn = 1 n

n

  • i=1

xixT

i

MLCC 2015 24

slide-25
SLIDE 25

PCA as an Eigenproblem (cont.)

We make two observations:

◮ The (”covariance”) matrix Cn = 1 n

n

i=1 XT n Xn is symmetric and

positive semi-definite.

MLCC 2015 25

slide-26
SLIDE 26

PCA as an Eigenproblem (cont.)

We make two observations:

◮ The (”covariance”) matrix Cn = 1 n

n

i=1 XT n Xn is symmetric and

positive semi-definite.

◮ The objective function of PCA can be written as

wT Cnw wT w the so called Rayleigh quotient.

MLCC 2015 26

slide-27
SLIDE 27

PCA as an Eigenproblem (cont.)

We make two observations:

◮ The (”covariance”) matrix Cn = 1 n

n

i=1 XT n Xn is symmetric and

positive semi-definite.

◮ The objective function of PCA can be written as

wT Cnw wT w the so called Rayleigh quotient. Note that, if Cnu = λu then uT Cnu

uT u

= λ, since u is normalized.

MLCC 2015 27

slide-28
SLIDE 28

PCA as an Eigenproblem (cont.)

We make two observations:

◮ The (”covariance”) matrix Cn = 1 n

n

i=1 XT n Xn is symmetric and

positive semi-definite.

◮ The objective function of PCA can be written as

wT Cnw wT w the so called Rayleigh quotient. Note that, if Cnu = λu then uT Cnu

uT u

= λ, since u is normalized. Indeed, it is possible to show that the Rayleigh quotient achieves its maximum at a vector corresponding to the maximum eigenvalue of Cn

MLCC 2015 28

slide-29
SLIDE 29

PCA as an Eigenproblem (cont.)

Computing the first principal component of the data reduces to computing the biggest eigenvalue of the covariance and the corresponding eigenvector. Cnu = λu, Cn = 1 n

n

  • i=1

XT

n Xn

MLCC 2015 29

slide-30
SLIDE 30

Outline

PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA

MLCC 2015 30

slide-31
SLIDE 31

Beyond the First Principal Component

We discuss how to consider more than one principle component (k > 1) M : X = RD → Rk, k ≪ D The idea is simply to iterate the previous reasoning

MLCC 2015 31

slide-32
SLIDE 32

Residual Reconstruction

The idea is to consider the one dimensional projection that can best reconstruct the residuals ri = xi − (pT xi)pi

MLCC 2015 32

slide-33
SLIDE 33

Residual Reconstruction

The idea is to consider the one dimensional projection that can best reconstruct the residuals ri = xi − (pT xi)pi An associated minimization problem is given by min

w∈SD−1,w⊥p

1 n

n

  • i=1

ri − (wT ri)w2. (note: the constraint w ⊥ p)

MLCC 2015 33

slide-34
SLIDE 34

Residual Reconstruction (cont.)

Note that for all i = 1, . . . , n, ri − (wT ri)w2 = ri2 − (wT ri)2 = ri2 − (wT xi)2 since w ⊥ p

MLCC 2015 34

slide-35
SLIDE 35

Residual Reconstruction (cont.)

Note that for all i = 1, . . . , n, ri − (wT ri)w2 = ri2 − (wT ri)2 = ri2 − (wT xi)2 since w ⊥ p Then, we can consider the following equivalent problem max

w∈SD−1,w⊥p

1 n

n

  • i=1

(wT xi)2 = wT Cnw.

MLCC 2015 35

slide-36
SLIDE 36

PCA as an Eigenproblem

max

w∈SD−1,w⊥p

1 n

n

  • i=1

(wT xi)2 = wT Cnw. Again, we have to minimize the Rayleigh quotient of the covariance matrix with the extra constraint w ⊥ p

MLCC 2015 36

slide-37
SLIDE 37

PCA as an Eigenproblem

max

w∈SD−1,w⊥p

1 n

n

  • i=1

(wT xi)2 = wT Cnw. Again, we have to minimize the Rayleigh quotient of the covariance matrix with the extra constraint w ⊥ p Similarly to before, it can be proved that the solution of the above problem is given by the second eigenvector of Cn, and the corresponding eigenvalue.

MLCC 2015 37

slide-38
SLIDE 38

PCA as an Eigenproblem (cont.)

Cnu = λu, Cn = 1 n

n

  • i=1

xixT

i

The reasoning generalizes to more than two components: computation of k principal components reduces to finding k eigenvalues and eigenvectors of Cn.

MLCC 2015 38

slide-39
SLIDE 39

Remarks

◮ Computational complexity roughly O(kD2) (complexity of forming

Cn is O(nD2)). If we have n points in D dimensions and n ≪ D can we compute PCA in less than O(nD2)?

MLCC 2015 39

slide-40
SLIDE 40

Remarks

◮ Computational complexity roughly O(kD2) (complexity of forming

Cn is O(nD2)). If we have n points in D dimensions and n ≪ D can we compute PCA in less than O(nD2)?

◮ The dimensionality reduction induced by PCA is a linear projection.

Can we generalize PCA to non linear dimensionality reduction?

MLCC 2015 40

slide-41
SLIDE 41

Outline

PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA

MLCC 2015 41

slide-42
SLIDE 42

Singular Value Decomposition

Consider the data matrix Xn, its singular value decomposition is given by Xn = UΣV T where:

◮ U is a n by k orthogonal matrix, ◮ V is a D by k orthogonal matrix, ◮ Σ is a diagonal matrix such that Σi,i = √λi, i = 1, . . . , k and

k ≤ min{n, D}. The columns of U and the columns of V are the left and right singular vectors and the diagonal entries of Σ the singular values.

MLCC 2015 42

slide-43
SLIDE 43

Singular Value Decomposition (cont.)

The SVD can be equivalently described by the equations Cnpj = λjpj, 1 nKnuj = λjuj, Xnpj =

  • λjuj,

1 nXT

n uj =

  • λjpj,

for j = 1, . . . , d and where Cn = 1

nXT n Xn and 1 nKn = 1 nXnXT n

MLCC 2015 43

slide-44
SLIDE 44

PCA and Singular Value Decomposition

If n ≪ p we can consider the following procedure:

◮ form the matrix Kn, which is O(Dn2) ◮ find the first k eigenvectors of Kn, which is O(kn2) ◮ compute the principal components using

pj = 1

  • λj

XT

n uj =

1

  • λj

n

  • i=1

xiui

j,

j = 1, . . . , d where u = (u1, . . . , un), This is O(knD) if we consider k principal components.

MLCC 2015 44

slide-45
SLIDE 45

Outline

PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component PCA and Singular Value Decomposition Kernel PCA

MLCC 2015 45

slide-46
SLIDE 46

Beyond Linear Dimensionality Reduction?

By considering PCA we are implicitly assuming the data to lie on a linear subspace....

MLCC 2015 46

slide-47
SLIDE 47

Beyond Linear Dimensionality Reduction?

By considering PCA we are implicitly assuming the data to lie on a linear subspace.... ...it is easy to think of situations where this assumption might violated.

MLCC 2015 47

slide-48
SLIDE 48

Beyond Linear Dimensionality Reduction?

By considering PCA we are implicitly assuming the data to lie on a linear subspace.... ...it is easy to think of situations where this assumption might violated. Can we use kernels to obtain non linear generalization of PCA?

MLCC 2015 48

slide-49
SLIDE 49

From SVD to KPCA

Using SVD the projection of a point x on a principal component pj, for j = 1, . . . , d, is (M(x))j = xT pj = 1

  • λj

xT XT

n uj =

1

  • λj

n

  • i=1

xT xiui

j,

Recall Cnpj = λjpj, 1 nKnuj = λjuj, Xnpj =

  • λjuj,

1 nXT

n uj =

  • λjpj,

MLCC 2015 49

slide-50
SLIDE 50

PCA and Feature Maps

(M(x))j = 1

  • λj

n

  • i=1

xT xiui

j,

What if consider a non linear feature-map Φ : X → F, before performing PCA?

MLCC 2015 50

slide-51
SLIDE 51

PCA and Feature Maps

(M(x))j = 1

  • λj

n

  • i=1

xT xiui

j,

What if consider a non linear feature-map Φ : X → F, before performing PCA? (M(x))j = Φ(x)T pj = 1

  • λj

n

  • i=1

Φ(x)T Φ(xi)ui

j,

where Knσj = σjuj and (Kn)i,j = Φ(x)T Φ(xj).

MLCC 2015 51

slide-52
SLIDE 52

Kernel PCA

(M(x))j = Φ(x)T pj = 1

  • λj

n

  • i=1

Φ(x)T Φ(xi)ui

j,

If the feature map is defined by a positive definite kernel K : X × X → R, then (M(x))j = 1

  • λj

n

  • i=1

K(x, xi)ui

j,

where Knσj = σjuj and (Kn)i,j = K(xi, xj).

MLCC 2015 52

slide-53
SLIDE 53

Wrapping Up

In this class we introduced PCA as a basic tool for dimensionality

  • reduction. We discussed computational aspect and extensions to non

linear dimensionality reduction (KPCA)

MLCC 2015 53

slide-54
SLIDE 54

Next Class

In the next class, beyond dimensionality reduction, we ask how we can devise interpretable data models, and discuss a class of methods based on the concept of sparsity.

MLCC 2015 54