Principal component analysis Ingo Blechschmidt December 17th, 2014 - - PowerPoint PPT Presentation

principal component analysis
SMART_READER_LITE
LIVE PREVIEW

Principal component analysis Ingo Blechschmidt December 17th, 2014 - - PowerPoint PPT Presentation

Theory Applications Principal component analysis Ingo Blechschmidt December 17th, 2014 Kleine Bayessche AG Principal component analysis 1 / 12 Theory Applications Principal component analysis Ingo Blechschmidt December 17th, 2014 Kleine


slide-1
SLIDE 1

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-2
SLIDE 2

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-3
SLIDE 3

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-4
SLIDE 4

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-5
SLIDE 5

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-6
SLIDE 6

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-7
SLIDE 7

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-8
SLIDE 8

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-9
SLIDE 9

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-10
SLIDE 10

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-11
SLIDE 11

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-12
SLIDE 12

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-13
SLIDE 13

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-14
SLIDE 14

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-15
SLIDE 15

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-16
SLIDE 16

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-17
SLIDE 17

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-18
SLIDE 18

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-19
SLIDE 19

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-20
SLIDE 20

Theory Applications

Principal component analysis

Ingo Blechschmidt

December 17th, 2014

Kleine Bayessche AG Principal component analysis 1 / 12

slide-21
SLIDE 21

Theory Applications

Outline

1 Theory

Singular value decomposition Pseudoinverses Low-rank approximation

2 Applications

Image compression Proper orthogonal decomposition Principal component analysis Eigenfaces Digit recognition

Kleine Bayessche AG Principal component analysis 2 / 12

slide-22
SLIDE 22

Theory Applications SVD Pseudoinverses Low-rank approximation

Singular value decomposition

Let A ∈ Rn×m. Then there exist numbers σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0, an orthonormal basis v1, . . . , vm of Rm, and an orthonormal basis w1, . . . , wn of Rn, such that Avi = σiwi, i = 1, . . . , m. In matrix language: A = W ΣV t, where V = (v1| . . . |vm) ∈ Rm×m orthogonal, W = (w1| . . . |wn) ∈ Rn×n orthogonal, Σ = diag(σ1, . . . , σm) ∈ Rn×m.

Kleine Bayessche AG Principal component analysis 3 / 12

slide-23
SLIDE 23
  • The singular value decomposition (SVD) exists for any real

matrix, even rectangular ones.

  • The singular values σi are unique.
  • The basis vectors are not unique.
  • If A is orthogonally diagonalizable with eigenvalues λi (for

instance, if A is symmetric), then σi = |λi|.

  • AFrobenius =
  • ij A2

ij =

  • tr(AtA) =
  • i σ2

i .

  • There exists a generalization to complex matrices. In this

case, the matrix A can be decomposed as W ΣV ⋆, where V ⋆ is the complex conjugate of V t and W and V are unitary matrices.

  • The singular value decomposition can also be formulated in

a basis-free manner as a result about linear maps between finite-dimensional Hilbert spaces.

slide-24
SLIDE 24

Existence proof (sketch):

  • 1. Consider the eigenvalue decomposition of the symmetric

and positive-semidefinite matrix AtA: We have an orthonor- mal basis vi of eigenvectors corresponding to eigenvalues λi.

  • 2. Set σi := √λi.
  • 3. Set wi := 1

σi Avi (for those i with λi = 0).

  • 4. Then Avi = σiwi holds trivially.
  • 5. The wi are orthonormal: (wi, wj) =

1 σiσj (AtAvi, vj) = λiδij σiσj .

  • 6. If necessary, extend the wi to an orthonormal basis.

This proof gives rise to an algorithm for calculating the SVD, but unless AtA is small, it has undesirable numerical properties. (But note that one can also use AAt!) Since the 1960ies, there exists a stable iterative algorithm by Golub and van Loan.

slide-25
SLIDE 25

Theory Applications SVD Pseudoinverses Low-rank approximation

The pseudoinverse of a matrix

Let A ∈ Rn×m and b ∈ Rn. Then the solutions to the

  • ptimization problem

Ax − b2 − → min under x ∈ Rm are given by x = A+b + V ⋆

  • ,

where A = W ΣV t is the SVD and A+ = W Σ+V t, Σ+ = diag(σ−1

1 , . . . , σ−1 m ).

Kleine Bayessche AG Principal component analysis 4 / 12

slide-26
SLIDE 26
  • In the formula for Σ+, set 0−1 := 0.
  • If A happens to be invertible, then A+ = A−1.
  • The pseudoinverse can be used for polynomial approxima-

tion: Let data points (xi, yi) ∈ R2, 1 ≤ i ≤ N, be given. Want to find a polynomial p(z) = n

k=0 αizi, n ≪ N, such

that

N

  • i=1

|p(xi) − yi|2 − → min. In matrix language, this problem is written Au − y2 − → min where u = (α0, . . . , αN)T ∈ Rn+1 and A =      1 x1 x2

1

· · · xn

1

1 x2 x2

2

· · · xn

2

. . . . . . . . . ... . . . 1 xN x2

N

· · · xn

N

     ∈ RN×(n+1), y =      y1 y2 . . . yN      ∈ RN.

slide-27
SLIDE 27

Theory Applications SVD Pseudoinverses Low-rank approximation

Low-rank approximation

Let A = W ΣV t ∈ Rn×m and 1 ≤ r ≤ n, m. Then a solution to the optimization problem A − MFrobenius − → min under all matrices M with rank M ≤ r is given by M = W ΣrV t, where Σr = diag(σ1, . . . , σr, 0, . . . , 0). The approximation error is A − W ΣrV tF =

  • σ2

r+1 + · · · + σ2 m.

Kleine Bayessche AG Principal component analysis 5 / 12

slide-28
SLIDE 28
  • This is the Eckart–Young(–Mirsky) theorem.
  • Beware of false and incomplete proofs in the literature!
slide-29
SLIDE 29

Theory Applications Image compression POD PCA Eigenfaces Digit recognition

Image compression

Think of images as matrices. Substitute a matrix W ΣV t by W ΣrV t with r small. To reconstruct W ΣrV t, only need to know

the r singular values σ1, . . . , σr, r the first r columns of W , and height · r the top r rows of V t. width · r

Total amount: r · (1 + height + weight) ≪ height · width

Kleine Bayessche AG Principal component analysis 6 / 12

slide-30
SLIDE 30
  • See http://speicherleck.de/iblech/stuff/pca-images.

pdf for sample compressions and http://pizzaseminar. speicherleck.de/skript4/08-principal-component-analysis/ svd-image.py for the Python code producing theses im- ages.

  • Image compression by singular value decomposition is mostly
  • f academic interest only.
  • This might be for the following reasons: other compression

algorithms have more efficient implementations; other al- gorithms taylor to the specific properties of human vision; the basis vectors of other approaches (for instance, DCT) are similar to the most important singular basis vectors of a sufficiently large corpus of images.

  • See http://dsp.stackexchange.com/questions/7859/relationship-between-dct-and-pca.
slide-31
SLIDE 31

Theory Applications Image compression POD PCA Eigenfaces Digit recognition

Proper orthogonal decomposition

Given data points xi ∈ RN, want to find a low-dimensional linear subspace which approximately contains the xi. Minimize J(U) :=

  • i

xi − PU(xi)2 under all r-dimensional subspaces U ⊆ RN, r ≪ N, where PU : RN → RN is the orthogonal projection onto U.

Kleine Bayessche AG Principal component analysis 7 / 12

slide-32
SLIDE 32

Theory Applications Image compression POD PCA Eigenfaces Digit recognition

Proper orthogonal decomposition

Given data points xi ∈ RN, want to find a low-dimensional linear subspace which approximately contains the xi. Minimize J(U) :=

  • i

xi − PU(xi)2 under all r-dimensional subspaces U ⊆ RN, r ≪ N, where PU : RN → RN is the orthogonal projection onto U. More concrete formulation: Minimize J(u1, . . . , ur) :=

  • i
  • xi −

r

  • j=1

xi, ujuj

  • 2

, where u1, . . . , ur ∈ RN, uj, uk = δjk.

Kleine Bayessche AG Principal component analysis 7 / 12

slide-33
SLIDE 33
  • In the first formulation, the optimization domain is the Grass-

mannian of r-dimensional subspaces in RN. It is a compact topological space (in fact a manifold of dimension r ·(N −r)). Since J(U) depends continuously on U, the optimization problem is guaranteed to have a solution.

  • The solution is in general not unique, not even locally. For

instance, consider the four data points (±1, ±1) in R2. Then any line U through the origin solves the optimization prob- lem, with functional value J(U) = 4.

  • In the more concrete formulation, we look for an orthonor-

mal basis of a suitable subspace. In this case, the optimiza- tion domain is a compact subset of RN×r.

  • Since a given subspace possesses infinitely many orthonor-

mal bases (at least for r ≥ 2), solutions to this refined problem are never unique, not even locally.

  • Note that this is a non-convex optimization problem. There-

fore common numerical techniques do not apply.

slide-34
SLIDE 34

Collect the data points xi as columns of a matrix X = (x1| · · · |xℓ) ∈ RN×ℓ and consider its singular value decomposition X = W ΣV t. Then a solution to the minimization problem is given by the first r columns of W , with approximation error J =

  • i

xi2 −

r

  • j=1

σ2

j .

slide-35
SLIDE 35
  • Proper orthogonal decomposition (POD) cannot be used

to find low-dimensional submanifolds which approximately contain given data points. But check out kernel principal component analysis.

  • Also, POD does not work well with affine subspaces. But

in this case, the fix is easy: Simply shift the data points so that their mean is zero.

  • POD is a general method for dimension reduction and can

be used as a kind of “preconditioner” for many other al- gorithms: Simply substitute the given points xi by their projections PU(xi).

slide-36
SLIDE 36

Theory Applications Image compression POD PCA Eigenfaces Digit recognition

Principal component analysis

Given observations x(k)

i

  • f random variables X (k), want to

find linearly uncorrelated principal components. Write X = (x1| · · · |xℓ) ∈ RN×ℓ. Calculate X = W ΣV t. Then the principal components are the variables Y (j) =

  • k

WkjX (k). Most of the variance is captured by Y (1); second to most is captured by Y (2); and so on.

Kleine Bayessche AG Principal component analysis 8 / 12

slide-37
SLIDE 37
  • For instance, in a study about circles, the variables radius,

diameter, and circumference are linearly correlated.

  • Principal component analysis (PCA) would automatically

pick one of these attributes as a principal component.

  • We have to normalize the data to have zero empirical mean
  • first. Then XX t is the empirical covariance matrix.
  • Note that, in the given sample, the Y (j) are indeed uncor-

related: E(Y (j)XX tY (k)) = (W ej)tXX t(W ek) = et

j W tW ΣV tV ΣtW tW ek

= et

j ΣΣtek.

  • Beware that PCA cannot resolve nonlinear correlation.
  • Also note that PCA is sensitive to outliers and is not scaling-

independent.

slide-38
SLIDE 38

Theory Applications Image compression POD PCA Eigenfaces Digit recognition

Eigenfaces

Record sample faces x1, . . . , xN ∈ Rwidth·height. Calculate a POD basis of eigenfaces. Recognize faces by looking at the coefficients of the most important eigenfaces.

Eigenfaces resemble faces.

Kleine Bayessche AG Principal component analysis 9 / 12

slide-39
SLIDE 39

Theory Applications Image compression POD PCA Eigenfaces Digit recognition

More eigenfaces

Kleine Bayessche AG Principal component analysis 10 / 12

slide-40
SLIDE 40
  • A naive approach is very sensitive to lighting, scale and

translation.

  • But extensions are possible, for instance considering the

eyes, the nose, and the mouth separately; this leads to eigeneyes, eigennoses, and eigenmouths.

  • Image credit:

http://upload.wikimedia.org/wikipedia/commons/6/67/ Eigenfaces.png http://www.cenparmi.concordia.ca/~jdong/eigenface.gif

  • Live demo:

http://cognitrn.psych.indiana.edu/nsfgrant/FaceMachine/ faceMachine.html

  • Examples:

http://www.cs.princeton.edu/~cdecoro/eigenfaces/

slide-41
SLIDE 41

Theory Applications Image compression POD PCA Eigenfaces Digit recognition

Digit recognition

Apply POD for dimension reduction, then use some similarity measure or clustering technique. Results:

Kleine Bayessche AG Principal component analysis 11 / 12

slide-42
SLIDE 42
  • These images were produced by a Python program. The

actual numerical code is very short (a few lines). Of course, the MNIST data set was used.

http://pizzaseminar.speicherleck.de/skript4/08-principal-component-analysis/digit-recognition. py

  • The nine images show the values of the first ten POD co-

efficients of the first 30 samples of the digits 1 to 9. Each column corresponds to a different sample. The first four POD coefficients are drawn using more vertical space, so that visual weight aligns with importance.

  • One can clearly see that the POD coefficients differ for the

different digits.

  • Also, one can see that the difference is not so great for

similar digits like 5 and 8 or 7 and 9.

slide-43
SLIDE 43

Theory Applications Image compression POD PCA Eigenfaces Digit recognition

Eigendigits

Kleine Bayessche AG Principal component analysis 12 / 12

slide-44
SLIDE 44
  • These images show the first 12 POD basis vectors. The

first basis vector is a kind of “prototypical digit”. The other basis vectors give subsequent “higher-order terms”.

  • Because I didn’t implement a similarity measure or cluster-

ing technique, I couldn’t calculate the percentage of cor- rectly classified digits. However, presumably the success rate would not be too high: Like the eigenfaces approach, this naive implementation is sensitive to the specific posi- tion of the digits in the bounding box. Refined techniques are discussed in the literature.

slide-45
SLIDE 45