Lecture 24: Principal Component Analysis Aykut Erdem January 2017 - - PowerPoint PPT Presentation

lecture 24
SMART_READER_LITE
LIVE PREVIEW

Lecture 24: Principal Component Analysis Aykut Erdem January 2017 - - PowerPoint PPT Presentation

Lecture 24: Principal Component Analysis Aykut Erdem January 2017 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA 2 PCA Applications Data


slide-1
SLIDE 1

Lecture 24:

−Principal Component Analysis

Aykut Erdem

January 2017 Hacettepe University

slide-2
SLIDE 2

This week

  • Motivation
  • PCA algorithms
  • Applications
  • PCA shortcomings
  • Autoencoders
  • Kernel PCA

2

slide-3
SLIDE 3

PCA Applications

  • Data Visualization
  • Data Compression
  • Noise Reduction
  • Learning
  • Anomaly detection

3

slide by Barnabás Póczos and Aarti Singh

slide-4
SLIDE 4

Data Visualization

Example:

  • Given 53 blood and urine samples

(features) from 65 people.

  • How can we visualize the measurements?

4

slide by Barnabás Póczos and Aarti Singh

slide-5
SLIDE 5

6

  • H-WBC

H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC H-MCHC A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000 A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000 A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000

Instances Features Difficult to see the correlations between the features...

Data Visualization

  • Matrix format (65x53)

5

slide by Barnabás Póczos and Aarti Singh

slide-6
SLIDE 6

Data Visualization

  • Spectral format (65 curves, one for each person)

6

  • 10

20 30 40 50 60 100 200 300 400 500 600 700 800 900 1000 measurement Value

Measurement

Difficult to compare the different patients...

slide by Barnabás Póczos and Aarti Singh

slide-7
SLIDE 7

Data Visualization

  • Spectral format (53 pictures, one for each feature)

7

8

0 10 20 30 40 50 60 70 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Person H-Bands

  • Difficult to see the correlations between the features...

slide by Barnabás Póczos and Aarti Singh

slide-8
SLIDE 8

Data Visualization

8

9

0 50 150 250 350 450 50 100 150 200 250 300 350 400 450 500 550

C-Triglycerides C-LDH

100200300400500 200 400 600 1 2 3 4

C-Triglycerides C-LDH M-EPI

Bi-variate Tri-variate

How can we visualize the other variables??? … ¡difficult ¡to ¡see ¡in ¡4 ¡or ¡higher ¡dimensional ¡spaces...

slide by Barnabás Póczos and Aarti Singh

slide-9
SLIDE 9

Data Visualization

  • Is there a representation better than the coordinate

axes?


  • Is it really necessary to show all the 53 dimensions?
  • ... what if there are strong correlations between the

features?


  • How could we find the smallest subspace of the

53-D space that keeps the most information about the original data?


  • A solution: Principal Component Analysis

9

slide by Barnabás Póczos and Aarti Singh

slide-10
SLIDE 10

PCA algorithms

10

slide-11
SLIDE 11

Principal Component Analysis

PCA: Orthogonal projection of the data onto a lower- dimension linear space that...

  • maximizes variance of projected data (purple line)

  • minimizes mean squared distance between
  • data point and
  • projections (sum of blue lines)

11

  • slide by Barnabás Póczos and Aarti Singh
slide-12
SLIDE 12

Principal Component Analysis

Idea:

  • Given data points in a d-dimensional space,

project them into a lower dimensional space while preserving as much information as possible.

  • Find best planar approximation to 3D data
  • Find best 12-D approximation to 104-D data

  • In particular, choose projection that 


minimizes squared error
 in reconstructing the original data.

12

slide by Barnabás Póczos and Aarti Singh

slide-13
SLIDE 13

Principal Component Analysis

  • PCA Vectors originate from the center of

mass.


  • Principal component #1: points in the

direction of the largest variance.


  • Each subsequent principal component
  • is orthogonal to the previous ones, and
  • points in the directions of the largest

variance of the residual subspace

13

slide by Barnabás Póczos and Aarti Singh

slide-14
SLIDE 14

2D Gaussian dataset

14

slide by Barnabás Póczos and Aarti Singh

slide-15
SLIDE 15

1st PCA axis

15

slide by Barnabás Póczos and Aarti Singh

slide-16
SLIDE 16

2nd PCA axis

16

slide by Barnabás Póczos and Aarti Singh

slide-17
SLIDE 17

PCA algorithm I (sequential)

17

18

  • m

i i T i T

m

1 2 1 1 1 2

} )] ( {[ 1 max arg x w w x w w

w

} ) {( 1 max arg

1 2 i 1 1

  • m

i T

m x w w

w

We maximize the variance

  • f the projection in the

residual subspace We maximize the variance of projection of x

x’ ¡PCA reconstruction

Given the centered data {x1, ¡…, ¡xm}, compute the principal vectors:

1st PCA vector kth PCA vector x w1 w x’=w1(w1

Tx)

w x-x’

slide by Barnabás Póczos and Aarti Singh

slide-18
SLIDE 18

PCA algorithm I (sequential)

18

19

  • m

i k j i T j j i T k

m

1 2 1 1 1

} )] ( {[ 1 max arg x w w x w w

w

We maximize the variance

  • f the projection in the

residual subspace Maximize the variance of projection of x

x’ ¡PCA reconstruction

Given w1,…, ¡wk-1, we calculate wk principal vector as before:

kth PCA vector w1(w1

Tx)

w2(w2

Tx)

x w1 w2 x’=w1(w1

Tx)+w2(w2 Tx)

w

slide by Barnabás Póczos and Aarti Singh

slide-19
SLIDE 19

19

  • Given data {x1, ¡…, ¡xm}, compute covariance matrix
  • PCA basis vectors = the eigenvectors of
  • Larger eigenvalue more important eigenvectors
  • m

i T i

m

1

) )( ( 1 x x x x

  • m

i i

m

1

1 x x

where

PCA algorithm II 
 (sample covariance matrix)

slide by Barnabás Póczos and Aarti Singh

slide-20
SLIDE 20

PCA algorithm II 
 (sample covariance matrix)

20

22

PCA algorithm(X, k): top k eigenvalues/eigenvectors % X = N m data matrix, % ¡… ¡each ¡data point xi = column vector, i=1..m

  • X subtract mean x from each column vector xi in X
  • X XT … ¡covariance matrix of X
  • { i, ui }i=1..N = eigenvectors/eigenvalues of

1 2 … ¡ N

  • Return { i, ui }i=1..k

% top k PCA components

  • m

i

m

1

1

i

x x

slide by Barnabás Póczos and Aarti Singh

slide-21
SLIDE 21

PCA algorithm III 
 (SVD of the data matrix)

21

23

Singular Value Decomposition of the centered data matrix X.

Xfeatures samples = USVT

X VT S U =

samples

significant noise noise noise significant sig.

(SVD of the data matrix)

slide by Barnabás Póczos and Aarti Singh

slide-22
SLIDE 22

PCA algorithm III

22

  • Columns of U
  • the principal vectors, { u(1), ¡…, ¡u(k) }
  • orthogonal and has unit norm – so UTU = I
  • Can reconstruct the data using linear combinations
  • f { u(1), ¡…, ¡u(k) }
  • Matrix S
  • Diagonal
  • Shows importance of each eigenvector
  • Columns of VT
  • The coefficients for reconstructing the samples

slide by Barnabás Póczos and Aarti Singh

slide-23
SLIDE 23

Applications

23

slide-24
SLIDE 24

Face Recognition

24

slide-25
SLIDE 25

Face Recognition

  • Want to identify specific person, based on facial image
  • Robust to glasses, lighting, …
  • Can’t just use the given 256 x 256 pixels

25

  • Robust ¡to ¡glasses, ¡lighting,…

Can’t ¡just ¡use ¡the ¡given ¡256 ¡x ¡256 ¡pixels

slide by Barnabás Póczos and Aarti Singh

slide-26
SLIDE 26

Applying PCA: Eigenfaces

26

Example data set: Images of faces

  • Famous Eigenface approach

[Turk & Pentland], [Sirovich & Kirby]

Each face x is ¡…

  • 256 256 values (luminance at location)
  • x in 256256 (view as 64K dim vector)

Form X = [ x1 , ¡…, ¡xm ] centered data mtx Compute = XXT Problem: is 64K 64K ¡… ¡HUGE!!!

256 x 256 real values m faces

X =

x1, ¡…, ¡xm

Method A: Build a PCA subspace for each person and check which subspace can reconstruct the test image the best Method B: Build one PCA database for the whole dataset and then classify based on the weights.

27

slide by Barnabás Póczos and Aarti Singh

slide-27
SLIDE 27

Computational Complexity

27

  • Suppose m instances, each of size N
  • Eigenfaces: m=500 faces, each of size N=64K
  • Given NN covariance matrix can compute
  • all N eigenvectors/eigenvalues in O(N3)
  • first k eigenvectors/eigenvalues in O(k N2)
  • But if N=64K, EXPENSIVE!

slide by Barnabás Póczos and Aarti Singh

slide-28
SLIDE 28

A Clever Workaround

28

  • Note that m<<64K
  • Use L=XTX instead of =XXT
  • If v is eigenvector of L

then Xv is eigenvector of Proof: L v = v

XTX v = v X (XTX v) = X( v) = Xv (XXT)X v = (Xv) Xv) = (Xv)

256 x 256 real values m faces

X =

x1, ¡…, ¡xm

slide by Barnabás Póczos and Aarti Singh

slide-29
SLIDE 29

Principle Components (Method B)

29

slide by Barnabás Póczos and Aarti Singh

slide-30
SLIDE 30

Principle Components (Method B)

  • … faster if train with …
  • only people w/out glasses
  • same lighting conditions

30

… ¡faster ¡if ¡train ¡with…

  • Reconstructing… ¡(Method ¡B)

slide by Barnabás Póczos and Aarti Singh

slide-31
SLIDE 31

Shortcomings

  • Requires carefully controlled data:
  • All faces centered in frame
  • Same size
  • Some sensitivity to angle
  • Method is completely knowledge free
  • (sometimes this is good!)
  • Doesn’t know that faces are wrapped around

3D objects (heads)

  • Makes no effort to preserve class distinctions

31

slide by Barnabás Póczos and Aarti Singh

slide-32
SLIDE 32

Happiness subspace (method A)

32

slide by Barnabás Póczos and Aarti Singh

slide-33
SLIDE 33

Disgust subspace (method A)

33

slide by Barnabás Póczos and Aarti Singh

slide-34
SLIDE 34

Facial Expression Recognition 
 Movies

34

slide by Barnabás Póczos and Aarti Singh

slide-35
SLIDE 35

Facial Expression Recognition 
 Movies

35

slide by Barnabás Póczos and Aarti Singh

slide-36
SLIDE 36

Facial Expression Recognition 
 Movies

36

slide by Barnabás Póczos and Aarti Singh

slide-37
SLIDE 37

Image Compression

37

slide-38
SLIDE 38

Original Image

  • Divide the original 372x492 image into patches:
  • Each patch is an instance
  • View each as a 144-D vector

38

  • slide by Barnabás Póczos and Aarti Singh
slide-39
SLIDE 39

L2 error and PCA dim

39

slide by Barnabás Póczos and Aarti Singh

slide-40
SLIDE 40

PCA compression: 144D => 60D

40

slide by Barnabás Póczos and Aarti Singh

slide-41
SLIDE 41

PCA compression: 144D => 16D

41

slide by Barnabás Póczos and Aarti Singh

slide-42
SLIDE 42

16 most important eigenvectors

42

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

slide by Barnabás Póczos and Aarti Singh

slide-43
SLIDE 43

PCA compression: 144D => 6D

43

slide by Barnabás Póczos and Aarti Singh

slide-44
SLIDE 44

6 most important eigenvectors

44

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

slide by Barnabás Póczos and Aarti Singh

slide-45
SLIDE 45

PCA compression: 144D => 3D

45

slide by Barnabás Póczos and Aarti Singh

slide-46
SLIDE 46

3 most important eigenvectors

46

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

slide by Barnabás Póczos and Aarti Singh

slide-47
SLIDE 47

PCA compression: 144D => 1D

47

slide by Barnabás Póczos and Aarti Singh

slide-48
SLIDE 48

60 most important eigenvectors

  • Looks like the discrete cosine bases of JPG!…

48

slide by Barnabás Póczos and Aarti Singh

slide-49
SLIDE 49

2D Discrete Cosine Basis

49

http://en.wikipedia.org/wiki/Discrete_cosine_transform

slide by Barnabás Póczos and Aarti Singh

slide-50
SLIDE 50

Noise Filtering

50

slide-51
SLIDE 51

Noise Filtering

51

x x’ U x

slide by Barnabás Póczos and Aarti Singh

slide-52
SLIDE 52

Noisy image

52

slide by Barnabás Póczos and Aarti Singh

slide-53
SLIDE 53

Denoised image 
 using 15 PCA components

53

slide by Barnabás Póczos and Aarti Singh

slide-54
SLIDE 54

PCA Shortcomings

54

slide-55
SLIDE 55

Problematic Data Set for PCA

  • PCA doesn’t know labels!

55

PCA ¡doesn’t ¡know ¡labels!

slide by Barnabás Póczos and Aarti Singh

slide-56
SLIDE 56

PCA vs. Fisher Linear Discriminant

56

  • Principal Component Analysis
  • higher variance

  • bad for discriminability

Fisher Linear Discriminant

  • smaller variance

  • good discriminability
  • ysis

slide by Javier Hernandez Rivera

slide-57
SLIDE 57

Problematic Data Set for PCA

  • PCA cannot capture NON-LINEAR structure!

57

slide by Barnabás Póczos and Aarti Singh

slide-58
SLIDE 58

PCA Conclusions

  • PCA
  • Finds orthonormal basis for data
  • Sorts dimensions in order of “importance”
  • Discard low significance dimensions

  • Uses:
  • Get compact description
  • Ignore noise
  • Improve classification (hopefully)

  • Not magic:
  • Doesn’t know class labels
  • Can only capture linear variations

  • One of many tricks to reduce dimensionality!

58

slide by Barnabás Póczos and Aarti Singh