Introduction to Machine Learning CMU-10701 Principal Component - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning CMU-10701 Principal Component - - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos & Aarti Singh Contents Motivation PCA algorithms Applications Some of these slides are taken from Karl Booksh Research group


slide-1
SLIDE 1

Introduction to Machine Learning CMU-10701

Principal Component Analysis

Barnabás Póczos & Aarti Singh

slide-2
SLIDE 2

2

Contents

Motivation PCA algorithms Applications

Some of these slides are taken from

  • Karl Booksh Research group
  • Tom Mitchell
  • Ron Parr
slide-3
SLIDE 3

3

Motivation

slide-4
SLIDE 4

4

  • Data Visualization
  • Data Compression
  • Noise Reduction

PCA Applications

slide-5
SLIDE 5

5

Data Visualization

Example:

  • Given 53 blood and urine samples

(features) from 65 people.

  • How can we visualize the measurements?
slide-6
SLIDE 6

6

  • Matrix format (65x53)

H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC H-MCHC A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000 A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000 A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000

Instances Features Difficult to see the correlations between the features...

Data Visualization

slide-7
SLIDE 7

7

  • Spectral format (65 curves, one for each person)

10 20 30 40 50 60 100 200 300 400 500 600 700 800 900 1000 measurement Value

Measurement

Difficult to compare the different patients...

Data Visualization

slide-8
SLIDE 8

8

0 10 20 30 40 50 60 70 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Person H-Bands

  • Spectral format (53 pictures, one for each feature)

Difficult to see the correlations between the features...

Data Visualization

slide-9
SLIDE 9

9

0 50 150 250 350 450 50 100 150 200 250 300 350 400 450 500 550

C-Triglycerides C-LDH

100200300400500 200 400 600 1 2 3 4

C-Triglycerides C-LDH M-EPI

Bi-variate Tri-variate

How can we visualize the other variables??? … difficult to see in 4 or higher dimensional spaces...

Data Visualization

slide-10
SLIDE 10

10

  • Is there a representation better than the coordinate axes?
  • Is it really necessary to show all the 53 dimensions?
  • … what if there are strong correlations between the

features?

  • How could we find

the smallest subspace of the 53-D space that keeps the most information about the original data?

  • A solution: Principal Component Analysis

Data Visualization

slide-11
SLIDE 11

11

PCA Algorithms

slide-12
SLIDE 12

12

Orthogonal projection of the data onto a lower-dimension linear space that... maximizes variance of projected data (purple line) minimizes the mean squared distance between

  • data point and
  • projections (sum of blue lines)

PCA:

Principal Component Analysis

slide-13
SLIDE 13

13

Idea:  Given data points in a d-dimensional space, project them into a lower dimensional space while preserving as much information as possible.

  • Find best planar approximation of 3D data
  • Find best 12-D approximation of 104-D data

 In particular, choose projection that minimizes squared error in reconstructing the original data.

Principal Component Analysis

slide-14
SLIDE 14

14

PCA Vectors originate from the center of mass. Principal component #1: points in the direction of the largest variance. Each subsequent principal component

  • is orthogonal to the previous ones, and
  • points in the directions of the largest

variance of the residual subspace

Principal Component Analysis

Properties:

slide-15
SLIDE 15

15

2D Gaussian dataset

slide-16
SLIDE 16

16

1st PCA axis

slide-17
SLIDE 17

17

2nd PCA axis

slide-18
SLIDE 18

18

 

 

m i i T i T

m

1 2 1 1 1 2

} )] ( {[ 1 max arg x w w x w w

w

} ) {( 1 max arg

1 2 i 1 1

 

m i T

m x w w

w

To find w2, we maximize the variance of the projection in the residual subspace

To find w1, maximize the variance of projection of x x’ PCA reconstruction

Given the centered data {x1, …, xm}, compute the principal vectors:

1st PCA vector 2nd PCA vector x w1 w x’=w1(w1

Tx)

w

PCA algorithm I (sequential)

x-x’ w2

slide-19
SLIDE 19

19

 

   

 

m i k j i T j j i T k

m

1 2 1 1 1

} )] ( {[ 1 max arg x w w x w w

w

We maximize the variance

  • f the projection in the

residual subspace Maximize the variance of projection of x

x’ PCA reconstruction

Given w1,…, wk-1, we calculate wk principal vector as before:

kth PCA vector w1(w1

Tx)

w2(w2

Tx)

x w1 w2 x’=w1(w1

Tx)+w2(w2 Tx)

w

PCA algorithm I (sequential)

slide-20
SLIDE 20

20

  • Given data {x1, …, xm}, compute covariance matrix 
  • PCA basis vectors = the eigenvectors of 
  • Larger eigenvalue  more important eigenvectors

   

m i T i

m

1

) )( ( 1 x x x x

m i i

m

1

1 x x

where

PCA algorithm II (sample covariance matrix)

slide-21
SLIDE 21

21

PCA algorithm(X, k): top k eigenvalues/eigenvectors % X = N  m data matrix, % … each data point xi = column vector, i=1..m

  • X  subtract mean x from each column vector xi in X
  •   X XT … covariance matrix of X
  • { i, ui }i=1..N = eigenvectors/eigenvalues of 

... 1  2  …  N

  • Return { i, ui }i=1..k

% top k PCA components

m i

m

1

1

i

x x

PCA algorithm II (sample covariance matrix)

slide-22
SLIDE 22

22

Animation

v=Sigma*v; v=v/sqrt(v'*v); ) vPCA1 Power iteration 1: Power iteration 2:  vPCA1T*Sigma*vPCA1 Sigma2=Sigma- *vPCA1*vPCA1

T;

v=Sigma2*v; v=v/sqrt(v'*v); ) vPCA2

slide-23
SLIDE 23

23

Singular Value Decomposition of the centered data matrix X.

Xfeatures  samples = USVT

X VT S U =

samples

significant noise noise noise significant sig.

PCA algorithm III (SVD of the data matrix)

slide-24
SLIDE 24

24

  • Columns of U
  • the principal vectors, { u(1), …, u(k) }
  • orthogonal and has unit norm – so UTU = I
  • Can reconstruct the data using linear combinations
  • f { u(1), …, u(k) }
  • Matrix S
  • Diagonal
  • Shows importance of each eigenvector
  • Columns of VT
  • The coefficients for reconstructing the samples

PCA algorithm III

slide-25
SLIDE 25

25

Applications

slide-26
SLIDE 26

 Want to identify specific person, based on facial image  Robust to glasses, lighting,…  Can’t just use the given 256 x 256 pixels

Face Recognition

26

slide-27
SLIDE 27

Method A: Build a PCA subspace for each person and check which subspace can reconstruct the test image the best Method B: Build one PCA database for the whole dataset and then classify based on the weights.

Applying PCA: Eigenfaces

27

slide-28
SLIDE 28

 Example data set: Images of faces

  • Eigenface approach

[Turk & Pentland], [Sirovich & Kirby]

 Each face x is …

  • 256  256 values (luminance at location)
  • x in 256256 (view as 64K dim vector)

 Form X = [ x1 , …, xm ] centered data mtx  Compute  = XXT  Problem:  is 64K  64K … HUGE!!!

Applying PCA: Eigenfaces

28

256 x 256 real values m faces

X =

x1, …, xm

slide-29
SLIDE 29

Suppose m instances, each of size N

  • Eigenfaces: m=500 faces, each of size N=64K

Given NN covariance matrix , can compute

  • all N eigenvectors/eigenvalues in O(N3)
  • first k eigenvectors/eigenvalues in O(k N2)

But if N=64K, EXPENSIVE!

29

Computational Complexity

slide-30
SLIDE 30
  • Note that m<<64K
  • Use L=XTX instead of =XXT
  • If v is eigenvector of L

then Xv is eigenvector of  Proof: L v =  v

XTX v =  v X (XTX v) = X( v) =  Xv (XXT)X v =  (Xv)  (Xv) =  (Xv)

256 x 256 real values m faces

X =

x1, …, xm

30

A Clever Workaround

slide-31
SLIDE 31

31

Principle Components (Method B)

slide-32
SLIDE 32

 … faster if train with…

  • only people w/out glasses
  • same lighting conditions

32

Reconstructing… (Method B)

slide-33
SLIDE 33

Requires carefully controlled data:

  • All faces centered in frame
  • Same size
  • Some sensitivity to angle

Method is completely knowledge free

  • (sometimes this is good!)
  • Doesn’t know that faces are wrapped around 3D
  • bjects (heads)
  • Makes no effort to preserve class distinctions

33

Shortcomings

slide-34
SLIDE 34

34

Happiness subspace (method A)

slide-35
SLIDE 35

35

Disgust subspace (method A)

slide-36
SLIDE 36

36

Facial Expression Recognition Movies

slide-37
SLIDE 37

37

Facial Expression Recognition Movies

slide-38
SLIDE 38

38

Facial Expression Recognition Movies

slide-39
SLIDE 39

Image Compression

slide-40
SLIDE 40

40

 Divide the original 372x492 image into patches:

  • Each patch is an instance that contains 12x12 pixels on a grid

 Consider each as a 144-D vector

Original Image

slide-41
SLIDE 41

41

L2 error and PCA dim

slide-42
SLIDE 42

42

PCA compression: 144D ) 60D

slide-43
SLIDE 43

43

PCA compression: 144D ) 16D

slide-44
SLIDE 44

44

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

16 most important eigenvectors

slide-45
SLIDE 45

45

PCA compression: 144D ) 6D

slide-46
SLIDE 46

46

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

6 most important eigenvectors

slide-47
SLIDE 47

47

PCA compression: 144D ) 3D

slide-48
SLIDE 48

48

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

3 most important eigenvectors

slide-49
SLIDE 49

49

PCA compression: 144D ) 1D

slide-50
SLIDE 50

50

Looks like the discrete cosine bases of JPG!...

60 most important eigenvectors

slide-51
SLIDE 51

51

http://en.wikipedia.org/wiki/Discrete_cosine_transform

2D Discrete Cosine Basis

slide-52
SLIDE 52

Noise Filtering

slide-53
SLIDE 53

53

x x’ U x

Noise Filtering

slide-54
SLIDE 54

54

Noisy image

slide-55
SLIDE 55

55

Denoised image using 15 PCA components

slide-56
SLIDE 56

PCA Shortcomings

slide-57
SLIDE 57

57

PCA doesn’t know labels!

Problematic Data Set for PCA

slide-58
SLIDE 58

58

  • PCA maximizes variance,

independent of class  magenta

  • FLD attempts to separate classes

 green line

PCA vs Fisher Linear Discriminant

slide-59
SLIDE 59

59

PCA cannot capture NON-LINEAR structure!

Problematic Data Set for PCA

slide-60
SLIDE 60

60

 PCA

  • finds orthonormal basis for data
  • Sorts dimensions in order of “importance”
  • Discard low significance dimensions

 Applications:

  • Get compact description
  • Remove noise
  • Improve classification (hopefully)

 Not magic:

  • Doesn’t know class labels
  • Can only capture linear variations

 One of many tricks to reduce dimensionality!

PCA Conclusions

slide-61
SLIDE 61

61

Kernel PCA

slide-62
SLIDE 62

62

Performing PCA in the feature space Lemma Proof:

Kernel PCA

slide-63
SLIDE 63

63

Lemma

Kernel PCA

slide-64
SLIDE 64

64

Proof

Kernel PCA

slide-65
SLIDE 65

65

 How to use  to calculate the projection of a new sample t? Where was I cheating?  The data should be centered in the feature space, too! But this is manageable...

Kernel PCA

slide-66
SLIDE 66

66

http://en.wikipedia.org/wiki/Kernel_principal_component_analysis

Input points before kernel PCA

slide-67
SLIDE 67

67

The three groups are distinguishable using the first component only

Output after kernel PCA

slide-68
SLIDE 68

68

PCA Theory

slide-69
SLIDE 69

69

GOAL:

Justification of Algorithm II

slide-70
SLIDE 70

70

x is centered!

Justification of Algorithm II

slide-71
SLIDE 71

71

GOAL:

Use Lagrange-multipliers for the constraints.

Justification of Algorithm II

slide-72
SLIDE 72

72

Justification of Algorithm II

slide-73
SLIDE 73

73

Thanks for the Attention! 