Principal Component Analysis (PCA) CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

principal component analysis pca
SMART_READER_LITE
LIVE PREVIEW

Principal Component Analysis (PCA) CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation

Principal Component Analysis (PCA) CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given feature


slide-1
SLIDE 1

CE-717: Machine Learning

Sharif University of Technology Spring 2016

Soleymani

Principal Component Analysis (PCA)

slide-2
SLIDE 2

Dimensionality Reduction: Feature Selection vs. Feature Extraction

 Feature selection

 Select a subset of a given feature set

 Feature extraction

 A linear or non-linear transform on the original feature space

2

𝑦1 ⋮ 𝑦𝑒 → 𝑦𝑗1 ⋮ 𝑦𝑗𝑒′ Feature Selection (𝑒′ < 𝑒) 𝑦1 ⋮ 𝑦𝑒 → 𝑧1 ⋮ 𝑧𝑒′ = 𝑔 𝑦1 ⋮ 𝑦𝑒 Feature Extraction

slide-3
SLIDE 3

Feature Extraction

3

 Mapping of the original data to another space

 Criterion for feature extraction can be different based on problem

settings

 Unsupervised task: minimize the information loss (reconstruction error)  Supervised task: maximize the class discrimination on the projected space

 Feature extraction algorithms

 Linear Methods

 Unsupervised: e.g., Principal Component Analysis (PCA)  Supervised: e.g., Linear Discriminant Analysis (LDA)

 Also known as Fisher’s Discriminant Analysis (FDA)

slide-4
SLIDE 4

Feature Extraction

4

 Unsupervised feature extraction:  Supervised feature extraction:

Feature Extraction 𝒀 = 𝑦1

(1)

⋯ 𝑦𝑒

(1)

⋮ ⋱ ⋮ 𝑦1

(𝑂)

⋯ 𝑦𝑒

(𝑂)

A mapping 𝑔: ℝ𝑒 → ℝ𝑒′ Or

  • nly the transformed data

𝒀′ = 𝑦′1

(1)

⋯ 𝑦′𝑒′

(1)

⋮ ⋱ ⋮ 𝑦′1

(𝑂)

⋯ 𝑦′𝑒′

(𝑂)

Feature Extraction 𝒀 = 𝑦1

(1)

⋯ 𝑦𝑒

(1)

⋮ ⋱ ⋮ 𝑦1

(𝑂)

⋯ 𝑦𝑒

(𝑂)

𝑍 = 𝑧(1) ⋮ 𝑧(𝑂) A mapping 𝑔: ℝ𝑒 → ℝ𝑒′ Or

  • nly the transformed data

𝒀′ = 𝑦′1

(1)

⋯ 𝑦′𝑒′

(1)

⋮ ⋱ ⋮ 𝑦′1

(𝑂)

⋯ 𝑦′𝑒′

(𝑂)

slide-5
SLIDE 5

Unsupervised Feature Reduction

5

 Visualization: projection of high-dimensional data onto 2D

  • r 3D.

 Data compression: efficient storage, communication, or

and retrieval.

 Pre-process: to improve accuracy by reducing features

 As a preprocessing step to reduce dimensions for supervised

learning tasks

 Helps avoiding overfitting

 Noise removal

 E.g, “noise” in the images introduced by minor lighting

variations, slightly different imaging conditions, etc.

slide-6
SLIDE 6

Linear Transformation

6

 For linear transformation, we find an explicit mapping

𝑔 𝒚 = 𝑩𝑈𝒚 that can transform also new data vectors.

Original data reduced data

=

Type equation here. 𝑩𝑈 ∈ ℝ𝑒′×𝑒 𝒚 ∈ ℝ 𝒚′ ∈ ℝ𝑒′ 𝒚′ = 𝑩𝑈𝒚 𝑒′ < 𝑒

slide-7
SLIDE 7

Linear Transformation

7

 Linear transformation are simple mappings

1

a

d 

a

𝑘 = 1, … , 𝑒′

𝑩 = 𝑏11 ⋯ 𝑏1𝑒′ ⋮ ⋱ ⋮ 𝑏𝑒1 ⋯ 𝑏𝑒𝑒′

𝒚′ = 𝑩𝑈𝒚

𝑦𝑘

′ = 𝒃𝑘 𝑈𝒚

𝑦1

⋮ 𝑦𝑒′

= 𝑏11 ⋯ 𝑏𝑒1 ⋮ ⋱ ⋮ 𝑏1𝑒′ ⋯ 𝑏𝑒′𝑒 𝑦1 ⋮ 𝑦𝑒

𝒃𝑒′

𝑈

𝒃1

𝑈

slide-8
SLIDE 8

Linear Dimensionality Reduction

8

 Unsupervised

 Principal Component Analysis (PCA) [we will discuss]  Independent Component Analysis (ICA) [we will discuss]  SingularValue Decomposition (SVD)  Multi Dimensional Scaling (MDS)  Canonical Correlation Analysis (CCA)

slide-9
SLIDE 9

Principal Component Analysis (PCA)

9

 Also known as Karhonen-Loeve (KL) transform  Principal Components (PCs): orthogonal vectors that are

  • rdered by the fraction of the total information (variation) in

the corresponding directions

 Find the directions at which data approximately lie

 When the data is projected onto first PC, the variance of the projected data

is maximized  PCA is an orthogonal projection of the data into a subspace

so that the variance of the projected data is maximized.

slide-10
SLIDE 10

Principal Component Analysis (PCA)

10

 The “best” linear subspace (i.e. providing least reconstruction

error of data):

 Find mean reduced data  The axes have been rotated to new (principal) axes such that:

 Principal axis 1 has the highest variance

....

 Principal axis i has the i-th highest variance.

 The principal axes are uncorrelated

 Covariance among each pair of the principal axes is zero.

 Goal: reducing the dimensionality of the data while preserving

the variation present in the dataset as much as possible.

 PCs can be found as the “best” eigenvectors of the covariance

matrix of the data points.

slide-11
SLIDE 11

Principal components

11

 If data has a Gaussian distribution 𝑂(𝝂, 𝚻), the direction of the

largest variance can be found by the eigenvector of 𝚻 that corresponds to the largest eigenvalue of 𝚻

𝒘1 𝒘2

slide-12
SLIDE 12

PCA: Steps

12

 Input: 𝑂 × 𝑒 data matrix 𝒀 (each row contain a 𝑒

dimensional data point)

 𝝂 =

1 𝑂 𝑗=1 𝑂

𝒚(𝑗)

𝒀 ← Mean value of data points is subtracted from rows of 𝒀

 𝑫 =

1 𝑂

𝒀𝑈 𝒀 (Covariance matrix)

 Calculate eigenvalue and eigenvectors of 𝑫  Pick 𝑒′ eigenvectors corresponding to the largest eigenvalues

and put them in the columns of 𝑩 = [𝒘1, … , 𝒘𝑒′]

 𝒀′ =

𝒀𝑩

First PC d’-th PC

slide-13
SLIDE 13

Covariance Matrix

13

𝝂𝒚 = 𝜈1 ⋮ 𝜈𝑒 = 𝐹(𝑦1) ⋮ 𝐹(𝑦𝑒) 𝜯 = 𝐹 𝒚 − 𝝂𝒚 𝒚 − 𝝂𝒚 𝑈

 ML estimate of covariance matrix from data points 𝒚(𝑗)

𝑗=1 𝑂 :

𝜯 = 1 𝑂

𝑗=1 𝑂

𝒚(𝑗) − 𝝂 𝒚(𝑗) − 𝝂

𝑈 = 1

𝑂 𝒀𝑈 𝒀

𝒀 = 𝒚(1) ⋮ 𝒚(𝑂) = 𝒚(1) − 𝝂 ⋮ 𝒚(𝑂) − 𝝂 𝝂 = 1 𝑂

𝑗=1 𝑂

𝒚(𝑗)

Mean-centered data We now assume that data are mean removed and 𝒚 in the later slides is indeed 𝒚

slide-14
SLIDE 14

Correlation matrix

14

1 𝑂 𝒀𝑈𝒀 = 1 𝑂 𝑦1

(1)

… 𝑦1

(𝑂)

⋮ ⋱ ⋮ 𝑦𝑒

(1)

… 𝑦𝑒

(𝑂)

𝑦1

(1)

… 𝑦𝑒

(1)

⋮ ⋱ ⋮ 𝑦1

(𝑂)

… 𝑦𝑒

(𝑂)

= 1 𝑂

𝑜=1 𝑂

𝑦1

(𝑜) 𝑦1 (𝑜)

𝑜=1 𝑂

𝑦1

(𝑜) 𝑦𝑒 (𝑜)

⋮ ⋱ ⋮

𝑜=1 𝑂

𝑦𝑒

(𝑜) 𝑦1 (𝑜)

𝑜=1 𝑂

𝑦𝑒

(𝑜) 𝑦𝑒 (𝑜)

𝒀 = 𝑦1

(1)

… 𝑦𝑒

(1)

⋮ ⋱ ⋮ 𝑦1

(𝑂)

… 𝑦𝑒

(𝑂)

slide-15
SLIDE 15

Two Interpretations

15

 MaximumVariance Subspace

 PCA finds vectors v such that projections on to the

vectors capture maximum variance in the data

1 𝑂 𝑜=1 𝑂

𝒃𝑈𝒚 𝑜

2 = 1 𝑂 𝒃𝑈𝒀𝑈𝒀𝒃

 Minimum Reconstruction Error

 PCA finds vectors v such that projection on to the

vectors yields minimum MSE reconstruction

1 𝑂 𝑜=1 𝑂

𝒚 𝑜 − 𝒃𝑈𝒚 𝑜 𝒃

2

slide-16
SLIDE 16

Least Squares Error Interpretation

16

 PCs are linear least squares fits to samples, each orthogonal to

the previous PCs:

 First PC is a minimum distance fit to a vector in the original feature

space

 Second PC is a minimum distance fit to a vector in the plane

perpendicular to the first PC

 And so on

slide-17
SLIDE 17

Example

17

slide-18
SLIDE 18

Example

18

slide-19
SLIDE 19

Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation)

 Minimizing sum of square distances to the line is equivalent to

maximizing the sum of squares of the projections on that line (Pythagoras).

19

  • rigin

red2+blue2=green2 green2 is fixed (shows the data vector after mean removing) ⇒ maximizing blue2 is equivalent to minimizing red2

slide-20
SLIDE 20

First PC

20

 The first PC is direction of greatest variability in data  We will show that the first PC is the eigenvector of the

covariance matrix corresponding the maximum eigen value of this matrix.

 If ||𝒃|| = 1, the projection of a d-dimensional 𝒚 on 𝒃 is 𝒃𝑈𝒚

  • rigin

𝒚 𝒃 𝒚 cos 𝜄 = 𝒚 𝒃𝑈𝒚 𝒚 𝒃 = 𝒃𝑈𝒚 𝜄

slide-21
SLIDE 21

First PC

21

argmax

𝒃

1 𝑂

𝑜=1 𝑂

𝒃𝑈𝒚 𝑜

2 = 1

𝑂 𝒃𝑈𝒀𝑈𝒀𝒃 s.t. 𝒃𝑈𝒃 = 1 𝜖 𝜖𝒃 1 𝑂 𝒃𝑈𝒀𝑈𝒀𝒃 + 𝜇 1 − 𝒃𝑈𝒃 = 0 ⇒ 1 𝑂 𝒀𝑈𝒀𝒃 = 𝜇𝒃

 𝒃 is the eigenvector of sample covariance matrix

1 𝑂 𝒀𝑈𝒀

 The eigenvalue 𝜇 denotes the amount of variance along that dimension.

 Variance= 1 𝑂 𝒃𝑈𝒀𝑈𝒀𝒃 = 𝒃𝑈 1 𝑂 𝒀𝑈𝒀𝒃 = 𝒃𝑈𝜇𝒃 = 𝜇

 So, if we seek the dimension with the largest variance, it will be the

eigenvector corresponding to the largest eigenvalue of the sample covariance matrix

slide-22
SLIDE 22

PCA: Uncorrelated Features

22

𝒚′ = 𝑩𝑈𝒚 𝑺𝒚′ = 𝐹 𝒚′𝒚′𝑈 = 𝐹 𝑩𝑈𝒚𝒚𝑈𝑩 = 𝑩𝑈𝐹 𝒚𝒚𝑈 𝑩 = 𝑩𝑈𝑺𝒚𝑩

 If 𝑩 = [𝒃1, … , 𝒃𝑒] where 𝒃1, … , 𝒃𝑒

are orthonormal eighenvectors of 𝑺𝒚: 𝑺𝒚′ = 𝑩𝑈𝑺𝒚𝑩 = 𝑩𝑈 𝑩𝚳𝑩𝑈 𝑩 = 𝚳 ⇒ ∀𝑗 ≠ 𝑘 𝑗, 𝑘 = 1, … , 𝑒 𝐹 𝒚𝑗

′𝒚𝑘 ′ = 0

 then mutually uncorrelated features are obtained

 Completely

uncorrelated features avoid information redundancies

slide-23
SLIDE 23

PCA Derivation: Mean Square Error Approximation

23

 Incorporating all eigenvectors in 𝑩 = [𝒃1, … , 𝒃𝑒]:

𝒚′ = 𝑩𝑈𝒚 ⇒ 𝑩𝒚′ = 𝑩𝑩𝑈𝒚 = 𝒚 ⇒ 𝒚 = 𝑩𝒚′

 ⟹ If 𝑒′ = 𝑒 then 𝒚 can be reconstructed exactly from 𝒚′

slide-24
SLIDE 24

PCA Derivation: Relation between Eigenvalues and Variances

24

 The 𝑘-th largest eigenvalue of 𝑺𝒚 is the variance on the 𝑘-th

PC:

𝑤𝑏𝑠 𝑦𝑘

′ = 𝜇𝑘

𝑤𝑏𝑠 𝑦𝑘

′ = 𝐹 𝑦𝑘 ′𝑦𝑘 ′

= 𝐹 𝒃𝑘

𝑈𝒚𝒚𝑈𝒃𝑘

= 𝒃𝑘

𝑈𝐹 𝒚𝒚𝑈 𝒃𝑘

= 𝒃𝑘

𝑈𝑺𝒚𝒃𝑘 = 𝒃𝑘 𝑈𝜇𝑘𝒃𝑘 = 𝜇𝑘

Eigenvalues 𝜇1 ≥ 𝜇2 ≥ 𝜇3 ≥ ⋯

  • The 1st PC is the the eigenvector of the sample covariance matrix

associated with the largest eigenvalue

  • The 2nd PC 𝑤2 is the the eigenvector of the sample covariance matrix

associated with the second largest eigenvalue

  • And so on …
slide-25
SLIDE 25

PCA Derivation: Mean Square Error Approximation

25

 Incorporating only 𝑒′ eigenvectors corresponding to the

largest eigenvalues 𝑩 = [𝒃1, … , 𝒃𝑒′] (𝑒′ < 𝑒)

 It minimizes MSE between 𝒚 and

𝒚 = 𝑩𝒚′:

𝐾 𝑩 = 𝐹 𝒚 − 𝒚 2 = 𝐹 𝒚 − 𝑩𝒚′ 2 = 𝐹

𝑘=𝑒′+1 𝑒

𝑦𝑘

′𝒃𝑘 2

= 𝐹

𝑘=𝑒′+1 𝑒 𝑙=𝑒′+1 𝑒

𝑦𝑘

′𝒃𝑘 𝑈𝒃𝑙 𝑦𝑙 ′

= 𝐹

𝑘=𝑒′+1 𝑒

𝑦𝑘

′2

=

𝑘=𝑒′+1 𝑒

𝐹 𝑦𝑘

′2 = 𝑘=𝑒′+1 𝑒

𝜇𝑘

Sum of the 𝑒 − 𝑒′ smallest eigenvalues

slide-26
SLIDE 26

PCA Derivation: Mean Square Error Approximation

26

 In general, it can also be shown MSE is minimized compared to

any

  • ther

approximation

  • f

𝒚 by any 𝑒′ -dimensional

  • rthonormal basis

 without first assuming that the axes are eigenvectors of the correlation

matrix, this result can also be obtained.

 If the data is mean-centered in advance, 𝑺𝒚 and 𝑫𝒚 (covariance

matrix) will be the same.

 However, in the correlation version when 𝑫𝒚 ≠ 𝑺𝒚 the approximation is

not, in general, a good one (although it is a minimum MSE solution)

slide-27
SLIDE 27

PCA on Faces: “Eigenfaces”

27

 ORL Database

Some Images

slide-28
SLIDE 28

PCA on Faces: “Eigenfaces”

28

For eigen faces “gray” = 0, “white” > 0, “black” < 0 Average face

1st

to 10th

PCs

slide-29
SLIDE 29

PCA on Faces:

Feature vector=[𝑦1

′,𝑦2 ′, … ,𝑦𝑒′ ′ ] +𝑦1

′ ×

+𝑦2

′ ×

+𝑦256

× + ⋯

29

= Average Face 𝑦𝑗

′ = 𝑄𝐷𝑗 𝑈𝒚

The projection of 𝒚 on the i-th PC 𝒚 is a 112 × 92 = 10304 dimensional vector containing intensity of the pixels of this image

slide-30
SLIDE 30

PCA on Faces: Reconstructed Face

d'=1 d'=2 d'=4 d'=8 d'=16 d'=32 d'=64 d'=128 Original Image d'=256

30

slide-31
SLIDE 31

Dimensionality Reduction by PCA

31

 In high-dimensional problems, data sometimes lies near a

linear subspace (small variability around this subspace can be considered as noise)

 Only keep data projections onto principal components

with large eigenvalue

 Might lose some info, but if eigenvalues are small, do not

lose much

slide-32
SLIDE 32

Kernel PCA

32

 Kernel extension of PCA

data (approximately) lies on a lower dimensional non-linear space

slide-33
SLIDE 33

PCA and LDA: Drawbacks

33

 PCA drawback: An excellent information packing transform

does not necessarily lead to a good class separability.

 The directions of the maximum variance may be useless for classification

purpose  LDA drawback

 Singularity or under-sampled problem (when 𝑂 < 𝑒)

 Example: gene expression data, images, text documents

 Can reduces dimension only to 𝑒′ ≤ 𝐷 − 1 (unlike PCA)

PCA LDA

slide-34
SLIDE 34

PCA vs. LDA

 Although LDA often provide more suitable features for

classification tasks, PCA might outperform LDA in some situations:

 when the number of samples per class is small (overfitting problem

  • f LDA)

 when the number of the desired features is more than 𝐷 − 1

 Advances in the last decade:

 Semi-supervised feature extraction

 E.g., PCA+LDA, Regularized LDA, Locally FDA (LFDA)

34

slide-35
SLIDE 35

Singular Value Decomposition (SVD)

35

 Given a matrix 𝒀 ∈ ℝ𝑂×𝑒, the SVD is a decomposition:

𝒀 = 𝑽𝑻𝑾𝑈

 𝑻 is a diagonal matrix with the singular values 𝜏1, … , 𝜏𝑒 of 𝑌.  Columns of 𝑽, 𝑾 are orthonormal matrices

𝒀 (𝑂 × 𝑒) 𝑽 (𝑂 × 𝑒) 𝑾𝑈 (𝑒 × 𝑒) 𝑻 (𝑒 × 𝑒)

slide-36
SLIDE 36

Singular Value Decomposition (SVD)

36

 Given a matrix 𝒀 ∈ ℝ𝑂×𝑒, the SVD is a decomposition:

𝒀 = 𝑽𝑻𝑾𝑈

 SVD of 𝑌 is related to eigen-decomposition of 𝒀𝑈𝒀 and 𝒀𝒀𝑈.

 𝒀𝑈𝒀 = 𝑾𝑻𝑽𝑈𝑽𝑻𝑾𝑈 = 𝑾𝑻2𝑾𝑈

 so 𝑾 contains eigenvectors of 𝒀𝑈𝒀 and 𝑻2 includes its eigenvalues (𝜇𝑗

= 𝜏𝑗

2)

 𝒀𝒀𝑈 = 𝑽𝑻𝑾𝑈𝑾𝑻𝑽𝑈 = 𝑽𝑻2𝑽𝑈  so 𝑽 contains eigenvectors of 𝒀𝒀𝑈and 𝑻2 includes its eigenvalues (𝜇𝑗

= 𝜏𝑗

2)

 In fact, we can view each row of 𝑉𝑇 as the coordinates of an

example along the axes given by the eigenvectors.

slide-37
SLIDE 37

Independent Component Analysis (ICA)

37

 PCA:

 The transformed dimensions will be uncorrelated from each

  • ther

 Orthogonal linear transform  Only uses second order statistics (i.e., covariance matrix)

 ICA:

 The transformed dimensions will be as independent as

possible.

 Non-orthogonal linear transform  High-order statistics can also used

slide-38
SLIDE 38

Uncorrelated and Independent

38

 Gaussian

 Independent ⟺ Uncorrelated

 Non-Gaussian

 Independent ⇒ Uncorrelated  Uncorrelated ⇏ Independent

Uncorrelated: 𝑑𝑝𝑤 𝑌1,𝑌2 = 0 Independent: 𝑄 𝑌1,𝑌2 = 𝑄(𝑌1)𝑄(𝑌2)

slide-39
SLIDE 39

ICA: Cocktail party problem

39

 Cocktail party problem

 𝑒 speakers are speaking simultaneously and any microphone

records only an overlapping combination of these voices.

 Each microphone records a different combination of the speakers’ voices.

 Using these 𝑒 microphone recordings, can we separate out the

  • riginal 𝑒 speakers’ speech signals?

 Mixing matrix 𝑩:

𝒚 = 𝑩𝒕

 Unmixing matrix 𝑩−1:

𝒕 = 𝑩−1𝒚

𝑡

𝑘 (𝑗): sound that speaker 𝑘 was uttering at time 𝑗.

𝑦𝑘

(𝑗): acoustic reading recorded by microphone 𝑘 at time 𝑗.

slide-40
SLIDE 40

ICA

40

 Find a linear transformation 𝒚 = 𝑩𝒕  for

which dimensions

  • f

𝒕 = 𝑡1, 𝑡2, … , 𝑡𝑒 𝑈 are statistically independent 𝑞(𝑡1, … , 𝑡𝑒) = 𝑞1(𝑡1)𝑞2(𝑡2) … 𝑞𝑒(𝑡𝑒)

 Algorithmically, we need to identify matrix 𝑩 and sources

𝒕 where 𝒚 = 𝑩𝒕 such that the mutual information between 𝑡1, 𝑡2, … , 𝑡𝑒 is minimized: 𝐽 𝑡1, 𝑡2, … , 𝑡𝑒 =

𝑗=1 𝑒

𝐼 𝑡𝑗 − 𝐼 𝑡1, 𝑡2, … , 𝑡𝑒