Principal Component Analysis (PCA) CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation
Principal Component Analysis (PCA) CE-717: Machine Learning Sharif - - PowerPoint PPT Presentation
Principal Component Analysis (PCA) CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given feature
Dimensionality Reduction: Feature Selection vs. Feature Extraction
Feature selection
Select a subset of a given feature set
Feature extraction
A linear or non-linear transform on the original feature space
2
𝑦1 ⋮ 𝑦𝑒 → 𝑦𝑗1 ⋮ 𝑦𝑗𝑒′ Feature Selection (𝑒′ < 𝑒) 𝑦1 ⋮ 𝑦𝑒 → 𝑧1 ⋮ 𝑧𝑒′ = 𝑔 𝑦1 ⋮ 𝑦𝑒 Feature Extraction
Feature Extraction
3
Mapping of the original data to another space
Criterion for feature extraction can be different based on problem
settings
Unsupervised task: minimize the information loss (reconstruction error) Supervised task: maximize the class discrimination on the projected space
Feature extraction algorithms
Linear Methods
Unsupervised: e.g., Principal Component Analysis (PCA) Supervised: e.g., Linear Discriminant Analysis (LDA)
Also known as Fisher’s Discriminant Analysis (FDA)
Feature Extraction
4
Unsupervised feature extraction: Supervised feature extraction:
Feature Extraction 𝒀 = 𝑦1
(1)
⋯ 𝑦𝑒
(1)
⋮ ⋱ ⋮ 𝑦1
(𝑂)
⋯ 𝑦𝑒
(𝑂)
A mapping 𝑔: ℝ𝑒 → ℝ𝑒′ Or
- nly the transformed data
𝒀′ = 𝑦′1
(1)
⋯ 𝑦′𝑒′
(1)
⋮ ⋱ ⋮ 𝑦′1
(𝑂)
⋯ 𝑦′𝑒′
(𝑂)
Feature Extraction 𝒀 = 𝑦1
(1)
⋯ 𝑦𝑒
(1)
⋮ ⋱ ⋮ 𝑦1
(𝑂)
⋯ 𝑦𝑒
(𝑂)
𝑍 = 𝑧(1) ⋮ 𝑧(𝑂) A mapping 𝑔: ℝ𝑒 → ℝ𝑒′ Or
- nly the transformed data
𝒀′ = 𝑦′1
(1)
⋯ 𝑦′𝑒′
(1)
⋮ ⋱ ⋮ 𝑦′1
(𝑂)
⋯ 𝑦′𝑒′
(𝑂)
Unsupervised Feature Reduction
5
Visualization: projection of high-dimensional data onto 2D
- r 3D.
Data compression: efficient storage, communication, or
and retrieval.
Pre-process: to improve accuracy by reducing features
As a preprocessing step to reduce dimensions for supervised
learning tasks
Helps avoiding overfitting
Noise removal
E.g, “noise” in the images introduced by minor lighting
variations, slightly different imaging conditions, etc.
Linear Transformation
6
For linear transformation, we find an explicit mapping
𝑔 𝒚 = 𝑩𝑈𝒚 that can transform also new data vectors.
Original data reduced data
=
Type equation here. 𝑩𝑈 ∈ ℝ𝑒′×𝑒 𝒚 ∈ ℝ 𝒚′ ∈ ℝ𝑒′ 𝒚′ = 𝑩𝑈𝒚 𝑒′ < 𝑒
Linear Transformation
7
Linear transformation are simple mappings
1
a
d
a
𝑘 = 1, … , 𝑒′
𝑩 = 𝑏11 ⋯ 𝑏1𝑒′ ⋮ ⋱ ⋮ 𝑏𝑒1 ⋯ 𝑏𝑒𝑒′
𝒚′ = 𝑩𝑈𝒚
𝑦𝑘
′ = 𝒃𝑘 𝑈𝒚
𝑦1
′
⋮ 𝑦𝑒′
′
= 𝑏11 ⋯ 𝑏𝑒1 ⋮ ⋱ ⋮ 𝑏1𝑒′ ⋯ 𝑏𝑒′𝑒 𝑦1 ⋮ 𝑦𝑒
𝒃𝑒′
𝑈
𝒃1
𝑈
Linear Dimensionality Reduction
8
Unsupervised
Principal Component Analysis (PCA) [we will discuss] Independent Component Analysis (ICA) [we will discuss] SingularValue Decomposition (SVD) Multi Dimensional Scaling (MDS) Canonical Correlation Analysis (CCA)
Principal Component Analysis (PCA)
9
Also known as Karhonen-Loeve (KL) transform Principal Components (PCs): orthogonal vectors that are
- rdered by the fraction of the total information (variation) in
the corresponding directions
Find the directions at which data approximately lie
When the data is projected onto first PC, the variance of the projected data
is maximized PCA is an orthogonal projection of the data into a subspace
so that the variance of the projected data is maximized.
Principal Component Analysis (PCA)
10
The “best” linear subspace (i.e. providing least reconstruction
error of data):
Find mean reduced data The axes have been rotated to new (principal) axes such that:
Principal axis 1 has the highest variance
....
Principal axis i has the i-th highest variance.
The principal axes are uncorrelated
Covariance among each pair of the principal axes is zero.
Goal: reducing the dimensionality of the data while preserving
the variation present in the dataset as much as possible.
PCs can be found as the “best” eigenvectors of the covariance
matrix of the data points.
Principal components
11
If data has a Gaussian distribution 𝑂(𝝂, 𝚻), the direction of the
largest variance can be found by the eigenvector of 𝚻 that corresponds to the largest eigenvalue of 𝚻
𝒘1 𝒘2
PCA: Steps
12
Input: 𝑂 × 𝑒 data matrix 𝒀 (each row contain a 𝑒
dimensional data point)
𝝂 =
1 𝑂 𝑗=1 𝑂
𝒚(𝑗)
𝒀 ← Mean value of data points is subtracted from rows of 𝒀
𝑫 =
1 𝑂
𝒀𝑈 𝒀 (Covariance matrix)
Calculate eigenvalue and eigenvectors of 𝑫 Pick 𝑒′ eigenvectors corresponding to the largest eigenvalues
and put them in the columns of 𝑩 = [𝒘1, … , 𝒘𝑒′]
𝒀′ =
𝒀𝑩
First PC d’-th PC
Covariance Matrix
13
𝝂𝒚 = 𝜈1 ⋮ 𝜈𝑒 = 𝐹(𝑦1) ⋮ 𝐹(𝑦𝑒) 𝜯 = 𝐹 𝒚 − 𝝂𝒚 𝒚 − 𝝂𝒚 𝑈
ML estimate of covariance matrix from data points 𝒚(𝑗)
𝑗=1 𝑂 :
𝜯 = 1 𝑂
𝑗=1 𝑂
𝒚(𝑗) − 𝝂 𝒚(𝑗) − 𝝂
𝑈 = 1
𝑂 𝒀𝑈 𝒀
𝒀 = 𝒚(1) ⋮ 𝒚(𝑂) = 𝒚(1) − 𝝂 ⋮ 𝒚(𝑂) − 𝝂 𝝂 = 1 𝑂
𝑗=1 𝑂
𝒚(𝑗)
Mean-centered data We now assume that data are mean removed and 𝒚 in the later slides is indeed 𝒚
Correlation matrix
14
1 𝑂 𝒀𝑈𝒀 = 1 𝑂 𝑦1
(1)
… 𝑦1
(𝑂)
⋮ ⋱ ⋮ 𝑦𝑒
(1)
… 𝑦𝑒
(𝑂)
𝑦1
(1)
… 𝑦𝑒
(1)
⋮ ⋱ ⋮ 𝑦1
(𝑂)
… 𝑦𝑒
(𝑂)
= 1 𝑂
𝑜=1 𝑂
𝑦1
(𝑜) 𝑦1 (𝑜)
…
𝑜=1 𝑂
𝑦1
(𝑜) 𝑦𝑒 (𝑜)
⋮ ⋱ ⋮
𝑜=1 𝑂
𝑦𝑒
(𝑜) 𝑦1 (𝑜)
…
𝑜=1 𝑂
𝑦𝑒
(𝑜) 𝑦𝑒 (𝑜)
𝒀 = 𝑦1
(1)
… 𝑦𝑒
(1)
⋮ ⋱ ⋮ 𝑦1
(𝑂)
… 𝑦𝑒
(𝑂)
Two Interpretations
15
MaximumVariance Subspace
PCA finds vectors v such that projections on to the
vectors capture maximum variance in the data
1 𝑂 𝑜=1 𝑂
𝒃𝑈𝒚 𝑜
2 = 1 𝑂 𝒃𝑈𝒀𝑈𝒀𝒃
Minimum Reconstruction Error
PCA finds vectors v such that projection on to the
vectors yields minimum MSE reconstruction
1 𝑂 𝑜=1 𝑂
𝒚 𝑜 − 𝒃𝑈𝒚 𝑜 𝒃
2
Least Squares Error Interpretation
16
PCs are linear least squares fits to samples, each orthogonal to
the previous PCs:
First PC is a minimum distance fit to a vector in the original feature
space
Second PC is a minimum distance fit to a vector in the plane
perpendicular to the first PC
And so on
Example
17
Example
18
Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation)
Minimizing sum of square distances to the line is equivalent to
maximizing the sum of squares of the projections on that line (Pythagoras).
19
- rigin
red2+blue2=green2 green2 is fixed (shows the data vector after mean removing) ⇒ maximizing blue2 is equivalent to minimizing red2
First PC
20
The first PC is direction of greatest variability in data We will show that the first PC is the eigenvector of the
covariance matrix corresponding the maximum eigen value of this matrix.
If ||𝒃|| = 1, the projection of a d-dimensional 𝒚 on 𝒃 is 𝒃𝑈𝒚
- rigin
𝒚 𝒃 𝒚 cos 𝜄 = 𝒚 𝒃𝑈𝒚 𝒚 𝒃 = 𝒃𝑈𝒚 𝜄
First PC
21
argmax
𝒃
1 𝑂
𝑜=1 𝑂
𝒃𝑈𝒚 𝑜
2 = 1
𝑂 𝒃𝑈𝒀𝑈𝒀𝒃 s.t. 𝒃𝑈𝒃 = 1 𝜖 𝜖𝒃 1 𝑂 𝒃𝑈𝒀𝑈𝒀𝒃 + 𝜇 1 − 𝒃𝑈𝒃 = 0 ⇒ 1 𝑂 𝒀𝑈𝒀𝒃 = 𝜇𝒃
𝒃 is the eigenvector of sample covariance matrix
1 𝑂 𝒀𝑈𝒀
The eigenvalue 𝜇 denotes the amount of variance along that dimension.
Variance= 1 𝑂 𝒃𝑈𝒀𝑈𝒀𝒃 = 𝒃𝑈 1 𝑂 𝒀𝑈𝒀𝒃 = 𝒃𝑈𝜇𝒃 = 𝜇
So, if we seek the dimension with the largest variance, it will be the
eigenvector corresponding to the largest eigenvalue of the sample covariance matrix
PCA: Uncorrelated Features
22
𝒚′ = 𝑩𝑈𝒚 𝑺𝒚′ = 𝐹 𝒚′𝒚′𝑈 = 𝐹 𝑩𝑈𝒚𝒚𝑈𝑩 = 𝑩𝑈𝐹 𝒚𝒚𝑈 𝑩 = 𝑩𝑈𝑺𝒚𝑩
If 𝑩 = [𝒃1, … , 𝒃𝑒] where 𝒃1, … , 𝒃𝑒
are orthonormal eighenvectors of 𝑺𝒚: 𝑺𝒚′ = 𝑩𝑈𝑺𝒚𝑩 = 𝑩𝑈 𝑩𝚳𝑩𝑈 𝑩 = 𝚳 ⇒ ∀𝑗 ≠ 𝑘 𝑗, 𝑘 = 1, … , 𝑒 𝐹 𝒚𝑗
′𝒚𝑘 ′ = 0
then mutually uncorrelated features are obtained
Completely
uncorrelated features avoid information redundancies
PCA Derivation: Mean Square Error Approximation
23
Incorporating all eigenvectors in 𝑩 = [𝒃1, … , 𝒃𝑒]:
𝒚′ = 𝑩𝑈𝒚 ⇒ 𝑩𝒚′ = 𝑩𝑩𝑈𝒚 = 𝒚 ⇒ 𝒚 = 𝑩𝒚′
⟹ If 𝑒′ = 𝑒 then 𝒚 can be reconstructed exactly from 𝒚′
PCA Derivation: Relation between Eigenvalues and Variances
24
The 𝑘-th largest eigenvalue of 𝑺𝒚 is the variance on the 𝑘-th
PC:
𝑤𝑏𝑠 𝑦𝑘
′ = 𝜇𝑘
𝑤𝑏𝑠 𝑦𝑘
′ = 𝐹 𝑦𝑘 ′𝑦𝑘 ′
= 𝐹 𝒃𝑘
𝑈𝒚𝒚𝑈𝒃𝑘
= 𝒃𝑘
𝑈𝐹 𝒚𝒚𝑈 𝒃𝑘
= 𝒃𝑘
𝑈𝑺𝒚𝒃𝑘 = 𝒃𝑘 𝑈𝜇𝑘𝒃𝑘 = 𝜇𝑘
Eigenvalues 𝜇1 ≥ 𝜇2 ≥ 𝜇3 ≥ ⋯
- The 1st PC is the the eigenvector of the sample covariance matrix
associated with the largest eigenvalue
- The 2nd PC 𝑤2 is the the eigenvector of the sample covariance matrix
associated with the second largest eigenvalue
- And so on …
PCA Derivation: Mean Square Error Approximation
25
Incorporating only 𝑒′ eigenvectors corresponding to the
largest eigenvalues 𝑩 = [𝒃1, … , 𝒃𝑒′] (𝑒′ < 𝑒)
It minimizes MSE between 𝒚 and
𝒚 = 𝑩𝒚′:
𝐾 𝑩 = 𝐹 𝒚 − 𝒚 2 = 𝐹 𝒚 − 𝑩𝒚′ 2 = 𝐹
𝑘=𝑒′+1 𝑒
𝑦𝑘
′𝒃𝑘 2
= 𝐹
𝑘=𝑒′+1 𝑒 𝑙=𝑒′+1 𝑒
𝑦𝑘
′𝒃𝑘 𝑈𝒃𝑙 𝑦𝑙 ′
= 𝐹
𝑘=𝑒′+1 𝑒
𝑦𝑘
′2
=
𝑘=𝑒′+1 𝑒
𝐹 𝑦𝑘
′2 = 𝑘=𝑒′+1 𝑒
𝜇𝑘
Sum of the 𝑒 − 𝑒′ smallest eigenvalues
PCA Derivation: Mean Square Error Approximation
26
In general, it can also be shown MSE is minimized compared to
any
- ther
approximation
- f
𝒚 by any 𝑒′ -dimensional
- rthonormal basis
without first assuming that the axes are eigenvectors of the correlation
matrix, this result can also be obtained.
If the data is mean-centered in advance, 𝑺𝒚 and 𝑫𝒚 (covariance
matrix) will be the same.
However, in the correlation version when 𝑫𝒚 ≠ 𝑺𝒚 the approximation is
not, in general, a good one (although it is a minimum MSE solution)
PCA on Faces: “Eigenfaces”
27
ORL Database
Some Images
PCA on Faces: “Eigenfaces”
28
For eigen faces “gray” = 0, “white” > 0, “black” < 0 Average face
1st
to 10th
PCs
PCA on Faces:
Feature vector=[𝑦1
′,𝑦2 ′, … ,𝑦𝑒′ ′ ] +𝑦1
′ ×
+𝑦2
′ ×
+𝑦256
′
× + ⋯
29
= Average Face 𝑦𝑗
′ = 𝑄𝐷𝑗 𝑈𝒚
The projection of 𝒚 on the i-th PC 𝒚 is a 112 × 92 = 10304 dimensional vector containing intensity of the pixels of this image
PCA on Faces: Reconstructed Face
d'=1 d'=2 d'=4 d'=8 d'=16 d'=32 d'=64 d'=128 Original Image d'=256
30
Dimensionality Reduction by PCA
31
In high-dimensional problems, data sometimes lies near a
linear subspace (small variability around this subspace can be considered as noise)
Only keep data projections onto principal components
with large eigenvalue
Might lose some info, but if eigenvalues are small, do not
lose much
Kernel PCA
32
Kernel extension of PCA
data (approximately) lies on a lower dimensional non-linear space
PCA and LDA: Drawbacks
33
PCA drawback: An excellent information packing transform
does not necessarily lead to a good class separability.
The directions of the maximum variance may be useless for classification
purpose LDA drawback
Singularity or under-sampled problem (when 𝑂 < 𝑒)
Example: gene expression data, images, text documents
Can reduces dimension only to 𝑒′ ≤ 𝐷 − 1 (unlike PCA)
PCA LDA
PCA vs. LDA
Although LDA often provide more suitable features for
classification tasks, PCA might outperform LDA in some situations:
when the number of samples per class is small (overfitting problem
- f LDA)
when the number of the desired features is more than 𝐷 − 1
Advances in the last decade:
Semi-supervised feature extraction
E.g., PCA+LDA, Regularized LDA, Locally FDA (LFDA)
34
Singular Value Decomposition (SVD)
35
Given a matrix 𝒀 ∈ ℝ𝑂×𝑒, the SVD is a decomposition:
𝒀 = 𝑽𝑻𝑾𝑈
𝑻 is a diagonal matrix with the singular values 𝜏1, … , 𝜏𝑒 of 𝑌. Columns of 𝑽, 𝑾 are orthonormal matrices
𝒀 (𝑂 × 𝑒) 𝑽 (𝑂 × 𝑒) 𝑾𝑈 (𝑒 × 𝑒) 𝑻 (𝑒 × 𝑒)
Singular Value Decomposition (SVD)
36
Given a matrix 𝒀 ∈ ℝ𝑂×𝑒, the SVD is a decomposition:
𝒀 = 𝑽𝑻𝑾𝑈
SVD of 𝑌 is related to eigen-decomposition of 𝒀𝑈𝒀 and 𝒀𝒀𝑈.
𝒀𝑈𝒀 = 𝑾𝑻𝑽𝑈𝑽𝑻𝑾𝑈 = 𝑾𝑻2𝑾𝑈
so 𝑾 contains eigenvectors of 𝒀𝑈𝒀 and 𝑻2 includes its eigenvalues (𝜇𝑗
= 𝜏𝑗
2)
𝒀𝒀𝑈 = 𝑽𝑻𝑾𝑈𝑾𝑻𝑽𝑈 = 𝑽𝑻2𝑽𝑈 so 𝑽 contains eigenvectors of 𝒀𝒀𝑈and 𝑻2 includes its eigenvalues (𝜇𝑗
= 𝜏𝑗
2)
In fact, we can view each row of 𝑉𝑇 as the coordinates of an
example along the axes given by the eigenvectors.
Independent Component Analysis (ICA)
37
PCA:
The transformed dimensions will be uncorrelated from each
- ther
Orthogonal linear transform Only uses second order statistics (i.e., covariance matrix)
ICA:
The transformed dimensions will be as independent as
possible.
Non-orthogonal linear transform High-order statistics can also used
Uncorrelated and Independent
38
Gaussian
Independent ⟺ Uncorrelated
Non-Gaussian
Independent ⇒ Uncorrelated Uncorrelated ⇏ Independent
Uncorrelated: 𝑑𝑝𝑤 𝑌1,𝑌2 = 0 Independent: 𝑄 𝑌1,𝑌2 = 𝑄(𝑌1)𝑄(𝑌2)
ICA: Cocktail party problem
39
Cocktail party problem
𝑒 speakers are speaking simultaneously and any microphone
records only an overlapping combination of these voices.
Each microphone records a different combination of the speakers’ voices.
Using these 𝑒 microphone recordings, can we separate out the
- riginal 𝑒 speakers’ speech signals?
Mixing matrix 𝑩:
𝒚 = 𝑩𝒕
Unmixing matrix 𝑩−1:
𝒕 = 𝑩−1𝒚
𝑡
𝑘 (𝑗): sound that speaker 𝑘 was uttering at time 𝑗.
𝑦𝑘
(𝑗): acoustic reading recorded by microphone 𝑘 at time 𝑗.
ICA
40
Find a linear transformation 𝒚 = 𝑩𝒕 for
which dimensions
- f