Geometric Data Analysis
Principal Component Analysis
MAT 6480W / STT 6705V
Guy Wolf guy.wolf@umontreal.ca
Universit´ e de Montr´ eal Fall 2019
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 1 / 20
Principal Component Analysis MAT 6480W / STT 6705V Guy Wolf - - PowerPoint PPT Presentation
Geometric Data Analysis Principal Component Analysis MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de Montr eal Fall 2019 MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 1 / 20 Outline Preprocessing for data
Geometric Data Analysis
Principal Component Analysis
MAT 6480W / STT 6705V
Guy Wolf guy.wolf@umontreal.ca
Universit´ e de Montr´ eal Fall 2019
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 1 / 20
Outline
1
Preprocessing for data simplification Sampling Aggregation Discretization Density estimation Dimensionality reduction
2
Principal component analysis (PCA) Autoencoder Variance maximization Singular value decomposition (SVD)
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 2 / 20
Preprocessing for data simplification
Sampling
Sampling
Select a subset of representative data points instead of processing the entire data. A sampled subset is useful only if its analysis yields the same patterns, results, conclusions, etc., as the analysis of the entire data. 8000 points 2000 points 500 points
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 3 / 20
Preprocessing for data simplification
Sampling
Sampling
Select a subset of representative data points instead of processing the entire data.
Common sampling approaches
Random: an equal probability of selecting any particular item. Without replacement: iteratively selected & remove items. With replacement: selected items remain in the population. Stratified: draw random samples from each partition. Choosing a sufficient sample size is often crucial for effective sampling.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 3 / 20
Preprocessing for data simplification
Sampling
Example
Choose enough samples guarantee at least one representative is selected from each distinct group/cluster/profile in the data.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 3 / 20
Preprocessing for data simplification
Aggregation
Instead of sampling representative data points we can coarse-grain the data by aggregating together attributes or data points.
Aggregation
Combining several attributes to a single feature, or several data points into a single observation.
Examples
Change monthly revenues to annual revenues Analyze neighborhoods instead of houses Provide average rating of a season (not per episode)
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 4 / 20
Preprocessing for data simplification
Discretization
It is sometimes convenient to transform the entire data to nominal (or ordinal) attributes.
Discretization
Transformation of continuous attributes (or ones with infinite range) to discrete ones with a finite range. Discretization can be done in a supervised discretization (e.g., using class labels) or in an unsupervised manner (e.g., using clustering).
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 5 / 20
Preprocessing for data simplification
Discretization
Supervised discretization based on minimizing impurity: 3 values per axis 5 values per axis
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 5 / 20
Preprocessing for data simplification
Discretization
Unsupervised discretization:
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 5 / 20
Preprocessing for data simplification
Density estimation
Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 6 / 20
Preprocessing for data simplification
Density estimation
Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one. Cell-based density estimation
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 6 / 20
Preprocessing for data simplification
Density estimation
Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one. Center-based density estimation
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 6 / 20
Preprocessing for data simplification
Dimensionality reduction
Dimensionality of data is generally determined by the number of attributes or features that represent each data point.
Curse of dimensionality
A general term for various phenomena that arise when analyzing and processing high-dimensional data.
Common theme - statistical significance is difficult, impractical, or even impossible to obtain due to sparsity of the data in high-dimensions Causes poor performance of classical statistical methods compared to low-dimensional data
Common solution - reduce the dimensionality of the data as part of its (pre)processing.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 7 / 20
Preprocessing for data simplification
Dimensionality reduction
There are several approaches to represent the data in a lower dimension, which can generally be split into two types:
Dimensionality reduction approaches
Feature selection/weighting - select a subset of existing features and only use them in the analysis, while possibly also assigning them importance weights to eliminate redundant information Feature extraction/construction - create new features by extracting relevant information from the original features PCA and MDS are two of the most common dimensionality reduction methods in data analysis, but many others exist as well.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 7 / 20
Preprocessing for data simplification
Feature subset selection
Ideally - choose the best feature subset out of all possible
Feature selection approaches
Embedded methods - choose the best features for a task as part
Filter methods - choose features that optimize a general criterion (e.g., min correlation) as part of data preprocessing using an efficient search algorithm. Wrapper methods - first formulate & handle a data mining task to select features, and then use the resulting subset to solve the real task. Alternatively, expert knowledge can sometimes be used to eliminate redundant and unnecessary features.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 8 / 20
Principal component analysis
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 9 / 20
Principal component analysis
PCA UdeM - Fall 2019 9 / 20
Principal component analysis
avg = 0
Find:
best k-dim projection
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 9 / 20
Principal component analysis
Projection on principal components:
Data points
Principal components
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 10 / 20
Principal component analysis
Projection on principal components:
3D space
✻ ✲ ✁ ☛
λ1φ1
1D space
✲
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 10 / 20
Principal component analysis
What is the best projection?
Find subspace S ⊆ ❘n s.t. dim(S) = k and the data is well approximated by ˆ x = projS x. ❘
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20
Principal component analysis
What is the best projection?
Find subspace S ⊆ ❘n s.t. dim(S) = k and the data is well approximated by ˆ x = projS x. ⇓ Find subspace S ⊆ ❘n s.t. S = span{u1, . . . , uk} and the data is x − ˆ x is minimal over the data with ˆ x = projS x.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20
Principal component analysis
What is the best projection?
Find subspace S ⊆ ❘n s.t. dim(S) = k and the data is well approximated by ˆ x = projS x. ⇓ Find subspace S ⊆ ❘n s.t. S = span{u1, . . . , uk} and the data is x − ˆ x is minimal over the data with ˆ x = projS x. ⇓ Find k vectors u1, . . . , uk s.t. N−1 N
i=1 xi − ˆ
xi2 is minimal with ˆ x = projspan{u1,...,uk} x.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20
Principal component analysis
What is the best projection?
Find subspace S ⊆ ❘n s.t. dim(S) = k and the data is well approximated by ˆ x = projS x. ⇓ Find subspace S ⊆ ❘n s.t. S = span{u1, . . . , uk} and the data is x − ˆ x is minimal over the data with ˆ x = projS x. ⇓ Find k vectors u1, . . . , uk s.t. N−1 N
i=1 xi − ˆ
xi2 is minimal with ˆ x = projspan{u1,...,uk} x. How do we find these vectors u1, . . . , uk?
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20
Principal component analysis
Autoencoder
Minimize N−1 n
i=1 xi − ˆ
xi2 s.t. ˆ x = projspan{u1,...,uk} x Input layer:
❄
x[1] x[2] x[3] x[4] x[5] hi = W xi
Hidden layer:
❄
h[1] h[2] h[3] ˆ xi = Uhi
Output layer:
ˆ x[1] ˆ x[2] ˆ x[3] ˆ x[4] ˆ x[5]
arg min
W ∈❘k×n,U∈❘n×k N
xi − UWxi2
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 12 / 20
Principal component analysis
Reconstruction error minimization
We only need to consider orthonormal vectors u1, . . . , uk (i.e., ui = 1, ui, uj = 0 for i = j) that form a basis for the subspace. We can then extend this set to form a basis u1, . . . , un for the entire ❘n. Then, we can write x =
n
j=1x, ujuj =
n
j=1 ujuT j x and
projspan{u1,...,uk} x = k
j=1 ujuT j x.
We now consider the reconstruction error N−1 N
i=1 xi − ˆ
xi2.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 13 / 20
Principal component analysis
Reconstruction error minimization
First, notice that x − ˆ x = n
j=1 ujuT j x − k j=1 ujuT j x = n j=k+1 ujuT j x
⇓ x − ˆ x2 =
n
(
n
uj[q]uT
j x)2
=
n
n
(
n
uj[q]uj′[q])(uT
j x)(uT j′ x)
=
n
(uT
j x)2 = n
(uT
j x)2 − k
(uT
j x)2 = x2 − ˆ
x2 ⇓ Minimizing the reconstruction error is equivalent to maximizing N−1 N
i=1 ˆ
xi2 = k
j=1 N−1 N i=1(uT j xi)2 = k j=1 variance(uT j x)
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 13 / 20
Principal component analysis
Variance maximization
Find a direction that maximizes the variance in the projected data.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20
Principal component analysis
Variance maximization
Find a direction that maximizes the variance in the projected data. ⇓ Find a unit vector u ∈ ❘n that maximizes variance(uTx) = uTΣu, where Σ is the covariance matrix.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20
Principal component analysis
Variance maximization
Find a direction that maximizes the variance in the projected data. ⇓ Find a unit vector u ∈ ❘n that maximizes: variance(uTx) = N−1
N
(uTxi)2 = N−1
N
(uTxi)(x T
i u)
= uT
N
xix T
i
where Σ is the covariance matrix.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20
Principal component analysis
Variance maximization
Find a direction that maximizes the variance in the projected data. ⇓ Find a unit vector u ∈ ❘n that maximizes variance(uTx) = uTΣu, where Σ is the covariance matrix.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20
Principal component analysis
Variance maximization
Find a direction that maximizes the variance in the projected data. ⇓ Find a unit vector u ∈ ❘n that maximizes variance(uTx) = uTΣu, where Σ is the covariance matrix. ⇓ Solve the maximization problem: maximize uTΣu s.t. u = 1
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20
Principal component analysis
Variance maximization
Solve the maximization problem: maximize uTΣu s.t. u = 1 Apply Lagrange multipliers method: f (u, α) = uTΣu + α(1 − uTu) ∇uf (u, α) = 2(Σu − αu) ∇uf (u, α) = 0 ⇒ Σu = αu Therefore, u is an eigenvector of Σ with eigenvalue α, which has to be the maximal eigenvalue to maximize uTΣu = α.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20
Principal component analysis
Variance maximization
Similarly, a second direction is found via: maximize uT
2 Σu2
s.t. u2 = 1 u2, u1 = 0 Apply Lagrange multipliers method: f (u2, α, β) = uT
2 Σu2 + α(1 − uT 2 u2) − βuT 2 u1
∇u2f (u2, α, β) = 2(Σu2 − αu2) − βu1 ∇u2f (u2, α, β) = 0 ⇒ β = −u1, ∇u2f (u2, α, β) = 0 ⇒ Σu2 = αu2 Therefore, u2 is an eigenvector of Σ with the second largest eigenvalue.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 15 / 20
Principal component analysis
Eigendecomposition and SVD
features features Covariance matrix
data points
data points
cov(q1, q2)
i xi[q1] · xi[q2]
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 16 / 20
Principal component analysis
Eigendecomposition and SVD
Eigenvalue
❅ ❅ ❘ λi
Eigenvector
❍❍ ❍ ❥
φi =
Covariance matrix
φi
q1
✛
q2
✛
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 16 / 20
Principal component analysis
Eigendecomposition and SVD
Eigenvalue
❅ ❅ ❘ λi
Eigenvector
❍❍ ❍ ❥
φi =
Covariance matrix
φi
q1
✛
q2
✛
Singular vectors
❍❍❍❍❍ ❥ ✟ ✟ ✟ ✟ ✟ ✙
Covariance matrix
=
❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅
Singular values
Spectral theorem applies to cov. matrices: SVD (Singular Value Decomposition) Spectral Theorem: cov(q1, q2) =
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 16 / 20
Principal component analysis
Singular value decomposition
Any matrix M ∈ ❘n×k can be decomposed to U, S, V ← SVD(M) as
n × n
n × k diagonal
✻
k × k
❅ ■
The singular values in S are the square root of the (nonnegative) eigenvalues of both MMT and MTM. The singular vectors in (the columns of) U are the eigenvectors
The singular vectors in (the columns of) V are the eigenvectors
Proof and more details about SVD can be found on Wikipedia.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 17 / 20
Principal component analysis
Singular value decomposition
λ1 λ2 λ3 λ4 λ5 eigenvalues Decaying covariance spectrum reveals (low) dimensionality
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 17 / 20
Principal component analysis
Singular value decomposition
Eigenvectors
❍❍❍❍❍ ❥ ✟ ✟ ✟ ✟ ✟ ✙
Covariance matrix
=
❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ✻
Eigenvalues Covariance matrix can be approximated by a truncated SVD
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 17 / 20
Principal component analysis
Trivial example
Consider simple case of data points that are all on the same high dimensional line Straight line is defined by a unit vector
Points on the line are defined by multiplying ψ by scalars The points can be formulated as xi = ci ψ Covariance: cov(t1, t2) =
i xi[t1]xi[t2] = i ci
ψ[t1]ci ψ[t2] = (
i c2 i )
ψ[t1] ψ[t2] = c2 ψ(t1) ψ(t2)
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20
Principal component analysis
Trivial example
Consider simple case of data points that are all on the same high dimensional line Straight line is defined by a unit vector
Points on the line are defined by multiplying ψ by scalars The points can be formulated as xi = ci ψ Covariance: cov(t1, t2) =
i xi[t1]xi[t2] = i ci
ψ[t1]ci ψ[t2] = (
i c2 i )
ψ[t1] ψ[t2] = c2 ψ(t1) ψ(t2)
Covariance Matrix
ψ
c2
ψ
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20
Principal component analysis
Trivial example
Consider simple case of data points that are all on the same high dimensional line Straight line is defined by a unit vector
Points on the line are defined by multiplying ψ by scalars The points can be formulated as xi = ci ψ Covariance: cov(t1, t2) =
i xi[t1]xi[t2] = i ci
ψ[t1]ci ψ[t2] = (
i c2 i )
ψ[t1] ψ[t2] = c2 ψ(t1) ψ(t2)
Covariance matrix has a single eigenvalue c2 and a single eigenvector ψ, which defines principal direction of the data-point vectors
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20
Principal component analysis
Trivial example
✻ ✲ ✁ ☛ ✈ ❢ ✈ ❢ ✈ ❢ ✈ ❢ ✈ ❢
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20
Principal component analysis
Trivial example
✻ ✲ ✁ ☛ ✈ ❢ ✈ ❢ ✈ ❢ ✈ ❢ ✈ ❢
φ1 = ψ
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20
Principal component analysis
Trivial example
✻ ✲ ✁ ☛
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20
Principal component analysis
Trivial example
✻ ✲ ✁ ☛
Length: eigenvalues Direction: eigenvectors
λ1φ1
λ2φ2
principal components ⇒ max var directions
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20
Principal component analysis
PCA algorithm:
1
Centering
2
Covariance
3
SVD (or eigendecomposition)
4
Projection Alternative method: Multi-Dimensional Scaling (MDS) - preserve distances/inner-products with minimal set of coordinates.
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 19 / 20
Summary
Preprocessing steps are crucial in preparing data for meaningful analysis. Linear dimensionality reduction for alleviating the curse of dimensionality. Principal Component Analysis (PCA) is a standard dimensionality reduction approach: Based on projecting data on leading eigenvectors of the covariance matrix. Minimizes reconstruction error by the projecion, Equivalently, finds a subspace that maximizes captured variance, In practice, SVD is used instead of eigendecomposition,
MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 20 / 20