Principal Component Analysis MAT 6480W / STT 6705V Guy Wolf - - PowerPoint PPT Presentation

principal component analysis
SMART_READER_LITE
LIVE PREVIEW

Principal Component Analysis MAT 6480W / STT 6705V Guy Wolf - - PowerPoint PPT Presentation

Geometric Data Analysis Principal Component Analysis MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de Montr eal Fall 2019 MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 1 / 20 Outline Preprocessing for data


slide-1
SLIDE 1

Geometric Data Analysis

Principal Component Analysis

MAT 6480W / STT 6705V

Guy Wolf guy.wolf@umontreal.ca

Universit´ e de Montr´ eal Fall 2019

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 1 / 20

slide-2
SLIDE 2

Outline

1

Preprocessing for data simplification Sampling Aggregation Discretization Density estimation Dimensionality reduction

2

Principal component analysis (PCA) Autoencoder Variance maximization Singular value decomposition (SVD)

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 2 / 20

slide-3
SLIDE 3

Preprocessing for data simplification

Sampling

Sampling

Select a subset of representative data points instead of processing the entire data. A sampled subset is useful only if its analysis yields the same patterns, results, conclusions, etc., as the analysis of the entire data. 8000 points 2000 points 500 points

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 3 / 20

slide-4
SLIDE 4

Preprocessing for data simplification

Sampling

Sampling

Select a subset of representative data points instead of processing the entire data.

Common sampling approaches

Random: an equal probability of selecting any particular item. Without replacement: iteratively selected & remove items. With replacement: selected items remain in the population. Stratified: draw random samples from each partition. Choosing a sufficient sample size is often crucial for effective sampling.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 3 / 20

slide-5
SLIDE 5

Preprocessing for data simplification

Sampling

Example

Choose enough samples guarantee at least one representative is selected from each distinct group/cluster/profile in the data.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 3 / 20

slide-6
SLIDE 6

Preprocessing for data simplification

Aggregation

Instead of sampling representative data points we can coarse-grain the data by aggregating together attributes or data points.

Aggregation

Combining several attributes to a single feature, or several data points into a single observation.

Examples

Change monthly revenues to annual revenues Analyze neighborhoods instead of houses Provide average rating of a season (not per episode)

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 4 / 20

slide-7
SLIDE 7

Preprocessing for data simplification

Discretization

It is sometimes convenient to transform the entire data to nominal (or ordinal) attributes.

Discretization

Transformation of continuous attributes (or ones with infinite range) to discrete ones with a finite range. Discretization can be done in a supervised discretization (e.g., using class labels) or in an unsupervised manner (e.g., using clustering).

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 5 / 20

slide-8
SLIDE 8

Preprocessing for data simplification

Discretization

Supervised discretization based on minimizing impurity: 3 values per axis 5 values per axis

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 5 / 20

slide-9
SLIDE 9

Preprocessing for data simplification

Discretization

Unsupervised discretization:

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 5 / 20

slide-10
SLIDE 10

Preprocessing for data simplification

Density estimation

Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 6 / 20

slide-11
SLIDE 11

Preprocessing for data simplification

Density estimation

Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one. Cell-based density estimation

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 6 / 20

slide-12
SLIDE 12

Preprocessing for data simplification

Density estimation

Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one. Center-based density estimation

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 6 / 20

slide-13
SLIDE 13

Preprocessing for data simplification

Dimensionality reduction

Dimensionality of data is generally determined by the number of attributes or features that represent each data point.

Curse of dimensionality

A general term for various phenomena that arise when analyzing and processing high-dimensional data.

Common theme - statistical significance is difficult, impractical, or even impossible to obtain due to sparsity of the data in high-dimensions Causes poor performance of classical statistical methods compared to low-dimensional data

Common solution - reduce the dimensionality of the data as part of its (pre)processing.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 7 / 20

slide-14
SLIDE 14

Preprocessing for data simplification

Dimensionality reduction

There are several approaches to represent the data in a lower dimension, which can generally be split into two types:

Dimensionality reduction approaches

Feature selection/weighting - select a subset of existing features and only use them in the analysis, while possibly also assigning them importance weights to eliminate redundant information Feature extraction/construction - create new features by extracting relevant information from the original features PCA and MDS are two of the most common dimensionality reduction methods in data analysis, but many others exist as well.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 7 / 20

slide-15
SLIDE 15

Preprocessing for data simplification

Feature subset selection

Ideally - choose the best feature subset out of all possible

  • combinations. Impractical - there are 2n choices for n attributes!

Feature selection approaches

Embedded methods - choose the best features for a task as part

  • f the data mining algorithm (e.g., decision trees).

Filter methods - choose features that optimize a general criterion (e.g., min correlation) as part of data preprocessing using an efficient search algorithm. Wrapper methods - first formulate & handle a data mining task to select features, and then use the resulting subset to solve the real task. Alternatively, expert knowledge can sometimes be used to eliminate redundant and unnecessary features.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 8 / 20

slide-16
SLIDE 16

Principal component analysis

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 9 / 20

slide-17
SLIDE 17

Principal component analysis

  • MAT 6480W (Guy Wolf)

PCA UdeM - Fall 2019 9 / 20

slide-18
SLIDE 18

Principal component analysis

  • Assume:

avg = 0

Find:

best k-dim projection

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 9 / 20

slide-19
SLIDE 19

Principal component analysis

Projection on principal components:

Data points

                            

Principal components

    

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 10 / 20

slide-20
SLIDE 20

Principal component analysis

Projection on principal components:

3D space

✻ ✲ ✁ ☛

λ1φ1

1D space

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 10 / 20

slide-21
SLIDE 21

Principal component analysis

What is the best projection?

Find subspace S ⊆ ❘n s.t. dim(S) = k and the data is well approximated by ˆ x = projS x. ❘

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20

slide-22
SLIDE 22

Principal component analysis

What is the best projection?

Find subspace S ⊆ ❘n s.t. dim(S) = k and the data is well approximated by ˆ x = projS x. ⇓ Find subspace S ⊆ ❘n s.t. S = span{u1, . . . , uk} and the data is x − ˆ x is minimal over the data with ˆ x = projS x.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20

slide-23
SLIDE 23

Principal component analysis

What is the best projection?

Find subspace S ⊆ ❘n s.t. dim(S) = k and the data is well approximated by ˆ x = projS x. ⇓ Find subspace S ⊆ ❘n s.t. S = span{u1, . . . , uk} and the data is x − ˆ x is minimal over the data with ˆ x = projS x. ⇓ Find k vectors u1, . . . , uk s.t. N−1 N

i=1 xi − ˆ

xi2 is minimal with ˆ x = projspan{u1,...,uk} x.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20

slide-24
SLIDE 24

Principal component analysis

What is the best projection?

Find subspace S ⊆ ❘n s.t. dim(S) = k and the data is well approximated by ˆ x = projS x. ⇓ Find subspace S ⊆ ❘n s.t. S = span{u1, . . . , uk} and the data is x − ˆ x is minimal over the data with ˆ x = projS x. ⇓ Find k vectors u1, . . . , uk s.t. N−1 N

i=1 xi − ˆ

xi2 is minimal with ˆ x = projspan{u1,...,uk} x. How do we find these vectors u1, . . . , uk?

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 11 / 20

slide-25
SLIDE 25

Principal component analysis

Autoencoder

Minimize N−1 n

i=1 xi − ˆ

xi2 s.t. ˆ x = projspan{u1,...,uk} x Input layer:

x[1] x[2] x[3] x[4] x[5] hi = W xi

Hidden layer:

h[1] h[2] h[3] ˆ xi = Uhi

Output layer:

ˆ x[1] ˆ x[2] ˆ x[3] ˆ x[4] ˆ x[5]

arg min

W ∈❘k×n,U∈❘n×k N

  • i=1

xi − UWxi2

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 12 / 20

slide-26
SLIDE 26

Principal component analysis

Reconstruction error minimization

We only need to consider orthonormal vectors u1, . . . , uk (i.e., ui = 1, ui, uj = 0 for i = j) that form a basis for the subspace. We can then extend this set to form a basis u1, . . . , un for the entire ❘n. Then, we can write x =

n

j=1x, ujuj =

n

j=1 ujuT j x and

projspan{u1,...,uk} x = k

j=1 ujuT j x.

We now consider the reconstruction error N−1 N

i=1 xi − ˆ

xi2.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 13 / 20

slide-27
SLIDE 27

Principal component analysis

Reconstruction error minimization

First, notice that x − ˆ x = n

j=1 ujuT j x − k j=1 ujuT j x = n j=k+1 ujuT j x

⇓ x − ˆ x2 =

n

  • q=1

(

n

  • j=k+1

uj[q]uT

j x)2

=

n

  • j=k+1

n

  • j′=k+1

(

n

  • q=1

uj[q]uj′[q])(uT

j x)(uT j′ x)

=

n

  • j=k+1

(uT

j x)2 = n

  • j=1

(uT

j x)2 − k

  • j=1

(uT

j x)2 = x2 − ˆ

x2 ⇓ Minimizing the reconstruction error is equivalent to maximizing N−1 N

i=1 ˆ

xi2 = k

j=1 N−1 N i=1(uT j xi)2 = k j=1 variance(uT j x)

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 13 / 20

slide-28
SLIDE 28

Principal component analysis

Variance maximization

Find a direction that maximizes the variance in the projected data.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20

slide-29
SLIDE 29

Principal component analysis

Variance maximization

Find a direction that maximizes the variance in the projected data. ⇓ Find a unit vector u ∈ ❘n that maximizes variance(uTx) = uTΣu, where Σ is the covariance matrix.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20

slide-30
SLIDE 30

Principal component analysis

Variance maximization

Find a direction that maximizes the variance in the projected data. ⇓ Find a unit vector u ∈ ❘n that maximizes: variance(uTx) = N−1

N

  • i=1

(uTxi)2 = N−1

N

  • i=1

(uTxi)(x T

i u)

= uT

  • N−1

N

  • i=1

xix T

i

  • u = uTΣu

where Σ is the covariance matrix.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20

slide-31
SLIDE 31

Principal component analysis

Variance maximization

Find a direction that maximizes the variance in the projected data. ⇓ Find a unit vector u ∈ ❘n that maximizes variance(uTx) = uTΣu, where Σ is the covariance matrix.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20

slide-32
SLIDE 32

Principal component analysis

Variance maximization

Find a direction that maximizes the variance in the projected data. ⇓ Find a unit vector u ∈ ❘n that maximizes variance(uTx) = uTΣu, where Σ is the covariance matrix. ⇓ Solve the maximization problem: maximize uTΣu s.t. u = 1

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20

slide-33
SLIDE 33

Principal component analysis

Variance maximization

Solve the maximization problem: maximize uTΣu s.t. u = 1 Apply Lagrange multipliers method: f (u, α) = uTΣu + α(1 − uTu) ∇uf (u, α) = 2(Σu − αu) ∇uf (u, α) = 0 ⇒ Σu = αu Therefore, u is an eigenvector of Σ with eigenvalue α, which has to be the maximal eigenvalue to maximize uTΣu = α.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 14 / 20

slide-34
SLIDE 34

Principal component analysis

Variance maximization

Similarly, a second direction is found via: maximize uT

2 Σu2

s.t. u2 = 1 u2, u1 = 0 Apply Lagrange multipliers method: f (u2, α, β) = uT

2 Σu2 + α(1 − uT 2 u2) − βuT 2 u1

∇u2f (u2, α, β) = 2(Σu2 − αu2) − βu1 ∇u2f (u2, α, β) = 0 ⇒ β = −u1, ∇u2f (u2, α, β) = 0 ⇒ Σu2 = αu2 Therefore, u2 is an eigenvector of Σ with the second largest eigenvalue.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 15 / 20

slide-35
SLIDE 35

Principal component analysis

Eigendecomposition and SVD

features features Covariance matrix

=

data points

                  

data points

                  

cov(q1, q2)

i xi[q1] · xi[q2]

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 16 / 20

slide-36
SLIDE 36

Principal component analysis

Eigendecomposition and SVD

Eigenvalue

❅ ❅ ❘ λi

Eigenvector

❍❍ ❍ ❥

φi =

Covariance matrix

φi

q1

q2

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 16 / 20

slide-37
SLIDE 37

Principal component analysis

Eigendecomposition and SVD

Eigenvalue

❅ ❅ ❘ λi

Eigenvector

❍❍ ❍ ❥

φi =

Covariance matrix

φi

q1

q2

Singular vectors

❍❍❍❍❍ ❥ ✟ ✟ ✟ ✟ ✟ ✙

Covariance matrix

=

❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅

Singular values

Spectral theorem applies to cov. matrices: SVD (Singular Value Decomposition) Spectral Theorem: cov(q1, q2) =

  • i λi φi[q1] φi[q2]

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 16 / 20

slide-38
SLIDE 38

Principal component analysis

Singular value decomposition

Any matrix M ∈ ❘n×k can be decomposed to U, S, V ← SVD(M) as

M = U S V T

n × n

  • rthogonal

n × k diagonal

k × k

  • rthogonal

❅ ■

The singular values in S are the square root of the (nonnegative) eigenvalues of both MMT and MTM. The singular vectors in (the columns of) U are the eigenvectors

  • f MMT.

The singular vectors in (the columns of) V are the eigenvectors

  • f MTM.

Proof and more details about SVD can be found on Wikipedia.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 17 / 20

slide-39
SLIDE 39

Principal component analysis

Singular value decomposition

λ1 λ2 λ3 λ4 λ5 eigenvalues Decaying covariance spectrum reveals (low) dimensionality

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 17 / 20

slide-40
SLIDE 40

Principal component analysis

Singular value decomposition

Eigenvectors

❍❍❍❍❍ ❥ ✟ ✟ ✟ ✟ ✟ ✙

Covariance matrix

=

  • principal components

❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ✻

Eigenvalues Covariance matrix can be approximated by a truncated SVD

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 17 / 20

slide-41
SLIDE 41

Principal component analysis

Trivial example

Consider simple case of data points that are all on the same high dimensional line Straight line is defined by a unit vector

  • ψ
  • = 1

Points on the line are defined by multiplying ψ by scalars The points can be formulated as xi = ci ψ Covariance: cov(t1, t2) =

i xi[t1]xi[t2] = i ci

ψ[t1]ci ψ[t2] = (

i c2 i )

ψ[t1] ψ[t2] = c2 ψ(t1) ψ(t2)

  • c (c1, c2, . . .)

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20

slide-42
SLIDE 42

Principal component analysis

Trivial example

Consider simple case of data points that are all on the same high dimensional line Straight line is defined by a unit vector

  • ψ
  • = 1

Points on the line are defined by multiplying ψ by scalars The points can be formulated as xi = ci ψ Covariance: cov(t1, t2) =

i xi[t1]xi[t2] = i ci

ψ[t1]ci ψ[t2] = (

i c2 i )

ψ[t1] ψ[t2] = c2 ψ(t1) ψ(t2)

  • c (c1, c2, . . .)

Covariance Matrix

=

ψ

c2

ψ

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20

slide-43
SLIDE 43

Principal component analysis

Trivial example

Consider simple case of data points that are all on the same high dimensional line Straight line is defined by a unit vector

  • ψ
  • = 1

Points on the line are defined by multiplying ψ by scalars The points can be formulated as xi = ci ψ Covariance: cov(t1, t2) =

i xi[t1]xi[t2] = i ci

ψ[t1]ci ψ[t2] = (

i c2 i )

ψ[t1] ψ[t2] = c2 ψ(t1) ψ(t2)

  • c (c1, c2, . . .)

Covariance matrix has a single eigenvalue c2 and a single eigenvector ψ, which defines principal direction of the data-point vectors

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20

slide-44
SLIDE 44

Principal component analysis

Trivial example

  • 3D space

✻ ✲ ✁ ☛ ✈ ❢ ✈ ❢ ✈ ❢ ✈ ❢ ✈ ❢

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20

slide-45
SLIDE 45

Principal component analysis

Trivial example

  • 3D space

✻ ✲ ✁ ☛ ✈ ❢ ✈ ❢ ✈ ❢ ✈ ❢ ✈ ❢

φ1 = ψ

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20

slide-46
SLIDE 46

Principal component analysis

Trivial example

  • 3D space

✻ ✲ ✁ ☛

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20

slide-47
SLIDE 47

Principal component analysis

Trivial example

  • 3D space

✻ ✲ ✁ ☛

Length: eigenvalues Direction: eigenvectors

λ1φ1

❅ ■

λ2φ2

principal components ⇒ max var directions

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 18 / 20

slide-48
SLIDE 48

Principal component analysis

PCA algorithm:

1

Centering

2

Covariance

3

SVD (or eigendecomposition)

4

Projection Alternative method: Multi-Dimensional Scaling (MDS) - preserve distances/inner-products with minimal set of coordinates.

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 19 / 20

slide-49
SLIDE 49

Summary

Preprocessing steps are crucial in preparing data for meaningful analysis. Linear dimensionality reduction for alleviating the curse of dimensionality. Principal Component Analysis (PCA) is a standard dimensionality reduction approach: Based on projecting data on leading eigenvectors of the covariance matrix. Minimizes reconstruction error by the projecion, Equivalently, finds a subspace that maximizes captured variance, In practice, SVD is used instead of eigendecomposition,

MAT 6480W (Guy Wolf) PCA UdeM - Fall 2019 20 / 20