Principal Components Analysis Sargur Srihari University at Buffalo - - PowerPoint PPT Presentation

principal components analysis
SMART_READER_LITE
LIVE PREVIEW

Principal Components Analysis Sargur Srihari University at Buffalo - - PowerPoint PPT Presentation

Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2 Motivation


slide-1
SLIDE 1

Principal Components Analysis

Sargur Srihari University at Buffalo

1

slide-2
SLIDE 2

Topics

  • Projection Pursuit Methods
  • Principal Components
  • Examples of using PCA
  • Graphical use of PCA
  • Multidimensional Scaling

Srihari 2

slide-3
SLIDE 3

Motivation

  • Scatterplots

– Good for two variables at a time – Disadvantage

  • may miss complicated relationships
  • PCA is a method to transform into new

variables

  • Projections along different directions to

detect relationships

– Say along direction defined by 2x1+3x2+x3=0

3

slide-4
SLIDE 4

Projection pursuit methods

  • Allow searching for “interesting” directions
  • Interesting means maximum variability
  • Data in 2-d space projected to 1-d:

x1 x2

2x1+3x2=0

Projection Task is to find a

4

slide-5
SLIDE 5

Principal Components

  • Find linear combinations that maximize

variance subject to being uncorrelated with those already selected

  • Hopefully there are few such linear

combinations-- known as principal components

  • Task is to find a k-dimensional projection

where 0 < k < d-1

5 Srihari

slide-6
SLIDE 6

Data Matrix Definition

X = n x d data matrix of n cases x(1) x(i) x(n) d variables x(i) is a d x 1 column vector Each row of matrix is of the form x(i)T Assume X is mean-centered, so that the value of each variable is subtracted for that variable

6

slide-7
SLIDE 7

Projection Definition

Let a be a p x 1 column vector of projection weights that result in the largest variance when the data X are projected along a Projection of a data vector x = (x1,..xp)t

  • nto a = (a1,..,ap)t is the linear combination

a tx = ajx j

j=1 p

Projected values of all data vectors in X onto a is Xa

  • - an n x 1 column vector-- a set of scalar values

corresponding to n projected points

Since X is n x p and a is p x 1 Therefore Xa is n x 1

7

slide-8
SLIDE 8

Variance along Projection

σ a

2 = Xa

( )

T Xa

( )

= aT X tXa = aTVa

Variance along a is Thus variance is a function of both the projection line a and the covariance matrix V

where V = X tX is the p × p covariance matrix of the data since X has zero mean

8

slide-9
SLIDE 9

Maximization of Variance

Maximizing variance along a is not well-defined since we can increase it without limit by increasing the size

  • f the components of a.

Impose a normalization constraint on the a vectors such that aTa = 1 Optimization problem is to maximize

u = a tVa − λ(a ta −1)

Where λ is a Lagrange multiplier. Differentiating wrt a yields

∂u ∂a = 2Va − 2λa = 0 which reduces to (V - λI)a = 0

Characteristic Equation!

slide-10
SLIDE 10

What is the Characteristic Equation?

Given a d x d matrix V a very important class of linear Equations is of the form d x d d x 1 d x 1 which can be rewritten as

Vx = λx

If V is real and symmetric there are d possible solution vectors, called Eigen Vectors, e1, ed and associated Eigen values

(V − λI)x = 0

10 Srihari

slide-11
SLIDE 11

Principal Component is obtained from the Covariance Matrix

Then its Characteristic Equation is

(V − λI)a = 0

Roots are Eigen Values Corresponding Eigen Vectors are principal components If the matrix V is the Covariance matrix First principal component is the Eigen Vector associated with the largest Eigen value of V.

11 Srihari

slide-12
SLIDE 12

Other Principal Components

  • Second Principal component is in

direction orthogonal to first

  • Has second largest Eigen value, etc

First Principal Component e1

X1 X2

Second Principal Component e2

12

slide-13
SLIDE 13

Projection into k Eigen Vectors

  • Variance of data projected into first

k Eigen vectors e1,..ek is

  • Squared error in approximating true data matrix

X using only first k Eigen vectors is

  • How to choose k ?

– increase k until squared error is less than a threshold

λ j

j= k +1 d

λl

l=1 d

Usually 5-10 principal components capture 90% variance in data

13 Srihari

slide-14
SLIDE 14

Scree Plot

Amount of variance explained by each consecutive Eigen value

CPU data 8 Eigen values: 63.26 10.70 10.30 6.68 5.23 2.18 1.31 0.34 Weights put by first component e1

  • n eight variables are:

0.199

  • 0.365
  • 0.399
  • 0.336
  • 0.331
  • 0.298
  • 0.421
  • 0.423

Eigen values of Correlation Matrix

An example Eigen Vector Scatterplot Matrix

CPU data

Eigen Value number Percent Variance Explained

Example of PCA

14

slide-15
SLIDE 15

PCA using correlation matrix and covariance matrix

Proportions of variation attributable to different components: 96.02 3.93 0.04 0.01 Scree Plot Correlation Matrix Scree Plot Covariance Matrix

Eigen Value number Eigen Value number

Percent Variance Explained Percent Variance Explained

15

slide-16
SLIDE 16

Graphical Use of PCA

Projection onto first two principal components

  • f six dimensional data

17 pills (data points) Six values are times at which specified proportion

  • f pill has dissolved:

10%, 30%, 50%, 70%, 75%, 90%

Pill 3 is very different Principal Component 1 Principal Component 2

16 Srihari

slide-17
SLIDE 17

Computational Issue: Scaling with Dimensionality

  • O(nd2+d3)

To calculate V Solve Eigen value equations for the d x d matrix

Can be applied to large numbers of records n But does not scale well with dimensionality d

Also, appropriate Scalings of variables have to be done

17

slide-18
SLIDE 18

Multidimensional Scaling

  • Using PCA to project on a plane is effective
  • nly if data lie on 2-d subspace
  • Intrinsic Dimensionality

– Data may lie on string or surface in d-space – E.g., when a digit image is translated and rotated

  • Then images in pixel space lie on a 3-dimensional

manifold (defined by location and orientation)

18 Srihari

slide-19
SLIDE 19

Goal of Multidimensional Scaling

  • Detecting underlying structure
  • Represent data in lower dimensional space

so that distances are preserved

– Distances between data points are mapped to a reduced space

  • Typically displayed on a 2-d plot
  • Begin with distances and then compute the

plot

– E.g., psychometrics and market research where similarities between objects are given by subjects

19

slide-20
SLIDE 20

Defining the B Matrix

  • For an n x d data matrix X we could

compute n x n matrix B = XXt

  • We will see (next slide) that the Euclidean

distance between the ith and jth objects is given by dij

2=bii+bjj-2bij

  • Matrices XXt and XtX are both meaningful

20 Srihari

slide-21
SLIDE 21

XtX versus XXt

  • If X is n x d

d=4

  • XtX is d x d
  • B=XXt is n x n

n x d d x n d x d n x d d x n n x n Covariance Matrix

B Matrix contains distance

information

dij

2=bii+bjj-2bij

slide-22
SLIDE 22

Factorizing the B matrix

  • Given a matrix of distances D

– Derived from original data by computing n(n-1)/2 distances – Compute elements of B by inverting

  • Factorize B

– in terms of eigen vectors to yield coordinates of points – Two largest eigen values would give 2-d representation dij

2=bii+bjj-2bij

22 Srihari

slide-23
SLIDE 23

Inverting distances to get B

  • Summing over i
  • Summing over j
  • Summing over i and j

dij

2=bii+bjj-2bij

Thus expressing bij as a function of dij

2

Method is known as Principal Coordinates Method

Can obtain tr(B) Can obtain bii Can obtain bjj

23

slide-24
SLIDE 24

Criterion for Multidimensional Scaling

  • Find projection into two dimensions to minimize

Observed distance between points i and j in d-space Distance between the points in two-dimensional space Criterion is invariant wrt rotations and translations. However it is not invariant to scaling Better criterion is

  • r

Called stress

24 Srihari

slide-25
SLIDE 25

Algorithm for Multidimensional Scaling

  • Two stage procedure
  • Assume that dij=a+bδij+eij
  • Regressioin in 2-D on given dissimilarities

yielding estimates for a and b

  • Find new values of dij that minimize the stress
  • Repeat until convergence

Original dissimilarities

25 Srihari

slide-26
SLIDE 26

Multidimensional Scaling Plot: Dialect Similarities

Numerical codes of villages and their counties

Each Pair of villages rated by percentage of 60 items for which villagers used different words

We are able to visualize 625 distances intuitively

26

slide-27
SLIDE 27

Variations of Multidimensional Scaling

  • Above methods are called metric methods
  • Sometimes precise similarities may not be

known– only rank orderings

  • Also may not be able to assume a

particular form of relationship between dij and δij

– Requires a two-stage approach – Replace simple linear regression with monotonic regression

27 Srihari

slide-28
SLIDE 28

Multidimensional Scaling: Disadvantages

  • When there are too many data points

structure becomes obscured

  • Highly sophisticated transformations of the

data (compared to scatter lots and PCA)

– Possibility of introducing artifacts – Dissimilarities can be more accurately determined when they are similar than when they are very dissimilar

  • Horseshoe effect when objects manufactured in a short

time span differ greatly from objects separated by greater time gap

  • Biplots show both data points and variables

28 Srihari