Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline - - PowerPoint PPT Presentation

dimension reduction techniques
SMART_READER_LITE
LIVE PREVIEW

Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline - - PowerPoint PPT Presentation

Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage capacities lead to


slide-1
SLIDE 1

Dimension Reduction Techniques

Presented by Jie (Jerry) Yu

slide-2
SLIDE 2

Outline

Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting

slide-3
SLIDE 3

Background

Advances in data collection and storage

capacities lead to information overload in many fields.

Traditional statistical methods often break

down because of the increase in the number

  • f variables in each observations, that is , the

dimension of the data.

One of the most challenging problem is to

reduce the dimension of original data.

slide-4
SLIDE 4

Problem Modeling

  • Original high-dimensional data:

: p dimensional multivariate random

  • Underlying/Intrinsic low-dimensional data:

: k (< < p) dimensional multivariate random

  • The mean and covariance:
  • Problems :

1) Find the appropriate mapping that can best capture the most important features in low dimension and 2) Find the appropriate k that can best describe the data in low dimension.

T p

x x X ) ,..., (

1

=

T k

y y Y ) ,..., (

1

=

T p

X E ) ,..., ( ) (

1

µ µ µ = =

− − = } ) )( {(

T x

X X E µ µ

slide-5
SLIDE 5

State-of-the-art Techniques

Dimension reduction techniques can be

categorized into two major classes: linear and non-linear.

Non-Linear Methods: Multidimensional Scaling

(MDS), Principal Curves, Self-Organizing Map (SOM), Neural Network, Isomap, Local Linear Embedding (LLE) and Charting.

Linear Methods: Principal Component Analysis

(PCA), Factor Analysis, Projection Pursuit and Independent Component Analysis (ICA)

slide-6
SLIDE 6

Principal Component Analysis (PCA)

Denote a linear projection as Thus In essence PCA tries to reduce the data dimension

by finding a few orthogonal linear combinations (Principal Components, PCs) of the original variables with the largest variance.

It can also be further rewritten as : } var{ max arg } var{ max arg

1 1

X w y W

T i k i i k i

∑ ∑

= =

= =

) max( arg W W W

x TΣ

=

] ,..., [

1 k

w w W =

X w y

T i i =

slide-7
SLIDE 7

PCA

Σ can be decomposed by eigen decomposition

as

Λ=

is the diagonal matrix of ascending ordered eigenvalues. U is the

  • rthogonal matrix containing the eigenvectors.

It is proven that the optimal projection matrix

W are the first k eigenvectors in U.

T x

U U Λ =

) ,..., (

1 p

diag λ λ

slide-8
SLIDE 8

PCA

Property 1: The subspace spanned by

the first k eigenvectors has the smallest mean square deviation from X among all subspace of dimension K.

Property 2: The total variance is equal

to the sum of the eigenvalues of the

  • riginal covariance matrix.
slide-9
SLIDE 9

Multidimensional Scaling (MDS)

Multidimensional Scaling (MDS) produces low-

dimensional representation of the data such that the distance in the new space reflect the proximities of the data in the original space.

Denote symmetric proximity matrix as : MDS tries to find the mapping such that the

distance in the lower space are as close as possible to a function of the corresponding proximity .

} ,..., 1 , , { n j i

ij

= = ∆ δ

) , (

j i ij

y y d d =

) ( ij f δ

slide-10
SLIDE 10

MDS

Mapping Cost function: The scale_factor are often based on

  • r .

Problem: Find optimal mapping that minimize the

cost function

If the proximity is the distance measure ,

  • r , we call it metric-MDS.

If the proximity uses ordinal information of the

data, it is called non-metric-MDS.

2 ,

) ( ij

j i f δ

factor scale d f

ij ij j i

_ ] ) ( [

2 ,

− δ

2 , ij j i d

2

L

1

L

slide-11
SLIDE 11

Isomap

Disadvantage of PCA and MDS: 1) Both methods

  • ften fail to discover complicated nonlinear structure

and 2) both have difficulties in detecting the intrinsic dimension of the data.

Goal : combine the major algorithmic feature of PCA

and MDS: computational efficiency, global optimality and asymptotic convergence guarantee and have the flexibility to learn nonlinear manifolds.

Idea : Introduce geodesic distance that can better

describe the relation between data points.

slide-12
SLIDE 12

Isomap

Illustration: Points far apart on the underlying manifold , when measured by their geodesic distance may appear close in high-dimensional input space.

The Swiss Roll data set

slide-13
SLIDE 13

Isomap

In this approach the intrinsic geometry of the

data is preserved by capturing the manifold distance between all data.

For neighboring points (ε or k-nearest), the

Euclidean distance provides good approximation to the geodesic distance.

For faraway points, geodesic distance can be

approximated by adding up a sequence of

“short hops” between neighboring points.

(Floyd Algorithm)

slide-14
SLIDE 14

Isomap Algorithm

Step 1: determine which points are

neighbors on the manifold based on the input distance matrix.

Step 2: Isomap estimates the geodesic

distances between all pairs of points on the manifold M by computing their shortest path distance .

Step 3: Apply MDS or PCA to the matrix of

the graph distance matrix .

) , ( j i dG

) , ( j i dx

)} , ( { j i d D

G G =

slide-15
SLIDE 15

The Swiss Roll Problem

slide-16
SLIDE 16

Detect Intrinsic Dimension

  • The intrinsic dimensionality
  • f the data can be estimated

from the decrease rate of Residual Variance as the dimensionality of Y increased.

  • Residual Variance is defined

as : while R()

  • peration is the linear

correlation coefficient and is the estimated distance in

  • riginal space and the

distance in projected space.

2

) , ( 1

y M

D D R −

M

D

y

D

slide-17
SLIDE 17

Theoretical Analysis

The main contribution of Isomap is substitute

the Euclidean distance with geodesic distance, which may better capture the nonlinear structure of a manifold.

Given sufficient data, Isomap is guaranteed

asymptotically to recover the true dimensionality and geometric structure of a non-linear manifolds.

slide-18
SLIDE 18

Experiments

slide-19
SLIDE 19

Experiments

slide-20
SLIDE 20

Experiment 1: Facial Images

slide-21
SLIDE 21

Experiment 2: The hand-written 2’s

slide-22
SLIDE 22

Locally Linear Embedding (LLE)

MDS and its variant Isomap try to preserve

pair wise distance between data points.

Locally Linear Embedding (LLE) is

unsupervised learning algorithm that recovers global nonlinear structure from locally linear fits.

Assumption: each data point and its

neighbors lie on or close to a locally linear patch of the manifold.

slide-23
SLIDE 23

Locally Linearity

slide-24
SLIDE 24

LLE

Idea: The local geometry is characterized by

linear coefficients that reconstruct each data point from its neighbors.

Reconstruction Cost is defined as : Two constraints:

1) each data point is only reconstruct by its neighbors instead of faraway points and 2) rows of weight matrix sum to one.

∑ ∑

− =

i j ij j i

x w x W

2

| | ) ( ε

slide-25
SLIDE 25

Linear reconstruction

slide-26
SLIDE 26

LLE

The symmetric weight matrix for any data point

is invariant to rotations, rescaling and translations.

Although the global manifold may be nonlinear,

for each locally linear neighborhood there exists a linear mapping (consisting of a translation ,rotation and rescaling) that project the neighborhood to low dimension.

The same weight matrix that reconstruct ith

data in D dimension should also reconstruct its embedded manifold in d dimsension.

slide-27
SLIDE 27

LLE

W is solved by minimizing the reconstruct

cost function in the original space.

To find the optimal global mapping to lower

dimensional space, define an embedding cost function:

Because W is fixed, the problem turns to find

a optimal projection (X-> Y) which minimize the embedding function.

∑ ∑

− =

i j ij j i

y w y Y

2

| | ) ( φ

slide-28
SLIDE 28

Theoretical analysis:

1) only one free parameter K and

transformation is determinant.

2)Guranteed to converge to global optimality

with sufficient data point.

3)LLE don’t have to be rerun to compute

higher dimension embeddings.

4)The intrinsic dimension d can be estimated

by analyzing a reciprocal cost function of reconstruct Y to X.

slide-29
SLIDE 29

Experiment 1 Facial Images

slide-30
SLIDE 30

Experiment 2: Words in semantic space

slide-31
SLIDE 31

Experiment 2: Arranging words in semantic space

slide-32
SLIDE 32

Charting

Charting is the problem of assigning a low-

dimensional coordinate system to data points in a high-dimensional samples.

Assume that the data lies on or near a low-

dimensional manifold in the sample space and there exists a 1-to-1 smooth nonlinear transform between the manifold and a low-dimensional vector space.

Goal: find a mapping that is expressed as a kernel-

based mixture of linear projections that minimizes information loss about the density and relative locations of sample points.

slide-33
SLIDE 33

Local Linear Scale and Intrinsic Dimensionality

Local Linear Scale (r) : at some scale r

the mapping from a neighborhood on M (original space) to lower dimension is linear.

Consider a ball of radius r centered on a

data point and containing n(r) data points. The count n(r) grows at only at the locally linear scale.

d

r

d

r

slide-34
SLIDE 34

Local Linear Scale and Intrinsic Dimensionality

There are two other factor that may affect the data distribution in different scale: isotropic noise (at a smaller scale) and embedding curvature ( at a larger scale).

  • Define c(r) = log r /log

n(r) .At noise scale c(r)= 1/D< 1/d. At locally linear scale c(r)= 1/d. At curvature scale c(r) < 1/d.

slide-35
SLIDE 35

Local Linear Scale and Intrinsic Dimensionality

Gradually increase r,

when c(r) first peaks (at 1/d). We have

  • ne observation of

both r and d.

Average on all data

points, we can estimate the r and d.

slide-36
SLIDE 36

Charting the data

Model: Each chart is modeled as Gaussian

Mixture Model (GMM).

Goal: find a soft partition of data into locally

linear low-dimension neighborhoods.

Problem: one data point may belong to

several neighboring chart. The estimation of local GMM should take account into the information from neighboring chart.

slide-37
SLIDE 37

Charting the data

Co-locality: is defined to estimate how close two

charts are:

Each data point is associated with a Gaussian

neighborhood with .

Covariance is estimated by: This step brings non-local information about the

manifold’s shape into the local description of each neighbor, ensuring that adjoining neighborhoods have similar covariance and small angles between their respective subspaces.

) , : ( ) (

2

σ µ µ µ

i j j i

N m =

i i

x = µ

∑ ∑ ∑ ∑

+ − − + − − =

j j i j j T i j i j T i j i j j i i

m x x m ) ( / ]) ) )( ( ) )( )[( ( ( µ µ µ µ µ µ µ µ

slide-38
SLIDE 38

Connecting the charts

  • To minimize the information loss in connection, the data points

project into a local subspace associate with each neighbor should have 1) minimal loss of local variance and 2) maximal agreement of projections of nearby points into nearby neighborhoods.

  • The first criterion is met by apply PCA on each chart and get a

local low-dimensional coordinate system. Each original data point has different copies (projected low-dimensional sample) in each local coordinate.

  • The second criterion is met by project each local coordinate to a

global coordinate with minimal disagreement of the projected data point in the global space.

slide-39
SLIDE 39

Connecting the charts

Each data point (i) are projected to neighboring local

coordinates (j):

Each copy of data point in local coordinate is finally

projected to a global coordinate:

Where is the projection from jth chart to global

space.

Minimizing the disagreement is modeled as a

weighted least-squared-distance problem

j

G ∑

=

j i x j ji j i

x p u G y ) (

|

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ≡

j k G

G F ji j ki k i x j i x k G K

u G u G x p x p G G G

, 2 | | 1

|| 1 1 || ) ( ) ( min arg ] ,..., [

i j ji

x l u =

slide-40
SLIDE 40

Experiment 1: The Twisted Curl Problem

slide-41
SLIDE 41

Experiment 2: The Trefoil Problem

slide-42
SLIDE 42

Experiment 3: The Facial Image Modeling