Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline - - PowerPoint PPT Presentation
Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline - - PowerPoint PPT Presentation
Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage capacities lead to
Outline
Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting
Background
Advances in data collection and storage
capacities lead to information overload in many fields.
Traditional statistical methods often break
down because of the increase in the number
- f variables in each observations, that is , the
dimension of the data.
One of the most challenging problem is to
reduce the dimension of original data.
Problem Modeling
- Original high-dimensional data:
: p dimensional multivariate random
- Underlying/Intrinsic low-dimensional data:
: k (< < p) dimensional multivariate random
- The mean and covariance:
- Problems :
1) Find the appropriate mapping that can best capture the most important features in low dimension and 2) Find the appropriate k that can best describe the data in low dimension.
T p
x x X ) ,..., (
1
=
T k
y y Y ) ,..., (
1
=
T p
X E ) ,..., ( ) (
1
µ µ µ = =
∑
− − = } ) )( {(
T x
X X E µ µ
State-of-the-art Techniques
Dimension reduction techniques can be
categorized into two major classes: linear and non-linear.
Non-Linear Methods: Multidimensional Scaling
(MDS), Principal Curves, Self-Organizing Map (SOM), Neural Network, Isomap, Local Linear Embedding (LLE) and Charting.
Linear Methods: Principal Component Analysis
(PCA), Factor Analysis, Projection Pursuit and Independent Component Analysis (ICA)
Principal Component Analysis (PCA)
Denote a linear projection as Thus In essence PCA tries to reduce the data dimension
by finding a few orthogonal linear combinations (Principal Components, PCs) of the original variables with the largest variance.
It can also be further rewritten as : } var{ max arg } var{ max arg
1 1
X w y W
T i k i i k i
∑ ∑
= =
= =
) max( arg W W W
x TΣ
=
] ,..., [
1 k
w w W =
X w y
T i i =
PCA
Σ can be decomposed by eigen decomposition
as
Λ=
is the diagonal matrix of ascending ordered eigenvalues. U is the
- rthogonal matrix containing the eigenvectors.
It is proven that the optimal projection matrix
W are the first k eigenvectors in U.
T x
U U Λ =
∑
) ,..., (
1 p
diag λ λ
PCA
Property 1: The subspace spanned by
the first k eigenvectors has the smallest mean square deviation from X among all subspace of dimension K.
Property 2: The total variance is equal
to the sum of the eigenvalues of the
- riginal covariance matrix.
Multidimensional Scaling (MDS)
Multidimensional Scaling (MDS) produces low-
dimensional representation of the data such that the distance in the new space reflect the proximities of the data in the original space.
Denote symmetric proximity matrix as : MDS tries to find the mapping such that the
distance in the lower space are as close as possible to a function of the corresponding proximity .
} ,..., 1 , , { n j i
ij
= = ∆ δ
) , (
j i ij
y y d d =
) ( ij f δ
MDS
Mapping Cost function: The scale_factor are often based on
- r .
Problem: Find optimal mapping that minimize the
cost function
If the proximity is the distance measure ,
- r , we call it metric-MDS.
If the proximity uses ordinal information of the
data, it is called non-metric-MDS.
∑
2 ,
) ( ij
j i f δ
factor scale d f
ij ij j i
_ ] ) ( [
2 ,
∑
− δ
∑
2 , ij j i d
2
L
1
L
Isomap
Disadvantage of PCA and MDS: 1) Both methods
- ften fail to discover complicated nonlinear structure
and 2) both have difficulties in detecting the intrinsic dimension of the data.
Goal : combine the major algorithmic feature of PCA
and MDS: computational efficiency, global optimality and asymptotic convergence guarantee and have the flexibility to learn nonlinear manifolds.
Idea : Introduce geodesic distance that can better
describe the relation between data points.
Isomap
Illustration: Points far apart on the underlying manifold , when measured by their geodesic distance may appear close in high-dimensional input space.
The Swiss Roll data set
Isomap
In this approach the intrinsic geometry of the
data is preserved by capturing the manifold distance between all data.
For neighboring points (ε or k-nearest), the
Euclidean distance provides good approximation to the geodesic distance.
For faraway points, geodesic distance can be
approximated by adding up a sequence of
“short hops” between neighboring points.
(Floyd Algorithm)
Isomap Algorithm
Step 1: determine which points are
neighbors on the manifold based on the input distance matrix.
Step 2: Isomap estimates the geodesic
distances between all pairs of points on the manifold M by computing their shortest path distance .
Step 3: Apply MDS or PCA to the matrix of
the graph distance matrix .
) , ( j i dG
) , ( j i dx
)} , ( { j i d D
G G =
The Swiss Roll Problem
Detect Intrinsic Dimension
- The intrinsic dimensionality
- f the data can be estimated
from the decrease rate of Residual Variance as the dimensionality of Y increased.
- Residual Variance is defined
as : while R()
- peration is the linear
correlation coefficient and is the estimated distance in
- riginal space and the
distance in projected space.
2
) , ( 1
y M
D D R −
M
D
y
D
Theoretical Analysis
The main contribution of Isomap is substitute
the Euclidean distance with geodesic distance, which may better capture the nonlinear structure of a manifold.
Given sufficient data, Isomap is guaranteed
asymptotically to recover the true dimensionality and geometric structure of a non-linear manifolds.
Experiments
Experiments
Experiment 1: Facial Images
Experiment 2: The hand-written 2’s
Locally Linear Embedding (LLE)
MDS and its variant Isomap try to preserve
pair wise distance between data points.
Locally Linear Embedding (LLE) is
unsupervised learning algorithm that recovers global nonlinear structure from locally linear fits.
Assumption: each data point and its
neighbors lie on or close to a locally linear patch of the manifold.
Locally Linearity
LLE
Idea: The local geometry is characterized by
linear coefficients that reconstruct each data point from its neighbors.
Reconstruction Cost is defined as : Two constraints:
1) each data point is only reconstruct by its neighbors instead of faraway points and 2) rows of weight matrix sum to one.
∑ ∑
− =
i j ij j i
x w x W
2
| | ) ( ε
Linear reconstruction
LLE
The symmetric weight matrix for any data point
is invariant to rotations, rescaling and translations.
Although the global manifold may be nonlinear,
for each locally linear neighborhood there exists a linear mapping (consisting of a translation ,rotation and rescaling) that project the neighborhood to low dimension.
The same weight matrix that reconstruct ith
data in D dimension should also reconstruct its embedded manifold in d dimsension.
LLE
W is solved by minimizing the reconstruct
cost function in the original space.
To find the optimal global mapping to lower
dimensional space, define an embedding cost function:
Because W is fixed, the problem turns to find
a optimal projection (X-> Y) which minimize the embedding function.
∑ ∑
− =
i j ij j i
y w y Y
2
| | ) ( φ
Theoretical analysis:
1) only one free parameter K and
transformation is determinant.
2)Guranteed to converge to global optimality
with sufficient data point.
3)LLE don’t have to be rerun to compute
higher dimension embeddings.
4)The intrinsic dimension d can be estimated
by analyzing a reciprocal cost function of reconstruct Y to X.
Experiment 1 Facial Images
Experiment 2: Words in semantic space
Experiment 2: Arranging words in semantic space
Charting
Charting is the problem of assigning a low-
dimensional coordinate system to data points in a high-dimensional samples.
Assume that the data lies on or near a low-
dimensional manifold in the sample space and there exists a 1-to-1 smooth nonlinear transform between the manifold and a low-dimensional vector space.
Goal: find a mapping that is expressed as a kernel-
based mixture of linear projections that minimizes information loss about the density and relative locations of sample points.
Local Linear Scale and Intrinsic Dimensionality
Local Linear Scale (r) : at some scale r
the mapping from a neighborhood on M (original space) to lower dimension is linear.
Consider a ball of radius r centered on a
data point and containing n(r) data points. The count n(r) grows at only at the locally linear scale.
d
r
d
r
Local Linear Scale and Intrinsic Dimensionality
There are two other factor that may affect the data distribution in different scale: isotropic noise (at a smaller scale) and embedding curvature ( at a larger scale).
- Define c(r) = log r /log
n(r) .At noise scale c(r)= 1/D< 1/d. At locally linear scale c(r)= 1/d. At curvature scale c(r) < 1/d.
Local Linear Scale and Intrinsic Dimensionality
Gradually increase r,
when c(r) first peaks (at 1/d). We have
- ne observation of
both r and d.
Average on all data
points, we can estimate the r and d.
Charting the data
Model: Each chart is modeled as Gaussian
Mixture Model (GMM).
Goal: find a soft partition of data into locally
linear low-dimension neighborhoods.
Problem: one data point may belong to
several neighboring chart. The estimation of local GMM should take account into the information from neighboring chart.
Charting the data
Co-locality: is defined to estimate how close two
charts are:
Each data point is associated with a Gaussian
neighborhood with .
Covariance is estimated by: This step brings non-local information about the
manifold’s shape into the local description of each neighbor, ensuring that adjoining neighborhoods have similar covariance and small angles between their respective subspaces.
) , : ( ) (
2
σ µ µ µ
i j j i
N m =
i i
x = µ
∑ ∑ ∑ ∑
+ − − + − − =
j j i j j T i j i j T i j i j j i i
m x x m ) ( / ]) ) )( ( ) )( )[( ( ( µ µ µ µ µ µ µ µ
Connecting the charts
- To minimize the information loss in connection, the data points
project into a local subspace associate with each neighbor should have 1) minimal loss of local variance and 2) maximal agreement of projections of nearby points into nearby neighborhoods.
- The first criterion is met by apply PCA on each chart and get a
local low-dimensional coordinate system. Each original data point has different copies (projected low-dimensional sample) in each local coordinate.
- The second criterion is met by project each local coordinate to a
global coordinate with minimal disagreement of the projected data point in the global space.
Connecting the charts
Each data point (i) are projected to neighboring local
coordinates (j):
Each copy of data point in local coordinate is finally
projected to a global coordinate:
Where is the projection from jth chart to global
space.
Minimizing the disagreement is modeled as a
weighted least-squared-distance problem
j
G ∑
=
j i x j ji j i
x p u G y ) (
|
∑
⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ≡
j k G
G F ji j ki k i x j i x k G K
u G u G x p x p G G G
, 2 | | 1
|| 1 1 || ) ( ) ( min arg ] ,..., [
i j ji
x l u =