[PPT] - Dimension Reduction CS 6242 Ramakrishnan Kannan Thanks : Prof. PowerPoint Presentation

SLIDE 1



Dimension Reduction

CS 6242 Ramakrishnan Kannan

Thanks : Prof. Jaegul Choo and Prof. Le Song

SLIDE 2

What is Dimension Reduction?

Data ¡item ¡index ¡(n) ¡ Dimension ¡ index ¡(d) ¡ Columns ¡as ¡ data ¡items ¡ ¡

low-‑dim ¡ data ¡

Dimension Reduction

How ¡big ¡is ¡ this? ¡

Why?

Attribute=Feature= Variable=Dimension

SLIDE 3

3

Serialized/rasterized pixel values

Image Data

3 80 24 58 63 45 3 80 24 58 78 45 5 34 78

Raw ¡images ¡ Pixel ¡values ¡

5 34 63

Serialized ¡ pixels ¡

In a 4K (4096x2160) image there are totally 8.8 million pixels

SLIDE 4

3 80 24 58 63 45 5 34 78 49 54 78 14 67 36 22 86 15 4

Serialized/rasterized pixel values

 Huge dimensions

 4096x2160 image size ¡→ ¡ ¡8847360 ¡dimensions  30 fps.  Means for 2 mins video, you generate a matrix of size

8847360 ¡x3600

Video Data

3 80 24 58 63 45

Raw ¡images ¡ Pixel ¡values ¡

5 34 63

Serialized ¡ pixels ¡

49 54 78 14 15 67 22 86 36

… ¡

SLIDE 5

 Bag-of-words vector

 Document 1 = “Life of Pi won Oscar”  Document 2 = “Life of Pi is also a book.”

Text Documents

Life ¡ Pi ¡ movies ¡ also ¡

scar ¡

book ¡ won ¡

Vocabulary ¡ Doc ¡1 ¡ Doc ¡2 ¡

1 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡ 0 ¡

… ¡

1 ¡ 1 ¡ 0 ¡ 0 ¡ 1 ¡ 0 ¡ 1 ¡

SLIDE 6

 Data items

 How many data items?

 Dimensions

 How many dimensions representing each item?

Two Axes of Data Set

Data ¡item ¡index ¡(n) ¡ Dimension ¡ index ¡(d) ¡ Columns ¡as ¡ data ¡items ¡ ¡

vs. ¡Rows ¡as ¡data ¡items ¡

We ¡will ¡use ¡this ¡during ¡lecture ¡

SLIDE 7

Dimension Reduction

7

Dimension Reduction

High-dim data (n) low-dim data (n)

No. of

dimensions (k) Additional info about data Other parameters Dim-reducing transformation for new data : user-specified

Reduced dimension (k)

Dimension index (d)

SLIDE 8

Benefits of Dimension Reduction

Obviously, Compression Visualization Faster computation

Computing distances: 100,000-dim vs. 10-dim vectors

More importantly, Noise removal (improving data quality)

Separates the data into General Pattern + Sparse + Noise Is Noise the important signal? Works as pre-processing for better performance e.g., microarray data analysis, information retrieval, face recognition, protein disorder prediction, network intrusion detection, document categorization, speech recognition

SLIDE 9

Two Main Techniques

1. Feature selection

Selects a subset of the original variables as reduced dimensions relevant for a particular task e.g., the number of genes responsible for a particular disease may be small

2. Feature extraction

Each reduced dimension combines multiple original dimensions The original dataset will be transformed to some other numbers

9

Feature = Variable = Dimension

SLIDE 10

Feature Selection

What are the optimal subset of m features to maximize a given criterion? Widely-used criteria

Information gain, correlation, …

Typically combinatorial optimization problems Therefore, greedy methods are popular

Forward selection: Empty set → Add one variable at a time Backward elimination: Entire set → Remove one variable at a time

10

SLIDE 11

Feature Extraction

SLIDE 12

Aspects of Dimension Reduction

Linear vs. Nonlinear Unsupervised vs. Supervised Global vs. Local Feature vectors vs. Similarity (as an input)

12

SLIDE 13

Linear vs. Nonlinear

Linear Represents each reduced dimension as a linear combination of original dimensions

Of the form aX+b where a, x and b are vectors/matrices e.g., Y1 = 3X1 – 4X2 + 0.3X3 – 1.5X4 Y2 = 2X1 + 3.2X2 – X3 + 2*X4

Naturally capable of mapping new data to the same space

13

Dimension Reduction D1 D2 X1 1 1 X2 1 X3 2 X4 1 1 D1 D2 Y1 1.75

0.27

Y2

0.21

0.58

SLIDE 14

Linear vs. Nonlinear

Linear Represents each reduced dimension as a linear combination of original dimensions

e.g., Y1 = 3X1 – 4X2 + 0.3X3 – 1.5X4, Y2 = 2X1 + 3.2X2 – X3 + 2*X4

Naturally capable of mapping new data to the same space Nonlinear More complicated, but generally more powerful Recently popular topics

14

SLIDE 15

Unsupervised vs. Supervised

Unsupervised Uses only the input data

15

Dimension Reduction

High-dim data low-dim data

No. of

dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data

SLIDE 16

Unsupervised vs. Supervised

Supervised Uses the input data + additional info

16

Dimension Reduction

High-dim data low-dim data

No. of

dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data

SLIDE 17

Unsupervised vs. Supervised

Supervised Uses the input data + additional info

e.g., grouping label

17

Dimension Reduction

High-dim data low-dim data

No. of

dimensions Additional info about data Other parameters Dim-reducing Transformer for a new data

SLIDE 18

Global vs. Local

Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! Then, what should we emphasize more? Global Treats all pairwise distances equally important

Focuses on preserving large distances

Local Focuses on small distances, neighborhood relationships Active research area, e.g., manifold learning

18

SLIDE 19

Feature vectors vs. Similarity (as an input)

Dimension Reduction

High-dim data (n) low-dim data

No. of

dimensions (k) Other parameters Dim-reducing Transformer for a new data Additional info about data

Typical setup (feature vectors as an input)

Reduced dimension (k) Dimension index (d)

SLIDE 20

Feature vectors vs. Similarity (as an input)

Dimension Reduction

Similarity matrix low-dim data

No. of

dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data

Typical setup (feature vectors as an input) Alternatively, takes similarity matrix instead

(i,j)-th component indicates similarity between i-th and j-th data Assuming distance is a metric, similarity matrix is symmetric

SLIDE 21

Feature vectors vs. Similarity (as an input)

Dimension Reduction

low-dim data(kxn)

No. of

dimensions Other parameters Dim-reducing Transformer for a new data Additional info about data

Typical setup (feature vectors as an input) Alternatively, takes similarity matrix instead Internally, converts feature vectors to similarity matrix before performing dimension reduction

Similarity matrix(nxn) High-dim data (dxn)

Dimension Reduction

low-dim data(kxn)

Graph Embedding

SLIDE 22

Feature vectors vs. Similarity (as an input)

Why called graph embedding? Similarity matrix can be viewed as a graph where similarity represents edge weight Similarity matrix High-dim data(dxn)

Dimension Reduction

low-dim data

Graph Embedding

SLIDE 23

Methods

Traditional

Principal component analysis (PCA) Multidimensional scaling (MDS) Linear discriminant analysis (LDA)

Advanced (nonlinear, kernelized, manifold learning)

Isometric feature mapping (Isomap)

23

* Matlab codes are available at

http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html

SLIDE 24

Principal Component Analysis

Finds the axis showing the largest variation, and project all points into this axis Reduced dimensions are orthogonal Algorithm: Eigen-decomposition Pros: Fast Cons: Limited performances

24

Image source: http://en.wikipedia.org/wiki/Principal_component_analysis PC1 PC2

Linear Unsupervised Global Feature vectors

SLIDE 25

PCA – Some Questions

Algorithm Subtract mean from the dataset (X-µ) Find the covariance matrix (X-µ)’ (X-µ) Perform SVD on this covariance matrix to find the leading eigen vectors Project the data point X on these leading eigen vectors. That is., multiply. Key Questions Why covariance matrix? Can’t we perform SVD on the original matrix?

25

SLIDE 26

Multidimensional Scaling (MDS)

Main idea Tries to preserve given pairwise distances in low- dimensional space Metric MDS

Preserves given distance values

Nonmetric MDS

When you only know/care about ordering of distances Preserves only the orderings of distance values

Algorithm: gradient-decent type c.f. classical MDS is the same as PCA

26

Nonlinear Unsupervised Global Similarity input

ideal distance Low-dim distance

SLIDE 27

Multidimensional Scaling

Pros: widely-used (works well in general) Cons: slow (n-body problem)

Nonmetric MDS is even much slower than metric MDS Fast algorithm are available. Barnes-Hut algorithm GPU-based implementations

27

SLIDE 28

Linear Discriminant Analysis

What if clustering information is available? LDA tries to separate clusters by Putting different cluster as far as possible Putting each cluster as compact as possible (a) (b)

SLIDE 29

Aspects of Dimension Reduction

Unsupervised vs. Supervised

Supervised Uses the input data + additional info

e.g., grouping label

Dimension Reduction

High-dim data low-dim data

No. of

dimensions Additional info about data Other parameters Dim-reducing Transformer for a new data

SLIDE 30

Linear Discriminant Analysis (LDA)

vs. Principal Component Analysis

30

2D visualization of 7 Gaussian mixture of 1000 dimensions Linear discriminant analysis (Supervised) Principal component analysis (Unsupervised)

30

SLIDE 31

LDA

Compute mean of the two classes, global mean (µ1, µ2,µ) Compute class specific covariance matrix Sw Compute between class covariance matrix using means. Call it Sb For every class compute inv(Sw)Sb Questions Why is this the solution? inv(Sw)Sb

31

Fisher’s LDA generalizes gracefully

– 𝑧 𝐷 − 1 [𝑧1, 𝑧2, … 𝑧𝐷−1] 𝐷 − 1 𝑥𝑗 𝑋 = [𝑥1|𝑥2| … |𝑥𝐷−1] 𝑧𝑗 = 𝑥𝑗

𝑈𝑦 ⇒ 𝑧 = 𝑋𝑈𝑦

–

𝑇𝑋 = 𝑇𝑗

𝐷 𝑗=1

𝑇𝑗 =

𝑦 − 𝜈𝑗 𝑦 − 𝜈𝑗 𝑈

𝑦∈𝜕𝑗

𝜈𝑗 =

1 𝑂𝑗

𝑦

𝑦∈𝜕𝑗

– es 𝑇𝐶 = 𝑂𝑗 𝜈𝑗 − 𝜈 𝜈𝑗 − 𝜈 𝑈

𝐷 𝑗=1

𝜈 =

1 𝑂

𝑦

∀𝑦

=

1 𝑂

𝑂𝑗𝜈𝑗

𝐷 𝑗=1

– 𝑇𝑈 = 𝑇𝐶 + 𝑇𝑋

1 2 3

S B 1

S B 3 S B 2 S W 3 S W 1 S W 2 x 1 x 2 1 2 3

S B 1

S B 3 S B 2 S W 3 S W 1 S W 2 x 1 x 2

*http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf

SLIDE 32

Linear Discriminant Analysis

Maximally separates clusters by Putting different cluster far apart Shrinking each cluster compactly Algorithm: generalized eigendecomposition Pros: better show cluster structure Cons: may distort original relationships of data

32

Linear Supervised Global Feature vectors

SLIDE 33

Methods

Traditional

Principal component analysis (PCA) Multidimensional scaling (MDS) Linear discriminant analysis (LDA)

Advanced (nonlinear, kernelized, manifold learning)

Isometric feature mapping (Isomap)

33

* Matlab codes are available at

http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html

SLIDE 34

Manifold Learning

Swiss Roll Data

Swiss roll data Originally in 3D What is the intrinsic dimensionality? (allowing flattening)

34

SLIDE 35

Manifold Learning

Swiss Roll Data

Swiss roll data Originally in 3D What is the intrinsic dimensionality? (allowing flattening) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡→ 2D

35

What if your data has low intrinsic dimensionality but resides in high-dimensional space?

SLIDE 36

Isomap

(Isometric Feature Mapping)

Let’s preserve pairwise geodesic distance (along manifold) Compute geodesic distance as the shortest path length from k-nearest neighbor (k-NN) graph *Eigen-decomposition on pairwise geodesic distance matrix to obtain embedding that best preserves given distances

36

* Eigen-decomposition is the main algorithm of PCA

SLIDE 37

Isomap

(Isometric Feature Mapping)

Algorithm: all-pair shortest path computation + eigen- decomposition Pros: performs well in general Cons: slow (shortest path), sensitive to parameters

37

Nonlinear Unsupervised Global: all pairwise distances are considered Feature vectors

SLIDE 38

Practitioner’s Guide

Caveats

38

Trustworthiness of dimension reduction results Inevitable distortion/information loss in 2D/3D The best result of a method may not align with what we want, e.g., PCA visualization of facial image data

(1, 2)-dimension (3, 4)-dimension

SLIDE 39

Practitioner’s Guide

General Recommendation

Want something simple and fast to visualize data? PCA, force-directed layout Want to first try some manifold learning methods? Isomap

It is the method that will give empirically the best result.

Have cluster label to use? (pre-given or computed) LDA (supervised)

Supervised approach is sometimes the only viable option when your data do not have clearly separable clusters

No labels, but still want some clusters to be revealed? Or simply, want some state-of-the-art method for visualization?

39

SLIDE 40

Practitioner’s Guide

Results Still Not Good?

Try various pre-processing Data centering

Subtract the global mean from each vector

Normalization

Make each vector have unit Euclidean norm Otherwise, a few outlier can affect dimension reduction significantly

Application-specific pre-processing

Document: TF-IDF weighting, remove too rare and/or short terms Image: histogram normalization

40

SLIDE 41

Practitioner’s Guide

Too Slow?

Apply PCA to reduce to an intermediate dimensions before the main dimension reduction step

The results may be even better due to noise removed by PCA

See if there is any approximated but faster version

Landmarked versions (only using a subset of data items) e.g., landmarked Isomap Linearized versions (the same criterion, but only allow linear mapping) e.g., Laplacian Eigenmaps → Locality preserving projection

41

SLIDE 42

Practitioner’s Guide

Still need more?

Tweak dimension reduction for your own purpose Play with its algorithm, convergence criteria, etc.

See if you can impose label information Restrict the number of iterations to save computational time.

The main purpose of DR is to serve us in exploring data and solving complicated real-world problems

42

SLIDE 43

Take Away

43

PCA MDS LDA Isomap Supervised ✖ ✖ ✔ ✖ Linear ✔ ✖ ✔ ✖ Global ✔ ✔ ✔ ✔ Feature ✔ ✖ ✔ ✔

SLIDE 44

Useful Resource

Tutorial on PCA http://arxiv.org/pdf/1404.1100.pdf Tutorial on LDA http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf Review article

http://www.iai.uni-bonn.de/~jz/ dimensionality_reduction_a_comparative_review.pdf

Matlab toolbox for dimension reduction

http://homepage.tudelft.nl/19j49/ Matlab_Toolbox_for_Dimensionality_Reduction.html

Matlab manifold learning demo

http://www.math.ucla.edu/~wittman/mani/

44



Dimension Reduction

What is Dimension Reduction?

low-­‑dim ¡ data ¡

How ¡big ¡is ¡ this? ¡

Why?

Attribute=Feature= Variable=Dimension

Serialized/rasterized pixel values

Image Data

Raw ¡images ¡ Pixel ¡values ¡

Serialized ¡ pixels ¡

In a 4K (4096x2160) image there are totally 8.8 million pixels

Serialized/rasterized pixel values

 Huge dimensions

Video Data

Raw ¡images ¡ Pixel ¡values ¡

Serialized ¡ pixels ¡

… ¡

 Bag-of-words vector

 Document 1 = “Life of Pi won Oscar”  Document 2 = “Life of Pi is also a book.”

Text Documents

… ¡

 Data items

 How many data items?

 Dimensions

 How many dimensions representing each item?

Two Axes of Data Set

Dimension Reduction

High-dim data (n) low-dim data (n)

Benefits of Dimension Reduction

Obviously, Compression Visualization Faster computation

Computing distances: 100,000-dim vs. 10-dim vectors

More importantly, Noise removal (improving data quality)

Two Main Techniques

Selects a subset of the original variables as reduced dimensions relevant for a particular task e.g., the number of genes responsible for a particular disease may be small

Each reduced dimension combines multiple original dimensions The original dataset will be transformed to some other numbers

Feature = Variable = Dimension

Feature Selection

What are the optimal subset of m features to maximize a given criterion? Widely-used criteria

Information gain, correlation, …

Typically combinatorial optimization problems Therefore, greedy methods are popular

Forward selection: Empty set → Add one variable at a time Backward elimination: Entire set → Remove one variable at a time

Feature Extraction

Aspects of Dimension Reduction

Linear vs. Nonlinear Unsupervised vs. Supervised Global vs. Local Feature vectors vs. Similarity (as an input)

Linear vs. Nonlinear

Linear Represents each reduced dimension as a linear combination of original dimensions

Of the form aX+b where a, x and b are vectors/matrices e.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4 Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4

Naturally capable of mapping new data to the same space

Linear vs. Nonlinear

Linear Represents each reduced dimension as a linear combination of original dimensions

e.g., Y1 = 3*X1 – 4*X2 + 0.3*X3 – 1.5*X4, Y2 = 2*X1 + 3.2*X2 – X3 + 2*X4

Naturally capable of mapping new data to the same space Nonlinear More complicated, but generally more powerful Recently popular topics

Unsupervised vs. Supervised

Unsupervised Uses only the input data

High-dim data low-dim data

Unsupervised vs. Supervised

Supervised Uses the input data + additional info

High-dim data low-dim data

Unsupervised vs. Supervised

Supervised Uses the input data + additional info

e.g., grouping label

High-dim data low-dim data

Global vs. Local

Dimension reduction typically tries to preserve all the relationships/distances in data Information loss is unavoidable! Then, what should we emphasize more? Global Treats all pairwise distances equally important

Focuses on preserving large distances

Local Focuses on small distances, neighborhood relationships Active research area, e.g., manifold learning

Feature vectors vs. Similarity (as an input)

High-dim data (n) low-dim data

Typical setup (feature vectors as an input)

Feature vectors vs. Similarity (as an input)

Similarity matrix low-dim data

Typical setup (feature vectors as an input) Alternatively, takes similarity matrix instead

(i,j)-th component indicates similarity between i-th and j-th data Assuming distance is a metric, similarity matrix is symmetric

Feature vectors vs. Similarity (as an input)

low-dim data(kxn)

Typical setup (feature vectors as an input) Alternatively, takes similarity matrix instead Internally, converts feature vectors to similarity matrix before performing dimension reduction

Similarity matrix(nxn) High-dim data (dxn)

low-dim data(kxn)

Feature vectors vs. Similarity (as an input)



low-‑dim ¡ data ¡

 Huge dimensions

 Bag-of-words vector

 Document 1 = “Life of Pi won Oscar”  Document 2 = “Life of Pi is also a book.”

 Data items

 How many data items?

 Dimensions

 How many dimensions representing each item?

Of the form aX+b where a, x and b are vectors/matrices e.g., Y1 = 3X1 – 4X2 + 0.3X3 – 1.5X4 Y2 = 2X1 + 3.2X2 – X3 + 2*X4

e.g., Y1 = 3X1 – 4X2 + 0.3X3 – 1.5X4, Y2 = 2X1 + 3.2X2 – X3 + 2*X4

Compute mean of the two classes, global mean (µ1, µ2,µ) Compute class specific covariance matrix Sw Compute between class covariance matrix using means. Call it Sb For every class compute inv(Sw)Sb Questions Why is this the solution? inv(Sw)Sb