Dimension Reduction for Classification Alfred O. Hero Dept. EECS, - PowerPoint PPT Presentation

Dimension Reduction for Classification Alfred O. Hero Dept. EECS, Dept BME, Dept. Statistics University of Michigan - Ann Arbor hero@eecs.umich.edu http://www.eecs.umich.edu/~hero BIRS, July. 2005 1. The manifold supporting data sample 2. Classification constrained dimension reduction 3. Dimension estimation on smooth manifolds 4. Applications 5. Conclusions

1. Manifold supporting data sample d dimensional subspace • Yale Face Database B: each 128x128 image lies in R^(16384)

Data-driven dimensionality reduction • Data-driven dimensionality reduction consists of: – Estimation of intrinsic dimension d: • Direct intrinsic dimension estimation – Reconstruction of data samples in the manifold domain: • Manifold learning • Classifiers on intrinsic data dimension – Estimated dimension as a discriminant – Label-constrained manifold learning

Manifold Learning Manifold learning problem setup: Given a finite sampling of a d -dimensional manifold , find an embedding of into a subset (usually ) without any prior knowledge about or d.

Manifold learning background Reconstructing the mapping and attributes of the manifold from a finite dataset falls into the general manifold learning problem. Manifold reconstruction for fixed d: 1. ISOMAP , Tenenbaum, de Silva, Langford (2000); 2. Locally Linear Embeddings (LLE), Roweiss, Saul (2000); 3. Laplacian Eigenmaps , Belkin, Niyogi (2002); 4. Hessian Eigenmaps (HLLE), Grimes, Donoho (2003); 5. Local Space Tangent Alignment (LTSA), Zhang, Za (2003); 6. SemiDefinite Embedding (SDE), Weinberger, Saul (2004).

Laplacian Eigenmaps Laplacian Eigenmaps: preserving local information (Belkin & Niyogi 2002) 1. Constructing an Adjacency Graph: a. compute a k-NN graph on the dataset; b. compute a similarity/weight matrix W between data points, that encodes neighborhood information (e.g., heat kernel):

Laplacian Eigenmaps 2. Manifold learning as an optimization problem: a. objective function: where is the Graph Laplacian . b. embedding is solution of ( � )

Laplacian Eigenmaps 3. Eigenmaps: a. solution to ( � ) is given by the d generalized eigenvectors associated with the d smallest generalized eigenvalues that solve: equivalently, eigenvectors of the normalized Graph Laplacian b. if is the collection of such eigenvectors, then the embedded points are given by

Dimension Reduction for Labeled Data 10 15 10 5 5 0 0 −5 −5 −10 −15 −10 −60 −40 −20 0 20 40 60 20 15 10 Dimension reduced Data X 5 10 0 5 −5 Original Data Y 800 points uniform on Swiss roll, 400 each class

2. Classification constrained dimensionality reduction Adding class dependent constraints – “virtual” class vertices. 10 5 0 −5 −10 20 15 10 5 10 5 0 −5

Label-penalized Laplacian Eigenmaps 1. If C is the class membership matrix (i.e., c_ij = 1 if point j is from class i), define the objective function: where , are the “virtual” class centers and is a regularization parameter. 2. Embedding is solution of where L is Laplacian of augmented weight matrix

15 10 Unconstrained 5 Dimensionality 0 Reduction −5 −10 10 −15 −60 −40 −20 0 20 40 60 5 0 0.03 0.025 0.02 −5 0.015 0.01 −10 0.005 20 15 10 5 10 0 5 −5 0 Classification −0.005 Constrained −0.01 Dimensionality −0.015 Reduction −0.02 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02

Partially Labeled Data Semi-Supervised Learning on Manifolds unlabeled samples 10 5 0 labeled samples −5 −10 20 15 10 5 10 5 0 −5

Semisupervised extension Algorithm: 1. Compute the constrained embedding of the entire data set, inserting a zero column in C for each unlabeled sample. 2. Fit a (e.g., linear) classifier to the labeled embedded points by minimizing the quadratic error loss: 3. For an unlabeled point , label it using the fitted (linear) classifier:

Classification Error Rates 60 50 40 k−NN Error rate (%) Laplacian CCDR 30 20 10 0 0 100 200 300 400 500 600 700 Number of labeled points Percentage of errors for labeling unlabeled samples as a function of the number of labeled points, out of a total of 1000 points on the Swiss roll.

3. Methods of dimension estimation Residual variance vs dimentionality- Data Set 1 0.015 Knee? • Scree plots 0.01 ce n ria l va – Plot residual fitting errors of a u sid e R 0.005 SVD, Isomap, LE, LLE 0 0 2 4 6 8 10 12 14 16 18 20 Isomap dimensionality ISOMAP residual curve • Kolmogorov/Entropy/Correlation dimension – Box counting, sphere packing (Liebovitch and Toth) • Maximum likelihood – Poisson approximation to Binomial (Levina&Bickel:2004) • Entropic graphs – Spanner-graph length approximation to entropy functional (Costa&Hero:2003)

Euclidean Random Graphs • data in D-dimensionalEuclidean space • Euclidean MST with edge power weighting gamma: • pairwise distance matrix over • edge length matrix of spanning trees over • Euclidean k-NNG with edge power weighting gamma:

Example: Uniform Planar Sample Uniform sample on square: N=400 3 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 γ =1, k=4 Minimal Spanning Tree: k NN Graph: γ =1, k=4 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −2 −2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

Convergence of Euclidean CQF’s Beardwood, Halton, Hammersley Theorem (BHH:1959):

k-NNG Convergence Theorem in Non-Euclidean Spaces Costa, Hero: TSP(2004), Birkhauser(2005)

Application to 2D Torus 4 x 10 2.81 • 600 uniformly distributed random samples 2.8 2.79 average length 2.78 2.77 d=2 H=9.6 bits 2.76 2.75 682 684 686 688 690 692 694 696 698 number of samples Mean kNNG (k=5) length

Local Extension via kNNG – Initialize: – For i=1,2,…p • compute and set

4. Application to MNIST Digits • Large database of 8 bit images of digits 0-9. • 28x28 pixels for each image • First 1000 images in training set used here • Non-adaptive: digit labels are known

Scree Plot Costa&Hero:Birkhauser05

Local Dimension/Entropy Statistics Costa&Hero:Birkhauser05

Adaptive Anomaly Detection • Spatio-temporal measurement vector: STTL temperat ure CHIN NYCM SNVA day DNVR IPLS WASH tempera ture LOSA KSCY ATLA day HSTN temperat ure day

Data Observed from Abilene Network • Objective: detect changes in network traffic via local intrinsic dimension • Hypotheses: – High traffic from few sources lowers the local dimension of the network traffic – Changes in distribution of dimension estimate can be used as a marker for more subtle changes in traffic • Data collection period: 1/1/05-1/2/05 • Data sampling: packet flow sampled every 5 minutes from all 11 routers on Abilene Network • Data fields: aggregate of all flows to/from all ports

KNN Algorithm (Costa) Modified Knn-Algo II 10 8 6 d 4 2 0 100 200 300 400 500 6 Actual Data x 10 2.5 2 1.5 1 0.5 0 0 100 200 300 400 500 1/01/05-1/02/05,n=150,Q=65,NN=4,N=10,M=10

Example – 1/1/05, 12:20 pm • Large data transfers from IPs 145.146.96 and 192.31.120 drastically increase flows through Chicago and NYC. Modified Knn-Algo II 10 8 6 d 4 2 100 110 120 130 140 150 160 170 180 190 200 6 Actual Data x 10 2.5 2 1.5 1 0.5 0 100 110 120 130 140 150 160 170 180 190 200 1/01/05-1/02/05,n=150,Q=65,N N=4,N=10,M=10

5. Conclusions • Classification constraints can be included in manifold learning dimension reduction algorithms • kNNG jointly estimate dimension and entropy of high dimensional data • Dimension can be used as a discriminant in anomaly detection • Can be used as precursor to model reduction and database compression • Methods only suffer from curse of intrinsic dimensionality

References • J. Costa, N. Patwari and A. O. Hero, "Distributed multidimensional scaling with adaptive weighting for node localization in sensor networks", (http://www.eecs.umich.edu/~hero/Preprints/wmds_v9.pdf), ACM Journal on Networking To appear 2005. • J. Costa and A. O. Hero, "Geodesic entropic graphs for dimension and entropy estimation in manifold learning," (http://www.eecs.umich.edu/~hero/Preprints/sp_mlsi_final_twocolumn.pdf) , IEEE Trans. on Signal Process. , Vol. 52, No. 8, pp. 2210-2221, Aug. 2004. • J.A. Costa, A. Girotra and A.O. Hero, "Estimating Local Intrinsic Dimension with k-Nearest Neighbor Graphs," IEEE Workshop on Statistical Signal Processing (SSP), Bordeaux, July 2005. (http://www.eecs.umich.edu/~hero/Preprints/ssp_2005_final_1.pdf) • J. Costa and A. O. Hero, "Classification constrained dimensionality reduction," (http://www.eecs.umich.edu/~hero/Preprints/costa_icassp2005.pdf), Proc. of ICASSP , Philadelphia, March, 2005.

Dimension Reduction for Classification Alfred O. Hero Dept. EECS, - PowerPoint PPT Presentation

Dimension Reduction for Classification Alfred O. Hero Dept. EECS, Dept BME, Dept. Statistics University of Michigan - Ann Arbor hero@eecs.umich.edu http://www.eecs.umich.edu/~hero BIRS, July. 2005 1. The manifold supporting data sample 2.

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

VC Dimension and classification John Duchi Prof. John Duchi Outline I Setting: classification

The Human Dimension Sue Manns Regional Director Pegasus The Human Dimension The Human

The Metric Dimension Problem. J. D az Monash U., May 2018 The Metric Dimension problem

Packing Dimension Results for Anisotropic Gaussian Random Fields Dongsheng Wu Department of

Dimension Reduction CS 760@UW-Madison Goals for the lecture you should understand the following

Geometric perspectives for supervised dimension reduction A Tale of Two Manifolds S. Mukherjee,

Nonparametric Variable Selection via Sufficient Dimension Reduction Lexin Li Workshop on Current

Extreme Value Theory and Dimension GARDES Inference on reduction for the study of hyperspectral

Dimension Reduction CS 6242 Ramakrishnan Kannan Thanks : Prof. Jaegul Choo and Prof. Le

Dimension reduction numerical methods for Bermudan options Scott Sues Probability, Numerics, and

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Nonlinear Methods Data often lies on or near a nonlinear low-dimensional curve aka manifold. 27

Advanced Introduction to Machine Learning, CMU-10715 Manifold Learning Barnabs Pczos

Digital Agency Performance Metrics Casey Cobb Business Track Who am I? Why am I qualified to

Can account of Fermi motion describe the EMC effect? A ( , p t ) d Z F 2 N ( x/ , Q 2 )

Visualization ( Nonlinear dimensionality reduction ) Fei Sha Yahoo! Research

6.1 Dimensionality reduction Previously in the course, we have discussed algorithms suited for a

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 Byron C Wallace Today A

AMMI Introduction to Deep Learning 8.2. Looking at activations Fran cois Fleuret