Dimensionality Reduction Aarti Singh Machine Learning 10-701/15-781 - PowerPoint PPT Presentation

Dimensionality Reduction Aarti Singh Machine Learning 10-701/15-781 Nov 17, 2010 Slides Courtesy: Tom Mitchell, Eric Xing, Lawrence Saul 1

High-Dimensional data • High-Dimensions = Lot of Features Document classification Features per document = thousands of words/unigrams millions of bigrams, contextual information Surveys - Netflix 480189 users x 17770 movies 2

High-Dimensional data • High-Dimensions = Lot of Features Discovering gene networks 10,000 genes x 1000 drugs x several species MEG Brain Imaging 120 locations x 500 time points x 20 objects 3

Curse of Dimensionality • Why are more features bad? – Redundant features (not all words are useful to classify a document) more noise added than signal – Hard to interpret and visualize – Hard to store and process data (computationally challenging) – Complexity of decision rule tends to grow with # features. Hard to learn complex rules as VC dimension increases (statistically challenging) 4

Dimensionality Reduction “Unrolling the swiss roll” 5

Dimensionality Reduction • Feature Selection – Only a few features are relevant to the learning task X 3 - Irrelevant X 3 X 1 X 2 • Latent features – Some linear/nonlinear combination of features provides a more efficient representation than observed features 6

Feature Selection • Approach 1: Score each feature and extract a subset 7

Feature Selection • Approach 1: Score each feature and extract a subset Common subset selection methods: • One step: Choose d highest scoring features • Iterative: 8

Feature Selection • Approach 2: Regularization (MAP) Integrate feature selection into learning objective by penalizing number of features with non-zero weights -ve log likelihood penalty Minimizes # features Convex Small weights of 11 chosen compromise features chosen

Latent Feature Extraction Combinations of observed features provide more efficient representation, and capture underlying relations that govern the data E.g. Ego, personality and intelligence are hidden attributes that characterize human behavior instead of survey questions Topics (sports, science, news, etc.) instead of documents Often may not have physical meaning • Linear Principal Component Analysis (PCA) Factor Analysis Independent Component Analysis (ICA) • Nonlinear Laplacian Eigenmaps ISOMAP Local Linear Embedding (LLE) 12

Principal Component Analysis (PCA) Only one relevant feature Both features become relevant Can we transform the features so that we only need to preserve one latent feature? Find linear projection so that projected data is uncorrelated. 13

Principal Component Analysis (PCA) Assumption: Data lies on or near a low d-dimensional linear subspace. Axes of this subspace are an effective representation of the data Identifying the axes is known as Principal Components Analysis, and can be obtained by Eigen or Singular value decomposition 14

Principal Component Analysis (PCA) Principal Components (PC) are orthogonal directions that capture most of the variance in the data 1 st PC – direction of greatest variability in data Projection of data points along 1 st PC discriminate the data most along any one direction Take a data point x i (D-dimensional vector) x i v Projection of x i onto the 1 st PC v is v T x i v T x i 15

Principal Component Analysis (PCA) Principal Components (PC) are orthogonal directions that capture most of the variance in the data 1 st PC – direction of greatest variability in data 2 nd PC – Next orthogonal (uncorrelated) direction of greatest variability (remove all variability in first direction, then find next direction of greatest variability) x i x i -v T x i And so on … v T x i 16

Principal Component Analysis (PCA) Let v 1 , v 2 , …, v d denote the principal components T v j = 0 i ≠ j Orthogonal and unit norm v i T v i = 1 v i Find vector that maximizes sample variance of projection Assume data are centered Data points X = [ x 1 x 2 … x n ] Wrap constraints into the objective function 17

Principal Component Analysis (PCA) Therefore, v is the eigenvector of sample correlation/ covariance matrix XX T Sample variance of projection = Thus, the eigenvalue λ denotes the amount of variability captured along that dimension (aka amount of energy along that dimension). Eigenvalues λ 1 > λ 2 > λ 3 > … The 1 st Principal component v 1 is the eigenvector of the sample covariance matrix XX T associated with the largest eigenvalue λ 1 The 2 nd Principal component v 2 is the eigenvector of the sample covariance matrix XX T associated with the second largest eigenvalue λ 2 And so on … 18

Computing the PCs Eigenvectors are solutions of the following equation: Non- zero solution v ≠ 0 possible only if Characteristic Equation This is a D th order equation in λ , can have at most D distinct solutions (roots of the characteristic equation) Once eigenvalues are computed, solve for eigenvectors (Principal Components) using For symmetric matrices, eigenvectors for distinct eigenvalues are orthogonal. 19

Principal Component Analysis (PCA) So, the new axes are the eigenvectors of the matrix of sample correlations XX T of the data, which capture the similarities of the original features based on how data samples project to the new axes. Transformed features are uncorrelated. x 2 x 1 • Geometrically: centering followed by rotation – Linear transformation 20

Another interpretation Maximum Variance Subspace: PCA finds vectors v such that projections on to the vectors capture maximum variance in the data Minimum Reconstruction Error: PCA finds vectors v such that projection on to the vectors yields minimum MSE reconstruction x i v v T x i 21

Dimensionality Reduction using PCA The eigenvalue λ denotes the amount of variability captured along that dimension. Zero eigenvalues indicate no variability along those directions => data lies exactly on a linear subspace Only keep data projections onto principal components with non- zero eigenvalues, say v 1 , …, v d where d = rank (XX T ) Original Representation Transformed representation data point projections x i x i = [x i1, x i2 , …. x iD ] [v 1T x i , v 2T x i , … v dT x i ] v (D-dimensional vector) (d-dimensional vector) v T x i 22

Dimensionality Reduction using PCA In high-dimensional problem, data usually lies near a linear subspace, as noise introduces small variability Only keep data projections onto principal components with large eigenvalues Can ignore the components of lesser significance. 25 20 Variance (%) 15 10 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 You might lose some information, but if the eigenvalues are small, you don’t lose much 23

Dimensionality Reduction Aarti Singh Machine Learning 10-701/15-781 - PowerPoint PPT Presentation

Dimensionality Reduction Aarti Singh Machine Learning 10-701/15-781 Nov 17, 2010 Slides Courtesy: Tom Mitchell, Eric Xing, Lawrence Saul 1 High-Dimensional data High-Dimensions = Lot of Features Document classification Features per

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

Spatial Data: Dimensionality Reduction CS444 Techniques, Lecture 3 In this subfield, we think

Spatial Data: Dimensionality Reduction CSC444 Techniques In this subfield, we think of a data

Dimensionality Reduction INFO-4604, Applied Machine Learning University of Colorado Boulder

Dimensionality Reduction Techniques for Proximity Problems Piotr Indyk, SODA 2000 CS 468 |

Characterization of differential diffusion effects during the constant volume ignition of a

Catalan numbers, parking functions, and invariant theory Vic Reiner Univ. of Minnesota CanaDAM

Delaunay triangulations on hyperbolic surfaces Iordan Iordanov Monique Teillaud Astonishing

Lecture 30 Mo#onal emf is a special case of Faradays

Extra rays needed for these effects: Distribution Ray Tracing Soft shadows Acceleration

Why Cyber is STEM About Steve Background ITM, Accounting, Psychology University 13

Vassiliev knot invariants derived from cable -polynomials

Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System Cray User Group