Dimensionality Reduc1on Lecture 23 David Sontag New York

Dimensionality Reduc1on Lecture 23 David Sontag New York University Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Dimensionality reduc9on • Input data may have thousands or millions of dimensions! – e.g., text data has ???, images have ??? • Dimensionality reduc1on : represent data with fewer dimensions – easier learning – fewer parameters – visualiza9on – show high dimensional data in 2D – discover “intrinsic dimensionality” of data • high dimensional data that is truly lower dimensional • noise reduc9on

!"#$%&"'%()$*+,-"'% � .&&+#/-"'%0(*1-1(21//)'3"#1-$456(4"$&('%( 1(4'7$)(*"#$%&"'%14(&/1,$ � 831#/4$&0 n = 2 n = 3 k = 1 k = 2 Slide from Yi Zhang

Example (from Bishop) • Suppose we have a dataset of digits (“3”) perturbed in various ways: • What opera9ons did I perform? What is the data’s intrinsic dimensionality? • Here the underlying manifold is nonlinear

Lower dimensional projec9ons • Obtain new feature vector by transforming the original features x 1 … x n z 1 = w (1) w (1) ⌥ + x i ⌥ 0 In general will not be i … inver9ble – cannot go i from z back to x z k = w ( k ) w ( k ) ⌥ + x i 0 i i • New features are linear combina9ons of old ones • Reduces dimension when k<n • This is typically done in an unsupervised seZng – just X , but no Y

Which projec9on is be[er? From notes by Andrew Ng

Reminder: Vector Projec9ons • Basic defini9ons: – A.B = |A||B|cos θ • Assume |B|=1 (unit vector) – A.B = |A|cos θ – So, dot product is length of projec9on!

Using a new basis for the data • Project a point into a (lower dimensional) space: – point : x = (x 1 ,…,x n ) – select a basis – set of unit (length 1) basis vectors ( u 1 ,…, u k ) • we consider orthonormal basis: – u j • u j =1, and u j • u l =0 for j ≠ l – select a center – x , defines offset of space – best coordinates in lower dimensional space defined by dot-products: (z 1 ,…,z k ), z j i = ( x i - x ) • u j

Maximize variance of projec9on Let x (i) be the i th data point minus the mean. Choose unit-length u to maximize: m m 1 1 Covariance ( x ( i ) T u ) 2 u T x ( i ) x ( i ) T u � � = matrix Σ m m i =1 i =1 � � m 1 x ( i ) x ( i ) T � u T = u. m i =1 Let ||u||=1 and maximize. Using the method of Lagrange multipliers, can show that the solution is given by the principal eigenvector of the covariance matrix! (shown on board)

Basic PCA algorithm [Pearson 1901, Hotelling, 1933] • Start from m by n data matrix X • Recenter : subtract mean from each row of X – X c ← X – X • Compute covariance matrix: – Σ ← 1/m X c T X c • Find eigen vectors and values of Σ • Principal components: k eigen vectors with highest eigen values

PCA example Data: Projection: Reconstruction:

Dimensionality reduc9on with PCA In high-dimensional problem, data usually lies near a linear subspace, as noise introduces small variability Only keep data projections onto principal components with large eigenvalues Can ignore the components of lesser significance. m 1 X ( z i j ) 2 var( z j ) = m 25 Percentage of total variance captured i =1 m by dimension z j for j=1 to 10: 1 λ j X ( x i · u j ) 2 = 20 P n l =1 λ l m i =1 Variance (%) = λ j 15 10 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 You might lose some information, but if the eigenvalues �� much 23 Slide from Aarti Singh

Eigenfaces [Turk, Pentland ’91] • Input images: � Principal components:

Eigenfaces reconstruc9on • Each image corresponds to adding together (weighted versions of) the principal components:

Scaling up • Covariance matrix can be really big! – Σ is n by n – 10000 features can be common! – finding eigenvectors is very slow… • Use singular value decomposi9on (SVD) – Finds k eigenvectors – great implementa9ons available, e.g., Matlab svd

SVD • Write X = Z S U T – X ← data matrix, one row per datapoint – S ← singular value matrix, diagonal matrix with entries σ i • Rela9onship between singular values of X and eigenvalues of Σ given by λ i = σ i 2 /m – Z ← weight matrix, one row per datapoint • Z 9mes S gives coordinate of x i in eigenspace – U T ← singular vector matrix • In our seZng, each row is eigenvector u j

PCA using SVD algorithm • Start from m by n data matrix X • Recenter : subtract mean from each row of X – X c ← X – X • Call SVD algorithm on X c – ask for k singular vectors • Principal components: k singular vectors with highest singular values (rows of U T ) – Coefficients: project each point onto the new vectors

Non-linear methods � A%&,'- /)-%,-0"1&2.30.%$%#&4%"156-6&7/248 B'2("-*C&'45)%) D&/,1,&/,&(*!"#1"&,&(*C&'45)%)*=D!C? � E"&4%&,'- !"01",-"% *-9$%3"06 DF@GCH A"2'4*A%&,'-*8#$,//%&6*=AA8? 12 Slide from Aarti Singh

Isomap Es9mate manifold using Goal: use geodesic Embed onto 2D plane graph. Distance between distance between points so that Euclidean distance points given by distance of (with respect to manifold) approximates graph shortest path distance [Tenenbaum, Silva, Langford. Science 2000]

Isomap Table 1. The Isomap algorithm takes as input the distances d X (i , j ) between all pairs i , j from N data points in the high-dimensional input space X , measured either in the standard Euclidean metric (as in Fig. 1A) or in some domain-specific metric (as in Fig. 1B). The algorithm outputs coordinate vectors y i in a d -dimensional Euclidean space Y that (according to Eq. 1) best represent the intrinsic geometry of the data. The only free parameter ( � or K ) appears in Step 1. Step 1 Construct neighborhood graph Define the graph G over all data points by connecting points i and j if [as measured by d X ( i , j )] they are closer than � ( � -Isomap), or if i is one of the K nearest neighbors of j ( K -Isomap). Set edge lengths equal to d X ( i , j ). 2 Compute shortest paths Initialize d G ( i , j ) � d X ( i , j ) if i , j are linked by an edge; d G ( i , j ) � � otherwise. Then for each value of k � 1, 2, . . ., N in turn, replace all entries d G ( i , j ) by min{ d G ( i , j ), d G ( i , k ) � d G ( k , j )}. The matrix of final values D G � { d G ( i , j )} will contain the shortest path distances between all pairs of points in G ( 16 , 19 ). 3 Construct d -dimensional embedding Let � p be the p -th eigenvalue (in decreasing order) of i be the i -th the matrix � ( D G ) ( 17 ), and v p component of the p -th eigenvector. Then set the p -th component of the d -dimensional coordinate i . vector y i equal to �� p v p

Isomap [Tenenbaum, Silva, Langford. Science 2000]

Isomap Swiss roll data Face images PCA Residual variance Isomap Number of dimensions

What you need to know • Dimensionality reduc9on – why and when it’s important • Principal component analysis – minimizing reconstruc9on error – rela9onship to covariance matrix and eigenvectors – using SVD • Non-linear dimensionality reduc9on

Dimensionality Reduc1on Lecture 23 David Sontag New York - PowerPoint PPT Presentation

Dimensionality Reduc1on Lecture 23 David Sontag New York University Slides adapted from Carlos Guestrin and Luke Zettlemoyer Dimensionality reduc9on Input data may have thousands or millions of dimensions! e.g., text data has ???,

Dimensionality Reduc1on Lecture 23 David Sontag New York University Slides adapted from Carlos

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduc1on Lecture 9 David Sontag New York

Dimensionality Reduc1on contd Aarti Singh Machine Learning 10-601 Nov 10,

Dimensionality Reduc1on Machine Learning 10-601B Seyoung Kim

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Massachuse(s)Toxics)Use)Reduc1on)Act) (TURA):)Reducing)the)Use)of)Carcinogens) Rachel'Massey'

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Dimensionality Reduction INFO-4604, Applied Machine Learning University of Colorado Boulder

Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization Maxim Raginsky and

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Chapter 8. Principal-Components Analysis Neural Networks and

Unsupervised Learning Principal Component Analysis CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu

WELCOME! SMEI Virtual Series Setting up for September: Exploring strategies for music teaching

Elastic deformations on the plane and approximations (lecture VVI) Aldo Pratelli Department

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent

Factor Analysis and Related Methods James H. Steiger Vanderbilt University Primary Goals for

Recasting Principal Components R.W. Oldford University of Waterloo Reducing dimensions -

IN5490 Advanced Topics in Artificial Intelligence for Intelligent Systems Md. Zia Uddin

Sambuz

Useful Links

Newsletter

Mail Us