Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 Byron C Wallace

Today • A bit of wrap up on PCA • Then : Non-linear dimensionality reduction! (SNE/t-SNE)

In Sum: Principal Component Analysis X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Data Eigenvectors of Covariance   λ 1 λ 2   Λ = ...   λ d Idea : Take top- k eigenvectors to maximize variance

Why? Idea : Take top- k eigenvectors to maximize variance Last time, we saw that we can derive this by maximizing the variance in the compressed space

Why? Idea : Take top- k eigenvectors to maximize variance Last time, we saw that we can derive this by maximizing the variance in the compressed space Can also motivate by explicitly minimizing reconstruction error

Minimizing reconstruction error

Getting the eigenvalues, two ways • Direct eigenvalue decomposition of the covariance matrix N S = 1 n = 1 X N XX > x n x > N n =1

Getting the eigenvalues, two ways • Direct eigenvalue decomposition of the covariance matrix N S = 1 n = 1 X N XX > x n x > N n =1 • Singular Value Decomposition (SVD)

Singular Value Decomposition Idea : Decompose the   d x n matrix X into 1. A n x n basis V   (unitary matrix) 2. A d x n matrix Σ   (diagonal projection) 3. A d x d basis U   (unitary matrix) d X = U d ⇥ d Σ d ⇥ n V > n ⇥ n

2 1. Rotation n X V T ~ h ~ x i ~ x = v i , ~ e i i = 1 2. Scaling n X SV T ~ s i h ~ x i ~ x = v i , ~ e i i = 1 3. Rotation n X USV T ~ s i h ~ x i ~ x = v i , ~ u i i = 1

SVD for PCA V > = U X Σ , |{z} |{z} |{z} |{z} D ⇥ N D ⇥ D D ⇥ N N ⇥ N S = 1 N XX > = 1 Σ > U > = 1 N U Σ V > V N U ΣΣ > U > | {z } = I N

SVD for PCA V > = U X Σ , |{z} |{z} |{z} |{z} D ⇥ N D ⇥ D D ⇥ N N ⇥ N S = 1 N XX > = 1 Σ > U > = 1 N U Σ V > V N U ΣΣ > U > | {z } = I N It turns out the columns of U are the eigenvectors of XX T

Computing PCA Method 1: eigendecomposition n XX > U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d ⇥ d Σ d ⇥ n V > n ⇥ n where U > U = I d ⇥ d , V > V = I n ⇥ n , Σ is diagonal Computing top k singular vectors takes only O ( ndk )

Computing PCA Method 1: eigendecomposition n XX > U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d ⇥ d Σ d ⇥ n V > n ⇥ n where U > U = I d ⇥ d , V > V = I n ⇥ n , Σ is diagonal Computing top k singular vectors takes only O ( ndk ) Relationship between eigendecomposition and SVD: Left singular vectors are principal components ( C = U Σ 2 U > )

Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i

Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . .

Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification

Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification Much faster: O ( dk + nk ) time instead of O ( dn ) when n, d � k

Aside: How many components? • • Magnitude of eigenvalues indicate fraction of variance captured. • Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 λ i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i • Eigenvalues typically drop o ff sharply, so don’t need that many. • Of course variance isn’t everything...

Wrapping up PCA • PCA is a linear model for dimensionality reduction which finds a mapping to a lower dimensional space that maximizes variance • We saw that this is equivalent to performing an eigendecomposition on the covariance matrix of X • Next time Auto-encoders and neural compression for non-linear projections

Non-linear dimensionality reduction

Limitations of Linearity PCA is e ff ective PCA is ine ff ective

Nonlinear PCA Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 }

Nonlinear PCA Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } Idea: Use kernels 1 , x 2 ) > We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 { } Linear dimensionality reduction in φ ( x ) space ⇔ Nonlinear dimensionality reduction in x space

Kernel PCA

Alternatively: t-SNE !

Stochastic Neighbor   Embeddings Borrowing from :   Laurens van der Maaten   (Delft -> Facebook AI)

Manifold learning Idea : Perform a non-linear dimensionality reduction   in a manner that preserves proximity (but not distances)

Manifold learning

PCA on MNIST digits

t-SNE on MNIST Digits

Swiss roll Euclidean distance is not always   a good notion of proximity

Non-linear projection Bad projection: relative position to neighbors changes

Non-linear projection Intuition: Want to preserve local neighborhood

Stochastic Neighbor Embedding Original space The map

SNE to t-SNE ( on board )

t-SNE: SNE with a t-Distribution Similarity in High Dimension exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Similarity in Low Dimension (1 + || y i − y j || 2 ) � 1 q ij = k 6 = l (1 + || y k − y l || 2 ) � 1 P Gradient ∂ C ( p ij − q ij )(1 + || y i − y j || 2 ) � 1 ( y i − y j ) X = 4 ∂ y i j 6 = i

Algorithm 1 : Simple version of t-Distributed Stochastic Neighbor Embedding. Data : data set X = { x 1 , x 2 ,..., x n } , cost function parameters: perplexity Perp , optimization parameters: number of iterations T , learning rate η , momentum α ( t ) . Result : low-dimensional data representation Y ( T ) = { y 1 , y 2 ,..., y n } . begin compute pairwise affinities p j | i with perplexity Perp (using Equation 1) p j | i + p i | j set p ij = 2 n sample initial solution Y ( 0 ) = { y 1 , y 2 ,..., y n } from N ( 0 , 10 − 4 I ) for t=1 to T do compute low-dimensional affinities q ij (using Equation 4) compute gradient δ C δ Y (using Equation 5) set Y ( t ) = Y ( t − 1 ) + η δ C � Y ( t − 1 ) − Y ( t − 2 ) � δ Y + α ( t ) end end

set Y ( t ) = Y ( t − 1 ) + η δ C � Y ( t − 1 ) − Y ( t − 2 ) � δ Y + α ( t )

set Y ( t ) = Y ( t − 1 ) + η δ C � Y ( t − 1 ) − Y ( t − 2 ) � δ Y + α ( t ) Regular gradient descent

“momentum” set Y ( t ) = Y ( t − 1 ) + η δ C � Y ( t − 1 ) − Y ( t − 2 ) � δ Y + α ( t )

Figure credit: Bisong, 2019

What’s magical about t ? Basically, the gradient has nice properties 18 Low − dimensional distance > 16 14 12 10 8 6 4 2 0 High − dimensional distance > (a) Gradient of SNE.

What’s magical about t ? Basically, the gradient has nice properties 18 Low − dimensional distance > 16 14 12 10 8 6 4 2 0 High − dimensional distance > (a) Gradient of SNE. Positive gradient —> “attraction” between points

What’s magical about t ? 18 1 Low − dimensional distance > Low − dimensional distance > 16 14 0.5 12 10 0 8 6 − 0.5 4 2 − 1 0 High − dimensional distance > High − dimensional distance > (a) Gradient of SNE. (c) Gradient of t-SNE. Positive gradient —> “attraction” between points

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 Byron C Wallace Today A bit of wrap up on PCA Then : Non-linear dimensionality reduction! (SNE/t-SNE) In Sum: Principal Component Analysis X = ( x 1 x

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

6.1 Dimensionality reduction Previously in the course, we have discussed algorithms suited for a

Visualization ( Nonlinear dimensionality reduction ) Fei Sha Yahoo! Research

Dimension Reduction for Classification Alfred O. Hero Dept. EECS, Dept BME, Dept. Statistics

Nonlinear Methods Data often lies on or near a nonlinear low-dimensional curve aka manifold. 27

AMMI Introduction to Deep Learning 8.2. Looking at activations Fran cois Fleuret

Medical Applica+ons for Manifold Learning: I)

Geometric perspectives for supervised dimension reduction A Tale of Two Manifolds S. Mukherjee,

CS 171: Visualization High-Dimensional Data Hanspeter Pfister pfister@seas.harvard.edu This