Machine Learning (AIMS) - MT 2018 1. Dimensionality Reduction Varun - PowerPoint PPT Presentation

Machine Learning (AIMS) - MT 2018 1. Dimensionality Reduction Varun Kanade University of Oxford November 5, 2018

Unsupervised Learning Training data is of the form x 1 , . . . , x N Infer properties about the data ◮ Search: Identify patterns in data ◮ Density Estimation: Learn the underlying distribution generating data ◮ Clustering: Group similar points together ◮ Today: Dimensionality Reduction 1

Outline Today, we’ll study a technique for dimensionality reduction ◮ Principal Component Analysis (PCA) identifies a small number of directions which explain most variation in the data ◮ PCA can be kernelised ◮ Dimensionality reduction is important both for visualising and as a preprocessing step before applying other (typically unsupervised) learning algorithms 2

Principal Component Analysis (PCA) 3

PCA: Maximum Variance View PCA is a linear dimensionality reduction technique Find the directions of maximum variance in the data � ( x i ) � N i =1 Assume that data is centered, i.e., � i x i = 0 4

PCA: Maximum Variance View PCA is a linear dimensionality reduction technique Find the directions of maximum variance in the data � ( x i ) � N i =1 Assume that data is centered, i.e., � i x i = 0 Find a set of orthogonal vectors v 1 , . . . , v k ◮ The first principal component (PC) v 1 is the direction of largest variance ◮ The second PC v 2 is the direction of largest variance orthogonal to v 1 ◮ The i th PC v i is the direction of largest variance orthogonal to v 1 , . . . , v i − 1 V D × k gives projection z i = V T x i for datapoint x i Z = XV for entire dataset 4

PCA: Maximum Variance View We are given i.i.d. data � ( x i ) � N i =1 ; data matrix X Want to find v 1 ∈ R D , � v 1 � = 1 , that maximizes � Xv 1 � 2 5

PCA: Maximum Variance View We are given i.i.d. data � ( x i ) � N i =1 ; data matrix X Want to find v 1 ∈ R D , � v 1 � = 1 , that maximizes � Xv 1 � 2 Let z = Xv 1 , so z i = x i · v 1 . We wish to find v 1 so that � N i =1 z 2 i is maximised. N � z 2 i = z T z i =1 = v T 1 X T Xv 1 The maximum value attained by v T 1 X T Xv 1 for � v 1 � 2 = 1 is the largest eigenvalue of X T X . The argmax is the corresponding eigenvector v 1 . 5

PCA: Maximum Variance View We are given i.i.d. data � ( x i ) � N i =1 ; data matrix X Want to find v 1 ∈ R D , � v 1 � = 1 , that maximizes � Xv 1 � 2 Let z = Xv 1 , so z i = x i · v 1 . We wish to find v 1 so that � N i =1 z 2 i is maximised. N � z 2 i = z T z i =1 = v T 1 X T Xv 1 The maximum value attained by v T 1 X T Xv 1 for � v 1 � 2 = 1 is the largest eigenvalue of X T X . The argmax is the corresponding eigenvector v 1 . Find v 2 , v 3 , . . . , v k that are all successively orthogonal to previous directions and maximise (as yet unexplained variance) 5

PCA: Best Reconstruction We have i.i.d. data � ( x i ) � N i =1 ; data matrix X Find a k -dimensional linear projection that best represents the data 6

PCA: Best Reconstruction We have i.i.d. data � ( x i ) � N i =1 ; data matrix X Find a k -dimensional linear projection that best represents the data Suppose V k ∈ R D × k is such that columns of V k are orthogonal Project data X on to subspace defined by V Z = XV k Minimize reconstruction error N � � x i − V k V T k x i � 2 i =1 6

Principal Component Analysis (PCA) 7

Equivalence between the Two Objectives: One PC Case Let v 1 be the direction of projection The point x is mapped to � x = ( v 1 · x ) v 1 , where � v 1 � = 1 8

Equivalence between the Two Objectives: One PC Case Let v 1 be the direction of projection The point x is mapped to � x = ( v 1 · x ) v 1 , where � v 1 � = 1 Maximum Variance Find v 1 that maximises � N i =1 ( v 1 · x i ) 2 Best Reconstruction Find v 1 that minimises: � � N N � � x i � 2 � x i � 2 x i � 2 � x i − � 2 = 2 − 2( x i · � x i ) + � � 2 i =1 i =1 � � � N 2 − 2( v 1 · x i ) 2 + ( v 1 · x i ) 2 � v 1 � 2 � x i � 2 = 2 i =1 � N � N � x i � 2 ( v 1 · x i ) 2 = 2 − i =1 i =1 So the same v 1 satisfies the two objectives 8

Finding Principal Components: SVD Let X be the N × D data matrix Pair of singular vectors u ∈ R N , v ∈ R D and singular value σ ∈ R + if σ v = X T u σ u = Xv and v is an eigenvector of X T X with eigenvalue σ 2 u is an eigenvector of XX T with eigenvalue σ 2 9

Finding Principal Components: SVD X = UΣV T (say N > D ) Thin SVD: U is N × D , Σ is D × D , V is D × D , U T U = V T V = I D Σ is diagonal with σ 1 ≥ σ 2 ≥ · · · ≥ σ D ≥ 0 The first k principal components are first k columns of V Full SVD: U is N × N , Σ = N × D , V is D × D . V and U are orthonormal matrices 10

Algorithm for finding PCs (when N > D) Constructing the matrix X T X takes time O ( D 2 N ) Eigenvectors of X T X can be computed in time O ( D 3 ) 11

Algorithm for finding PCs (when N > D) Constructing the matrix X T X takes time O ( D 2 N ) Eigenvectors of X T X can be computed in time O ( D 3 ) Iterative methods to get top k singular (right) vectors directly: ◮ Initiate v 0 to be random unit norm vector ◮ Iterative Update: ◮ v t +1 = X T Xv t � v t +1 � � � � ◮ v t +1 = v t +1 / � 2 until (approximate) convergence ◮ Update step only takes O ( ND ) time (compute Xv t first, then X T ( Xv t ) ) ◮ This gives the singular vector corresponding to the largest singular value ◮ Subsequent singular vectors obtained by choosing v 0 orthogonal to previously identified singular vectors (this needs to be done at each iteration to avoid numerical errors creeping in) 11

Algorithm for finding PCs (when D ≫ N) Constructing the matrix XX T takes time O ( N 2 D ) Eigenvectors of XX T can be computed in time O ( N 3 ) The eigenvectors give the ‘left’ singular vectors, u i of X To obtain v i , we use the fact that v i = σ − 1 X T u i Iterative method can be used directly as in the case when N > D 12

PCA: Reconstruction Error We have thin SVD: X = UΣV T Let V k be the matrix containing first k columns of V Projection on to k PCs: Z = XV k = U k Σ k , where U k is the matrix of the first k columns of U and Σ k is the k × k diagonal submatrix for Σ of the top k singular values Reconstruction: � X = ZV T k = U k Σ k V T k � N � D k x i � 2 = � x i − V k V T σ 2 Reconstruction error = j i =1 j = k +1 13

PCA: Reconstruction Error We have thin SVD: X = UΣV T Let V k be the matrix containing first k columns of V Projection on to k PCs: Z = XV k = U k Σ k , where U k is the matrix of the first k columns of U and Σ k is the k × k diagonal submatrix for Σ of the top k singular values Reconstruction: � X = ZV T k = U k Σ k V T k � N � D k x i � 2 = � x i − V k V T σ 2 Reconstruction error = j i =1 j = k +1 This follows from the following calculations: D k � � X = UΣV T = σ j u j v T X = U k Σ k V T � σ j u j v T k = j j j =1 j =1 � � D � � � � X − � σ 2 X � F = j j = k +1 13

Reconstruction of an Image using PCA 14

How many principal components to pick? 15

How many principal components to pick? Look for an ‘elbow’ in the curve of reconstruction error vs # PCs 15

Application: Eigenfaces A popular application of PCA for face detection and recognition is known as Eigenfaces ◮ Face detection: Identify faces in a given image ◮ Face Recognition: Classification (or search) problem to identify a certain person 16

Application: Eigenfaces PCA on a dataset of face images. Each principal component can be thought of as being an ‘element’ of a face. Source: http://vismod.media.mit.edu/vismod/demos/facerec/basic.html 17

Application: Eigenfaces Detection: Each patch of the image can be checked to identify whether there is a face in it Recognition: Map all faces in terms of their principal components. Then use some distance measure on the projections to find faces that are most like the input image. Why use PCA for face detection? ◮ Even though images can be large, we can use the D ≫ N approach to be efficient ◮ The final model (the PCs) can be quite compact, can fit on cameras, phones ◮ Works very well given the simplicity of the model 18

Application: Latent Semantic Analysis X is an N × D matrix, D is the size of dictionary x i is a vector of word counts (bag of words) Reconstruction using k eigenvectors X ≈ ZV T k , where Z = XV k � z i , z j � is probably a better notion of similarity than � x i , x j � × V T k ≈ X Z Non-negative matrix factorisation has more natural interpretation, but is harder to compute 19

PCA: Beyond Linearity 20

Projection: Linear PCA 21

Projection: Kernel PCA 22

Machine Learning (AIMS) - MT 2018 1. Dimensionality Reduction Varun - PowerPoint PPT Presentation

Machine Learning (AIMS) - MT 2018 1. Dimensionality Reduction Varun Kanade University of Oxford November 5, 2018 Unsupervised Learning Training data is of the form x 1 , . . . , x N Infer properties about the data Search: Identify patterns

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Dimensionality Reduction INFO-4604, Applied Machine Learning University of Colorado Boulder

APPLIED MACHINE LEARNING Methods for Reduction of Dimensionality through Linear Projection

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Machine Learning Dimensionality Reduction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 Byron C Wallace Today A

Dimensionality reduction Machine Learning Hamid Beigy Sharif University of Technology Fall 1393

Finding Monotone Patterns in Sublinear Time Erik Waingarten (Columbia University) Cl ement

Finding & Backgrounding People Research Madness March 22, 2010 When Do You Need to Research

Is electroweak baryogenesis dead? with K. Kainulainen and D. Tucker-Smith Jim Cline, McGill U.

FreeBSD and NetBSD on APM86290 System on Chip Zbigniew Bodek zbb@semihalf.com EuroBSDCon 2012,

10/5/2020 Finding Work-Life Balance By Deidre Hayes and Madison Herman 1 At home chef

Improving People Search Using Query Expansions How Friends Help To Find People Thomas Mensink

Sliding Windows Sanja Fidler CSC420: Intro to Image Understanding 1 / 49 Type of Approaches

Jobs for People with Developmental Disabilities: Making It Happen! Catherine Nichols, Senior

Machine Learning (AIMS) - MT 2018 1. Dimensionality Reduction Varun - PowerPoint PPT Presentation

Machine Learning (AIMS) - MT 2018 1. Dimensionality Reduction Varun Kanade University of Oxford November 5, 2018 Unsupervised Learning Training data is of the form x 1 , . . . , x N Infer properties about the data Search: Identify patterns

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Dimensionality Reduction INFO-4604, Applied Machine Learning University of Colorado Boulder

APPLIED MACHINE LEARNING Methods for Reduction of Dimensionality through Linear Projection

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Machine Learning Dimensionality Reduction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 Byron C Wallace Today A

Dimensionality reduction Machine Learning Hamid Beigy Sharif University of Technology Fall 1393

Finding Monotone Patterns in Sublinear Time Erik Waingarten (Columbia University) Cl ement

Finding &amp; Backgrounding People Research Madness March 22, 2010 When Do You Need to Research

Is electroweak baryogenesis dead? with K. Kainulainen and D. Tucker-Smith Jim Cline, McGill U.

FreeBSD and NetBSD on APM86290 System on Chip Zbigniew Bodek zbb@semihalf.com EuroBSDCon 2012,

10/5/2020 Finding Work-Life Balance By Deidre Hayes and Madison Herman 1 At home chef

Improving People Search Using Query Expansions How Friends Help To Find People Thomas Mensink

Sliding Windows Sanja Fidler CSC420: Intro to Image Understanding 1 / 49 Type of Approaches

Jobs for People with Developmental Disabilities: Making It Happen! Catherine Nichols, Senior

Finding & Backgrounding People Research Madness March 22, 2010 When Do You Need to Research