Principal component analysis Course of Machine Learning Master - PowerPoint PPT Presentation

Principal component analysis Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019 1

Curse of dimensionality In general, many features: high-dimensional spaces. of dimensionality High dimensions lead to difficulties in machine learning algorithms (lower 2 • sparseness of data • increase in the number of coefficients, for example for dimension D and order 3 of the polynomial, D D D D D D ∑ ∑ ∑ ∑ ∑ ∑ y ( x , w ) = w 0 + w i x i + w ij x i x j + w ijk x i x j x k i =1 i =1 j =1 i =1 j =1 k =1 number of coefficients is O ( D M ) reliability or need of large number of coefficients) this is denoted as curse

Dimensionality reduction • for any given classifier, the training set size required to obtain a certain accuracy grows exponentially wrt the number of features ( curse of dimensionality ) • it is important to bound the number of features, identifying the less discriminant ones 3

Discriminant features • Discriminant feature: makes it possible to distinguish between two classes • Non discriminant feature: does not allow classes to be distinguished 4

Searching hyperplanes for the dataset • verifying whether training set elements lie on a hyperplane (a space of lower dimensionality), apart from a limited variability (which could be seen as noise) “faithful” representation of the original dataset • as “faithful” representation we mean that distances between elements and their projections are small, even minimal 5 • principal component analysis looks for a d ′ -dimensional subspace ( d ′ < d ) such that the projection of elements onto such suspace is a

6 • it is easy to show that is minimum PCA for d ′ = 0 • Objective: represent all d -dimensional vectors x 1 , . . . , x n by means of a unique vector x 0 , in the most faithful way, that is so that n ∑ || x 0 − x i || 2 J ( x 0 ) = i =1 n x 0 = m = 1 ∑ x i n i =1

7 • In fact, PCA for d ′ = 0 n ∑ || ( x 0 − m ) − ( x i − m ) || 2 J ( x 0 ) = i =1 n n n || x 0 − m || 2 − 2 ( x 0 − m ) T ( x i − m ) + ∑ ∑ ∑ || x i − m || 2 = i =1 i =1 i =1 n n n || x 0 − m || 2 − 2( x 0 − m ) T ∑ ∑ ∑ || x i − m || 2 = ( x i − m ) + i =1 i =1 i =1 n n || x 0 − m || 2 + ∑ ∑ || x i − m || 2 = i =1 i =1 • since n n ∑ ∑ ( x i − m ) = x i − n · m = n · m − n · m = 0 i =1 i =1 • the second term is independent from x 0 , while the first one is equal to zero for x 0 = m

• a single vector is too concise a representation of the dataset: anything related to data variability gets lost • a more interesting case is the one when vectors are projected onto a 8 PCA for d ′ = 1 line passing through m

quadratic error is then 9 PCA for d ′ = 1 • let u 1 be unit vector ( || u 1 || = 1 ) in the line direction: the line equation x = α u 1 + m where α is the distance of x from m along the line • let ˜ x i = α i u 1 + m be the projection of x i ( i = 1 , . . . , n ) onto the line: given x 1 , . . . , x n , we wish to find the set of projections minimizing the

10 The quadratic error is defined as PCA for d ′ = 1 n ∑ x i − x i || 2 || ˜ J ( α 1 , . . . , α n , u 1 ) = i =1 n ∑ || ( m + α i u 1 ) − x i || 2 = i =1 n ∑ || α i u 1 − ( x i − m ) || 2 = i =1 n n n i || u 1 || 2 + || x i − m || 2 − 2 ∑ + α 2 ∑ ∑ α i u T = 1 ( x i − m ) i =1 i =1 i =1 n n n || x i − m || 2 − 2 ∑ α 2 ∑ ∑ α i u T = i + 1 ( x i − m ) i =1 i =1 i =1

The second derivative turns out to be positive the line). showing that what we have found is indeed a minimum. 11 PCA for d ′ = 1 Its derivative wrt α k is ∂ ∂α k J ( α 1 , . . . , α n , u 1 ) = 2 α k − 2 u T 1 ( x k − m ) which is zero when α k = u T 1 ( x k − m ) (the orthogonal projection of x k onto ∂ J ( α 1 , . . . , α n , u 1 ) = 2 ∂α 2 k

12 of the dataset PCA for d ′ = 1 To derive the best direction u 1 of the line, we consider the covariance matrix n S = 1 ∑ ( x i − m )( x i − m ) T n i =1 By plugging the values computed for α i into the definition of J ( α 1 , . . . , α n , u 1 ) , we get n n n || x i − m || 2 − 2 ∑ α 2 ∑ ∑ α 2 J ( u 1 ) = i + i i =1 i =1 i =1 n n 1 ( x i − m )] 2 + ∑ ∑ || x i − m || 2 [ u T = − i =1 i =1 n n 1 ( x i − m )( x i − m ) T u 1 + ∑ u T ∑ || x i − m || 2 = − i =1 i =1 n = − n u T ∑ || x i − m || 2 1 Su 1 + i =1

• the product • the sum 13 PCA for d ′ = 1 • u T 1 ( x i − m ) is the projection of x i onto the line 1 ( x i − m )( x i − m ) T u 1 u T is then the variance of the projection of x i wrt the mean m n 1 ( x i − m )( x i − m ) T u 1 = n u T ∑ u T 1 Su 1 i =1 is the overall variance of the projections of vectors x i wrt the mean m

14 variance in the dataset to 0, obtaining By applying Lagrange multipliers this results equivalent to maximizing PCA for d ′ = 1 Minimizing J ( u 1 ) is equivalent to maximizing u T 1 Su 1 . That is, J ( u 1 ) is minimum if u 1 is the direction which keeps the maximum amount of Hence, we wish to maximize u T 1 Su 1 (wrt u 1 ), with the constraint || u 1 || = 1 . u = u T 1 Su 1 − λ 1 ( u T 1 u 1 − 1) This can be done by setting the first derivative wrt u 1 : ∂u ∂ u 1 = 2 Su 1 − 2 λ 1 u 1 Su 1 = λ 1 u 1

Note that: • the overall variance of the projections is then equal to the corresponding eigenvalue • the variance of the projections is then maximized (and the error 15 PCA for d ′ = 1 • u is maximized if u 1 is an eigenvector of S u T 1 Su 1 = u T 1 λ 1 u 1 = λ 1 u T 1 u 1 = λ 1 minimized) if u 1 is the eigenvector of S corresponding to the maximum eigenvalue λ 1

• The projections of vectors onto that hyperplane are distributed as a • The quadratic error is minimized by projecting vectors onto a of data variability 16 PCA for d ′ > 1 hyperplane defined by the directions associated to the d ′ eigenvectors corresponding to the d ′ largest eigenvalues of S • If we assume data are modeled by a d -dimensional gaussian distribution with mean µ and covariance matrix Σ , PCA returns a d ′ -dimensional subspace corresponding to the hyperplane defined by the eigenvectors associated to the d ′ largest eigenvalues of Σ d ′ -dimensional distribution which keeps the maximum possible amount

An example of PCA 17 • Digit recognition ( D = 28 × 28 = 784 )

Eigenvalue size distribution is usually characterized by a fast initial decrease followed by a small decrease This makes it possible to identify the number of eigenvalues to keep, and thus the dimensionality of the projections. 18 Choosing d ′

19 Eigenvalues measure the amount of distribution variance kept in the projection. by setting largest eigenvalues. Choosing d ′ Let us consider, for each k < d , the value ∑ k i =1 λ 2 i r k = ∑ n i =1 λ 2 i which provides a measure of the variance fraction associated to the k When r 1 < . . . < r d are known, a certain amount p of variance can be kept d ′ = argmin r i > p i ∈{ 1 ,...,d }

Singular value decomposition 20

Singular Value Decomposition = 21 there exist R n × m be a matrix of rank r ≤ min ( n, m ) , and let n > m . Then, Let W ∈ I R n × r orthonormal (that is, U T U = I r ) • U ∈ I R m × r orthonormal (that is, VV T = I r ) • V ∈ I R r × r diagonal • Σ ∈ I such that W = UΣV T Σ V T ( r × r ) ( r × m ) W U ( n × m ) ( n × r )

this derives from SVD in greater detail 22 Let us consider the matrix A = W T W ∈ I R m × m . Observe that • by definition, A has the same rank of W , that is r • A is symmetric: in fact, a ij = w T i w j by definition, where w k is the k -th column of W ; by the commutativity of vector product, a ij = w T i w j = w T j w i = a ji • A is semidefinite positive, that is x T Ax ≥ 0 for all non null x ∈ I R m : x T Ax = x T ( W T W ) x = ( Wx ) T ( Wx ) = || Wx || 2 ≥ 0

SVD in greater detail 23 All eigenvalues of A are real. In fact, • let λ ∈ C be an eigenvalue of A , and let v ∈ C n be a corresponding eigenvector: then, Av = λ v and v T Av = v T λ v = λ v T v • observe that, in general, it must also be that the complex conjugates λ and v are themselves an eigenvalue-eigenvector pair for A : then, Av = λ v . Since λ v T = ( λ v ) T = ( Av ) T = v T A T = v T A by the simmetry of A , it derives v T Av = λ v T v • as a consequence, λ v T v = λ v T v , that is λ || v || 2 = λ || v || 2 • since v ̸ = 0 (being an eigenvector), it must be λ = λ , hence λ ∈ I R

SVD in greater detail orthonormal base. 24 orthogonal The eigenvectors of A corresponding to different eigenvalues are • Let v 1 , v 2 ∈ C n be two eigenvectors, with corresponding distinct eigenvalues λ 1 , λ 2 1 v 2 ) = ( λ 1 v 1 ) T v 2 = ( Av 1 ) T v 2 = • then, by the simmetry of A , λ 1 ( v T 1 A T v 2 = v T v T 1 Av 2 = v T 1 λ 2 v 2 = λ 2 ( v T 1 v 2 ) • as a consequence, ( λ 1 − λ 2 ) v T 1 v 2 = 0 • since λ 1 ̸ = λ 2 , it must be v T 1 v 2 = 0 , that is v 1 , v 2 must be orthogonal If an eigenvalue λ ′ has multiplicity m > 1 , it is always possible to find a set of m orthonormal eigenvectors of λ ′ . As a result, there exists a set of eigenvectors of A which provides an

Principal component analysis Course of Machine Learning Master - PowerPoint PPT Presentation

Principal component analysis Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 1 Curse of dimensionality In general, many features: high-dimensional spaces. of

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

Section 1 Principal Component Analysis 1 / 16 Principal Component Analysis ST 810-006

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Principal Component Analysis Powerpoint Presentation What is multivariate analysis? Summarizing

Principal component analysis Ingo Blechschmidt December 17th, 2014 Kleine Bayessche AG

Functional components Notification component Application received Refuse ? Notification

WIO IOSAP Project Budget Nairobi Convention WIO IOSAP Budget per Project Component COMPONENT

Principal Component Analysis http://setosa.io/ev/principal- Food consumption in the UK

CS475/CS675 Lecture 23: July 19, 2016 Principal Component Analysis, Eigenfaces CS475/CS675 (c)

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678

Introduction to Principal Component Analysis and Indepedent Component Analysis Tristan A. Hearn

Chapter 5 Singular value decomposition and principal component analysis In A Practical Approach to

Hebbian Learning, Hebbian Learning Principal Component Analysis, and Independent Component

Principal Component Analysis in a Linear Algebraic View by Anna Orosz under the mentorship of

Lecture 3 Principal Component Analysis Lin ZHANG, PhD School of Software Engineering Tongji

Component selection 1 (c) 2020 A.J.M. Montagne Component selection + - + - + - 2 (c)

TLS Renegotiation Vulnerability IETF-76 Joe Salowey (jsalowey@cisco.com) Eric Rescorla

Mueller Navelet jets at LHC: An observable to reveal high energy resummation eets?

The Significance of Errors to Parametric Models of Language Acquisition Paula Buttery Natural

Chunk-based Verb Reordering in VSO Sentences for Arabic-English SMT Arianna Bisazza, Marcello

Lecture 2: Principal Components and Eigenfaces Mark Hasegawa-Johnson ECE 417: Multimedia Signal

Rank optimality for the Burer-Monteiro factorization Ir` ene Waldspurger CNRS and CEREMADE

A review of Hybrid High-Order methods: formulations, computational aspects, links with other

Conditionals in Translation Towards Translation Mining in a compositional setting Jos Tellings