principal component analysis
play

Principal component analysis Course of Machine Learning Master - PowerPoint PPT Presentation

Principal component analysis Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 1 Curse of dimensionality In general, many features: high-dimensional spaces. of


  1. Principal component analysis Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019 1

  2. Curse of dimensionality In general, many features: high-dimensional spaces. of dimensionality High dimensions lead to difficulties in machine learning algorithms (lower 2 • sparseness of data • increase in the number of coefficients, for example for dimension D and order 3 of the polynomial, D D D D D D ∑ ∑ ∑ ∑ ∑ ∑ y ( x , w ) = w 0 + w i x i + w ij x i x j + w ijk x i x j x k i =1 i =1 j =1 i =1 j =1 k =1 number of coefficients is O ( D M ) reliability or need of large number of coefficients) this is denoted as curse

  3. Dimensionality reduction • for any given classifier, the training set size required to obtain a certain accuracy grows exponentially wrt the number of features ( curse of dimensionality ) • it is important to bound the number of features, identifying the less discriminant ones 3

  4. Discriminant features • Discriminant feature: makes it possible to distinguish between two classes • Non discriminant feature: does not allow classes to be distinguished 4

  5. Searching hyperplanes for the dataset • verifying whether training set elements lie on a hyperplane (a space of lower dimensionality), apart from a limited variability (which could be seen as noise) “faithful” representation of the original dataset • as “faithful” representation we mean that distances between elements and their projections are small, even minimal 5 • principal component analysis looks for a d ′ -dimensional subspace ( d ′ < d ) such that the projection of elements onto such suspace is a

  6. 6 • it is easy to show that is minimum PCA for d ′ = 0 • Objective: represent all d -dimensional vectors x 1 , . . . , x n by means of a unique vector x 0 , in the most faithful way, that is so that n ∑ || x 0 − x i || 2 J ( x 0 ) = i =1 n x 0 = m = 1 ∑ x i n i =1

  7. 7 • In fact, PCA for d ′ = 0 n ∑ || ( x 0 − m ) − ( x i − m ) || 2 J ( x 0 ) = i =1 n n n || x 0 − m || 2 − 2 ( x 0 − m ) T ( x i − m ) + ∑ ∑ ∑ || x i − m || 2 = i =1 i =1 i =1 n n n || x 0 − m || 2 − 2( x 0 − m ) T ∑ ∑ ∑ || x i − m || 2 = ( x i − m ) + i =1 i =1 i =1 n n || x 0 − m || 2 + ∑ ∑ || x i − m || 2 = i =1 i =1 • since n n ∑ ∑ ( x i − m ) = x i − n · m = n · m − n · m = 0 i =1 i =1 • the second term is independent from x 0 , while the first one is equal to zero for x 0 = m

  8. • a single vector is too concise a representation of the dataset: anything related to data variability gets lost • a more interesting case is the one when vectors are projected onto a 8 PCA for d ′ = 1 line passing through m

  9. quadratic error is then 9 PCA for d ′ = 1 • let u 1 be unit vector ( || u 1 || = 1 ) in the line direction: the line equation x = α u 1 + m where α is the distance of x from m along the line • let ˜ x i = α i u 1 + m be the projection of x i ( i = 1 , . . . , n ) onto the line: given x 1 , . . . , x n , we wish to find the set of projections minimizing the

  10. 10 The quadratic error is defined as PCA for d ′ = 1 n ∑ x i − x i || 2 || ˜ J ( α 1 , . . . , α n , u 1 ) = i =1 n ∑ || ( m + α i u 1 ) − x i || 2 = i =1 n ∑ || α i u 1 − ( x i − m ) || 2 = i =1 n n n i || u 1 || 2 + || x i − m || 2 − 2 ∑ + α 2 ∑ ∑ α i u T = 1 ( x i − m ) i =1 i =1 i =1 n n n || x i − m || 2 − 2 ∑ α 2 ∑ ∑ α i u T = i + 1 ( x i − m ) i =1 i =1 i =1

  11. The second derivative turns out to be positive the line). showing that what we have found is indeed a minimum. 11 PCA for d ′ = 1 Its derivative wrt α k is ∂ ∂α k J ( α 1 , . . . , α n , u 1 ) = 2 α k − 2 u T 1 ( x k − m ) which is zero when α k = u T 1 ( x k − m ) (the orthogonal projection of x k onto ∂ J ( α 1 , . . . , α n , u 1 ) = 2 ∂α 2 k

  12. 12 of the dataset PCA for d ′ = 1 To derive the best direction u 1 of the line, we consider the covariance matrix n S = 1 ∑ ( x i − m )( x i − m ) T n i =1 By plugging the values computed for α i into the definition of J ( α 1 , . . . , α n , u 1 ) , we get n n n || x i − m || 2 − 2 ∑ α 2 ∑ ∑ α 2 J ( u 1 ) = i + i i =1 i =1 i =1 n n 1 ( x i − m )] 2 + ∑ ∑ || x i − m || 2 [ u T = − i =1 i =1 n n 1 ( x i − m )( x i − m ) T u 1 + ∑ u T ∑ || x i − m || 2 = − i =1 i =1 n = − n u T ∑ || x i − m || 2 1 Su 1 + i =1

  13. • the product • the sum 13 PCA for d ′ = 1 • u T 1 ( x i − m ) is the projection of x i onto the line 1 ( x i − m )( x i − m ) T u 1 u T is then the variance of the projection of x i wrt the mean m n 1 ( x i − m )( x i − m ) T u 1 = n u T ∑ u T 1 Su 1 i =1 is the overall variance of the projections of vectors x i wrt the mean m

  14. 14 variance in the dataset to 0, obtaining By applying Lagrange multipliers this results equivalent to maximizing PCA for d ′ = 1 Minimizing J ( u 1 ) is equivalent to maximizing u T 1 Su 1 . That is, J ( u 1 ) is minimum if u 1 is the direction which keeps the maximum amount of Hence, we wish to maximize u T 1 Su 1 (wrt u 1 ), with the constraint || u 1 || = 1 . u = u T 1 Su 1 − λ 1 ( u T 1 u 1 − 1) This can be done by setting the first derivative wrt u 1 : ∂u ∂ u 1 = 2 Su 1 − 2 λ 1 u 1 Su 1 = λ 1 u 1

  15. Note that: • the overall variance of the projections is then equal to the corresponding eigenvalue • the variance of the projections is then maximized (and the error 15 PCA for d ′ = 1 • u is maximized if u 1 is an eigenvector of S u T 1 Su 1 = u T 1 λ 1 u 1 = λ 1 u T 1 u 1 = λ 1 minimized) if u 1 is the eigenvector of S corresponding to the maximum eigenvalue λ 1

  16. • The projections of vectors onto that hyperplane are distributed as a • The quadratic error is minimized by projecting vectors onto a of data variability 16 PCA for d ′ > 1 hyperplane defined by the directions associated to the d ′ eigenvectors corresponding to the d ′ largest eigenvalues of S • If we assume data are modeled by a d -dimensional gaussian distribution with mean µ and covariance matrix Σ , PCA returns a d ′ -dimensional subspace corresponding to the hyperplane defined by the eigenvectors associated to the d ′ largest eigenvalues of Σ d ′ -dimensional distribution which keeps the maximum possible amount

  17. An example of PCA 17 • Digit recognition ( D = 28 × 28 = 784 )

  18. Eigenvalue size distribution is usually characterized by a fast initial decrease followed by a small decrease This makes it possible to identify the number of eigenvalues to keep, and thus the dimensionality of the projections. 18 Choosing d ′

  19. 19 Eigenvalues measure the amount of distribution variance kept in the projection. by setting largest eigenvalues. Choosing d ′ Let us consider, for each k < d , the value ∑ k i =1 λ 2 i r k = ∑ n i =1 λ 2 i which provides a measure of the variance fraction associated to the k When r 1 < . . . < r d are known, a certain amount p of variance can be kept d ′ = argmin r i > p i ∈{ 1 ,...,d }

  20. Singular value decomposition 20

  21. Singular Value Decomposition = 21 there exist R n × m be a matrix of rank r ≤ min ( n, m ) , and let n > m . Then, Let W ∈ I R n × r orthonormal (that is, U T U = I r ) • U ∈ I R m × r orthonormal (that is, VV T = I r ) • V ∈ I R r × r diagonal • Σ ∈ I such that W = UΣV T Σ V T ( r × r ) ( r × m ) W U ( n × m ) ( n × r )

  22. this derives from SVD in greater detail 22 Let us consider the matrix A = W T W ∈ I R m × m . Observe that • by definition, A has the same rank of W , that is r • A is symmetric: in fact, a ij = w T i w j by definition, where w k is the k -th column of W ; by the commutativity of vector product, a ij = w T i w j = w T j w i = a ji • A is semidefinite positive, that is x T Ax ≥ 0 for all non null x ∈ I R m : x T Ax = x T ( W T W ) x = ( Wx ) T ( Wx ) = || Wx || 2 ≥ 0

  23. SVD in greater detail 23 All eigenvalues of A are real. In fact, • let λ ∈ C be an eigenvalue of A , and let v ∈ C n be a corresponding eigenvector: then, Av = λ v and v T Av = v T λ v = λ v T v • observe that, in general, it must also be that the complex conjugates λ and v are themselves an eigenvalue-eigenvector pair for A : then, Av = λ v . Since λ v T = ( λ v ) T = ( Av ) T = v T A T = v T A by the simmetry of A , it derives v T Av = λ v T v • as a consequence, λ v T v = λ v T v , that is λ || v || 2 = λ || v || 2 • since v ̸ = 0 (being an eigenvector), it must be λ = λ , hence λ ∈ I R

  24. SVD in greater detail orthonormal base. 24 orthogonal The eigenvectors of A corresponding to different eigenvalues are • Let v 1 , v 2 ∈ C n be two eigenvectors, with corresponding distinct eigenvalues λ 1 , λ 2 1 v 2 ) = ( λ 1 v 1 ) T v 2 = ( Av 1 ) T v 2 = • then, by the simmetry of A , λ 1 ( v T 1 A T v 2 = v T v T 1 Av 2 = v T 1 λ 2 v 2 = λ 2 ( v T 1 v 2 ) • as a consequence, ( λ 1 − λ 2 ) v T 1 v 2 = 0 • since λ 1 ̸ = λ 2 , it must be v T 1 v 2 = 0 , that is v 1 , v 2 must be orthogonal If an eigenvalue λ ′ has multiplicity m > 1 , it is always possible to find a set of m orthonormal eigenvectors of λ ′ . As a result, there exists a set of eigenvectors of A which provides an

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend