machine learning cse 446 pca continued and learning as
play

Machine Learning (CSE 446): PCA (continued) and Learning as - PowerPoint PPT Presentation

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 17 PCA: continuing on... 1 / 17 Dimension of Greatest Variance Assume that the


  1. Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade � 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 17

  2. PCA: continuing on... 1 / 17

  3. Dimension of Greatest Variance Assume that the data are centered , i.e., that � � � x n � N mean = 0 . n =1 1 / 17

  4. Dimension of Greatest Variance Assume that the data are centered , i.e., that � � � x n � N mean = 0 . n =1 1 / 17

  5. Projection into One Dimension Let u be the dimension of greatest variance, where � u � 2 = 1 . p n = x n · u is the projection of the n th example onto u . Since the mean of the data is 0 , the mean of � p 1 , . . . , p N � is also 0 . N � This implies that the variance of � p 1 , . . . , p N � is 1 p 2 n . N n =1 The u that gives the greatest variance, then, is: N � ( x n · u ) 2 argmax u n =1 2 / 17

  6. Finding the Maximum-Variance Direction N � ( x n · u ) 2 argmax u n =1 s.t. � u � 2 = 1 (Why do we constrain u to have length 1?)   x ⊤ 1   x ⊤   � Xu � 2 , s.t. � u � 2 = 1 . If we let X =  , then we want: argmax   . .  . u x ⊤ N 2- This is PCA in one dimension! 3 / 17

  7. Linear algebra review: things to understand ◮ � x � 2 is the Euclidean norm. ◮ What is the dimension of Xu ? ◮ What is i -th component of Xu ? ◮ Also, note: � u � 2 = u ⊤ u ◮ So what is � Xu � 2 ? 4 / 17

  8. Constrained Optimization The blue lines represent contours : all points on a blue line have the same objective function value. 5 / 17

  9. Deriving the Solution Don’t panic. � Xu � 2 , s.t. � u � 2 = 1 argmax u ◮ The Lagrangian encoding of the problem moves the constraint into the objective: λ � Xu � 2 − λ ( � u � 2 − 1) � Xu � 2 − λ ( � u � 2 − 1) max min ⇒ min λ max u u 6 / 17

  10. Deriving the Solution Don’t panic. � Xu � 2 , s.t. � u � 2 = 1 argmax u ◮ The Lagrangian encoding of the problem moves the constraint into the objective: λ � Xu � 2 − λ ( � u � 2 − 1) � Xu � 2 − λ ( � u � 2 − 1) max min ⇒ min λ max u u ◮ Gradient (first derivatives with respect to u ): 2 X ⊤ Xu − 2 λ u ◮ Setting equal to 0 leads to: λ u = X ⊤ Xu ◮ You may recognize this as the definition of an eigenvector ( u ) and eigenvalue ( λ ) for the matrix X ⊤ X . ◮ We take the first (largest) eigenvalue. 6 / 17

  11. Deriving the Solution: Scratch space 7 / 17

  12. Deriving the Solution: Scratch space 7 / 17

  13. Deriving the Solution: Scratch space 7 / 17

  14. Variance in Multiple Dimensions So far, we’ve projected each x n into one dimension. To get a second direction v , we solve the same problem again, but this time with another constraint: � Xv � 2 , s.t. � v � 2 = 1 and u · v = 0 argmax v (That is, we want a dimension that’s orthogonal to the u that we found earlier.) Following the same steps we had for u , the solution will be the second eigenvector. 8 / 17

  15. “Eigenfaces” Fig. from https://github.com/AlexOuyang/RealTimeFaceRecognition 9 / 17

  16. Principal Components Analysis ◮ Input: unlabeled data X = [ x 1 | x 2 | · · · | x N ] ⊤ ; dimensionality K < d ◮ Output: K -dimensional “subspace”. ◮ Algorithm: 1. Compute the mean µ 2. compute the covariance matrix : � Σ = 1 ( x i − µ )( x i − µ ) ⊤ N i 3. let � λ 1 , . . . , λ K � be the top K eigenvalues of Σ and � u 1 , . . . , u K � be the corresponding eigenvectors ◮ Let � U = [ u 1 | u | · · · | u K ] Return � U You can read about many algorithms for finding eigendecompositions of a matrix. 10 / 17

  17. Alternate View of PCA: Minimizing Reconstruction Error Assume that the data are centered . Find a line which minimizes the squared reconstruction error. 11 / 17

  18. Alternate View of PCA: Minimizing Reconstruction Error Assume that the data are centered . Find a line which minimizes the squared reconstruction error. 11 / 17

  19. Projection and Reconstruction: the one dimensional case ◮ Take out mean µ : ◮ Find the “top” eigenvector u of the covariance matrix. ◮ What are your projections? ◮ What are your reconstructions, � x N ] ⊤ ? X = [ � x 1 | � x 2 | · · · | � ◮ Whis is your reconstruction error? � 1 x i ) 2 =?? ( x i − � N i 12 / 17

  20. Alternate View: Minimizing Reconstruction Error with K -dim subspace. Equivalent (“dual”) formulation of PCA: find an “orthonormal basis” u 1 , u 2 , . . . u K which minimizes the total reconstruction error on the data: � 1 ( x i − Proj u 1 ,... u K ( x i )) 2 argmin N orthonormal basis: u 1 , u 2 ,... u K i Recall the projection of x onto K -orthonormal basis is: K � Proj u 1 ,... u K ( x ) = ( u i · x ) u i j =1 The SVD “simultaneously” finds all u 1 , u 2 , . . . u K 13 / 17

  21. Choosing K (Hyperparameter Tuning) How do you select K for PCA? Read CIML (similar methods for K -means) 13 / 17

  22. PCA and Clustering There’s a unified view of both PCA and clustering. ◮ K -Means chooses cluster-means so that squared distances to data are small. ◮ PCA chooses a basis so that reconstruction error of data is small. Both attempt to find a “simple” way to summarize the data: fewer points or fewer dimensions. Both could be used to create new features for supervised learning 14 / 17

  23. Loss functions 14 / 17

  24. Perceptron A model and an algorithm, rolled into one. Model: f ( x ) = sign ( w · x + b ) , known as linear , visualized by a (hopefully) separating hyperplane in feature-space. Algorithm: PerceptronTrain , an error-driven, iterative updating algorithm. 15 / 17

  25. A Different View of PerceptronTrain : Optimization “Minimize training-set error rate”: loss N � 1 min � y n · ( w · x + b ) ≤ 0 � N w ,b n =1 � �� � ǫ train ≡ zero-one loss margin = y · ( w · x + b ) 16 / 17

  26. A Different View of PerceptronTrain : Optimization “Minimize training-set error rate”: loss � N 1 min � y n · ( w · x + b ) ≤ 0 � N w ,b n =1 � �� � ǫ train ≡ zero-one loss margin = y · ( w · x + b ) This problem is NP-hard; even solving trying to get a (multiplicaive) approximatation is NP-hard. 16 / 17

  27. A Different View of PerceptronTrain : Optimization loss “Minimize training-set error rate”: N � 1 min � y n · ( w · x + b ) ≤ 0 � N margin = y · ( w · x + b ) w ,b n =1 � �� � ǫ train ≡ zero-one loss What the perceptron does: loss � N 1 min max( − y n · ( w · x + b ) , 0) N � �� � w ,b n =1 perceptron loss margin = y · ( w · x + b ) 16 / 17

  28. A Different View of PerceptronTrain : Optimization “Minimize training-set error rate”: N � 1 min � y n · ( w · x + b ) ≤ 0 � N w ,b n =1 � �� � ǫ train ≡ zero-one loss What the perceptron does: � N 1 min max( − y n · ( w · x + b ) , 0) � �� � N w ,b n =1 perceptron loss 16 / 17

  29. A Different View of PerceptronTrain : Optimization “Minimize training-set error rate”: N � 1 min � y n · ( w · x + b ) ≤ 0 � N w ,b n =1 � �� � ǫ train ≡ zero-one loss What the perceptron does: N � 1 min max( − y n · ( w · x + b ) , 0) � �� � N w ,b n =1 perceptron loss 16 / 17

  30. Smooth out the Loss? 17 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend