machine learning cse 446 learning as minimizing loss
play

Machine Learning (CSE 446): Learning as Minimizing Loss; Least - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13 Review 1 / 13 Alternate View of PCA: Minimizing Reconstruction Error Assume that


  1. Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c � 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13

  2. Review 1 / 13

  3. Alternate View of PCA: Minimizing Reconstruction Error Assume that the data are centered . Find a line which minimizes the squared reconstruction error. 1 / 13

  4. Alternate View of PCA: Minimizing Reconstruction Error Assume that the data are centered . Find a line which minimizes the squared reconstruction error. 1 / 13

  5. Alternate View: Minimizing Reconstruction Error with K -dim subspace. Equivalent (“dual”) formulation of PCA: find an “orthonormal basis” u 1 , u 2 , . . . u K which minimizes the total reconstruction error on the data: � 1 ( x i − Proj u 1 ,... u K ( x i )) 2 argmin N orthonormal basis: u 1 , u 2 ,... u K i Recall the projection of x onto K -orthonormal basis is: K � Proj u 1 ,... u K ( x ) = ( u i · x ) u i j =1 The SVD “simultaneously” finds all u 1 , u 2 , . . . u K 2 / 13

  6. Projection and Reconstruction: the one dimensional case ◮ Take out mean µ : ◮ Find the “top” eigenvector u of the covariance matrix. ◮ What are your projections? ◮ What are your reconstructions, � x N ] ⊤ ? X = [ � x 1 | � x 2 | · · · | � ◮ What is your reconstruction error of doing nothing ( K = 0 ) and using K = 1 ? � � 1 1 ( x i − µ ) 2 = x i ) 2 = ( x i − � N N i i ◮ Reduction in error by using a k -dim PCA projection: 3 / 13

  7. PCA vs. Clustering Summarize your data with fewer points or fewer dimensions ? 3 / 13

  8. Loss functions 3 / 13

  9. Today 3 / 13

  10. Perceptron Perceptron Algorithm: A model and an algorithm, rolled into one. Isn’t there a more principled methodology to derive algorithms? 3 / 13

  11. What we (“naively”) want: “Minimize training-set error rate”: loss � N 1 min � y n ( w · x n + b ) ≤ 0 � � �� � N w ,b n =1 zero-one loss on a point n margin = y · ( w · x + b ) This problem is NP-hard; even for a (multiplicative) approximation. Why is this loss function so unwieldy? 4 / 13

  12. Relax! ◮ The mis-classification optimization problem: � N 1 min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, let’s try to choose a “reasonable” loss function ℓ ( y n , w · x ) and then try to solve the relaxation: N � 1 min ℓ ( y n , w · x n ) N w n =1 5 / 13

  13. What is a good “relaxation”? ◮ Want that minimizing our surrogate loss helps with minimizing the mis-classification loss. ◮ idea: try to use a (sharp) upper bound of the zero-one loss by ℓ : � y ( w · x ) ≤ 0 � ≤ ℓ ( y, w · x ) ◮ want our relaxed optimization problem to be easy to solve. What properties might we want for ℓ ( · ) ? 6 / 13

  14. What is a good “relaxation”? ◮ Want that minimizing our surrogate loss helps with minimizing the mis-classification loss. ◮ idea: try to use a (sharp) upper bound of the zero-one loss by ℓ : � y ( w · x ) ≤ 0 � ≤ ℓ ( y, w · x ) ◮ want our relaxed optimization problem to be easy to solve. What properties might we want for ℓ ( · ) ? ◮ differentiable? sensitive to changes in w ? ◮ convex? 6 / 13

  15. The square loss! (and linear regression) ◮ The square loss: ℓ ( y, w · x ) = ( y − w · x ) 2 . ◮ The relaxed optimization problem: N � 1 ( y n − w · x n ) 2 min N w n =1 ◮ nice properties: ◮ for binary classification, it is a an upper bound on the zero-one loss. ◮ It makes sense more generally, e.g. if we want to predict real valued y . ◮ We have a convex optimization problem. ◮ For classification, what is your decision rule using a w ? 7 / 13

  16. The square loss as an upper bound ◮ We have: � y ( w · x ) ≤ 0 � ≤ ( y − w · x ) 2 ◮ Easy to see, by plotting: 8 / 13

  17. Remember this problem? Data derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. Goal: predict whether mpg is < 23 (“bad” = 0) or above (“good” = 1) given the input row. 9 / 13

  18. Remember this problem? Data derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. Goal: predict whether mpg is < 23 (“bad” = 0) or above (“good” = 1) given the input row. Predicting a real y (often) makes more sense. 9 / 13

  19. A better (convex) upper bound ◮ The logistic loss: ℓ logistic ( y, w · x ) = log (1 + exp( − y w · x )) . ◮ We have: � y ( w · x ) ≤ 0 � ≤ constant ∗ ℓ logistic ( y, w · x ) ◮ Again, easy to see, by plotting: 10 / 13

  20. Least squares: let’s minimize it! ◮ The optimization problem: � N 1 ( y n − w · x n ) 2 = min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ How do we interpret X w ? 11 / 13

  21. Least squares: let’s minimize it! ◮ The optimization problem: � N 1 ( y n − w · x n ) 2 = min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ How do we interpret X w ? The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 11 / 13

  22. Matrix calculus proof: scratch space 12 / 13

  23. Matrix calculus proof: scratch space 12 / 13

  24. Remember your linear system solving! 12 / 13

  25. Lots of questions: ◮ What could go wrong with least squares? ◮ Suppose we are in “high dimensions”: more dimensions than data points. ◮ Inductive bias: we need a way to control the complexity of the model. ◮ How do we minimize (sum) logistic loss? ◮ Optimization: how do we do this all quickly? 13 / 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend