machine learning cse 446 learning as minimizing loss
play

Machine Learning (CSE 446): Learning as Minimizing Loss: - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Assignment 2 due tomo. Midterm: Weds,


  1. Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent Sham M Kakade c � 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12

  2. Announcements ◮ Assignment 2 due tomo. ◮ Midterm: Weds, Feb 7th. ◮ Qz section: review ◮ Today: Regularization and Optimization! 1 / 12

  3. Review 1 / 12

  4. Relax! ◮ The mis-classification optimization problem: N 1 � min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, use loss function ℓ ( y n , w · x ) and solve a relaxation: N 1 � min ℓ ( y n , w · x n ) N w n =1 2 / 12

  5. Relax! ◮ The mis-classification optimization problem: N 1 � min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, use loss function ℓ ( y n , w · x ) and solve a relaxation: N 1 � min ℓ ( y n , w · x n ) N w n =1 ◮ What do we want? ◮ How do we get it? speed? accuracy? 2 / 12

  6. Some loss functions: ◮ The square loss: ℓ ( y, w · x ) = ( y − w · x ) 2 ◮ The logistic loss: ℓ logistic ( y, w · x ) = log (1 + exp( − y w · x )) . ◮ They both “upper bound” the mistake rate. ◮ Instead: ◮ Instead, we let’s care about “regression” where y is real valued. ◮ What if we have multiple classes? (not just binary classification?) 3 / 12

  7. Least squares: let’s minimize it! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 4 / 12

  8. Matrix calculus proof: scratch space 5 / 12

  9. Matrix calculus proof: scratch space 5 / 12

  10. Let’s remember our linear system solving! 6 / 12

  11. Today 6 / 12

  12. Least squares: What could go wrong?! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 7 / 12

  13. Least squares: What could go wrong?! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y What if d is bigger than n ? Even if not? 7 / 12

  14. What could go wrong? Suppose d > n : What about n > d ? 8 / 12

  15. What could go wrong? Suppose d > n : What about n > d ? ◮ What happens if features are very correlated? (e.g. ’rows/columns in our matrix are co-linear .) 8 / 12

  16. linear system solving: scratch space 8 / 12

  17. A fix: Regularization ◮ Regularize the optimization problem: N 1 ( y n − w · x n ) 2 + λ � w � 2 = � min N w n =1 w � Y − X ⊤ w � 2 + λ � w � 2 min ◮ This particular case: “Ridge” Regression, Tikhonov regularization ◮ The solution is the least squares estimator : � 1 � − 1 � 1 � w least squares = N X ⊤ X + λ I N X ⊤ Y 9 / 12

  18. The “general” approach ◮ The regularized optimization problem: N 1 � min ℓ ( y n , w · x n ) + R ( w ) N w n =1 ◮ Penalty some w more than others. Example: R ( w ) = � w � 2 How do we find a solution quickly? 10 / 12

  19. Remember: convexity 10 / 12

  20. Gradient Descent ◮ Want to solve: min F ( z ) z ◮ How should we update z? 11 / 12

  21. Gradient Descent Data : function F : R d → R , number of iterations K , step sizes � η (1) , . . . , η ( K ) � Result : z ∈ R d initialize: z (0) = 0 ; for k ∈ { 1 , . . . , K } do z ( k ) = z ( k − 1) − η ( k ) · ∇ z F ( z ( k − 1) ) ; end return z ( K ) ; Algorithm 1: GradientDescent 11 / 12

  22. Gradient Descent: Convergence ◮ Letting z ∗ = argmin z F ( z ) denote the global minimum ◮ Let z ( k ) be our parameter after k updates. ◮ Thm: Suppose F is convex and “ L -smooth”. Using a fixed step size η ≤ 1 L , we have: F ( z ( k ) ) − F ( z ∗ ) ≤ � z (0) − z ∗ � 2 η · k That is the convergence rate is O ( 1 k ) . 12 / 12

  23. Smoothness and Gradient Descent Convergence ◮ Smooth functions: for all z, z ′ �∇ F ( z ) − ∇ F ( z ′ ) � ≤ L � z − z ′ � ◮ Proof idea: 1. If our gradient is large, we will make good progress decreasing our function value: 2. If our gradient is small, we must have value near the optimal value: 12 / 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend