Machine Learning (CSE 446): Learning as Minimizing Loss: - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent Sham M Kakade c � 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12

Announcements ◮ Assignment 2 due tomo. ◮ Midterm: Weds, Feb 7th. ◮ Qz section: review ◮ Today: Regularization and Optimization! 1 / 12

Review 1 / 12

Relax! ◮ The mis-classification optimization problem: N 1 � min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, use loss function ℓ ( y n , w · x ) and solve a relaxation: N 1 � min ℓ ( y n , w · x n ) N w n =1 2 / 12

Relax! ◮ The mis-classification optimization problem: N 1 � min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, use loss function ℓ ( y n , w · x ) and solve a relaxation: N 1 � min ℓ ( y n , w · x n ) N w n =1 ◮ What do we want? ◮ How do we get it? speed? accuracy? 2 / 12

Some loss functions: ◮ The square loss: ℓ ( y, w · x ) = ( y − w · x ) 2 ◮ The logistic loss: ℓ logistic ( y, w · x ) = log (1 + exp( − y w · x )) . ◮ They both “upper bound” the mistake rate. ◮ Instead: ◮ Instead, we let’s care about “regression” where y is real valued. ◮ What if we have multiple classes? (not just binary classification?) 3 / 12

Least squares: let’s minimize it! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 4 / 12

Matrix calculus proof: scratch space 5 / 12

Let’s remember our linear system solving! 6 / 12

Today 6 / 12

Least squares: What could go wrong?! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 7 / 12

Least squares: What could go wrong?! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y What if d is bigger than n ? Even if not? 7 / 12

What could go wrong? Suppose d > n : What about n > d ? 8 / 12

What could go wrong? Suppose d > n : What about n > d ? ◮ What happens if features are very correlated? (e.g. ’rows/columns in our matrix are co-linear .) 8 / 12

linear system solving: scratch space 8 / 12

A fix: Regularization ◮ Regularize the optimization problem: N 1 ( y n − w · x n ) 2 + λ � w � 2 = � min N w n =1 w � Y − X ⊤ w � 2 + λ � w � 2 min ◮ This particular case: “Ridge” Regression, Tikhonov regularization ◮ The solution is the least squares estimator : � 1 � − 1 � 1 � w least squares = N X ⊤ X + λ I N X ⊤ Y 9 / 12

The “general” approach ◮ The regularized optimization problem: N 1 � min ℓ ( y n , w · x n ) + R ( w ) N w n =1 ◮ Penalty some w more than others. Example: R ( w ) = � w � 2 How do we find a solution quickly? 10 / 12

Remember: convexity 10 / 12

Gradient Descent ◮ Want to solve: min F ( z ) z ◮ How should we update z? 11 / 12

Gradient Descent Data : function F : R d → R , number of iterations K , step sizes � η (1) , . . . , η ( K ) � Result : z ∈ R d initialize: z (0) = 0 ; for k ∈ { 1 , . . . , K } do z ( k ) = z ( k − 1) − η ( k ) · ∇ z F ( z ( k − 1) ) ; end return z ( K ) ; Algorithm 1: GradientDescent 11 / 12

Gradient Descent: Convergence ◮ Letting z ∗ = argmin z F ( z ) denote the global minimum ◮ Let z ( k ) be our parameter after k updates. ◮ Thm: Suppose F is convex and “ L -smooth”. Using a fixed step size η ≤ 1 L , we have: F ( z ( k ) ) − F ( z ∗ ) ≤ � z (0) − z ∗ � 2 η · k That is the convergence rate is O ( 1 k ) . 12 / 12

Smoothness and Gradient Descent Convergence ◮ Smooth functions: for all z, z ′ �∇ F ( z ) − ∇ F ( z ′ ) � ≤ L � z − z ′ � ◮ Proof idea: 1. If our gradient is large, we will make good progress decreasing our function value: 2. If our gradient is small, we must have value near the optimal value: 12 / 12

Machine Learning (CSE 446): Learning as Minimizing Loss: - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Assignment 2 due tomo. Midterm: Weds,

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

CSE 446: Linear Algebra Review Sachin Mehta University of Washington, Seattle Email:

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M

Machine Learning (CSE 446): (continuation of overfitting &) Limits of Learning Sham M Kakade

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Overview CS 446 What is machine learning? Machine learning : study of computational

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c University of Washington

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade 2018 c

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

THE WEB AND TV JEFF JAFFE, W3C CEO HOSTING AND MAJOR UNDERWRITING PROVIDED BY ADDITIONAL

IMRT treatments in Belgium: IMRT treatments in Belgium: an update an update S. Vy Vyn nckier

Generalized Row-Action Methods for Tomographic Imaging Sparse Tomo Days, Technical University of

Reputation Network Analysis for Email Filtering Jennifer Golbeck, James Hendler Department of

The EBEX AHWP Shaul Hanany + EBEX Team, Tomo Matsumura, Jeff Klein Observational Cosmology -

Riposte: An Anonymous Messaging System Handling Millions of Users Henry Corrigan-Gibbs, Dan

Cognate object case in Samoan and Niuean Rebecca Tollan and Diane Massam University of Delaware

The Mu2e Experiment Tomo Miyashita Caltech On Behalf of the Mu2e Collaboration Fermilab Users

Machine Learning (CSE 446): Learning as Minimizing Loss: - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Assignment 2 due tomo. Midterm: Weds,

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade

Machine Learning (CSE 446): Probabilistic Machine Learning MLE &amp; MAP Sham M Kakade 2018

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

CSE 446: Linear Algebra Review Sachin Mehta University of Washington, Seattle Email:

CSCI 446: Ar*ficial Intelligence CSCI 446: Ar*ficial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Machine Learning (CSE 446): Concepts &amp; the i.i.d. Supervised Learning Paradigm Sham M

Machine Learning (CSE 446): (continuation of overfitting &amp;) Limits of Learning Sham M Kakade

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Overview CS 446 What is machine learning? Machine learning : study of computational

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c University of Washington

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade 2018 c

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

THE WEB AND TV JEFF JAFFE, W3C CEO HOSTING AND MAJOR UNDERWRITING PROVIDED BY ADDITIONAL

IMRT treatments in Belgium: IMRT treatments in Belgium: an update an update S. Vy Vyn nckier

Generalized Row-Action Methods for Tomographic Imaging Sparse Tomo Days, Technical University of

Reputation Network Analysis for Email Filtering Jennifer Golbeck, James Hendler Department of

The EBEX AHWP Shaul Hanany + EBEX Team, Tomo Matsumura, Jeff Klein Observational Cosmology -

Riposte: An Anonymous Messaging System Handling Millions of Users Henry Corrigan-Gibbs, Dan

Cognate object case in Samoan and Niuean Rebecca Tollan and Diane Massam University of Delaware

The Mu2e Experiment Tomo Miyashita Caltech On Behalf of the Mu2e Collaboration Fermilab Users

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M

Machine Learning (CSE 446): (continuation of overfitting &) Limits of Learning Sham M Kakade