Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient - PowerPoint PPT Presentation

CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD)

Today’s Class Stochastic Gradient Descent (SGD) SGD Recap • Regression vs Classification • Generalization / Overfitting / Underfitting • Regularization • Momentum Updates / ADAM Updates •

Our function L(w) ! " = 3 + (" − 4) + 3

Our function L(w) ! " = 3 + (" − 4) + , Easy way to find minimum (and max): Find where ," ! " = 0 , " = 4 This is zero when: ," ! " = 2 " − 4 4

Our function L(w) ! " = 3 + (" − 4) + But this is not easy for complex functions: L " , , " + , . . , " ,+ = −/012034567 1 " , , " + , . . , " ,+ , 7 , 89:;8 < −/012034567 1 " , , " + , . . , " ,+ , 7 + 89:;8 = … −/012034567 1 " , , " + , . . , " ,+ , 7 ? 89:;8 @ 5

Our function L(w) ! " = 3 + (" − 4) + Or even for simpler functions: ! " = , -. + / + 0!(") 0 = −, -. + 2/ = −, -. + 2/ 0" How do you find x?

Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=12 " 7

Gradient Descent (GD) (idea) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 " 8

Gradient Descent (GD) (idea) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8 " 9

Gradient Descent (GD) 7 = 0.01 4 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /56 for e = 0, num_epochs do Compute: and ;!(#, %)/;% ;!(#, %)/;# Update w: # = # − 7 ;!(#, %)/;# Update b: % = % − 7 ;!(#, %)/;% // Useful to see if this is becoming smaller or not. Print: !(#, %) end 10

Gradient Descent (GD) expensive 7 = 0.01 4 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /56 for e = 0, num_epochs do Compute: and ;!(#, %)/;% ;!(#, %)/;# Update w: # = # − 7 ;!(#, %)/;# Update b: % = % − 7 ;!(#, %)/;% // Useful to see if this is becoming smaller or not. Print: !(#, %) end 11

(mini-batch) Stochastic Gradient Descent (SGD) 6 = 0.01 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /∈5 for e = 0, num_epochs do for b = 0, num_batches do Compute: and :!(#, %)/:% :!(#, %)/:# Update w: # = # − 6 :!(#, %)/:# Update b: % = % − 6 :!(#, %)/:% // Useful to see if this is becoming smaller or not. Print: !(#, %) end end 12

(mini-batch) Stochastic Gradient Descent (SGD) 6 = 0.01 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /∈5 for e = 0, num_epochs do for b = 0, num_batches do Compute: and :!(#, %)/:% for |B| = 1 :!(#, %)/:# Update w: # = # − 6 :!(#, %)/:# Update b: % = % − 6 :!(#, %)/:% // Useful to see if this is becoming smaller or not. Print: !(#, %) end end 13

Regression vs Classification Regression Classification Labels are continuous Labels are discrete variables (1 • • variables – e.g. distance. out of K categories) Losses: Distance-based Losses: Cross-entropy loss, • • losses, e.g. sum of distances margin losses, logistic regression to true values. (binary cross entropy) Evaluation: Mean distances, Evaluation: Classification • • correlation coefficients, etc. accuracy, etc.

Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) "

Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " Model: ! . = 0" + 2

Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 57- ' Loss: 3 0, 2 = 4 ! . 5 − ! 5 Model: ! . = 0" + 2 57$

Quadratic Regression ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 57- . = 0 $ " ' + 0 ' " + 2 ' Loss: 3 0, 2 = 4 ! . 5 − ! 5 Model: ! 57$

n-polynomial Regression ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 79- . = 0 1 " 1 + ⋯ + 0 $ " + 4 ' Loss: 5 0, 4 = 6 ! . 7 − ! 7 Model: ! 79$

Overfitting % is a polynomial of % is linear % is cubic degree 9 !"## $ is high !"## $ is low !"## $ is zero! Overfitting Underfitting High Bias High Variance Christopher M. Bishop – Pattern Recognition and Machine Learning

Regularization Large weights lead to large variance. i.e. model fits to the training • data too strongly. Solution: Minimize the loss but also try to keep the weight values • small by doing the following: ! ", $ + & |" ( | ) minimize (

Regularization Large weights lead to large variance. i.e. model fits to the training • data too strongly. Solution: Minimize the loss but also try to keep the weight values • small by doing the following: ! ", $ + & ' |" ) | * minimize Regularizer term e.g. L2- regularizer )

SGD with Regularization (L-2) , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: and 0!(", $)/0$ 0!(", $)/0" Update w: " = " − , 0!(", $)/0" − ,'" $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 24

Revisiting Another Problem with SGD , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do These are only for b = 0, num_batches do approximations to the Compute: and 0!(", $)/0$ 0!(", $)/0" true gradient with Update w: " = " − , 0!(", $)/0" − ,'" respect to 6(", $) $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 25

Revisiting Another Problem with SGD , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do This could lead to “un- for b = 0, num_batches do learning” what has Compute: and 0!(", $)/0$ 0!(", $)/0" been learned in some Update w: " = " − , 0!(", $)/0" − ,'" previous steps of training. $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 26

Solution: Momentum Updates , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do Keep track of previous for b = 0, num_batches do gradients in an accumulator variable! Compute: and 0!(", $)/0$ 0!(", $)/0" and use a weighted Update w: " = " − , 0!(", $)/0" − ,'" average with current $ = $ − , 0!(", $)/0$ − ,'" gradient. Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 27

Solution: Momentum Updates , = 0.01 7 = 0.9 Initialize w and b randomly ! ", $ = ! ", $ + ' ∑ |" * | + * global 6 for e = 0, num_epochs do Keep track of previous for b = 0, num_batches do gradients in an Compute: 0!(", $)/0" accumulator variable! Compute: 6 = 76 + 0!(", $)/0" + '" and use a weighted average with current Update w: " = " − , 6 gradient. // Useful to see if this is becoming smaller or not. Print: !(", $) end end 28

More on Momentum https://distill.pub/2017/momentum/

Questions? 30

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient - PowerPoint PPT Presentation

CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap Regression vs Classification Generalization / Overfitting / Underfitting Regularization

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

On the Noisy Gradient Descent that Generalizes as SGD Jingfeng Wu , Wenqing Hu, Haoyi Xiong, Jun

Continuous Improvement Toolkit Regression (Introduction) Continuous Improvement Toolkit .

STK-IN4300 Piecewise polynomials and splines Smoothing splines Statistical Learning Methods in

Permutation Groups and Transformation Semigroups Lecture 1: Introduction Peter J. Cameron

An n component face-cubic model on the complete graph Zongzheng (Eric) Zhou School of

CS4501: Introduction to Computer Vision Max-Margin Classifier, Regularization, Generalization,

CPSC 340: Machine Learning and Data Mining More Regularization Summer 2020 Admin

Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support

What is modeling? NEU 466M Instructor: Professor Ila R.

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient - PowerPoint PPT Presentation

CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap Regression vs Classification Generalization / Overfitting / Underfitting Regularization

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

On the Noisy Gradient Descent that Generalizes as SGD Jingfeng Wu , Wenqing Hu, Haoyi Xiong, Jun

Continuous Improvement Toolkit Regression (Introduction) Continuous Improvement Toolkit .

STK-IN4300 Piecewise polynomials and splines Smoothing splines Statistical Learning Methods in

Permutation Groups and Transformation Semigroups Lecture 1: Introduction Peter J. Cameron

An n component face-cubic model on the complete graph Zongzheng (Eric) Zhou School of

CS4501: Introduction to Computer Vision Max-Margin Classifier, Regularization, Generalization,

CPSC 340: Machine Learning and Data Mining More Regularization Summer 2020 Admin

Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support

What is modeling? NEU 466M Instructor: Professor Ila R.

Gradient Descent Michail Michailidis & Patrick Maiden Outline