stochastic gradient descent sgd today s class
play

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient - PowerPoint PPT Presentation

CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap Regression vs Classification Generalization / Overfitting / Underfitting Regularization


  1. CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD)

  2. Today’s Class Stochastic Gradient Descent (SGD) SGD Recap • Regression vs Classification • Generalization / Overfitting / Underfitting • Regularization • Momentum Updates / ADAM Updates •

  3. Our function L(w) ! " = 3 + (" − 4) + 3

  4. Our function L(w) ! " = 3 + (" − 4) + , Easy way to find minimum (and max): Find where ," ! " = 0 , " = 4 This is zero when: ," ! " = 2 " − 4 4

  5. Our function L(w) ! " = 3 + (" − 4) + But this is not easy for complex functions: L " , , " + , . . , " ,+ = −/012034567 1 " , , " + , . . , " ,+ , 7 , 89:;8 < −/012034567 1 " , , " + , . . , " ,+ , 7 + 89:;8 = … −/012034567 1 " , , " + , . . , " ,+ , 7 ? 89:;8 @ 5

  6. Our function L(w) ! " = 3 + (" − 4) + Or even for simpler functions: ! " = , -. + / + 0!(") 0 = −, -. + 2/ = −, -. + 2/ 0" How do you find x?

  7. Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=12 " 7

  8. Gradient Descent (GD) (idea) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 " 8

  9. Gradient Descent (GD) (idea) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8 " 9

  10. Gradient Descent (GD) 7 = 0.01 4 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /56 for e = 0, num_epochs do Compute: and ;!(#, %)/;% ;!(#, %)/;# Update w: # = # − 7 ;!(#, %)/;# Update b: % = % − 7 ;!(#, %)/;% // Useful to see if this is becoming smaller or not. Print: !(#, %) end 10

  11. Gradient Descent (GD) expensive 7 = 0.01 4 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /56 for e = 0, num_epochs do Compute: and ;!(#, %)/;% ;!(#, %)/;# Update w: # = # − 7 ;!(#, %)/;# Update b: % = % − 7 ;!(#, %)/;% // Useful to see if this is becoming smaller or not. Print: !(#, %) end 11

  12. (mini-batch) Stochastic Gradient Descent (SGD) 6 = 0.01 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /∈5 for e = 0, num_epochs do for b = 0, num_batches do Compute: and :!(#, %)/:% :!(#, %)/:# Update w: # = # − 6 :!(#, %)/:# Update b: % = % − 6 :!(#, %)/:% // Useful to see if this is becoming smaller or not. Print: !(#, %) end end 12

  13. (mini-batch) Stochastic Gradient Descent (SGD) 6 = 0.01 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /∈5 for e = 0, num_epochs do for b = 0, num_batches do Compute: and :!(#, %)/:% for |B| = 1 :!(#, %)/:# Update w: # = # − 6 :!(#, %)/:# Update b: % = % − 6 :!(#, %)/:% // Useful to see if this is becoming smaller or not. Print: !(#, %) end end 13

  14. Regression vs Classification Regression Classification Labels are continuous Labels are discrete variables (1 • • variables – e.g. distance. out of K categories) Losses: Distance-based Losses: Cross-entropy loss, • • losses, e.g. sum of distances margin losses, logistic regression to true values. (binary cross entropy) Evaluation: Mean distances, Evaluation: Classification • • correlation coefficients, etc. accuracy, etc.

  15. Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) "

  16. Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " Model: ! . = 0" + 2

  17. Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " Model: ! . = 0" + 2

  18. Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 57- ' Loss: 3 0, 2 = 4 ! . 5 − ! 5 Model: ! . = 0" + 2 57$

  19. Quadratic Regression ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 57- . = 0 $ " ' + 0 ' " + 2 ' Loss: 3 0, 2 = 4 ! . 5 − ! 5 Model: ! 57$

  20. n-polynomial Regression ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 79- . = 0 1 " 1 + ⋯ + 0 $ " + 4 ' Loss: 5 0, 4 = 6 ! . 7 − ! 7 Model: ! 79$

  21. Overfitting % is a polynomial of % is linear % is cubic degree 9 !"## $ is high !"## $ is low !"## $ is zero! Overfitting Underfitting High Bias High Variance Christopher M. Bishop – Pattern Recognition and Machine Learning

  22. Regularization Large weights lead to large variance. i.e. model fits to the training • data too strongly. Solution: Minimize the loss but also try to keep the weight values • small by doing the following: ! ", $ + & |" ( | ) minimize (

  23. Regularization Large weights lead to large variance. i.e. model fits to the training • data too strongly. Solution: Minimize the loss but also try to keep the weight values • small by doing the following: ! ", $ + & ' |" ) | * minimize Regularizer term e.g. L2- regularizer )

  24. SGD with Regularization (L-2) , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: and 0!(", $)/0$ 0!(", $)/0" Update w: " = " − , 0!(", $)/0" − ,'" $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 24

  25. Revisiting Another Problem with SGD , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do These are only for b = 0, num_batches do approximations to the Compute: and 0!(", $)/0$ 0!(", $)/0" true gradient with Update w: " = " − , 0!(", $)/0" − ,'" respect to 6(", $) $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 25

  26. Revisiting Another Problem with SGD , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do This could lead to “un- for b = 0, num_batches do learning” what has Compute: and 0!(", $)/0$ 0!(", $)/0" been learned in some Update w: " = " − , 0!(", $)/0" − ,'" previous steps of training. $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 26

  27. Solution: Momentum Updates , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do Keep track of previous for b = 0, num_batches do gradients in an accumulator variable! Compute: and 0!(", $)/0$ 0!(", $)/0" and use a weighted Update w: " = " − , 0!(", $)/0" − ,'" average with current $ = $ − , 0!(", $)/0$ − ,'" gradient. Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 27

  28. Solution: Momentum Updates , = 0.01 7 = 0.9 Initialize w and b randomly ! ", $ = ! ", $ + ' ∑ |" * | + * global 6 for e = 0, num_epochs do Keep track of previous for b = 0, num_batches do gradients in an Compute: 0!(", $)/0" accumulator variable! Compute: 6 = 76 + 0!(", $)/0" + '" and use a weighted average with current Update w: " = " − , 6 gradient. // Useful to see if this is becoming smaller or not. Print: !(", $) end end 28

  29. More on Momentum https://distill.pub/2017/momentum/

  30. Questions? 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend