Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient - - PowerPoint PPT Presentation

stochastic gradient descent sgd today s class
SMART_READER_LITE
LIVE PREVIEW

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient - - PowerPoint PPT Presentation

CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap Regression vs Classification Generalization / Overfitting / Underfitting Regularization


slide-1
SLIDE 1

CS6501: Deep Learning for Visual Recognition

Stochastic Gradient Descent (SGD)

slide-2
SLIDE 2

Today’s Class

Stochastic Gradient Descent (SGD)

  • SGD Recap
  • Regression vs Classification
  • Generalization / Overfitting / Underfitting
  • Regularization
  • Momentum Updates / ADAM Updates
slide-3
SLIDE 3

3

Our function L(w)

! " = 3 + (" − 4)+

slide-4
SLIDE 4

4

Our function L(w)

! " = 3 + (" − 4)+ Easy way to find minimum (and max): Find where , ," ! " = 0 , ," ! " = 2 " − 4 This is zero when: " = 4

slide-5
SLIDE 5

5

Our function L(w)

! " = 3 + (" − 4)+ L ",, "+, . . , ",+ = −/012034567 1 ",, "+, . . , ",+, 7, 89:;8< −/012034567 1 ",, "+, . . , ",+, 7+ 89:;8= … −/012034567 1 ",, "+, . . , ",+, 7? 89:;8@ But this is not easy for complex functions:

slide-6
SLIDE 6

Our function L(w)

! " = 3 + (" − 4)+ ! " = ,-. + /+ 0!(") 0" = −,-. + 2/ 0 = −,-. + 2/ Or even for simpler functions: How do you find x?

slide-7
SLIDE 7

7

Gradient Descent (GD) (idea)

! " "

  • 1. Start with a random value
  • f w (e.g. w = 12)

w=12

  • 2. Compute the gradient

(derivative) of L(w) at point w = 12. (e.g. dL/dw = 6)

  • 3. Recompute w as:

w = w – lambda * (dL / dw)

slide-8
SLIDE 8

8

Gradient Descent (GD) (idea)

! " " w=10

  • 2. Compute the gradient

(derivative) of L(w) at point w = 12. (e.g. dL/dw = 6)

  • 3. Recompute w as:

w = w – lambda * (dL / dw)

slide-9
SLIDE 9

9

Gradient Descent (GD) (idea)

! " " w=8

  • 2. Compute the gradient

(derivative) of L(w) at point w = 12. (e.g. dL/dw = 6)

  • 3. Recompute w as:

w = w – lambda * (dL / dw)

slide-10
SLIDE 10

10

Gradient Descent (GD)

!(#, %) = ( −log .

/,01230(#, %) 4 /56

7 = 0.01 for e = 0, num_epochs do end Initialize w and b randomly ;!(#, %)/;# ;!(#, %)/;% Compute: and Update w: Update b: # = # − 7 ;!(#, %)/;# % = % − 7 ;!(#, %)/;% Print: !(#, %) // Useful to see if this is becoming smaller or not.

slide-11
SLIDE 11

11

Gradient Descent (GD)

!(#, %) = ( −log .

/,01230(#, %) 4 /56

7 = 0.01 for e = 0, num_epochs do end Initialize w and b randomly ;!(#, %)/;# ;!(#, %)/;% Compute: and Update w: Update b: # = # − 7 ;!(#, %)/;# % = % − 7 ;!(#, %)/;% Print: !(#, %) // Useful to see if this is becoming smaller or not. expensive

slide-12
SLIDE 12

12

(mini-batch) Stochastic Gradient Descent (SGD)

!(#, %) = ( −log .

/,01230(#, %) /∈5

6 = 0.01 for e = 0, num_epochs do end Initialize w and b randomly :!(#, %)/:# :!(#, %)/:% Compute: and Update w: Update b: # = # − 6 :!(#, %)/:# % = % − 6 :!(#, %)/:% Print: !(#, %) // Useful to see if this is becoming smaller or not. end for b = 0, num_batches do

slide-13
SLIDE 13

13

(mini-batch) Stochastic Gradient Descent (SGD)

!(#, %) = ( −log .

/,01230(#, %) /∈5

6 = 0.01 for e = 0, num_epochs do end Initialize w and b randomly :!(#, %)/:# :!(#, %)/:% Compute: and Update w: Update b: # = # − 6 :!(#, %)/:# % = % − 6 :!(#, %)/:% Print: !(#, %) // Useful to see if this is becoming smaller or not. end for b = 0, num_batches do for |B| = 1

slide-14
SLIDE 14

Regression vs Classification

Regression

  • Labels are continuous

variables – e.g. distance.

  • Losses: Distance-based

losses, e.g. sum of distances to true values.

  • Evaluation: Mean distances,

correlation coefficients, etc. Classification

  • Labels are discrete variables (1
  • ut of K categories)
  • Losses: Cross-entropy loss,

margin losses, logistic regression (binary cross entropy)

  • Evaluation: Classification

accuracy, etc.

slide-15
SLIDE 15

Linear Regression – 1 output, 1 input

! "

("$, !$) ("', !') ("(, !() ("), !)) ("*, !*) ("+, !+) (",, !,) ("-, !-)

slide-16
SLIDE 16

Linear Regression – 1 output, 1 input

! "

("$, !$) ("', !') ("(, !() ("), !)) ("*, !*) ("+, !+) (",, !,) ("-, !-)

Model: ! . = 0" + 2

slide-17
SLIDE 17

Linear Regression – 1 output, 1 input

! "

("$, !$) ("', !') ("(, !() ("), !)) ("*, !*) ("+, !+) (",, !,) ("-, !-)

Model: ! . = 0" + 2

slide-18
SLIDE 18

Linear Regression – 1 output, 1 input

! "

("$, !$) ("', !') ("(, !() ("), !)) ("*, !*) ("+, !+) (",, !,) ("-, !-)

Model: ! . = 0" + 2 Loss: 3 0, 2 = 4 ! .5 − !5

' 57- 57$

slide-19
SLIDE 19

Quadratic Regression

! "

("$, !$) ("', !') ("(, !() ("), !)) ("*, !*) ("+, !+) (",, !,) ("-, !-)

Model: ! . = 0$"' + 0'" + 2 Loss: 3 0, 2 = 4 ! .5 − !5

' 57- 57$

slide-20
SLIDE 20

n-polynomial Regression

! "

("$, !$) ("', !') ("(, !() ("), !)) ("*, !*) ("+, !+) (",, !,) ("-, !-)

Model: ! . = 01"1 + ⋯ + 0$" + 4 Loss: 5 0, 4 = 6 ! .7 − !7

' 79- 79$

slide-21
SLIDE 21

Overfitting

!"## $ is high !"## $ is low !"## $ is zero! Overfitting Underfitting High Bias High Variance

% is linear % is cubic % is a polynomial of degree 9

Christopher M. Bishop – Pattern Recognition and Machine Learning

slide-22
SLIDE 22

Regularization

  • Large weights lead to large variance. i.e. model fits to the training

data too strongly.

  • Solution: Minimize the loss but also try to keep the weight values

small by doing the following:

minimize ! ", $ + & |"(|)

(

slide-23
SLIDE 23

Regularization

  • Large weights lead to large variance. i.e. model fits to the training

data too strongly.

  • Solution: Minimize the loss but also try to keep the weight values

small by doing the following:

minimize ! ", $ + & ' |")|*

)

Regularizer term e.g. L2- regularizer

slide-24
SLIDE 24

24

SGD with Regularization (L-2)

! ", $ = ! ", $ + ' ∑ |"*|+

*

, = 0.01 for e = 0, num_epochs do end Initialize w and b randomly 0!(", $)/0" 0!(", $)/0$ Compute: and Update w: Update b: " = " − , 0!(", $)/0" − ,'" $ = $ − , 0!(", $)/0$ − ,'" Print: !(", $) // Useful to see if this is becoming smaller or not. end for b = 0, num_batches do

slide-25
SLIDE 25

25

Revisiting Another Problem with SGD

! ", $ = ! ", $ + ' ∑ |"*|+

*

, = 0.01 for e = 0, num_epochs do end Initialize w and b randomly 0!(", $)/0" 0!(", $)/0$ Compute: and Update w: Update b: " = " − , 0!(", $)/0" − ,'" $ = $ − , 0!(", $)/0$ − ,'" Print: !(", $) // Useful to see if this is becoming smaller or not. end for b = 0, num_batches do These are only approximations to the true gradient with respect to 6(", $)

slide-26
SLIDE 26

26

Revisiting Another Problem with SGD

! ", $ = ! ", $ + ' ∑ |"*|+

*

, = 0.01 for e = 0, num_epochs do end Initialize w and b randomly 0!(", $)/0" 0!(", $)/0$ Compute: and Update w: Update b: " = " − , 0!(", $)/0" − ,'" $ = $ − , 0!(", $)/0$ − ,'" Print: !(", $) // Useful to see if this is becoming smaller or not. end for b = 0, num_batches do This could lead to “un- learning” what has been learned in some previous steps of training.

slide-27
SLIDE 27

27

Solution: Momentum Updates

! ", $ = ! ", $ + ' ∑ |"*|+

*

, = 0.01 for e = 0, num_epochs do end Initialize w and b randomly 0!(", $)/0" 0!(", $)/0$ Compute: and Update w: Update b: " = " − , 0!(", $)/0" − ,'" $ = $ − , 0!(", $)/0$ − ,'" Print: !(", $) // Useful to see if this is becoming smaller or not. end for b = 0, num_batches do Keep track of previous gradients in an accumulator variable! and use a weighted average with current gradient.

slide-28
SLIDE 28

28

Solution: Momentum Updates

! ", $ = ! ", $ + ' ∑ |"*|+

*

, = 0.01 for e = 0, num_epochs do end Initialize w and b randomly 0!(", $)/0" Compute: Update w: " = " − , 6 Print: !(", $) // Useful to see if this is becoming smaller or not. end for b = 0, num_batches do Keep track of previous gradients in an accumulator variable! and use a weighted average with current gradient. 7 = 0.9 global 6 Compute: 6 = 76 + 0!(", $)/0" + '"

slide-29
SLIDE 29

https://distill.pub/2017/momentum/

More on Momentum

slide-30
SLIDE 30

Questions?

30