Fitting Neural Networks Gradient Descent and Stochastic Gradient - - PowerPoint PPT Presentation

fitting neural networks gradient descent and stochastic
SMART_READER_LITE
LIVE PREVIEW

Fitting Neural Networks Gradient Descent and Stochastic Gradient - - PowerPoint PPT Presentation

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner New requirement for the final project: For the first time ever, researchers who submit


slide-1
SLIDE 1

CS109A Introduction to Data Science

Pavlos Protopapas, Kevin Rader and Chris Tanner

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent

slide-2
SLIDE 2

CS109A, PROTOPAPAS, RADER, TANNER

New requirement for the final project: For the first time ever, researchers who submit papers to NeurIPS or

  • ther conferences, must now state the “potential broader impact of

their work” on society CS109A final project will also include the same requirement: “potential broader impact of your work” A guide to writing the impact statement: https://medium.com/@BrentH/suggestions-for-writing-neurips- 2020-broader-impacts-statements-121da1b765bf

slide-3
SLIDE 3

CS109A, PROTOPAPAS, RADER, TANNER

slide-4
SLIDE 4

CS109A, PROTOPAPAS, RADER, TANNER

4

Outline

  • Gradient Descent
  • Stochastic Gradient Descent
slide-5
SLIDE 5

CS109A, PROTOPAPAS, RADER, TANNER

5

Considerations

  • We still need to calculate the derivatives.
  • We need to know what is the learning rate or how to set it.
  • Local vs global minima.
  • The full likelihood function includes summing up all

individual ‘errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples.

slide-6
SLIDE 6

CS109A, PROTOPAPAS, RADER, TANNER

6

Considerations

  • We still need to calculate the derivatives.
  • We need to know what is the learning rate or how to set it.
  • Local vs global minima.
  • The full likelihood function includes summing up all

individual ‘errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples.

slide-7
SLIDE 7

CS109A, PROTOPAPAS, RADER, TANNER

7

Calculate the Derivatives

Can we do it? Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. Example: Logistic Regression derivatives

slide-8
SLIDE 8

CS109A, PROTOPAPAS, RADER, TANNER

8

Chain Rule

Chain rule for computing gradients: 𝑧 = 𝑕 𝑦 𝑨 = 𝑔 𝑧 = 𝑔 𝑕 𝑦

  • For longer chains:

𝜖𝑨 𝜖𝑦 = 𝜖𝑨 𝜖𝑧 𝜖𝑧 𝜖𝑦 𝒛 = 𝑕 𝒚 𝑨 = 𝑔 𝒛 = 𝑔 𝑕 𝒚 𝜖𝑨 𝜖𝑦! = *

"

𝜖𝑨 𝜖𝑧" 𝜖𝑧" 𝜖𝑦!

∂z ∂xi = … ∂z ∂yj1

jm

j1

…∂yjm ∂xi

slide-9
SLIDE 9

CS109A, PROTOPAPAS, RADER, TANNER

9

Logistic Regression derivatives

ℒ = *

!

ℒ! = − *

!

log 𝑀! = − *

!

[𝑧! log 𝑞! + 1 − 𝑧! log(1 − 𝑞!)]

#ℒ #% = ∑! #ℒ! #% = ∑!( #ℒ!

"

#%+ #ℒ!

#

#%)

ℒ! = −𝑧! log 1 1 + 𝑓&%$' − 1 − 𝑧! log(1 − 1 1 + 𝑓&%$') For logistic regression, the –ve log of the likelihood is: ℒ! = ℒ!

( + ℒ! )

To simplify the analysis let us split it into two parts, So the derivative with respect to W is:

slide-10
SLIDE 10

CS109A, PROTOPAPAS, RADER, TANNER

Variables Partial derivatives Partial derivatives 𝜊! = −𝑋"𝑌 𝜖𝜊! 𝜖𝑋 = −𝑌 𝜖𝜊! 𝜖𝑋 = −𝑌 𝜊# = 𝑓$! = 𝑓%&"' 𝜖𝜊# 𝜖𝜊! = 𝑓$! 𝜖𝜊# 𝜖𝜊! = 𝑓%&"' 𝜊( = 1 + 𝜊# = 1 + 𝑓%&"' 𝜖𝜊( 𝜖𝜊# = 1

)$# )$$ =1

𝜊* = 1 𝜊( = 1 1 + 𝑓%&"' = 𝑞 𝜖𝜊* 𝜖𝜊( = − 1 𝜊(

#

𝜖𝜊* 𝜖𝜊( = − 1 1 + 𝑓%&"' # 𝜊+ = log 𝜊* = log 𝑞 = log 1 1 + 𝑓%&"' 𝜖𝜊+ 𝜖𝜊* = 1 𝜊* 𝜖𝜊+ 𝜖𝜊* = 1 + 𝑓%&"' ℒ,

  • = −𝑧𝜊+

𝜖ℒ 𝜖𝜊+ = −𝑧 𝜖ℒ 𝜖𝜊+ = −𝑧 𝜖ℒ,

  • 𝜖𝑋 = 𝜖ℒ,

𝜖𝜊+ 𝜖𝜊+ 𝜖𝜊* 𝜖𝜊* 𝜖𝜊( 𝜖𝜊( 𝜖𝜊# 𝜖𝜊# 𝜖𝜊! 𝜖𝜊! 𝜖𝑋 𝜖ℒ,

  • 𝜖𝑋 = −𝑧𝑌𝑓%&"'

1 1 + 𝑓%&"'

ℒ!

" = −𝑧! log

1 1 + 𝑓#$!%

slide-11
SLIDE 11

CS109A, PROTOPAPAS, RADER, TANNER

Variables derivatives Partial derivatives wrt to X,W 𝜊! = −𝑋"𝑌 𝜖𝜊! 𝜖𝑋 = −𝑌 𝜖𝜊! 𝜖𝑋 = −𝑌 𝜊# = 𝑓$! = 𝑓%&"' 𝜖𝜊# 𝜖𝜊! = 𝑓$! 𝜖𝜊# 𝜖𝜊! = 𝑓%&"' 𝜊( = 1 + 𝜊# = 1 + 𝑓%&"' 𝜖𝜊( 𝜖𝜊# = 1

)$# )# =1

𝜊* = 1 𝜊( = 1 1 + 𝑓%&"' = 𝑞 𝜖𝜊* 𝜖𝜊( = − 1 𝜊(

#

𝜖𝜊* 𝜖𝜊( = − 1 1 + 𝑓%&"' # 𝜊+ = 1 − 𝜊* = 1 − 1 1 + 𝑓%&"' 𝜖𝜊+ 𝜖𝜊* = −1

)$% )$& =-1

𝜊. = log 𝜊+ = log(1 − 𝑞) = log 1 1 + 𝑓%&"' 𝜖𝜊. 𝜖𝜊+ = 1 𝜊+ 𝜖𝜊. 𝜖𝜊+ = 1 + 𝑓%&"' 𝑓%&"' ℒ,

/ = (1 − 𝑧)𝜊.

𝜖ℒ 𝜖𝜊. = 1 − 𝑧 𝜖ℒ 𝜖𝜊. = 1 − 𝑧 𝜖ℒ,

/

𝜖𝑋 = 𝜖ℒ,

/

𝜖𝜊. 𝜖𝜊. 𝜖𝜊+ 𝜖𝜊+ 𝜖𝜊* 𝜖𝜊* 𝜖𝜊( 𝜖𝜊( 𝜖𝜊# 𝜖𝜊# 𝜖𝜊! 𝜖𝜊! 𝜖𝑋 𝜖ℒ,

/

𝜖𝑋 = (1 − 𝑧)𝑌 1 1 + 𝑓%&"'

ℒ!

& = −(1 − 𝑧!) log[1 −

1 1 + 𝑓#$!%]

slide-12
SLIDE 12

CS109A, PROTOPAPAS, RADER, TANNER

12

Considerations

  • We still need to calculate the derivatives.
  • We need to know what is the learning rate or how to set it.
  • Local vs global minima.
  • The full likelihood function includes summing up all

individual ‘errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples.

slide-13
SLIDE 13

CS109A, PROTOPAPAS, RADER, TANNER

13

Learning Rate

Our choice of the learning rate has a significant impact on the performance of gradient descent.

When 𝜃 is too small, the algorithm makes very little progress. When 𝜃 is too large, the algorithm may overshoot the minimum and has crazy oscillations. When 𝜃 is appropriate, the algorithm will find the minimum. The algorithm converges!

slide-14
SLIDE 14

CS109A, PROTOPAPAS, RADER, TANNER

How can we tell when gradient descent is converging? We visualize the loss function at each step of gradient descent. This is called the trace plot. Loss is mostly oscillating between values rather than converging. While the loss is decreasing throughout training, it does not look like descent hit the bottom. The loss has decreased significantly during training. Towards the end, the loss stabilizes and it can’t decrease further.

slide-15
SLIDE 15

CS109A, PROTOPAPAS, RADER, TANNER

15

Learning Rate

There are many alternative methods which address how to set or adjust the learning rate, using the derivative or second derivatives and or the momentum. More on this later.

slide-16
SLIDE 16

CS109A, PROTOPAPAS, RADER, TANNER

16

Considerations

  • We still need to calculate the derivatives.
  • We need to know what is the learning rate or how to set it.
  • Local vs global minima.
  • The full likelihood function includes summing up all

individual ‘errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples.

slide-17
SLIDE 17

CS109A, PROTOPAPAS, RADER, TANNER

17

Local vs Global Minima

If we choose 𝜃 correctly, then gradient descent will converge to a stationary point. But will this point be a global minimum? If the function is convex then the stationary point will be a global minimum.

slide-18
SLIDE 18

CS109A, PROTOPAPAS, RADER, TANNER

18

Local vs Global Minima

No guarantee that we get the global minimum. Question: What would be a good strategy?

  • Random restarts
  • Add noise to the loss function
slide-19
SLIDE 19

CS109A, PROTOPAPAS, RADER, TANNER

19

Considerations

  • We still need to calculate the derivatives.
  • We need to know what is the learning rate or how to set it.
  • Local vs global minima.
  • The full likelihood function includes summing up all

individual ‘errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples.

slide-20
SLIDE 20

CS109A, PROTOPAPAS, RADER, TANNER

20

Batch and Stochastic Gradient Descent

Instead of using all the examples for every step, use a subset

  • f them (batch).

For each iteration k, use the following loss function to derive the derivatives: which is an approximation to the full loss function.

ℒ = − *

!

[𝑧! log 𝑞! + 1 − 𝑧! log(1 − 𝑞!)] ℒ8 = − *

!∈:6

[𝑧! log 𝑞! + 1 − 𝑧! log(1 − 𝑞!)]

slide-21
SLIDE 21

CS109A, PROTOPAPAS, RADER, TANNER

slide-22
SLIDE 22

CS109A, PROTOPAPAS, RADER, TANNER

slide-23
SLIDE 23

CS109A, PROTOPAPAS, RADER, TANNER

1

DATA

I

¥Y*"tI7E

↳ calculate 4.⇒ off ⇒ w±w-riffs

tabulate

L ⇒ off ⇒ w±w-right

2

÷

↳ ↳⇒ FE ⇒ w±w-ndffif

COMPLETE DATA ⇒ ONE EPOCH

RESHUFFLE

DATA AND REPEAT

slide-24
SLIDE 24

CS109A, PROTOPAPAS, RADER, TANNER

slide-25
SLIDE 25

CS109A, PROTOPAPAS, RADER, TANNER

slide-26
SLIDE 26

CS109A, PROTOPAPAS, RADER, TANNER

26

Batch and Stochastic Gradient Descent

L 𝛊 Full Likelihood: Batch Likelihood:

slide-27
SLIDE 27

CS109A, PROTOPAPAS, RADER, TANNER

27

Batch and Stochastic Gradient Descent

L 𝛊 Full Likelihood: Batch Likelihood:

slide-28
SLIDE 28

CS109A, PROTOPAPAS, RADER, TANNER

28

Batch and Stochastic Gradient Descent

L 𝛊 Full Likelihood: Batch Likelihood:

slide-29
SLIDE 29

CS109A, PROTOPAPAS, RADER, TANNER

29

Batch and Stochastic Gradient Descent

L 𝛊 Full Likelihood: Batch Likelihood:

slide-30
SLIDE 30

CS109A, PROTOPAPAS, RADER, TANNER

30

Batch and Stochastic Gradient Descent

L 𝛊 Full Likelihood: Batch Likelihood:

slide-31
SLIDE 31

CS109A, PROTOPAPAS, RADER, TANNER

31

Batch and Stochastic Gradient Descent

L 𝛊 Full Likelihood: Batch Likelihood:

slide-32
SLIDE 32

CS109A, PROTOPAPAS, RADER, TANNER

32

Batch and Stochastic Gradient Descent

L 𝛊 Full Likelihood: Batch Likelihood:

slide-33
SLIDE 33

CS109A, PROTOPAPAS, RADER, TANNER

33

Batch and Stochastic Gradient Descent

L 𝛊 Full Likelihood: Batch Likelihood:

slide-34
SLIDE 34

CS109A, PROTOPAPAS, RADER, TANNER

34

Batch and Stochastic Gradient Descent

L 𝛊 Full Likelihood: Batch Likelihood: