CS109A Introduction to Data Science
Pavlos Protopapas, Kevin Rader and Chris Tanner
Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent
Fitting Neural Networks Gradient Descent and Stochastic Gradient - - PowerPoint PPT Presentation
Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner New requirement for the final project: For the first time ever, researchers who submit
CS109A Introduction to Data Science
Pavlos Protopapas, Kevin Rader and Chris Tanner
Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent
CS109A, PROTOPAPAS, RADER, TANNER
New requirement for the final project: For the first time ever, researchers who submit papers to NeurIPS or
their work” on society CS109A final project will also include the same requirement: “potential broader impact of your work” A guide to writing the impact statement: https://medium.com/@BrentH/suggestions-for-writing-neurips- 2020-broader-impacts-statements-121da1b765bf
CS109A, PROTOPAPAS, RADER, TANNER
CS109A, PROTOPAPAS, RADER, TANNER
4
Outline
CS109A, PROTOPAPAS, RADER, TANNER
5
Considerations
individual ‘errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples.
CS109A, PROTOPAPAS, RADER, TANNER
6
Considerations
individual ‘errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples.
CS109A, PROTOPAPAS, RADER, TANNER
7
Calculate the Derivatives
Can we do it? Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives. Example: Logistic Regression derivatives
CS109A, PROTOPAPAS, RADER, TANNER
8
Chain Rule
Chain rule for computing gradients: 𝑧 = 𝑦 𝑨 = 𝑔 𝑧 = 𝑔 𝑦
𝜖𝑨 𝜖𝑦 = 𝜖𝑨 𝜖𝑧 𝜖𝑧 𝜖𝑦 𝒛 = 𝒚 𝑨 = 𝑔 𝒛 = 𝑔 𝒚 𝜖𝑨 𝜖𝑦! = *
"
𝜖𝑨 𝜖𝑧" 𝜖𝑧" 𝜖𝑦!
∂z ∂xi = … ∂z ∂yj1
jm
∑
j1
∑
…∂yjm ∂xi
CS109A, PROTOPAPAS, RADER, TANNER
9
Logistic Regression derivatives
ℒ = *
!
ℒ! = − *
!
log 𝑀! = − *
!
[𝑧! log 𝑞! + 1 − 𝑧! log(1 − 𝑞!)]
#ℒ #% = ∑! #ℒ! #% = ∑!( #ℒ!
"
#%+ #ℒ!
#
#%)
ℒ! = −𝑧! log 1 1 + 𝑓&%$' − 1 − 𝑧! log(1 − 1 1 + 𝑓&%$') For logistic regression, the –ve log of the likelihood is: ℒ! = ℒ!
( + ℒ! )
To simplify the analysis let us split it into two parts, So the derivative with respect to W is:
CS109A, PROTOPAPAS, RADER, TANNER
Variables Partial derivatives Partial derivatives 𝜊! = −𝑋"𝑌 𝜖𝜊! 𝜖𝑋 = −𝑌 𝜖𝜊! 𝜖𝑋 = −𝑌 𝜊# = 𝑓$! = 𝑓%&"' 𝜖𝜊# 𝜖𝜊! = 𝑓$! 𝜖𝜊# 𝜖𝜊! = 𝑓%&"' 𝜊( = 1 + 𝜊# = 1 + 𝑓%&"' 𝜖𝜊( 𝜖𝜊# = 1
)$# )$$ =1𝜊* = 1 𝜊( = 1 1 + 𝑓%&"' = 𝑞 𝜖𝜊* 𝜖𝜊( = − 1 𝜊(
#𝜖𝜊* 𝜖𝜊( = − 1 1 + 𝑓%&"' # 𝜊+ = log 𝜊* = log 𝑞 = log 1 1 + 𝑓%&"' 𝜖𝜊+ 𝜖𝜊* = 1 𝜊* 𝜖𝜊+ 𝜖𝜊* = 1 + 𝑓%&"' ℒ,
𝜖ℒ 𝜖𝜊+ = −𝑧 𝜖ℒ 𝜖𝜊+ = −𝑧 𝜖ℒ,
𝜖𝜊+ 𝜖𝜊+ 𝜖𝜊* 𝜖𝜊* 𝜖𝜊( 𝜖𝜊( 𝜖𝜊# 𝜖𝜊# 𝜖𝜊! 𝜖𝜊! 𝜖𝑋 𝜖ℒ,
1 1 + 𝑓%&"'
ℒ!
" = −𝑧! log
1 1 + 𝑓#$!%
CS109A, PROTOPAPAS, RADER, TANNER
Variables derivatives Partial derivatives wrt to X,W 𝜊! = −𝑋"𝑌 𝜖𝜊! 𝜖𝑋 = −𝑌 𝜖𝜊! 𝜖𝑋 = −𝑌 𝜊# = 𝑓$! = 𝑓%&"' 𝜖𝜊# 𝜖𝜊! = 𝑓$! 𝜖𝜊# 𝜖𝜊! = 𝑓%&"' 𝜊( = 1 + 𝜊# = 1 + 𝑓%&"' 𝜖𝜊( 𝜖𝜊# = 1
)$# )# =1𝜊* = 1 𝜊( = 1 1 + 𝑓%&"' = 𝑞 𝜖𝜊* 𝜖𝜊( = − 1 𝜊(
#𝜖𝜊* 𝜖𝜊( = − 1 1 + 𝑓%&"' # 𝜊+ = 1 − 𝜊* = 1 − 1 1 + 𝑓%&"' 𝜖𝜊+ 𝜖𝜊* = −1
)$% )$& =-1𝜊. = log 𝜊+ = log(1 − 𝑞) = log 1 1 + 𝑓%&"' 𝜖𝜊. 𝜖𝜊+ = 1 𝜊+ 𝜖𝜊. 𝜖𝜊+ = 1 + 𝑓%&"' 𝑓%&"' ℒ,
/ = (1 − 𝑧)𝜊.𝜖ℒ 𝜖𝜊. = 1 − 𝑧 𝜖ℒ 𝜖𝜊. = 1 − 𝑧 𝜖ℒ,
/𝜖𝑋 = 𝜖ℒ,
/𝜖𝜊. 𝜖𝜊. 𝜖𝜊+ 𝜖𝜊+ 𝜖𝜊* 𝜖𝜊* 𝜖𝜊( 𝜖𝜊( 𝜖𝜊# 𝜖𝜊# 𝜖𝜊! 𝜖𝜊! 𝜖𝑋 𝜖ℒ,
/𝜖𝑋 = (1 − 𝑧)𝑌 1 1 + 𝑓%&"'
ℒ!
& = −(1 − 𝑧!) log[1 −
1 1 + 𝑓#$!%]
CS109A, PROTOPAPAS, RADER, TANNER
12
Considerations
individual ‘errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples.
CS109A, PROTOPAPAS, RADER, TANNER
13
Learning Rate
Our choice of the learning rate has a significant impact on the performance of gradient descent.
When 𝜃 is too small, the algorithm makes very little progress. When 𝜃 is too large, the algorithm may overshoot the minimum and has crazy oscillations. When 𝜃 is appropriate, the algorithm will find the minimum. The algorithm converges!
CS109A, PROTOPAPAS, RADER, TANNER
How can we tell when gradient descent is converging? We visualize the loss function at each step of gradient descent. This is called the trace plot. Loss is mostly oscillating between values rather than converging. While the loss is decreasing throughout training, it does not look like descent hit the bottom. The loss has decreased significantly during training. Towards the end, the loss stabilizes and it can’t decrease further.
CS109A, PROTOPAPAS, RADER, TANNER
15
Learning Rate
There are many alternative methods which address how to set or adjust the learning rate, using the derivative or second derivatives and or the momentum. More on this later.
CS109A, PROTOPAPAS, RADER, TANNER
16
Considerations
individual ‘errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples.
CS109A, PROTOPAPAS, RADER, TANNER
17
Local vs Global Minima
If we choose 𝜃 correctly, then gradient descent will converge to a stationary point. But will this point be a global minimum? If the function is convex then the stationary point will be a global minimum.
CS109A, PROTOPAPAS, RADER, TANNER
18
Local vs Global Minima
No guarantee that we get the global minimum. Question: What would be a good strategy?
CS109A, PROTOPAPAS, RADER, TANNER
19
Considerations
individual ‘errors’. Unless you are a statistician, sometimes this includes hundreds of thousands of examples.
CS109A, PROTOPAPAS, RADER, TANNER
20
Batch and Stochastic Gradient Descent
Instead of using all the examples for every step, use a subset
For each iteration k, use the following loss function to derive the derivatives: which is an approximation to the full loss function.
ℒ = − *
!
[𝑧! log 𝑞! + 1 − 𝑧! log(1 − 𝑞!)] ℒ8 = − *
!∈:6
[𝑧! log 𝑞! + 1 − 𝑧! log(1 − 𝑞!)]
CS109A, PROTOPAPAS, RADER, TANNER
CS109A, PROTOPAPAS, RADER, TANNER
CS109A, PROTOPAPAS, RADER, TANNER
DATA
I
↳ calculate 4.⇒ off ⇒ w±w-riffs
tabulate
L ⇒ off ⇒ w±w-right
2
÷
↳ ↳⇒ FE ⇒ w±w-ndffif
COMPLETE DATA ⇒ ONE EPOCH
RESHUFFLE
DATA AND REPEAT
CS109A, PROTOPAPAS, RADER, TANNER
CS109A, PROTOPAPAS, RADER, TANNER
CS109A, PROTOPAPAS, RADER, TANNER
26
Batch and Stochastic Gradient Descent
L 𝛊 Full Likelihood: Batch Likelihood:
CS109A, PROTOPAPAS, RADER, TANNER
27
Batch and Stochastic Gradient Descent
L 𝛊 Full Likelihood: Batch Likelihood:
CS109A, PROTOPAPAS, RADER, TANNER
28
Batch and Stochastic Gradient Descent
L 𝛊 Full Likelihood: Batch Likelihood:
CS109A, PROTOPAPAS, RADER, TANNER
29
Batch and Stochastic Gradient Descent
L 𝛊 Full Likelihood: Batch Likelihood:
CS109A, PROTOPAPAS, RADER, TANNER
30
Batch and Stochastic Gradient Descent
L 𝛊 Full Likelihood: Batch Likelihood:
CS109A, PROTOPAPAS, RADER, TANNER
31
Batch and Stochastic Gradient Descent
L 𝛊 Full Likelihood: Batch Likelihood:
CS109A, PROTOPAPAS, RADER, TANNER
32
Batch and Stochastic Gradient Descent
L 𝛊 Full Likelihood: Batch Likelihood:
CS109A, PROTOPAPAS, RADER, TANNER
33
Batch and Stochastic Gradient Descent
L 𝛊 Full Likelihood: Batch Likelihood:
CS109A, PROTOPAPAS, RADER, TANNER
34
Batch and Stochastic Gradient Descent
L 𝛊 Full Likelihood: Batch Likelihood: