Stochastic Gradient Descent Many slides attributable to: Prof. - - PowerPoint PPT Presentation

stochastic gradient descent
SMART_READER_LITE
LIVE PREVIEW

Stochastic Gradient Descent Many slides attributable to: Prof. - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Stochastic Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James,


slide-1
SLIDE 1

Stochastic Gradient Descent

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

Many slides attributable to: Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Objectives Today (day 12) Stochastic Gradient Descent

  • Review: Gradient Descent
  • Repeatedly step downhill until converged
  • Review: Training Neural Nets with Backprop
  • Backprop = chain rule plus dynamic programming
  • L-BFGS : How to step in better direction?
  • Stochastic Gradient Descent : How to go fast?

3

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-3
SLIDE 3

Review: Gradient Descent in 1D

4

Mike Hughes - Tufts COMP 135 - Fall 2020

input: initial θ ∈ R input: step size α ∈ R+ while not converged: θ ← θ − α d dθJ(θ)

Q: Which direction to step? A: Straight downhill (steepest descent at current location) Q: How far to step in that direction? A: Step size parameter picked in advance, unaware of current location

α ·

∂θJ

  • <latexit sha1_base64="QD3y6jOAWeQ5sqrhpmn6rqR5jJ8=">ACN3icbVBNSyNBEO3xY9Wsu0Y9emkMC3sKM+7CehS9iAdRMCqkQ6jp1GQaez7orhHCmH/lxb/hTS8eFPHqP7AnieLHPmjq8V4V1fXCXCtLvn/jTU3PzH6bm1+ofV/8XOpvrxybLPCSGzJTGfmNASLWqXYIkUaT3ODkIQaT8Kznco/OUdjVZYe0SDHTgL9VEVKAjmpW98XoPMYuJC9jLjQGNEFf60iMiBLkYMhBXr4xrigGAmGfI8Lo/px1Tqu3XrDb/oj8K8kmJAGm+CgW78WvUwWCaYkNVjbDvycOmW1R2oc1kRhMQd5Bn1sO5pCgrZTju4e8l9O6fEoM+6lxEfq+4kSEmsHSeg6E6DYfvYq8X9eu6Bos1OqNC8IUzleFBWaU8arEHlPGZSkB46ANMr9lcsYXFjkoq65EILPJ38lxvN4E9z4/BvY2t7Esc8W2Pr7DcL2D+2xXbZAWsxyS7ZLbtnD96Vd+c9ek/j1ilvMrPKPsB7fgFLZq1r</latexit>
slide-4
SLIDE 4

input: initial θ ∈ R input: step size α ∈ R+ while not converged: θ ← θ − α d dθJ(θ)

D

Review: Gradient Descent in 2D+

5

Mike Hughes - Tufts COMP 135 - Fall 2020

Q: Which direction to step? A: Straight downhill (steepest descent at current location) Q: How far to step in that direction? A: Step size parameter picked in advance, unaware of current location gradient = vector of partial derivatives

θ θ αrθJ(θ)

<latexit sha1_base64="1JLgwDRjFbF3NXnXhXd7r/jcJ0g=">ACH3icbZBNSwMxEIazflu/qh69BIugB8tuFfUoehFPClaFbimz6bQNZrNLMiuUpf/Ei3/FiwdFxJv/xrTdg18vBJ68M0Myb5Qqacn3P72Jyanpmdm5+dLC4tLySnl17domRFYF4lKzG0EFpXUWCdJCm9TgxBHCm+iu9Nh/eYejZWJvqJ+is0Yulp2pAByVqt8EFIPCXjYRbK8uOzyEFTac6GSErH/sDfr49p1WueJX/ZH4XwgKqLBCF63yR9hORBajJqHA2kbgp9TMwZAUCgelMLOYgriDLjYcaojRNvPRfgO+5Zw27yTGHU185H6fyCG2th9HrjMG6tnftaH5X62RUeomUudZoRajB/qZIpTwodh8bY0KEj1HYAw0v2Vix4YEOQiLbkQgt8r/4XrWjXYq9Yu9yvHJ0Uc2yDbJtFrBDdszO2AWrM8Ee2BN7Ya/eo/fsvXnv49YJr5hZz/kfX4B3NKiPQ=</latexit>

rθJ(θ) = 

∂ ∂θ0 J ∂ ∂θ1 J

  • <latexit sha1_base64="N3+JepYHD0brMFSlWy+mt6ByhW4=">ACfnichVHLatwFJXdV+q+ps2yG7VDStqSqZ0Umk0hNJuQVQqdJDAy5lpzPSMiy0a6DgzGn9Ef6y7f0k01M6bkUegFweHcx86N6+1chTHV0F47/6Dh482HkdPnj57/mLw8tWpqxorcSwrXdnzHBxqZXBMijSe1xahzDWe5ReHy/zZJVqnKvODFjWmJcyMKpQE8lQ2+CkM5BqyVtAcCTp+vL1G7/nXSGgsaBKJHGfKtGAtLpWdpEoLMhW1GBJge7+Ir4uzWLfJhKC/1eYrIRopn3zSFg1m1OaDYbxKF4FvwuSHgxZHyfZ4JeYVrIp0ZDU4NwkiWtK2+U0qdH3bRzWIC9ghMPDZTo0nZlX8e3PDPlRWX9M8RX7PWKFkrnFmXulSXQ3N3OLcl/5SYNFftpq0zdEBq5HlQ0mlPFl7fgU2VRkl54ANIqvyuXc/CWkb9Y5E1Ibn/5LjdHSV7o93vn4cH3o7Nthr9pZts4R9YQfsiJ2wMZPsd/Am+B8DFn4LtwJP62lYdDXbLIbEe7/AYrCwhU=</latexit>

α · ||rθJ(θ)||

<latexit sha1_base64="YmhBdBy1tgcD8mJGsWOwVKwghNI=">ACLnicbVBNa9tAEF2lH3GVfjNsZelJuBejOQUkqNJKJSeHKgTg2XMaD2yFq9WYncUMIp/US75K80h0IbSa35G1x+F1u7AMo/3jA7Ly6UtBQE372dJ0+fPd+tvfD3Xr56/a+/bC5qUR2BO5yk0/BotKauyRJIX9wiBkscLeHq20C+v0FiZ680K3CYwUTLRAogR43qnyJQRQo8EuOceKQwoes/zY80xApGVUQpEsz5l+YKfAjIyfpwrnqo3ojaAXL4tsgXIMGW1d3VL+LxrkoM9QkFg7CIOChUYkLh3I9KiwWIKUxw4KCGDO2wWp4754eOGfMkN+5p4kv274kKMmtnWeycGVBqN7UF+T9tUFJyMqykLkpCLVaLklJxyvkiOz6WBgWpmQMgjHR/5SIFA4Jcwr4LIdw8eRtctFvhUat9/rHROV3HUWPv2HvWZCE7Zh32mXVZjwl2w76xH+zBu/XuvZ/er5V1x1vPHLB/ynv8DVt1qL8=</latexit>
slide-5
SLIDE 5

Even in one dimension, tough to select step size. Recommendations

  • Try multiple values
  • Might need different sizes at different locations

Review: Step size matters

6

Mike Hughes - Tufts COMP 135 - Fall 2020 𝑔(𝒚) 𝒚 𝑔(𝒚) 𝒚 𝒚 𝒚

slide-6
SLIDE 6

Review: Neural Net as computational graph

2 directions of propagation

Forward: compute loss Backward: compute grad

7

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-7
SLIDE 7

Review: Training Neural Nets

8

Mike Hughes - Tufts COMP 135 - Fall 2020

min

w N

X

n=1

E(yn, ˆ y(xn, w))

<latexit sha1_base64="nJnhInwsXMzfqZM21hSK+iPOXxw=">ACGHicbVDLSgMxFM3Ud31VXboJFqEFqTNV0I0giuBKFOwDOnXIpGkbmSGJGMdhvkMN/6KGxeKuHXn35hpu9DqgQsn59xL7j1+yKjStv1l5WZm5+YXFpfyura+uFjc26CiKJSQ0HLJBNHynCqCA1TUjzVASxH1Gv7gPMb90QqGohbHYekzVFP0C7FSBvJK+y7nAovGabQVRH3EnHipHdX8KIEY0/sQbePdBKnpYfsMSzDslco2hV7BPiXOBNSBNce4VPtxPgiBOhMUNKtRw71O0ESU0xI2nejRQJER6gHmkZKhAnqp2MDkvhrlE6sBtIU0LDkfpzIkFcqZj7pMj3VfTXib+57Ui3T1uJ1SEkSYCjz/qRgzqAGYpwQ6VBGsWG4KwpGZXiPtIqxNlnkTgjN98l9Sr1acg0r15rB4ejaJYxFsgx1QAg4AqfgElyDGsDgETyDV/BmPVkv1rv1MW7NWZOZLfAL1uc3L+2ejg=</latexit>

Training Objective: Gradient Descent Algorithm:

w = initialize_weights_at_random_guess(random_state=0) while not converged: total_grad_wrt_w = zeros_like(w) for n in 1, 2, … N: loss[n], grad_wrt_w[n] = forward_and_backward_prop(x[n], y[n], w) total_grad_wrt_w += grad_wrt_w[n] w = w – alpha * total_grad_wrt_w

How to pick step size reliably? How to go fast on big datasets?

slide-8
SLIDE 8

Step size strategy: Slow decay

9

Mike Hughes - Tufts COMP 135 - Fall 2020

input: initial w 2 R input: initial step size s0 2 R+ while not converged: w w strwL(w) st decay(s0, t) t t + 1

Linear decay Exponential decay

Often helpful, requires tuning and hard to get right!

t : number of steps

θ

<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>

θ

<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>

θ

<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>

θ θ αrθJ(θ)

<latexit sha1_base64="1JLgwDRjFbF3NXnXhXd7r/jcJ0g=">ACH3icbZBNSwMxEIazflu/qh69BIugB8tuFfUoehFPClaFbimz6bQNZrNLMiuUpf/Ei3/FiwdFxJv/xrTdg18vBJ68M0Myb5Qqacn3P72Jyanpmdm5+dLC4tLySnl17domRFYF4lKzG0EFpXUWCdJCm9TgxBHCm+iu9Nh/eYejZWJvqJ+is0Yulp2pAByVqt8EFIPCXjYRbK8uOzyEFTac6GSErH/sDfr49p1WueJX/ZH4XwgKqLBCF63yR9hORBajJqHA2kbgp9TMwZAUCgelMLOYgriDLjYcaojRNvPRfgO+5Zw27yTGHU185H6fyCG2th9HrjMG6tnftaH5X62RUeomUudZoRajB/qZIpTwodh8bY0KEj1HYAw0v2Vix4YEOQiLbkQgt8r/4XrWjXYq9Yu9yvHJ0Uc2yDbJtFrBDdszO2AWrM8Ee2BN7Ya/eo/fsvXnv49YJr5hZz/kfX4B3NKiPQ=</latexit>

αt

<latexit sha1_base64="JVrwa7M8sGc4QqjDu1HsCjLtuc=">AB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilTo/KZET72C9X3Ko7B1klXk4qkKPRL3/1BjFLI6QSWpM13MT9DOqUTDJp6VeanhC2ZgOedSRSNu/Gx+75ScWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZs+TgdCcoZxYQpkW9lbCRlRThjaikg3BW35lbRqVe+iWru/rNRv8jiKcAKncA4eXEd7qABTWAg4Rle4c15dF6cd+dj0Vpw8plj+APn8wcbupAD</latexit>

s0 kt αt

<latexit sha1_base64="JVrwa7M8sGc4QqjDu1HsCjLtuc=">AB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilTo/KZET72C9X3Ko7B1klXk4qkKPRL3/1BjFLI6QSWpM13MT9DOqUTDJp6VeanhC2ZgOedSRSNu/Gx+75ScWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZs+TgdCcoZxYQpkW9lbCRlRThjaikg3BW35lbRqVe+iWru/rNRv8jiKcAKncA4eXEd7qABTWAg4Rle4c15dF6cd+dj0Vpw8plj+APn8wcbupAD</latexit>α0 <latexit sha1_base64="e9yMdJpsdnMCJQkIUtmjHIdqj0s=">AB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3bpZhN3N0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNwI1kUwygQrB2Mb2d+4kpzWP5YCYJ8yMcSh5yisZKnR6KZIR9t1+uFV3DrJKvJxUIEejX/7qDWKaRkwaKlDrucmxs9QGU4Fm5Z6qWYJ0jEOWdSiRHTfja/d0rOrDIgYaxsSUPm6u+JDCOtJ1FgOyM0I73szcT/vG5qwms/4zJDZN0sShMBTExmT1PBlwxasTEqSK21sJHaFCamxEJRuCt/zyKmnVqt5FtXZ/Wanf5HEU4QRO4Rw8uI63EDmkBwDO8wpvz6Lw4787HorXg5DPH8AfO5w+0m4+/</latexit>

s0e−kt αt

<latexit sha1_base64="JVrwa7M8sGc4QqjDu1HsCjLtuc=">AB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilTo/KZET72C9X3Ko7B1klXk4qkKPRL3/1BjFLI6QSWpM13MT9DOqUTDJp6VeanhC2ZgOedSRSNu/Gx+75ScWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZs+TgdCcoZxYQpkW9lbCRlRThjaikg3BW35lbRqVe+iWru/rNRv8jiKcAKncA4eXEd7qABTWAg4Rle4c15dF6cd+dj0Vpw8plj+APn8wcbupAD</latexit>α0 <latexit sha1_base64="e9yMdJpsdnMCJQkIUtmjHIdqj0s=">AB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3bpZhN3N0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNwI1kUwygQrB2Mb2d+4kpzWP5YCYJ8yMcSh5yisZKnR6KZIR9t1+uFV3DrJKvJxUIEejX/7qDWKaRkwaKlDrucmxs9QGU4Fm5Z6qWYJ0jEOWdSiRHTfja/d0rOrDIgYaxsSUPm6u+JDCOtJ1FgOyM0I73szcT/vG5qwms/4zJDZN0sShMBTExmT1PBlwxasTEqSK21sJHaFCamxEJRuCt/zyKmnVqt5FtXZ/Wanf5HEU4QRO4Rw8uI63EDmkBwDO8wpvz6Lw4787HorXg5DPH8AfO5w+0m4+/</latexit>

αt

<latexit sha1_base64="JVrwa7M8sGc4QqjDu1HsCjLtuc=">AB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilTo/KZET72C9X3Ko7B1klXk4qkKPRL3/1BjFLI6QSWpM13MT9DOqUTDJp6VeanhC2ZgOedSRSNu/Gx+75ScWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZs+TgdCcoZxYQpkW9lbCRlRThjaikg3BW35lbRqVe+iWru/rNRv8jiKcAKncA4eXEd7qABTWAg4Rle4c15dF6cd+dj0Vpw8plj+APn8wcbupAD</latexit>α0 <latexit sha1_base64="e9yMdJpsdnMCJQkIUtmjHIdqj0s=">AB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3bpZhN3N0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNwI1kUwygQrB2Mb2d+4kpzWP5YCYJ8yMcSh5yisZKnR6KZIR9t1+uFV3DrJKvJxUIEejX/7qDWKaRkwaKlDrucmxs9QGU4Fm5Z6qWYJ0jEOWdSiRHTfja/d0rOrDIgYaxsSUPm6u+JDCOtJ1FgOyM0I73szcT/vG5qwms/4zJDZN0sShMBTExmT1PBlwxasTEqSK21sJHaFCamxEJRuCt/zyKmnVqt5FtXZ/Wanf5HEU4QRO4Rw8uI63EDmkBwDO8wpvz6Lw4787HorXg5DPH8AfO5w+0m4+/</latexit>

αt

<latexit sha1_base64="JVrwa7M8sGc4QqjDu1HsCjLtuc=">AB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilTo/KZET72C9X3Ko7B1klXk4qkKPRL3/1BjFLI6QSWpM13MT9DOqUTDJp6VeanhC2ZgOedSRSNu/Gx+75ScWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZs+TgdCcoZxYQpkW9lbCRlRThjaikg3BW35lbRqVe+iWru/rNRv8jiKcAKncA4eXEd7qABTWAg4Rle4c15dF6cd+dj0Vpw8plj+APn8wcbupAD</latexit>α0 <latexit sha1_base64="e9yMdJpsdnMCJQkIUtmjHIdqj0s=">AB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3bpZhN3N0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNwI1kUwygQrB2Mb2d+4kpzWP5YCYJ8yMcSh5yisZKnR6KZIR9t1+uFV3DrJKvJxUIEejX/7qDWKaRkwaKlDrucmxs9QGU4Fm5Z6qWYJ0jEOWdSiRHTfja/d0rOrDIgYaxsSUPm6u+JDCOtJ1FgOyM0I73szcT/vG5qwms/4zJDZN0sShMBTExmT1PBlwxasTEqSK21sJHaFCamxEJRuCt/zyKmnVqt5FtXZ/Wanf5HEU4QRO4Rw8uI63EDmkBwDO8wpvz6Lw4787HorXg5DPH8AfO5w+0m4+/</latexit>

αt

<latexit sha1_base64="JVrwa7M8sGc4QqjDu1HsCjLtuc=">AB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilTo/KZET72C9X3Ko7B1klXk4qkKPRL3/1BjFLI6QSWpM13MT9DOqUTDJp6VeanhC2ZgOedSRSNu/Gx+75ScWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZs+TgdCcoZxYQpkW9lbCRlRThjaikg3BW35lbRqVe+iWru/rNRv8jiKcAKncA4eXEd7qABTWAg4Rle4c15dF6cd+dj0Vpw8plj+APn8wcbupAD</latexit>
slide-9
SLIDE 9

Q: How far to step? A: Line search

Find good step size for current location

Search for the best scalar s >= 0, such that:

10

Mike Hughes - Tufts COMP 135 - Fall 2020

min

x f(x)

x

∆x = rxf(x) s∗ = arg min

s≥0 f(x + s∆x) ∆x

Possible step lengths

Step Direction: Goal:

In Python code: scipy.optimize.line_search s = 0.5 s = 1.3 s = 5.1

Can be expensive, but often worth it

slide-10
SLIDE 10

11

Mike Hughes - Tufts COMP 135 - Fall 2020

Q: Better direction to step than straight downhill? A: Yes. Modify direction using second-order derivative.

min

θ

J(θ)

<latexit sha1_base64="l+C2uK736OngnDMEy71ptwDskJQ=">ACAXicbZDLSsNAFIYnXmu9Rd0IbgaLUDclqYIui27EVQV7gSaUyXTSDp1MwsyJUELd+CpuXCji1rdw59s4bPQ1h8GPv5zDmfOHySCa3Ccb2tpeWV1b2wUdzc2t7Ztf2mzpOFWUNGotYtQOimeCSNYCDYO1EMRIFgrWC4fWk3npgSvNY3sMoYX5E+pKHnBIwVtc+9CIu5kHAwZkjPFteYanXbvkVJyp8CK4OZRQrnrX/vJ6MU0jJoEKonXHdRLwM6KAU8HGRS/VLCF0SPqsY1CSiGk/m14wxifG6eEwVuZJwFP390RGIq1HUWA6IwIDPV+bmP/VOimEl37GZICk3S2KEwFhP4sA9rhgFMTJAqOLmr5gOiCIUTGhFE4I7f/IiNKsV96xSvTsv1a7yOAroCB2jMnLRBaqhG1RHDUTRI3pGr+jNerJerHfrY9a6ZOUzB+iPrM8fm1+WVQ=</latexit>

∆θ = −J0(θ)

<latexit sha1_base64="4yXb+lNI42Fn/OvgZpx4A8F+62U=">ACB3icbVDLSgMxFM3UV62vUZeCBItYF5aZKuhGKOpCXFWwD2hLyaS3bWjmQXJHKU7N/6KGxeKuPUX3Pk3pg9BWw+EnJxzLzf3eJEUGh3ny0rMzS8sLiWXUyura+sb9uZWSYex4lDkoQxVxWMapAigiAIlVCIFzPcklL3u5dAv34PSIgzusBdB3WftQLQEZ2ikhr1buwKJjNawA+Y6p0f05iDz8zxs2Gkn64xAZ4k7IWkyQaFhf9aIY9CJBLpnXVdSKs95lCwSUMUrVYQ8R4l7WhamjAfND1/miPAd03SpO2QmVOgHSk/u7oM1/rnu+ZSp9hR097Q/E/rxpj6zeF0EUIwR8PKgVS4ohHYZCm0IBR9kzhHElzF8p7zDFOJroUiYEd3rlWVLKZd3jbO72J2/mMSRJDtkj2SIS05JnlyTAikSTh7IE3khr9aj9Wy9We/j0oQ16dkmf2B9fAMuW5bx</latexit>

∆θ = − 1 J00(θ)J0(θ)

<latexit sha1_base64="XBG5xkoxXORUoeyjeCUaS9DfcTs=">ACHXicbVDLSgMxFM34tr6qLt0Ei7QuLDMq6EYQdSGuKtgHdErJpHfaYOZBckcoQ3/Ejb/ixoUiLtyIf2PajqDVAyGHc84luceLpdBo25/W1PTM7Nz8wmJuaXldS2/vlHTUaI4VHkI9XwmAYpQqiQAmNWAELPAl17/Z86NfvQGkRhTfYj6EVsG4ofMEZGqmdP3QvQCKjLvbAXCd0j7q+Yjx1BulVsVga67sDelUsfYd2/mCXbZHoH+Jk5ECyVBp59/dTsSTAELkmndOwYWylTKLiEQc5NMSM37IuNA0NWQC6lY62G9Ado3SoHylzQqQj9edEygKt+4FnkgHDnp70huJ/XjNB/7iVijBOEI+fshPJMWIDquiHaGAo+wbwrgS5q+U95gpB02hOVOCM7nyX1LbLzsH5f3rw8LpWVbHAtki26REHJETsklqZAq4eSePJn8mI9WE/Wq/U2jk5Z2cwm+QXr4wsK0J9h</latexit>

∆θ = rθJ(θ)

<latexit sha1_base64="+0kHMN4rNQ4f3ETI1OQARjozU=">ACFnicbVDLSgNBEJz1bXxFPXoZDIeDLsq6EUQ9SCeFIwK2RB6Jx0zZHZ2mekVwpKv8OKvePGgiFfx5t84eQi+CpouqrqZ6YpSJS35/oc3Mjo2PjE5NV2YmZ2bXyguLl3aJDMCKyJRibmOwKSGiskSeF1ahDiSOFV1D7q+Ve3aKxM9AV1UqzFcKNlUwogJ9WLm+ExKgIeUgtd2+ebPNQKajnA6nLT9e/3I16seSX/T74XxIMSYkNcVYvoeNRGQxahIKrK0Gfkq1HAxJobBbCDOLKYg23GDVUQ0x2lreP6vL15zS4M3EuNLE+r3jRxiaztx5CZjoJb97fXE/7xqRs29Wi51mhFqMXiomSlOCe9lxBvSoCDVcQSEke6vXLTAgCXZMGFEPw+S+53CoH2+Wt853SweEwjim2wlbZOgvYLjtgJ+yMVZhgd+yBPbFn79579F6818HoiDfcWY/4L19AmhwnaM=</latexit>

1-D 2-D+ 1st order only decent direction Using 2nd order Newton descent direction

∆θ = H(θ)−1rθJ(θ)

<latexit sha1_base64="Kby2dClql3JAVfRL+o2wD315E=">ACJXicbVDPSxtBFJ5NbdXU1rQevQyGQnIw7EZBDxZEPYSeIpgYyMbwdvKSDM7OLjNvC2HJP9OL/4oXD4oIPfVf6eRHoY39YJhvc93rwvSpW05Ps/vcKbtbfv1jc2i+3PnzcLn363LZJZgS2RKIS04nAopIaWyRJYSc1CHGk8Dq6PZ/Vr7+jsTLRVzRJsRfDSMuhFEBO6pdOwgtUBDykMbrK9/njcriUb3J94MpDzVECvr5Qpzyb5U/5mq/VPZr/hz8NQmWpMyWaPZLT+EgEVmMmoQCa7uBn1IvB0NSKJwWw8xiCuIWRth1VEOMtpfPt5zyL04Z8GFi3NHE5+rfHTnE1k7iyDljoLFdrc3E/9W6GQ2Pe7nUaUaoxWLQMFOcEj6LjA+kQUFq4gI91fuRiDAUEu2KILIVhd+TVp12vBQa1+eVg+PVvGscF2R6rsIAdsVPWYE3WYoL9YPfskT15d96D9+y9LKwFb9mzw/6B9+s393KjGA=</latexit>

Hessian matrix for J H is a D x D matrix

All second-order partial derivatives

∆θ

<latexit sha1_base64="4yXb+lNI42Fn/OvgZpx4A8F+62U=">ACB3icbVDLSgMxFM3UV62vUZeCBItYF5aZKuhGKOpCXFWwD2hLyaS3bWjmQXJHKU7N/6KGxeKuPUX3Pk3pg9BWw+EnJxzLzf3eJEUGh3ny0rMzS8sLiWXUyura+sb9uZWSYex4lDkoQxVxWMapAigiAIlVCIFzPcklL3u5dAv34PSIgzusBdB3WftQLQEZ2ikhr1buwKJjNawA+Y6p0f05iDz8zxs2Gkn64xAZ4k7IWkyQaFhf9aIY9CJBLpnXVdSKs95lCwSUMUrVYQ8R4l7WhamjAfND1/miPAd03SpO2QmVOgHSk/u7oM1/rnu+ZSp9hR097Q/E/rxpj6zeF0EUIwR8PKgVS4ohHYZCm0IBR9kzhHElzF8p7zDFOJroUiYEd3rlWVLKZd3jbO72J2/mMSRJDtkj2SIS05JnlyTAikSTh7IE3khr9aj9Wy9We/j0oQ16dkmf2B9fAMuW5bx</latexit>
slide-11
SLIDE 11

Approximate second order method

  • Computes first-order gradient vector exactly on provided training dataset
  • Computes efficient approximation of Hessian via recent history of steps

L-BFGS: Smarter Gradient Descent

12

Mike Hughes - Tufts COMP 135 - Fall 2020

L-BFGS : Limited Memory Broyden–Fletcher–Goldfarb–Shanno (BFGS)

Q: Which direction to step? A: Downhill, adjusted by curvature at current location Q: How far to step in that direction? A: Efficient line search Step size adjusted to current location (as implemented in SciPy)

slide-12
SLIDE 12

Objectives Today (day 12) Stochastic Gradient Descent

13

Mike Hughes - Tufts COMP 135 - Fall 2020

  • Review: Gradient Descent
  • Repeatedly step downhill until converged
  • Review: Training Neural Nets with Backprop
  • Backprop = chain rule plus dynamic programming
  • L-BFGS : How to step in better direction?
  • Stochastic Gradient Descent : How to go fast?
slide-13
SLIDE 13

Stochastic Estimate of Loss Function

  • Standard “full-dataset” objective
  • Rewrite as an “expected value”

14

Mike Hughes - Tufts COMP 135 - Fall 2020

L(w) = 1 N

N

X

n=1

Ln(xn, yn, w)

L(w) = Exi,yi∼Unif({xn,yn}N

n=1) [Li(xi, yi, w)] Empirical distribution over

  • ur N training examples

Each index i selected with probability 1/N

slide-14
SLIDE 14

Stochastic Estimate of Loss Function

  • Standard “full-dataset” objective
  • Rewrite as an “expected value”
  • Approximate with one randomly-drawn sample

15

Mike Hughes - Tufts COMP 135 - Fall 2020

L(w) = 1 N

N

X

n=1

Ln(xn, yn, w)

L(w) = Exi,yi∼Unif({xn,yn}N

n=1) [Li(xi, yi, w)]

L(w) ≈ Li(xi, yi, w) xi, yi ∼ Unif({xn, yn}N

n=1)

Each index i selected with probability 1/N

slide-15
SLIDE 15

Stochastic Estimate of Gradient

  • Standard “full-dataset” gradient
  • Approximate with one randomly-drawn sample

16

Mike Hughes - Tufts COMP 135 - Fall 2020 Each index i selected with probability 1/N

rwL(w) = 1 N

N

X

n=1

rwLn(xn, yn, w)

<latexit sha1_base64="gY5Tfb52SJiSJQRjaujfvUKQwmU=">ACP3icbVBNS8NAEN34WetX1aOXxSK0ICVRQS+C6MWDFAVbhaGyXbTLm42YXdjLSH/zIt/wZtXLx4U8erNTe1BrQO7PN6bx8w8P+ZMadt+siYmp6ZnZgtzxfmFxaXl0spqU0WJLRBIh7JKx8U5UzQhma06tYUgh9Ti/9m+Ncv7ylUrFIXOhBTNshdAULGAFtK/UdAX4HLw+dkPQPQI8Pc0q/So+wG4gaROltYz7Kok9FJx4GTXdfyfxROVO09s4UH+9ateqWzX7GHhceCMQBmN6swrPbqdiCQhFZpwUKrl2LFupyA1I5xmRTdRNAZyA13aMlBASFU7Hd6f4U3DdHAQSfOExkP2pyOFUKlB6JvOfGP1V8vJ/7RWoP9dspEnGgqyPegIOFYRzgPE3eYpETzgQFAJDO7YtIDE5s2kRdNCM7fk8dBc7vm7NS2z3fLh0ejOApoHW2gCnLQHjpEJ+gMNRB9+gZvaI368F6sd6tj+/WCWvkWUO/yvr8Aun6rtM=</latexit>

rwL(w) ⇡ rwLi(xi, yi, w) xi, yi ⇠ Unif({xn, yn}N

n=1)

slide-16
SLIDE 16

Gradient Descent using Noisy Estimates of the “True” Gradient

17

Mike Hughes - Tufts COMP 135 - Fall 2020

Intuition As long as each noisy step takes us in a direction that is correct on average, we will

  • ver many steps make progress

in minimizing the loss. Formal guarantees Our Monte Carlo estimate of gradient is unbiased, so its expected value is exactly equal to the true whole-dataset gradient

slide-17
SLIDE 17

18

Mike Hughes - Tufts COMP 135 - Fall 2020

Stochastic gradient descent (SGD)

using one example at a time

input: initial w 2 R input: step size s 2 R+ while not converged: {xi, yi} ⇠ Unif({xn, yn}N

n=1)

w w srwL(xi, yi, w)

Should we only use one example i to estimate gradient?

α

<latexit sha1_base64="+wSBPeL8nxBdvzPXA2qswhGhfpg=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3btZhN2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNwI1kUwygQrB2Mb2d+4kpzWP5YCYJ8yMcSh5yisZKrR6KZIT9csWtunOQVeLlpAI5Gv3yV28Q0zRi0lCBWnc9NzF+hspwKti01Es1S5COci6lkqMmPaz+bVTcmaVAQljZUsaMld/T2QYaT2JAtsZoRnpZW8m/ud1UxNe+xmXSWqYpItFYSqIicnsdTLgilEjJpYgVdzeSugIFVJjAyrZELzl1dJq1b1Lq1+8tK/SaPowgncArn4MEV1OEOGtAECo/wDK/w5sTOi/PufCxaC04+cwx/4Hz+AIzPjxw=</latexit>

α

<latexit sha1_base64="+wSBPeL8nxBdvzPXA2qswhGhfpg=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3btZhN2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNwI1kUwygQrB2Mb2d+4kpzWP5YCYJ8yMcSh5yisZKrR6KZIT9csWtunOQVeLlpAI5Gv3yV28Q0zRi0lCBWnc9NzF+hspwKti01Es1S5COci6lkqMmPaz+bVTcmaVAQljZUsaMld/T2QYaT2JAtsZoRnpZW8m/ud1UxNe+xmXSWqYpItFYSqIicnsdTLgilEjJpYgVdzeSugIFVJjAyrZELzl1dJq1b1Lq1+8tK/SaPowgncArn4MEV1OEOGtAECo/wDK/w5sTOi/PufCxaC04+cwx/4Hz+AIzPjxw=</latexit>
slide-18
SLIDE 18

19

Mike Hughes - Tufts COMP 135 - Fall 2020

SGD with minibatches of size B

input: initial w 2 R input: step size s 2 R+ while not converged: {xi, yi} ⇠ Unif({xn, yn}N

n=1)

w w srwL(xi, yi, w)

B = 1 recovers previous slide. B = N recovers standard GD. In between: trade off quality of estimate with cost of estimate

α

<latexit sha1_base64="+wSBPeL8nxBdvzPXA2qswhGhfpg=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3btZhN2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNwI1kUwygQrB2Mb2d+4kpzWP5YCYJ8yMcSh5yisZKrR6KZIT9csWtunOQVeLlpAI5Gv3yV28Q0zRi0lCBWnc9NzF+hspwKti01Es1S5COci6lkqMmPaz+bVTcmaVAQljZUsaMld/T2QYaT2JAtsZoRnpZW8m/ud1UxNe+xmXSWqYpItFYSqIicnsdTLgilEjJpYgVdzeSugIFVJjAyrZELzl1dJq1b1Lq1+8tK/SaPowgncArn4MEV1OEOGtAECo/wDK/w5sTOi/PufCxaC04+cwx/4Hz+AIzPjxw=</latexit>

α

<latexit sha1_base64="+wSBPeL8nxBdvzPXA2qswhGhfpg=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3btZhN2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNwI1kUwygQrB2Mb2d+4kpzWP5YCYJ8yMcSh5yisZKrR6KZIT9csWtunOQVeLlpAI5Gv3yV28Q0zRi0lCBWnc9NzF+hspwKti01Es1S5COci6lkqMmPaz+bVTcmaVAQljZUsaMld/T2QYaT2JAtsZoRnpZW8m/ud1UxNe+xmXSWqYpItFYSqIicnsdTLgilEjJpYgVdzeSugIFVJjAyrZELzl1dJq1b1Lq1+8tK/SaPowgncArn4MEV1OEOGtAECo/wDK/w5sTOi/PufCxaC04+cwx/4Hz+AIzPjxw=</latexit>

{xb, yb}B

b=1

<latexit sha1_base64="MiIz8fwEHUH650t6FvmJ8cR8aTM=">AB/3icbVDLSsNAFJ34rPUVFdy4GSyCylJFXQjlLpxWcE+oIlhMp20QyeTMDMRQ8zCX3HjQhG3/oY7/8Zpm4W2HrhwOde7r3HjxmVyrK+jYXFpeWV1dJaeX1jc2vb3NltygRmLRwxCLR9ZEkjHLSUlQx0o0FQaHPSMcfXY39zj0Rkb8VqUxcUM04DSgGCktea+k8EHz+BqedDJ/cy/9LO7xqeWbGq1gRwntgFqYACTc/8cvoRTkLCFWZIyp5txcrNkFAUM5KXnUSGOERGpCephyFRLrZ5P4cHmlD4NI6OIKTtTfExkKpUxDX3eGSA3lrDcW/N6iQou3IzyOFGE4+miIGFQRXAcBuxTQbBiqSYIC6pvhXiIBMJKR1bWIdizL8+Tdq1qn1ZrN2eVeqOIowQOwCE4BjY4B3VwDZqgBTB4BM/gFbwZT8aL8W58TFsXjGJmD/yB8fkDUHiVAQ=</latexit>

Unif({xn, yn}N

n=1,

size = B, replace = False)

<latexit sha1_base64="V54lCvGX1KF6Ha1y0NsgSEDCzGY=">ACPXicbZBLSwMxFIUzvq2vqks3wSIoSJlRQTdCqSCuRMGq0KlDJr3VYCYzJHfEOow/zI3/wZ07Ny4UcevW9AE+DwROvnsvyT1hIoVB1310BgaHhkdGx8YLE5NT0zPF2bljE6eaQ43HMtanITMghYIaCpRwmhgUSjhJLzc6dRPrkAbEasjbCfQiNi5Ei3BGVoUFI98hGvMapbly9TP6HWgVmk7UNTPg0xte/nZ/iq9ve21GXED+Xb1C2hIJOW9a67TBrIV4JiyS27XdG/xubEunrICg+M2YpxEo5JIZU/fcBsZ0yi4hLzgpwYSxi/ZOdStVSwC08i62+d0yZImbcXaHoW0S79PZCwyph2FtjNieGF+1zrwv1o9xdZWIxMqSREU7z3USiXFmHaipE2hgaNsW8O4FvavlF8wzTjawAs2BO/3yn/N8VrZWy+vHW6UKtV+HGNkgSySZeKRTVIhe+SA1Agnd+SJvJBX595dt6c917rgNOfmSc/5Hx8Ah3fr9I=</latexit>

α · 1 B

B

X

i=1

rwL(xi, yi, w)

<latexit sha1_base64="pRk7JDmj0xqDcL2ukTygo/Mztm4=">ACNHicbVBNSwMxEM36WetX1aOXYBEUpOyqoBeh1IugBwWrQrcus2nWhmazS5JVS9gf5cUf4kUED4p49TeYrT349SDhzZsZuaFKWdKu+6TMzI6Nj4xWZoqT8/Mzs1XFhbPVJQpsk4Ym8CEFRzgRtaqY5vUglhTjk9Dzs7Rf582sqFUvEqe6ntB3DlWARI6CtFQOfeBpF7BPOonG2I8kEOPlpHbQGVxYNiel182bCQg5BCYm9yPQXcJcHOUr90GbAP3i+9mPahU3Zo7AP5LvCGpoiGOg8qD30lIFlOhCQelWp6b6rYBqRnhNC/7maIpkB5c0ZalAmKq2mZwdI5XrdLBUSLtExoP1O8dBmKl+nFoK4t91e9cIf6Xa2U62m0bJtJMU0G+BkUZxzrBhYO4wyQlmvctASKZ3RWTLljftPW5bE3wfp/8l5xt1ryt2ubJdrXeGNpRQstoBa0hD+2gOjpAx6iJCLpDj+gFvTr3zrPz5rx/lY4w54l9APOxye4tKpC</latexit>
slide-19
SLIDE 19

Objectives Today (day 12) Stochastic Gradient Descent

20

Mike Hughes - Tufts COMP 135 - Fall 2020

  • Review: Gradient Descent
  • Repeatedly step downhill until converged
  • Review: Training Neural Nets with Backprop
  • Backprop = chain rule plus dynamic programming
  • Line Search: How to take step of good size?
  • L-BFGS : How to step in better direction?
  • Stochastic Gradient Descent : How to go fast?
slide-20
SLIDE 20

Breakout to Lab

Warning: Notation can be confusing

  • Alpha in these slides refers to step size (aka

learning rate)

  • In sklearn’s MLPClassifier, alpha refers to a

different hyperparameter: the scalar strength

  • f a small L2 penalty on the magnitudes of the

weights

  • To set step size in sklearn:

learning_rate_init=0.5

21

Mike Hughes - Tufts COMP 135 - Fall 2020