stochastic gradient descent
play

Stochastic Gradient Descent Many slides attributable to: Prof. - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Stochastic Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James,


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Stochastic Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

  2. Objectives Today (day 12) Stochastic Gradient Descent • Review: Gradient Descent • Repeatedly step downhill until converged • Review: Training Neural Nets with Backprop • Backprop = chain rule plus dynamic programming • L-BFGS : How to step in better direction? • Stochastic Gradient Descent : How to go fast? Mike Hughes - Tufts COMP 135 - Fall 2020 3

  3. <latexit sha1_base64="QD3y6jOAWeQ5sqrhpmn6rqR5jJ8=">ACN3icbVBNSyNBEO3xY9Wsu0Y9emkMC3sKM+7CehS9iAdRMCqkQ6jp1GQaez7orhHCmH/lxb/hTS8eFPHqP7AnieLHPmjq8V4V1fXCXCtLvn/jTU3PzH6bm1+ofV/8XOpvrxybLPCSGzJTGfmNASLWqXYIkUaT3ODkIQaT8Kznco/OUdjVZYe0SDHTgL9VEVKAjmpW98XoPMYuJC9jLjQGNEFf60iMiBLkYMhBXr4xrigGAmGfI8Lo/px1Tqu3XrDb/oj8K8kmJAGm+CgW78WvUwWCaYkNVjbDvycOmW1R2oc1kRhMQd5Bn1sO5pCgrZTju4e8l9O6fEoM+6lxEfq+4kSEmsHSeg6E6DYfvYq8X9eu6Bos1OqNC8IUzleFBWaU8arEHlPGZSkB46ANMr9lcsYXFjkoq65EILPJ38lxvN4E9z4/BvY2t7Esc8W2Pr7DcL2D+2xXbZAWsxyS7ZLbtnD96Vd+c9ek/j1ilvMrPKPsB7fgFLZq1r</latexit> Review: Gradient Descent in 1D input: initial θ ∈ R input: step size α ∈ R + while not converged: θ ← θ − α d d θ J ( θ ) Q: How far to step in that direction? Q: Which direction to step? � � � � ∂ � � � � A: α · ∂θ J A: Straight downhill � � � � � � � � (steepest descent at current location) Step size parameter picked in advance, unaware of current location Mike Hughes - Tufts COMP 135 - Fall 2020 4

  4. <latexit sha1_base64="YmhBdBy1tgcD8mJGsWOwVKwghNI=">ACLnicbVBNa9tAEF2lH3GVfjNsZelJuBejOQUkqNJKJSeHKgTg2XMaD2yFq9WYncUMIp/US75K80h0IbSa35G1x+F1u7AMo/3jA7Ly6UtBQE372dJ0+fPd+tvfD3Xr56/a+/bC5qUR2BO5yk0/BotKauyRJIX9wiBkscLeHq20C+v0FiZ680K3CYwUTLRAogR43qnyJQRQo8EuOceKQwoes/zY80xApGVUQpEsz5l+YKfAjIyfpwrnqo3ojaAXL4tsgXIMGW1d3VL+LxrkoM9QkFg7CIOChUYkLh3I9KiwWIKUxw4KCGDO2wWp4754eOGfMkN+5p4kv274kKMmtnWeycGVBqN7UF+T9tUFJyMqykLkpCLVaLklJxyvkiOz6WBgWpmQMgjHR/5SIFA4Jcwr4LIdw8eRtctFvhUat9/rHROV3HUWPv2HvWZCE7Zh32mXVZjwl2w76xH+zBu/XuvZ/er5V1x1vPHLB/ynv8DVt1qL8=</latexit> <latexit sha1_base64="N3+JepYHD0brMFSlWy+mt6ByhW4=">ACfnichVHLatwFJXdV+q+ps2yG7VDStqSqZ0Umk0hNJuQVQqdJDAy5lpzPSMiy0a6DgzGn9Ef6y7f0k01M6bkUegFweHcx86N6+1chTHV0F47/6Dh482HkdPnj57/mLw8tWpqxorcSwrXdnzHBxqZXBMijSe1xahzDWe5ReHy/zZJVqnKvODFjWmJcyMKpQE8lQ2+CkM5BqyVtAcCTp+vL1G7/nXSGgsaBKJHGfKtGAtLpWdpEoLMhW1GBJge7+Ir4uzWLfJhKC/1eYrIRopn3zSFg1m1OaDYbxKF4FvwuSHgxZHyfZ4JeYVrIp0ZDU4NwkiWtK2+U0qdH3bRzWIC9ghMPDZTo0nZlX8e3PDPlRWX9M8RX7PWKFkrnFmXulSXQ3N3OLcl/5SYNFftpq0zdEBq5HlQ0mlPFl7fgU2VRkl54ANIqvyuXc/CWkb9Y5E1Ibn/5LjdHSV7o93vn4cH3o7Nthr9pZts4R9YQfsiJ2wMZPsd/Am+B8DFn4LtwJP62lYdDXbLIbEe7/AYrCwhU=</latexit> <latexit sha1_base64="1JLgwDRjFbF3NXnXhXd7r/jcJ0g=">ACH3icbZBNSwMxEIazflu/qh69BIugB8tuFfUoehFPClaFbimz6bQNZrNLMiuUpf/Ei3/FiwdFxJv/xrTdg18vBJ68M0Myb5Qqacn3P72Jyanpmdm5+dLC4tLySnl17domRFYF4lKzG0EFpXUWCdJCm9TgxBHCm+iu9Nh/eYejZWJvqJ+is0Yulp2pAByVqt8EFIPCXjYRbK8uOzyEFTac6GSErH/sDfr49p1WueJX/ZH4XwgKqLBCF63yR9hORBajJqHA2kbgp9TMwZAUCgelMLOYgriDLjYcaojRNvPRfgO+5Zw27yTGHU185H6fyCG2th9HrjMG6tnftaH5X62RUeomUudZoRajB/qZIpTwodh8bY0KEj1HYAw0v2Vix4YEOQiLbkQgt8r/4XrWjXYq9Yu9yvHJ0Uc2yDbJtFrBDdszO2AWrM8Ee2BN7Ya/eo/fsvXnv49YJr5hZz/kfX4B3NKiPQ=</latexit> Review: Gradient Descent in 2D+ gradient = vector of D input: initial θ ∈ R partial derivatives ∂ input: step size α ∈ R +  � ∂θ 0 J r θ J ( θ ) = ∂ ∂θ 1 J while not converged: θ ← θ − α d θ θ � α r θ J ( θ ) d θ J ( θ ) Q: How far to step in that direction? Q: Which direction to step? α · || r θ J ( θ ) || A: A: Straight downhill (steepest descent at current location) Step size parameter picked in advance, unaware of current location Mike Hughes - Tufts COMP 135 - Fall 2020 5

  5. Review: Step size matters Even in one dimension, tough to select step size. 𝑔(𝒚) 𝑔(𝒚) 𝒚 𝒚 𝒚 𝒚 Recommendations - Try multiple values - Might need different sizes at different locations Mike Hughes - Tufts COMP 135 - Fall 2020 6

  6. Review: Neural Net as computational graph 2 directions of propagation Forward: compute loss Backward: compute grad Mike Hughes - Tufts COMP 135 - Fall 2020 7

  7. <latexit sha1_base64="nJnhInwsXMzfqZM21hSK+iPOXxw=">ACGHicbVDLSgMxFM3Ud31VXboJFqEFqTNV0I0giuBKFOwDOnXIpGkbmSGJGMdhvkMN/6KGxeKuHXn35hpu9DqgQsn59xL7j1+yKjStv1l5WZm5+YXFpfyura+uFjc26CiKJSQ0HLJBNHynCqCA1TUjzVASxH1Gv7gPMb90QqGohbHYekzVFP0C7FSBvJK+y7nAovGabQVRH3EnHipHdX8KIEY0/sQbePdBKnpYfsMSzDslco2hV7BPiXOBNSBNce4VPtxPgiBOhMUNKtRw71O0ESU0xI2nejRQJER6gHmkZKhAnqp2MDkvhrlE6sBtIU0LDkfpzIkFcqZj7pMj3VfTXib+57Ui3T1uJ1SEkSYCjz/qRgzqAGYpwQ6VBGsWG4KwpGZXiPtIqxNlnkTgjN98l9Sr1acg0r15rB4ejaJYxFsgx1QAg4AqfgElyDGsDgETyDV/BmPVkv1rv1MW7NWZOZLfAL1uc3L+2ejg=</latexit> Review: Training Neural Nets Training Objective: N X min E ( y n , ˆ y ( x n , w )) w n =1 Gradient Descent Algorithm: w = initialize_weights_at_random_guess(random_state=0) while not converged: total_grad_wrt_w = zeros_like(w) for n in 1, 2, … N: loss[n], grad_wrt_w[n] = forward_and_backward_prop(x[n], y[n], w) total_grad_wrt_w += grad_wrt_w[n] w = w – alpha * total_grad_wrt_w How to pick step size reliably? How to go fast on big datasets? Mike Hughes - Tufts COMP 135 - Fall 2020 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend