Stochastic Gradient Descent Many slides attributable to: Prof. - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Stochastic Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

Objectives Today (day 12) Stochastic Gradient Descent • Review: Gradient Descent • Repeatedly step downhill until converged • Review: Training Neural Nets with Backprop • Backprop = chain rule plus dynamic programming • L-BFGS : How to step in better direction? • Stochastic Gradient Descent : How to go fast? Mike Hughes - Tufts COMP 135 - Fall 2020 3

<latexit sha1_base64="QD3y6jOAWeQ5sqrhpmn6rqR5jJ8=">ACN3icbVBNSyNBEO3xY9Wsu0Y9emkMC3sKM+7CehS9iAdRMCqkQ6jp1GQaez7orhHCmH/lxb/hTS8eFPHqP7AnieLHPmjq8V4V1fXCXCtLvn/jTU3PzH6bm1+ofV/8XOpvrxybLPCSGzJTGfmNASLWqXYIkUaT3ODkIQaT8Kznco/OUdjVZYe0SDHTgL9VEVKAjmpW98XoPMYuJC9jLjQGNEFf60iMiBLkYMhBXr4xrigGAmGfI8Lo/px1Tqu3XrDb/oj8K8kmJAGm+CgW78WvUwWCaYkNVjbDvycOmW1R2oc1kRhMQd5Bn1sO5pCgrZTju4e8l9O6fEoM+6lxEfq+4kSEmsHSeg6E6DYfvYq8X9eu6Bos1OqNC8IUzleFBWaU8arEHlPGZSkB46ANMr9lcsYXFjkoq65EILPJ38lxvN4E9z4/BvY2t7Esc8W2Pr7DcL2D+2xXbZAWsxyS7ZLbtnD96Vd+c9ek/j1ilvMrPKPsB7fgFLZq1r</latexit> Review: Gradient Descent in 1D input: initial θ ∈ R input: step size α ∈ R + while not converged: θ ← θ − α d d θ J ( θ ) Q: How far to step in that direction? Q: Which direction to step? � � � � ∂ � � � � A: α · ∂θ J A: Straight downhill � � � � � � � � (steepest descent at current location) Step size parameter picked in advance, unaware of current location Mike Hughes - Tufts COMP 135 - Fall 2020 4

<latexit sha1_base64="YmhBdBy1tgcD8mJGsWOwVKwghNI=">ACLnicbVBNa9tAEF2lH3GVfjNsZelJuBejOQUkqNJKJSeHKgTg2XMaD2yFq9WYncUMIp/US75K80h0IbSa35G1x+F1u7AMo/3jA7Ly6UtBQE372dJ0+fPd+tvfD3Xr56/a+/bC5qUR2BO5yk0/BotKauyRJIX9wiBkscLeHq20C+v0FiZ680K3CYwUTLRAogR43qnyJQRQo8EuOceKQwoes/zY80xApGVUQpEsz5l+YKfAjIyfpwrnqo3ojaAXL4tsgXIMGW1d3VL+LxrkoM9QkFg7CIOChUYkLh3I9KiwWIKUxw4KCGDO2wWp4754eOGfMkN+5p4kv274kKMmtnWeycGVBqN7UF+T9tUFJyMqykLkpCLVaLklJxyvkiOz6WBgWpmQMgjHR/5SIFA4Jcwr4LIdw8eRtctFvhUat9/rHROV3HUWPv2HvWZCE7Zh32mXVZjwl2w76xH+zBu/XuvZ/er5V1x1vPHLB/ynv8DVt1qL8=</latexit> <latexit sha1_base64="N3+JepYHD0brMFSlWy+mt6ByhW4=">ACfnichVHLatwFJXdV+q+ps2yG7VDStqSqZ0Umk0hNJuQVQqdJDAy5lpzPSMiy0a6DgzGn9Ef6y7f0k01M6bkUegFweHcx86N6+1chTHV0F47/6Dh482HkdPnj57/mLw8tWpqxorcSwrXdnzHBxqZXBMijSe1xahzDWe5ReHy/zZJVqnKvODFjWmJcyMKpQE8lQ2+CkM5BqyVtAcCTp+vL1G7/nXSGgsaBKJHGfKtGAtLpWdpEoLMhW1GBJge7+Ir4uzWLfJhKC/1eYrIRopn3zSFg1m1OaDYbxKF4FvwuSHgxZHyfZ4JeYVrIp0ZDU4NwkiWtK2+U0qdH3bRzWIC9ghMPDZTo0nZlX8e3PDPlRWX9M8RX7PWKFkrnFmXulSXQ3N3OLcl/5SYNFftpq0zdEBq5HlQ0mlPFl7fgU2VRkl54ANIqvyuXc/CWkb9Y5E1Ibn/5LjdHSV7o93vn4cH3o7Nthr9pZts4R9YQfsiJ2wMZPsd/Am+B8DFn4LtwJP62lYdDXbLIbEe7/AYrCwhU=</latexit> <latexit sha1_base64="1JLgwDRjFbF3NXnXhXd7r/jcJ0g=">ACH3icbZBNSwMxEIazflu/qh69BIugB8tuFfUoehFPClaFbimz6bQNZrNLMiuUpf/Ei3/FiwdFxJv/xrTdg18vBJ68M0Myb5Qqacn3P72Jyanpmdm5+dLC4tLySnl17domRFYF4lKzG0EFpXUWCdJCm9TgxBHCm+iu9Nh/eYejZWJvqJ+is0Yulp2pAByVqt8EFIPCXjYRbK8uOzyEFTac6GSErH/sDfr49p1WueJX/ZH4XwgKqLBCF63yR9hORBajJqHA2kbgp9TMwZAUCgelMLOYgriDLjYcaojRNvPRfgO+5Zw27yTGHU185H6fyCG2th9HrjMG6tnftaH5X62RUeomUudZoRajB/qZIpTwodh8bY0KEj1HYAw0v2Vix4YEOQiLbkQgt8r/4XrWjXYq9Yu9yvHJ0Uc2yDbJtFrBDdszO2AWrM8Ee2BN7Ya/eo/fsvXnv49YJr5hZz/kfX4B3NKiPQ=</latexit> Review: Gradient Descent in 2D+ gradient = vector of D input: initial θ ∈ R partial derivatives ∂ input: step size α ∈ R +  � ∂θ 0 J r θ J ( θ ) = ∂ ∂θ 1 J while not converged: θ ← θ − α d θ θ � α r θ J ( θ ) d θ J ( θ ) Q: How far to step in that direction? Q: Which direction to step? α · || r θ J ( θ ) || A: A: Straight downhill (steepest descent at current location) Step size parameter picked in advance, unaware of current location Mike Hughes - Tufts COMP 135 - Fall 2020 5

Review: Step size matters Even in one dimension, tough to select step size. 𝑔(𝒚) 𝑔(𝒚) 𝒚 𝒚 𝒚 𝒚 Recommendations - Try multiple values - Might need different sizes at different locations Mike Hughes - Tufts COMP 135 - Fall 2020 6

Review: Neural Net as computational graph 2 directions of propagation Forward: compute loss Backward: compute grad Mike Hughes - Tufts COMP 135 - Fall 2020 7

<latexit sha1_base64="nJnhInwsXMzfqZM21hSK+iPOXxw=">ACGHicbVDLSgMxFM3Ud31VXboJFqEFqTNV0I0giuBKFOwDOnXIpGkbmSGJGMdhvkMN/6KGxeKuHXn35hpu9DqgQsn59xL7j1+yKjStv1l5WZm5+YXFpfyura+uFjc26CiKJSQ0HLJBNHynCqCA1TUjzVASxH1Gv7gPMb90QqGohbHYekzVFP0C7FSBvJK+y7nAovGabQVRH3EnHipHdX8KIEY0/sQbePdBKnpYfsMSzDslco2hV7BPiXOBNSBNce4VPtxPgiBOhMUNKtRw71O0ESU0xI2nejRQJER6gHmkZKhAnqp2MDkvhrlE6sBtIU0LDkfpzIkFcqZj7pMj3VfTXib+57Ui3T1uJ1SEkSYCjz/qRgzqAGYpwQ6VBGsWG4KwpGZXiPtIqxNlnkTgjN98l9Sr1acg0r15rB4ejaJYxFsgx1QAg4AqfgElyDGsDgETyDV/BmPVkv1rv1MW7NWZOZLfAL1uc3L+2ejg=</latexit> Review: Training Neural Nets Training Objective: N X min E ( y n , ˆ y ( x n , w )) w n =1 Gradient Descent Algorithm: w = initialize_weights_at_random_guess(random_state=0) while not converged: total_grad_wrt_w = zeros_like(w) for n in 1, 2, … N: loss[n], grad_wrt_w[n] = forward_and_backward_prop(x[n], y[n], w) total_grad_wrt_w += grad_wrt_w[n] w = w – alpha * total_grad_wrt_w How to pick step size reliably? How to go fast on big datasets? Mike Hughes - Tufts COMP 135 - Fall 2020 8

Stochastic Gradient Descent Many slides attributable to: Prof. - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Stochastic Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James,

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 Wolfgang Bangerth Background

Scaled gradient projection methods in image deblurring and denoising Mario Bertero 1 Patrizia

Deep Learning: Theory and Practice Matrix Calculus 31-1-2019 Linear and Logistic Regression

Op Optimization for Machine Learning: g: Be Beyon ond St Stoc ochastic Gradient Descent

Adjoint approach to optimization Praveen. C praveen@math.tifrbng.res.in Tata Institute of

Inverse Scattering Problems Chaiwoot Boonyasiriwat October 14, 2020 Direct Scattering Problem

New developments of LOBPCG for large-scale nonlinear eigenvalue problems Fei Xue University of

Metric-Optimized Example Weights Sen Zhao , Mahdi Milani Fard, Harikrishna Narasimhan, Maya Gupta

Developing a Predic0ve Model for Internet Video

Sambuz

Useful Links

Newsletter

Mail Us

Stochastic Gradient Descent Many slides attributable to: Prof. - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Stochastic Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James,

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 Wolfgang Bangerth Background

Scaled gradient projection methods in image deblurring and denoising Mario Bertero 1 Patrizia

Deep Learning: Theory and Practice Matrix Calculus 31-1-2019 Linear and Logistic Regression

Op Optimization for Machine Learning: g: Be Beyon ond St Stoc ochastic Gradient Descent

Adjoint approach to optimization Praveen. C praveen@math.tifrbng.res.in Tata Institute of

Inverse Scattering Problems Chaiwoot Boonyasiriwat October 14, 2020 Direct Scattering Problem

New developments of LOBPCG for large-scale nonlinear eigenvalue problems Fei Xue University of

Metric-Optimized Example Weights Sen Zhao , Mahdi Milani Fard, Harikrishna Narasimhan, Maya Gupta

Developing a Predic0ve Model for Internet Video

Sambuz

Useful Links

Newsletter

Mail Us

Gradient Descent Michail Michailidis & Patrick Maiden Outline