adaptive gradient methods and beyond
play

Adaptive Gradient Methods And Beyond Liangchen Luo Peking - PowerPoint PPT Presentation

Adaptive Gradient Methods And Beyond Liangchen Luo Peking University, Beijing luolc.witty@gmail.com March, 2019 From SGD to Adam SGD (Robbins & Monro, 1951) + Momentum (Qian, 1999) + Nesterov (Nesterov, 1983) AdaGrad


  1. Adaptive Gradient Methods And Beyond Liangchen Luo Peking University, Beijing luolc.witty@gmail.com March, 2019

  2. From SGD to Adam ● SGD (Robbins & Monro, 1951) ○ + Momentum (Qian, 1999) ○ + Nesterov (Nesterov, 1983) ● AdaGrad (Duchi et al., 2011) ● RMSprop (Tieleman & Hinton, 2012) ● Adam (Kingma & Lei Ba, 2015) 2

  3. Stochastic Gradient Decent (Robbins & Monro, 1951) 3 H. Robbins, S. Monro. A stochastic approximation method. Ann. Math. Stat. , 1951.

  4. SGD with Momentum (Qian, 1999) In the original paper: Actual implementation in PyTorch: Ning Qian. On the momentum term in gradient descent learning algorithms. Neu. Net. , 1999. 4 Figure source: https://www.willamette.edu/~gorr/classes/cs449/momrate.html

  5. Nesterov Accelerated Gradient (Nesterov, 1983) Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O (1/k^2). Doklady AN USSR , 1983. 5 Figure source: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

  6. AdaGrad (Duchi et al., 2011) is a diagonal matrix where each diagonal element is the sum of the squares of the gradients w.r.t. up to time step 6 J. Duchi, E. Hazan, Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR , 2011.

  7. RMSprop (Tieleman & Hinton, 2012) Use an exponential moving average instead of the sum used in AdaGrad. 7 http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

  8. Adam (Kingma & Lei Ba, 2015) bias correction 8 D. P. Kingma, J. Lei Ba. Adam: A method for stochastic optimization. ICLR , 2015.

  9. Adaptive Methods: Pros ● Faster training speed ● Smoother learning curve ● Easier to choose hyperparameters (Kingma & Lei Ba, 2015) ● Better when data are very sparse (Dean et al., 2012) 9

  10. Adaptive Methods: Cons ● Worse performance on unseen data (viz. dev/test set; Wilson et al., 2017) ● Convergence issue caused by non-decreasing learning rates (Reddi et al., 2018) ● Convergence issue caused by extreme learning rates ( Luo et al., 2019 ) 10

  11. Worse Performance on Unseen Data (Wilson et al., 2017) The authors construct a binary classification example where different algorithms can find entirely different solutions when initialized from the same point, and particularly, adaptive methods find a solution which has worse out-of-sample error than SGD . 11 A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, B. Recht. The marginal value of adaptive gradient methods in machine learning. NeurIPS , 2017.

  12. 12

  13. 13

  14. Convergence Issue Caused by Non-Decreasing Learning Rates (Reddi et al., 2018) The following quantity is always larger than or equals to zero for SGD, while not necessarily the case for Adam and RMSprop, which translates to non-decreasing learning rates. The authors prove that this can result in undesirable convergence behavior in certain cases. 14 S. J. Reddi, S. Kale, S. Kumar. On the convergence of Adam and beyond. ICLR , 2018.

  15. Convergence Issue Caused by Extreme Learning Rates ( Luo et al., 2019 ) The authors demonstrate the existence of extreme learning rates when the model is close to convergence, and prove that this can lead to undesirable convergence behavior for Adam and RMSprop in certain cases whatever value the initial step size is . 15 Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ICLR , 2019.

  16. Proposals for Improvement ● AMSGrad (Reddi et al., 2018) ● AdaBound (Luo et al., 2019) ● AdaShift (Zhou et al., 2019) ● Padam (Chen & Gu, 2019) ● NosAdam (Huang et al., 2019) 16

  17. AMSGrad (Reddi et al., 2018) gurantee a non-negative viz. non-increasing learning rates 17 S. J. Reddi, S. Kale, S. Kumar. On the convergence of Adam and beyond. ICLR , 2018.

  18. 18

  19. 19 Figure source: https://fdlm.github.io/post/amsgrad/

  20. AdaBound ( Luo et al., 2019 ) Consider applying the following operation in Adam, which clips the learning rate element-wisely such that the output is constrained to be in It follows that SGD(M) and Adam can be considered as the following cases where: For SGD(M): For Adam: 20 Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ICLR , 2019.

  21. applying bound on learning rates Employ and as functions of instead of constant lower and upper bound, where ● is an increasing function that starts from and converges to asymptotically; ● is a decreasing function that starts from and converges to asymptotically. 21

  22. 22

  23. 23

  24. 24

  25. 25

  26. The Robustness of AdaBound 26

  27. 27

  28. 28

  29. The Limitations of AdaBound ● Based on the assumption that SGD would perform better than Adam w.r.t. the final generalization ability ● The form of bound function ○ gamma as a function of expected global step? ○ other functions ● Fixed final learning rate ○ how to determine the final learning rate automatically? 29

  30. Any questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend