Adaptive Gradient Methods And Beyond Liangchen Luo Peking - - PowerPoint PPT Presentation

adaptive gradient methods and beyond
SMART_READER_LITE
LIVE PREVIEW

Adaptive Gradient Methods And Beyond Liangchen Luo Peking - - PowerPoint PPT Presentation

Adaptive Gradient Methods And Beyond Liangchen Luo Peking University, Beijing luolc.witty@gmail.com March, 2019 From SGD to Adam SGD (Robbins & Monro, 1951) + Momentum (Qian, 1999) + Nesterov (Nesterov, 1983) AdaGrad


slide-1
SLIDE 1

Adaptive Gradient Methods And Beyond

Liangchen Luo

Peking University, Beijing luolc.witty@gmail.com

March, 2019

slide-2
SLIDE 2

From SGD to Adam

  • SGD (Robbins & Monro, 1951)

○ + Momentum (Qian, 1999) ○ + Nesterov (Nesterov, 1983)

  • AdaGrad (Duchi et al., 2011)
  • RMSprop (Tieleman & Hinton, 2012)
  • Adam (Kingma & Lei Ba, 2015)

2

slide-3
SLIDE 3

Stochastic Gradient Decent (Robbins & Monro, 1951)

3

  • H. Robbins, S. Monro. A stochastic approximation method. Ann. Math. Stat., 1951.
slide-4
SLIDE 4

Actual implementation in PyTorch:

SGD with Momentum (Qian, 1999)

In the original paper:

4 Ning Qian. On the momentum term in gradient descent learning algorithms. Neu. Net., 1999. Figure source: https://www.willamette.edu/~gorr/classes/cs449/momrate.html

slide-5
SLIDE 5

Nesterov Accelerated Gradient (Nesterov, 1983)

5

  • Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O (1/k^2). Doklady AN USSR, 1983.

Figure source: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

slide-6
SLIDE 6

AdaGrad (Duchi et al., 2011)

is a diagonal matrix where each diagonal element is the sum of the squares of the gradients w.r.t. up to time step

6

  • J. Duchi, E. Hazan, Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 2011.
slide-7
SLIDE 7

RMSprop (Tieleman & Hinton, 2012)

Use an exponential moving average instead of the sum used in AdaGrad.

7 http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

slide-8
SLIDE 8

Adam (Kingma & Lei Ba, 2015)

8

bias correction

  • D. P. Kingma, J. Lei Ba. Adam: A method for stochastic optimization. ICLR, 2015.
slide-9
SLIDE 9

Adaptive Methods: Pros

  • Faster training speed
  • Smoother learning curve
  • Easier to choose hyperparameters (Kingma & Lei Ba, 2015)
  • Better when data are very sparse (Dean et al., 2012)

9

slide-10
SLIDE 10

Adaptive Methods: Cons

  • Worse performance on unseen data (viz. dev/test set; Wilson et al., 2017)
  • Convergence issue caused by non-decreasing learning rates (Reddi et al., 2018)
  • Convergence issue caused by extreme learning rates (Luo et al., 2019)

10

slide-11
SLIDE 11

Worse Performance on Unseen Data (Wilson et al., 2017)

The authors construct a binary classification example where different algorithms can find entirely different solutions when initialized from the same point, and particularly, adaptive methods find a solution which has worse out-of-sample error than SGD.

11

  • A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, B. Recht. The marginal value of adaptive gradient methods in machine learning. NeurIPS, 2017.
slide-12
SLIDE 12

12

slide-13
SLIDE 13

13

slide-14
SLIDE 14

Convergence Issue Caused by Non-Decreasing Learning Rates (Reddi et al., 2018)

The following quantity is always larger than or equals to zero for SGD, while not necessarily the case for Adam and RMSprop, which translates to non-decreasing learning

  • rates. The authors prove that this can result in undesirable convergence behavior in certain

cases.

14

  • S. J. Reddi, S. Kale, S. Kumar. On the convergence of Adam and beyond. ICLR, 2018.
slide-15
SLIDE 15

Convergence Issue Caused by Extreme Learning Rates (Luo et al., 2019)

15

The authors demonstrate the existence of extreme learning rates when the model is close to convergence, and prove that this can lead to undesirable convergence behavior for Adam and RMSprop in certain cases whatever value the initial step size is.

Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ICLR, 2019.

slide-16
SLIDE 16

Proposals for Improvement

  • AMSGrad (Reddi et al., 2018)
  • AdaBound (Luo et al., 2019)
  • AdaShift (Zhou et al., 2019)
  • Padam (Chen & Gu, 2019)
  • NosAdam (Huang et al., 2019)

16

slide-17
SLIDE 17

AMSGrad (Reddi et al., 2018)

17

gurantee a non-negative

  • viz. non-increasing learning rates
  • S. J. Reddi, S. Kale, S. Kumar. On the convergence of Adam and beyond. ICLR, 2018.
slide-18
SLIDE 18

18

slide-19
SLIDE 19

19 Figure source: https://fdlm.github.io/post/amsgrad/

slide-20
SLIDE 20

AdaBound (Luo et al., 2019)

Consider applying the following operation in Adam, which clips the learning rate element-wisely such that the output is constrained to be in

20

It follows that SGD(M) and Adam can be considered as the following cases where: For SGD(M): For Adam:

Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ICLR, 2019.

slide-21
SLIDE 21

Employ and as functions of instead of constant lower and upper bound, where

  • is an increasing function that starts from and converges to asymptotically;
  • is a decreasing function that starts from and converges to asymptotically.

21

applying bound on learning rates

slide-22
SLIDE 22

22

slide-23
SLIDE 23

23

slide-24
SLIDE 24

24

slide-25
SLIDE 25

25

slide-26
SLIDE 26

The Robustness of AdaBound

26

slide-27
SLIDE 27

27

slide-28
SLIDE 28

28

slide-29
SLIDE 29

The Limitations of AdaBound

  • Based on the assumption that SGD would perform better than Adam w.r.t. the final

generalization ability

  • The form of bound function

○ gamma as a function of expected global step? ○

  • ther functions
  • Fixed final learning rate

○ how to determine the final learning rate automatically?

29

slide-30
SLIDE 30

Any questions?