Adaptive Gradient Methods And Beyond
Liangchen Luo
Peking University, Beijing luolc.witty@gmail.com
March, 2019
Adaptive Gradient Methods And Beyond Liangchen Luo Peking - - PowerPoint PPT Presentation
Adaptive Gradient Methods And Beyond Liangchen Luo Peking University, Beijing luolc.witty@gmail.com March, 2019 From SGD to Adam SGD (Robbins & Monro, 1951) + Momentum (Qian, 1999) + Nesterov (Nesterov, 1983) AdaGrad
Liangchen Luo
Peking University, Beijing luolc.witty@gmail.com
March, 2019
○ + Momentum (Qian, 1999) ○ + Nesterov (Nesterov, 1983)
2
3
Actual implementation in PyTorch:
In the original paper:
4 Ning Qian. On the momentum term in gradient descent learning algorithms. Neu. Net., 1999. Figure source: https://www.willamette.edu/~gorr/classes/cs449/momrate.html
5
Figure source: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
is a diagonal matrix where each diagonal element is the sum of the squares of the gradients w.r.t. up to time step
6
Use an exponential moving average instead of the sum used in AdaGrad.
7 http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
8
bias correction
9
10
The authors construct a binary classification example where different algorithms can find entirely different solutions when initialized from the same point, and particularly, adaptive methods find a solution which has worse out-of-sample error than SGD.
11
12
13
The following quantity is always larger than or equals to zero for SGD, while not necessarily the case for Adam and RMSprop, which translates to non-decreasing learning
cases.
14
15
The authors demonstrate the existence of extreme learning rates when the model is close to convergence, and prove that this can lead to undesirable convergence behavior for Adam and RMSprop in certain cases whatever value the initial step size is.
Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ICLR, 2019.
16
17
gurantee a non-negative
18
19 Figure source: https://fdlm.github.io/post/amsgrad/
Consider applying the following operation in Adam, which clips the learning rate element-wisely such that the output is constrained to be in
20
It follows that SGD(M) and Adam can be considered as the following cases where: For SGD(M): For Adam:
Liangchen Luo, Yuanhao Xiong, Yan Liu, Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ICLR, 2019.
Employ and as functions of instead of constant lower and upper bound, where
21
applying bound on learning rates
22
23
24
25
26
27
28
generalization ability
○ gamma as a function of expected global step? ○
○ how to determine the final learning rate automatically?
29