CS 6316 Machine Learning
Gradient Descent
Yangfeng Ji
Department of Computer Science University of Virginia
CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department - - PowerPoint PPT Presentation
CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Gradient Descent 2. Stochastic Gradient Descent 3. SGD with Momentum 4. Adaptive Learning Rates 1 Gradient Descent
Department of Computer Science University of Virginia
1
m
θ′
3
4
2
4
2
4
2
4
α m
m
m
5
α m
m
m
5
6
6
7
1 + 10x2 2
8
9
9
10
11
11
2
11
12
f (θ′) ≈ f (θ) + ∇ f (θ)(θ′ − θ) + 1 2η(θ′ − θ)T(θ′ − θ)
12
f (θ′) ≈ f (θ) + ∇ f (θ)(θ′ − θ) + 1 2η(θ′ − θ)T(θ′ − θ)
∂ f (θ′) ∂θ′ ≈ ∇ f (θ) + 1 2η(θ′ − θ) 0 ⇒ θ′ θ − η · ∇ f (θ) (16)
12
f (θ′) ≈ f (θ) + ∇ f (θ)(θ′ − θ) + 1 2η(θ′ − θ)T(θ′ − θ)
∂ f (θ′) ∂θ′ ≈ ∇ f (θ) + 1 2η(θ′ − θ) 0 ⇒ θ′ θ − η · ∇ f (θ) (16)
12
1 + θ2 2)/2
(a) Too small
13
1 + θ2 2)/2
(d) Too small (e) Too large
13
1 + θ2 2)/2
(g) Too small (h) Too large (i) Just right
13
s≥0
14
s≥0
2
14
s≥0
2
14
15
2
15
2
15
i1, the loss function is
m
2
17
m
18
m
m
18
19
20
◮ learning with large-scale data ◮ models with lots of parameters
21
22
23
23
m
23
∞
∞
t ≤ ∞
24
∞
∞
t ≤ ∞
24
26
27
(a) SGD without momentum
Figure: The effect of momentum in SGD: reduce the fluctuation (Credit: Genevieve B. Orr)
28
(a) SGD without momentum (b) SGD with momentum
Figure: The effect of momentum in SGD: reduce the fluctuation (Credit: Genevieve B. Orr)
28
1 + 10x2 2
Note: the arrow show the opposite direction of the gradient
29
!"($%&) (($%&) "($)
Note: the arrow show the opposite direction of the gradient
30
32
k
k
k,k
k
33
k
k
k,k
k
k
k,k
i1(g(i) k )2
33
k
k
k,k
k
k
k,k
i1(g(i) k )2
33
k
k
k,k
k
34
k
k
k,k
k
34
k
k
k
k
k
k
k
35
36
v(t)
k
k
+ (1 − µ)g(t−1)
k
(41) r(t)
k
k
+ (1 − ρ)[g(t−1)
k
]2 (42) ˆ v(t)
k
k
1 − µt (43) ˆ r(t)
k
k
1 − ρt (44) θ(t)
k
k
− η0 ˆ v(t)
k
r(t)
k + ǫ
(45) The default values of µ and ρ are 0.9 and 0.999 respectively.
36
v(t)
k
k
+ (1 − µ)g(t−1)
k
(41) r(t)
k
k
+ (1 − ρ)[g(t−1)
k
]2 (42) (43) (44) θ(t)
k
k
− η0 ˆ v(t)
k
r(t)
k + ǫ
(45) The default values of µ and ρ are 0.9 and 0.999 respectively.
36
37
Bottou, L. (1998). Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142. Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311. Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159. Hinton, G., Srivastava, N., and Swersky, K. (2012). Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. LeCun, Y. A., Bottou, L., Orr, G. B., and Müller, K.-R. (2012). Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer.
38