Training DNNs: Basic Methods
Ju Sun
Computer Science & Engineering University of Minnesota, Twin Cities
March 3, 2020
1 / 50
Training DNNs: Basic Methods Ju Sun Computer Science & - - PowerPoint PPT Presentation
Training DNNs: Basic Methods Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities March 3, 2020 1 / 50 Supervised learning as function approximation Underlying true function: f 0 Training data: { x i , y i }
1 / 50
f∈H
W
2 / 50
Credit: aria42.com
3 / 50
Credit: [Baydin et al., 2017]
4 / 50
4 / 50
5 / 50
Credit: Stanford CS231N
6 / 50
Credit: Stanford CS231N
W
6 / 50
Credit: Stanford CS231N
W
6 / 50
Credit: Stanford CS231N
W
6 / 50
Credit: Stanford CS231N
W
6 / 50
7 / 50
7 / 50
7 / 50
7 / 50
7 / 50
7 / 50
1 1+e−x
8 / 50
1 1+e−x
8 / 50
1 1+e−x
8 / 50
1 1+e−x
8 / 50
1 1+e−x
8 / 50
1 1+e−x
8 / 50
1 1+e−x
8 / 50
1 1+e−x
8 / 50
1 1+e−x
8 / 50
9 / 50
9 / 50
9 / 50
9 / 50
9 / 50
9 / 50
9 / 50
9 / 50
9 / 50
9 / 50
9 / 50
9 / 50
9 / 50
10 / 50
10 / 50
10 / 50
11 / 50
11 / 50
1 1+e−x
11 / 50
1 1+e−x
n−k 0′s
ezp
11 / 50
1 1+e−x
n−k 0′s
ezp
11 / 50
2 (common, torch.nn.MSELoss), ·1 (for robustness,
12 / 50
2 (common, torch.nn.MSELoss), ·1 (for robustness,
12 / 50
2 (common, torch.nn.MSELoss), ·1 (for robustness,
2 or cross-entropy:
2 or cross-entropy: ℓ (y,
i yi log
12 / 50
2 (common, torch.nn.MSELoss), ·1 (for robustness,
2 or cross-entropy:
2 or cross-entropy: ℓ (y,
i yi log
n−k 0′s
n−k ε′s
12 / 50
2 (common, torch.nn.MSELoss), ·1 (for robustness,
2 or cross-entropy:
2 or cross-entropy: ℓ (y,
i yi log
n−k 0′s
n−k ε′s
12 / 50
13 / 50
14 / 50
15 / 50
W
m
16 / 50
W
m
1 m
i=1 ℓ (yi, DNNW (xi)) → Ex,yℓ (y, DNNW (x))
16 / 50
W
m
1 m
i=1 ℓ (yi, DNNW (xi)) → Ex,yℓ (y, DNNW (x))
16 / 50
W
m
1 m
i=1 ℓ (yi, DNNW (xi)) → Ex,yℓ (y, DNNW (x))
16 / 50
W
m
1 m
i=1 ℓ (yi, DNNW (xi)) → Ex,yℓ (y, DNNW (x))
16 / 50
17 / 50
17 / 50
1 m
i=1 ∇W ℓ (yi, DNNW (xi)) → Ex,y∇W ℓ (y, DNNW (x))
17 / 50
1 m
i=1 ∇W ℓ (yi, DNNW (xi)) → Ex,y∇W ℓ (y, DNNW (x))
1 |J|
1 m
i=1 ∇2 W ℓ (yi, DNNW (xi)) → Ex,y∇2 W ℓ (y, DNNW (x))
1 |J|
W ℓ
17 / 50
1 m
i=1 ∇W ℓ (yi, DNNW (xi)) → Ex,y∇W ℓ (y, DNNW (x))
1 |J|
1 m
i=1 ∇2 W ℓ (yi, DNNW (xi)) → Ex,y∇2 W ℓ (y, DNNW (x))
1 |J|
W ℓ
17 / 50
w
m
18 / 50
w
m
1 |Jk|
18 / 50
w
m
1 |Jk|
18 / 50
19 / 50
19 / 50
19 / 50
19 / 50
2, where X ∈ R10000×500, y ∈ R10000, w ∈ R500
20 / 50
2, where X ∈ R10000×500, y ∈ R10000, w ∈ R500
20 / 50
2, where X ∈ R10000×500, y ∈ R10000, w ∈ R500
20 / 50
2, where X ∈ R10000×500, y ∈ R10000, w ∈ R500
20 / 50
2, where X ∈ R10000×500, y ∈ R10000, w ∈ R500
20 / 50
21 / 50
21 / 50
21 / 50
1 m
i=1 ∇wf (w; ξi), i.e., reducing m to B (batch size)
21 / 50
1 m
i=1 ∇wf (w; ξi), i.e., reducing m to B (batch size)
1 m
i=1 f (w; ξi) or
1 m
i=1 f (w − t
21 / 50
1 m
i=1 ∇wf (w; ξi), i.e., reducing m to B (batch size)
1 m
i=1 f (w; ξi) or
1 m
i=1 f (w − t
21 / 50
1 m
i=1 ∇wf (w; ξi), i.e., reducing m to B (batch size)
1 m
i=1 f (w; ξi) or
1 m
i=1 f (w − t
21 / 50
k < ∞.
22 / 50
k < ∞.
22 / 50
k < ∞.
22 / 50
k < ∞.
22 / 50
23 / 50
23 / 50
23 / 50
Credit: Princeton ELE522
24 / 50
Credit: Princeton ELE522
24 / 50
Credit: Princeton ELE522
24 / 50
25 / 50
25 / 50
Credit: Princeton ELE522
25 / 50
26 / 50
Credit: Stanford CS231N
26 / 50
Credit: Stanford CS231N
26 / 50
1 + 4x2 2
27 / 50
1 + 4x2 2
27 / 50
1 + 4x2 2
27 / 50
1 + 4x2 2
27 / 50
28 / 50
j=1 g2 i,j + ε
j=1 g2 j + ε
28 / 50
j=1 g2 i,j + ε
j=1 g2 j + ε
j=1 g2
28 / 50
j=1 g2 i,j + ε
j=1 g2 j + ε
j=1 g2
28 / 50
k
j.
k
29 / 50
k
j.
k
gk √sk+ε become small when k is large.
29 / 50
k
j.
k
gk √sk+ε become small when k is large.
k ⇐
k + βg2 k−1 + β2g2 k−2 + . . .
k
j.
k
gk √sk+ε become small when k is large.
k ⇐
k + βg2 k−1 + β2g2 k−2 + . . .
29 / 50
30 / 50
k
30 / 50
k
30 / 50
k
30 / 50
k
30 / 50
31 / 50
k gk.
31 / 50
k gk.
31 / 50
Credit: Stanford CS231N
32 / 50
Credit: Stanford CS231N
32 / 50
Credit: Stanford CS231N
32 / 50
Credit: Stanford CS231N
32 / 50
1 m
i=1 ℓ (yi, σ (W kσ(W k−1 . . . (W 1xi))))
33 / 50
1 m
i=1 ℓ (yi, σ (W kσ(W k−1 . . . (W 1xi))))
33 / 50
1 m
i=1 ℓ (yi, σ (W kσ(W k−1 . . . (W 1xi))))
33 / 50
1 m
i=1 ℓ (yi, σ (W kσ(W k−1 . . . (W 1xi))))
34 / 50
1 m
i=1 ℓ (yi, σ (W kσ(W k−1 . . . (W 1xi))))
34 / 50
1 m
i=1 ℓ (yi, σ (W kσ(W k−1 . . . (W 1xi))))
34 / 50
35 / 50
35 / 50
35 / 50
36 / 50
37 / 50
37 / 50
37 / 50
37 / 50
37 / 50
1 m
i=1 ℓ (yi, σ (W kσ(W k−1 . . . (W 1xi))))
m
2
m
2 (yi − W 2σ (W 1xi)) ⊙ σ′ (W 1xi)
i
38 / 50
1 m
i=1 ℓ (yi, σ (W kσ(W k−1 . . . (W 1xi))))
m
2
m
2 (yi − W 2σ (W 1xi)) ⊙ σ′ (W 1xi)
i
38 / 50
1 m
i=1 ℓ (yi, σ (W kσ(W k−1 . . . (W 1xi))))
m
2
m
2 (yi − W 2σ (W 1xi)) ⊙ σ′ (W 1xi)
i
38 / 50
1 m
i=1 ℓ (yi, σ (W kσ(W k−1 . . . (W 1xi))))
m
2
m
2 (yi − W 2σ (W 1xi)) ⊙ σ′ (W 1xi)
i
38 / 50
i ’s “well
39 / 50
i ’s “well
39 / 50
i ’s “well
39 / 50
i ’s “well
39 / 50
i ’s “well
39 / 50
40 / 50
i (due to its role in the gradient) also suggests that
40 / 50
i (due to its role in the gradient) also suggests that
2 din+dout -variance. For example:
40 / 50
i (due to its role in the gradient) also suggests that
2 din+dout -variance. For example:
2 din+dout
din+dout ,
din+dout
40 / 50
41 / 50
41 / 50
2 din -variance. For example:
41 / 50
2 din -variance. For example:
2 din
din ,
din
c
dindout for some constant c [Defazio and Bottou, 2019]
41 / 50
42 / 50
42 / 50
43 / 50
44 / 50
44 / 50
44 / 50
45 / 50
46 / 50
[Anil et al., 2020] Anil, R., Gupta, V., Koren, T., Regan, K., and Singer, Y. (2020). Second order optimization made practical. arXiv:2002.09018. [Arjovsky et al., 2016] Arjovsky, M., Shah, A., and Bengio, Y. (2016). Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pages 1120–1128. [Bansal et al., 2018] Bansal, N., Chen, X., and Wang, Z. (2018). Can we gain more from orthogonality regularizations in training deep cnns? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4266–4276. Curran Associates Inc. [Baydin et al., 2017] Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind,
Journal of Machine Learning Research, 18(1):5595–5637. [Bottou and Bousquet, 2008] Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168. 47 / 50
[Byrd et al., 2016] Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. (2016). A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031. [Chauhan et al., 2018] Chauhan, V. K., Sharma, A., and Dahiya, K. (2018). Stochastic trust region inexact newton method for large-scale machine learning. arXiv:1812.10426. [Choromanska et al., 2015] Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015). The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204. [Curtis and Shi, 2019] Curtis, F. E. and Shi, R. (2019). A fully stochastic second-order trust region method. arXiv:1911.06920. [Dauphin et al., 2014] Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941. 48 / 50
[Defazio and Bottou, 2019] Defazio, A. and Bottou, L. (2019). Scaling laws for the principled design, initialization and preconditioning of relu networks. arXiv:1906.04267. [Lezcano-Casado and Mart´ ınez-Rubio, 2019] Lezcano-Casado, M. and Mart´ ınez-Rubio, D. (2019). Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. arXiv1901.08428. [Li et al., 2020] Li, J., Fuxin, L., and Todorovic, S. (2020). Efficient riemannian
[Martens and Grosse, 2015] Martens, J. and Grosse, R. (2015). Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. [Pascanu et al., 2014] Pascanu, R., Dauphin, Y. N., Ganguli, S., and Bengio, Y. (2014). On the saddle point problem for non-convex optimization. arXiv preprint arXiv:1405.4604. [Roosta-Khorasani and Mahoney, 2018] Roosta-Khorasani, F. and Mahoney, M. W. (2018). Sub-sampled newton methods. Mathematical Programming, 174(1-2):293–326. 49 / 50
[Staib et al., 2020] Staib, M., Reddi, S. J., Kale, S., Kumar, S., and Sra, S. (2020). Escaping saddle points with adaptive gradient methods. arXiv:1901.09149. [Sun, 2019] Sun, R. (2019). Optimization for deep learning: theory and algorithms. arXiv:1912.08957. 50 / 50