Global convergence of gradient descent for non-convex learning problems
Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France
É C O L E N O R M A L E S U P É R I E U R E
Global convergence of gradient descent for non-convex learning - - PowerPoint PPT Presentation
Global convergence of gradient descent for non-convex learning problems Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France C O L E N O R M A L E S U P R I E U R E Joint work with L ena c Chizat Institut Henri
É C O L E N O R M A L E S U P É R I E U R E
mσ(θ⊤ m−1σ(· · · θ⊤ 2 σ(θ⊤ 1 x))
θ∈Rd
n
n
θ∈Rd
n
n
θ∈Rd
n
n
θ∈Rd
n
n
n
i
i(t)
θ∈Rd
n
n
n
i
i(t)
θ∈Rd
n
n
ε
ε
1 ε
ε
ε
ε
1 ε
ε
ε
ε
ε
1 ε
ε
ε
ε
AFG L−BFGS SG A S G I A G S A G − L S
mσ(θ⊤ m−1σ(· · · θ⊤ 2 σ(θ⊤ 1 x))
mσ(θ⊤ m−1σ(· · · θ⊤ 2 σ(θ⊤ 1 x))
mσ(θ⊤ m−1σ(· · · θ⊤ 2 σ(θ⊤ 1 x))
1 0.5 0.5 1 1 0.5 0.5 1 2 1.5 1 0.5
1 0.5 0.5 1 1 0.5 0.5 1 2 1.5 1 0.5
1 0.5 0.5 1 1 0.5 0.5 1 2 1.5 1 0.5
1 0.5 0.5 1 1 0.5 0.5 1 2 1.5 1 0.5
2 σ(θ⊤ 1 x) = m i=1 θ2(i) · σ
m
2 σ(θ⊤ 1 x) = m i=1 θ2(i) · σ
m
2 σ(θ⊤ 1 x) = m i=1 θ2(i) · σ
m
m
m
W
W
m
i=1 δwi
W
m
i=1 δwi
W
m
W
m
W
m
2 (µ, ν)
γ∈Π(µ,ν)
γ∈Π(µ,ν)
2 (µ, ν)
γ∈Π(µ,ν)
γ∈Π(µ,ν)
2 (µ, ν)
γ∈Π(µ,ν)
γ∈Π(µ,ν)
γ∈Π(µ,ν)
2 (µ, ν)
γ∈Π(µ,ν)
γ∈Π(µ,ν)
γ∈Π(µ,ν)
2 (µ, ν)
γ∈Π(µ,ν)
γ∈Π(µ,ν)
γ∈Π(µ,ν)
m
m
2 (µ, ν)
γ∈Π(µ,ν)
γ∈Π(µ,ν)
γ∈Π(µ,ν)
m
m
m
m
m
m
m
i=1 Ψ(wi)
m
i=1 Ψ(wi)
m
i=1 Ψ(wi)
m
i=1 Ψ(wi) prediction function
m
i=1 Ψ(wi) prediction function
h R(h) “linearly”
m
i=1 Ψ(wi) prediction function
m
i=1 Ψ(wi) prediction function
m
i=1 Ψ(wi) prediction function
Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar´
Francis Bach. Breaking the curse of dimensionality with convex neural networks. Technical Report 1412.8690, arXiv, 2014. Francis Bach, Julien Mairal, and Jean Ponce. Convex sparse matrix factorizations. Technical Report 0812.1869, arXiv, 2008.
Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993. Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
Advances in Neural Information Processing Systems (NIPS), 2006. L´ ena¨ ıc Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Technical Report 1805.09545, arXiv, 2018a. Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. Technical Report To appear, ArXiv, 2018b. Anna Choromanska, Mikael Henaff, Michael Mathieu, G´ erard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015.
for big data problems. In Proc. ICML, 2014a. Aaron Defazio. A simple practical accelerated method for finite sums. In Advances in Neural Information Processing Systems, pages 676–684, 2016. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, 2014b. Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural
Benjamin D. Haeffele and Ren´ e Vidal. Global optimality in neural network training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7331–7339, 2017. Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8580–8589, 2018. Prateek Jain and Purushottam Kar. Non-convex optimization for machine learning. Foundations and Trends in Machine Learning, 10(3-4):142–336, 2017. Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance
IEEE Transactions on Information Theory, 47(6):2659–2665, Sep 2001.
2015.
rate for strongly-convex optimization with finite training sets. In Advances in Neural Information Processing Systems (NIPS), 2012. Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016.
Neural Information Processing Systems (NIPS), 2015a. Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM Journal on Optimization, 25(4):2244–2273, 2015b.
Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layers neural networks. Technical Report 1804.06561, arXiv, 2018. Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. arXiv preprint arXiv:1902.06015, 2019.
Wiley, 1983.
Information Processing Systems (NIPS), 2014. Atsushi Nitanda and Taiji Suzuki. Stochastic particle gradient descent for infinite ensembles. arXiv preprint arXiv:1712.05438, 2017.
In Proceedings of the Conference on Learning Theory (COLT), 2007. Grant M. Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.
Stochastic dual coordinate ascent methods for regularized loss
loss minimization. In Proc. ICML, 2014. Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks. arXiv preprint arXiv:1805.01053, 2018. Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 2018. Blake E. Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in neural information processing systems, pages 3639–3647, 2016.
full gradients. In Advances in Neural Information Processing Systems, 2013.