Tutorial on Gradient methods for non-convex problems
Part 1
Guillaume Garrigos โ November 28th โ ENS
Tutorial on Gradient methods for non-convex problems Part 1 - - PowerPoint PPT Presentation
Tutorial on Gradient methods for non-convex problems Part 1 Guillaume Garrigos November 28th ENS What can we expect? Does my algorithm converge? lim + exists? What is the nature of the limit
Guillaume Garrigos โ November 28th โ ENS
๐โ+โ ๐ฆ๐ exists?
Global/Local minima? Saddle?
Let 0 โช ๐๐ โช 2/๐, then: 1) ๐ ๐ฆ๐ is decreasing 2) if ๐ฆ๐๐ โ ๐ฆโ then ๐ผ๐ ๐ฆโ = 0 3) Isolated local minima are attractive Proposition f โถ Rn โ R is of class C๐
1,1
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐
[Pro 1.2.3, 1.2.5 & Ex. 1.2.18] Bertsekas, Nonlinear Programming, 1999.
Let 0 โช ๐๐ โช 2/๐, then: 1) ๐ ๐ฆ๐ is decreasing 2) if ๐ฆ๐๐ โ ๐ฆโ then ๐ผ๐ ๐ฆโ = 0 3) Isolated local minima are attractive Proposition f โถ Rn โ R is of class C๐
1,1
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐
[Pro 1.2.3, 1.2.5 & Ex. 1.2.18] Bertsekas, Nonlinear Programming, 1999.
๐ฆ๐ can have no limit !! No convergence โ Lack of regularity, but rather wild ildness
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐
[Ex. 3] Palis, de Melo, Geometric Theory of Dynamical Systems: An Introduction, 1982. H.B.Curry, The method of steepest descent for nonlinear minimization problems, 1944.
f โถ Rn โ R is of class C๐
1,1
โ
แถ ๐ฆ ๐ข ๐๐ข < โ
โ
แถ ๐ฆ ๐ข ๐๐ข < โ
๐ฆ๐ โ ฯ๐=1
๐ โ1 ๐ ๐
โ โlog(2) but ฯ ๐ฆ๐+1 โ ๐ฆ๐ = ฯ
1 ๐
โ
แถ ๐ฆ ๐ข ๐๐ข < โ
โ
แถ ๐ฆ ๐ข ๐๐ข < โ
๐ก0 = ๐(๐ฆ0) and ๐กโ = lim
โ ๐(๐ฆ ๐ข )
โ
แถ ๐ฆ ๐ข ๐๐ข < โ
๐ก0 = ๐(๐ฆ0) and ๐กโ = lim
โ ๐(๐ฆ ๐ข )
๐ โ ๐ฆ โ1 ๐ก s.t. แถ ๐ง ๐ก = ๐ผ๐ ๐ง ๐ก ๐ผ๐ ๐ง ๐ก
โ2
โ
แถ ๐ฆ ๐ข ๐๐ข < โ
๐ก0 = ๐(๐ฆ0) and ๐กโ = lim
โ ๐(๐ฆ ๐ข )
๐ โ ๐ฆ โ1 ๐ก s.t. แถ ๐ง ๐ก = ๐ผ๐ ๐ง ๐ก ๐ผ๐ ๐ง ๐ก
โ2
๐กโ ๐ก0 1 โ๐ผ๐ ๐ง ๐ก โ ๐๐ก
Ignore ๐ผ๐ ๐ง ๐ก = 0 Finite interval !
โ
แถ ๐ฆ ๐ข ๐๐ข = ืฌ
๐กโ ๐ก0 1 โ๐ผ๐ ๐ง ๐ก โ ๐๐ก ?
โ
แถ ๐ฆ ๐ข ๐๐ข = ืฌ
๐กโ ๐ก0 1 โ๐ผ๐ ๐ง ๐ก โ ๐๐ก ?
โฅ ๐ท i.e. sharpness
โ
แถ ๐ฆ ๐ข ๐๐ข = ืฌ
๐กโ ๐ก0 1 โ๐ผ๐ ๐ง ๐ก โ ๐๐ก ?
โฅ ๐ท i.e. sharpness
โ
แถ ๐ฆ ๐ข ๐๐ข = ืฌ
๐กโ ๐ก0 1 โ๐ผ๐ ๐ง ๐ก โ ๐๐ก ?
โฅ ๐ท i.e. sharpness
1 โ๐ผ๐ ๐ง ๐ก โ โค ๐โฒ(๐ก) with ๐ โฅ 0, ๐ โ
so the length is โค ๐ ๐ก0 โ ๐ ๐กโ โค ๐(๐ก0)
โ
แถ ๐ฆ ๐ข ๐๐ข = ืฌ
๐กโ ๐ก0 1 โ๐ผ๐ ๐ง ๐ก โ ๐๐ก ?
โฅ ๐ท i.e. sharpness
1 โ๐ผ๐ ๐ง ๐ก โ โค ๐โฒ(๐ก) with ๐ โฅ 0, ๐ โ
so the length is โค ๐ ๐ก0 โ ๐ ๐กโ โค ๐(๐ก0)
๐ผ๐ ๐ฆ ๐ข โฅ 1 i.e. ๐ โ ๐ is sharp: ๐ผ ๐ โ ๐ ๐ฆ โฅ 1
We say that ๐ is ลojasiewicz at a critical point ๐ฆโ if ๐โฒ ๐ ๐ฆ โ ๐ ๐ฆโ ๐ผ๐ ๐ฆ โฅ 1,
๐ฆโฒ โ ๐บ ๐ฆโ, ๐ ๐ ๐ฆโ < ๐ ๐ฆโฒ < ๐ ๐ฆโ + ๐ } Definition
๐ ๐ก โ ๐ก1/๐ : ๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Definition
Let ๐ be ลojasiewicz and ๐๐ โ ]0,2/๐[. If ๐ฆ๐ is bounded, then it converges to some critical point ๐ฆโ. Theorem (convergence) ๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐
ลojasiewicz. Sur les trajectoires du gradient dโune fonction analytique, 1984. Absil, Mahony, Andrews. Convergence of the Iterates of Descent Methods for Analytic Cost Functions, 2005.
Let ๐ be ลojasiewicz and ๐๐ โ ]0,2/๐[. For every ๐ฆโ โ ๐๐ ๐๐๐๐ ๐, if ๐ฆ0 โผ ๐ฆโ then ๐ฆ๐ converges to ๐ฆโ โ ๐๐ ๐๐๐๐ ๐. Theorem (capture) f โถ Rn โ R is of class C๐
1,1
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐ f โถ Rn โ R is of class C๐
1,1
Sketch of proof : show that ๐โฒ ๐ก โฅ โ แถ ๐ฆ ๐ข โ
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐ f โถ Rn โ R is of class C๐
1,1
๐ ๐ ๐ฆ๐ โ ๐ ๐ฆโ โ ๐ ๐ ๐ฆ๐+1) โ ๐(๐ฆโ) Sketch of proof : show that ๐โฒ ๐ก โฅ โ แถ ๐ฆ ๐ข โ
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐ f โถ Rn โ R is of class C๐
1,1
๐ ๐ ๐ฆ๐ โ ๐ ๐ฆโ โ ๐ ๐ ๐ฆ๐+1) โ ๐(๐ฆโ) โฅ ๐โฒ(๐ ๐ฆ๐ โ ๐ ๐ฆโ )(๐ ๐ฆ๐ โ ๐ ๐ฆ๐+1 ) because ๐ concave Sketch of proof : show that ๐โฒ ๐ก โฅ โ แถ ๐ฆ ๐ข โ
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐ f โถ Rn โ R is of class C๐
1,1
๐ ๐ ๐ฆ๐ โ ๐ ๐ฆโ โ ๐ ๐ ๐ฆ๐+1) โ ๐(๐ฆโ) โฅ ๐โฒ(๐ ๐ฆ๐ โ ๐ ๐ฆโ )(๐ ๐ฆ๐ โ ๐ ๐ฆ๐+1 ) because ๐ concave โฅ ๐โฒ ๐ ๐ฆ๐ โ ๐ ๐ฆโ ๐๐,๐ ๐ฆ๐+1 โ ๐ฆ๐
2
with Descent Lemma Sketch of proof : show that ๐โฒ ๐ก โฅ โ แถ ๐ฆ ๐ข โ
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐ f โถ Rn โ R is of class C๐
1,1
๐ ๐ ๐ฆ๐ โ ๐ ๐ฆโ โ ๐ ๐ ๐ฆ๐+1) โ ๐(๐ฆโ) โฅ ๐โฒ(๐ ๐ฆ๐ โ ๐ ๐ฆโ )(๐ ๐ฆ๐ โ ๐ ๐ฆ๐+1 ) because ๐ concave โฅ ๐โฒ ๐ ๐ฆ๐ โ ๐ ๐ฆโ ๐๐,๐ ๐ฆ๐+1 โ ๐ฆ๐
2
with Descent Lemma = ๐โฒ ๐ ๐ฆ๐ โ ๐ ๐ฆโ ๐ท๐,๐ ๐ฆ๐+1 โ ๐ฆ๐ ๐ผ๐ ๐ฆ๐ Sketch of proof : show that ๐โฒ ๐ก โฅ โ แถ ๐ฆ ๐ข โ
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐ f โถ Rn โ R is of class C๐
1,1
๐ ๐ ๐ฆ๐ โ ๐ ๐ฆโ โ ๐ ๐ ๐ฆ๐+1) โ ๐(๐ฆโ) โฅ ๐โฒ(๐ ๐ฆ๐ โ ๐ ๐ฆโ )(๐ ๐ฆ๐ โ ๐ ๐ฆ๐+1 ) because ๐ concave โฅ ๐โฒ ๐ ๐ฆ๐ โ ๐ ๐ฆโ ๐๐,๐ ๐ฆ๐+1 โ ๐ฆ๐
2
with Descent Lemma = ๐โฒ ๐ ๐ฆ๐ โ ๐ ๐ฆโ ๐ท๐,๐ ๐ฆ๐+1 โ ๐ฆ๐ ๐ผ๐ ๐ฆ๐ Sketch of proof : show that ๐โฒ ๐ก โฅ โ แถ ๐ฆ ๐ข โ
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐ f โถ Rn โ R is of class C๐
1,1
๐ ๐ ๐ฆ๐ โ ๐ ๐ฆโ โ ๐ ๐ ๐ฆ๐+1) โ ๐(๐ฆโ) โฅ ๐โฒ(๐ ๐ฆ๐ โ ๐ ๐ฆโ )(๐ ๐ฆ๐ โ ๐ ๐ฆ๐+1 ) because ๐ concave โฅ ๐โฒ ๐ ๐ฆ๐ โ ๐ ๐ฆโ ๐๐,๐ ๐ฆ๐+1 โ ๐ฆ๐
2
with Descent Lemma = ๐โฒ ๐ ๐ฆ๐ โ ๐ ๐ฆโ ๐ท๐,๐ ๐ฆ๐+1 โ ๐ฆ๐ ๐ผ๐ ๐ฆ๐ โฅ 1 โ ๐ท๐,๐ ๐ฆ๐+1 โ ๐ฆ๐ with: Sketch of proof : show that ๐โฒ ๐ก โฅ โ แถ ๐ฆ ๐ข โ
Let ๐ be ลojasiewicz and ๐๐ โ ]0,2/๐[. If ๐ฆ๐ is bounded, then it converges to some critical point ๐ฆโ. Theorem (convergence) ๐ฆ๐+1 = ๐ฆ๐ โ ๐๐๐ผ๐ ๐ฆ๐
ลojasiewicz. Sur les trajectoires du gradient dโune fonction analytique, 1984. Absil, Mahony, Andrews. Convergence of the Iterates of Descent Methods for Analytic Cost Functions, 2005.
Let ๐ be ลojasiewicz and ๐๐ โ ]0,2/๐[. For every ๐ฆโ โ ๐๐ ๐๐๐๐ ๐, if ๐ฆ0 โผ ๐ฆโ then ๐ฆ๐ converges to ๐ฆโ โ ๐๐ ๐๐๐๐ ๐. Theorem (capture) f โถ Rn โ R is of class C๐
1,1
Bolte, Daniilidis, Ley, Mazet, Characterizations of ลojasiewicz inequalities [โฆ], 2010.
f โถ Rn โ R is of class C๐
1,1
Examples ๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Counter-example
ลojasiewicz, Ensembles semi-analytiques, 1965. Kurdyka, On gradients of functions definable in o-minimal structures, 1998. Bolte, Daniilidis, Lewis, Shiota, Clarke Subgradients of Stratifiable Functions, 2007.
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Any analytic function is p-ลojasiewicz at its critical points. Theorem
Theorem
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Examples of semi-algebraic functions
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Examples of semi-algebraic functions The class of semi-algebraic functions is stable under:
Theorem (``Tarski-Seidenbergโโ)
Coste, An Introduction to O-minimal Geometry, 2000.
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Examples of semi-algebraic functions The class of semi-algebraic functions is stable under:
Theorem (``Tarski-Seidenbergโโ)
Coste, An Introduction to O-minimal Geometry, 2000.
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Counter-examples of semi-algebraic functions
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Counter-examples of semi-algebraic functions There exists a class of functions (o-minimal structure) which:
Theorem
Speissegger, The Pfaffian closure of an o-minimal structure, 1999.
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Take-home message: Virtually an any fu function you can think about is ลojasiewicz, as long as it does not involve โ โ โ ๐ฆ โฆ sin(๐ฆ)
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Take-home message: Virtually an any fu function you can think about is ลojasiewicz, as long as it does not involve โ โ โ ๐ฆ โฆ sin(๐ฆ)
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Take-home message: Virtually an any fu function you can think about is ลojasiewicz, as long as it does not involve โ โ โ ๐ฆ โฆ sin(๐ฆ) So gradient descent ``always convergesโโ to a critical point
Polyak, Gradient methods for the minimisation of functionals, 1963.
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Let ๐ be globally 2-ลojasiewicz, ๐ โ]]0,2/๐[[, and ๐ฆ๐ โ ๐ฆโ. Then we have linear convergence : ๐ ๐ฆ๐+1 โ ๐ ๐ฆโ โค ๐ ๐ ๐ฆ๐ โ ๐ ๐ฆโ where ๐ โ [0,1[, and ๐พ = ๐ โ ๐/๐ด if ๐ = 1/๐. Theorem (p=2)
Attouch, Bolte, On the convergence of the proximal algorithm for nonsmooth functions [โฆ], 2009. Chouzenoux, Pesquet, Repetti, A block coordinate variable metric forward-backward algorithm, 2014.
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Let ๐ be globally p-ลojasiewicz, ๐ โ]]0,2/๐[[, and ๐ฆ๐ โ ๐ฆโ. Then we have sublinear convergence : ๐ ๐ฆ๐ โ ๐ ๐ฆโ = ๐ ๐
โ๐ ๐โ2
Theorem (p>2)
๐โ2 โ +โ when ๐ โ 2 ; ๐ ๐โ2 โ 1 when ๐ โ โ
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Take-home message: Virtually an any fu function you can think about is ลojasiewicz, as long as it does not involve โ โ โ ๐ฆ โฆ sin(๐ฆ) So gradient descent ``always convergesโโ to a critical point What about oth
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Attouch, Bolte, Svaiter, Convergence of descent methods for semi-algebraic and tame problems [โฆ], 2013.
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Attouch, Bolte, Svaiter, Convergence of descent methods for semi-algebraic and tame problems [โฆ], 2013.
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Bรฉgout, Bolte, Jendoubi, On damped second-order gradient systems, 2015. + refs within! Li et al., Convergence Analysis of Proximal Gradient with Momentum for Nonconvex Optimization, 2017.
แท ๐ฆ ๐ข + ๐ฝ(๐ข) แถ ๐ฆ ๐ข + ๐ผ๐ ๐ฆ ๐ข = 0 ๐ฆ๐+1 = ๐ง๐ โ ๐๐ผ๐ ๐ง๐ ๐ง๐ = ๐ฆ๐ +
1 1+๐ฝ๐ (๐ฆ๐ โ ๐ฆ๐โ1)
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
Bรฉgout, Bolte, Jendoubi, On damped second-order gradient systems, 2015. + refs within! Li et al., Convergence Analysis of Proximal Gradient with Momentum for Nonconvex Optimization, 2017.
แท ๐ฆ ๐ข + ๐ฝ(๐ข) แถ ๐ฆ ๐ข + ๐ผ๐ ๐ฆ ๐ข = 0 ๐ฆ๐+1 = ๐ง๐ โ ๐๐ผ๐ ๐ง๐ ๐ง๐ = ๐ฆ๐ +
1 1+๐ฝ๐ (๐ฆ๐ โ ๐ฆ๐โ1)
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
region methods, Landweber iterations. IDK for Newton/BFGS.
Frankel, Garrigos, Peypouquet, Splitting Methods with Variable Metric [โฆ], 2015. + Absil et al., and many others.
f โถ Rn โ R is of class C๐
1,1
๐(๐ ๐ฆ โ ๐ ๐ฆโ )๐โ1 โค ๐ผ๐ ๐ฆ
๐
region methods, Landweber iterations. IDK for Newton/BFGS.
Reddi, Hefny, Sra, Poczos, Smola, Stochastic variance reduction for nonconvex optimization, 2016. Karimi, Nutini, Schmidt, Linear Convergence of Gradient and Proximal-Gradient [โฆ], 2016. Lei et al., Stochastic Gradient Descent for Nonconvex Learning without Bounded Gradient Assumptions, 2019.
๐โ+โ ๐ฆ๐ exists?
Global/Local minima? Saddle?
๐โ+โ ๐ฆ๐ exists?
Global/Local minima? Saddle?
๐ต symmetric operator แถ ๐ฆ ๐ข + ๐ต๐ฆ ๐ข = 0
๐ต symmetric operator แถ ๐ฆ ๐ข + ๐ต๐ฆ ๐ข = 0 ๐ต = 1 1 ๐ต = โ1 โ1 ๐ต = 1 โ1
แถ ๐ฆ
๐ต symmetric operator แถ ๐ฆ ๐ข + ๐ต๐ฆ ๐ข = 0 Let าง ๐ฆ be an equilibrium of the system. We define ๐ าง ๐ฆ = ๐ฆ ๐ฆ ๐ข โ าง ๐ฆ with ๐ฆ 0 = ๐ฆ} Definition ๐ าง ๐ฆ โ โ๐>0 ๐น๐(๐ต) Theorem If ๐๐๐๐ ๐ต < 0, then ๐ าง ๐ฆ has Lebesgue measure 0. Corollary
แถ ๐ฆ ๐ข + ๐ผ๐(๐ฆ ๐ข ) = 0 f โถ Rn โ R is of class C2
แถ ๐ฆ ๐ข + ๐ผ๐(๐ฆ ๐ข ) = 0 ๐ าง ๐ฆ is a submanifold of dimension smaller than the one of โ๐>0 ๐น๐ ๐ผ2๐ าง ๐ฆ Theorem (Stable Manifold Lemma) If ๐๐๐๐ ๐ผ2๐ าง ๐ฆ < 0, then ๐ าง ๐ฆ has Lebesgue measure 0. Corollary f โถ Rn โ R is of class C2
Perron, Die stabilitรคtsfrage bei differentialgleichungen, 1930. Smale, Differentiable dynamical systems, 1967.
The 3 kinds of critical points:
าง ๐ฆ > 0) , attractive
าง ๐ฆ < 0, repulsive
าง ๐ฆ = 0), ??? แถ ๐ฆ แถ ๐ฆ ๐ข + ๐ผ๐(๐ฆ ๐ข ) = 0 f โถ Rn โ R is of class C2
แท ๐ฆ ๐ข + ๐ฝ แถ ๐ฆ ๐ข + ๐ผ๐(๐ฆ ๐ข ) = 0 f โถ Rn โ R is of class C2
Goudou, Munier, The gradient and heavy ball with friction dynamical systems: the quasiconvex case, 2007.
แท ๐ฆ ๐ข + ๐ฝ แถ ๐ฆ ๐ข + ๐ผ๐(๐ฆ ๐ข ) = 0 ๐ าง ๐ฆ is a submanifold of dimension smaller than the one of โ๐>0 ๐น๐ ๐ผ2๐ าง ๐ฆ Theorem (Stable Manifold Lemma) If ๐๐๐๐ ๐ผ2๐ าง ๐ฆ < 0, then ๐ าง ๐ฆ has Lebesgue measure 0. Corollary f โถ Rn โ R is of class C2
Goudou, Munier, The gradient and heavy ball with friction dynamical systems: the quasiconvex case, 2007.
Let ๐ โ]0,1/๐[. Then ๐ าง ๐ฆ is a submanifold of dimension smaller than the one of โ๐>0 ๐น๐ ๐ผ2๐ าง ๐ฆ Theorem If ๐๐๐๐ ๐ผ2๐ าง ๐ฆ < 0, then ๐ าง ๐ฆ has Lebesgue measure 0. Corollary f โถ Rn โ R is of class C๐
1,1 โฉ ๐ท2
Lee, Simchowitz, Jordan, Recht, Gradient Descent Converges to Minimizers, 2016.
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐ผ๐ ๐ฆ๐ If ๐ has no degenerated critical points and is ลojasiewicz, then ๐ฆ๐ converges a.s. to a local minima with random initialization. Corollary
f โถ Rn โ R is of class C๐
1,1 โฉ ๐ท2
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐ผ๐ ๐ฆ๐ It is time now for examples.
f โถ Rn โ R is of class C๐
1,1 โฉ ๐ท2
Li et al., Symmetry, saddle points, and global geometry of nonconvex matrix factorization, 2016.
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐ผ๐ ๐ฆ๐ If ๐ has no degenerated critical points and is ลojasiewicz, then ๐ฆ๐ converges a.s. to a local minima with random initialization. Corollary Some problems have no degenrated critical points, like the matrix factorization problem a.k.a. two-layer-linear-neural-network min
๐โโ๐ร๐ ๐ ๐ =
๐๐๐ โ ๐ต ๐บ
2
f โถ Rn โ R is of class C๐
1,1 โฉ ๐ท2
๐ฆ๐+1 = ๐ฆ๐ โ ๐๐ผ๐ ๐ฆ๐ + ๐๐ If ๐ has no degenerated critical points and is ลojasiewicz, then ๐ฆ๐ converges a.s. to a local minima with random initialization. Corollary The above result remains true for the noisy gradient method. Corollary The noise here must be isotropic! Not the case for SGD (proportional to eigenvalues), but proof can be adapted for RKHS learning with a loss s.t. โโฒโฒ = ๐( โโฒ ).
Daneshmand et al., Escaping saddles with stochastic gradients, 2018.
๐โ+โ ๐ฆ๐ exists?
Global/Local minima? Saddle?
๐โ+โ ๐ฆ๐ exists?
Global/Local minima? Saddle?
๐โ+โ ๐ฆ๐ exists?
Global/Local minima? Saddle? Depends strongly on :
Yes, this bold statement will be my conclusion
๐โ+โ ๐ฆ๐ exists?
Global/Local minima? Saddle? Depends strongly on :
Yes, this bold statement will be my conclusion