 
              Exact rate of Nesterov Scheme Vasileios Apidopoulos, Jean-Fran¸ cois Aujol, Charles Dossal, Aude Rondepierre INSA de Toulouse, Institut de Math´ ematiques de Toulouse January 2019, MIA Conference 1/1
The setting Minimize a differentiable function Let F be a convex differentiable function from R n to R which gradient is L − Lipschitz , having at least one minimizer x ∗ . We want to build an efficient sequence to estimate arg min x ∈ R n F ( x ) (1) 2/1
Gradient descent Explicit Gradient Descent Let F be a convex differentiable function from R n to R which gradient is L − Lipschitz , having at least one minimizer x ∗ . Gradient descent : for h < 2 L , x n +1 = x n − h ∇ F ( x n ) (2) The sequence ( x n ) n ∈ N converges to a minimizer of F and F ( x n ) − F ( x ∗ ) � � x 0 − x ∗ � 2 (3) 2 hn 3/1
Nesterov inertial scheme Nesterov inertial scheme Nesterov Scheme for h < 1 L , and α � 3 � n � x n +1 = x n − h ∇ F x n + n + α ( x n − x n − 1 ) (4) � 1 � F ( x n ) − F ( x ∗ ) = O (5) n 2 Nesterov (84) proposes α = 3. 4/1
The questions The questions � 1 � More precise than O with more information on F ? n 2 Is Nesterov really an acceleration of Gradient descent ? The answers Yes... with strong convexity, Su et al. (15) Attouch et al. (17) We give a more accurate answer for more general geometry. In many numerical problems Nesterov is more efficient, but not always. The real answer is ... Nesterov may be more efficient than GD or not. 5/1
Outline Outline Gradient descent and growth condition. State of the art on Nesterov scheme. New rates for Nesterov Schemes. Proofs coming from an ODE study. 6/1
Gradient Descent and Geometry Growth condition A function F satisfies condition L ( γ ) if it exists K > 0 such that for all x ∈ R n d ( x , X ∗ ) γ � K ( F ( x ) − F ( x ∗ )) (6) Theorem Garrigos al al. If F satisfies condition L ( γ ) with γ > 2 then � 1 � F ( x n ) − F ( x ∗ ) = O (7) α n α − 2 If F satisfies condition L (2) then it exists a > 0 � e − an � F ( x n ) − F ( x ∗ ) = O (8) 7/1
Geometric convergence of GD with L (2) Geometric convergence of GD with L (2) F ( x n ) − F ( x ∗ ) � � x 0 − x ∗ � 2 and � x − x ∗ � 2 � K ( F ( x ) − F ( x ∗ )) 2 hn No memory algorithm ⇒ ∀ j � n F ( x n ) − F ( x ∗ ) � � x n − j − x ∗ � 2 � K 2 hj ( F ( x n − j ) − F ( x ∗ )) 2 hj 2 hj � 1 K ⇒ j � K If 2 ⇐ h , F ( x n ) − F ( x ∗ ) � F ( x n − j ) − F ( x ∗ ) 2 Conclusion : The decay is geometric. 8/1
Back to Nesterov scheme State of the art Nesterov Scheme for h < 1 L , and α � 3 � � n x n +1 = x n − h ∇ F x n + n + α ( x n − x n − 1 ) (9) � 1 � F ( x n ) − F ( x ∗ ) = O (10) n 2 Chambolle, D (14) and Attouch Peypouquet (15): � 1 � α > 3 ⇒ convergence of ( x n ) n � 1 and F ( x n ) − F ( x ∗ ) = o n 2 If α � 3, Apidopoulos et al. and Attouch et al. (17) � 1 � F ( x n ) − F ( x ∗ ) = O (11) 2 α n 3 9/1
Nesterov, with strong convexity Theorem Su Boyd Cand` es (15), Attouch Cabot (17) If F satisfies L (2) and uniqueness of minimizer, then ∀ α > 0 � 1 � F ( x n ) − F ( x ∗ ) = O (12) 2 α n 3 10/1
Another geometrical condition Flatness condition F satisfies condition H ( γ ) if ∀ x ∈ R n and all x ∗ ∈ X ∗ F ( x ) − F ( x ∗ ) � 1 γ �∇ F ( x ) , x − x ∗ � (13) Flatness and growth properties 1 γ is convex, then F satisfies H 1( γ ). If ( F − F ∗ ) If F satisfies H ( γ ) then it exists K 2 > 0 such that F ( x ) − F ( x ∗ ) � K 2 d ( x , X ∗ ) γ (14) if F ( x ) = � x − x ∗ � r , with r > 1, F satisfies H 1( γ ) for all γ ∈ [1 , r ] ... and L ( p ) for all p � γ . if F satisfies L (2) and ∇ F is L -Lispchitz then F satisfies L H (1 + 2 K 2 ). 11/1
Nesterov, flatness may improve convergence rate Theorem : Apidopoulos et al. (18) Let F be a differentiable convex function which gradient is L − Lipschitz If F satisfies H ( γ ), with γ > 1 and 1 if α � 1 + 2 1 γ � 1 � F ( x n ) − F ( x ∗ ) = O (15) 2 γα n γ +2 if α > 1 + 2 γ and thus if α = 3 then 2 � 1 � F ( x n ) − F ( x ∗ ) = o (16) n 2 and the sequence ( x n ) n � 1 converges. If F satisfies L (2), the previous points apply for a γ > 1. 2 12/1
Nesterov, rate for sharp functions Theorem for sharp functions, Apidoupoulos et al. (18) If F satisfies L (2) , H ( γ ) and has a unique minimizer x ∗ then ∀ α > 0 � � 1 F ( x n ) − F ( x ∗ ) = O (17) 2 γα n γ +2 Comments � � 1 For γ = 1 we recover the decay O : Attouch and Cabot 2 α n 3 � 1 � For quadratic functions, γ = 2 and thus we get O . n α Since ∇ F is L − Lipschitz, F satisfies H 1( γ ) for γ > 1 and thus 2 γα γ +2 > 2 α 3 . � 1 For F ( x ) = � Ax − y � 2 the decay is O � . n α 13/1
Nesterov, rate for flat functions Theorem for flat functions, Apidopoulos (18) If F satisfies H ( γ ) and L ( γ ) with γ > 2, if F has unique minimizer and if α > γ +2 γ − 2 then � � 1 F ( x n ) − F ( x ∗ ) = O (18) 2 γ n γ − 2 Gradient descent rate If F satisfies L ( γ ) with γ > 2 � � 1 F ( x n ) − F ( x ∗ ) = O (19) γ n γ − 2 14/1
Nesterov continuous and discret Discretization of an ODE, Su Boyd and Cand` es (15) The scheme defined by n x n +1 = y n − h ∇ F ( y n ) with y n = x n + n + α ( x n − x n − 1 ) (20) is a discretization of a solution of x ( t ) + α ¨ t ˙ x ( t ) + ∇ F ( x ( t )) = 0 (ODE) With ˙ x ( t 0 ) = 0. Move of a solid in a potential field with a vanishing viscosity α t . Advantages of the discret setting A simpler Lyapunov analysis, better insight 1 Optimality of bounds 2 15/1
Nesterov, Continuous vs discret x ( t ) + α ¨ t ˙ x ( t ) + ∇ F ( x ( t )) = 0 (ODE) Nesterov, Continuous If F is convex and if α � 3, the solution of ( ?? ) satisfies � 1 � F ( x ( t )) − F ( x ∗ ) = O (21) t 2 n x n +1 = y n − h ∇ F ( y n ) with y n = x n + n + α ( x n − x n − 1 ) Nesterov, Discret If F is convex and if α � 3, the sequence( x n ) n � 1 satisfies � 1 � F ( x n ) − F ( x ∗ ) = O (22) n 2 16/1
Nesterov, Proof Nesterov, Proof of the continuous theorem We define E ( t ) = t 2 ( F ( x ( t )) − F ( x ∗ )) + 1 x ( t ) � 2 2 � ( α − 1)( x ( t ) − x ∗ ) + t ˙ Using ( ?? ) and the following convex inequality F ( x ( t )) − F ( x ∗ ) � � x ( t ) − x ∗ , ∇ F ( x ( t )) � we get E ′ ( t ) � (3 − α ) t ( F ( x ( t ) − F ( x ∗ )) (23) If α � 3, ∀ t � t 0 , t 2 ( F ( x ( t )) − F ( x ∗ )) � E ( t 0 ) 1 � + ∞ ( α − 3) t ( F ( x ( t ) − F ( x ∗ )) � E ( t 0 ) If α > 3, 2 t = t 0 17/1
Nesterov, Proof Nesterov, Proof of the discret theorem We define E n = n 2 ( F ( x n ) − F ( x ∗ )) + 1 2 h � ( α − 1)( x n − x ∗ ) + n ( x n − x n − 1 ) � 2 Using the definition of ( x n ) n � 1 and the following convex inequality F ( x n ) − F ( x ∗ ) � � x n − x ∗ , ∇ F ( x n ) � we get E n +1 − E n � (3 − α ) n ( F ( x n ) − F ( x ∗ )) (24) If α � 3, ∀ n � 1, n 2 ( F ( x n ) − F ( x ∗ )) � E 1 1 � ( α − 3) n ( F ( x n ) − F ( x ∗ )) � E 1 If α > 3, 2 n � 1 18/1
Nesterov, Proof of convergence rate We define for ( p , ξ, λ ) ∈ R 3 1 H ( t ) = t p ( t 2 ( F ( x ( t )) − F ( x ∗ ))+1 x ( t ) � 2 + ξ 2 � x ( t ) − x ∗ � 2 ) 2 � ( λ ( x ( t ) − x ∗ ) + t ˙ We choose ( p , ξ, λ ) ∈ R 3 depending on the hypotheses to 2 ensure that H is bounded. H may not be non increasing. We deduce there is A ∈ R such that 3 t 2+ p ( F ( x ( t )) − F ( x ∗ )) � A − t p ξ 2 � x ( t ) − x ∗ � 2 1 � � If ξ � 0 then F ( x ( t )) − F ( x ∗ ) = O . 4 t p +2 if ξ � 0 we must use conditions L ( γ ) to conclude. 5 19/1
Nesterov, Example Theorem Su, Boyd, Cand` es (15) If F is convex, satisfies and α � 3 � 1 � F ( x ( t )) − F ( x ∗ ) = O (25) t 2 Proof : p = 0, λ = α − 1 , ξ = 0 Theorem Aujol, D., Rondepierre (18) If F is convex, satisfies H ( γ ) and L (2), and has unique minimizer � � 1 F ( x ( t )) − F ( x ∗ ) = O (26) 2 αγ t γ +2 Proof : p = 2 αγ 2 α γ +2 − 2, λ = γ +2 , ξ = λ ( λ + 1 − α ). 20/1
Recommend
More recommend