Exact rate of Nesterov Scheme Vasileios Apidopoulos, Jean-Fran cois - PowerPoint PPT Presentation

Exact rate of Nesterov Scheme Vasileios Apidopoulos, Jean-Fran¸ cois Aujol, Charles Dossal, Aude Rondepierre INSA de Toulouse, Institut de Math´ ematiques de Toulouse January 2019, MIA Conference 1/1

The setting Minimize a differentiable function Let F be a convex differentiable function from R n to R which gradient is L − Lipschitz , having at least one minimizer x ∗ . We want to build an efficient sequence to estimate arg min x ∈ R n F ( x ) (1) 2/1

Gradient descent Explicit Gradient Descent Let F be a convex differentiable function from R n to R which gradient is L − Lipschitz , having at least one minimizer x ∗ . Gradient descent : for h < 2 L , x n +1 = x n − h ∇ F ( x n ) (2) The sequence ( x n ) n ∈ N converges to a minimizer of F and F ( x n ) − F ( x ∗ ) � � x 0 − x ∗ � 2 (3) 2 hn 3/1

Nesterov inertial scheme Nesterov inertial scheme Nesterov Scheme for h < 1 L , and α � 3 � n � x n +1 = x n − h ∇ F x n + n + α ( x n − x n − 1 ) (4) � 1 � F ( x n ) − F ( x ∗ ) = O (5) n 2 Nesterov (84) proposes α = 3. 4/1

The questions The questions � 1 � More precise than O with more information on F ? n 2 Is Nesterov really an acceleration of Gradient descent ? The answers Yes... with strong convexity, Su et al. (15) Attouch et al. (17) We give a more accurate answer for more general geometry. In many numerical problems Nesterov is more efficient, but not always. The real answer is ... Nesterov may be more efficient than GD or not. 5/1

Outline Outline Gradient descent and growth condition. State of the art on Nesterov scheme. New rates for Nesterov Schemes. Proofs coming from an ODE study. 6/1

Gradient Descent and Geometry Growth condition A function F satisfies condition L ( γ ) if it exists K > 0 such that for all x ∈ R n d ( x , X ∗ ) γ � K ( F ( x ) − F ( x ∗ )) (6) Theorem Garrigos al al. If F satisfies condition L ( γ ) with γ > 2 then � 1 � F ( x n ) − F ( x ∗ ) = O (7) α n α − 2 If F satisfies condition L (2) then it exists a > 0 � e − an � F ( x n ) − F ( x ∗ ) = O (8) 7/1

Geometric convergence of GD with L (2) Geometric convergence of GD with L (2) F ( x n ) − F ( x ∗ ) � � x 0 − x ∗ � 2 and � x − x ∗ � 2 � K ( F ( x ) − F ( x ∗ )) 2 hn No memory algorithm ⇒ ∀ j � n F ( x n ) − F ( x ∗ ) � � x n − j − x ∗ � 2 � K 2 hj ( F ( x n − j ) − F ( x ∗ )) 2 hj 2 hj � 1 K ⇒ j � K If 2 ⇐ h , F ( x n ) − F ( x ∗ ) � F ( x n − j ) − F ( x ∗ ) 2 Conclusion : The decay is geometric. 8/1

Back to Nesterov scheme State of the art Nesterov Scheme for h < 1 L , and α � 3 � � n x n +1 = x n − h ∇ F x n + n + α ( x n − x n − 1 ) (9) � 1 � F ( x n ) − F ( x ∗ ) = O (10) n 2 Chambolle, D (14) and Attouch Peypouquet (15): � 1 � α > 3 ⇒ convergence of ( x n ) n � 1 and F ( x n ) − F ( x ∗ ) = o n 2 If α � 3, Apidopoulos et al. and Attouch et al. (17) � 1 � F ( x n ) − F ( x ∗ ) = O (11) 2 α n 3 9/1

Nesterov, with strong convexity Theorem Su Boyd Cand` es (15), Attouch Cabot (17) If F satisfies L (2) and uniqueness of minimizer, then ∀ α > 0 � 1 � F ( x n ) − F ( x ∗ ) = O (12) 2 α n 3 10/1

Another geometrical condition Flatness condition F satisfies condition H ( γ ) if ∀ x ∈ R n and all x ∗ ∈ X ∗ F ( x ) − F ( x ∗ ) � 1 γ �∇ F ( x ) , x − x ∗ � (13) Flatness and growth properties 1 γ is convex, then F satisfies H 1( γ ). If ( F − F ∗ ) If F satisfies H ( γ ) then it exists K 2 > 0 such that F ( x ) − F ( x ∗ ) � K 2 d ( x , X ∗ ) γ (14) if F ( x ) = � x − x ∗ � r , with r > 1, F satisfies H 1( γ ) for all γ ∈ [1 , r ] ... and L ( p ) for all p � γ . if F satisfies L (2) and ∇ F is L -Lispchitz then F satisfies L H (1 + 2 K 2 ). 11/1

Nesterov, flatness may improve convergence rate Theorem : Apidopoulos et al. (18) Let F be a differentiable convex function which gradient is L − Lipschitz If F satisfies H ( γ ), with γ > 1 and 1 if α � 1 + 2 1 γ � 1 � F ( x n ) − F ( x ∗ ) = O (15) 2 γα n γ +2 if α > 1 + 2 γ and thus if α = 3 then 2 � 1 � F ( x n ) − F ( x ∗ ) = o (16) n 2 and the sequence ( x n ) n � 1 converges. If F satisfies L (2), the previous points apply for a γ > 1. 2 12/1

Nesterov, rate for sharp functions Theorem for sharp functions, Apidoupoulos et al. (18) If F satisfies L (2) , H ( γ ) and has a unique minimizer x ∗ then ∀ α > 0 � � 1 F ( x n ) − F ( x ∗ ) = O (17) 2 γα n γ +2 Comments � � 1 For γ = 1 we recover the decay O : Attouch and Cabot 2 α n 3 � 1 � For quadratic functions, γ = 2 and thus we get O . n α Since ∇ F is L − Lipschitz, F satisfies H 1( γ ) for γ > 1 and thus 2 γα γ +2 > 2 α 3 . � 1 For F ( x ) = � Ax − y � 2 the decay is O � . n α 13/1

Nesterov, rate for flat functions Theorem for flat functions, Apidopoulos (18) If F satisfies H ( γ ) and L ( γ ) with γ > 2, if F has unique minimizer and if α > γ +2 γ − 2 then � � 1 F ( x n ) − F ( x ∗ ) = O (18) 2 γ n γ − 2 Gradient descent rate If F satisfies L ( γ ) with γ > 2 � � 1 F ( x n ) − F ( x ∗ ) = O (19) γ n γ − 2 14/1

Nesterov continuous and discret Discretization of an ODE, Su Boyd and Cand` es (15) The scheme defined by n x n +1 = y n − h ∇ F ( y n ) with y n = x n + n + α ( x n − x n − 1 ) (20) is a discretization of a solution of x ( t ) + α ¨ t ˙ x ( t ) + ∇ F ( x ( t )) = 0 (ODE) With ˙ x ( t 0 ) = 0. Move of a solid in a potential field with a vanishing viscosity α t . Advantages of the discret setting A simpler Lyapunov analysis, better insight 1 Optimality of bounds 2 15/1

Nesterov, Continuous vs discret x ( t ) + α ¨ t ˙ x ( t ) + ∇ F ( x ( t )) = 0 (ODE) Nesterov, Continuous If F is convex and if α � 3, the solution of ( ?? ) satisfies � 1 � F ( x ( t )) − F ( x ∗ ) = O (21) t 2 n x n +1 = y n − h ∇ F ( y n ) with y n = x n + n + α ( x n − x n − 1 ) Nesterov, Discret If F is convex and if α � 3, the sequence( x n ) n � 1 satisfies � 1 � F ( x n ) − F ( x ∗ ) = O (22) n 2 16/1

Nesterov, Proof Nesterov, Proof of the continuous theorem We define E ( t ) = t 2 ( F ( x ( t )) − F ( x ∗ )) + 1 x ( t ) � 2 2 � ( α − 1)( x ( t ) − x ∗ ) + t ˙ Using ( ?? ) and the following convex inequality F ( x ( t )) − F ( x ∗ ) � � x ( t ) − x ∗ , ∇ F ( x ( t )) � we get E ′ ( t ) � (3 − α ) t ( F ( x ( t ) − F ( x ∗ )) (23) If α � 3, ∀ t � t 0 , t 2 ( F ( x ( t )) − F ( x ∗ )) � E ( t 0 ) 1 � + ∞ ( α − 3) t ( F ( x ( t ) − F ( x ∗ )) � E ( t 0 ) If α > 3, 2 t = t 0 17/1

Nesterov, Proof Nesterov, Proof of the discret theorem We define E n = n 2 ( F ( x n ) − F ( x ∗ )) + 1 2 h � ( α − 1)( x n − x ∗ ) + n ( x n − x n − 1 ) � 2 Using the definition of ( x n ) n � 1 and the following convex inequality F ( x n ) − F ( x ∗ ) � � x n − x ∗ , ∇ F ( x n ) � we get E n +1 − E n � (3 − α ) n ( F ( x n ) − F ( x ∗ )) (24) If α � 3, ∀ n � 1, n 2 ( F ( x n ) − F ( x ∗ )) � E 1 1 � ( α − 3) n ( F ( x n ) − F ( x ∗ )) � E 1 If α > 3, 2 n � 1 18/1

Nesterov, Proof of convergence rate We define for ( p , ξ, λ ) ∈ R 3 1 H ( t ) = t p ( t 2 ( F ( x ( t )) − F ( x ∗ ))+1 x ( t ) � 2 + ξ 2 � x ( t ) − x ∗ � 2 ) 2 � ( λ ( x ( t ) − x ∗ ) + t ˙ We choose ( p , ξ, λ ) ∈ R 3 depending on the hypotheses to 2 ensure that H is bounded. H may not be non increasing. We deduce there is A ∈ R such that 3 t 2+ p ( F ( x ( t )) − F ( x ∗ )) � A − t p ξ 2 � x ( t ) − x ∗ � 2 1 � � If ξ � 0 then F ( x ( t )) − F ( x ∗ ) = O . 4 t p +2 if ξ � 0 we must use conditions L ( γ ) to conclude. 5 19/1

Nesterov, Example Theorem Su, Boyd, Cand` es (15) If F is convex, satisfies and α � 3 � 1 � F ( x ( t )) − F ( x ∗ ) = O (25) t 2 Proof : p = 0, λ = α − 1 , ξ = 0 Theorem Aujol, D., Rondepierre (18) If F is convex, satisfies H ( γ ) and L (2), and has unique minimizer � � 1 F ( x ( t )) − F ( x ∗ ) = O (26) 2 αγ t γ +2 Proof : p = 2 αγ 2 α γ +2 − 2, λ = γ +2 , ξ = λ ( λ + 1 − α ). 20/1

Exact rate of Nesterov Scheme Vasileios Apidopoulos, Jean-Fran cois - PowerPoint PPT Presentation

Exact rate of Nesterov Scheme Vasileios Apidopoulos, Jean-Fran cois Aujol, Charles Dossal, Aude Rondepierre INSA de Toulouse, Institut de Math ematiques de Toulouse January 2019, MIA Conference 1/1 The setting Minimize a differentiable

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

Complexity and Simplicity of Optimization Problems Yurii Nesterov, CORE/INMA (UCL) February 17 -

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov,

Scheme Announcements Scheme Scheme is a Dialect of Lisp 4 Scheme is a Dialect of Lisp What

Algebraic Tools for Exact Geometric Computing I - Exact Arithmetic and Filtering Michael Hemmer

What can Scheme learn from JavaScript? Scheme Workshop 2014 Andy Wingo Me and Scheme Guile

Variable Rate Debt Options: Auction Rate Securities Auction Rate Securities What are Auction Rate

THE TRAINING LAYOFF SCHEME THE TRAINING LAYOFF SCHEME 1 October 2009 The Training Layoff Scheme

Government Pension Scheme (LGPS) Scheme Administration Defined Benefit Scheme National

Countryside Stewardship Scheme Farm Update North Events Overview and Update on scheme for 2018

Hokio Drainage Scheme Scheme Facts Scheme Assets. 4 floodgated culverts 45 km of

BEST OF EXACT GLOBE Jos Suijkens Michiel Beek Best of Exact Globe 2 Agenda 1. A fresh look for

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

27 MARCH 2014 27 MARCH 2012 1 BITR and BDTI Rate evolution BITR Rate Evolution (ws) BDTI Rate

Rate Proceeding November 5, 2019 Chehalis Agenda Whats Driving the Rate Increase?

Interest Rate Swap and Interest Rate Swap and Variable Rate Debt Programs Variable Rate Debt

Using a Raster Display Device for Photometric Stereo Nathan Funk & Yee-Hong Yang CRV 2007

An Italian to Catalan RBMT system reusing data from existing language pairs Antonio Toral , Mireia

Covariance Control and its Relationship to 17 Other Control Problems Robert Skelton Department

My Grandfather, F. E. Stafford, worked in Shanghai, China from 1909 1915 as a printer &

How Franois met and lived in Stochastic Network, Contributions from his scientific family 1

Welcome Laurie Coffin Percona Be Social! Download the App! App features include: Up-to-date

Relative entropy methods in the mathematical theory of complete fluid systems Eduard Feireisl

Reconstructing Patterns of Information Diffusion from Incomplete Observations Sapienza

Exact rate of Nesterov Scheme Vasileios Apidopoulos, Jean-Fran cois - PowerPoint PPT Presentation

Exact rate of Nesterov Scheme Vasileios Apidopoulos, Jean-Fran cois Aujol, Charles Dossal, Aude Rondepierre INSA de Toulouse, Institut de Math ematiques de Toulouse January 2019, MIA Conference 1/1 The setting Minimize a differentiable

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

Complexity and Simplicity of Optimization Problems Yurii Nesterov, CORE/INMA (UCL) February 17 -

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov,

Scheme Announcements Scheme Scheme is a Dialect of Lisp 4 Scheme is a Dialect of Lisp What

Algebraic Tools for Exact Geometric Computing I - Exact Arithmetic and Filtering Michael Hemmer

What can Scheme learn from JavaScript? Scheme Workshop 2014 Andy Wingo Me and Scheme Guile

Variable Rate Debt Options: Auction Rate Securities Auction Rate Securities What are Auction Rate

THE TRAINING LAYOFF SCHEME THE TRAINING LAYOFF SCHEME 1 October 2009 The Training Layoff Scheme

Government Pension Scheme (LGPS) Scheme Administration Defined Benefit Scheme National

Countryside Stewardship Scheme Farm Update North Events Overview and Update on scheme for 2018

Hokio Drainage Scheme Scheme Facts Scheme Assets. 4 floodgated culverts 45 km of

BEST OF EXACT GLOBE Jos Suijkens Michiel Beek Best of Exact Globe 2 Agenda 1. A fresh look for

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

27 MARCH 2014 27 MARCH 2012 1 BITR and BDTI Rate evolution BITR Rate Evolution (ws) BDTI Rate

Rate Proceeding November 5, 2019 Chehalis Agenda Whats Driving the Rate Increase?

Interest Rate Swap and Interest Rate Swap and Variable Rate Debt Programs Variable Rate Debt

Using a Raster Display Device for Photometric Stereo Nathan Funk &amp; Yee-Hong Yang CRV 2007

An Italian to Catalan RBMT system reusing data from existing language pairs Antonio Toral , Mireia

Covariance Control and its Relationship to 17 Other Control Problems Robert Skelton Department

My Grandfather, F. E. Stafford, worked in Shanghai, China from 1909 1915 as a printer &amp;

How Franois met and lived in Stochastic Network, Contributions from his scientific family 1

Welcome Laurie Coffin Percona Be Social! Download the App! App features include: Up-to-date

Relative entropy methods in the mathematical theory of complete fluid systems Eduard Feireisl

Reconstructing Patterns of Information Diffusion from Incomplete Observations Sapienza

Using a Raster Display Device for Photometric Stereo Nathan Funk & Yee-Hong Yang CRV 2007

My Grandfather, F. E. Stafford, worked in Shanghai, China from 1909 1915 as a printer &