l acc el eration de nesterov est elle vraiment une acc el
play

Lacc el eration de Nesterov est-elle vraiment une acc el eration - PowerPoint PPT Presentation

Lacc el eration de Nesterov est-elle vraiment une acc el eration ? cois Aujol 1 Jean-Fran Collaboration avec Vassilis Apidopoulos 1 , Charles Dossal 2 et Aude Rondepierre 2 . 1 IMB, Universit e de Bordeaux 2 INSA Toulouse, IMT


  1. L’acc´ el´ eration de Nesterov est-elle vraiment une acc´ el´ eration ? cois Aujol 1 Jean-Fran¸ Collaboration avec Vassilis Apidopoulos 1 , Charles Dossal 2 et Aude Rondepierre 2 . 1 IMB, Universit´ e de Bordeaux 2 INSA Toulouse, IMT 27 mai 2019 1/70

  2. Setting Minimize a differentiable function Let F be a convex differentiable function from R n to R whose gradient is L − Lipschitz , having at least one minimizer x ∗ . We want to build an efficient sequence to estimate arg min x ∈ R n F ( x ) 2/70

  3. Gradient descent Explicit Gradient Descent Let F be a convex differentiable function from R n to R whose gradient is L − Lipschitz , having at least one minimizer x ∗ . Gradient descent : for h < 2 L , x n +1 = x n − h ∇ F ( x n ) The sequence ( x n ) n ∈ N converges to a minimizer of F and F ( x n ) − F ( x ∗ ) � � x 0 − x ∗ � 2 2 hn 3/70

  4. Gradient descent Inertial Gradient Descent Let F be a convex differentiable function from R n to R whose gradient is L − Lipschitz , having at least one minimizer x ∗ . Inertial Gradient descent : for h < 1 L , � y n = x n + α ( x n − x n − 1 ) = y n − h ∇ F ( y n ) x n +1 If α ∈ [0 , 1], the sequence ( x n ) n ∈ N converges to a minimizer of F and � 1 � F ( x n ) − F ( x ∗ ) = O n 4/70

  5. Nesterov inertial scheme Nesterov inertial scheme A specific class of inertial gradient scheme: Nesterov Scheme for h < 1 L , and α � 3 � y n n = x n + n + α ( x n − x n − 1 ) = y n − h ∇ F ( y n ) x n +1 � 1 � F ( x n ) − F ( x ∗ ) = O n 2 Nesterov (84) proposes α = 3. 5/70

  6. Introduction Non-smooth convex optimization 10 2 ISTA FISTA 10 1 10 0 10 -1 10 -2 50 100 150 200 250 300 (a) Input y: motion blur + noise ( σ = 2) (b) Convergence prof les (c) Deconvolution ISTA(300)+UDWT (d) Deconvolution FISTA(300)+UDWT 6/70

  7. Introduction Some questions How to prove these decays ? What is the role of the inertial parameter α ? � 1 � Can we get more accurate rates than O with more n 2 information on F (i.e. assuming more then F convex) ? Are these bounds tight ? Is Nesterov scheme really an acceleration of the Gradient descent ? 7/70

  8. Outline Introduction 1 Geometry assumptions 2 Geometry for Nesterov inertial scheme 3 ODEs and Lyapunov functions 4 The non differentiable setting 5 Finite error analysis 6 8/70

  9. Outline Introduction 1 Geometry assumptions 2 Geometry for Nesterov inertial scheme 3 ODEs and Lyapunov functions 4 The non differentiable setting 5 Finite error analysis 6 9/70

  10. Convex functions Definition F is a convex function : ∀ ( x , y ) , ∀ λ ∈ [0 , 1] , F ( λ x + (1 − λ ) y ) ≤ λ F ( x ) + (1 − λ ) F ( y ) Properties of differentiable convex functions F is minorated by its affine approximation ∀ ( x , y ) , F ( y ) � F ( x ) + �∇ F ( x ) , y − x � If x �→ ∇ F ( x ) is L − Lipschitz , F is majorated by its quadratic approximation F ( y ) � F ( x ) + �∇ F ( x ) , y − x � + L 2 � x − y � 2 ∀ ( x , y ) , 2 � x − x ∗ � 2 In particular, F ( x ) − F ( x ∗ ) � L 10/70

  11. Classical geometric assumptions Strong convexity F ( y ) � F ( x ) + �∇ F ( x ) , y − x � + µ 2 � x − y � 2 Strong minimizer x ∗ the minimizer of F : � x − x ∗ � 2 ≤ 2 µ ( F ( x ) − F ( x ∗ )) In both cases, we have uniqueness of the minimizer. Remark If F strongly convex with L -Lipshitz gradient: µ 2 � x − x ∗ � 2 ≤ F ( x ) − F ( x ∗ ) ≤ L 2 � x − x ∗ � 2 11/70

  12. Refined geometric assumptions Growth condition (sharpness) X ∗ the set of minimizer of F . A function F satisfies condition L ( p ) if it exists K > 0 such that for all x ∈ R n d ( x , X ∗ ) p � K ( F ( x ) − F ( x ∗ )) The smaller p, the sharper F. Remark L (2) and uniqueness of the minimizer ⇐ ⇒ strong minimizer Remark If F convex with L -Lipschitz gradient satisfies the growth condtion L ( p ) for some p > 0, then we necessary have p ≥ 2. 12/70

  13. Another geometrical condition Flatness condition X ∗ the set of minimizer of F . F satisfies condition H ( γ ) if ∀ x ∈ R n and all x ∗ ∈ X ∗ F ( x ) − F ( x ∗ ) � 1 γ �∇ F ( x ) , x − x ∗ � Flatness properties 1 γ is convex, then F satisfies H ( γ ). If ( F − F ∗ ) If F satisfies H ( γ ) then it exists K 2 > 0 such that F ( x ) − F ( x ∗ ) � K 2 d ( x , X ∗ ) γ The hypothesis H ( γ ) can be seen as a “flatness” condition on the function F in the sense that it ensures that F is sufficiently flat (at least as flat as x �→ | x | γ ) in the neighborhood of its minimizers. The larger γ , the flatter F. 13/70

  14. Flatness and growth geometrical conditions Flatness and growth properties if F ( x ) = � x − x ∗ � r , with r > 1, F satisfies H ( γ ) for all γ ∈ [1 , r ] ... and L ( p ) for all p � r . if F satisfies H ( γ ) and L ( p ) , then p ≥ γ . if F satisfies L (2) and ∇ F is L -Lispchitz then F satisfies L H (1 + 2 K 2 ). Remark For the explicit gradient descent, only sharpness H ( γ ) assumption is used. For inertial methods, flatness L ( p ) also plays a key role. 14/70

  15. � Lojasiewicz property and growth condition � Lojasiewicz property A differentiable function F : R n → R is said to have the � Lojasiewicz property with exponent θ ∈ [0 , 1) if, for any critical point x ∗ , there exist c > 0 and ε > 0 such that: ∀ x ∈ B ( x ∗ , ε ) , �∇ F ( x ) � � c | F ( x ) − F ∗ | θ Lemma Let F : R n → R be a convex differentiable function. Then F has the � Lojasiewicz property with exponent θ ∈ [0 , 1) iff F satisfies the growth condition L ( r ) with θ = 1 − 1 r . 15/70

  16. Gradient Descent and Geometry Growth condition A function F satisfies condition L ( p ) if it exists K > 0 such that for all x ∈ R n d ( x , X ∗ ) p � K ( F ( x ) − F ( x ∗ )) Theorem Garrigos al al. 2017 (gradient descent) If F satisfies condition L ( p ) with p > 2 then � � 1 F ( x n ) − F ( x ∗ ) = O γ n γ − 2 If F satisfies condition L (2) then it exists a > 0 F ( x n ) − F ( x ∗ ) = O e − an � � 16/70

  17. Geometric convergence of GD with L (2) Geometric convergence of GD with L (2) F ( x n ) − F ( x ∗ ) � � x 0 − x ∗ � 2 and � x − x ∗ � 2 � K ( F ( x ) − F ( x ∗ )) 2 hn No memory algorithm ⇒ ∀ j � n F ( x n ) − F ( x ∗ ) � � x n − j − x ∗ � 2 � K 2 hj ( F ( x n − j ) − F ( x ∗ )) 2 hj 2 hj � 1 K ⇒ j � K 2 ⇐ If h , F ( x n ) − F ( x ∗ ) � F ( x n − j ) − F ( x ∗ ) 2 Conclusion : The decay is geometric. 17/70

  18. Nesterov scheme for strongly convex functions Nesterov inertial scheme Nesterov Scheme for h < 1 n L , and α n = n + α , with α ≥ 3: � y n = x n + α n ( x n − x n − 1 ) = y n − h ∇ F ( y n ) x n +1 � 1 � F ( x n ) − F ( x ∗ ) = O n 2 Nesterov Scheme for strongly convex functions For h < 1 L , ρ = µ L : x n + 1 −√ ρ � = 1+ √ ρ ( x n − x n − 1 ) y n x n +1 = y n − h ∇ F ( y n ) F ( x n ) − F ( x ∗ ) = O ((1 − √ ρ ) n ) 18/70

  19. Outline Introduction 1 Geometry assumptions 2 Geometry for Nesterov inertial scheme 3 ODEs and Lyapunov functions 4 The non differentiable setting 5 Finite error analysis 6 19/70

  20. Back to Nesterov scheme State of the art Nesterov Scheme for h < 1 L , and α � 3 � � n x n +1 = x n − h ∇ F x n + n + α ( x n − x n − 1 ) � 1 � F ( x n ) − F ( x ∗ ) = O n 2 Chambolle-Dossal (14) and Attouch-Peypouquet (15): � 1 � α > 3 ⇒ convergence of ( x n ) n � 1 and F ( x n ) − F ( x ∗ ) = o n 2 If α � 3, Apidopoulos et al. and Attouch et al. (17) � 1 � F ( x n ) − F ( x ∗ ) = O 2 α n 3 20/70

  21. Nesterov, with strong convexity Theorem Su Boyd Cand` es (15), Attouch Cabot (17) If F satisfies L (2) and uniqueness of minimizer, then ∀ α > 0 � 1 � F ( x n ) − F ( x ∗ ) = O 2 α n 3 21/70

  22. Geometrical condition Growth condition A function F satisfies condition L ( p ) if it exists K > 0 such that for all x ∈ R n d ( x , X ∗ ) p � K ( F ( x ) − F ( x ∗ )) Flatness condition F satisfies condition H ( γ ) if ∀ x ∈ R n and all x ∗ ∈ X ∗ F ( x ) − F ( x ∗ ) � 1 γ �∇ F ( x ) , x − x ∗ � 22/70

  23. Nesterov, flatness may improve convergence rate Theorem : Aujol et al. (18) Let F be a differentiable convex function whose gradient is L − Lipschitz If F satisfies H ( γ ), with γ > 1 and 1 if α � 1 + 2 1 γ � � 1 F ( x n ) − F ( x ∗ ) = O 2 γα n γ +2 if α > 1 + 2 γ and thus if α = 3 then 2 � 1 � F ( x n ) − F ( x ∗ ) = o n 2 and the sequence ( x n ) n � 1 converges. If F satisfies L (2), then there exists γ > 1 such that F satifies 2 H ( γ ). 23/70

  24. Nesterov, flatness may improve convergence rate Decay rate r ( α, γ ) = 2 αγ γ +2 depending on the value of α when α � γ +2 and F satisfies H ( γ ) for four values γ : γ 1 = 1 . 5 dashed γ line, γ 2 = 2, solide line, γ 3 = 3 dotted line and γ 4 = 5 dashed-dotted line. 24/70

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend