Restarting accelerated gradient methods with a rough strong - PowerPoint PPT Presentation

Setup Restarting FISTA Restarting APPROX Adaptive restart Restarting accelerated gradient methods with a rough strong convexity estimate Olivier Fercoq Joint work with Zheng Qu 20 March 2017 1/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Minimisation of composite functions Minimise the “strongly” convex composite function F x ∈ R N { F ( x ) = f ( x ) + ψ ( x ) } min • f : R N → R , convex, differentiable, with L -Lipschitz gradient • ψ : R N → R ∪ { + ∞} , convex, with simple proximal operator y ∈ R N ψ ( y ) + 1 2 � x − y � 2 prox ψ ( x ) = arg min L • F = f + g features some kind of strong convexity 2/28

Setup Restarting FISTA Restarting APPROX Adaptive restart The local error bound property Let X ∗ be the set of optimal solutions such that ∀ x ∗ ∈ X ∗ , ∀ x ∈ R n , F ∗ = F ( x ∗ ) ≤ F ( x ). Assumption There exists s > 0 and µ F ( s ) > 0 such that if dist L ( x , X ∗ ) ≤ s , F ( x ) ≥ F ∗ + µ F ( s ) dist L ( x , X ∗ ) 2 2 Examples: - F ( x ) = φ ( Ax ) with ∇ 2 φ ( x ) > 0, ∀ x 2 � Ax − b � 2 + λ � x � 1 - F ( x ) = 1 Local error bound for s > 0 ⇒ local error bound ∀ compact set 3/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Algorithms: FISTA Choose x 0 ∈ dom ψ . Set θ 0 = 1 and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k � �∇ f ( y k ) , x − y k � + 1 2 � x − y k � 2 � x k +1 = arg min x ∈ R N L + ψ ( x ) z k +1 = z k + 1 θ k ( x k +1 − y k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 4/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Algorithms: APG Choose x 0 ∈ dom ψ . Set θ 0 = 1 and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k �∇ f ( y k ) , z − y k � + θ k � 2 � z − z k � 2 � z k +1 = arg min z ∈ R N L + ψ ( z ) x k +1 = y k + θ k ( z k +1 − z k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 5/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Algorithms: APPROX Choose x 0 ∈ dom ψ . Set θ 0 = τ n and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k Randomly generate S k ∼ ˆ S for i ∈ S k do k � + θ k nv i z i �∇ i f ( y k ) , z − y i 2 τ | z − z i k | 2 + ψ i ( z ) � � k +1 = arg min z ∈ R ni end for x k +1 = y k + n τ θ k ( z k +1 − z k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 6/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Accelerated gradient methods µ F = 0 µ F > 0 is known FISTA Beck & Teboulle Vandenberghe APG Nesterov Nesterov dual APG Nesterov Nesterov APPROX Fercoq & Richt´ arik Lin, Lu & Xiao O (1 − √ µ F ) k ) O (1 / k 2 ) The algorithms that guarantee linear convergence depend explicitly on µ F (e.g. θ k = √ µ F , ∀ k ) 7/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Restart when µ F is known Proposition (Nesterov: Conditional restarting at x k ) Let ( x k , z k ) be the iterates of FISTA. We have F ( x k ) − F ( x ∗ ) ≤ θ 2 k − 1 ( F ( x 0 ) − F ( x ∗ )) . µ F Moreover, given α < 1 , if � 1 k ≥ 2 − 1 , αµ F then F ( x k ) − F ( x ∗ ) ≤ α ( F ( x 0 ) − F ( x ∗ )) . 8/28

Setup Restarting FISTA Restarting APPROX Adaptive restart FISTA with restart Choose x 0 ∈ dom ψ . Set θ 0 = 1 and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k �∇ f ( y k ) , x − y k � + 1 � 2 � x − y k � 2 � x k +1 = arg min x ∈ R N L + ψ ( x ) z k +1 = z k + 1 θ k ( x k +1 − y k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 � � � 1 if k ≡ 0 mod 2 αµ F − 1 then θ k +1 = θ 0 z k +1 = x k +1 end if end for Issue: the algorithm still depends on µ F 9/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Methods when µ F is not known • Dual APG with adaptive restart [Nesterov] 1. Start with x 0 and an estimate µ of µ F . 2. Perform periodic restart as if µ were smaller than µ F 3. If the “gradient” is not small enough at the time of restart, decrease µ and go back to step 1. → Annoying transient phase (go back to x 0 ) • Heuristic adaptive restart [O’Donoghue & Candes] - If F ( x k +1 ) > F ( x k ), then restart → Works well in practice but no guarantee 10/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Our goal • Perform periodic restart with an arbitrary frequency • Show convergence at a linear rate • Result for FISTA, APG and APPROX 11/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Complexity without restart Proposition The iterates of FISTA and APG satisfy for all k ≥ 1 , 1 ( F ( x k ) − F ∗ ) + 1 2 dist L ( z k , X ∗ ) 2 ≤ 1 2 dist L ( x 0 , X ∗ ) 2 θ 2 k − 1 1 2 dist L ( x k , X ∗ ) 2 ≤ 1 2 dist L ( x 0 , X ∗ ) 2 → First inequality is a direct consequence of classical results using dist L ( z k , X ∗ ) ≤ � z k − x ∗ � L → The second is a stability result 12/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Unconditional restarting Theorem (Restarting for FISTA and APG) Let ( x k , z k ) be the iterates of FISTA or APG. Let σ ∈ [0 , 1] and ¯ x k = (1 − σ ) x k + σ z k . We have for µ F = µ F (dist L ( x 0 , X ∗ )) , 1 x k , X ∗ ) 2 ≤ 1 � σ, 1 − σµ F � dist L ( x 0 , X ∗ ) 2 2 dist L (¯ 2 max θ 2 k − 1 13/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 definition of ¯ x k = (1 − σ ) x k + σ z k 14/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 2 dist( x k , X ∗ ) 2 + θ 2 � 1 1 − σ − σµ F σ � µ F � 2dist( x k , X ∗ ) 2 + dist( z k , X ∗ ) 2 � k − 1 = θ 2 θ 2 2 k − 1 k − 1 Rearrange 14/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 2 dist( x k , X ∗ ) 2 + θ 2 � 1 1 − σ − σµ F σ � µ F � 2dist( x k , X ∗ ) 2 + dist( z k , X ∗ ) 2 � k − 1 = θ 2 θ 2 2 k − 1 k − 1 F ( x k ) − F ∗ + θ 2 0 , 1 − σ − σµ F � 1 σ � 2dist( x k , X ∗ ) 2 + � dist( z k , X ∗ ) 2 � k − 1 ≤ max θ 2 θ 2 2 k − 1 k − 1 max(0 , x ) ≥ x and local error bound 14/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 2 dist( x k , X ∗ ) 2 + θ 2 � 1 1 − σ − σµ F σ � µ F � 2dist( x k , X ∗ ) 2 + dist( z k , X ∗ ) 2 � k − 1 = θ 2 θ 2 2 k − 1 k − 1 F ( x k ) − F ∗ + θ 2 0 , 1 − σ − σµ F � 1 σ � 2dist( x k , X ∗ ) 2 + � dist( z k , X ∗ ) 2 � k − 1 ≤ max θ 2 θ 2 2 k − 1 k − 1 1 0 , 1 − σ − σµ F � 1 2 dist L ( x 0 , X ∗ ) 2 + σ � x k , X ∗ ) 2 ≤ max 2 dist L ( x 0 , X ∗ ) 2 2 dist L (¯ θ 2 k − 1 σ, 1 − σµ F � 1 � 2 dist L ( x 0 , X ∗ ) 2 = max θ 2 k − 1 Complexity of FISTA/APG + stability 14/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Nb iters to reach F ( x k ) − F ( x ∗ ) ≤ 10 − 10 min x ∈ R N 1 2 � Ax − b � 2 2 + λ � x � 1 , N = 4 (iris dataset) 10 − 3 10 − 4 10 − 5 10 − 6 10 − 8 µ est 1 0.1 0.01 Dual APG with 447 398 265 162 163 163 163 156 adaptive restart FISTA- µ 751 352 170 173 264 291 277 277 FISTA restarted: at x , Proposition 751 687 297 160 198 278 278 278 at ¯ x , Theorem 633 274 168 211 278 278 278 278 if F ( x k +1 ) > F ( x k ) 121 APG- µ 751 351 340 882 2580 7453 > 1e4 > 1e4 APG restarted: at x , Proposition 751 684 297 189 311 894 1471 4488 at ¯ x , Theorem 632 275 173 281 794 1310 3977 > 1e4 if F ( x k +1 ) > F ( x k ) > 1e4 751: Proximal gradient > 1e4 : APG 15/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Restarting Accelerated coordinate descent Expected separable overapproximation ( E [ | ˆ S | ] = τ ): � � S ] )] ≤ F ( x k ) + τ �∇ f ( x k ) , h � + 1 2 � h � 2 E [ F ( x + h [ ˆ v n Choose x 0 ∈ dom ψ . Set θ 0 = τ n and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k Randomly generate S k ∼ ˆ S for i ∈ S k do k � + θ k nv i k | 2 + ψ i ( z ) z i � �∇ i f ( y k ) , z − y i 2 τ | z − z i � k +1 = arg min z ∈ R ni end for x k +1 = y k + n τ θ k ( z k +1 − z k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 16/28

Setup Restarting FISTA Restarting APPROX Adaptive restart Complexity of APPROX without restart ∆( x ) = 1 − θ 0 ( F ( x ) − F ∗ ) + 1 dist v ( x , X ∗ ) 2 θ 2 2 θ 2 0 0 Proposition The iterates of APPROX satisfy for all k ≥ 1 , 1 ( F ( x k ) − F ∗ ) + 1 � dist v ( z k , X ∗ ) 2 � E ≤ ∆( x 0 ) θ 2 2 θ 2 0 k − 1 k γ i E [ F ( x i ) − F ∗ ] + 1 − θ 0 � k E [∆( x k )] ≤ ∆( x 0 ) − E [ F ( x k ) − F ∗ ] θ 2 θ 2 i − 1 0 i =0 where γ i i γ i i γ i k ≥ 0 , � k = 1 and x k = � k z i 17/28

Restarting accelerated gradient methods with a rough strong - PowerPoint PPT Presentation

Setup Restarting FISTA Restarting APPROX Adaptive restart Restarting accelerated gradient methods with a rough strong convexity estimate Olivier Fercoq Joint work with Zheng Qu 20 March 2017 1/28 Setup Restarting FISTA Restarting APPROX

Rough paths methods 1: Introduction Samy Tindel Purdue University University of Aarhus 2016

Rough paths methods 1: Introduction Samy Tindel University of Lorraine at Nancy KU - Probability

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Super League The strategy of restarting and rebranding

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

T urbocharging Monte Carlo pricing under rough volatility Mikko Pakkanen Department of

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM

Rough Case Rough Case Study Plan Study Plan Responsible NORAD institutions Funding

The microstructural foundations of rough volatility Omar El Euch and Mathieu Rosenbaum Ecole

= Pull- -Off Force Off Force JKR Pull- -Off Off Pull Pull 2 0 . 6 / F W d F

Basics of Numerical Optimization: Iterative Methods Ju Sun Computer Science & Engineering

Overview Motivation and Introduction Solving CMPs A heuristic Application Implementation

The moment-LP and moment-SOS approaches Jean B. Lasserre LAAS-CNRS and Institute of Mathematics,

Lecture 4.4: Finitely generated abelian groups Matthew Macauley Department of Mathematical

Linear Convergence of Randomized Primal-Dual Coordinate Method for Large-scale Linear Constrained

CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA week ) ( Monday 10/5 next

Anderson Accelerated Douglas-Rachford Splitting Anqi Fu Junzi Zhang Stephen Boyd EE & ICME

Classification of Poincar e inequalities and PI-rectifiablity Sylvester ErikssonBique

Restarting accelerated gradient methods with a rough strong - PowerPoint PPT Presentation

Setup Restarting FISTA Restarting APPROX Adaptive restart Restarting accelerated gradient methods with a rough strong convexity estimate Olivier Fercoq Joint work with Zheng Qu 20 March 2017 1/28 Setup Restarting FISTA Restarting APPROX

Rough paths methods 1: Introduction Samy Tindel Purdue University University of Aarhus 2016

Rough paths methods 1: Introduction Samy Tindel University of Lorraine at Nancy KU - Probability

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Super League The strategy of restarting and rebranding

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

T urbocharging Monte Carlo pricing under rough volatility Mikko Pakkanen Department of

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM

Rough Case Rough Case Study Plan Study Plan Responsible NORAD institutions Funding

The microstructural foundations of rough volatility Omar El Euch and Mathieu Rosenbaum Ecole

= Pull- -Off Force Off Force JKR Pull- -Off Off Pull Pull 2 0 . 6 / F W d F

Basics of Numerical Optimization: Iterative Methods Ju Sun Computer Science &amp; Engineering

Overview Motivation and Introduction Solving CMPs A heuristic Application Implementation

The moment-LP and moment-SOS approaches Jean B. Lasserre LAAS-CNRS and Institute of Mathematics,

Lecture 4.4: Finitely generated abelian groups Matthew Macauley Department of Mathematical

Linear Convergence of Randomized Primal-Dual Coordinate Method for Large-scale Linear Constrained

CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA week ) ( Monday 10/5 next

Anderson Accelerated Douglas-Rachford Splitting Anqi Fu Junzi Zhang Stephen Boyd EE &amp; ICME

Classification of Poincar e inequalities and PI-rectifiablity Sylvester ErikssonBique

Basics of Numerical Optimization: Iterative Methods Ju Sun Computer Science & Engineering

Anderson Accelerated Douglas-Rachford Splitting Anqi Fu Junzi Zhang Stephen Boyd EE & ICME