restarting accelerated gradient methods with a rough
play

Restarting accelerated gradient methods with a rough strong - PowerPoint PPT Presentation

Setup Restarting FISTA Restarting APPROX Adaptive restart Restarting accelerated gradient methods with a rough strong convexity estimate Olivier Fercoq Joint work with Zheng Qu 20 March 2017 1/28 Setup Restarting FISTA Restarting APPROX


  1. Setup Restarting FISTA Restarting APPROX Adaptive restart Restarting accelerated gradient methods with a rough strong convexity estimate Olivier Fercoq Joint work with Zheng Qu 20 March 2017 1/28

  2. Setup Restarting FISTA Restarting APPROX Adaptive restart Minimisation of composite functions Minimise the “strongly” convex composite function F x ∈ R N { F ( x ) = f ( x ) + ψ ( x ) } min • f : R N → R , convex, differentiable, with L -Lipschitz gradient • ψ : R N → R ∪ { + ∞} , convex, with simple proximal operator y ∈ R N ψ ( y ) + 1 2 � x − y � 2 prox ψ ( x ) = arg min L • F = f + g features some kind of strong convexity 2/28

  3. Setup Restarting FISTA Restarting APPROX Adaptive restart The local error bound property Let X ∗ be the set of optimal solutions such that ∀ x ∗ ∈ X ∗ , ∀ x ∈ R n , F ∗ = F ( x ∗ ) ≤ F ( x ). Assumption There exists s > 0 and µ F ( s ) > 0 such that if dist L ( x , X ∗ ) ≤ s , F ( x ) ≥ F ∗ + µ F ( s ) dist L ( x , X ∗ ) 2 2 Examples: - F ( x ) = φ ( Ax ) with ∇ 2 φ ( x ) > 0, ∀ x 2 � Ax − b � 2 + λ � x � 1 - F ( x ) = 1 Local error bound for s > 0 ⇒ local error bound ∀ compact set 3/28

  4. Setup Restarting FISTA Restarting APPROX Adaptive restart Algorithms: FISTA Choose x 0 ∈ dom ψ . Set θ 0 = 1 and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k � �∇ f ( y k ) , x − y k � + 1 2 � x − y k � 2 � x k +1 = arg min x ∈ R N L + ψ ( x ) z k +1 = z k + 1 θ k ( x k +1 − y k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 4/28

  5. Setup Restarting FISTA Restarting APPROX Adaptive restart Algorithms: APG Choose x 0 ∈ dom ψ . Set θ 0 = 1 and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k �∇ f ( y k ) , z − y k � + θ k � 2 � z − z k � 2 � z k +1 = arg min z ∈ R N L + ψ ( z ) x k +1 = y k + θ k ( z k +1 − z k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 5/28

  6. Setup Restarting FISTA Restarting APPROX Adaptive restart Algorithms: APPROX Choose x 0 ∈ dom ψ . Set θ 0 = τ n and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k Randomly generate S k ∼ ˆ S for i ∈ S k do k � + θ k nv i z i �∇ i f ( y k ) , z − y i 2 τ | z − z i k | 2 + ψ i ( z ) � � k +1 = arg min z ∈ R ni end for x k +1 = y k + n τ θ k ( z k +1 − z k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 6/28

  7. Setup Restarting FISTA Restarting APPROX Adaptive restart Accelerated gradient methods µ F = 0 µ F > 0 is known FISTA Beck & Teboulle Vandenberghe APG Nesterov Nesterov dual APG Nesterov Nesterov APPROX Fercoq & Richt´ arik Lin, Lu & Xiao O (1 − √ µ F ) k ) O (1 / k 2 ) The algorithms that guarantee linear convergence depend explicitly on µ F (e.g. θ k = √ µ F , ∀ k ) 7/28

  8. Setup Restarting FISTA Restarting APPROX Adaptive restart Restart when µ F is known Proposition (Nesterov: Conditional restarting at x k ) Let ( x k , z k ) be the iterates of FISTA. We have F ( x k ) − F ( x ∗ ) ≤ θ 2 k − 1 ( F ( x 0 ) − F ( x ∗ )) . µ F Moreover, given α < 1 , if � 1 k ≥ 2 − 1 , αµ F then F ( x k ) − F ( x ∗ ) ≤ α ( F ( x 0 ) − F ( x ∗ )) . 8/28

  9. Setup Restarting FISTA Restarting APPROX Adaptive restart FISTA with restart Choose x 0 ∈ dom ψ . Set θ 0 = 1 and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k �∇ f ( y k ) , x − y k � + 1 � 2 � x − y k � 2 � x k +1 = arg min x ∈ R N L + ψ ( x ) z k +1 = z k + 1 θ k ( x k +1 − y k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 � � � 1 if k ≡ 0 mod 2 αµ F − 1 then θ k +1 = θ 0 z k +1 = x k +1 end if end for Issue: the algorithm still depends on µ F 9/28

  10. Setup Restarting FISTA Restarting APPROX Adaptive restart Methods when µ F is not known • Dual APG with adaptive restart [Nesterov] 1. Start with x 0 and an estimate µ of µ F . 2. Perform periodic restart as if µ were smaller than µ F 3. If the “gradient” is not small enough at the time of restart, decrease µ and go back to step 1. → Annoying transient phase (go back to x 0 ) • Heuristic adaptive restart [O’Donoghue & Candes] - If F ( x k +1 ) > F ( x k ), then restart → Works well in practice but no guarantee 10/28

  11. Setup Restarting FISTA Restarting APPROX Adaptive restart Our goal • Perform periodic restart with an arbitrary frequency • Show convergence at a linear rate • Result for FISTA, APG and APPROX 11/28

  12. Setup Restarting FISTA Restarting APPROX Adaptive restart Complexity without restart Proposition The iterates of FISTA and APG satisfy for all k ≥ 1 , 1 ( F ( x k ) − F ∗ ) + 1 2 dist L ( z k , X ∗ ) 2 ≤ 1 2 dist L ( x 0 , X ∗ ) 2 θ 2 k − 1 1 2 dist L ( x k , X ∗ ) 2 ≤ 1 2 dist L ( x 0 , X ∗ ) 2 → First inequality is a direct consequence of classical results using dist L ( z k , X ∗ ) ≤ � z k − x ∗ � L → The second is a stability result 12/28

  13. Setup Restarting FISTA Restarting APPROX Adaptive restart Unconditional restarting Theorem (Restarting for FISTA and APG) Let ( x k , z k ) be the iterates of FISTA or APG. Let σ ∈ [0 , 1] and ¯ x k = (1 − σ ) x k + σ z k . We have for µ F = µ F (dist L ( x 0 , X ∗ )) , 1 x k , X ∗ ) 2 ≤ 1 � σ, 1 − σµ F � dist L ( x 0 , X ∗ ) 2 2 dist L (¯ 2 max θ 2 k − 1 13/28

  14. Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 definition of ¯ x k = (1 − σ ) x k + σ z k 14/28

  15. Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 2 dist( x k , X ∗ ) 2 + θ 2 � 1 1 − σ − σµ F σ � µ F � 2dist( x k , X ∗ ) 2 + dist( z k , X ∗ ) 2 � k − 1 = θ 2 θ 2 2 k − 1 k − 1 Rearrange 14/28

  16. Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 2 dist( x k , X ∗ ) 2 + θ 2 � 1 1 − σ − σµ F σ � µ F � 2dist( x k , X ∗ ) 2 + dist( z k , X ∗ ) 2 � k − 1 = θ 2 θ 2 2 k − 1 k − 1 F ( x k ) − F ∗ + θ 2 0 , 1 − σ − σµ F � 1 σ � 2dist( x k , X ∗ ) 2 + � dist( z k , X ∗ ) 2 � k − 1 ≤ max θ 2 θ 2 2 k − 1 k − 1 max(0 , x ) ≥ x and local error bound 14/28

  17. Setup Restarting FISTA Restarting APPROX Adaptive restart Proof 1 x k , X ∗ ) 2 ≤ 1 − σ dist L ( x k , X ∗ ) 2 + σ 2 dist L ( z k , X ∗ ) 2 2 dist L (¯ 2 2 dist( x k , X ∗ ) 2 + θ 2 � 1 1 − σ − σµ F σ � µ F � 2dist( x k , X ∗ ) 2 + dist( z k , X ∗ ) 2 � k − 1 = θ 2 θ 2 2 k − 1 k − 1 F ( x k ) − F ∗ + θ 2 0 , 1 − σ − σµ F � 1 σ � 2dist( x k , X ∗ ) 2 + � dist( z k , X ∗ ) 2 � k − 1 ≤ max θ 2 θ 2 2 k − 1 k − 1 1 0 , 1 − σ − σµ F � 1 2 dist L ( x 0 , X ∗ ) 2 + σ � x k , X ∗ ) 2 ≤ max 2 dist L ( x 0 , X ∗ ) 2 2 dist L (¯ θ 2 k − 1 σ, 1 − σµ F � 1 � 2 dist L ( x 0 , X ∗ ) 2 = max θ 2 k − 1 Complexity of FISTA/APG + stability 14/28

  18. Setup Restarting FISTA Restarting APPROX Adaptive restart Nb iters to reach F ( x k ) − F ( x ∗ ) ≤ 10 − 10 min x ∈ R N 1 2 � Ax − b � 2 2 + λ � x � 1 , N = 4 (iris dataset) 10 − 3 10 − 4 10 − 5 10 − 6 10 − 8 µ est 1 0.1 0.01 Dual APG with 447 398 265 162 163 163 163 156 adaptive restart FISTA- µ 751 352 170 173 264 291 277 277 FISTA restarted: at x , Proposition 751 687 297 160 198 278 278 278 at ¯ x , Theorem 633 274 168 211 278 278 278 278 if F ( x k +1 ) > F ( x k ) 121 APG- µ 751 351 340 882 2580 7453 > 1e4 > 1e4 APG restarted: at x , Proposition 751 684 297 189 311 894 1471 4488 at ¯ x , Theorem 632 275 173 281 794 1310 3977 > 1e4 if F ( x k +1 ) > F ( x k ) > 1e4 751: Proximal gradient > 1e4 : APG 15/28

  19. Setup Restarting FISTA Restarting APPROX Adaptive restart Restarting Accelerated coordinate descent Expected separable overapproximation ( E [ | ˆ S | ] = τ ): � � S ] )] ≤ F ( x k ) + τ �∇ f ( x k ) , h � + 1 2 � h � 2 E [ F ( x + h [ ˆ v n Choose x 0 ∈ dom ψ . Set θ 0 = τ n and z 0 = x 0 . for k ≥ 0 do y k = (1 − θ k ) x k + θ k z k Randomly generate S k ∼ ˆ S for i ∈ S k do k � + θ k nv i k | 2 + ψ i ( z ) z i � �∇ i f ( y k ) , z − y i 2 τ | z − z i � k +1 = arg min z ∈ R ni end for x k +1 = y k + n τ θ k ( z k +1 − z k ) √ θ 4 k +4 θ 2 k − θ 2 θ k +1 = k 2 end for 16/28

  20. Setup Restarting FISTA Restarting APPROX Adaptive restart Complexity of APPROX without restart ∆( x ) = 1 − θ 0 ( F ( x ) − F ∗ ) + 1 dist v ( x , X ∗ ) 2 θ 2 2 θ 2 0 0 Proposition The iterates of APPROX satisfy for all k ≥ 1 , 1 ( F ( x k ) − F ∗ ) + 1 � dist v ( z k , X ∗ ) 2 � E ≤ ∆( x 0 ) θ 2 2 θ 2 0 k − 1 k γ i E [ F ( x i ) − F ∗ ] + 1 − θ 0 � k E [∆( x k )] ≤ ∆( x 0 ) − E [ F ( x k ) − F ∗ ] θ 2 θ 2 i − 1 0 i =0 where γ i i γ i i γ i k ≥ 0 , � k = 1 and x k = � k z i 17/28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend