self concordant analysis of frank wolfe algorithms
play

Self-concordant analysis of Frank-Wolfe algorithms Pavel - PowerPoint PPT Presentation

Overview Development of the algorithms The Algorithms Conclusion Self-concordant analysis of Frank-Wolfe algorithms Pavel Dvurechensky 1 Shimrit Shtern 2 Mathias Staudigl 3 Petr Ostroukhov 4 Kamil Safin 4 1 WIAS 2 The Technion 3 Maastricht


  1. Overview Development of the algorithms The Algorithms Conclusion Self-concordant analysis of Frank-Wolfe algorithms Pavel Dvurechensky 1 Shimrit Shtern 2 Mathias Staudigl 3 Petr Ostroukhov 4 Kamil Safin 4 1 WIAS 2 The Technion 3 Maastricht University 4 Moscow Institute of Physics and Technology ICML2020, July 12-July 18

  2. Overview Development of the algorithms The Algorithms Conclusion Self-concordant minimization We consider the optimization problem min x ∈ X f ( x ) (P) where X ⊂ R n is convex compact f : R n → ( − ∞ , ∞ ] is convex and thrice continuously differentiable on the open set dom f = { x : f ( x ) < ∞ } . Given the large-scale nature of optimization problems in machine learning, first-order methods are the method of choice.

  3. Overview Development of the algorithms The Algorithms Conclusion Frank-Wolfe methods Because of great scalability and sparsity properties, Frank-Wolfe (FW) methods (Frank & Wolfe, 1956) received lot of attention in ML. Convergence guarantees require Lipschitz continuous 1 gradients, or finite curvature constants on f (Jaggi, 2013) Even for well-conditioned (Lipschitz smooth and strongly convex) problems 2 only sublinear convergence rates guaranteed in general. x (0) x ( t ) x ( t +1) s t x ∗

  4. Overview Development of the algorithms The Algorithms Conclusion Many canonical ML problems do not have Lipschitz gradients Portfolio Optimization T n ln ( � r t , x � ) , x ∈ X = { x ∈ R n ∑ ∑ f ( x ) = − + : x i = 1 } . t = 1 i = 1 Covariance Estimation: f ( x ) = − ln ( det ( X )) + tr ( ˆ Σ X ) , x ∈ X = { x ∈ R n × n sym , + : � Vec ( X ) � 1 ≤ R } . Poisson Inverse Problem m m ∑ ∑ f ( x ) = � w i , x � − y i ln ( � w i , x � ) , i = 1 i = 1 x ∈ X = { x ∈ R n | � x � 1 ≤ R } .

  5. Overview Development of the algorithms The Algorithms Conclusion Main Results All these function are Self-concordant (SC), and have no Lipschitz continuous gradient. Standard analysis does not apply. Result 1: We give a unified analysis of provable convergent FW algorithms minimizing SC functions. Result 2: Based on the theory of Local Linear Optimization Oracles (LLOO) (Lan 2013, Garber & Hazan, 2016), we construct linearly convergent variants for our base algorithms.

  6. Overview Development of the algorithms The Algorithms Conclusion Vanilla FW The analysis of FW involves (a) a search direction s ( x ) = argmin �∇ f ( x ) , s � . s ∈ X (b) as merit function the gap function gap ( x ) = �∇ f ( x ) , x − s ( x ) � Standard Frank-Wolfe method: If gap ( x k ) > ε then Obtain s k = s ( x k ) ; 1 Set x k + 1 = x k + α k ( s k − x k ) for some α k ∈ [ 0, 1 ] . 2

  7. Overview Development of the algorithms The Algorithms Conclusion SC optimization Definition of SC functions f : R n → ( − ∞ , + ∞ ] a C 3 ( dom f ) convex function dom f is open set in R n . f is SC if � ≤ M ϕ ′′ ( t ) 3 / 2 � ϕ ′′′ ( t ) � � for ϕ ( t ) = f ( x + tv ) , x ∈ dom f , v ∈ R n and x + tv ∈ dom f .

  8. Overview Development of the algorithms The Algorithms Conclusion SC optimization Self-concordant functions Self-concordant (SC) function have been developed within the field of interior-point method (Nesterov & Nemirovski, 1994) Starting with Bach (2010), they gained a lot of interest in Machine learning and Statistics (see e.g. Tran-Dinh, Kyrillidis & Cevher; Sun & Tran-Dinh 2018; Ostrovskii & Bach 2018) MATLAB toolbox SCOPT

  9. Overview Development of the algorithms The Algorithms Conclusion Adaptive Frank Wolfe methods Basic estimates of SC functions For all x , ˜ x ∈ dom f we have the following bounds on function values x − x � + 4 f ( ˜ x ) ≥ f ( x ) + �∇ f ( x ) , ˜ M 2 ω ( d ( x , ˜ x )) x − x � + 4 f ( ˜ x ) ≤ f ( x ) + �∇ f ( x ) , ˜ M 2 ω ∗ ( d ( x , ˜ x )) where ω ( t ) : = t − ln ( 1 + t ) , and ω ∗ ( t ) : = − t − ln ( 1 − t ) d ( x , y ) : = M 2 � y − x � x = M � 1 / 2 � D 2 f ( x )[ y − x , y − x ] . 2

  10. Overview Development of the algorithms The Algorithms Conclusion Algorithm 1 Let x + = x + t ( s ( x ) − x ) , t > 0 t Obtain the non-Euclidean descent inequality: � + 4 f ( x + � ∇ f ( x ) , x + t ) ≤ f ( x ) + t − x M 2 ω ∗ ( t e ( x )) ≤ f ( x ) − η x ( t ) 2 � s ( x ) − x � 2 for t ∈ ( 0, 1 / e ( x )) , e ( x ) = M x . Optimizing the per-iteration decrease w.r.t t leads to gap ( x ) α ( x ) = min { 1, t ( x ) } , t ( x ) = . 4 e ( x )( gap ( x ) + M 2 e ( x ))

  11. Overview Development of the algorithms The Algorithms Conclusion Iteration Complexity Define the approximation error : h k = f ( x k ) − f ∗ . Let S ( x 0 ) = { x ∈ X | f ( x ) ≤ f ( x 0 ) } , and λ max ( ∇ 2 f ( x )) . L ∇ f = max x ∈ S ( x 0 ) Theorem For given ε > 0 , define N ε ( x 0 ) = min { k ≥ 0 | h k ≤ ε } . Then, � � h 0 b ln + L ∇ f diam ( X ) 2 a N ε ( x 0 ) ≤ ( 1 + ln ( 2 )) ε . a � 2 ( 1 − ln ( 2 )) � 1 − ln ( 2 ) 1 where a = min 2 , and b = L ∇ f diam ( X ) 2 . M √ L ∇ f diam ( X )

  12. Overview Development of the algorithms The Algorithms Conclusion Algorithm 2: Backtracking Variant of FW Let Q ( x k , t , µ ) : = f ( x k ) − t · gap ( x k ) + t 2 µ 2 � s ( x k ) − x k � � 2 . � � 2 � On S ( x 0 ) : = { x ∈ X | f ( x ) ≤ f ( x 0 ) } , we have f ( x k + t ( s k − x k )) ≤ Q ( x k , t , L ∇ f ) . Problem: L ∇ f is hard to estimate and numerically large. Solution: A backtracking procedure allows us to find a local estimate for the unknown L ∇ f (see also Pedregosa et al. 2020)

  13. Overview Development of the algorithms The Algorithms Conclusion Backtracking procedure to find the local Lipschitz constant Algorithm 1 Function step ( f , v , x , g , L ) Choose γ u > 1, γ d < 1 Choose µ ∈ [ γ d L , L ] g α = min { , 1 } µ � v � 2 2 if f ( x + α v ) > Q ( x , α , µ ) then µ ← γ u µ g α ← min { , 1 } µ � v � 2 2 end if Return α , µ We have for all t ∈ [ 0, 1 ] f ( x k + 1 ) ≤ f ( x k ) − t · gap ( x k ) + t 2 L k 2 � � s k − x k � � � 2 � where L k is obtained from Algorithm 1.

  14. Overview Development of the algorithms The Algorithms Conclusion Main Result Theorem Let ( x k ) k be the backtracking variant of FW using Algorithm 1 as subroutine. Then 2 gap ( x 0 ) k diam ( X ) 2 ¯ h k ≤ ( k + 1 )( k + 2 ) + L k ( k + 1 )( k + 2 ) where ¯ L k � 1 k ∑ k − 1 i = 0 L i .

  15. Overview Development of the algorithms The Algorithms Conclusion Linear Convergence Linearly Convergent FW variant Definition (Garber & Hazan (2016)) A procedure A ( x , r , c ) , where x ∈ X , r > 0, c ∈ R n , is a LLOO with parameter ρ ≥ 1 for the polytope X if A ( x , r , c ) returns a point s ∈ X such that for all x ∈ B r ( x ) ∩ X � c , x � ≥ � c , s � and � x − s � 2 ≤ ρ r . Such oracles exist for any compact polyhedral domain. Particular simple implementation for Simplex-like domains.

  16. Overview Development of the algorithms The Algorithms Conclusion Linear Convergence Call x ∈ S ( x 0 ) λ min ( ∇ 2 f ( x )) . σ f = min Theorem (Simplified version) Given a polytope X with LLOO A ( x , r , c ) for each x ∈ X , , r ∈ ( 0, ∞ ) , c ∈ R n . Let σ f 1 α � min { 6 L ∇ f ρ 2 , 1 } ¯ . 1 + √ L ∇ f M diam ( X ) 2 Then, h k ≤ gap ( x 0 ) exp ( − k ¯ α / 2 ) . In the paper we present a version of this Theorem without knowledge of L ∇ f .

  17. Overview Development of the algorithms The Algorithms Conclusion Linear Convergence Numerical Performance Portfolio Optimization T ∑ f ( x ) = ln ( � r t , x � ) t = 1 n X = { x ∈ R n ∑ + : x i = 1 } . i = 1 Poisson Inverse problem m m f ( x ) = ∑ � w i , x � − ∑ y i ln ( � w i , x � ) , i = 1 i = 1 x ∈ X = { x ∈ R n | � x � 1 ≤ R } . Figure: Portfolio Optimization (Right), Poisson Inverse Problem (Left)

  18. Overview Development of the algorithms The Algorithms Conclusion Conclusion We derived various novel FW schemes with provable convergence guarantees for self-concordant minimization. Future directions of research include the following Generalized self-concordant minimization (Sun & Tran-Dinh 2018) Stochastic oracles Inertial effects in algorithm design (Conditional gradient sliding (Lan & Zhou, 2016))

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend