 
              Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole Polytechnique ICML Spotlight Presentation, 2013 / 06 / 19
Constrained Convex Optimization D ⊂ R d
f ( x ) Constrained Convex Optimization min x ∈ D f ( x ) x D ⊂ R d
An iterative algorithm f ( x ) x D ⊂ R d
f ( x ) x D ⊂ R d
f ( x ) x D ⊂ R d
f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x s D ⊂ R d Algorithm 1 Frank-Wolfe Let x (0) 2 D for k = 0 . . . K do Convergence: s 0 , r f ( x ( k ) ) ⌦ ↵ Compute s := arg min � 1 � O s 0 2 D 2 Let γ := k k +2 Update x ( k +1) := (1 � γ ) x ( k ) + γ s end for
f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x s D ⊂ R d ALGORITHM FOR QUADRATIC PROGRAMMING Algorithm 1 Frank-Wolfe 6 5 Let x (0) 2 D 9 1 Marguerite Frank and P h i l i p Wolfel A finite iteration method for calculating the solution of quadratic for k = 0 . . . K do Extensions to more general non- y t i v e r s s 0 , r f ( x ( k ) ) ⌦ ↵ Compute s := arg min t o n Un i N A ce s 0 2 D Pr in 2 Let γ := k +2 Update x ( k +1) := (1 � γ ) x ( k ) + γ s programming problems is described. problem of maximizing a concave quadratic function whose variables are end for r linear Droblems a r e suggested. subject of several recent studies, from Bibliography). Our aim here has been to programming problem which should be particularly Lagrange multipliers the'solutions quadratic programming called PI, is set forth e h t constraints has been INTRODUCTION (see theoretical
Two kinds of first-order f ( x ) methods x r f ( x ) D ⊂ R d Frank-Wolfe Gradient Descent (approx.) solve Iteration cost projection back to D linearized problem on D sparse ✓ dense ✗ Iterates (in terms of used vertices)
Contributions • Stronger Convergence Results f ( x ) primal-dual analysis with certificates for approximation quality g ( x ) Primal Rate � Primal-Dual f ( x ( k ) ) � f ( x ⇤ )  2 C f k ) )  7 C f g ( x (ˆ k + 2(1 k + 2 x Holds for all algorithm variants D ⊂ R d • Approximate Subproblems and inexact gradients (and domains) • Affine Invariance • Optimality in Terms of Sparsity
Applications D := conv( A ) Some Atomic Domains Suitable for Frank-Wolfe X Optimization Domain Complexity of one Frank-Wolfe Iteration Atoms A D = conv( A ) sup s 2 D h s , y i Complexity R n Sparse Vectors k . k 1 -ball k y k 1 O ( n ) R n Sign-Vectors k . k 1 -ball k y k 1 O ( n ) R n ` p -Sphere k . k p -ball k y k q O ( n ) R n Sparse Non-neg. Vectors Simplex ∆ n max i { y i } O ( n ) � ⇤ � � R n P Latent Group Sparse Vec. k . k G -ball max g 2 G g 2 G | g | � y ( g ) g p ˜ R m ⇥ n � " 0 � Matrix Trace Norm k . k tr -ball k y k op = � 1 ( y ) O N f / (Lanczos) R m ⇥ n Matrix Operator Norm k . k op -ball k y k tr = k ( � i ( y )) k 1 SVD R m ⇥ n Schatten Matrix Norms k ( � i ( . )) k p -ball k ( � i ( y )) k q SVD ˜ f ( n + m ) 1 . 5 / " 0 2 . 5 � R m ⇥ n � Matrix Max-Norm k . k max -ball O N O ( n 3 ) R n ⇥ n Permutation Matrices Birkho ff polytope R n ⇥ n Rotation Matrices SVD (Procrustes prob.) p ˜ S n ⇥ n Rank-1 PSD matrices � " 0 � � max ( y ) O N f / (Lanczos) { x ⌫ 0 , Tr( x )=1 } of unit trace ˜ PSD matrices f n 1 . 5 / " 0 2 . 5 � S n ⇥ n � O N { x ⌫ 0 , x ii  1 } of bounded diagonal Table 1: Some examples of atomic domains suitable for optimization using the Frank-Wolfe algorithm. Here SVD refers to the complexity of computing a singular value decomposition, which is O (min { mn 2 , m 2 n } ) . N f is the number of non-zero entries in the gradient of the objective func- tion f , and " 0 = 2 δ C f k +2 is the required accuracy for the linear subproblems. For any p 2 [1 , 1 ] , the conjugate value q is meant to satisfy 1 + 1 = 1 , allowing q = 1 for p = 1 and vice versa.
Factorized Matrix Domains Example: uv T � ⇣n o⌘ u 2 R n , k u k 2 =1 D := conv � v 2 R m , k v k 2 =1 � (trace norm) uv T � ⇣n o⌘ � u ∈ A left D := conv � v ∈ A right A right ⊂ R m A left ⊂ R n
Recommend
More recommend