block coordinate frank wolfe optimization
play

Block-Coordinate Frank-Wolfe Optimization with applications to - PowerPoint PPT Presentation

Block-Coordinate Frank-Wolfe Optimization with applications to structured prediction Martin Jaggi CMAP, Ecole Polytechnique, Paris Optimization and Big Data Workshop, Edinburgh, 2013 / 5 / 2 Co-Authors: Simon Lacoste-Julien, Mark Schmidt and


  1. Block-Coordinate Frank-Wolfe Optimization with applications to structured prediction Martin Jaggi CMAP, Ecole Polytechnique, Paris Optimization and Big Data Workshop, Edinburgh, 2013 / 5 / 2 Co-Authors: Simon Lacoste-Julien, Mark Schmidt and Patrick Pletscher

  2. Outline • Two Old First-Order Optimization Algorithms • Coordinate Descent • The Frank-Wolfe Algorithm • Duality for Constrained Convex Optimization • Combining Frank-Wolfe and Coordinate Descent • Applications: Large Margin Prediction • binary SVMs • structural SVMs

  3. Coordinate Descent

  4. Coordinate Descent f ( x ) Selection of next coordinate: • the one of steepest desc. • cycle (hard to analyze!) • random sampling x R d

  5. The Frank- Wolfe Algorithm Frank and Wolfe (1956) D ⊂ R d

  6. f ( x ) x ∈ D f ( x ) min x D ⊂ R d

  7. f ( x ) x ∈ D f ( x ) min x D ⊂ R d

  8. f ( x ) x ∈ D f ( x ) min x D ⊂ R d

  9. f ( x ) x ∈ D f ( x ) min x D ⊂ R d

  10. f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x s D ⊂ R d Algorithm 1 Frank-Wolfe for k = 0 . . . K do s 0 , r f ( x ( k ) ) ⌦ ↵ Compute s := arg min s 0 2 D 2 Let γ := k +2 Update x ( k +1) := (1 � γ ) x ( k ) + γ s end for

  11. f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x r f ( x ) D ⊂ R d Frank-Wolfe Gradient Descent (approx.) solve Cost per step Projection back to D linearized problem on D ✓ ✗ Sparse Solutions (in terms of used vertices)

  12. Some Examples of Atomic Domains Suitable for Frank-Wolfe X Optimization Domain Complexity of one Frank-Wolfe Iteration Atoms A D = conv( A ) sup s 2 D h s , y i Complexity R n Sparse Vectors k . k 1 -ball k y k 1 O ( n ) R n Sign-Vectors k . k 1 -ball k y k 1 O ( n ) R n ` p -Sphere k . k p -ball k y k q O ( n ) R n Sparse Non-neg. Vectors Simplex ∆ n max i { y i } O ( n ) � ⇤ � � R n P Latent Group Sparse Vec. k . k G -ball max g 2 G g 2 G | g | � y ( g ) g p ˜ R m ⇥ n � " 0 � Matrix Trace Norm k . k tr -ball k y k op = � 1 ( y ) O N f / (Lanczos) R m ⇥ n Matrix Operator Norm k . k op -ball k y k tr = k ( � i ( y )) k 1 SVD R m ⇥ n Schatten Matrix Norms k ( � i ( . )) k p -ball k ( � i ( y )) k q SVD ˜ f ( n + m ) 1 . 5 / " 0 2 . 5 � R m ⇥ n � Matrix Max-Norm k . k max -ball O N O ( n 3 ) R n ⇥ n Permutation Matrices Birkho ff polytope R n ⇥ n Rotation Matrices SVD (Procrustes prob.) p ˜ S n ⇥ n Rank-1 PSD matrices � " 0 � � max ( y ) O N f / (Lanczos) { x ⌫ 0 , Tr( x )=1 } of unit trace ˜ PSD matrices f n 1 . 5 / " 0 2 . 5 � S n ⇥ n � O N { x ⌫ 0 , x ii  1 } of bounded diagonal Table 1: Some examples of atomic domains suitable for optimization using the Frank-Wolfe algorithm. J. 2013 Here SVD refers to the complexity of computing a singular value decomposition, which is O (min { mn 2 , m 2 n } ) . N f is the number of non-zero entries in the gradient of the objective func- tion f , and " 0 = 2 δ C f k +2 is the required accuracy for the linear subproblems. For any p 2 [1 , 1 ] , the conjugate value q is meant to satisfy 1 p + 1 q = 1 , allowing q = 1 for p = 1 and vice versa. Dudık et al. 2011, Tewari et al. 2011, J. 2011

  13. f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x D ⊂ R d Primal Convergence: Primal-Dual Convergence: Algorithms obtain Algorithms obtain � 1 � 1 gap ( x ( k ) ) ≤ O � f ( x ( k ) ) − f ( x ∗ ) ≤ O � k k after steps . after steps . k k [ Frank & Wolfe 1956 ] [ Clarkson 2008, J. 2013 ]

  14. A Simple Optimization Duality Original Problem x ∈ D f ( x ) min f ( x ) The Dual Value gap (x) ω ( x ) := s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min ω ( x ) Weak Duality x D ⊂ R d ω ( x ) ≤ f ( x ⇤ ) ≤ f ( x 0 )

  15. Block-Separable Optimization Problems x ∈ D (1) × ··· × D ( n ) f ( x ) min x = ( x (1) , . . . , x ( n ) ) f ( x ) f ( x ) f ( x ) × × · · · × x (1) x ( i ) x ( n ) D ( i ) ∈ R d i D (1) ∈ R d 1 D ( n ) ∈ R d n

  16. f ( x ) f ( x ) f ( x ) f ( x ) f ( x ) f ( x ) × × × · · · × × · · · × x (1) x (1) x ( n ) x ( n ) x ( i ) x ( i ) Algorithm 2: Uniform Coordinate Descent Algorithm 3: Block-Coordinate “Frank-Wolfe” Let x (0) 2 D Let x (0) 2 D for k = 0 . . . K do for k = 0 . . . K do Pick i 2 u.a.r. [ n ] Pick i 2 u.a.r. [ n ] D E D E s ( i ) , r ( i ) f ( x ( k ) ) s ( i ) , r ( i ) f ( x ( k ) ) Compute s ( i ) := arg min Compute s ( i ) := arg min s ( i ) ∈ D ( i ) s ( i ) ∈ D ( i ) � 2 + L i � � � s ( i ) � x ( i ) 2 n Let γ := k +2 n , or optimize γ by line-search 2 Update x ( k +1) := x ( k ) s ( i ) � x ( k ) Update x ( k +1) := x ( k ) s ( i ) � x ( k ) � � ( i ) + � � ( i ) + γ ( i ) ( i ) ( i ) ( i ) end end Hidden constant: Theorem: Curvature Algorithm obtains Nesterov (2012) i L f diam 2 ( D ( i ) ) accuracy ≤ P Richtárik, Takác ̌ (2012) 2 n � � O ``Huge-Scale’’ Coordinate Descent k +2 n (also in duality gap , after steps. k and with inexact subproblems )

  17. Applications: Large Margin Prediction • Binary Support Vector Machine (no bias) • also: Ranking SVM ⌦ ↵ w , φ ( x i ) y i ≥ 1 − ξ i primal problem: λ 2 k w k 2 min w n + 1 n ↵o X ⌦ max 0 , 1 � w , φ ( x i ) y i n i =1

  18. Binary SVM ⌦ ↵ w , φ ( x i ) y i ≥ 1 − ξ i primal dual 2 k w k 2 λ 2 k A α k 2 � b T α min λ min f ( α ) := w 2 R d α 2 R n n n ↵o X ⌦ + 1 s.t. 0  α i  1 8 i 2 [ n ] max 0 , 1 � w , φ ( x i ) y i n | {z } i =1 i -th column of A • • d -dim n -dim • • unconstrained box-constrained • • non -smooth, strongly convex smooth, not strongly convex

  19. Structural SVM ``joint’’ feature map φ : X × Y → R d large margin ``separation’’ ⌦ ↵ w , φ ( x i , y i ) − φ ( x i , y ) ≥ L ( y , y i ) − ξ i ∀ y primal problem: φ ( , 2 ) φ ( , 7 ) 2 k w k 2 λ min φ ( , 4 ) w 2 R d n n ↵o φ ( , 0 ) X ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) n y 2 Y | {z } i =1 ( i, y )-th column of A φ ( , 1 ) φ ( , 3 )

  20. Structural SVM ``joint’’ feature map φ : X × Y → R d large margin ``separation’’ ⌦ ↵ w , φ ( x i , y i ) − φ ( x i , y ) ≥ L ( y , y i ) − ξ i ∀ y uxtecpsss d ( , ) φ primal problem: e t c e φ ( , ) p x e nuexpcted n 2 k w k 2 λ u min ( , ) φ w 2 R d n n ↵o X aaaaaaa ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) ( , ) φ n y 2 Y | {z } i =1 ( i, y )-th column of A decoding oracle donaudampfschifffahrtsgesellschaftskapitän |Y| = 26 42

  21. Binary SVM primal dual 2 Y 2 k A α k 2 � b T α 2 k w k 2 λ λ min min f ( α ) := w 2 R d α 2 R n n n ↵o X s.t. 0  α i  1 8 i 2 [ n ] ⌦ + 1 max 0 , 1 � w , φ ( x i ) y i n | {z } i =1 i -th column of A primal-dual Structural SVM correspondence w = A α primal dual 2 k A α k 2 � b T α 2 k w k 2 λ λ min f ( α ) := min α 2 R n ·|Y| w 2 R d P n s.t. y 2 Y α i ( y ) = 1 8 i 2 [ n ] n ↵o X ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) n y 2 Y and α i ( y ) � 0 8 i 2 [ n ] , 8 y 2 Y | {z } i =1 ( i, y )-th column of A

  22. Binary SVM primal dual 2 Y 2 k A α k 2 � b T α 2 k w k 2 λ λ min min f ( α ) := w 2 R d α 2 R n n n ↵o X s.t. 0  α i  1 8 i 2 [ n ] ⌦ + 1 max 0 , 1 � w , φ ( x i ) y i n | {z } i =1 i -th column of A • d -dim • n -dim • unconstrained • box-constrained • non -smooth, strongly convex • smooth, not strongly convex Optimization Algorithms primal dual • Frank-Wolfe batch ( n cost per iteration) • subgradient descent =cutting plane ( SVM-light ) O ( R 2 λε ) • coordinate descent (Hsieh 2008) ( 1 cost per iteration) • stochastic subgradient online =block-coordinate descent (SGD, Pegasos) =block-coordinate Frank-Wolfe

  23. φ ( , 2 ) φ ( ,7) Structural SVM φ ( ,4) φ ( ,0) φ ( ,1) primal dual 2 k A α k 2 � b T α λ 2 k w k 2 min f ( α ) := λ min α 2 R n ·|Y| w 2 R d P n s.t. y 2 Y α i ( y ) = 1 8 i 2 [ n ] n ↵o X ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) n y 2 Y and α i ( y ) � 0 8 i 2 [ n ] , 8 y 2 Y | {z } i =1 ( i, y )-th column of A • d -dim • n |Y| - dim • unconstrained • block-constrained • non -smooth, strongly convex • smooth, not strongly convex Optimization Algorithms Optimization Algorithms primal dual • Frank-Wolfe batch ( n cost per iteration) • subgradient descent =cutting plane ( SVM-struct ) O ( R 2 λε ) • block coordinate descent (Nesterov) ( 1 cost per iteration) • stochastic subgradient online • block-coordinate Frank-Wolfe (SGD, Pegasos)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend