Block-Coordinate Frank-Wolfe Optimization with applications to - PowerPoint PPT Presentation

Block-Coordinate Frank-Wolfe Optimization with applications to structured prediction Martin Jaggi CMAP, Ecole Polytechnique, Paris Optimization and Big Data Workshop, Edinburgh, 2013 / 5 / 2 Co-Authors: Simon Lacoste-Julien, Mark Schmidt and Patrick Pletscher

Outline • Two Old First-Order Optimization Algorithms • Coordinate Descent • The Frank-Wolfe Algorithm • Duality for Constrained Convex Optimization • Combining Frank-Wolfe and Coordinate Descent • Applications: Large Margin Prediction • binary SVMs • structural SVMs

Coordinate Descent

Coordinate Descent f ( x ) Selection of next coordinate: • the one of steepest desc. • cycle (hard to analyze!) • random sampling x R d

The Frank- Wolfe Algorithm Frank and Wolfe (1956) D ⊂ R d

f ( x ) x ∈ D f ( x ) min x D ⊂ R d

f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x s D ⊂ R d Algorithm 1 Frank-Wolfe for k = 0 . . . K do s 0 , r f ( x ( k ) ) ⌦ ↵ Compute s := arg min s 0 2 D 2 Let γ := k +2 Update x ( k +1) := (1 � γ ) x ( k ) + γ s end for

f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x r f ( x ) D ⊂ R d Frank-Wolfe Gradient Descent (approx.) solve Cost per step Projection back to D linearized problem on D ✓ ✗ Sparse Solutions (in terms of used vertices)

Some Examples of Atomic Domains Suitable for Frank-Wolfe X Optimization Domain Complexity of one Frank-Wolfe Iteration Atoms A D = conv( A ) sup s 2 D h s , y i Complexity R n Sparse Vectors k . k 1 -ball k y k 1 O ( n ) R n Sign-Vectors k . k 1 -ball k y k 1 O ( n ) R n ` p -Sphere k . k p -ball k y k q O ( n ) R n Sparse Non-neg. Vectors Simplex ∆ n max i { y i } O ( n ) � ⇤ � � R n P Latent Group Sparse Vec. k . k G -ball max g 2 G g 2 G | g | � y ( g ) g p ˜ R m ⇥ n � " 0 � Matrix Trace Norm k . k tr -ball k y k op = � 1 ( y ) O N f / (Lanczos) R m ⇥ n Matrix Operator Norm k . k op -ball k y k tr = k ( � i ( y )) k 1 SVD R m ⇥ n Schatten Matrix Norms k ( � i ( . )) k p -ball k ( � i ( y )) k q SVD ˜ f ( n + m ) 1 . 5 / " 0 2 . 5 � R m ⇥ n � Matrix Max-Norm k . k max -ball O N O ( n 3 ) R n ⇥ n Permutation Matrices Birkho ff polytope R n ⇥ n Rotation Matrices SVD (Procrustes prob.) p ˜ S n ⇥ n Rank-1 PSD matrices � " 0 � � max ( y ) O N f / (Lanczos) { x ⌫ 0 , Tr( x )=1 } of unit trace ˜ PSD matrices f n 1 . 5 / " 0 2 . 5 � S n ⇥ n � O N { x ⌫ 0 , x ii  1 } of bounded diagonal Table 1: Some examples of atomic domains suitable for optimization using the Frank-Wolfe algorithm. J. 2013 Here SVD refers to the complexity of computing a singular value decomposition, which is O (min { mn 2 , m 2 n } ) . N f is the number of non-zero entries in the gradient of the objective func- tion f , and " 0 = 2 δ C f k +2 is the required accuracy for the linear subproblems. For any p 2 [1 , 1 ] , the conjugate value q is meant to satisfy 1 p + 1 q = 1 , allowing q = 1 for p = 1 and vice versa. Dudık et al. 2011, Tewari et al. 2011, J. 2011

f ( x ) The Linearized Problem s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min x D ⊂ R d Primal Convergence: Primal-Dual Convergence: Algorithms obtain Algorithms obtain � 1 � 1 gap ( x ( k ) ) ≤ O � f ( x ( k ) ) − f ( x ∗ ) ≤ O � k k after steps . after steps . k k [ Frank & Wolfe 1956 ] [ Clarkson 2008, J. 2013 ]

A Simple Optimization Duality Original Problem x ∈ D f ( x ) min f ( x ) The Dual Value gap (x) ω ( x ) := s 0 � x , r f ( x ) ⌦ ↵ s 0 2 D f ( x ) + min ω ( x ) Weak Duality x D ⊂ R d ω ( x ) ≤ f ( x ⇤ ) ≤ f ( x 0 )

Block-Separable Optimization Problems x ∈ D (1) × ··· × D ( n ) f ( x ) min x = ( x (1) , . . . , x ( n ) ) f ( x ) f ( x ) f ( x ) × × · · · × x (1) x ( i ) x ( n ) D ( i ) ∈ R d i D (1) ∈ R d 1 D ( n ) ∈ R d n

f ( x ) f ( x ) f ( x ) f ( x ) f ( x ) f ( x ) × × × · · · × × · · · × x (1) x (1) x ( n ) x ( n ) x ( i ) x ( i ) Algorithm 2: Uniform Coordinate Descent Algorithm 3: Block-Coordinate “Frank-Wolfe” Let x (0) 2 D Let x (0) 2 D for k = 0 . . . K do for k = 0 . . . K do Pick i 2 u.a.r. [ n ] Pick i 2 u.a.r. [ n ] D E D E s ( i ) , r ( i ) f ( x ( k ) ) s ( i ) , r ( i ) f ( x ( k ) ) Compute s ( i ) := arg min Compute s ( i ) := arg min s ( i ) ∈ D ( i ) s ( i ) ∈ D ( i ) � 2 + L i � � � s ( i ) � x ( i ) 2 n Let γ := k +2 n , or optimize γ by line-search 2 Update x ( k +1) := x ( k ) s ( i ) � x ( k ) Update x ( k +1) := x ( k ) s ( i ) � x ( k ) � � ( i ) + � � ( i ) + γ ( i ) ( i ) ( i ) ( i ) end end Hidden constant: Theorem: Curvature Algorithm obtains Nesterov (2012) i L f diam 2 ( D ( i ) ) accuracy ≤ P Richtárik, Takác ̌ (2012) 2 n � � O ``Huge-Scale’’ Coordinate Descent k +2 n (also in duality gap , after steps. k and with inexact subproblems )

Applications: Large Margin Prediction • Binary Support Vector Machine (no bias) • also: Ranking SVM ⌦ ↵ w , φ ( x i ) y i ≥ 1 − ξ i primal problem: λ 2 k w k 2 min w n + 1 n ↵o X ⌦ max 0 , 1 � w , φ ( x i ) y i n i =1

Binary SVM ⌦ ↵ w , φ ( x i ) y i ≥ 1 − ξ i primal dual 2 k w k 2 λ 2 k A α k 2 � b T α min λ min f ( α ) := w 2 R d α 2 R n n n ↵o X ⌦ + 1 s.t. 0  α i  1 8 i 2 [ n ] max 0 , 1 � w , φ ( x i ) y i n | {z } i =1 i -th column of A • • d -dim n -dim • • unconstrained box-constrained • • non -smooth, strongly convex smooth, not strongly convex

Structural SVM ``joint’’ feature map φ : X × Y → R d large margin ``separation’’ ⌦ ↵ w , φ ( x i , y i ) − φ ( x i , y ) ≥ L ( y , y i ) − ξ i ∀ y primal problem: φ ( , 2 ) φ ( , 7 ) 2 k w k 2 λ min φ ( , 4 ) w 2 R d n n ↵o φ ( , 0 ) X ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) n y 2 Y | {z } i =1 ( i, y )-th column of A φ ( , 1 ) φ ( , 3 )

Structural SVM ``joint’’ feature map φ : X × Y → R d large margin ``separation’’ ⌦ ↵ w , φ ( x i , y i ) − φ ( x i , y ) ≥ L ( y , y i ) − ξ i ∀ y uxtecpsss d ( , ) φ primal problem: e t c e φ ( , ) p x e nuexpcted n 2 k w k 2 λ u min ( , ) φ w 2 R d n n ↵o X aaaaaaa ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) ( , ) φ n y 2 Y | {z } i =1 ( i, y )-th column of A decoding oracle donaudampfschifffahrtsgesellschaftskapitän |Y| = 26 42

Binary SVM primal dual 2 Y 2 k A α k 2 � b T α 2 k w k 2 λ λ min min f ( α ) := w 2 R d α 2 R n n n ↵o X s.t. 0  α i  1 8 i 2 [ n ] ⌦ + 1 max 0 , 1 � w , φ ( x i ) y i n | {z } i =1 i -th column of A primal-dual Structural SVM correspondence w = A α primal dual 2 k A α k 2 � b T α 2 k w k 2 λ λ min f ( α ) := min α 2 R n ·|Y| w 2 R d P n s.t. y 2 Y α i ( y ) = 1 8 i 2 [ n ] n ↵o X ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) n y 2 Y and α i ( y ) � 0 8 i 2 [ n ] , 8 y 2 Y | {z } i =1 ( i, y )-th column of A

Binary SVM primal dual 2 Y 2 k A α k 2 � b T α 2 k w k 2 λ λ min min f ( α ) := w 2 R d α 2 R n n n ↵o X s.t. 0  α i  1 8 i 2 [ n ] ⌦ + 1 max 0 , 1 � w , φ ( x i ) y i n | {z } i =1 i -th column of A • d -dim • n -dim • unconstrained • box-constrained • non -smooth, strongly convex • smooth, not strongly convex Optimization Algorithms primal dual • Frank-Wolfe batch ( n cost per iteration) • subgradient descent =cutting plane ( SVM-light ) O ( R 2 λε ) • coordinate descent (Hsieh 2008) ( 1 cost per iteration) • stochastic subgradient online =block-coordinate descent (SGD, Pegasos) =block-coordinate Frank-Wolfe

φ ( , 2 ) φ ( ,7) Structural SVM φ ( ,4) φ ( ,0) φ ( ,1) primal dual 2 k A α k 2 � b T α λ 2 k w k 2 min f ( α ) := λ min α 2 R n ·|Y| w 2 R d P n s.t. y 2 Y α i ( y ) = 1 8 i 2 [ n ] n ↵o X ⌦ + 1 max L ( y i , y ) � w , φ ( x i , y i ) � φ ( x i , y ) n y 2 Y and α i ( y ) � 0 8 i 2 [ n ] , 8 y 2 Y | {z } i =1 ( i, y )-th column of A • d -dim • n |Y| - dim • unconstrained • block-constrained • non -smooth, strongly convex • smooth, not strongly convex Optimization Algorithms Optimization Algorithms primal dual • Frank-Wolfe batch ( n cost per iteration) • subgradient descent =cutting plane ( SVM-struct ) O ( R 2 λε ) • block coordinate descent (Nesterov) ( 1 cost per iteration) • stochastic subgradient online • block-coordinate Frank-Wolfe (SGD, Pegasos)

Block-Coordinate Frank-Wolfe Optimization with applications to - PowerPoint PPT Presentation

Block-Coordinate Frank-Wolfe Optimization with applications to structured prediction Martin Jaggi CMAP, Ecole Polytechnique, Paris Optimization and Big Data Workshop, Edinburgh, 2013 / 5 / 2 Co-Authors: Simon Lacoste-Julien, Mark Schmidt and

WOLFE RESIDENCE 337 Kenmore Road The Douglaston Historic District Kevin Wolfe Architect 1

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

Presentation for: Prospect Name February 17, 2015 Gregg Wolfe-Principal Doreen Guss National

Fortran Programmers Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com Outline GPU

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Block Ciphers Eli Biham - May 3, 2005 c 83 Block Ciphers (4) Block Ciphers and Stream

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

Resource Transfers in the RIPE NCC Service Region Marco Schmidt Policy Development O ffi cer

The small ball inequality and binary nets Naomi Feldheim (Stanford University) j.w. Dmitriy Bilyk

Detecting Abuse of Abandoned Internet Resources Tim Schmidt Betreuer: Dipl. Inf. Johann Schlamp,

Celebration of Life Charles Avery November 10, 1930 - September 16, 2019 Club of Sioux City

RIPE Address Policy Working Group October 24, 2017 RIPE 75, Dubai WG Chairs: Gert D oring

Meeting Agenda Welcome and opening remarks Israel Ruiz, Executive Vice President and Treasurer

What Makes A Design Difficult to Route Charles J. Alpert, Zhuo Li, Michael Moffitt, Gi-Joon Nam,

Introduction to Machine Learning Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences

Block-Coordinate Frank-Wolfe Optimization with applications to - PowerPoint PPT Presentation

Block-Coordinate Frank-Wolfe Optimization with applications to structured prediction Martin Jaggi CMAP, Ecole Polytechnique, Paris Optimization and Big Data Workshop, Edinburgh, 2013 / 5 / 2 Co-Authors: Simon Lacoste-Julien, Mark Schmidt and

WOLFE RESIDENCE 337 Kenmore Road The Douglaston Historic District Kevin Wolfe Architect 1

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

Presentation for: Prospect Name February 17, 2015 Gregg Wolfe-Principal Doreen Guss National

Fortran Programmers Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com Outline GPU

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations &amp; Transformations &amp; Coordinate Systems Coordinate Systems CSCD 472?

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Block Ciphers Eli Biham - May 3, 2005 c 83 Block Ciphers (4) Block Ciphers and Stream

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

Resource Transfers in the RIPE NCC Service Region Marco Schmidt Policy Development O ffi cer

The small ball inequality and binary nets Naomi Feldheim (Stanford University) j.w. Dmitriy Bilyk

Detecting Abuse of Abandoned Internet Resources Tim Schmidt Betreuer: Dipl. Inf. Johann Schlamp,

Celebration of Life Charles Avery November 10, 1930 - September 16, 2019 Club of Sioux City

RIPE Address Policy Working Group October 24, 2017 RIPE 75, Dubai WG Chairs: Gert D oring

Meeting Agenda Welcome and opening remarks Israel Ruiz, Executive Vice President and Treasurer

What Makes A Design Difficult to Route Charles J. Alpert, Zhuo Li, Michael Moffitt, Gi-Joon Nam,

Introduction to Machine Learning Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?