Block Conditional Gradient Algorithms E. Pauwels joint work with A. - PowerPoint PPT Presentation

Block Conditional Gradient Algorithms E. Pauwels joint work with A. Beck and S. Sabach. GdT Math´ ematiques de l’apprentissage September 24 2015 1 / 21

Context: large scale convex optimization Two old ideas have received renewed attention in the past years: Block decomposition: Linear oracles: ●  x 1  . min x ∈ X � x , c � x = .   .   x N 2 / 21

Context: large scale convex optimization Two old ideas have received renewed attention in the past years: Block decomposition: Linear oracles: ●  x 1  . min x ∈ X � x , c � x = .   .   x N Coordinate descent: Conditional gradient: Large dimension “Complex constraints” Distributed data Primal-dual interpretation 2 / 21

Context: large scale convex optimization Two old ideas have received renewed attention in the past years: Block decomposition: Linear oracles: ●  x 1  . min x ∈ X � x , c � x = .   .   x N Coordinate descent: Conditional gradient: Large dimension “Complex constraints” Distributed data Primal-dual interpretation Theoretical properties and empirical performances? 2 / 21

Scope of the presentation Most results in the litterature hold for random block selection rules. Lacoste-Julien and co-authors analyzed the random block conditional gradient method (RBCG). ◮ Block-Coordinate Frank-Wolfe Optimization for Structural SVMs (ICML 2013) We propose a convergence analysis for the cyclic block variant (CBCG). 3 / 21

Scope of the presentation Most results in the litterature hold for random block selection rules. Lacoste-Julien and co-authors analyzed the random block conditional gradient method (RBCG). ◮ Block-Coordinate Frank-Wolfe Optimization for Structural SVMs (ICML 2013) We propose a convergence analysis for the cyclic block variant (CBCG). This presentation: focus on machine learning related aspects General introduction to linear oracle based optimization methods. Specification to (regularized) empirical risk minimization (ERM). Details about the application to structured SVM. (Taskar et. al., 2003 – Tsochantaridis et. al., 2005) 3 / 21

Outline 1. Context 2. Conditional Gradient algorithm 3. CG and convex duality 4. Block CG and L 2 regularized ERM 5. Results 4 / 21

Main idea Optimization setting: f : R n → R is convex, C 1 with L -Lipschitz gradient over X ⊂ R n which is convex and compact. ¯ f := min x ∈ X f ( x ) 5 / 21

Main idea Optimization setting: f : R n → R is convex, C 1 with L -Lipschitz gradient over X ⊂ R n which is convex and compact. ¯ f := min x ∈ X f ( x ) Start with x 0 ∈ X p k ∈ argmax y ∈ X ∇ f ( x k ) , x k − y � � x k +1 = (1 − α k ) x k + α k p k 0 ≤ α k ≤ 1 5 / 21

Main idea Optimization setting: f : R n → R is convex, C 1 with L -Lipschitz gradient over X ⊂ R n which is convex and compact. ¯ f := min x ∈ X f ( x ) Start with x 0 ∈ X p k ∈ argmax y ∈ X ∇ f ( x k ) , x k − y � � x k +1 = (1 − α k ) x k + α k p k 0 ≤ α k ≤ 1 Step size: α k = 2 Open loop k +2 x k +1 = argmin y ∈ [ x k , p k ] f ( y ) Exact line search x k +1 = argmin y ∈ [ x k , p k ] Q ( x k , y ) Approximate line search f ( y ) ≤ Q ( x , y ) := f ( x ) + �∇ f ( x ) , y − x � + L 2 � y − x � 2 2 (tangent quadratic upper bound, descent Lemma) . 5 / 21

Historical remarks Fifty years ago: First appearance for quadratic programs (Frank, Wolfe, 1956). f ( x k ) − ¯ f = O (1 / k ) (Polyak, Dunn, Dem’Yanov . . . , 60’s). For any ǫ > 0, it cannot be O (1 / k 1+ ǫ ) (Canon, Cullum, Polyak, 60’s) 6 / 21

Historical remarks Fifty years ago: First appearance for quadratic programs (Frank, Wolfe, 1956). f ( x k ) − ¯ f = O (1 / k ) (Polyak, Dunn, Dem’Yanov . . . , 60’s). For any ǫ > 0, it cannot be O (1 / k 1+ ǫ ) (Canon, Cullum, Polyak, 60’s) Recent developments (illustrations follow): Revival for large scale problems. Primal dual interpretation (Bach 2015) and convergence analysis (Jaggi 2013) Block decomposition variants (Lacoste-Julien et al. 2013) 6 / 21

Why is it interesting? O (1 / k 2 ) can be achieved by using projections (Beck, Teboulle 2009). Conditional Gradient does not compete in practice. 7 / 21

Why is it interesting? O (1 / k 2 ) can be achieved by using projections (Beck, Teboulle 2009). Conditional Gradient does not compete in practice. In some situations, projection does not constitute a practical alternative. Linear programs on convex sets attain their value at extreme points. 7 / 21

Why is it interesting? O (1 / k 2 ) can be achieved by using projections (Beck, Teboulle 2009). Conditional Gradient does not compete in practice. In some situations, projection does not constitute a practical alternative. Linear programs on convex sets attain their value at extreme points. Trace norm: For M ∈ R m × n , � M � ∗ = � i σ i , where { σ i } is the set of singular values of M . Projection on the trace norm ball is a thresholding of singular values → full SVD. Linear programming on the trace norm ball is finding the largest singular value → leading singular vector. 7 / 21

Convex duality Recall that X is convex and compact. Define its support function g : R n → R n g : w → max x ∈ X � x , w � 9 / 21

Convex duality Recall that X is convex and compact. Define its support function g : R n → R n g : w → max x ∈ X � x , w � Given A ∈ R n × m and b ∈ R n , consider the problems 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X 9 / 21

Convex duality Recall that X is convex and compact. Define its support function g : R n → R n g : w → max x ∈ X � x , w � Given A ∈ R n × m and b ∈ R n , consider the problems 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X Weak duality: for any w ∈ R m and x ∈ X , P ( w ) + D ( x ) ≥ 0 Strong duality holds p + ¯ ¯ d = 0 9 / 21

Primal subgradient and dual conditional gradient g : w → max x ∈ X � x , w � ( x ∈ argmax ⇔ x ∈ ∂ g ( w )) 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X 10 / 21

Primal subgradient and dual conditional gradient g : w → max x ∈ X � x , w � ( x ∈ argmax ⇔ x ∈ ∂ g ( w )) 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X A conditional gradient step in the dual: p k : AA T x k − b , x k − y + g ( − AA T x k + b ) � � = � A T x k � 2 � b , x k � max 2 − y ∈ X = P ( A T x k ) + D ( x k ) 10 / 21

Primal subgradient and dual conditional gradient g : w → max x ∈ X � x , w � ( x ∈ argmax ⇔ x ∈ ∂ g ( w )) 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X A conditional gradient step in the dual: p k : AA T x k − b , x k − y + g ( − AA T x k + b ) � � = � A T x k � 2 � b , x k � max 2 − y ∈ X = P ( A T x k ) + D ( x k ) Consider the primal variable w k = A T x k : we have p k ∈ ∂ g ( − A w k + b ). w k +1 − w k = α k A T ( − x k + p k ) = − α k ∂ P ( w k ) 10 / 21

Primal subgradient and dual conditional gradient g : w → max x ∈ X � x , w � ( x ∈ argmax ⇔ x ∈ ∂ g ( w )) 1 2 � w � 2 p = min ¯ 2 + g ( − A w + b ) (= P ( w )) w ∈ R m 1 ¯ 2 � A T x � 2 d = min 2 − � x , b � (= D ( x )) x ∈ X A conditional gradient step in the dual: p k : AA T x k − b , x k − y + g ( − AA T x k + b ) � � = � A T x k � 2 � b , x k � max 2 − y ∈ X = P ( A T x k ) + D ( x k ) Consider the primal variable w k = A T x k : we have p k ∈ ∂ g ( − A w k + b ). w k +1 − w k = α k A T ( − x k + p k ) = − α k ∂ P ( w k ) Implicit subgradient steps in the primal! 10 / 21

Primal subgradient and dual conditional gradient The primal-dual interpretation holds in much more general settings (Bach 2015). Primal-dual convergence analysis, min i =1 ,..., k P ( w i ) + D ( x i ) = O (1 / k ) (Jaggi 2013). Automatic step size tuning for subgradient descent in the primal. 11 / 21

✶ L 2 regularized ERM Consider a problem of the form: N λ 2 + 1 � 2 � w � 2 p = min ¯ g ( − A i w + b i ) (= P ( w )) N w ∈ R m i =1 2 � � N N λ 1 − 1 � � ¯ � � T x i d = min A i � x i , b i � (= D ( x )) � � 2 � N λ � N x i ∈ X , i =1 ,..., N � � i =1 i =1 2 13 / 21

Block Conditional Gradient Algorithms E. Pauwels joint work with A. - PowerPoint PPT Presentation

Block Conditional Gradient Algorithms E. Pauwels joint work with A. Beck and S. Sabach. GdT Math ematiques de lapprentissage September 24 2015 1 / 21 Context: large scale convex optimization Two old ideas have received renewed attention

Conditional Statements Python Conditional Statements Sometimes a statement (or a block of

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Problem 1 k zero bits n bits IV Block Block Block Block Cipher Cipher Cipher Cipher

Review: Conditional Probability Conditional Probability The conditional probability of event

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Introducing EF Block TM Introduction to EF Block Building Materials Overview of EF

CRYSTAL CITY BLOCK PLAN # CCBP- J-K 2019 1 BLOCK J-K Long Range Planning Committee Block

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

CRYSTAL CITY BLOCK PLAN # CCBP- G 1 BLOCK G (Metro Market Square block) 1 Long Range

Block Ciphers Eli Biham - May 3, 2005 c 83 Block Ciphers (4) Block Ciphers and Stream

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

A method to find functional dependencies through refutations and duality of hypergraphs J.

Duality based error estimation for electrostatic force computation Author: Simon Pintarelli

FROM TRIAGE TO TRIUMPH The duality of Academic Advising for at-risk students. Chris Hibbs Academic

Convex Analysis, Duality and Optimization Yao-Liang Yu yaoliang@cs.ualberta.ca Dept. of

On projections onto polyhedral sets and applications to primal-dual projection algorithms for

An overview of Classical Orthogonal Polynomials Roberto S. Costas-Santos University of Alcal a

Korea Twitter: @OECD_social Stefano Scarpetta , Director for Employment, Labour and Social Affairs

Schur duality Laura Mancinska University of Waterloo July 30, 2008 Outline 1 Basics of