Smooth Constraint Convex Minimization via Conditional Gradients - PowerPoint PPT Presentation

Smooth Constraint Convex Minimization via Conditional Gradients Sebastian Pokutta H. Milton Stewart School of Industrial and Systems Engineering Center for Machine Learning @ GT (ML@GT) Algorithm and Randomness Center (ARC) Tokyo , 03/2019 Georgia Institute of Technology twitter: @spokutta

Joint Work with… (in random order) Alexandre Thomas Gábor Cyrille Swati Kerdreux D’Aspremont Braun Combettes Gupta Stephen Daniel Robert Yi Dan George Hildebrand Wright Zhou Tu Zink Lan 2

(C (Constrain int) ) Co Convex Opti timiz izatio ion Convex Optimization: Given a feasible region 𝑄 solve the optimization problem: Min 𝑦∈𝑄 𝑔 𝑦 , where 𝑔 is a convex function (+ extra properties). Our setup. Source: [Jaggi 2013] 1. Access to 𝑄 . Linear Optimization (LO) oracle: Given linear objective c 𝑦 ← argmin 𝑤∈𝑄 𝑑 𝑈 𝑤 2. Access to 𝑔 . First-Order (FO) oracle: Given 𝑦 return 𝛼𝑔 𝑦 and 𝑔(𝑦) => Complexity of convex optimization relative to LO/FO oracle 3

Why would ld you care for r constrain int convex opti timiz izatio ion? Setup captures various problems in Machine Learning, e.g.: 1. OCR (Structured SVM Training) 1. Marginal polytope over chain graph of letters of word and quadratic loss 2. Video Co-Localization 1. Flow polytope and quadratic loss 3. Lasso 1. Scaled ℓ 1 -ball and quadratic loss (regression) 4. Regression over structured objects 1. Regression over convex hull of combinatorial atoms 5. Approximation of distributions 1. Bayesian inference, sequential kernel herding, … 4

Smooth Convex Optim imiz ization 101 5

Ba Basic ic noti tions Let 𝑔: 𝑆 𝑜 → 𝑆 be a function. We will use the following basic concepts: 2 Smoothness. 𝑔 𝑧 ≤ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 + 𝑀 2 𝑦 − 𝑧 Convexity. 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 2 Strong Convexity. 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 + 𝜈 𝑦 − 𝑧 2 => Use for optimization unclear. Next step: Operationalize notions! 6

Mea easu sures of f Progr gress ss: Smooth thness and Id Idealized Gradient t Descent Consider an iterative algorithm of the form: 𝑦 𝑢+1 ← 𝑦 𝑢 − 𝜃 𝑢 𝑒 𝑢 2 2 𝑀 By definition of smoothness. 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ 𝜃 𝑢 ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 − 𝜃 𝑢 𝑒 𝑢 2 Smoothness induces primal progress. Optimizing right-hand side: 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 2 𝜃 𝑢∗ = ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 for 2 2 2𝑀 𝑒 𝑢 𝑀 𝑒 𝑢 Idealized Gradient Descent (IGD). Choose 𝑒 𝑢 ← 𝑦 𝑢 − 𝑦 ∗ (non-deterministic!) 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 (𝑦 𝑢 − 𝑦 ∗ ) 2 𝜃 𝑢∗ = ∇𝑔 𝑦 𝑢 𝑈 (𝑦 𝑢 − 𝑦 ∗ ) for 2 𝑀 𝑦 𝑢 − 𝑦 ∗ 2𝑀 𝑦 𝑢 − 𝑦 ∗ 7

Measu sures of f Optim timality: Convexity ty Recall convexity: 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 Primal bound from Convexity. 𝑦 ← 𝑦 𝑢 and 𝑧 ← 𝑦 ∗ ∈ argmin 𝑦∈𝑄 𝑔 𝑦 : ℎ 𝑢 ≔ 𝑔 𝑦 𝑢 − 𝑔 𝑦 ∗ ≤ ∇𝑔 𝑦 𝑢 𝑈 (𝑦 𝑢 − 𝑦 ∗ ) 𝑦 0 − 𝑦 ∗ . Plugging this into the progress from IGD and 𝑦 𝑢 − 𝑦 ∗ ≤ 2 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ 2 ℎ 𝑢 ≥ 2 2 2𝑀 𝑦 𝑢 − 𝑦 ∗ 2𝑀 𝑦 0 − 𝑦 ∗ Rearranging provides contraction and convergence rate. 2 ℎ 𝑈 ≤ 2𝑀 𝑦 0 − 𝑦 ∗ ℎ 𝑢 ℎ 𝑢+1 ≤ ℎ 𝑢 ⋅ 1 − ⇒ 2 𝑈 + 4 2𝑀 𝑦 0 − 𝑦 ∗ 8

Measu sures of f Optim timality: Str trong Convexity ty 2 Recall strong convexity: 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 + 𝜈 𝑦 − 𝑧 2 Primal bound from Strong Convexity. 𝑦 ← 𝑦 𝑢 and 𝑧 ← 𝑦 𝑢 − 𝛿(𝑦 𝑢 − 𝑦 ∗ ) 2 ℎ 𝑢 ≔ 𝑔 𝑦 𝑢 − 𝑔 𝑦 ∗ ≤ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ 2 2𝜈 𝑦 𝑢 − 𝑦 ∗ Plugging this into the progress from IGD. 2 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ ≥ 𝜈 𝑀 ℎ 𝑢 2 2𝑀 𝑦 𝑢 − 𝑦 ∗ Rearranging provides contraction and convergence rate. ℎ 𝑢+1 ≤ ℎ 𝑢 ⋅ 1 − 𝜈 ℎ 𝑈 ≤ 𝑓 −𝜈 𝑀𝑈 ⋅ ℎ 0 ⇒ 𝑀 9

From IG IGD to actu tual l alg lgorit ithms Consider an algorithm of the form: 𝑦 𝑢+1 ← 𝑦 𝑢 − 𝜃 𝑢 𝑒 𝑢 Scaling condition (Scaling). Show there exist 𝛽 𝑢 with ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 ≥ 𝛽 𝑢 𝑦 𝑢 − 𝑦 ∗ 𝑒 𝑢 𝟑 factor in iteration 𝒖 . Bounds and rates follow immediately. => Lose an 𝜷 𝒖 Example. (Vanilla) Gradient Descent with 𝑒 𝑢 ← ∇𝑔(𝑦 𝑢 ) 2 ≥ 1 ⋅ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 = ∇𝑔 𝑦 𝑢 𝑦 𝑢 − 𝑦 ∗ 𝑒 𝑢 => TODAY: No more convergences proofs. Just establishing (Scaling). 10

Conditional Gradients (a.k .k.a .a. Frank-Wolfe Alg lgorit ithm) 11

Co Condit itio ional l Gradie ients a.k .k.a. . Frank-Wolf lfe Alg lgorit ithm 1. Advantages Source: [Jaggi 2013] 1. Extremely simple and robust: no complicated data structures to maintain 2. Easy to implement: requires only a linear optimization oracle (first order method) 3. Projection-free: feasibility via linear optimization oracle 4. Sparse distributions over vertices: optimal solution is convex comb. (enables sampling) 2. Disadvantages 1 Suboptimal convergence rate of 𝑃( 𝑈 ) in the worst-case 1. => Despite suboptimal rate often used because of simplicity 12

Condit Co itio ional l Gradie ients a.k .k.a. . Frank-Wolf lfe Alg lgorit ithm 𝑤 2 𝑦 2 ) 𝑦 3 −∇𝑔(𝑦 2 ) 𝑤 3 Note: A) Points are formed as convex combinations of vertices B) vertices used to write point => „ Active sets “ −∇𝑔(𝑦 1 ) 𝑦 1 = 𝑤 1 13

Condit Co itio ional l Gradie ients a.k .k.a. . Frank-Wolf lfe Alg lgorit ithm Establishing (Scaling). FW algorithm takes direction 𝑒 𝑢 = 𝑦 𝑢 − 𝑤 𝑢 . Observe Source: [Jaggi 2013] ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 ≥ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ | 𝑦 𝑢 −𝑦 ∗ | Hence with 𝛽 𝑢 = with D diameter of P : 𝐸 ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 ⋅ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ 𝑦 𝑢 − 𝑦 ∗ ≥ | 𝑦 𝑢 − 𝑤 𝑢 | 𝐸 𝑦 𝑢 − 𝑦 ∗ 𝟐 => This 𝜷 𝒖 is sufficient for 𝑷( 𝒖 ) convergence but better?? 14

Th The str trongly convex case Lin inear convergence in in sp special cases If 𝑔 is strongly convex we would expect a linear rate of convergence. Obstacle. ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 ⋅ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ 𝒚 𝒖 − 𝒚 ∗ ≥ | 𝑦 𝑢 − 𝑤 𝑢 | 𝐸 𝑦 𝑢 − 𝑦 ∗ Special case 𝑦 ∗ ∈ rel. int 𝑄 , s ay 𝐶 𝑦 ∗ , 2𝑠 ⊆ 𝑄 . Then: Theorem [Marcotte, Guélat ‘86]. After a few iterations ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 𝐸 ⋅ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ ≥ 𝑠 | 𝑦 𝑢 − 𝑤 𝑢 | 𝑦 𝑢 − 𝑦 ∗ and linear convergence follows via (Scaling). 15

Th The str trongly convex case Is Is lin linear convergence in in general poss ssible? (Vanilla) Frank-Wolfe cannot achieve linear convergence in general: Theorem [Wolfe ‘70]. 𝑦 ∗ on boundary of P. For any 𝜀 > 0 for infinitely many t : 1 𝑔 𝑦 𝑢 − 𝑔 𝑦 ∗ ≥ 𝑢 1+𝜀 [Wolfe ‘70] proposed Away Steps Issue: zig-zagging (b/c first order opt) 16

Th The str trongly convex case Lin inear convergence in in general First linear convergence result (in general) due to [Garber, Hazan ‘13] 1. Simulating (theoretically efficiently) a stronger oracle rather using Away Steps 2. Involved constants are extremely large => algorithm unimplementable Linear convergence for implementable variants due to [Lacoste-Julien, Jaggi ‘15] 1. (Dominating) Away-steps are enough 2. Includes most known variants: Away-Step FW, Pairwise CG, Fully-Corrective FW, Wolfe’s algorithm, … 3. Key ingredient: There exists 𝑥(𝑄) (depending on polytope 𝑄 (only!)) s.t. ∇𝑔 𝑦 𝑈 𝑏 𝑢 − 𝑤 𝑢 ≥ 𝑥 𝑄 ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ 𝑦 𝑢 − 𝑦 ∗ ( 𝑒 𝑢 = 𝑏 𝑢 − 𝑤 𝑢 is basically the direction that either variant dominates) => Linear convergence via (Scaling) 17

Many more variants and results… Recently there has been a lot of work on Conditional Gradients, e.g., 1. Linear convergence for conditional gradient sliding [Lan, Zhou ‘14] 2. Linear convergence for (some) non-strongly convex functions [Beck, Shtern ‘17] 3. Online FW [Hazan , Kale ‘12, Chen et al ‘18] 4. Stochastic FW [Reddi et al ‘16] and Variance -Reduced Stochastic FW [Hazan, Luo ’16, Chen et al ‘18] 5. In-face directions [Freund, Grigas ‘15] 6. Improved convergence under sharpness [Kerdreux, D’Aspremont, P. ‘18] … and many more!! => Very competitive and versatile in real-world applications 18

Revis isit iting Conditional Gradie ients 19

Smooth Constraint Convex Minimization via Conditional Gradients - PowerPoint PPT Presentation

Smooth Constraint Convex Minimization via Conditional Gradients Sebastian Pokutta H. Milton Stewart School of Industrial and Systems Engineering Center for Machine Learning @ GT (ML@GT) Algorithm and Randomness Center (ARC) Tokyo , 03/2019

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Non Convex Minimization using Convex Relaxation Some Hints to Formulate Equivalent Convex Energies

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Strengthening Smooth Transition Strengthening Smooth Transition Strengthening Smooth Transition

Constraint Networks Dario Maggi University Basel October 9, 2014 Dario Maggi Constraint

Non-Smooth Convex Optimization in Data Sciences Jalal Fadili Normandie Universit-ENSICAEN,

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS133 Computational Geometry Convex Hull 4/12/2018 1 Convex Hull Given a set of n points,

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

On Using Application-Layer Middlebox Protocols for Peeking Behind NAT Gateways Teemu Rytilahti,

CSE 232A Graduate Database Systems Arun Kumar About Paper Reviews 1 Goal of Peer Review in

Universal Plug and Play (UPnP) Internet Gateway Device (IGD)- Port Control Protocol (PCP)

THE E ECONOMY A AFTER CO CORONA RECOVERY O OR IN INTENSIV IVE C CARE? Professor Sir

DATA ANALYTICS USING DEEP LEARNING GT 8803 FALL 2018 POOJA BHANDARY T O W A R D S A U N I F

Financial Disclosure IN INTERNET ADDICTION ACROSS THE LIFESPAN David R. Rosenberg, M.D., Paul

VCoRE: A web resource oriented architecture for efficient data exchange Tobias Alexander Franke

Some Comments on GD and IGD and Relations to the Hausdorff Distance O. Schtze, X. Esquivel, A.