 
              Smooth Constraint Convex Minimization via Conditional Gradients Sebastian Pokutta H. Milton Stewart School of Industrial and Systems Engineering Center for Machine Learning @ GT (ML@GT) Algorithm and Randomness Center (ARC) Tokyo , 03/2019 Georgia Institute of Technology twitter: @spokutta
Joint Work with… (in random order) Alexandre Thomas Gábor Cyrille Swati Kerdreux D’Aspremont Braun Combettes Gupta Stephen Daniel Robert Yi Dan George Hildebrand Wright Zhou Tu Zink Lan 2
(C (Constrain int) ) Co Convex Opti timiz izatio ion Convex Optimization: Given a feasible region 𝑄 solve the optimization problem: Min 𝑦∈𝑄 𝑔 𝑦 , where 𝑔 is a convex function (+ extra properties). Our setup. Source: [Jaggi 2013] 1. Access to 𝑄 . Linear Optimization (LO) oracle: Given linear objective c 𝑦 ← argmin 𝑤∈𝑄 𝑑 𝑈 𝑤 2. Access to 𝑔 . First-Order (FO) oracle: Given 𝑦 return 𝛼𝑔 𝑦 and 𝑔(𝑦) => Complexity of convex optimization relative to LO/FO oracle 3
Why would ld you care for r constrain int convex opti timiz izatio ion? Setup captures various problems in Machine Learning, e.g.: 1. OCR (Structured SVM Training) 1. Marginal polytope over chain graph of letters of word and quadratic loss 2. Video Co-Localization 1. Flow polytope and quadratic loss 3. Lasso 1. Scaled ℓ 1 -ball and quadratic loss (regression) 4. Regression over structured objects 1. Regression over convex hull of combinatorial atoms 5. Approximation of distributions 1. Bayesian inference, sequential kernel herding, … 4
Smooth Convex Optim imiz ization 101 5
Ba Basic ic noti tions Let 𝑔: 𝑆 𝑜 → 𝑆 be a function. We will use the following basic concepts: 2 Smoothness. 𝑔 𝑧 ≤ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 + 𝑀 2 𝑦 − 𝑧 Convexity. 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 2 Strong Convexity. 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 + 𝜈 𝑦 − 𝑧 2 => Use for optimization unclear. Next step: Operationalize notions! 6
Mea easu sures of f Progr gress ss: Smooth thness and Id Idealized Gradient t Descent Consider an iterative algorithm of the form: 𝑦 𝑢+1 ← 𝑦 𝑢 − 𝜃 𝑢 𝑒 𝑢 2 2 𝑀 By definition of smoothness. 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ 𝜃 𝑢 ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 − 𝜃 𝑢 𝑒 𝑢 2 Smoothness induces primal progress. Optimizing right-hand side: 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 2 𝜃 𝑢∗ = ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 for 2 2 2𝑀 𝑒 𝑢 𝑀 𝑒 𝑢 Idealized Gradient Descent (IGD). Choose 𝑒 𝑢 ← 𝑦 𝑢 − 𝑦 ∗ (non-deterministic!) 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 (𝑦 𝑢 − 𝑦 ∗ ) 2 𝜃 𝑢∗ = ∇𝑔 𝑦 𝑢 𝑈 (𝑦 𝑢 − 𝑦 ∗ ) for 2 𝑀 𝑦 𝑢 − 𝑦 ∗ 2𝑀 𝑦 𝑢 − 𝑦 ∗ 7
Measu sures of f Optim timality: Convexity ty Recall convexity: 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 Primal bound from Convexity. 𝑦 ← 𝑦 𝑢 and 𝑧 ← 𝑦 ∗ ∈ argmin 𝑦∈𝑄 𝑔 𝑦 : ℎ 𝑢 ≔ 𝑔 𝑦 𝑢 − 𝑔 𝑦 ∗ ≤ ∇𝑔 𝑦 𝑢 𝑈 (𝑦 𝑢 − 𝑦 ∗ ) 𝑦 0 − 𝑦 ∗ . Plugging this into the progress from IGD and 𝑦 𝑢 − 𝑦 ∗ ≤ 2 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ 2 ℎ 𝑢 ≥ 2 2 2𝑀 𝑦 𝑢 − 𝑦 ∗ 2𝑀 𝑦 0 − 𝑦 ∗ Rearranging provides contraction and convergence rate. 2 ℎ 𝑈 ≤ 2𝑀 𝑦 0 − 𝑦 ∗ ℎ 𝑢 ℎ 𝑢+1 ≤ ℎ 𝑢 ⋅ 1 − ⇒ 2 𝑈 + 4 2𝑀 𝑦 0 − 𝑦 ∗ 8
Measu sures of f Optim timality: Str trong Convexity ty 2 Recall strong convexity: 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 + 𝜈 𝑦 − 𝑧 2 Primal bound from Strong Convexity. 𝑦 ← 𝑦 𝑢 and 𝑧 ← 𝑦 𝑢 − 𝛿(𝑦 𝑢 − 𝑦 ∗ ) 2 ℎ 𝑢 ≔ 𝑔 𝑦 𝑢 − 𝑔 𝑦 ∗ ≤ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ 2 2𝜈 𝑦 𝑢 − 𝑦 ∗ Plugging this into the progress from IGD. 2 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ ≥ 𝜈 𝑀 ℎ 𝑢 2 2𝑀 𝑦 𝑢 − 𝑦 ∗ Rearranging provides contraction and convergence rate. ℎ 𝑢+1 ≤ ℎ 𝑢 ⋅ 1 − 𝜈 ℎ 𝑈 ≤ 𝑓 −𝜈 𝑀𝑈 ⋅ ℎ 0 ⇒ 𝑀 9
From IG IGD to actu tual l alg lgorit ithms Consider an algorithm of the form: 𝑦 𝑢+1 ← 𝑦 𝑢 − 𝜃 𝑢 𝑒 𝑢 Scaling condition (Scaling). Show there exist 𝛽 𝑢 with ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 ≥ 𝛽 𝑢 𝑦 𝑢 − 𝑦 ∗ 𝑒 𝑢 𝟑 factor in iteration 𝒖 . Bounds and rates follow immediately. => Lose an 𝜷 𝒖 Example. (Vanilla) Gradient Descent with 𝑒 𝑢 ← ∇𝑔(𝑦 𝑢 ) 2 ≥ 1 ⋅ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 = ∇𝑔 𝑦 𝑢 𝑦 𝑢 − 𝑦 ∗ 𝑒 𝑢 => TODAY: No more convergences proofs. Just establishing (Scaling). 10
Conditional Gradients (a.k .k.a .a. Frank-Wolfe Alg lgorit ithm) 11
Co Condit itio ional l Gradie ients a.k .k.a. . Frank-Wolf lfe Alg lgorit ithm 1. Advantages Source: [Jaggi 2013] 1. Extremely simple and robust: no complicated data structures to maintain 2. Easy to implement: requires only a linear optimization oracle (first order method) 3. Projection-free: feasibility via linear optimization oracle 4. Sparse distributions over vertices: optimal solution is convex comb. (enables sampling) 2. Disadvantages 1 Suboptimal convergence rate of 𝑃( 𝑈 ) in the worst-case 1. => Despite suboptimal rate often used because of simplicity 12
Condit Co itio ional l Gradie ients a.k .k.a. . Frank-Wolf lfe Alg lgorit ithm 𝑤 2 𝑦 2 ) 𝑦 3 −∇𝑔(𝑦 2 ) 𝑤 3 Note: A) Points are formed as convex combinations of vertices B) vertices used to write point => „ Active sets “ −∇𝑔(𝑦 1 ) 𝑦 1 = 𝑤 1 13
Condit Co itio ional l Gradie ients a.k .k.a. . Frank-Wolf lfe Alg lgorit ithm Establishing (Scaling). FW algorithm takes direction 𝑒 𝑢 = 𝑦 𝑢 − 𝑤 𝑢 . Observe Source: [Jaggi 2013] ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 ≥ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ | 𝑦 𝑢 −𝑦 ∗ | Hence with 𝛽 𝑢 = with D diameter of P : 𝐸 ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 ⋅ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ 𝑦 𝑢 − 𝑦 ∗ ≥ | 𝑦 𝑢 − 𝑤 𝑢 | 𝐸 𝑦 𝑢 − 𝑦 ∗ 𝟐 => This 𝜷 𝒖 is sufficient for 𝑷( 𝒖 ) convergence but better?? 14
Th The str trongly convex case Lin inear convergence in in sp special cases If 𝑔 is strongly convex we would expect a linear rate of convergence. Obstacle. ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 ⋅ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ 𝒚 𝒖 − 𝒚 ∗ ≥ | 𝑦 𝑢 − 𝑤 𝑢 | 𝐸 𝑦 𝑢 − 𝑦 ∗ Special case 𝑦 ∗ ∈ rel. int 𝑄 , s ay 𝐶 𝑦 ∗ , 2𝑠 ⊆ 𝑄 . Then: Theorem [Marcotte, Guélat ‘86]. After a few iterations ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 𝐸 ⋅ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ ≥ 𝑠 | 𝑦 𝑢 − 𝑤 𝑢 | 𝑦 𝑢 − 𝑦 ∗ and linear convergence follows via (Scaling). 15
Th The str trongly convex case Is Is lin linear convergence in in general poss ssible? (Vanilla) Frank-Wolfe cannot achieve linear convergence in general: Theorem [Wolfe ‘70]. 𝑦 ∗ on boundary of P. For any 𝜀 > 0 for infinitely many t : 1 𝑔 𝑦 𝑢 − 𝑔 𝑦 ∗ ≥ 𝑢 1+𝜀 [Wolfe ‘70] proposed Away Steps Issue: zig-zagging (b/c first order opt) 16
Th The str trongly convex case Lin inear convergence in in general First linear convergence result (in general) due to [Garber, Hazan ‘13] 1. Simulating (theoretically efficiently) a stronger oracle rather using Away Steps 2. Involved constants are extremely large => algorithm unimplementable Linear convergence for implementable variants due to [Lacoste-Julien, Jaggi ‘15] 1. (Dominating) Away-steps are enough 2. Includes most known variants: Away-Step FW, Pairwise CG, Fully-Corrective FW, Wolfe’s algorithm, … 3. Key ingredient: There exists 𝑥(𝑄) (depending on polytope 𝑄 (only!)) s.t. ∇𝑔 𝑦 𝑈 𝑏 𝑢 − 𝑤 𝑢 ≥ 𝑥 𝑄 ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ 𝑦 𝑢 − 𝑦 ∗ ( 𝑒 𝑢 = 𝑏 𝑢 − 𝑤 𝑢 is basically the direction that either variant dominates) => Linear convergence via (Scaling) 17
Many more variants and results… Recently there has been a lot of work on Conditional Gradients, e.g., 1. Linear convergence for conditional gradient sliding [Lan, Zhou ‘14] 2. Linear convergence for (some) non-strongly convex functions [Beck, Shtern ‘17] 3. Online FW [Hazan , Kale ‘12, Chen et al ‘18] 4. Stochastic FW [Reddi et al ‘16] and Variance -Reduced Stochastic FW [Hazan, Luo ’16, Chen et al ‘18] 5. In-face directions [Freund, Grigas ‘15] 6. Improved convergence under sharpness [Kerdreux, D’Aspremont, P. ‘18] … and many more!! => Very competitive and versatile in real-world applications 18
Revis isit iting Conditional Gradie ients 19
Recommend
More recommend