smooth constraint convex minimization via conditional
play

Smooth Constraint Convex Minimization via Conditional Gradients - PowerPoint PPT Presentation

Smooth Constraint Convex Minimization via Conditional Gradients Sebastian Pokutta H. Milton Stewart School of Industrial and Systems Engineering Center for Machine Learning @ GT (ML@GT) Algorithm and Randomness Center (ARC) Tokyo , 03/2019


  1. Smooth Constraint Convex Minimization via Conditional Gradients Sebastian Pokutta H. Milton Stewart School of Industrial and Systems Engineering Center for Machine Learning @ GT (ML@GT) Algorithm and Randomness Center (ARC) Tokyo , 03/2019 Georgia Institute of Technology twitter: @spokutta

  2. Joint Work with… (in random order) Alexandre Thomas Gábor Cyrille Swati Kerdreux D’Aspremont Braun Combettes Gupta Stephen Daniel Robert Yi Dan George Hildebrand Wright Zhou Tu Zink Lan 2

  3. (C (Constrain int) ) Co Convex Opti timiz izatio ion Convex Optimization: Given a feasible region 𝑄 solve the optimization problem: Min 𝑦∈𝑄 𝑔 𝑦 , where 𝑔 is a convex function (+ extra properties). Our setup. Source: [Jaggi 2013] 1. Access to 𝑄 . Linear Optimization (LO) oracle: Given linear objective c 𝑦 ← argmin 𝑤∈𝑄 𝑑 𝑈 𝑤 2. Access to 𝑔 . First-Order (FO) oracle: Given 𝑦 return 𝛼𝑔 𝑦 and 𝑔(𝑦) => Complexity of convex optimization relative to LO/FO oracle 3

  4. Why would ld you care for r constrain int convex opti timiz izatio ion? Setup captures various problems in Machine Learning, e.g.: 1. OCR (Structured SVM Training) 1. Marginal polytope over chain graph of letters of word and quadratic loss 2. Video Co-Localization 1. Flow polytope and quadratic loss 3. Lasso 1. Scaled ℓ 1 -ball and quadratic loss (regression) 4. Regression over structured objects 1. Regression over convex hull of combinatorial atoms 5. Approximation of distributions 1. Bayesian inference, sequential kernel herding, … 4

  5. Smooth Convex Optim imiz ization 101 5

  6. Ba Basic ic noti tions Let 𝑔: 𝑆 𝑜 → 𝑆 be a function. We will use the following basic concepts: 2 Smoothness. 𝑔 𝑧 ≤ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 + 𝑀 2 𝑦 − 𝑧 Convexity. 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 2 Strong Convexity. 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 + 𝜈 𝑦 − 𝑧 2 => Use for optimization unclear. Next step: Operationalize notions! 6

  7. Mea easu sures of f Progr gress ss: Smooth thness and Id Idealized Gradient t Descent Consider an iterative algorithm of the form: 𝑦 𝑢+1 ← 𝑦 𝑢 − 𝜃 𝑢 𝑒 𝑢 2 2 𝑀 By definition of smoothness. 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ 𝜃 𝑢 ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 − 𝜃 𝑢 𝑒 𝑢 2 Smoothness induces primal progress. Optimizing right-hand side: 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 2 𝜃 𝑢∗ = ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 for 2 2 2𝑀 𝑒 𝑢 𝑀 𝑒 𝑢 Idealized Gradient Descent (IGD). Choose 𝑒 𝑢 ← 𝑦 𝑢 − 𝑦 ∗ (non-deterministic!) 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 (𝑦 𝑢 − 𝑦 ∗ ) 2 𝜃 𝑢∗ = ∇𝑔 𝑦 𝑢 𝑈 (𝑦 𝑢 − 𝑦 ∗ ) for 2 𝑀 𝑦 𝑢 − 𝑦 ∗ 2𝑀 𝑦 𝑢 − 𝑦 ∗ 7

  8. Measu sures of f Optim timality: Convexity ty Recall convexity: 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 Primal bound from Convexity. 𝑦 ← 𝑦 𝑢 and 𝑧 ← 𝑦 ∗ ∈ argmin 𝑦∈𝑄 𝑔 𝑦 : ℎ 𝑢 ≔ 𝑔 𝑦 𝑢 − 𝑔 𝑦 ∗ ≤ ∇𝑔 𝑦 𝑢 𝑈 (𝑦 𝑢 − 𝑦 ∗ ) 𝑦 0 − 𝑦 ∗ . Plugging this into the progress from IGD and 𝑦 𝑢 − 𝑦 ∗ ≤ 2 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ 2 ℎ 𝑢 ≥ 2 2 2𝑀 𝑦 𝑢 − 𝑦 ∗ 2𝑀 𝑦 0 − 𝑦 ∗ Rearranging provides contraction and convergence rate. 2 ℎ 𝑈 ≤ 2𝑀 𝑦 0 − 𝑦 ∗ ℎ 𝑢 ℎ 𝑢+1 ≤ ℎ 𝑢 ⋅ 1 − ⇒ 2 𝑈 + 4 2𝑀 𝑦 0 − 𝑦 ∗ 8

  9. Measu sures of f Optim timality: Str trong Convexity ty 2 Recall strong convexity: 𝑔 𝑧 ≥ 𝑔 𝑦 + ∇𝑔 𝑦 𝑈 𝑧 − 𝑦 + 𝜈 𝑦 − 𝑧 2 Primal bound from Strong Convexity. 𝑦 ← 𝑦 𝑢 and 𝑧 ← 𝑦 𝑢 − 𝛿(𝑦 𝑢 − 𝑦 ∗ ) 2 ℎ 𝑢 ≔ 𝑔 𝑦 𝑢 − 𝑔 𝑦 ∗ ≤ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ 2 2𝜈 𝑦 𝑢 − 𝑦 ∗ Plugging this into the progress from IGD. 2 𝑔 𝑦 𝑢 − 𝑔 𝑦 𝑢+1 ≥ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ ≥ 𝜈 𝑀 ℎ 𝑢 2 2𝑀 𝑦 𝑢 − 𝑦 ∗ Rearranging provides contraction and convergence rate. ℎ 𝑢+1 ≤ ℎ 𝑢 ⋅ 1 − 𝜈 ℎ 𝑈 ≤ 𝑓 −𝜈 𝑀𝑈 ⋅ ℎ 0 ⇒ 𝑀 9

  10. From IG IGD to actu tual l alg lgorit ithms Consider an algorithm of the form: 𝑦 𝑢+1 ← 𝑦 𝑢 − 𝜃 𝑢 𝑒 𝑢 Scaling condition (Scaling). Show there exist 𝛽 𝑢 with ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 ≥ 𝛽 𝑢 𝑦 𝑢 − 𝑦 ∗ 𝑒 𝑢 𝟑 factor in iteration 𝒖 . Bounds and rates follow immediately. => Lose an 𝜷 𝒖 Example. (Vanilla) Gradient Descent with 𝑒 𝑢 ← ∇𝑔(𝑦 𝑢 ) 2 ≥ 1 ⋅ ∇𝑔 𝑦 𝑢 𝑈 𝑦 𝑢 − 𝑦 ∗ ∇𝑔 𝑦 𝑢 𝑈 𝑒 𝑢 = ∇𝑔 𝑦 𝑢 𝑦 𝑢 − 𝑦 ∗ 𝑒 𝑢 => TODAY: No more convergences proofs. Just establishing (Scaling). 10

  11. Conditional Gradients (a.k .k.a .a. Frank-Wolfe Alg lgorit ithm) 11

  12. Co Condit itio ional l Gradie ients a.k .k.a. . Frank-Wolf lfe Alg lgorit ithm 1. Advantages Source: [Jaggi 2013] 1. Extremely simple and robust: no complicated data structures to maintain 2. Easy to implement: requires only a linear optimization oracle (first order method) 3. Projection-free: feasibility via linear optimization oracle 4. Sparse distributions over vertices: optimal solution is convex comb. (enables sampling) 2. Disadvantages 1 Suboptimal convergence rate of 𝑃( 𝑈 ) in the worst-case 1. => Despite suboptimal rate often used because of simplicity 12

  13. Condit Co itio ional l Gradie ients a.k .k.a. . Frank-Wolf lfe Alg lgorit ithm 𝑤 2 𝑦 2 ) 𝑦 3 −∇𝑔(𝑦 2 ) 𝑤 3 Note: A) Points are formed as convex combinations of vertices B) vertices used to write point => „ Active sets “ −∇𝑔(𝑦 1 ) 𝑦 1 = 𝑤 1 13

  14. Condit Co itio ional l Gradie ients a.k .k.a. . Frank-Wolf lfe Alg lgorit ithm Establishing (Scaling). FW algorithm takes direction 𝑒 𝑢 = 𝑦 𝑢 − 𝑤 𝑢 . Observe Source: [Jaggi 2013] ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 ≥ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ | 𝑦 𝑢 −𝑦 ∗ | Hence with 𝛽 𝑢 = with D diameter of P : 𝐸 ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 ⋅ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ 𝑦 𝑢 − 𝑦 ∗ ≥ | 𝑦 𝑢 − 𝑤 𝑢 | 𝐸 𝑦 𝑢 − 𝑦 ∗ 𝟐 => This 𝜷 𝒖 is sufficient for 𝑷( 𝒖 ) convergence but better?? 14

  15. Th The str trongly convex case Lin inear convergence in in sp special cases If 𝑔 is strongly convex we would expect a linear rate of convergence. Obstacle. ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 ⋅ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ 𝒚 𝒖 − 𝒚 ∗ ≥ | 𝑦 𝑢 − 𝑤 𝑢 | 𝐸 𝑦 𝑢 − 𝑦 ∗ Special case 𝑦 ∗ ∈ rel. int 𝑄 , s ay 𝐶 𝑦 ∗ , 2𝑠 ⊆ 𝑄 . Then: Theorem [Marcotte, Guélat ‘86]. After a few iterations ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑤 𝑢 𝐸 ⋅ ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ ≥ 𝑠 | 𝑦 𝑢 − 𝑤 𝑢 | 𝑦 𝑢 − 𝑦 ∗ and linear convergence follows via (Scaling). 15

  16. Th The str trongly convex case Is Is lin linear convergence in in general poss ssible? (Vanilla) Frank-Wolfe cannot achieve linear convergence in general: Theorem [Wolfe ‘70]. 𝑦 ∗ on boundary of P. For any 𝜀 > 0 for infinitely many t : 1 𝑔 𝑦 𝑢 − 𝑔 𝑦 ∗ ≥ 𝑢 1+𝜀 [Wolfe ‘70] proposed Away Steps Issue: zig-zagging (b/c first order opt) 16

  17. Th The str trongly convex case Lin inear convergence in in general First linear convergence result (in general) due to [Garber, Hazan ‘13] 1. Simulating (theoretically efficiently) a stronger oracle rather using Away Steps 2. Involved constants are extremely large => algorithm unimplementable Linear convergence for implementable variants due to [Lacoste-Julien, Jaggi ‘15] 1. (Dominating) Away-steps are enough 2. Includes most known variants: Away-Step FW, Pairwise CG, Fully-Corrective FW, Wolfe’s algorithm, … 3. Key ingredient: There exists 𝑥(𝑄) (depending on polytope 𝑄 (only!)) s.t. ∇𝑔 𝑦 𝑈 𝑏 𝑢 − 𝑤 𝑢 ≥ 𝑥 𝑄 ∇𝑔 𝑦 𝑈 𝑦 𝑢 − 𝑦 ∗ 𝑦 𝑢 − 𝑦 ∗ ( 𝑒 𝑢 = 𝑏 𝑢 − 𝑤 𝑢 is basically the direction that either variant dominates) => Linear convergence via (Scaling) 17

  18. Many more variants and results… Recently there has been a lot of work on Conditional Gradients, e.g., 1. Linear convergence for conditional gradient sliding [Lan, Zhou ‘14] 2. Linear convergence for (some) non-strongly convex functions [Beck, Shtern ‘17] 3. Online FW [Hazan , Kale ‘12, Chen et al ‘18] 4. Stochastic FW [Reddi et al ‘16] and Variance -Reduced Stochastic FW [Hazan, Luo ’16, Chen et al ‘18] 5. In-face directions [Freund, Grigas ‘15] 6. Improved convergence under sharpness [Kerdreux, D’Aspremont, P. ‘18] … and many more!! => Very competitive and versatile in real-world applications 18

  19. Revis isit iting Conditional Gradie ients 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend