Incremental Methods for Additive Convex Cost Minimization: - PowerPoint PPT Presentation

Incremental Methods for Additive Convex Cost Minimization: Deterministic vs Randomized Variants Mert Gurbuzbalaban (Rutgers) joint work with A. Ozdaglar (MIT), P. Parrilo (MIT), D. Vanli (MIT) DIMACS Workshop, August 2017 1

Introduction Additive Cost Problems We consider optimization problems with an objective function given by the sum of a large number of component functions: m � min f ( x ) = f i ( x ) x i =1 x ∈ R n , s.t. where f i : R n → R , i = 1 , . . . , m are convex functions. These arise in several important contexts. 2

Introduction Examples of Additive Cost Problems Empirical Risk Minimization: i =1 : x i ∈ R n is a feature Data { ( x i , y i ) } m vector, y i ∈ R is target output. � m 1 min θ ∈ R n i =1 L ( y i , x i , θ ) + pen( θ ). m Examples: LASSO, support vector machine, logistic regression, classification... Minimization of an Expected Value (Stochastic Programming): min x ∈ X E [ F ( x , w )] ( w : random variable taking large finite number of values). Distributed Optimization in Networks: f 1 ( x 1 , . . . , x n ) f 2 ( x 1 , . . . , x n ) f i ( x ): local objective function of node i (privately known by node i ). f m ( x 1 , . . . , x n ) 3

Introduction Incremental Methods We focus on problems where the number of component functions m is large, so a full (sub)gradient step, ∇ f ( x ) = � m i =1 ∇ f i ( x ), is very costly. Motivates using incremental algorithms which process component functions sequentially. Reasonable progress with cheaper “incremental” steps. Also well-suited for problems where: f i ( x ): distributed and locally known by agents. f i ( x ): known sequentially over time in an online manner. Incremental Gradient: Each (outer) iteration k consists of a cycle with m subiterations: For k ≥ 1, x k i +1 = x k i − α k ∇ f i ( x k i ) , for i = 1 , 2 , . . . , m , where α k is a stepsize. 4

Introduction Order for Processing Component Functions Deterministic Orders: Cyclic order: Incremental Gradient Fixed arbitrary order in each cycle Random Orders: Sample with replacement: Stochastic Gradient Descent (SGD) Sample without replacement: Random Reshuffling (RR) Network-imposed Orders: 3 ¡ 2 1 Deterministic with network structure. m ¡ Random (next component function sampled from neighborhood): Markov Randomized Incremental Methods. 5

Introduction This Talk We study Incremental Gradient (IG) method for deterministic orders. For smooth/strongly convex functions, we show O (1 / k ) rate in distances [ O (1 / k 2 ) rate in function values]. √ Improves on the existing O (1 / k ) result (for non smooth functions). Achieving this rate with IG involves knowing strong convexity constant. We then focus on random orders, in particular Random Reshuffling (RR). Numerically observed to outperform SGD, yet no analytical results. We show Θ(1 / k 2 s ) rate, s ∈ (1 / 2 , 1), with probability one in function values. Improves on the existing Ω(1 / k ) minmax rate of SGD. Achieving this rate involves a stepsize α k = 1 / k s and properly averaging the iterates. As a special case of IG, we study coordinate descent methods. We provide linear rate results and problem classes for which any cyclic order is faster than randomized order both asymptotically and non-asymptotically in the worst-case. We also characterize the best deterministic order. 6

Incremental Gradient Method Incremental (Sub)Gradient method Prominent algorithm that appears in many contexts: Backpropagation algorithm for training neural networks. Kaczmarz method for solving linear systems of equations a T i x = b i . 7

Incremental Gradient Method Literature: Incremental (Sub)gradient Optimization Deterministic order: Convergence analysis under various conditions Textbooks by Bertsekas, Polyak, Shor,... Differentiable problems: [Luo 91], [Luo and Tseng 94], [Mangasarian and Solodov 94], [Bertsekas 97], [Solodov 98], [Tseng 98],... Non-differentiable problems: [Nedic, Bertsekas 00], [Kiwiel 2004], ... √ Best rate known dist k ≤ O (1 / k ) under strong-convexity-type cond. Question: Can we achieve better rates when functions f i are smooth? 8

Incremental Gradient Method Incremental Gradient with Smoothness Assumptions: (Strong convexity+differentiability) Each f i is convex and C 2 on R n . The 1 sum f is c -strongly convex, i.e. f ( x ) − c 2 � x � 2 is convex. (Lipschitz gradients) There exists a constant L i > 0 such that 2 �∇ f i ( x ) − ∇ f i ( y ) � ≤ L i � x − y � , for all x , y , i = 1 , 2 , . . . , m . Then, f has Lipschitz gradients with constant at most L = � i L i . (Subgradient boundedness) 3 ∀ g ∈ ∂ f i ( x k � g � ≤ G , i ) , i = 1 , 2 , . . . , m , k = 1 , 2 , . . . . 9

Incremental Gradient Method Convergence Rate of IG with Smoothness Theorem (Gurbuzbalaban, Ozdaglar, Parrilo 15) Suppose Assumptions 1, 2 and 3 hold. Consider the IG method with stepsize α k = R / k. If R > 1 / c, then � 1 � LmGR 2 dist k ≤ k + o (1 / k ) . Rc − 1 This rate result highly dependent on the choice of stepsize, i.e., knowledge of strong convexity constant c . Similar problems with 1 / k -decay step sizes widely noted in stochastic approximation and stochastic gradient descent literatures [Chung 53], [Frees and Ruppert 87], [Nemirovsky, Juditsky, Lan, and Shapiro 09], [Bach and Moulines 11], [Bach 13]. 10

Incremental Gradient Method Convergence Rate of IG with Smoothness Example Let f i ( x ) = x 2 / 20 for i = 1 , 2, x ∈ R . Then, we have m = 2, c = 1 / 5 and x ∗ = 0. Take R = 1 which corresponds the stepsize 1 / k . The IG iterations are � � 2 1 x k +1 x k = 1 − 1 . 1 10 k If x 1 = 1, a simple analysis shows x k 1 1 = dist k > Ω( k 1 / 5 ). The stepsize α k = Θ(1 / k s ), s ∈ (0 , 1), does not require adaptation to the strong convexity constant, providing robust rate guarantees. Theorem (Gurbuzbalaban, Ozdaglar, Parrilo 15) Suppose Assumptions 1, 2 and 3 hold. Consider the IG method with stepsize α k = R / k s , s ∈ (0 , 1) , with R > 0 . Then � 1 � LmGR k s + o (1 / k s ) . dist k ≤ c 11

Incremental Gradient Method Quadratics: Order-Dependent Upper Bounds Consider the IG method with arbitrary deterministic order σ (a fixed permutation of { 1 , 2 , . . . , m } ), and with stepsize α k = R / k s , s ∈ (0 , 1). Theorem (Gurbuzbalaban, Ozdaglar, Parrilo 2015) For each i, let f i : R n → R be quadratic functions of the form f i ( x ) = 1 2 x T i P i x − q T i x + r i , where P i is a symmetric square matrix, q i is a column vector and r i is a scalar. Suppose f is strongly convex with constant c. Then, � � � � dist k ≤ RM σ 1 � k s + o (1 / k s ) , � P σ ( j ) ∇ f σ ( i ) ( x ∗ ) � where M σ = � . � � c � 1 ≤ i < j ≤ m Note that M σ ≤ � m j =1 jL σ ( j ) G ≤ LmG . Suggests processing functions with higher Lipschitz constants first. 12

Random Orders Random Orders: SGD vs RR Much empirical evidence showing RR outperforms SGD, no analytical results. Figure: The classification of RCV1 documents belonging to class CCAT. Left: SGD achieves its Ω(1 / k ) rate, Right: Random Reshuffling rate of ∼ 1 / k 2 [Bottou 09]. Long-standing open problem: Characterization of convergence rate of RR [Bertsekas 99], [Bottou 09], [Recht Re 2012, 2013]. Analysis hard because of dependencies of gradient errors in and across cycles. 13

Random Orders SGD: Revived Interest Vast literature going back to [Robbins, Monro 51], [Kiefer, Wolfovitz 52]. Popular in machine learning applications due to its scalability and robustness. Active area of research: More recent work on achievable rates, more robust variants and second-order versions: [Ruppert 88], [Polyak 90], [Polyak, Juditsky 92], [Bottou, LeCun 05], [Nemirovski Juditsky, Lan and Shapiro 09], [Hazan, Kale 11], [Rakhlin, Shamir, Sridharan 12], [Bach and Moulines 11], [Byrd, Hansen, Nocedal, Singer 14], [Hardt, Recht, Singer 15].... 14

Random Orders Convergence Rate of SGD For strongly convex functions, SGD has Ω(1 / k ) min-max lower bounds for stochastic convex optimization [Nemirovski, Yudin 83], [Agarwal et al. 12]. Polyak-Ruppert averaging is one way of achieving this lower bound. Choose larger stepsize α k = R / k s with s ∈ (1 / 2 , 1). Take time average of the iterates x k = x 1 + x 2 + · · · + x k ¯ k Averaged Stochastic Gradient Descent: Theorem (Polyak, Juditsky 92) k 1 / 2 (¯ D x k − x ∗ ) − → N (0 , σ ) = ⇒ ∼ 1 / k rate for function values. 15

Random Reshuffling Convergence Rate of SGD and RR Under Assumptions 1 , 2 + some technical conditions, we have: Averaged Stochastic Gradient Descent: Theorem (Polyak, Juditsky 92) k 1 / 2 (¯ D x k − x ∗ ) − → N (0 , σ ) = ⇒ ∼ 1 / k rate for function values. Random Reshuffling (RR): Theorem (Gurbuzbalaban, Ozdaglar, Parrilo 15 (simplified)) k s (¯ x k − x ∗ ) → ∇ 2 f ( x ∗ ) − 1 θ ∗ with probability one � m for a fixed vector θ ∗ = − 1 i =1 ∇ 2 f i ( x ∗ ) ∇ f i ( x ∗ ) and s ∈ (1 / 2 , 1) . 2 ⇒ ∼ 1 / k 2 s faster rate for function values. Also, � θ ∗ � ≤ LG (no additional m). = 16

Random Reshuffling Ilustration on a simple example 2 ( x − 1) 2 . Here, θ ∗ = 0. Two quadratics: f 1 ( x ) = 1 2 ( x + 1) 2 , f 2 ( x ) = 1 x k − x ∗ for SGD and Figure: Left: Histograms of the approximation error ∆ k = ¯ RR. Right, top: Histogram of k s ∆ k → 0 for RR as θ ∗ = 0. Right, bottom: Histogram of k 1 / 2 ∆ k for SGD which is asymptotically normal. 17

Incremental Methods for Additive Convex Cost Minimization: - PowerPoint PPT Presentation

Incremental Methods for Additive Convex Cost Minimization: Deterministic vs Randomized Variants Mert Gurbuzbalaban (Rutgers) joint work with A. Ozdaglar (MIT), P. Parrilo (MIT), D. Vanli (MIT) DIMACS Workshop, August 2017 1 Introduction

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Non Convex Minimization using Convex Relaxation Some Hints to Formulate Equivalent Convex Energies

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS133 Computational Geometry Convex Hull 4/12/2018 1 Convex Hull Given a set of n points,

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Incremental Construction Cost Incremental Construction Cost Analysis for New Homes Robin Snyder,

Minimizing within convex bodies using a convex hull method Edouard Oudet Thomas

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

Meta-modelling Markov Model Simulations for cost effectiveness analyses ICTR-PHE 2012 Daniel

Evaluation 101 Energy Efficiency Program Evaluation By Nick Hall TecMarket Works February 8,

Defect Removal Metrics SE 350 Software Process & Product Quality 1 Objectives Understand

The economics of climate change C C 175 Christian Traeger Ch i ti T Part 3: Environmental

Dynamic Content Allocation for Cloud- assisted Service of Periodic Workloads Gyrgy Dn Niklas

Optimal approximation for unconstrained non-submodular minimization Marwa El Halabi Stefanie

Reparameterization: a Universal Tool for Optimization and Counting George Katsirelos 10/05/2017

Minimization of Energy Loss using Integrated Evolutionary Approaches

Sambuz

Useful Links

Newsletter

Mail Us

Incremental Methods for Additive Convex Cost Minimization: - PowerPoint PPT Presentation

Incremental Methods for Additive Convex Cost Minimization: Deterministic vs Randomized Variants Mert Gurbuzbalaban (Rutgers) joint work with A. Ozdaglar (MIT), P. Parrilo (MIT), D. Vanli (MIT) DIMACS Workshop, August 2017 1 Introduction

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Non Convex Minimization using Convex Relaxation Some Hints to Formulate Equivalent Convex Energies

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS133 Computational Geometry Convex Hull 4/12/2018 1 Convex Hull Given a set of n points,

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Incremental Construction Cost Incremental Construction Cost Analysis for New Homes Robin Snyder,

Minimizing within convex bodies using a convex hull method Edouard Oudet Thomas

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

Meta-modelling Markov Model Simulations for cost effectiveness analyses ICTR-PHE 2012 Daniel

Evaluation 101 Energy Efficiency Program Evaluation By Nick Hall TecMarket Works February 8,

Defect Removal Metrics SE 350 Software Process &amp; Product Quality 1 Objectives Understand

The economics of climate change C C 175 Christian Traeger Ch i ti T Part 3: Environmental

Dynamic Content Allocation for Cloud- assisted Service of Periodic Workloads Gyrgy Dn Niklas

Optimal approximation for unconstrained non-submodular minimization Marwa El Halabi Stefanie

Reparameterization: a Universal Tool for Optimization and Counting George Katsirelos 10/05/2017

Minimization of Energy Loss using Integrated Evolutionary Approaches

Sambuz

Useful Links

Newsletter

Mail Us

Defect Removal Metrics SE 350 Software Process & Product Quality 1 Objectives Understand