Optimal convergence rates for distributed optimization Francis Bach - PowerPoint PPT Presentation

Optimal convergence rates for distributed optimization Francis Bach — Inria - Ecole Normale Sup´ erieure, Paris Joint work with K. Scaman , S. Bubeck, Y.-T. Lee and L. Massouli´ e LCCC Workshop - June 2017

Motivations Typical Machine Learning setting ◮ Empirical risk minimization: m 1 � ℓ ( x i , y i ; θ ) + c � θ � 2 min 2 m θ ∈ R d i =1 ◮ Large scale learning systems handle massive amounts of data ◮ Requires multiple machines to train the model Francis Bach 2/27 LCCC workshop

Motivations Typical Machine Learning setting ◮ Empirical risk minimization: logistic regression m 1 � i θ )) + c � θ � 2 log(1 + exp( − y i x ⊤ min 2 m θ ∈ R d i =1 ◮ Large scale learning systems handle massive amounts of data ◮ Requires multiple machines to train the model Francis Bach 2/27 LCCC workshop

Optimization with a single machine “Best” convergence rate for strongly-convex and smooth functions ◮ Number of iterations to reach a precision ε > 0 (Nesterov, 2004): � √ κ ln � 1 �� Θ ε where κ is the condition number of the function to optimize. ◮ Consequence of f ( θ t ) − f ( θ ∗ ) � β (1 − 1 / √ κ ) t � θ 0 − θ ∗ � 2 ◮ ...but each iteration requires m gradients to compute ! Francis Bach 3/27 LCCC workshop

Optimization with a single machine “Best” convergence rate for strongly-convex and smooth functions ◮ Number of iterations to reach a precision ε > 0 (Nesterov, 2004): � √ κ ln � 1 �� Θ ε where κ is the condition number of the function to optimize. ◮ Consequence of f ( θ t ) − f ( θ ∗ ) � β (1 − 1 / √ κ ) t � θ 0 − θ ∗ � 2 ◮ ...but each iteration requires m gradients to compute ! Upper and lower bounds of complexity inf sup #iterations to reach ε algorithms functions ◮ Upper-bound: exhibit an algorithm (here Nesterov acceleration) ◮ Lower-bound: exhibit a hard function where all algorithms fail Francis Bach 3/27 LCCC workshop

Distributing information on a network Centralized algorithms ◮ “Master/slave” ◮ Minimal number of communication steps = Diameter ∆ Decentralized algorithms ◮ Gossip algorithms (Boyd et.al., 2006 ; Shah, 2009) ◮ Mixing time of the Markov chain on the graph ≈ inverse of the second smallest eigenvalue γ of the Laplacian Francis Bach 4/27 LCCC workshop

Goals of this work Beyond single machine optimization m √ κ ln � 1 � �� ◮ Can we improve on Θ ? ε ◮ Is the speed up linear? ◮ How does a limited bandwidth affects optimization algorithms? Francis Bach 5/27 LCCC workshop

Goals of this work Beyond single machine optimization m √ κ ln � 1 � �� ◮ Can we improve on Θ ? ε ◮ Is the speed up linear? ◮ How does a limited bandwidth affects optimization algorithms? Extending optimization theory to distributed architectures ◮ Optimal convergence rates of first order distributed methods, ◮ Optimal algorithms achieving this rate, ◮ Beyond flat (totally connected) architectures (Arjevani and Shamir, 2015), ◮ Explicit dependence on optimization parameters and graph parameters. Francis Bach 5/27 LCCC workshop

Distributed optimization setting Optimization problem Let f i be α -strongly convex and β -smooth functions. We consider minimizing the average of the local functions. n f ( θ ) = 1 θ ∈ R d ¯ � min f i ( θ ) n i =1 ◮ Machine learning: distributed observations Francis Bach 6/27 LCCC workshop

Distributed optimization setting Optimization problem Let f i be α -strongly convex and β -smooth functions. We consider minimizing the average of the local functions. n f ( θ ) = 1 θ ∈ R d ¯ � min f i ( θ ) n i =1 ◮ Machine learning: distributed observations Optimization procedures We consider distributed first-order optimization procedures: access to gradients (or gradients of the Fenchel conjugates). Francis Bach 6/27 LCCC workshop

Distributed optimization setting Optimization problem Let f i be α -strongly convex and β -smooth functions. We consider minimizing the average of the local functions. n f ( θ ) = 1 θ ∈ R d ¯ � min f i ( θ ) n i =1 ◮ Machine learning: distributed observations Optimization procedures We consider distributed first-order optimization procedures: access to gradients (or gradients of the Fenchel conjugates). Network communications Let G = ( V , E ) be a connected simple graph of n computing units and diameter ∆, each having access to a function f i ( θ ) over θ ∈ R d . Francis Bach 6/27 LCCC workshop

Strong convexity and smoothness Strong convexity A function f is α -strongly convex iff. ∀ x , y ∈ R d , f ( y ) ≥ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + α � y − x � 2 . Smoothness A function f is β -smooth convex iff. ∀ x , y ∈ R d , f ( y ) ≤ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + β � y − x � 2 . Notations ◮ κ l = β α ( local ) condition number of each f i , ◮ κ g = β g ( global ) condition number of ¯ f , α g ◮ κ g � κ l , equal if all functions f i equal. Francis Bach 7/27 LCCC workshop

Communication network Assumptions ◮ Each local computation takes a unit of time, ◮ Each communication between neighbors takes a time τ , ◮ Actions may be performed in parallel and asynchronously . Francis Bach 8/27 LCCC workshop

Distributed optimization algorithms Black-box procedures We consider distributed algorithms verifying the following constraints: 1. Local memory: each node i can store past values in an internal memory M i , t ⊂ R d at time t ≥ 0. M i , t ⊂ M comp ∪ M comm , θ i , t ∈ M i , t . i , t i , t 2. Local computation: each node i can, at time t , compute the gradient of its local function ∇ f i ( θ ) or its Fenchel conjugate ∇ f ∗ i ( θ ), where f ∗ ( θ ) = sup x x ⊤ θ − f ( x ). M comp = Span ( { θ, ∇ f i ( θ ) , ∇ f ∗ i ( θ ) : θ ∈ M i , t − 1 } ) . i , t 3. Local communication: each node i can, at time t , share a value to all or part of its neighbors. � � � M comm = Span M j , t − τ . i , t ( i , j ) ∈E Francis Bach 9/27 LCCC workshop

Centralized vs. decentralized architectures Centralized communication ◮ One master machine is responsible for sending requests and synchronizing computation, ◮ Slave machines perform computations upon request and send the result to the master. Francis Bach 10/27 LCCC workshop

Centralized vs. decentralized architectures Centralized communication ◮ One master machine is responsible for sending requests and synchronizing computation, ◮ Slave machines perform computations upon request and send the result to the master. Decentralized communication ◮ All machines perform local computations and share values with their neighbors, ◮ Local averaging is performed through gossip (Boyd et.al., 2006). ◮ Node i receives � j W ij x j = ( Wx ) i , where W verifies: 1. W is an n × n symmetric matrix, 2. W is defined on the edges of the network: W ij � = 0 only if i = j or ( i , j ) ∈ E , 3. W is positive semi-definite, 4. The kernel of W is the set of constant vectors: Ker ( W ) = Span ( 1 ), where 1 = (1 , ..., 1) ⊤ . ◮ Let γ ( W ) = λ n − 1 ( W ) /λ 1 ( W ) be the (normalized) eigengap of W . Francis Bach 10/27 LCCC workshop

Lower bound on convergence rate Theorem 1 (SBBLM, 2017) Let G be a graph of diameter ∆ > 0 and size n > 0, and β g ≥ α g > 0. There exist n functions f i : ℓ 2 → R such that ¯ f is α g -strongly-convex and β g -smooth, and for any t ≥ 0 and any black-box procedure one has, for all i ∈ { 1 , ..., n } , t � 1+ � f ( θ ∗ ) ≥ α g 4 1+∆ τ f ( θ i , t ) − ¯ ¯ � θ i , 0 − θ ∗ � 2 . 1 − √ κ g 2 Francis Bach 11/27 LCCC workshop

Lower bound on convergence rate Theorem 1 (SBBLM, 2017) Let G be a graph of diameter ∆ > 0 and size n > 0, and β g ≥ α g > 0. There exist n functions f i : ℓ 2 → R such that ¯ f is α g -strongly-convex and β g -smooth, and for any t ≥ 0 and any black-box procedure one has, for all i ∈ { 1 , ..., n } , t � 1+ � f ( θ ∗ ) ≥ α g 4 1+∆ τ ¯ f ( θ i , t ) − ¯ � θ i , 0 − θ ∗ � 2 . 1 − √ κ g 2 Take-home message For any graph of diameter ∆ and any black-box procedure, there exist functions f i such that the time to reach a precision ε > 0 is lower bounded by � √ κ g � 1 �� Ω 1 + ∆ τ ln ε ◮ Extends the totally connected result of Arjevani & Shamir (2015) Francis Bach 11/27 LCCC workshop

Proof warm-up: single machine ◮ Simplification: ℓ 2 instead of R d . ◮ Goal : design a worst-case convex function f . ◮ From Nesterov (2004), Bubeck (2015): f ( θ ) = α ( κ − 1) + α θ ⊤ A θ − 2 θ 1 2 � θ � 2 � � 2 8 with A infinite tridiagonal matrix with 2 on the diagonal, and − 1 on the upper and lower diagonal. Francis Bach 12/27 LCCC workshop

Optimal convergence rates for distributed optimization Francis Bach - PowerPoint PPT Presentation

Optimal convergence rates for distributed optimization Francis Bach Inria - Ecole Normale Sup erieure, Paris Joint work with K. Scaman , S. Bubeck, Y.-T. Lee and L. Massouli e LCCC Workshop - June 2017 Motivations Typical Machine

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

Multiplicative Weights Update as a Distributed Optimization Algorithm: Constrained Optimization

Clearance Rates Office of Research and Data Analysis Clearance Rates Clearance rates are the

Advanced Macroeconomics 7. Exchange Rates, Interest Rates and Expectations Karl Whelan School of

Poisson Approximation for Two Scan Statistics with Rates of Convergence Xiao Fang (Joint work

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convergence Rates in Decentralized Optimization Alex Olshevsky Department of Electrical and

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Multi- -Disciplinary Convergence in Life Sciences: Disciplinary Convergence in Life Sciences:

OPCW SAB TWG OPCW SAB TWG OPCW SAB TWG OPCW SAB TWG Convergence in Chemistry and Biology

Asymptotics Review Harvard Math Camp - Econometrics Ashesh Rambachan Summer 2018 Outline Types

II of large Number Lattin in probability almost convergence convergence sure - - "

NS NSF Convergence Accelerator Chaitan Baru Senior Science Advisor, Convergence Accelerator

CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing Bremler-Barr, Afek, Schwarz,

Some Thoughts on MC Convergence first, would like to define what I mean two kinds of

Big Data for Data Science The MapReduce Framework & Hadoop event.cwi.nl/lsde Key premise:

Continuous Integration & Deploying using Jenkins and uDeploy (Projects used are of

Spark architecture Spark architecture Hardware organization Hardware organization In local

Developing Materials Using some Principles from SLA Diane Schmitt More to Word Knowledge than

DEPLOYING AND SCALING MICROSERVICES Sam Newman Goto Chicago 2016 @samnewman Building

Parallel applications in the cloud Diana Naranjo Pomalaya Computao Paralela e Distribuida

of optimized code Michael Ernst University of Washington (work done at Microsoft Research)

Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of

Optimal convergence rates for distributed optimization Francis Bach - PowerPoint PPT Presentation

Optimal convergence rates for distributed optimization Francis Bach Inria - Ecole Normale Sup erieure, Paris Joint work with K. Scaman , S. Bubeck, Y.-T. Lee and L. Massouli e LCCC Workshop - June 2017 Motivations Typical Machine

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

Multiplicative Weights Update as a Distributed Optimization Algorithm: Constrained Optimization

Clearance Rates Office of Research and Data Analysis Clearance Rates Clearance rates are the

Advanced Macroeconomics 7. Exchange Rates, Interest Rates and Expectations Karl Whelan School of

Poisson Approximation for Two Scan Statistics with Rates of Convergence Xiao Fang (Joint work

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convergence Rates in Decentralized Optimization Alex Olshevsky Department of Electrical and

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Multi- -Disciplinary Convergence in Life Sciences: Disciplinary Convergence in Life Sciences:

OPCW SAB TWG OPCW SAB TWG OPCW SAB TWG OPCW SAB TWG Convergence in Chemistry and Biology

Asymptotics Review Harvard Math Camp - Econometrics Ashesh Rambachan Summer 2018 Outline Types

II of large Number Lattin in probability almost convergence convergence sure - - &quot;

NS NSF Convergence Accelerator Chaitan Baru Senior Science Advisor, Convergence Accelerator

CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing Bremler-Barr, Afek, Schwarz,

Some Thoughts on MC Convergence first, would like to define what I mean two kinds of

Big Data for Data Science The MapReduce Framework &amp; Hadoop event.cwi.nl/lsde Key premise:

Continuous Integration &amp; Deploying using Jenkins and uDeploy (Projects used are of

Spark architecture Spark architecture Hardware organization Hardware organization In local

Developing Materials Using some Principles from SLA Diane Schmitt More to Word Knowledge than

DEPLOYING AND SCALING MICROSERVICES Sam Newman Goto Chicago 2016 @samnewman Building

Parallel applications in the cloud Diana Naranjo Pomalaya Computao Paralela e Distribuida

of optimized code Michael Ernst University of Washington (work done at Microsoft Research)

Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of

II of large Number Lattin in probability almost convergence convergence sure - - "

Big Data for Data Science The MapReduce Framework & Hadoop event.cwi.nl/lsde Key premise:

Continuous Integration & Deploying using Jenkins and uDeploy (Projects used are of