optimal convergence rates for distributed optimization
play

Optimal convergence rates for distributed optimization Francis Bach - PowerPoint PPT Presentation

Optimal convergence rates for distributed optimization Francis Bach Inria - Ecole Normale Sup erieure, Paris Joint work with K. Scaman , S. Bubeck, Y.-T. Lee and L. Massouli e LCCC Workshop - June 2017 Motivations Typical Machine


  1. Optimal convergence rates for distributed optimization Francis Bach — Inria - Ecole Normale Sup´ erieure, Paris Joint work with K. Scaman , S. Bubeck, Y.-T. Lee and L. Massouli´ e LCCC Workshop - June 2017

  2. Motivations Typical Machine Learning setting ◮ Empirical risk minimization: m 1 � ℓ ( x i , y i ; θ ) + c � θ � 2 min 2 m θ ∈ R d i =1 ◮ Large scale learning systems handle massive amounts of data ◮ Requires multiple machines to train the model Francis Bach 2/27 LCCC workshop

  3. Motivations Typical Machine Learning setting ◮ Empirical risk minimization: logistic regression m 1 � i θ )) + c � θ � 2 log(1 + exp( − y i x ⊤ min 2 m θ ∈ R d i =1 ◮ Large scale learning systems handle massive amounts of data ◮ Requires multiple machines to train the model Francis Bach 2/27 LCCC workshop

  4. Optimization with a single machine “Best” convergence rate for strongly-convex and smooth functions ◮ Number of iterations to reach a precision ε > 0 (Nesterov, 2004): � √ κ ln � 1 �� Θ ε where κ is the condition number of the function to optimize. ◮ Consequence of f ( θ t ) − f ( θ ∗ ) � β (1 − 1 / √ κ ) t � θ 0 − θ ∗ � 2 ◮ ...but each iteration requires m gradients to compute ! Francis Bach 3/27 LCCC workshop

  5. Optimization with a single machine “Best” convergence rate for strongly-convex and smooth functions ◮ Number of iterations to reach a precision ε > 0 (Nesterov, 2004): � √ κ ln � 1 �� Θ ε where κ is the condition number of the function to optimize. ◮ Consequence of f ( θ t ) − f ( θ ∗ ) � β (1 − 1 / √ κ ) t � θ 0 − θ ∗ � 2 ◮ ...but each iteration requires m gradients to compute ! Upper and lower bounds of complexity inf sup #iterations to reach ε algorithms functions ◮ Upper-bound: exhibit an algorithm (here Nesterov acceleration) ◮ Lower-bound: exhibit a hard function where all algorithms fail Francis Bach 3/27 LCCC workshop

  6. Distributing information on a network Centralized algorithms ◮ “Master/slave” ◮ Minimal number of communication steps = Diameter ∆ Decentralized algorithms ◮ Gossip algorithms (Boyd et.al., 2006 ; Shah, 2009) ◮ Mixing time of the Markov chain on the graph ≈ inverse of the second smallest eigenvalue γ of the Laplacian Francis Bach 4/27 LCCC workshop

  7. Goals of this work Beyond single machine optimization m √ κ ln � 1 � �� ◮ Can we improve on Θ ? ε ◮ Is the speed up linear? ◮ How does a limited bandwidth affects optimization algorithms? Francis Bach 5/27 LCCC workshop

  8. Goals of this work Beyond single machine optimization m √ κ ln � 1 � �� ◮ Can we improve on Θ ? ε ◮ Is the speed up linear? ◮ How does a limited bandwidth affects optimization algorithms? Extending optimization theory to distributed architectures ◮ Optimal convergence rates of first order distributed methods, ◮ Optimal algorithms achieving this rate, ◮ Beyond flat (totally connected) architectures (Arjevani and Shamir, 2015), ◮ Explicit dependence on optimization parameters and graph parameters. Francis Bach 5/27 LCCC workshop

  9. Distributed optimization setting Optimization problem Let f i be α -strongly convex and β -smooth functions. We consider minimizing the average of the local functions. n f ( θ ) = 1 θ ∈ R d ¯ � min f i ( θ ) n i =1 ◮ Machine learning: distributed observations Francis Bach 6/27 LCCC workshop

  10. Distributed optimization setting Optimization problem Let f i be α -strongly convex and β -smooth functions. We consider minimizing the average of the local functions. n f ( θ ) = 1 θ ∈ R d ¯ � min f i ( θ ) n i =1 ◮ Machine learning: distributed observations Optimization procedures We consider distributed first-order optimization procedures: access to gradients (or gradients of the Fenchel conjugates). Francis Bach 6/27 LCCC workshop

  11. Distributed optimization setting Optimization problem Let f i be α -strongly convex and β -smooth functions. We consider minimizing the average of the local functions. n f ( θ ) = 1 θ ∈ R d ¯ � min f i ( θ ) n i =1 ◮ Machine learning: distributed observations Optimization procedures We consider distributed first-order optimization procedures: access to gradients (or gradients of the Fenchel conjugates). Network communications Let G = ( V , E ) be a connected simple graph of n computing units and diameter ∆, each having access to a function f i ( θ ) over θ ∈ R d . Francis Bach 6/27 LCCC workshop

  12. Strong convexity and smoothness Strong convexity A function f is α -strongly convex iff. ∀ x , y ∈ R d , f ( y ) ≥ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + α � y − x � 2 . Smoothness A function f is β -smooth convex iff. ∀ x , y ∈ R d , f ( y ) ≤ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + β � y − x � 2 . Notations ◮ κ l = β α ( local ) condition number of each f i , ◮ κ g = β g ( global ) condition number of ¯ f , α g ◮ κ g � κ l , equal if all functions f i equal. Francis Bach 7/27 LCCC workshop

  13. Communication network Assumptions ◮ Each local computation takes a unit of time, ◮ Each communication between neighbors takes a time τ , ◮ Actions may be performed in parallel and asynchronously . Francis Bach 8/27 LCCC workshop

  14. Distributed optimization algorithms Black-box procedures We consider distributed algorithms verifying the following constraints: 1. Local memory: each node i can store past values in an internal memory M i , t ⊂ R d at time t ≥ 0. M i , t ⊂ M comp ∪ M comm , θ i , t ∈ M i , t . i , t i , t 2. Local computation: each node i can, at time t , compute the gradient of its local function ∇ f i ( θ ) or its Fenchel conjugate ∇ f ∗ i ( θ ), where f ∗ ( θ ) = sup x x ⊤ θ − f ( x ). M comp = Span ( { θ, ∇ f i ( θ ) , ∇ f ∗ i ( θ ) : θ ∈ M i , t − 1 } ) . i , t 3. Local communication: each node i can, at time t , share a value to all or part of its neighbors. � � � M comm = Span M j , t − τ . i , t ( i , j ) ∈E Francis Bach 9/27 LCCC workshop

  15. Centralized vs. decentralized architectures Centralized communication ◮ One master machine is responsible for sending requests and synchronizing computation, ◮ Slave machines perform computations upon request and send the result to the master. Francis Bach 10/27 LCCC workshop

  16. Centralized vs. decentralized architectures Centralized communication ◮ One master machine is responsible for sending requests and synchronizing computation, ◮ Slave machines perform computations upon request and send the result to the master. Decentralized communication ◮ All machines perform local computations and share values with their neighbors, ◮ Local averaging is performed through gossip (Boyd et.al., 2006). ◮ Node i receives � j W ij x j = ( Wx ) i , where W verifies: 1. W is an n × n symmetric matrix, 2. W is defined on the edges of the network: W ij � = 0 only if i = j or ( i , j ) ∈ E , 3. W is positive semi-definite, 4. The kernel of W is the set of constant vectors: Ker ( W ) = Span ( 1 ), where 1 = (1 , ..., 1) ⊤ . ◮ Let γ ( W ) = λ n − 1 ( W ) /λ 1 ( W ) be the (normalized) eigengap of W . Francis Bach 10/27 LCCC workshop

  17. Lower bound on convergence rate Theorem 1 (SBBLM, 2017) Let G be a graph of diameter ∆ > 0 and size n > 0, and β g ≥ α g > 0. There exist n functions f i : ℓ 2 → R such that ¯ f is α g -strongly-convex and β g -smooth, and for any t ≥ 0 and any black-box procedure one has, for all i ∈ { 1 , ..., n } , t � 1+ � f ( θ ∗ ) ≥ α g 4 1+∆ τ f ( θ i , t ) − ¯ ¯ � θ i , 0 − θ ∗ � 2 . 1 − √ κ g 2 Francis Bach 11/27 LCCC workshop

  18. Lower bound on convergence rate Theorem 1 (SBBLM, 2017) Let G be a graph of diameter ∆ > 0 and size n > 0, and β g ≥ α g > 0. There exist n functions f i : ℓ 2 → R such that ¯ f is α g -strongly-convex and β g -smooth, and for any t ≥ 0 and any black-box procedure one has, for all i ∈ { 1 , ..., n } , t � 1+ � f ( θ ∗ ) ≥ α g 4 1+∆ τ ¯ f ( θ i , t ) − ¯ � θ i , 0 − θ ∗ � 2 . 1 − √ κ g 2 Take-home message For any graph of diameter ∆ and any black-box procedure, there exist functions f i such that the time to reach a precision ε > 0 is lower bounded by � √ κ g � 1 �� � � Ω 1 + ∆ τ ln ε ◮ Extends the totally connected result of Arjevani & Shamir (2015) Francis Bach 11/27 LCCC workshop

  19. Proof warm-up: single machine ◮ Simplification: ℓ 2 instead of R d . ◮ Goal : design a worst-case convex function f . ◮ From Nesterov (2004), Bubeck (2015): f ( θ ) = α ( κ − 1) + α θ ⊤ A θ − 2 θ 1 2 � θ � 2 � � 2 8 with A infinite tridiagonal matrix with 2 on the diagonal, and − 1 on the upper and lower diagonal. Francis Bach 12/27 LCCC workshop

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend