Communication trade-offs for synchronized distributed SGD with large - PowerPoint PPT Presentation

Communication trade-offs for synchronized distributed SGD with large step size Aymeric DIEULEVEUT EPLF, MLO 17 november 2017 Joint work with Kumar Kshitij Patel. 1

Outline 1. Stochastic gradient descent - supervised machine learning - setting, assumptions and proof techniques 2. Synchronized distributed SGD - from mini-batch averaging to model averaging 3. Optimality of Local-SGD. 2

Stochastic Gradient Descent ◮ Goal: θ ∈ R d F ( θ ) min given unbiased gradient θ ∗ estimates g n ◮ θ ⋆ := argmin R d F ( θ ). 3

Stochastic Gradient Descent θ 0 ◮ Goal: θ ∈ R d F ( θ ) min given unbiased gradient θ ∗ estimates g n ◮ θ ⋆ := argmin R d F ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − η k g k ( θ k − 1 ) ◮ E [ g k ( θ k − 1 ) |F k − 1 ] = F ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 3

Stochastic Gradient Descent θ 0 θ 1 ◮ Goal: θ ∈ R d F ( θ ) min θ n given unbiased gradient θ ∗ estimates g n ◮ θ ⋆ := argmin R d F ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − η k g k ( θ k − 1 ) ◮ E [ g k ( θ k − 1 ) |F k − 1 ] = F ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 3

Supervised Machine Learning ◮ We define the risk (generalization error) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . ◮ Empirical risk (or training error): n R ( θ ) = 1 � ˆ ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 4

Supervised Machine Learning ◮ We define the risk (generalization error) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . ◮ Empirical risk (or training error): n R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 ◮ For example, least-squares regression: n 1 � 2 � � y i − � θ, Φ( x i ) � min + µ Ω( θ ) , 2 n θ ∈ R d i =1 ◮ and logistic regression: n 1 � � 1 + exp( − y i � θ, Φ( x i ) � ) � min log + µ Ω( θ ) . n θ ∈ R d i =1 4

Polyak Ruppert averaging 5

Polyak Ruppert averaging θ 1 θ 0 θ 1 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n n 1 θ ∗ ¯ � θ n = θ k . n + 1 k =0 ◮ off line averaging reduces the noise effect. 6

Polyak Ruppert averaging θ 1 θ 0 θ 1 θ 2 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n n 1 θ ∗ θ n ¯ � θ n = θ k . n + 1 k =0 ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 n +1 ¯ n θ n +1 = n +1 θ n +1 + θ n . 6

Assumptions Recursion: θ k = θ k − 1 − η k g k ( θ k − 1 ) Goal: min F ( θ ) . θ A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. 7

Assumptions Recursion: θ k = θ k − 1 − η k g k ( θ k − 1 ) Goal: min F ( θ ) . θ A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. A2 [Smoothness and regularity] The function F is three times continuously differentiable with second and third � < L , � �� F (2) ( θ ) � �� uniformly bounded derivatives: sup θ ∈ R d � < M . Especially F is L -smooth. � �� F (3) ( θ ) � �� and sup θ ∈ R d 7

Assumptions Recursion: θ k = θ k − 1 − η k g k ( θ k − 1 ) Goal: min F ( θ ) . θ A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. A2 [Smoothness and regularity] The function F is three times continuously differentiable with second and third � < L , � �� F (2) ( θ ) � �� uniformly bounded derivatives: sup θ ∈ R d � < M . Especially F is L -smooth. � �� F (3) ( θ ) � �� and sup θ ∈ R d Or: Q1 [Quadratic function] There exists a positive definite matrix Σ ∈ R d × d , such that the function F is the quadratic function θ �→ � Σ 1 / 2 ( θ − θ ⋆ ) � 2 / 2, 7

Which step size would you use? Smooth functions. √ η k ≡ η 0 η k = 1 / k η k = 1 / ( µ k ) Convex Strongly Convex Quadratic 8

Classical bound: Lyapunov approach � � � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � || θ k +1 − θ ⋆ || 2 |F k ≤ E − 2 η k E + η 2 k || g k ( θ k ) || 2 � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � ≤ E − 2 η k (1 − η k L ) + η 2 k || g k ( θ ⋆ ) || 2 � || θ k − θ ⋆ || 2 � � � η k ( F ( θ k ) − F ( θ ⋆ )) ≤ (1 − η k µ ) E || θ k +1 − θ ⋆ || 2 |F k − E + η 2 k || g k ( θ ⋆ ) || 2 9

Classical bound: Lyapunov approach � � � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � || θ k +1 − θ ⋆ || 2 |F k ≤ E − 2 η k E + η 2 k || g k ( θ k ) || 2 � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � ≤ E − 2 η k (1 − η k L ) + η 2 k || g k ( θ ⋆ ) || 2 � || θ k − θ ⋆ || 2 � � � η k ( F ( θ k ) − F ( θ ⋆ )) ≤ (1 − η k µ ) E || θ k +1 − θ ⋆ || 2 |F k − E + η 2 k || g k ( θ ⋆ ) || 2 1 Conclusion: with η k = µ k , telescopic sum + Jensen: � � F (¯ θ k ) − F ( θ ⋆ ) ≤ O (1 /µ k ) . E 9

Trivial case: decaying step sizes are not that great ! i . i . d . Consider least squares: y i = θ ⋆ ⊤ x i + ε i , ε i ∼ N (0 , σ 2 ). 10

Trivial case: decaying step sizes are not that great ! i . i . d . Consider least squares: y i = θ ⋆ ⊤ x i + ε i , ε i ∼ N (0 , σ 2 ). Start with θ 0 = θ ⋆ : Then: k θ k − θ ⋆ = 1 ¯ � i η 2 M k i ε i . k i =1 Even with large step size η 2 i ≡ η , CLT is enough to control that ! 10

Trivial case: decaying step sizes are not that great ! i . i . d . Consider least squares: y i = θ ⋆ ⊤ x i + ε i , ε i ∼ N (0 , σ 2 ). Start with θ 0 = θ ⋆ : Then: k θ k − θ ⋆ = 1 ¯ � i η 2 M k i ε i . k i =1 Even with large step size η 2 i ≡ η , CLT is enough to control that ! Tight control is much easier on the stochastic process θ k − θ ⋆ than through the “Lyapunov approach”. 10

Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . 11

Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 11

Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 Initial condition - Noise - Non quadratic residual 11

Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 Initial condition - Noise - Non quadratic residual � tight control of || F ′′ ( θ ⋆ ) � ¯ θ K − θ ⋆ � || . 11

Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 Initial condition - Noise - Non quadratic residual � tight control of || F ′′ ( θ ⋆ ) � ¯ θ K − θ ⋆ � || . Correct control of the noise for smooth and strongly convex All step sizes η n = Cn − α with α ∈ (1 / 2 , 1) lead to O ( n − 1 ). LMS algorithm: constant step-size → statistical optimality. 11

Communication trade-offs for synchronized distributed SGD with large - PowerPoint PPT Presentation

Communication trade-offs for synchronized distributed SGD with large step size Aymeric DIEULEVEUT EPLF, MLO 17 november 2017 Joint work with Kumar Kshitij Patel. 1 Outline 1. Stochastic gradient descent - supervised machine learning -

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Time-memory Trade-offs for Near-collisions Conclusion Combining trunc & codes Time-memory

Chapter 2 Trade-offs, Comparative Advantage, and the Market System Modeling Trade-offs:

TRADE-OFFS AMONG AI TRADE-OFFS AMONG AI TECHNIQUES TECHNIQUES Christian Kaestner With slides

TSMP Time Synchronized Mesh Protocol Seminar in Distributed Computing, FS 2010, ETH Zrich

Soft Gamma-ray Polarimetry with ASTRO-H SGD August 23, 2014 HEAPA Symposium on Future

Performance, Information Pattern Trade-offs and Computational Complexity Analysis of a Consensus

Poster 158 1 / 4 Poster 158 Security in Distributed ML Zeno: distributed synchronous SGD that

Communication-efficient Distributed SGD with Sketching Nikita Ivkin, Daniel Rothchild, Enayat

History of Operating Systems What drives these trade-offs? Hardware User Applications

PubPol 201 Module 3: International Trade Policy Class 1 Introduction to Trade and Trade Policy

PubPol 201 Module 3: International Trade Policy Class 1 Introduction to Trade and Trade Policy

Explicit Locks Alma Orucevic-Alagic 2013-11-28 Synchronized Java incorporates a

SHOW ME THE MONEY! THE NEW ERA OF LIVE OTT. Sye has started the synchronized live OTT revolution.

Loosely Time-Synchronized Snapshots in Object-Based File Systems Jan Stender, Mikael Hgqvist,

A quantum information trade-off for Augmented Index Ashwin Nayak Joint work with Dave Touchette

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 21: Review Jan-Willem van de

Trade-Off Between Trade-off problems for . . . Sample Size and Accuracy: Solutions Case of

Automating the Area-Delay Trade-off Problem Haven Skinner, Rafael Possignolo, Jose Renau CARRV

Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya 1 and Hjalte

Performance/Power Trade-Offs of Bitline Isolation Se-Hyun Yang and Babak Falsafi Computer

PRIVACY THROUGH SOLIDARITY Asia J. Biega Rishiraj Saha Roy Gerhard Weikum YOUR PROFILE MEANS

Description Logic Reasoning COMP62342 Sean Bechhofer sean.bechhofer@manchester.ac.uk Inference

Communication trade-offs for synchronized distributed SGD with large - PowerPoint PPT Presentation

Communication trade-offs for synchronized distributed SGD with large step size Aymeric DIEULEVEUT EPLF, MLO 17 november 2017 Joint work with Kumar Kshitij Patel. 1 Outline 1. Stochastic gradient descent - supervised machine learning -

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Time-memory Trade-offs for Near-collisions Conclusion Combining trunc &amp; codes Time-memory

Chapter 2 Trade-offs, Comparative Advantage, and the Market System Modeling Trade-offs:

TRADE-OFFS AMONG AI TRADE-OFFS AMONG AI TECHNIQUES TECHNIQUES Christian Kaestner With slides

TSMP Time Synchronized Mesh Protocol Seminar in Distributed Computing, FS 2010, ETH Zrich

Soft Gamma-ray Polarimetry with ASTRO-H SGD August 23, 2014 HEAPA Symposium on Future

Performance, Information Pattern Trade-offs and Computational Complexity Analysis of a Consensus

Poster 158 1 / 4 Poster 158 Security in Distributed ML Zeno: distributed synchronous SGD that

Communication-efficient Distributed SGD with Sketching Nikita Ivkin*, Daniel Rothchild*, Enayat

History of Operating Systems What drives these trade-offs? Hardware User Applications

PubPol 201 Module 3: International Trade Policy Class 1 Introduction to Trade and Trade Policy

PubPol 201 Module 3: International Trade Policy Class 1 Introduction to Trade and Trade Policy

Explicit Locks Alma Orucevic-Alagic 2013-11-28 Synchronized Java incorporates a

SHOW ME THE MONEY! THE NEW ERA OF LIVE OTT. Sye has started the synchronized live OTT revolution.

Loosely Time-Synchronized Snapshots in Object-Based File Systems Jan Stender, Mikael Hgqvist,

A quantum information trade-off for Augmented Index Ashwin Nayak Joint work with Dave Touchette

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 21: Review Jan-Willem van de

Trade-Off Between Trade-off problems for . . . Sample Size and Accuracy: Solutions Case of

Automating the Area-Delay Trade-off Problem Haven Skinner, Rafael Possignolo, Jose Renau CARRV

Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya 1 and Hjalte

Performance/Power Trade-Offs of Bitline Isolation Se-Hyun Yang and Babak Falsafi Computer

PRIVACY THROUGH SOLIDARITY Asia J. Biega Rishiraj Saha Roy Gerhard Weikum YOUR PROFILE MEANS

Description Logic Reasoning COMP62342 Sean Bechhofer sean.bechhofer@manchester.ac.uk Inference

Time-memory Trade-offs for Near-collisions Conclusion Combining trunc & codes Time-memory

Communication-efficient Distributed SGD with Sketching Nikita Ivkin, Daniel Rothchild, Enayat