communication trade offs for synchronized distributed sgd
play

Communication trade-offs for synchronized distributed SGD with large - PowerPoint PPT Presentation

Communication trade-offs for synchronized distributed SGD with large step size Aymeric DIEULEVEUT EPLF, MLO 17 november 2017 Joint work with Kumar Kshitij Patel. 1 Outline 1. Stochastic gradient descent - supervised machine learning -


  1. Communication trade-offs for synchronized distributed SGD with large step size Aymeric DIEULEVEUT EPLF, MLO 17 november 2017 Joint work with Kumar Kshitij Patel. 1

  2. Outline 1. Stochastic gradient descent - supervised machine learning - setting, assumptions and proof techniques 2. Synchronized distributed SGD - from mini-batch averaging to model averaging 3. Optimality of Local-SGD. 2

  3. Stochastic Gradient Descent ◮ Goal: θ ∈ R d F ( θ ) min given unbiased gradient θ ∗ estimates g n ◮ θ ⋆ := argmin R d F ( θ ). 3

  4. Stochastic Gradient Descent θ 0 ◮ Goal: θ ∈ R d F ( θ ) min given unbiased gradient θ ∗ estimates g n ◮ θ ⋆ := argmin R d F ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − η k g k ( θ k − 1 ) ◮ E [ g k ( θ k − 1 ) |F k − 1 ] = F ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 3

  5. Stochastic Gradient Descent θ 0 θ 1 ◮ Goal: θ ∈ R d F ( θ ) min θ n given unbiased gradient θ ∗ estimates g n ◮ θ ⋆ := argmin R d F ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − η k g k ( θ k − 1 ) ◮ E [ g k ( θ k − 1 ) |F k − 1 ] = F ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 3

  6. Supervised Machine Learning ◮ We define the risk (generalization error) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . ◮ Empirical risk (or training error): n R ( θ ) = 1 � ˆ ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 4

  7. Supervised Machine Learning ◮ We define the risk (generalization error) as R ( θ ) := E ρ [ ℓ ( Y , � θ, Φ( X ) � )] . ◮ Empirical risk (or training error): n R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) . n i =1 ◮ For example, least-squares regression: n 1 � 2 � � y i − � θ, Φ( x i ) � min + µ Ω( θ ) , 2 n θ ∈ R d i =1 ◮ and logistic regression: n 1 � � 1 + exp( − y i � θ, Φ( x i ) � ) � min log + µ Ω( θ ) . n θ ∈ R d i =1 4

  8. Polyak Ruppert averaging 5

  9. Polyak Ruppert averaging θ 1 θ 0 θ 1 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n n 1 θ ∗ ¯ � θ n = θ k . n + 1 k =0 ◮ off line averaging reduces the noise effect. 6

  10. Polyak Ruppert averaging θ 1 θ 0 θ 1 θ 2 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n n 1 θ ∗ θ n ¯ � θ n = θ k . n + 1 k =0 ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 n +1 ¯ n θ n +1 = n +1 θ n +1 + θ n . 6

  11. Assumptions Recursion: θ k = θ k − 1 − η k g k ( θ k − 1 ) Goal: min F ( θ ) . θ A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. 7

  12. Assumptions Recursion: θ k = θ k − 1 − η k g k ( θ k − 1 ) Goal: min F ( θ ) . θ A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. A2 [Smoothness and regularity] The function F is three times continuously differentiable with second and third � < L , � �� �� � F (2) ( θ ) � �� �� uniformly bounded derivatives: sup θ ∈ R d � < M . Especially F is L -smooth. � �� �� � F (3) ( θ ) � �� �� and sup θ ∈ R d 7

  13. Assumptions Recursion: θ k = θ k − 1 − η k g k ( θ k − 1 ) Goal: min F ( θ ) . θ A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. A2 [Smoothness and regularity] The function F is three times continuously differentiable with second and third � < L , � �� �� � F (2) ( θ ) � �� �� uniformly bounded derivatives: sup θ ∈ R d � < M . Especially F is L -smooth. � �� �� � F (3) ( θ ) � �� �� and sup θ ∈ R d Or: Q1 [Quadratic function] There exists a positive definite matrix Σ ∈ R d × d , such that the function F is the quadratic function θ �→ � Σ 1 / 2 ( θ − θ ⋆ ) � 2 / 2, 7

  14. Which step size would you use? Smooth functions. √ η k ≡ η 0 η k = 1 / k η k = 1 / ( µ k ) Convex Strongly Convex Quadratic 8

  15. Classical bound: Lyapunov approach � � � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � || θ k +1 − θ ⋆ || 2 |F k ≤ E − 2 η k E + η 2 k || g k ( θ k ) || 2 � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � ≤ E − 2 η k (1 − η k L ) + η 2 k || g k ( θ ⋆ ) || 2 � || θ k − θ ⋆ || 2 � � � η k ( F ( θ k ) − F ( θ ⋆ )) ≤ (1 − η k µ ) E || θ k +1 − θ ⋆ || 2 |F k − E + η 2 k || g k ( θ ⋆ ) || 2 9

  16. Classical bound: Lyapunov approach � � � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � || θ k +1 − θ ⋆ || 2 |F k ≤ E − 2 η k E + η 2 k || g k ( θ k ) || 2 � || θ k − θ ⋆ || 2 � � F ′ ( θ k ) , θ k − θ ⋆ � ≤ E − 2 η k (1 − η k L ) + η 2 k || g k ( θ ⋆ ) || 2 � || θ k − θ ⋆ || 2 � � � η k ( F ( θ k ) − F ( θ ⋆ )) ≤ (1 − η k µ ) E || θ k +1 − θ ⋆ || 2 |F k − E + η 2 k || g k ( θ ⋆ ) || 2 1 Conclusion: with η k = µ k , telescopic sum + Jensen: � � F (¯ θ k ) − F ( θ ⋆ ) ≤ O (1 /µ k ) . E 9

  17. Trivial case: decaying step sizes are not that great ! i . i . d . Consider least squares: y i = θ ⋆ ⊤ x i + ε i , ε i ∼ N (0 , σ 2 ). 10

  18. Trivial case: decaying step sizes are not that great ! i . i . d . Consider least squares: y i = θ ⋆ ⊤ x i + ε i , ε i ∼ N (0 , σ 2 ). Start with θ 0 = θ ⋆ : Then: k θ k − θ ⋆ = 1 ¯ � i η 2 M k i ε i . k i =1 Even with large step size η 2 i ≡ η , CLT is enough to control that ! 10

  19. Trivial case: decaying step sizes are not that great ! i . i . d . Consider least squares: y i = θ ⋆ ⊤ x i + ε i , ε i ∼ N (0 , σ 2 ). Start with θ 0 = θ ⋆ : Then: k θ k − θ ⋆ = 1 ¯ � i η 2 M k i ε i . k i =1 Even with large step size η 2 i ≡ η , CLT is enough to control that ! Tight control is much easier on the stochastic process θ k − θ ⋆ than through the “Lyapunov approach”. 10

  20. Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . 11

  21. Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 11

  22. Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 Initial condition - Noise - Non quadratic residual 11

  23. Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 Initial condition - Noise - Non quadratic residual � tight control of || F ′′ ( θ ⋆ ) � ¯ θ K − θ ⋆ � || . 11

  24. Other proof: introduce decomposition Original proof of averaging in Polyak and Juditsky (1992). η k F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) = θ k − 1 − θ k g k ( θ k − 1 ) − F ′ ( θ k − 1 ) − η k � � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � + η k . Thus, for η k ≡ η K = θ K − θ 0 − 1 � ¯ � F ′′ ( θ ⋆ ) θ K − θ ⋆ � g k ( θ k − 1 ) − F ′ ( θ k − 1 ) � � η K K k =1 K + 1 � F ′ ( θ k − 1 ) − F ′′ ( θ ⋆ )( θ k − 1 − θ ⋆ ) � � . K k =1 Initial condition - Noise - Non quadratic residual � tight control of || F ′′ ( θ ⋆ ) � ¯ θ K − θ ⋆ � || . Correct control of the noise for smooth and strongly convex All step sizes η n = Cn − α with α ∈ (1 / 2 , 1) lead to O ( n − 1 ). LMS algorithm: constant step-size → statistical optimality. 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend