farzin haddadpour

Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi - PowerPoint PPT Presentation

Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi Viveck Cadambe Kamani X min f ( x ) , Goal: Solving f i ( x ) i X min f ( x )


  1. Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization Farzin Haddadpour

  2. Joint work with Mohammad Mahdi Mehrdad Mahdavi Viveck Cadambe Kamani

  3. X min f ( x ) , Goal: Solving f i ( x ) i

  4. X min f ( x ) , Goal: Solving f i ( x ) i 1 x ( t +1) = x ( t ) � η | ξ ( t ) | r f ( x ( t ) ; ξ ( t ) ) SGD

  5. X min f ( x ) , Goal: Solving f i ( x ) i 1 x ( t +1) = x ( t ) � η | ξ ( t ) | r f ( x ( t ) ; ξ ( t ) ) SGD Parallelization due to computational cost p 1 x ( t +1) = x ( t ) � η r f ( x ( t ) ; ξ ( t ) X Distributed j ) | ξ ( t ) p j | SGD j =1

  6. X min f ( x ) , Goal: Solving f i ( x ) i 1 x ( t +1) = x ( t ) � η | ξ ( t ) | r f ( x ( t ) ; ξ ( t ) ) SGD Parallelization due to computational cost p 1 x ( t +1) = x ( t ) � η r f ( x ( t ) ; ξ ( t ) X Distributed j ) | ξ ( t ) p j | SGD j =1 Communication is bottleneck

  7. Communication Number of bits per iteration Gradient compression based techniques

  8. Communication Number of bits per iteration Number of rounds Gradient compression based Local SGD with periodic techniques averaging

  9. Local SGD with periodic averaging Averaging step (a) h i x ( t +1) x ( t ) g ( t ) P p = 1 if τ | T − η ˜ j =1 j j j p x ( t +1) = x ( t ) g ( t ) Local update (b) − η ˜ otherwise, j j j

  10. Local SGD with periodic averaging Averaging step (a) h i x ( t +1) x ( t ) g ( t ) P p = 1 if τ | T − η ˜ j =1 j j j p x ( t +1) = x ( t ) g ( t ) Local update (b) − η ˜ otherwise, j j j p = 3 , τ = 1 W 1 W 2 W 3 (a) W 1 W 3 W 2 (a) W 1 W 3 W 2 (a) W 1 W 2 W 3

  11. Local SGD with periodic averaging Averaging step (a) h i x ( t +1) x ( t ) g ( t ) P p = 1 if τ | T − η ˜ j =1 j j j p x ( t +1) = x ( t ) g ( t ) Local update (b) − η ˜ otherwise, j j j p = 3 , τ = 3 p = 3 , τ = 1 W 1 W 2 W 3 W 1 (b) W 2 W 3 (a) W 1 W 3 W 2 (a) W 1 W 3 W 2 (a) W 1 W 3 W 2 (a) W 1 W 2 W 3 W 1 W 2 W 3

  12. Convergence Analysis of Local SGD with periodic averaging Table 1: Comparison of di ff erent SGD based algorithms. Strategy Convergence error Assumptions Com-round( T/ τ ) O (1 / √ pT ) SGD i.i.d. & b.g T O (1 / √ pT ) 3 1 4 T 4 ) [Yu et.al. ] i.i.d. & b.g O ( p O (1 / √ pT ) 3 1 2 T 2 ) [Wang & Joshi] i.i.d. O ( p O (1 / √ pT ) + O ((1 − q/p ) β ) 3 1 2 T 2 ) RI-SGD ( τ , q ) non-i.i.d. & b.d. O ( p b.g: Bounded gradient k g i k 2 2  G Unbiased gradient estimation E [˜ g j ] = g j

  13. Convergence Analysis of Local SGD with periodic averaging Table 1: Comparison of di ff erent SGD based algorithms. Strategy Convergence error Assumptions Com-round( T/ τ ) O (1 / √ pT ) SGD i.i.d. & b.g T O (1 / √ pT ) 3 1 4 T 4 ) [Yu et.al. ] i.i.d. & b.g O ( p O (1 / √ pT ) 3 1 2 T 2 ) [Wang & Joshi] i.i.d. O ( p O (1 / √ pT ) + O ((1 − q/p ) β ) 3 1 2 T 2 ) RI-SGD ( τ , q ) non-i.i.d. & b.d. O ( p b.g: Bounded gradient k g i k 2 2  G Unbiased gradient estimation E [˜ g j ] = g j A. Residual error is observe in practice but theoretical understanding is missing? B. How we can capture this in convergence analysis? C. Any solution to improve it?

  14. Insufficiency of convergence analysis A. Residual error is observe in practice but theoretical understanding is missing? Unbiased gradient estimation does not hold

  15. Insufficiency of convergence analysis A. Residual error is observe in practice but theoretical understanding is missing? Unbiased gradient estimation does not hold B. How to capture this in convergence analysis? Analysis based on biased gradients Our work

  16. Insufficiency of convergence analysis A. Residual error is observe in practice but theoretical understanding is missing? Unbiased gradient estimation does not hold B. How to capture this in convergence analysis? Analysis based on biased gradients Our work C. Any solution to improve it? Redundancy Our work

  17. Redundancy infused local SGD (RI-SGD) D = D 1 ∪ D 2 ∪ D 3 Local SGD p = 3 , τ = 3 D 3 D 2 D 1 W 1 W 2 W 3 W 1 W 3 W 2 W 1 W 2 W 3

  18. Redundancy infused local SGD (RI-SGD) D = D 1 ∪ D 2 ∪ D 3 RI-SGD q = 2 , p = 3 , τ = 3 Local SGD p = 3 , τ = 3 Explicit redundancy D 3 D 2 D 1 D 1 D 3 D 1 D 2 D 2 D 3 W 1 W 2 W 3 W 1 W 2 W 3 W 1 W 3 W 2 W 1 W 3 W 2 W 1 W 2 W 3 W 1 W 2 W 3

  19. Comparing RI-SGD with other schemes b.d: Bounded inner product of gradients h g i , g j i  β Assumption Biased gradients q: Number of data chunks at each worker node Redundancy

  20. Comparing RI-SGD with other schemes b.d: Bounded inner product of gradients h g i , g j i  β Assumption Biased gradients q: Number of data chunks at each worker node Redundancy Table 1: Comparison of di ff erent SGD based algorithms. Strategy Convergence error Assumptions Com-round( T/ τ ) O (1 / √ pT ) SGD i.i.d. & b.g T O (1 / √ pT ) 3 1 4 T 4 ) [Yu et.al. ] i.i.d. & b.g O ( p O (1 / √ pT ) 3 1 2 T 2 ) [Wang & Joshi] i.i.d. O ( p O (1 / √ pT ) + O ((1 − q/p ) β ) 3 1 2 T 2 ) RI-SGD ( τ , q ) non-i.i.d. & b.d. O ( p

  21. Advantages of RI-SGD: 1. Speed up not only due to larger effective mini-batch size, but also due to increasing intra-gradient diversity. 2. Fault-tolerance. 3. Extension to heterogeneous mini-batch size and possible application to federated optimization.

  22. Faster convergence: Experiments over Image-net (top figures) and Cifar-100 (bottom figures)

  23. Increasing intra-gradient diversity: Experiments over Cifar-10

  24. Fault-Tolerance: Experiments over Cifar-10

  25. For more details please come to my poster session Wed Jun 12th 06:30 -- 09:00 PM @ Pacific Ballroom #185 Thanks for your attention!

Recommend


More recommend