Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi - PowerPoint PPT Presentation

Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization Farzin Haddadpour

Joint work with Mohammad Mahdi Mehrdad Mahdavi Viveck Cadambe Kamani

X min f ( x ) , Goal: Solving f i ( x ) i

X min f ( x ) , Goal: Solving f i ( x ) i 1 x ( t +1) = x ( t ) � η | ξ ( t ) | r f ( x ( t ) ; ξ ( t ) ) SGD

X min f ( x ) , Goal: Solving f i ( x ) i 1 x ( t +1) = x ( t ) � η | ξ ( t ) | r f ( x ( t ) ; ξ ( t ) ) SGD Parallelization due to computational cost p 1 x ( t +1) = x ( t ) � η r f ( x ( t ) ; ξ ( t ) X Distributed j ) | ξ ( t ) p j | SGD j =1

X min f ( x ) , Goal: Solving f i ( x ) i 1 x ( t +1) = x ( t ) � η | ξ ( t ) | r f ( x ( t ) ; ξ ( t ) ) SGD Parallelization due to computational cost p 1 x ( t +1) = x ( t ) � η r f ( x ( t ) ; ξ ( t ) X Distributed j ) | ξ ( t ) p j | SGD j =1 Communication is bottleneck

Communication Number of bits per iteration Gradient compression based techniques

Communication Number of bits per iteration Number of rounds Gradient compression based Local SGD with periodic techniques averaging

Local SGD with periodic averaging Averaging step (a) h i x ( t +1) x ( t ) g ( t ) P p = 1 if τ | T − η ˜ j =1 j j j p x ( t +1) = x ( t ) g ( t ) Local update (b) − η ˜ otherwise, j j j

Local SGD with periodic averaging Averaging step (a) h i x ( t +1) x ( t ) g ( t ) P p = 1 if τ | T − η ˜ j =1 j j j p x ( t +1) = x ( t ) g ( t ) Local update (b) − η ˜ otherwise, j j j p = 3 , τ = 1 W 1 W 2 W 3 (a) W 1 W 3 W 2 (a) W 1 W 3 W 2 (a) W 1 W 2 W 3

Local SGD with periodic averaging Averaging step (a) h i x ( t +1) x ( t ) g ( t ) P p = 1 if τ | T − η ˜ j =1 j j j p x ( t +1) = x ( t ) g ( t ) Local update (b) − η ˜ otherwise, j j j p = 3 , τ = 3 p = 3 , τ = 1 W 1 W 2 W 3 W 1 (b) W 2 W 3 (a) W 1 W 3 W 2 (a) W 1 W 3 W 2 (a) W 1 W 3 W 2 (a) W 1 W 2 W 3 W 1 W 2 W 3

Convergence Analysis of Local SGD with periodic averaging Table 1: Comparison of di ff erent SGD based algorithms. Strategy Convergence error Assumptions Com-round( T/ τ ) O (1 / √ pT ) SGD i.i.d. & b.g T O (1 / √ pT ) 3 1 4 T 4 ) [Yu et.al. ] i.i.d. & b.g O ( p O (1 / √ pT ) 3 1 2 T 2 ) [Wang & Joshi] i.i.d. O ( p O (1 / √ pT ) + O ((1 − q/p ) β ) 3 1 2 T 2 ) RI-SGD ( τ , q ) non-i.i.d. & b.d. O ( p b.g: Bounded gradient k g i k 2 2  G Unbiased gradient estimation E [˜ g j ] = g j

Convergence Analysis of Local SGD with periodic averaging Table 1: Comparison of di ff erent SGD based algorithms. Strategy Convergence error Assumptions Com-round( T/ τ ) O (1 / √ pT ) SGD i.i.d. & b.g T O (1 / √ pT ) 3 1 4 T 4 ) [Yu et.al. ] i.i.d. & b.g O ( p O (1 / √ pT ) 3 1 2 T 2 ) [Wang & Joshi] i.i.d. O ( p O (1 / √ pT ) + O ((1 − q/p ) β ) 3 1 2 T 2 ) RI-SGD ( τ , q ) non-i.i.d. & b.d. O ( p b.g: Bounded gradient k g i k 2 2  G Unbiased gradient estimation E [˜ g j ] = g j A. Residual error is observe in practice but theoretical understanding is missing? B. How we can capture this in convergence analysis? C. Any solution to improve it?

Insufficiency of convergence analysis A. Residual error is observe in practice but theoretical understanding is missing? Unbiased gradient estimation does not hold

Insufficiency of convergence analysis A. Residual error is observe in practice but theoretical understanding is missing? Unbiased gradient estimation does not hold B. How to capture this in convergence analysis? Analysis based on biased gradients Our work

Insufficiency of convergence analysis A. Residual error is observe in practice but theoretical understanding is missing? Unbiased gradient estimation does not hold B. How to capture this in convergence analysis? Analysis based on biased gradients Our work C. Any solution to improve it? Redundancy Our work

Redundancy infused local SGD (RI-SGD) D = D 1 ∪ D 2 ∪ D 3 Local SGD p = 3 , τ = 3 D 3 D 2 D 1 W 1 W 2 W 3 W 1 W 3 W 2 W 1 W 2 W 3

Redundancy infused local SGD (RI-SGD) D = D 1 ∪ D 2 ∪ D 3 RI-SGD q = 2 , p = 3 , τ = 3 Local SGD p = 3 , τ = 3 Explicit redundancy D 3 D 2 D 1 D 1 D 3 D 1 D 2 D 2 D 3 W 1 W 2 W 3 W 1 W 2 W 3 W 1 W 3 W 2 W 1 W 3 W 2 W 1 W 2 W 3 W 1 W 2 W 3

Comparing RI-SGD with other schemes b.d: Bounded inner product of gradients h g i , g j i  β Assumption Biased gradients q: Number of data chunks at each worker node Redundancy

Comparing RI-SGD with other schemes b.d: Bounded inner product of gradients h g i , g j i  β Assumption Biased gradients q: Number of data chunks at each worker node Redundancy Table 1: Comparison of di ff erent SGD based algorithms. Strategy Convergence error Assumptions Com-round( T/ τ ) O (1 / √ pT ) SGD i.i.d. & b.g T O (1 / √ pT ) 3 1 4 T 4 ) [Yu et.al. ] i.i.d. & b.g O ( p O (1 / √ pT ) 3 1 2 T 2 ) [Wang & Joshi] i.i.d. O ( p O (1 / √ pT ) + O ((1 − q/p ) β ) 3 1 2 T 2 ) RI-SGD ( τ , q ) non-i.i.d. & b.d. O ( p

Advantages of RI-SGD: 1. Speed up not only due to larger effective mini-batch size, but also due to increasing intra-gradient diversity. 2. Fault-tolerance. 3. Extension to heterogeneous mini-batch size and possible application to federated optimization.

Faster convergence: Experiments over Image-net (top figures) and Cifar-100 (bottom figures)

Increasing intra-gradient diversity: Experiments over Cifar-10

Fault-Tolerance: Experiments over Cifar-10

For more details please come to my poster session Wed Jun 12th 06:30 -- 09:00 PM @ Pacific Ballroom #185 Thanks for your attention!

Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi - PowerPoint PPT Presentation

Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi Viveck Cadambe Kamani X min f ( x ) , Goal: Solving f i ( x ) i X min f ( x )

Shine BTIGHTER LCVP TM&HR Application Farzin Mohammadi General Information

Searching with Context Reiner Kraft Farzin Maghoul Chi Chao Chang Ravi Kumar Yahoo!, Inc.,

Enhancing the absorption of Graphene through Bragg mirrors and Genetic Optimization Gabriel I

Results from the Endovascular Revascularization And Supervised Exercise for claudication study

Results from the Endovascular Revascularization And Supervised Exercise for claudication study

Title Bedside Mental status assessment as an independent correlate of mortality in elderly

Cloud Privacy in a PervasivE Monitoring t Landscape pt

synchronization 3: monitors pt 2 / semaphores / rwlock 1 last time barriers everyone waits

Performance Measurements of QUIC Communications Algorithm to improve connection RTT evaluation

Machine Learning based algorithm for reconstructing prompt and displaced muons at Level-1 in CMS

Variation of canonical height, illustrated Laura DeMarco Northwestern University Theorem I.0.3.

Personnel Management Personnel Management B A R B N I S S E L R E T I R E D F O O D S E R V I

ASSIST: Using performance analysis tools for driving Feedback Directed Optimizations Youenn Lebras

Defensive Coding Techniques (Pt. 1) Engineering Secure Software Last Revised: September 21, 2020

b - j e t I d e n t i f i c a t i o n i n t h e D 0 E x p e r i me

Production of the D s meson in proton-proton collisions at 13 TeV as a function of multiplicity

Search for neutrinoless double beta decay in NEMO 3 and SuperNEMO Yu. Shitov, IC

The evolution of Vertex Detectors The evolution of Vertex Detectors From Gas to Silicon Strips

Third Quarter 2013 Results 31 October 2013 Disclaimer Figures included in this presentation are

A Second-Order Model of the Stock Market Robert Fernholz INTECH Conference in honor of Ioannis

Random Integer Partitions and the Bose Gas Mathias Rafler 1 supervised by Sylvie Rlly 1 and Hans

Delocalization of Schr odinger eigenfunctions Nalini Anantharaman Universit e de Strasbourg

Singularities and Characteristic classes for Differentiable Maps

relationship with the economic freedoms Dr. Marco Rocca (UHasselt, ULige, ULB - Belgium)

Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi - PowerPoint PPT Presentation

Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi Viveck Cadambe Kamani X min f ( x ) , Goal: Solving f i ( x ) i X min f ( x )

Shine BTIGHTER LCVP TM&amp;HR Application Farzin Mohammadi General Information

Searching with Context Reiner Kraft Farzin Maghoul Chi Chao Chang Ravi Kumar Yahoo!, Inc.,

Enhancing the absorption of Graphene through Bragg mirrors and Genetic Optimization Gabriel I

Results from the Endovascular Revascularization And Supervised Exercise for claudication study

Results from the Endovascular Revascularization And Supervised Exercise for claudication study

Title Bedside Mental status assessment as an independent correlate of mortality in elderly

Cloud Privacy in a PervasivE Monitoring t Landscape pt

synchronization 3: monitors pt 2 / semaphores / rwlock 1 last time barriers everyone waits

Performance Measurements of QUIC Communications Algorithm to improve connection RTT evaluation

Machine Learning based algorithm for reconstructing prompt and displaced muons at Level-1 in CMS

Variation of canonical height, illustrated Laura DeMarco Northwestern University Theorem I.0.3.

Personnel Management Personnel Management B A R B N I S S E L R E T I R E D F O O D S E R V I

ASSIST: Using performance analysis tools for driving Feedback Directed Optimizations Youenn Lebras

Defensive Coding Techniques (Pt. 1) Engineering Secure Software Last Revised: September 21, 2020

b - j e t I d e n t i f i c a t i o n i n t h e D 0 E x p e r i me

Production of the D s meson in proton-proton collisions at 13 TeV as a function of multiplicity

Search for neutrinoless double beta decay in NEMO 3 and SuperNEMO Yu. Shitov, IC

The evolution of Vertex Detectors The evolution of Vertex Detectors From Gas to Silicon Strips

Third Quarter 2013 Results 31 October 2013 Disclaimer Figures included in this presentation are

A Second-Order Model of the Stock Market Robert Fernholz INTECH Conference in honor of Ioannis

Random Integer Partitions and the Bose Gas Mathias Rafler 1 supervised by Sylvie Rlly 1 and Hans

Delocalization of Schr odinger eigenfunctions Nalini Anantharaman Universit e de Strasbourg

Singularities and Characteristic classes for Differentiable Maps

relationship with the economic freedoms Dr. Marco Rocca (UHasselt, ULige, ULB - Belgium)

Shine BTIGHTER LCVP TM&HR Application Farzin Mohammadi General Information