Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi - - PowerPoint PPT Presentation
Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi - - PowerPoint PPT Presentation
Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization Farzin Haddadpour Joint work with Mohammad Mahdi Mehrdad Mahdavi Viveck Cadambe Kamani X min f ( x ) , Goal: Solving f i ( x ) i X min f ( x )
Joint work with
Viveck Cadambe Mehrdad Mahdavi Mohammad Mahdi Kamani
min f(x) , X
i
fi(x)
Goal: Solving
min f(x) , X
i
fi(x)
Goal: Solving
x(t+1) = x(t) η 1 |ξ(t)|rf(x(t); ξ(t))
SGD
min f(x) , X
i
fi(x)
Goal: Solving
x(t+1) = x(t) η 1 |ξ(t)|rf(x(t); ξ(t))
Parallelization due to
computational cost
x(t+1) = x(t) η p
p
X
j=1
1 |ξ(t)
j |
rf(x(t); ξ(t)
j )
SGD
Distributed SGD
min f(x) , X
i
fi(x)
Goal: Solving
x(t+1) = x(t) η 1 |ξ(t)|rf(x(t); ξ(t))
Parallelization due to
computational cost
Communication is bottleneck
x(t+1) = x(t) η p
p
X
j=1
1 |ξ(t)
j |
rf(x(t); ξ(t)
j )
SGD
Distributed SGD
Communication
Number of bits per iteration Gradient compression based techniques
Communication
Number of rounds Number of bits per iteration Gradient compression based techniques Local SGD with periodic averaging
Local SGD with periodic averaging
x(t+1)
j
= 1
p
Pp
j=1
h x(t)
j
− η ˜ g(t)
j
i if τ|T x(t+1)
j
= x(t)
j
− η ˜ g(t)
j
- therwise,
Averaging step (a) Local update (b)
Local SGD with periodic averaging
x(t+1)
j
= 1
p
Pp
j=1
h x(t)
j
− η ˜ g(t)
j
i if τ|T x(t+1)
j
= x(t)
j
− η ˜ g(t)
j
- therwise,
Averaging step (a) Local update (b)
W1 W1
W3
W2 W2
W3
W1 W2
W3
W1
W3
W2
(a) (a) (a) p = 3, τ = 1
Local SGD with periodic averaging
x(t+1)
j
= 1
p
Pp
j=1
h x(t)
j
− η ˜ g(t)
j
i if τ|T x(t+1)
j
= x(t)
j
− η ˜ g(t)
j
- therwise,
Averaging step (a) Local update (b)
W1 W1 W1
W3
W2 W2 W2
W3 W3
(a) (b) p = 3, τ = 1
W1 W1
W3
W2 W2
W3
W1 W2
W3
W1
W3
W2
(a) (a) (a) p = 3, τ = 3
Convergence Analysis of Local SGD with periodic averaging
Table 1: Comparison of different SGD based algorithms. Strategy Convergence error Assumptions Com-round(T/τ) SGD O(1/√pT) i.i.d. & b.g T [Yu et.al.] O(1/√pT) i.i.d. & b.g O(p
3 4 T 1 4 )
[Wang & Joshi] O(1/√pT) i.i.d. O(p
3 2 T 1 2 )
RI-SGD (τ, q) O(1/√pT) + O((1 − q/p)β) non-i.i.d. & b.d. O(p
3 2 T 1 2 )
b.g: Bounded gradient kgik2
2 G
Unbiased gradient estimation E[˜ gj] = gj
Convergence Analysis of Local SGD with periodic averaging
Table 1: Comparison of different SGD based algorithms. Strategy Convergence error Assumptions Com-round(T/τ) SGD O(1/√pT) i.i.d. & b.g T [Yu et.al.] O(1/√pT) i.i.d. & b.g O(p
3 4 T 1 4 )
[Wang & Joshi] O(1/√pT) i.i.d. O(p
3 2 T 1 2 )
RI-SGD (τ, q) O(1/√pT) + O((1 − q/p)β) non-i.i.d. & b.d. O(p
3 2 T 1 2 )
b.g: Bounded gradient kgik2
2 G
- A. Residual error is observe in practice but
theoretical understanding is missing?
- B. How we can capture this in convergence
analysis?
- C. Any solution to improve it?
Unbiased gradient estimation E[˜ gj] = gj
Insufficiency of convergence analysis
- A. Residual error is observe in practice but theoretical
understanding is missing? Unbiased gradient estimation does not hold
Insufficiency of convergence analysis
- A. Residual error is observe in practice but theoretical
understanding is missing?
- B. How to capture this in convergence analysis?
Unbiased gradient estimation does not hold Analysis based on biased gradients Our work
Insufficiency of convergence analysis
- A. Residual error is observe in practice but theoretical
understanding is missing?
- B. How to capture this in convergence analysis?
- C. Any solution to improve it?
Unbiased gradient estimation does not hold Analysis based on biased gradients Redundancy Our work Our work
Redundancy infused local SGD (RI-SGD) D = D1 ∪ D2 ∪ D3
W1 W1 W1
W3
W2 W2 W2
W3 W3
D1 D2 D3 p = 3, τ = 3 Local SGD
Redundancy infused local SGD (RI-SGD) D = D1 ∪ D2 ∪ D3
W1 W1 W1
W3
W2 W2 W2
W3 W3
D1 D2 D3 p = 3, τ = 3 Local SGD RI-SGD q = 2, p = 3, τ = 3
W1 W1
W3
W2 W2
W3
W1
D1D2 D2 D1
W2
D3 D3
W3
Explicit redundancy
Comparing RI-SGD with other schemes b.d: Bounded inner product of gradients hgi, gji β
Assumption Redundancy
Biased gradients q: Number of data chunks at each worker node
Comparing RI-SGD with other schemes b.d: Bounded inner product of gradients hgi, gji β
Assumption Redundancy Table 1: Comparison of different SGD based algorithms. Strategy Convergence error Assumptions Com-round(T/τ) SGD O(1/√pT) i.i.d. & b.g T [Yu et.al.] O(1/√pT) i.i.d. & b.g O(p
3 4 T 1 4 )
[Wang & Joshi] O(1/√pT) i.i.d. O(p
3 2 T 1 2 )
RI-SGD (τ, q) O(1/√pT) + O((1 − q/p)β) non-i.i.d. & b.d. O(p
3 2 T 1 2 )
Biased gradients q: Number of data chunks at each worker node
- 1. Speed up not only due to larger effective
mini-batch size, but also due to increasing intra-gradient diversity.
- 2. Fault-tolerance.
- 3. Extension to heterogeneous mini-batch size
and possible application to federated
- ptimization.