Distributed Learning over Unreliable Networks
Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu
Presenter: Chen Yu
Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, - - PowerPoint PPT Presentation
Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu Presenter: Chen Yu AllReduce SGD Server 1 Server 2 Server 2 Worker 1 Worker 2 Worker 3 n x t
Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu
Presenter: Chen Yu
Server 1 Server 2 Server 2 Worker 1 Worker 2 Worker 3
AllReduce SGD
xt+1 = xt − 1 n
n
∑
i=1
∇F(xt; ξ(i)
t )
AllReduce
Unreliable Network
Server 1 Server 2 Server 2 Worker 1 Worker 2 Worker 3
0.5 1 1.5 2 2.5 20 40 60 80 100 120 140 160
Training loss Epoch 95% Arrival 99% Arrival Baseline
Sharing Gradients Won’t Work
v(i)
t
= x(i)
t − γg(i) t ,
v(i)
t
= ((v(i,1)
t
)
⊤, (v(i,2) t
)
⊤, ⋯, (v(i,n) t
)
⊤
)
⊤
. Local Partition:
˜ v(i)
t =
1 |𝒪(i)
t | ∑ j∈𝒪( j)
t
v(i,j)
t
Robust Averaging:
x(i,j)
t+1 = {
˜ v(j)
t
j ∈ ˜ 𝒪(i)
t
v(i,j)
t
j ∉ ˜ 𝒪(i)
t
.
Model Update:
1 T
T
∑
t=1
𝔽 ∇f(xt)
2 ≲
(σ + ζ)(1 + p(1 − p)) (1 − p) nT + 1 T
Assumptions:
p: Package Dropping Rate T: Total Iterations
Non Convex, with L-Lipschitz Gradient; f(x)
𝔽ξ∼i∥∇F(x; ξ) − ∇fi(x)∥2 ≤ σ2, ∀i, ∀x;
1 n
n
∑
i=1
∇fi(x) − ∇f(x)
2 ⩽ ζ2,
∀i, ∀x .
Bounded Data Variance Bounded Dataset Difference
16 NVIDIA TITAN Xp GPUs, ResNet-110 on CIFAR-10 RPS is Robust Standard SGD is Vulnerable
0.5 1 1.5 2 2.5 20 40 60 80 100 120 140 160
Training loss Epoch 60% Arrival 80% Arrival 90% Arrival 95% Arrival 99% Arrival Baseline
0.5 1 1.5 2 2.5 20 40 60 80 100 120 140 160
Training loss Epoch 95% Arrival 99% Arrival Baseline
Welcome to Pacific Ballroom #97 to see the poster for more detail