Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, - - PowerPoint PPT Presentation

distributed learning over unreliable networks
SMART_READER_LITE
LIVE PREVIEW

Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, - - PowerPoint PPT Presentation

Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu Presenter: Chen Yu AllReduce SGD Server 1 Server 2 Server 2 Worker 1 Worker 2 Worker 3 n x t


slide-1
SLIDE 1

Distributed Learning over Unreliable Networks

Chen Yu, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan Alistarh, Ce Zhang, Ji Liu

Presenter: Chen Yu

slide-2
SLIDE 2

Server 1 Server 2 Server 2 Worker 1 Worker 2 Worker 3

AllReduce SGD

xt+1 = xt − 1 n

n

i=1

∇F(xt; ξ(i)

t )

AllReduce

slide-3
SLIDE 3

Unreliable Network

Server 1 Server 2 Server 2 Worker 1 Worker 2 Worker 3

0.5 1 1.5 2 2.5 20 40 60 80 100 120 140 160

Training loss Epoch 95% Arrival 99% Arrival Baseline

Sharing Gradients Won’t Work

slide-4
SLIDE 4

Reliable Parameter Server (RPS)

High Level: Share Models

v(i)

t

= x(i)

t − γg(i) t ,

v(i)

t

= ((v(i,1)

t

)

⊤, (v(i,2) t

)

⊤, ⋯, (v(i,n) t

)

)

. Local Partition:

˜ v(i)

t =

1 |𝒪(i)

t | ∑ j∈𝒪( j)

t

v(i,j)

t

Robust Averaging:

x(i,j)

t+1 = {

˜ v(j)

t

j ∈ ˜ 𝒪(i)

t

v(i,j)

t

j ∉ ˜ 𝒪(i)

t

.

Model Update:

slide-5
SLIDE 5

Convergence Rate

1 T

T

t=1

𝔽 ∇f(xt)

2 ≲

(σ + ζ)(1 + p(1 − p)) (1 − p) nT + 1 T

Assumptions:

p: Package Dropping Rate T: Total Iterations

Non Convex, with L-Lipschitz Gradient; f(x)

𝔽ξ∼𝒠i∥∇F(x; ξ) − ∇fi(x)∥2 ≤ σ2, ∀i, ∀x;

1 n

n

i=1

∇fi(x) − ∇f(x)

2 ⩽ ζ2,

∀i, ∀x .

Bounded Data Variance Bounded Dataset Difference

slide-6
SLIDE 6

Experiments

16 NVIDIA TITAN Xp GPUs, ResNet-110 on CIFAR-10 RPS is Robust Standard SGD is Vulnerable

0.5 1 1.5 2 2.5 20 40 60 80 100 120 140 160

Training loss Epoch 60% Arrival 80% Arrival 90% Arrival 95% Arrival 99% Arrival Baseline

0.5 1 1.5 2 2.5 20 40 60 80 100 120 140 160

Training loss Epoch 95% Arrival 99% Arrival Baseline

slide-7
SLIDE 7

Thanks

Welcome to Pacific Ballroom #97 to see the poster for more detail