DoubleSqueeze: Parallel Stochastic Gradient Descent with - - PowerPoint PPT Presentation

doublesqueeze
SMART_READER_LITE
LIVE PREVIEW

DoubleSqueeze: Parallel Stochastic Gradient Descent with - - PowerPoint PPT Presentation

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-pass Error-Compensated Compression Hanlin Tang, Xiangru Lian , Chen Yu, Tong Zhang, Ji Liu Presenter: Xiangru Lian Compressed SGD (existing algorithms) Worker 1 g (1) g n x


slide-1
SLIDE 1

DoubleSqueeze:

Parallel Stochastic Gradient Descent with Double-pass Error-Compensated Compression

Hanlin Tang, Xiangru Lian, Chen Yu, Tong Zhang, Ji Liu

Presenter: Xiangru Lian

slide-2
SLIDE 2

Server Worker 1 Worker 2 Worker 3

g(2)

¯ g ¯ g

g(3)

¯ g

g(1)

Compressed SGD (existing algorithms)

xt+1 = xt − γ n

n

i=1

Cω [g(i)]

Compression Operator :

{

1bit Quantization Clipping Top-k Sparsification

slide-3
SLIDE 3

Compressed SGD introduces error: We can do better by compensating this error:

1.2 → 1; error = − 0.2 1.2 → 1; error = − 0.2

Next Step

Next_Grad Next_Grad - error

slide-4
SLIDE 4

DoubleSqueeze

High Level: Compensating Error for Both Server and Workers

Server :

x ← x − γ¯ v

On All Workers (Model Update):

g(i) ← ∇F(x; ξ(i)), v(i) ← Cω [g(i) + δ(i)], δ(i) ← g(i) + δ(i) − v(i)

¯ g ← 1 n

n

i=1

v(i), ¯ v ← Cω [¯ g + ¯ δ], ¯ δ ← ¯ g + ¯ δ − ¯ v

Worker :

i

slide-5
SLIDE 5

Convergence Rate

𝔽∥∇f(xT)∥2 ≲ 1 + σ nT + ζ

2 3

T

2 3

Assumptions:

Non Convex, with L-Lipschitz Gradient; f(x)

𝔽ξ∼𝒠i∥∇F(x; ξ) − ∇fi(x)∥2 ≤ σ2, ∀i, ∀x;

∥Cω[x] − x∥2 ≤ ζ2

T: Total Iterations

(DoubleSqueeze)

𝔽∥∇f(xT)∥2 ≲ 1 + σ nT + ζ T

(Compressed SGD)

slide-6
SLIDE 6

Experiments

ResNet-18 on CIFAR-10. 8 Nvidia 1080Ti GPUs. 1 GPU per worker.

1Bit Quantization:

100 200 300

eSoch

0.0 0.5 1.0 1.5

LoVV

VDQLllD SGD DouEleSqueeze 0E0-SGD QSGD

Top-k Sparsification:

50 100 150 200

eSoch

0.0 0.5 1.0

LoVV

VDnLllD SGD DouEleSTueeze 0E0-SGD ToS-k SGD

0.02 0.04 0.06 0.08 0.10

BDnGwiGth (1/0E)

100 200 300 400 500

VeconGV

VDnillD 6GD DouEle6Tueeze 0(0-6GD ToS-k 6GD 0.02 0.04 0.06 0.08 0.10

BDQGwiGth (1/0E)

100 200 300 400 500

VecoQGV

VDQillD 6GD DouEle6queeze 0(0-6GD 46GD

Convergence Rate Per-Epoch Time

slide-7
SLIDE 7

Thanks

Welcome to Pacific Ballroom #99 to see the poster for more detail