DoubleSqueeze:
Parallel Stochastic Gradient Descent with Double-pass Error-Compensated Compression
Hanlin Tang, Xiangru Lian, Chen Yu, Tong Zhang, Ji Liu
Presenter: Xiangru Lian
DoubleSqueeze: Parallel Stochastic Gradient Descent with - - PowerPoint PPT Presentation
DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-pass Error-Compensated Compression Hanlin Tang, Xiangru Lian , Chen Yu, Tong Zhang, Ji Liu Presenter: Xiangru Lian Compressed SGD (existing algorithms) Worker 1 g (1) g n x
Hanlin Tang, Xiangru Lian, Chen Yu, Tong Zhang, Ji Liu
Presenter: Xiangru Lian
Server Worker 1 Worker 2 Worker 3
Compressed SGD (existing algorithms)
xt+1 = xt − γ n
n
∑
i=1
Cω [g(i)]
Compression Operator :
1bit Quantization Clipping Top-k Sparsification
Compressed SGD introduces error: We can do better by compensating this error:
Next Step
Next_Grad Next_Grad - error
High Level: Compensating Error for Both Server and Workers
Server :
On All Workers (Model Update):
g(i) ← ∇F(x; ξ(i)), v(i) ← Cω [g(i) + δ(i)], δ(i) ← g(i) + δ(i) − v(i)
¯ g ← 1 n
n
∑
i=1
v(i), ¯ v ← Cω [¯ g + ¯ δ], ¯ δ ← ¯ g + ¯ δ − ¯ v
Worker :
2 3
2 3
Assumptions:
Non Convex, with L-Lipschitz Gradient; f(x)
𝔽ξ∼i∥∇F(x; ξ) − ∇fi(x)∥2 ≤ σ2, ∀i, ∀x;
∥Cω[x] − x∥2 ≤ ζ2
T: Total Iterations
(DoubleSqueeze)
(Compressed SGD)
ResNet-18 on CIFAR-10. 8 Nvidia 1080Ti GPUs. 1 GPU per worker.
1Bit Quantization:
100 200 300
eSoch
0.0 0.5 1.0 1.5
LoVV
VDQLllD SGD DouEleSqueeze 0E0-SGD QSGD
Top-k Sparsification:
50 100 150 200
eSoch
0.0 0.5 1.0
LoVV
VDnLllD SGD DouEleSTueeze 0E0-SGD ToS-k SGD
0.02 0.04 0.06 0.08 0.10
BDnGwiGth (1/0E)
100 200 300 400 500
VeconGV
VDnillD 6GD DouEle6Tueeze 0(0-6GD ToS-k 6GD 0.02 0.04 0.06 0.08 0.10
BDQGwiGth (1/0E)
100 200 300 400 500
VecoQGV
VDQillD 6GD DouEle6queeze 0(0-6GD 46GD
Convergence Rate Per-Epoch Time
Welcome to Pacific Ballroom #99 to see the poster for more detail