Compressed optimisation for non-convex problems
signSGD
Yu-Xiang Wang UCSB/Amazon Anima Anandkumar Caltech/Amazon Kamyar Azizzadenesheli UCI Jeremy Bernstein Caltech
signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon - - PowerPoint PPT Presentation
signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon Compressed optimisation for non-convex problems Kamyar Azizzadenesheli Anima Anandkumar UCI Caltech/Amazon Snap gradient components to 1 Reduces signSGD
Compressed optimisation for non-convex problems
Yu-Xiang Wang UCSB/Amazon Anima Anandkumar Caltech/Amazon Kamyar Azizzadenesheli UCI Jeremy Bernstein Caltech
Compressed optimisation for non-convex problems
Snap gradient components to ±1 Reduces communication time Realistic for deep learning
Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results
GRADIENT COMPRESSION……WHY CARE?
Parameter server
With 1/2 data With 1/2 data
GPU 2 GPU 1
GRADIENT COMPRESSION……WHY CARE?
Parameter server
GPU 5 GPU 7 GPU 7 GPU 1 GPU 3 GPU 4 GPU 2
DISTRIBUTED SGD
Parameter server
With 1/2 data With 1/2 data
GPU 2 GPU 1
∑ g g g
SIGNSGD WITH MAJORITY VOTE
Parameter server
With 1/2 data With 1/2 data
GPU 2 GPU 1
sign [∑ sign(g)] sign(g) sign(g)
COMPRESSION SAVINGS OF MAJORITY VOTE
10 20 30 40 SGD Majority vote
# bits per component per iteration
SIGNSGD IS A SPECIAL CASE OF ADAM
signSGD Adam Signum
(Sign momentum)
ADAM……………………WHY CARE?
2750 5500 8250 11000
# of Google scholar citations
SGD Adam Turing test
Robbins & Monro Kingma & Ba Turing
UNIFYING ADAPTIVE GRADIENT METHODS + COMPRESSION
Sign descent weak theoretical foundation incredibly popular (e.g. Adam) Compressed descent weak theoretical foundation take pains to correct bias empirically successful Sign-based gradient compression?
Need to build a theory
Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results
DOES SIGNSGD EVEN CONVERGE?
What might we fear?
➤ Might not converge at all ➤ Might have horrible dimension dependence ➤ Majority vote may give no speedup by adding extra machines
Compression can be a free lunch
Our results
➤ It does converge ➤ We characterise functions where signSGD & majority vote are as nice as SGD ➤ Suggest these functions are typical in deep learning
SINGLE WORKER RESULTS
Assumptions SGD gets rate
피 [ 1 K
K−1
∑
k=0
∥gk∥2
2] ≤
1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2
2]
signSGD gets rate
피 [ 1 K
K−1
∑
k=0
∥gk∥1]
2
≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2∥ ⃗ σ ∥1]
2
f*
➤ Objective function lower bound
⃗ σ
➤ Coordinate-wise variance bound
⃗ L
➤ Coordinate-wise gradient Lipschitz
K
➤ Number of iterations
N
➤ Number of backpropagations
Define
SINGLE WORKER RESULTS
Assumptions SGD gets rate
피 [ 1 K
K−1
∑
k=0
∥gk∥2
2] ≤
1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2
2]
signSGD gets rate
피 [ 1 K
K−1
∑
k=0
∥gk∥1]
2
≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2∥ ⃗ σ ∥1]
2
f*
➤ Objective function lower bound
⃗ σ
➤ Coordinate-wise variance bound
⃗ L
➤ Coordinate-wise gradient Lipschitz
K
➤ Number of iterations
N
➤ Number of backpropagations
Define
d ∥ ⃗ L ∥∞
d∥ ⃗ σ ∥2
d∥ ⃗ g k∥2
MULTI WORKER RESULTS
SGD gets rate
피 [ 1 K
K−1
∑
k=0
∥gk∥2
2] ≤
1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2
2
]
M
피 [ 1 K
K−1
∑
k=0
∥gk∥1]
2
≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2 ∥ ⃗ σ ∥1 ]
2
M
with M workers
Assumptions
f*
➤ Objective function lower bound
⃗ σ
➤ Coordinate-wise variance bound
⃗ L
➤ Coordinate-wise gradient Lipschitz
majority vote gets
if gradient noise is unimodal symmetric
Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results
CHARACTERISING THE DEEP LEARNING LANDSCAPE EMPIRICALLY
➤ signSGD cares about gradient density
Natural measure of density =1 for fully dense v ≈0 for fully sparse v
➤ majority vote cares about noise symmetry
For large enough mini-batch size, reasonable by Central Limit Theorem.
Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results
SIGNUM IS COMPETITIVE ON IMAGENET
Performance very similar to Adam May want to switch to SGD towards end?
DOES MAJORITY VOTE WORK?
Jiawei Zhao NUAA Cifar-10, Resnet-18