signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon - - PowerPoint PPT Presentation

▶

Mar 18, 2023 231 likes •460 views

signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon Compressed optimisation for non-convex problems Kamyar Azizzadenesheli Anima Anandkumar UCI Caltech/Amazon Snap gradient components to 1 Reduces signSGD

SLIDE 1

Compressed optimisation for   non-convex problems

signSGD

Yu-Xiang Wang  UCSB/Amazon Anima Anandkumar  Caltech/Amazon Kamyar Azizzadenesheli  UCI Jeremy Bernstein  Caltech

SLIDE 2

Compressed optimisation for   non-convex problems

signSGD

Snap gradient components to ±1 Reduces communication time Realistic for deep learning

SLIDE 3

STRUCTURE

Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

SLIDE 4

GRADIENT COMPRESSION……WHY CARE?

Parameter server

With 1/2 data With 1/2 data

GPU 2 GPU 1

SLIDE 5

GRADIENT COMPRESSION……WHY CARE?

Parameter server

GPU 5 GPU 7 GPU 7 GPU 1 GPU 3 GPU 4 GPU 2

COMPRESS?

SLIDE 6

DISTRIBUTED SGD

Parameter server

With 1/2 data With 1/2 data

GPU 2 GPU 1

∑ g g g

SLIDE 7

SIGNSGD WITH MAJORITY VOTE

Parameter server

With 1/2 data With 1/2 data

GPU 2 GPU 1

sign [∑ sign(g)] sign(g) sign(g)

SLIDE 8

COMPRESSION SAVINGS OF MAJORITY VOTE

10 20 30 40 SGD Majority vote

# bits   per component per iteration

SLIDE 9

SIGNSGD IS A SPECIAL CASE OF ADAM

signSGD Adam Signum

(Sign momentum)

SLIDE 10

ADAM……………………WHY CARE?

2750 5500 8250 11000

# of Google scholar citations

SGD Adam Turing test

Robbins & Monro Kingma & Ba Turing

SLIDE 11

UNIFYING ADAPTIVE GRADIENT METHODS + COMPRESSION

Sign descent weak theoretical foundation incredibly popular (e.g. Adam) Compressed descent weak theoretical foundation take pains to correct bias empirically successful Sign-based gradient compression?

Need to build a theory

SLIDE 12

STRUCTURE

Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

SLIDE 13

DOES SIGNSGD EVEN CONVERGE?

What might we fear?

➤ Might not converge at all ➤ Might have horrible dimension dependence ➤ Majority vote may give no speedup by adding extra machines

Compression can be a free lunch

Our results

➤ It does converge ➤ We characterise functions where signSGD & majority vote are as nice as SGD ➤ Suggest these functions are typical in deep learning

SLIDE 14

SINGLE WORKER RESULTS

Assumptions SGD gets rate

피 [ 1 K

K−1

∑

k=0

∥gk∥2

2] ≤

1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2

signSGD gets rate

피 [ 1 K

K−1

∑

k=0

∥gk∥1]

≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2∥ ⃗ σ ∥1]

➤ Objective function lower bound

⃗ σ

➤ Coordinate-wise variance bound

⃗ L

➤ Coordinate-wise gradient Lipschitz

➤ Number of iterations

➤ Number of backpropagations

Define

SLIDE 15

SINGLE WORKER RESULTS

Assumptions SGD gets rate

피 [ 1 K

K−1

∑

k=0

∥gk∥2

2] ≤

1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2

signSGD gets rate

피 [ 1 K

K−1

∑

k=0

∥gk∥1]

≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2∥ ⃗ σ ∥1]

➤ Objective function lower bound

⃗ σ

➤ Coordinate-wise variance bound

⃗ L

➤ Coordinate-wise gradient Lipschitz

➤ Number of iterations

➤ Number of backpropagations

Define

d ∥ ⃗ L ∥∞

d∥ ⃗ σ ∥2

d∥ ⃗ g k∥2

SLIDE 16

MULTI WORKER RESULTS

SGD gets rate

피 [ 1 K

K−1

∑

k=0

∥gk∥2

2] ≤

1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2

]

피 [ 1 K

K−1

∑

k=0

∥gk∥1]

≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2 ∥ ⃗ σ ∥1 ]

with M workers

Assumptions

➤ Objective function lower bound

⃗ σ

➤ Coordinate-wise variance bound

⃗ L

➤ Coordinate-wise gradient Lipschitz

majority vote gets

if gradient noise is unimodal symmetric

SLIDE 17

STRUCTURE

Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

SLIDE 18

CHARACTERISING THE DEEP LEARNING LANDSCAPE EMPIRICALLY

➤ signSGD cares about gradient density

Natural measure of density =1 for fully dense v ≈0 for fully sparse v

➤ majority vote cares about noise symmetry

For large enough mini-batch size, reasonable by Central Limit Theorem.

SLIDE 19

STRUCTURE

Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

SLIDE 20

SIGNUM IS COMPETITIVE ON IMAGENET

Performance very similar to Adam May want to switch to SGD towards end?

SLIDE 21

DOES MAJORITY VOTE WORK?

Jiawei Zhao  NUAA Cifar-10, Resnet-18

SLIDE 22

signSGD

signSGD

STRUCTURE

COMPRESS?

STRUCTURE

STRUCTURE

STRUCTURE

Poster tonight! 6.15—9 PM @ Hall B #72