signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon - - PowerPoint PPT Presentation

signsgd
SMART_READER_LITE
LIVE PREVIEW

signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon - - PowerPoint PPT Presentation

signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon Compressed optimisation for non-convex problems Kamyar Azizzadenesheli Anima Anandkumar UCI Caltech/Amazon Snap gradient components to 1 Reduces signSGD


slide-1
SLIDE 1

Compressed optimisation for 
 non-convex problems

signSGD

Yu-Xiang Wang
 UCSB/Amazon Anima Anandkumar
 Caltech/Amazon Kamyar Azizzadenesheli
 UCI Jeremy Bernstein
 Caltech

slide-2
SLIDE 2

Compressed optimisation for 
 non-convex problems

signSGD

Snap gradient components to ±1 Reduces communication time Realistic for deep learning

slide-3
SLIDE 3

STRUCTURE

Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

slide-4
SLIDE 4

GRADIENT COMPRESSION……WHY CARE?

Parameter server

With 1/2 data With 1/2 data

GPU 2 GPU 1

slide-5
SLIDE 5

GRADIENT COMPRESSION……WHY CARE?

Parameter server

GPU 5 GPU 7 GPU 7 GPU 1 GPU 3 GPU 4 GPU 2

COMPRESS?

slide-6
SLIDE 6

DISTRIBUTED SGD

Parameter server

With 1/2 data With 1/2 data

GPU 2 GPU 1

∑ g g g

slide-7
SLIDE 7

SIGNSGD WITH MAJORITY VOTE

Parameter server

With 1/2 data With 1/2 data

GPU 2 GPU 1

sign [∑ sign(g)] sign(g) sign(g)

slide-8
SLIDE 8

COMPRESSION SAVINGS OF MAJORITY VOTE

10 20 30 40 SGD Majority vote

# bits 
 per component per iteration

slide-9
SLIDE 9

SIGNSGD IS A SPECIAL CASE OF ADAM

signSGD Adam Signum

(Sign momentum)

slide-10
SLIDE 10

ADAM……………………WHY CARE?

2750 5500 8250 11000

# of Google scholar citations

SGD Adam Turing test

Robbins & Monro Kingma & Ba Turing

slide-11
SLIDE 11

UNIFYING ADAPTIVE GRADIENT METHODS + COMPRESSION

Sign descent weak theoretical foundation incredibly popular (e.g. Adam) Compressed descent weak theoretical foundation take pains to correct bias empirically successful Sign-based gradient compression?

Need to build a theory

slide-12
SLIDE 12

STRUCTURE

Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

slide-13
SLIDE 13

DOES SIGNSGD EVEN CONVERGE?

What might we fear?

➤ Might not converge at all ➤ Might have horrible dimension dependence ➤ Majority vote may give no speedup by adding extra machines

Compression can be a free lunch

Our results

➤ It does converge ➤ We characterise functions where signSGD & majority vote are as nice as SGD ➤ Suggest these functions are typical in deep learning

slide-14
SLIDE 14

SINGLE WORKER RESULTS

Assumptions SGD gets rate

피 [ 1 K

K−1

k=0

∥gk∥2

2] ≤

1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2

2]

signSGD gets rate

피 [ 1 K

K−1

k=0

∥gk∥1]

2

≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2∥ ⃗ σ ∥1]

2

f*

➤ Objective function lower bound

⃗ σ

➤ Coordinate-wise variance bound

⃗ L

➤ Coordinate-wise gradient Lipschitz

K

➤ Number of iterations

N

➤ Number of backpropagations

Define

slide-15
SLIDE 15

SINGLE WORKER RESULTS

Assumptions SGD gets rate

피 [ 1 K

K−1

k=0

∥gk∥2

2] ≤

1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2

2]

signSGD gets rate

피 [ 1 K

K−1

k=0

∥gk∥1]

2

≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2∥ ⃗ σ ∥1]

2

f*

➤ Objective function lower bound

⃗ σ

➤ Coordinate-wise variance bound

⃗ L

➤ Coordinate-wise gradient Lipschitz

K

➤ Number of iterations

N

➤ Number of backpropagations

Define

d ∥ ⃗ L ∥∞

d∥ ⃗ σ ∥2

d∥ ⃗ g k∥2

slide-16
SLIDE 16

MULTI WORKER RESULTS

SGD gets rate

피 [ 1 K

K−1

k=0

∥gk∥2

2] ≤

1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2

2

]

M

피 [ 1 K

K−1

k=0

∥gk∥1]

2

≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2 ∥ ⃗ σ ∥1 ]

2

M

with M workers

Assumptions

f*

➤ Objective function lower bound

⃗ σ

➤ Coordinate-wise variance bound

⃗ L

➤ Coordinate-wise gradient Lipschitz

majority vote gets

if gradient noise is unimodal symmetric

slide-17
SLIDE 17

STRUCTURE

Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

slide-18
SLIDE 18

CHARACTERISING THE DEEP LEARNING LANDSCAPE EMPIRICALLY

➤ signSGD cares about gradient density

Natural measure of density =1 for fully dense v ≈0 for fully sparse v

➤ majority vote cares about noise symmetry

For large enough mini-batch size, reasonable by Central Limit Theorem.

slide-19
SLIDE 19

STRUCTURE

Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

slide-20
SLIDE 20

SIGNUM IS COMPETITIVE ON IMAGENET

Performance very similar to Adam May want to switch to SGD towards end?

slide-21
SLIDE 21

DOES MAJORITY VOTE WORK?

Jiawei Zhao
 NUAA Cifar-10, Resnet-18

slide-22
SLIDE 22

Poster tonight! 6.15—9 PM @ Hall B #72