SWALP: Stochastic Weight Averaging in Low-Precision Training - - PowerPoint PPT Presentation

swalp stochastic weight averaging in low precision
SMART_READER_LITE
LIVE PREVIEW

SWALP: Stochastic Weight Averaging in Low-Precision Training - - PowerPoint PPT Presentation

SWALP: Stochastic Weight Averaging in Low-Precision Training Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, Christopher De Sa Low-precision Computation Problem Statement We study how to leverage


slide-1
SLIDE 1

SWALP: Stochastic Weight Averaging
 in Low-Precision Training

Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, Christopher De Sa

slide-2
SLIDE 2

Low-precision Computation

slide-3
SLIDE 3

We study how to leverage low-precision training to

  • btain a high-accuracy model.

Problem Statement

slide-4
SLIDE 4

We study how to leverage low-precision training to

  • btain a high-accuracy model.

Output model can be higher-precision.

Problem Statement

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

SWALP

SGD-LP model SWALP model

slide-9
SLIDE 9

SWALP

SGD-LP model SWALP model Updating

slide-10
SLIDE 10

SWALP

SGD-LP model SWALP model Every c iterations Averaging Updating

slide-11
SLIDE 11

SWALP

SGD-LP model SWALP model Every c iterations Averaging Updating Infrequently

slide-12
SLIDE 12

Convergence Analysis

Let T be the number of iterations.
 
 Theorem 1 (quadratic)
 SWALP converges to the optimal solution 
 at a O(1/T) rate.

slide-13
SLIDE 13

Convergence Analysis

Let T be the number of iterations.
 
 Theorem 1 (quadratic)
 SWALP converges to the optimal solution 
 at a O(1/T) rate. SWALP has the same convergence rate 
 as full precision SGD.

slide-14
SLIDE 14

Convergence Analysis

Let δ be the quantization gap.
 
 Theorem 2 (strongly convex)
 The expected distance between SWALP solution 
 and the optimal one is bounded by O(δ^2).

slide-15
SLIDE 15

Convergence Analysis

Let δ be the quantization gap.
 
 Theorem 2 (strongly convex)
 The expected distance between SWALP solution 
 and the optimal one is bounded by O(δ^2).

  • The best bound for SGD-LP is O(δ) 


(Li et al, NeurIPS 2017).

  • SWALP requires half the number of bits to


reduce the noise ball by the same factor.

slide-16
SLIDE 16

Experiments

slide-17
SLIDE 17

Experiments 1.3 2.9 0.8 2.3

slide-18
SLIDE 18

Experiments

slide-19
SLIDE 19

Poster @ Pacific Ballroom #58

SWALP Codes QPyTorch:
 A Low-Precision 
 Framework