COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, - - PowerPoint PPT Presentation

compressing gradient optimizers
SMART_READER_LITE
LIVE PREVIEW

COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, - - PowerPoint PPT Presentation

6/11/2019 Count-Sketch Optimizers COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava Rice University, Amazon Search ICML 2019 6/11/2019 Count-Sketch Optimizers Deep


slide-1
SLIDE 1

COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES

Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava Rice University, Amazon Search ICML 2019

Count-Sketch Optimizers 6/11/2019

slide-2
SLIDE 2

Deep Learning is Resource Intensive

Training deep learning models requires large amounts of time and resources

6/11/2019 Count-Sketch Optimizers

slide-3
SLIDE 3

Data-Parallelism for faster training!

A key tool for reducing training time is to increase the batch size

6/11/2019 Count-Sketch Optimizers

slide-4
SLIDE 4

Data Parallelism – Memory Limitations

Increasing the batch size requires significant amounts of memory

6/11/2019 Count-Sketch Optimizers

slide-5
SLIDE 5

Faster Training vs. Expressive Model

Sacrifice batch size for a larger, more expressive model

6/11/2019 Count-Sketch Optimizers

slide-6
SLIDE 6

Pesky Popular Optimizers

  • The auxiliary parameters used by popular optimizers

aggravate the memory issue

  • i.e. Adam, RMSProp, Adagrad, Momentum

6/11/2019 Count-Sketch Optimizers

slide-7
SLIDE 7

Optimizers – A Concrete Example

  • Training BERT Transformer on Nvidia V100 16GB*
  • SGD: 10,800 MB, Adam: 13,362 MB
  • Auxiliary variables require 2,562 MB extra memory!

*Using Activation Checkpointing and Mixed Precision Training

6/11/2019 Count-Sketch Optimizers

slide-8
SLIDE 8

Our Goal

  • Compress the auxiliary variables
  • Maintain convergence rate and accuracy of the full-sized
  • ptimizer

6/11/2019 Count-Sketch Optimizers

slide-9
SLIDE 9

Count-Sketches to the Rescue!

  • Solution: Compress the auxiliary variables with count-

sketches

  • Intuition: Map multiple model parameters to the same

parameter in the count-sketch

  • Outcome: Free memory for more expressive model

and/or larger batch size

6/11/2019 Count-Sketch Optimizers

slide-10
SLIDE 10

Highlighted Result - LSTM – LM1B

Metric Adam Count-Sketch Time (Hrs) 5.28 5.42 Size (MB) 10,813 7,693 Test Perplexity 39.90 40.55

6/11/2019 Count-Sketch Optimizers

  • Count-Sketch optimizer used 5x fewer parameters
  • Upshot: Reduced memory usage with minimal accuracy
  • r performance loss
slide-11
SLIDE 11

6/11/2019 Count-Sketch Optimizers

Please visit the poster today! 6:30pm @ Pacific Ballroom #83 GitHub: https://github.com/rdspring1/Count-Sketch-Optimizers