Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch - - PowerPoint PPT Presentation

crossbow scaling deep learning on multi gpu servers
SMART_READER_LITE
LIVE PREVIEW

Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch - - PowerPoint PPT Presentation

Large-Scale Data & Systems Group Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk


slide-1
SLIDE 1

Crossbow: Scaling Deep Learning on Multi-GPU Servers

Peter Pietzuch

with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk <prp@imperial.ac.uk>

CASTOR Software Days – Stockholm, Sweden – October 2019

Large-Scale Data & Systems Group

slide-2
SLIDE 2

Machine Learning with Deep Neural Networks (DNN)

Revolutionised solutions in vision, speech recognition, … DNN models are trained by giving examples (instead of programming)

` ` `

audio words text topics

hello audience

images labels

` ` `

Peter Pietzuch - Imperial College London 2

When DNN output is wrong, tweak its parameters

slide-3
SLIDE 3

Training DNNs

  • Obtain DNN model that minimises classification error
  • Use Stochastic Gradient Descent (SGD) for training:
  • 1. Begin with random model
  • 2. Consider mini-batch of

training data

  • 3. Iteratively calculate gradients

& update model parameters w

Model parameters w

Error

lowest error random

  • ptimal
  • converge

Peter Pietzuch - Imperial College London 3

slide-4
SLIDE 4

Training DNNs on GPUs

  • GPUs are good at parallelising gradient computation

Peter Pietzuch - Imperial College London 4

slide-5
SLIDE 5

Training DNNs in Parallel with GPUs

  • With large datasets, speed up by calculating gradients on multiple GPUs
  • Every GPU has model replica with a copy of model parameters (or weights)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

GPU 1 GPU 2 GPU 3

gradient

model

gradient gradient

model model

Shard 1 Shard 2 Shard 3 But model replicas would diverge

  • ver time…

Mini-batch (of training data)

Peter Pietzuch - Imperial College London 5

slide-6
SLIDE 6

Model Synchronisation among GPUs

  • Parameter server: Maintains global model

GPU 1 GPU 2 GPU 3 model model model model

gradient gradient gradient

Global model

  • GPUs:
  • 1. Send gradients to

update global model

  • 2. Synchronise local

model replicas with global model

Peter Pietzuch - Imperial College London 6

slide-7
SLIDE 7

The Problem with Large Batch Sizes

Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32.

Yann LeCun

@ylecun

2:00 PM – 26 Apr 2018

447 1.2K

Peter Pietzuch - Imperial College London 7

slide-8
SLIDE 8

Why Use Large Batch Sizes?

dataset gradient gradient gradient weights average

An even bigger batch A bigger batch A batch

Keep work per GPU constant to scale

E.g. ~32 to 256 labelled images

Peter Pietzuch - Imperial College London 8

slide-9
SLIDE 9

What is the Best Batch Size on a GPU?

  • ResNet-32 on NVIDIA Titan X GPU

Peter Pietzuch - Imperial College London

1134 361 302 354 379 445

200 400 600 800 1000 1200 32 64 128 256 512 1024 Time to accuracy (sec) Batch size b TensorFlow

9

slide-10
SLIDE 10

Training DNNs Favours Small Batch Sizes

We want frequent, less “noisy” updates

w GPU grad

small batch weights

w

weights

grad grad avg GPU

large batch

GPU w w GPU grad

small batch

Peter Pietzuch - Imperial College London 10

slide-11
SLIDE 11

Statistical Efficiency Needs Small Batch Sizes

Peter Pietzuch - Imperial College London

0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 140 Test accuracy (%) Epochs

b=64 b=128 b=256 b=512 b=1024 b=2048 b=4096

11

small batch sizes large batch sizes

slide-12
SLIDE 12

Hardware Efficiency Needs Large Batch Sizes

Keep work per GPU constant → increase batch size with #GPUs

Peter Pietzuch - Imperial College London

Compute gradient Update replica Compute gradient Update replica Compute gradient Compute gradient Average

GPU 1 GPU 2 Batch size ½ b Batch size ½ b Batch size ½ b Batch size ½ b

12

slide-13
SLIDE 13

Tension between Hardware & Statistical Efficiency

  • Practitioners increase batch size due to hardware efficiency
  • But best batch size depends on both hardware & statistical efficiency

Peter Pietzuch - Imperial College London 13

Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32.

Yann LeCun

@ylecun

2:00 PM – 26 Apr 2018

447 1.2K

slide-14
SLIDE 14

Current Practice: Hyper-Parameter Tuning

  • Adjust hyper-parameters (eg learning rate, momentum etc) to avoid

reduction in statistical efficiency

  • Linear scaling rule:

"When mini-batch size is multiplied by k, multiply learning rate by k”

  • Goyal et al. (2017)
  • Drawbacks

– Manual, labour-intensive process – Highly model specific – not portable and does not work for some models – Less effective for very large batch sizes…

Peter Pietzuch - Imperial College London 14

slide-15
SLIDE 15

Limits of Hyper-Parameter Tuning

“When mini-batch size is multiplied by k, multiply learning rate by k”

20 25 30 35 40 256 1,024 4,096 16,384 65,536 Top-1 Validation Error Mini-batch size ResNet-50, 32 images/GPU

8,192

Peter Pietzuch - Imperial College London 15

slide-16
SLIDE 16

Fundamental Challenge of GPU Scaling

  • “If batch size could be made arbitrarily large while still training

effectively, then training is amenable to standard weak scaling

  • approaches. However, if the training rate of some models is

restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.”

– J. Dean, D. Patterson et al., “A New Golden Age in Computer Architecture”, IEEE Micro, 2018

  • How to design a deep learning system that scales training with

multiple GPUs, even when the preferred batch size is small?

Peter Pietzuch - Imperial College London 16

slide-17
SLIDE 17

(1) How to increase hardware efficiency with small batches? (2) How to synchronise model replicas? (3) How to reduce scheduling & synchronisation

  • verheads?

task1 task2 task3

model model model

GPU

buffer-1 buffer-2 buffer-3 buffer-4

Reusable data buffers Currently in use

task

GPU-1 GPU-2

model model model model

Model

Peter Pietzuch - Imperial College London 17

slide-18
SLIDE 18

Problem: Small Batch Sizes Underutilise GPUs

Peter Pietzuch - Imperial College London

20 40 60 80 100 20 40 60 80 100 GPU Occupancy (%) CDF (%)

Under-used resources

18

slide-19
SLIDE 19

How to Process Small Batches Efficiently?

One batch per GPU → Not enough data and instruction parallelism for every operator

batch

grad

weights batch

grad

weights One per GPU 100 Operations GPU Util. (%)

Peter Pietzuch - Imperial College London 19

slide-20
SLIDE 20

Idea: Train Multiple Model Replicas per GPU

One learning process (or learner) per GPU stream

batch

grad

weights Learner Learner Learner Stream Stream Stream 1 GPU

Scheduler

Peter Pietzuch - Imperial College London 20

slide-21
SLIDE 21

Effect of Training Multiple Model Replicas per GPU

  • But now we must synchronise a large number of learners/model replicas...

Peter Pietzuch - Imperial College London

20 40 60 80 100 32 64 128 256 512 1024 Throughput increase (%) Batch size b

Regained resources

21

slide-22
SLIDE 22

(1) How to increase efficiency with small batches? (2) How to synchronise model replicas?

task1 task2 task3

model model model

GPU GPU 1 GPU 2

model model model model

Model

Peter Pietzuch - Imperial College London 22

  • Train multiple

model replicas per GPU

slide-23
SLIDE 23

Problem: Why not Synchronous Parallel SGD?

All learners always start from the same point Limited exploration of parameter space

Peter Pietzuch - Imperial College London 23

slide-24
SLIDE 24

Idea: Maintain Independent Model Replicas

  • Benefits:

– Increased exploration of space through parallelism – Each model replica uses small batch size

Peter Pietzuch - Imperial College London 24

Average model trajectory Replica X’s trajectory Replica Y’s trajectory Initial weights

slide-25
SLIDE 25

Crossbow: Synchronous Model Averaging

Allow learners to diverge but correct trajectories based on average model Accelerate average model trajectory with momentum to find minima faster

correction correction Momentum-accelerated

Peter Pietzuch - Imperial College London 25

slide-26
SLIDE 26

GPUs with Synchronous Model Averaging

Learner Replica

Learner Replica GPU 2 Learner Replica

Learner Replica Learner Replica

Learner Replica GPU 3 GPU 1

  • Synchronously apply corrections to model replicas

Peter Pietzuch - Imperial College London 26

slide-27
SLIDE 27

GPUs with Synchronous Model Averaging

Learner Replica

Reference Model

Learner Learner Replica GPU 2 Learner Replica

Average Model Learner Learner Replica Learner Replica

Reference Model

Learner Learner Replica GPU 3 GPU 1

  • Synchronously apply corrections to model replicas

Peter Pietzuch - Imperial College London 27

slide-28
SLIDE 28

GPUs with Synchronous Model Averaging

Learner Replica

Reference Model

Learner Learner Replica GPU 2 Learner Replica

Average Model Learner Learner Replica Learner Replica

Reference Model

Learner Learner Replica GPU 3 GPU 1

Synchronous Model Averaging

  • Ensures consistent view of average model
  • Takes GPU bandwidth into account during synchronisation

Peter Pietzuch - Imperial College London 28

slide-29
SLIDE 29
  • Train multiple

model replicas per GPU

  • Use synchronous

model averaging (1) How to increase efficiency with small batches? (2) How to synchronise model replicas? (3) How to reduce scheduling and synchronisation

  • verheads?

task1 task2 task3

model model model

GPU GPU-1 GPU-2

model model model model

Model

Peter Pietzuch - Imperial College London 29

buffer-1 buffer-2 buffer-3 buffer-4

Reusable data buffers Currently in use

task

slide-30
SLIDE 30

Crossbow Architecture

Auto-tuner Java

24K LOC

C/C++

15K LOC

Dataset Data pre-process Data pre-process

Data pre-processors

Input batches Model replicas Learner streams Ready queues Task scheduler

Dataflows

GPU 1 GPU 2

Learner Synch

Integration with TensorFlow

github.com/lsds/crossbow

Learner Synch

Peter Pietzuch - Imperial College London 30

slide-31
SLIDE 31

Efficient Task Scheduling

  • Execute compute and

synchronisation tasks

  • Fine-grained

concurrency

  • Need efficient scheduler

to feed all GPUs with tasks

Dataflow

Task scheduler Input batches Ready queues

Dataset

Data pre- processors

<Multiple threads>

Model replicas

  • Learn. streams

Auto-tuner

Peter Pietzuch - Imperial College London 31

slide-32
SLIDE 32

Interleaving Compute & Synchronisation Tasks

Peter Pietzuch - Imperial College London

Compute gradient Replica Queue Elastic average Update replica Average gradients Average all gradients Update replica Compute gradient Compute gradient Replica Queue Elastic average Update replica Update replica Compute gradient Elastic average Elastic average Update device model Mapped memory Auto-tuner Update replica Update replica Dataset W Worker threads W W Compute gradient Sync. Queue Replica Queue Elastic average Update replica Average gradients Update replica Compute gradient Replic a Queue Elastic average Update replica Update replica Compute gradient Compute gradient Elastic average Elastic average Update device model Update replica Update replica New Replica Queue Sync. Queue New Replica Queue Batch N Batch N+1 Batch N+2 Monitor GPU 1 GPU 2 A … Create new queue on-the-fly Ready to compute another gradient Schedule next available replica

32

slide-33
SLIDE 33

Auto-Tuning the Number of Model Replicas

  • Monitor training

throughput

  • Dynamically adust

number of learners

  • Uses object pooling &

lazy materialization

Auto-tuner Input batches Task manager

Dataset

Data pre- processors Model replicas

  • Learn. streams

Peter Pietzuch - Imperial College London 33

slide-34
SLIDE 34

Experimental Evaluation

Peter Pietzuch - Imperial College London 34

slide-35
SLIDE 35

Does Crossbow Train Effectively with Small Batch Sizes?

Multiple learners per GPU improve hardware efficiency

TensorFlow Crossbow b = 64 b = 256 b = 64 1 learner b = 64 4 learners 200 400 600 800 1000 1200 1400 Time to test accuracy (sec)

Peter Pietzuch - Imperial College London 35

  • ResNet-32 with

ImageNet dataset on 1 Titan X GPU

slide-36
SLIDE 36

20000 40000 60000 80000 100000 8 TTA(53%) (sec)

g

TensorFlow

32

Crossbow (m=1)

32

Crossbow

16 2

Does Crossbow Scale to multiple GPUs?

Time to test accuracy (sec)

GPUs ResNet-50

Peter Pietzuch - Imperial College London 36

Synchronous Model Averaging improves statistical efficiency

  • ResNet-50 with

ImageNet dataset on 8 Titan X GPUs

slide-37
SLIDE 37

Does Crossbow Train Effectively Across Models?

Training with multiple learners always better than training with large batches

1.3 1.5

1.2

ResNet-32 VGG ResNet-50 ResNet-101

2.7

1 2 3 4 LeNet Speed-up over TensorFlow 1 GPU 8 GPUs

4.3

Peter Pietzuch - Imperial College London 37

slide-38
SLIDE 38
  • ResNet-50

with ImageNet dataset on Titan X GPUs

What is the Statistical Efficiency with Many Learners?

Peter Pietzuch - Imperial College London

20 40 60 80 100 20 40 60 80 100 120 140 Test accuracy (%) Epochs

m=1 m=2 m=4 m=8 m=16 m=32

Increasing parallelism has less effect on statistical efficiency

38

slide-39
SLIDE 39

Crossbow: Scaling GPU Deep Learning

  • Need to make training throughput independent from hyper-parameters

– Rethink the design of future deep learning systems

  • Crossbow: Scaling DNN training with small batch sizes on many GPUs

– Multiple model replicas per GPU for high hardware efficiency – Synchronous model averaging for high statistical efficiency

  • Exciting research challenges for next generation deep learning systems

Peter Pietzuch - Imperial College London 39

Peter Pietzuch https://lsds.doc.ic.ac.uk — prp@imperial.ac.uk

Thank You — Any Questions?

github.com/lsds/crossbow