Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch - PowerPoint PPT Presentation

Large-Scale Data & Systems Group Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk <prp@imperial.ac.uk> CASTOR Software Days – Stockholm, Sweden – October 2019

Machine Learning with Deep Neural Networks (DNN) Revolutionised solutions in vision , speech recognition , … DNN models are trained by giving examples (instead of programming) topics ` words ` labels ` hello audience ` text ` audio When DNN output is wrong, tweak its parameters ` images Peter Pietzuch - Imperial College London 2

Training DNNs • Obtain DNN model that minimises classification error • Use Stochastic Gradient Descent (SGD) for training: • 1. Begin with random model • 2. Consider mini-batch of training data Error • 3. Iteratively calculate gradients & update model parameters w • converge lowest error random optimal Model parameters w Peter Pietzuch - Imperial College London 3

Training DNNs on GPUs • GPUs are good at parallelising gradient computation Peter Pietzuch - Imperial College London 4

Training DNNs in Parallel with GPUs • With large datasets, speed up by calculating gradients on multiple GPUs • Every GPU has model replica with a copy of model parameters (or weights) gradient gradient gradient model model model 1 1 0 1 0 1 0 1 0 GPU 1 GPU 2 GPU 3 0 1 1 1 1 But model replicas 0 0 0 0 1 1 1 1 1 would diverge 1 0 0 0 1 1 0 over time… 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 1 Shard 1 Shard 2 Shard 3 Mini-batch (of training data) Peter Pietzuch - Imperial College London 5

Model Synchronisation among GPUs • Parameter server : Maintains global model Global model model • GPUs: 1. Send gradients to update global model 2. Synchronise local gradient gradient gradient model replicas with model model model global model GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 6

The Problem with Large Batch Sizes Yann LeCun @ylecun Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32. 2:00 PM – 26 Apr 2018 447 1.2K Peter Pietzuch - Imperial College London 7

Why Use Large Batch Sizes? average gradient gradient gradient weights E.g. ~32 to 256 labelled images A batch A bigger batch An even bigger batch dataset Keep work per GPU constant to scale Peter Pietzuch - Imperial College London 8

What is the Best Batch Size on a GPU? • ResNet-32 on NVIDIA Titan X GPU 1200 Time to accuracy (sec) TensorFlow 1134 1000 800 600 400 445 379 361 354 200 302 0 32 64 128 256 512 1024 Batch size b Peter Pietzuch - Imperial College London 9

Training DNNs Favours Small Batch Sizes We want frequent, less “noisy” updates w GPU GPU grad grad small small batch batch w weights w GPU grad large GPU grad batch w avg weights Peter Pietzuch - Imperial College London 10

Statistical Efficiency Needs Small Batch Sizes 1 b=64 small batch sizes b=128 b=256 Test accuracy (%) 0.8 b=512 b=1024 b=2048 0.6 b=4096 0.4 large batch sizes 0.2 0 0 20 40 60 80 100 120 140 Epochs Peter Pietzuch - Imperial College London 11

Hardware Efficiency Needs Large Batch Sizes Batch size ½ b Batch size ½ b GPU 1 Compute Update Compute Average gradient replica gradient Batch size ½ b Batch size ½ b GPU 2 Compute Update Compute gradient replica gradient Keep work per GPU constant → increase batch size with #GPUs Peter Pietzuch - Imperial College London 12

Tension between Hardware & Statistical Efficiency Yann LeCun @ylecun Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32. 2:00 PM – 26 Apr 2018 447 1.2K • Practitioners increase batch size due to hardware efficiency • But best batch size depends on both hardware & statistical efficiency Peter Pietzuch - Imperial College London 13

Current Practice: Hyper-Parameter Tuning • Adjust hyper-parameters (eg learning rate, momentum etc) to avoid reduction in statistical efficiency • Linear scaling rule : "When mini-batch size is multiplied by k , multiply learning rate by k ” • Goyal et al. (2017) • Drawbacks – Manual, labour-intensive process – Highly model specific – not portable and does not work for some models – Less effective for very large batch sizes… Peter Pietzuch - Imperial College London 14

Limits of Hyper-Parameter Tuning “When mini-batch size is multiplied by k , multiply learning rate by k ” ResNet-50, 32 images/GPU 40 Top-1 Validation Error 35 30 25 20 8,192 256 1,024 4,096 16,384 65,536 Mini-batch size Peter Pietzuch - Imperial College London 15

Fundamental Challenge of GPU Scaling • “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.” – J. Dean, D. Patterson et al., “A New Golden Age in Computer Architecture”, IEEE Micro, 2018 • How to design a deep learning system that scales training with multiple GPUs, even when the preferred batch size is small? Peter Pietzuch - Imperial College London 16

(3) How to reduce (2) How to synchronise (1) How to increase scheduling & model replicas? hardware efficiency synchronisation with small batches? overheads? Reusable data buffers buffer- 1 model task 1 Model buffer- 2 model task 2 task model model model model Currently in use model task 3 GPU-1 GPU-2 buffer- 3 buffer- 4 GPU Peter Pietzuch - Imperial College London 17

Problem: Small Batch Sizes Underutilise GPUs 100 GPU Occupancy (%) 80 Under-used resources 60 40 20 0 0 20 40 60 80 100 CDF (%) Peter Pietzuch - Imperial College London 18

How to Process Small Batches Efficiently? One batch per GPU → Not enough data and instruction parallelism for every operator 100 GPU Util. (%) grad grad batch batch weights weights 0 Operations One per GPU Peter Pietzuch - Imperial College London 19

Idea: Train Multiple Model Replicas per GPU One learning process (or learner ) per GPU stream 1 GPU grad Learner Stream batch weights Learner Stream Scheduler Learner Stream Peter Pietzuch - Imperial College London 20

Effect of Training Multiple Model Replicas per GPU Throughput increase (%) 100 80 60 40 20 Regained resources 0 32 64 128 256 512 1024 Batch size b • But now we must synchronise a large number of learners/model replicas... Peter Pietzuch - Imperial College London 21

(2) How to synchronise (1) How to increase model replicas? efficiency with small batches? model task 1 Model model task 2 model model model model model task 3 GPU 1 GPU 2 GPU • Train multiple model replicas per GPU Peter Pietzuch - Imperial College London 22

Problem: Why not Synchronous Parallel SGD? All learners always start from the same point Limited exploration of parameter space Peter Pietzuch - Imperial College London 23

Idea: Maintain Independent Model Replicas Replica X’s trajectory Initial weights Average model trajectory Replica Y’s trajectory • Benefits: – Increased exploration of space through parallelism – Each model replica uses small batch size Peter Pietzuch - Imperial College London 24

Crossbow: Synchronous Model Averaging Allow learners to diverge but correct trajectories based on average model Accelerate average model trajectory with momentum to find minima faster correction Momentum-accelerated correction Peter Pietzuch - Imperial College London 25

GPUs with Synchronous Model Averaging • Synchronously apply corrections to model replicas … … … Replica Replica Replica Replica Replica Replica Learner Learner Learner Learner Learner Learner GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 26

GPUs with Synchronous Model Averaging • Synchronously apply corrections to model replicas Average Reference Reference Model Model Model Learner Learner Learner … … … Replica Replica Replica Replica Replica Replica Learner Learner Learner Learner Learner Learner GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 27

GPUs with Synchronous Model Averaging • Ensures consistent view of average model • Takes GPU bandwidth into account during synchronisation Synchronous Average Reference Model Averaging Reference Model Model Model Learner Learner Learner … … … Replica Replica Replica Replica Replica Replica Learner Learner Learner Learner Learner Learner GPU 1 GPU 2 GPU 3 Peter Pietzuch - Imperial College London 28

Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch - PowerPoint PPT Presentation

Large-Scale Data & Systems Group Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk

Network virtualisation using Crossbow Technology Uro Nedi, MSc OpenSolaris Contributor

Router Plugins (Formerly Crossbow) A Software Architecture for Next Generation Routers John

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Ordinary DNS: www.google.com A? Client's k.root-servers.net com. NS a.gtld-servers.net Resolver

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Ordinary DNS: www.google.com A? Client's k.root-servers.net com. NS a.gtld-servers.net Resolver

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Volunteer Outreach Training Why Outreach Calls Matter Help our advisors become aware of

21 st Century Disaster Readiness Kevin Yeskey, M.D. March 15, 2018 21 st Century: An

Training for Transfer Learning objective Be able to develop a first version of a complete

Agenda Introduction to our panelists Training in the Pandemic: National Kaohsiung University of

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE VANYA YANEVA Ajitha Rajan,

GOAL To learn about the realities of transition to adult care and adulthood for a person with

Cluster-Jet Target Slow-Control An In-Kind Contract: FAIR GmbH-JU-NCBJ Legal aspects and

Trails of triples in Steiner triple systems Daniel Horsley (Monash University, Australia) Joint

Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch - PowerPoint PPT Presentation

Large-Scale Data & Systems Group Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk

Network virtualisation using Crossbow Technology Uro Nedi, MSc OpenSolaris Contributor

Router Plugins (Formerly Crossbow) A Software Architecture for Next Generation Routers John

GPU Servers for Research in Quantum Fluids L. Galantucci HPC &amp; Quantum Summit QEII Centre,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Ordinary DNS: www.google.com A? Client's k.root-servers.net com. NS a.gtld-servers.net Resolver

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Ordinary DNS: www.google.com A? Client's k.root-servers.net com. NS a.gtld-servers.net Resolver

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Volunteer Outreach Training Why Outreach Calls Matter Help our advisors become aware of

21 st Century Disaster Readiness Kevin Yeskey, M.D. March 15, 2018 21 st Century: An

Training for Transfer Learning objective Be able to develop a first version of a complete

Agenda Introduction to our panelists Training in the Pandemic: National Kaohsiung University of

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE VANYA YANEVA Ajitha Rajan,

GOAL To learn about the realities of transition to adult care and adulthood for a person with

Cluster-Jet Target Slow-Control An In-Kind Contract: FAIR GmbH-JU-NCBJ Legal aspects and

Trails of triples in Steiner triple systems Daniel Horsley (Monash University, Australia) Joint

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,