IBM Research AI
Efficient Communication Library for Large-Scale Deep Learning
Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com)
Efficient Communication Library for Large-Scale Deep Learning Mar - - PowerPoint PPT Presentation
IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com) Deep Learning changing Our Life Automotive/transportation Medicine and Biology Security/public safety Consumer Web,
IBM Research AI
Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com)
2
Deep Learning changing Our Life
Automotive/transportation Security/public safety Medicine and Biology Media and Entertainment Consumer Web, Mobile, Retail
3
Deep Learning Workflow IBM 3
Training Training Data
(grouped in large minibatches)
Trained Model
Next Minibatch Next Epoch
Forward Backward Error
Inference
Forward
Action
Application- dependent
Latency to model: Typically days to train complex models
Limited by training compute throughput
Latency to action: Typically ms to complete full inference workflow
Limited by latency of batching (to enable efficient inference) + inference compute+ resultant action
Individual
E.g., from microservices
Input Data Inference Model Conversion/Retraining
Needed if training/inference precisions differ
Batching
Smaller, varied batch size: Application-dependent
This is my focus
4
Advance in Computation for Deep Learning
[P. Goldsborough]
[MichaelGalloy.com]
/FPGA
5
Motivation: Ok, ever-fast computation. Is this enough?
– Batch-size = 32 (limited by GPU memory) – Iteration time = 300ms – #iterations per epoch = 38000 – Total training time for 100 epoch = 13.2 days
– Total training time for 100 epoch = 35.2 days
– 1.2M samples are still at toy scale – Computation scaling is not fast enough
– I cannot wait for days to get my model trained!
6
Faster Training Time with Distributed Deep Learning
Recognition Recognition
Learning runs with Power 8
What will you do? Iterate more and create more accurate models? Create more models? Both?
4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours
7
Distributed Deep Learning
[P. Goldsborough]
Data parallelism : Parm-Server Model parallelism (complex partitioning) Gradient/weight (10MB-1GB) Anything
Data parallelism : Allreduce
8
Communication : Overhead
32 images 32 images sync
sync
32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images
– Computation cost remains constant – Communication cost increases with more learners/GPUs
– Increase Computation – Faster Communication
9
Advance in Communication for Deep Learning
– Computation is still ahead – Data perhaps grows much faster
10
Computation Model depth GPU throughput Faster algorithm Mini-batch size Communication Gradient count Network BW Faster algorithm
Good balance
Designing Large-scale Deep Learning
11
IBM PowerAI DDL (Distributed Deep Learning Library)
– MPI-like interface for easy-integration – Enables deep learning software to scale to 100s severs with CPU/GPUs – Works across variety of system sizes – Works with variety of network types, switch topologies
– Plan efficient communication pattern on a hierarchal network environment – Actual point-point data transfer via NCCL or MPI
– Supported: Caffe, Tensorflow, Chainer, Torch – Ongoing : Caffe2, PyTorch, Keras (TF-backend)
12
DDL : Topology-aware communication
switch0 switch1 Max bandwidth: 10 Gbytes/sec Max sustained bandwidth: 100 Gbytes/sec
A B C D
– A, B, C, D broadcast to all others – Suffers from congestion – Suffers from lower BW A->B B->C C->D C->D B->C D->A C->D D->A A->B A->B D->A B->C
13
DDL : Topology-aware communication
switch0 switch1 box0 box1 box2 box3
A B C D
– System-specific network – Application-specific traffic
– To minimize bus contention – To minimize crossing lower BW A->B B->C C->D C->D B->C D->A C->D D->A A->B A->B D->A B->C B->C D->A A->D C->B A->B C->D B->A D->C C->D A->B D->C B->A suboptimal Optimal
14
DDL : Problem Definition and Solution
– network topology with various bandwidths
– min-cost multi-commodity flow problem – NP-hard problem but can be solved easily if graph size is small (ie 4 vertices)
– if the cluster/cloud has provide such topology, it performs very well
15
DDL : How well it performs on Caffe2
– 3 racks and 16 hosts on each, connected though 10GB/s IB – Each host has 4 P100-SXM2 with CUDA8, CUDNN5
– MPI_Allreduce – Ring (all-reduce from Baidu in Feb 2017) – GLOO (from Facebook) : NCCL+ib_verb
DDL DDL
16
Comparison with NCCL 2.1.x Allreduce (POWER)
Exploiting in-system topology Exploiting in/cross-system topology
17
Comparison with NCCL 2.1.x Allreduce (X86)
NO in-system topology Exploiting cross-system topology
18
Conclusion
– Network hierarchy – Multi-tier bandwidth
19
BACKUP