Efficient Communication Library for Large-Scale Deep Learning Mar - - PowerPoint PPT Presentation

efficient communication library for
SMART_READER_LITE
LIVE PREVIEW

Efficient Communication Library for Large-Scale Deep Learning Mar - - PowerPoint PPT Presentation

IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com) Deep Learning changing Our Life Automotive/transportation Medicine and Biology Security/public safety Consumer Web,


slide-1
SLIDE 1

IBM Research AI

Efficient Communication Library for Large-Scale Deep Learning

Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com)

slide-2
SLIDE 2

2

Deep Learning changing Our Life

Automotive/transportation Security/public safety Medicine and Biology Media and Entertainment Consumer Web, Mobile, Retail

slide-3
SLIDE 3

3

Deep Learning Workflow IBM 3

Training Training Data

(grouped in large minibatches)

Trained Model

Next Minibatch Next Epoch

Forward Backward Error

Inference

Forward

Action

Application- dependent

Latency to model: Typically days to train complex models

Limited by training compute throughput

Latency to action: Typically ms to complete full inference workflow

Limited by latency of batching (to enable efficient inference) + inference compute+ resultant action

Individual

E.g., from microservices

Input Data Inference Model Conversion/Retraining

Needed if training/inference precisions differ

Batching

Smaller, varied batch size: Application-dependent

This is my focus

slide-4
SLIDE 4

4

Advance in Computation for Deep Learning

  • 10-100 TFLOPS
  • Very good scaling for last 15 years

[P. Goldsborough]

[MichaelGalloy.com]

/FPGA

slide-5
SLIDE 5

5

Motivation: Ok, ever-fast computation. Is this enough?

  • ImageNet1K : 1.2M images, 1K classes, Resnet101

– Batch-size = 32 (limited by GPU memory) – Iteration time = 300ms – #iterations per epoch = 38000 – Total training time for 100 epoch = 13.2 days

  • ImageNet22K : 7.5M images, 22K classes , Resnet101

– Total training time for 100 epoch = 35.2 days

  • No, it is NOT

– 1.2M samples are still at toy scale – Computation scaling is not fast enough

  • the data explosion/model complexity
  • Innovation will take too long, or even stop at some point

– I cannot wait for days to get my model trained!

slide-6
SLIDE 6

6

Faster Training Time with Distributed Deep Learning

9Days

Recognition Recognition

54x

Learning runs with Power 8

What will you do? Iterate more and create more accurate models? Create more models? Both?

4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours

slide-7
SLIDE 7

7

Distributed Deep Learning

[P. Goldsborough]

Data parallelism : Parm-Server Model parallelism (complex partitioning) Gradient/weight (10MB-1GB) Anything

Data parallelism : Allreduce

slide-8
SLIDE 8

8

Communication : Overhead

32 images 32 images sync

sync

32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images

  • In weak-scaling

– Computation cost remains constant – Communication cost increases with more learners/GPUs

  • Computation /Communication is the key for large-scale deep learning

– Increase Computation – Faster Communication

slide-9
SLIDE 9

9

Advance in Communication for Deep Learning

  • Still scaling, but not fast enough

– Computation is still ahead – Data perhaps grows much faster

slide-10
SLIDE 10

10

  • Model/Application
  • Deeper/wider model to increase compute time
  • Smaller gradient count to reduce communication time
  • System
  • Balance network and computing resources
  • Select mini-batch size to adjust the ratio
  • Larger mini-batch size to lower the ratio
  • Too big mini-batch size can hurt convergence and accuracy
  • Network-topology aware communication

Computation Model depth GPU throughput Faster algorithm Mini-batch size Communication Gradient count Network BW Faster algorithm

Good balance

Designing Large-scale Deep Learning

slide-11
SLIDE 11

11

IBM PowerAI DDL (Distributed Deep Learning Library)

  • Collaborative communication library for Distributed Deep Learning

– MPI-like interface for easy-integration – Enables deep learning software to scale to 100s severs with CPU/GPUs – Works across variety of system sizes – Works with variety of network types, switch topologies

  • DDL orchestrates the data communication

– Plan efficient communication pattern on a hierarchal network environment – Actual point-point data transfer via NCCL or MPI

  • Currently integrated into

– Supported: Caffe, Tensorflow, Chainer, Torch – Ongoing : Caffe2, PyTorch, Keras (TF-backend)

  • Currently US patent-pending
slide-12
SLIDE 12

12

DDL : Topology-aware communication

switch0 switch1 Max bandwidth: 10 Gbytes/sec Max sustained bandwidth: 100 Gbytes/sec

A B C D

  • Example

– A, B, C, D broadcast to all others – Suffers from congestion – Suffers from lower BW A->B B->C C->D C->D B->C D->A C->D D->A A->B A->B D->A B->C

slide-13
SLIDE 13

13

DDL : Topology-aware communication

switch0 switch1 box0 box1 box2 box3

A B C D

  • It’s a mapping problem

– System-specific network – Application-specific traffic

  • DDL does differently

– To minimize bus contention – To minimize crossing lower BW A->B B->C C->D C->D B->C D->A C->D D->A A->B A->B D->A B->C B->C D->A A->D C->B A->B C->D B->A D->C C->D A->B D->C B->A suboptimal Optimal

slide-14
SLIDE 14

14

DDL : Problem Definition and Solution

  • Assumption

– network topology with various bandwidths

  • Problem Definition

– min-cost multi-commodity flow problem – NP-hard problem but can be solved easily if graph size is small (ie 4 vertices)

  • DDL solves a typical case/topology offline

– if the cluster/cloud has provide such topology, it performs very well

slide-15
SLIDE 15

15

DDL : How well it performs on Caffe2

  • 48 IBM S822LC with PPC64LE RHEL

– 3 racks and 16 hosts on each, connected though 10GB/s IB – Each host has 4 P100-SXM2 with CUDA8, CUDNN5

  • Comparing algorithms on Resnet50 + Imagenet1K (preloaded to RAMDisk) mbs=32

– MPI_Allreduce – Ring (all-reduce from Baidu in Feb 2017) – GLOO (from Facebook) : NCCL+ib_verb

DDL DDL

slide-16
SLIDE 16

16

Comparison with NCCL 2.1.x Allreduce (POWER)

  • IBM P9 Newell Systems (NVLink) with V100s
  • 100Gbps InfiniBand

Exploiting in-system topology Exploiting in/cross-system topology

slide-17
SLIDE 17

17

Comparison with NCCL 2.1.x Allreduce (X86)

  • X86 Systems (PCIe) with P100s
  • 10Gbps Ethernet

NO in-system topology Exploiting cross-system topology

slide-18
SLIDE 18

18

Conclusion

  • DDL is a topology-aware communication library in PowerAI
  • DDL delivers the industry-best performance with

– Network hierarchy – Multi-tier bandwidth

  • DDL is suitable for common distributed training on cloud environment
slide-19
SLIDE 19

19

BACKUP