efficient communication library for
play

Efficient Communication Library for Large-Scale Deep Learning Mar - PowerPoint PPT Presentation

IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com) Deep Learning changing Our Life Automotive/transportation Medicine and Biology Security/public safety Consumer Web,


  1. IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho (minsikcho@us.ibm.com)

  2. Deep Learning changing Our Life Automotive/transportation Medicine and Biology Security/public safety Consumer Web, Mobile, Retail Media and Entertainment 2

  3. IBM Deep Learning Workflow Latency to model: Typically days to train complex models This is my focus Limited by training compute throughput Training Forward Conversion/Retraining Trained Needed if training/inference precisions differ Error Model Training Data Backward (grouped in large minibatches) Next Minibatch Inference Next Epoch Model Inference Batching Forward Action Smaller, varied batch size: Application-dependent Application- Input Data dependent Individual E.g., from microservices Latency to action: Typically ms to complete full inference workflow Limited by latency of batching (to enable efficient inference) + inference compute+ resultant action 3 3

  4. Advance in Computation for Deep Learning [P. Goldsborough] [MichaelGalloy.com] /FPGA • 10-100 TFLOPS • Very good scaling for last 15 years 4

  5. Motivation: Ok, ever-fast computation. Is this enough? • ImageNet1K : 1.2M images, 1K classes, Resnet101 – Batch-size = 32 (limited by GPU memory) – Iteration time = 300ms – #iterations per epoch = 38000 – Total training time for 100 epoch = 13.2 days • ImageNet22K : 7.5M images, 22K classes , Resnet101 – Total training time for 100 epoch = 35.2 days • No, it is NOT – 1.2M samples are still at toy scale – Computation scaling is not fast enough • the data explosion/model complexity • Innovation will take too long, or even stop at some point – I cannot wait for days to get my model trained! 5

  6. Faster Training Time with Distributed Deep Learning 9 Days Recognition What will you do? 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours 4 Hours Iterate more and create more accurate models? Create more models? Both? Recognition 54x Learning runs with Power 8 6

  7. Distributed Deep Learning Gradient/weight (10MB-1GB) [P. Goldsborough] Anything Data parallelism : Parm-Server Model parallelism (complex partitioning) Data parallelism : Allreduce 7

  8. Communication : Overhead sync sync 32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images 32 images • In weak-scaling – Computation cost remains constant – Communication cost increases with more learners/GPUs • Computation /Communication is the key for large-scale deep learning – Increase Computation – Faster Communication 8

  9. Advance in Communication for Deep Learning • Still scaling, but not fast enough – Computation is still ahead – Data perhaps grows much faster 9

  10. Designing Large-scale Deep Learning Computation Communication Model depth Gradient count GPU throughput Network BW Good balance Faster algorithm Faster algorithm Mini-batch size • Model/Application • Deeper/wider model to increase compute time • Smaller gradient count to reduce communication time • System • Balance network and computing resources • Select mini-batch size to adjust the ratio • Larger mini-batch size to lower the ratio • Too big mini-batch size can hurt convergence and accuracy • Network-topology aware communication 10

  11. IBM PowerAI DDL (Distributed Deep Learning Library) • Collaborative communication library for Distributed Deep Learning – MPI-like interface for easy-integration – Enables deep learning software to scale to 100s severs with CPU/GPUs – Works across variety of system sizes – Works with variety of network types, switch topologies • DDL orchestrates the data communication – Plan efficient communication pattern on a hierarchal network environment – Actual point-point data transfer via NCCL or MPI • Currently integrated into – Supported: Caffe, Tensorflow, Chainer, Torch – Ongoing : Caffe2, PyTorch, Keras (TF-backend) • Currently US patent-pending 11

  12. DDL : Topology-aware communication Max bandwidth: 10 Gbytes/sec switch0 switch1 Max sustained bandwidth: 100 Gbytes/sec A B D C • Example A->B B->C C->D – A, B, C, D broadcast to all others B->C C->D D->A – Suffers from congestion – Suffers from lower BW C->D D->A A->B D->A A->B B->C 12

  13. DDL : Topology-aware communication A->B B->C C->D switch0 switch1 B->C C->D D->A suboptimal C->D D->A A->B box0 box1 box2 box3 D->A A->B B->C A B D C • It’s a mapping problem A->B B->C C->D – System-specific network B->A A->D – Application-specific traffic D->C Optimal • DDL does differently C->D D->A A->B – To minimize bus contention D->C C->B B->A – To minimize crossing lower BW 13

  14. DDL : Problem Definition and Solution • Assumption – network topology with various bandwidths • Problem Definition – min-cost multi-commodity flow problem – NP-hard problem but can be solved easily if graph size is small (ie 4 vertices) • DDL solves a typical case/topology offline – if the cluster/cloud has provide such topology, it performs very well 14

  15. DDL : How well it performs on Caffe2 DDL DDL • 48 IBM S822LC with PPC64LE RHEL – 3 racks and 16 hosts on each, connected though 10GB/s IB – Each host has 4 P100-SXM2 with CUDA8, CUDNN5 • Comparing algorithms on Resnet50 + Imagenet1K (preloaded to RAMDisk) mbs=32 – MPI_Allreduce – Ring (all-reduce from Baidu in Feb 2017) – GLOO (from Facebook) : NCCL+ib_verb 15

  16. Comparison with NCCL 2.1.x Allreduce (POWER) Exploiting in-system topology Exploiting in/cross-system topology • IBM P9 Newell Systems (NVLink) with V100s • 100Gbps InfiniBand 16

  17. Comparison with NCCL 2.1.x Allreduce (X86) NO in-system topology Exploiting cross-system topology • X86 Systems (PCIe) with P100s • 10Gbps Ethernet 17

  18. Conclusion • DDL is a topology-aware communication library in PowerAI • DDL delivers the industry-best performance with – Network hierarchy – Multi-tier bandwidth • DDL is suitable for common distributed training on cloud environment 18

  19. BACKUP 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend