fast and generic collectives for distributed ml
play

Fast and Generic Collectives for Distributed ML Guanhua Wang , - PowerPoint PPT Presentation

Fast and Generic Collectives for Distributed ML Guanhua Wang , Shivaram Venkataraman, Amar Phanishayee Jorgen Thelin, Nikhil R. Devanur, Ion Stoica 1 DNNs empower state-of-the-art results across many different applications Image Classification


  1. Fast and Generic Collectives for Distributed ML Guanhua Wang , Shivaram Venkataraman, Amar Phanishayee Jorgen Thelin, Nikhil R. Devanur, Ion Stoica 1

  2. DNNs empower state-of-the-art results across many different applications Image Classification Robot Control Game Playing Speech Recognition 2

  3. Speed-up DNN training: Data Parallelism Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time 3 * https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

  4. Speed-up DNN training: Data Parallelism Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time 4 * https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

  5. Speed-up DNN training: Data Parallelism ∇ W 2 ∇ W 1 Data parallel training speed-up on ImageNet-1K dataset* Significantly reduce training time ∇ W 4 ∇ W 3 Model Synchronization ∇ W = ∇ W 1 + ∇ W 2 + ⋯ + ∇ W N 5 * https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems

  6. Despite many performance optimizations, model synchronization is a big overhead in data parallel training on cloud servers >50% communication overhead Multi-GPU scaling performance using TensorFlow* *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018 6

  7. Despite many performance optimizations, model synchronization is a big overhead in data parallel training on cloud servers >50% communication overhead Multi-GPU scaling performance using TensorFlow* Up to 90% communication overhead Communication overhead of data-parallel training with Multi-GPU servers using PyTorch^ ^PipeDream: Generalized Pipeline Parallelism for DNN Training, SOSP 2019 7 *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018

  8. Model synchronization is a big overhead in data parallel training despite many performance optimizations >50% communication overhead Multi-GPU scaling performance using TensorFlow* To alleviate communication bottlenecks, recently there have been Up to 90% big improvements in hardware and software. communication overhead Communication overhead of data-parallel training with Multi-GPU servers using PyTorch^ ^PipeDream: Generalized Pipeline Parallelism for DNN Training, SOSP 2019 8 *Horovod: fast and easy distributed deep learning in TensorFlow, arXiv:1802.05799, 2018

  9. NVIDIA DGX-1 NVIDIA DGX-2 State of the art (hardware) 9

  10. What is inside? • Computation NVIDIA P100: 5.3 Tera-FLOPs Double Precision NVIDIA V100: 7.8 Tera-FLOPs Double Precision

  11. What is inside? • Computation • Faster Interconnects NVIDIA P100: 5.3 Tera-FLOPs PCIe 3.0 (x16) ~10GB/s Double Precision Shared • NVIDIA V100: 7.8 Tera-FLOPs NVLink Double Precision Point-to-point • 1 st Gen (P100) ~ 18 GB/s • 2 nd Gen (V100) ~ 23 GB/s • 11

  12. What is inside? • Computation • Faster Interconnects NVIDIA P100: 5.3 Tera-FLOPs PCIe 3.0 (x16) ~10GB/s Double Precision Shared • NVIDIA V100: 7.8 Tera-FLOPs NVLink Double Precision Point-to-point • 1 st Gen (P100) ~ 18 GB/s • 2 nd Gen (V100) ~ 23 GB/s • NVSwitch Fully connected crossbar • 6x NVLink 2 nd Gen Bandwidth • ~ 130 GB/s 12

  13. State of the art (software) NCCL (Nvidia Collective Communication Library) Ring-based collective communication protocols 13

  14. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 14

  15. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 15

  16. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 16

  17. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 17

  18. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 18

  19. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 19

  20. Ring-based collectives (e.g. Broadcast) GPU1 GPU2 GPU3 GPU0 GPU0 GPU1 GPU3 GPU2 Topology Ring Broadcast (from GPU0) 20

  21. State of the art (software) NCCL (Nvidia Collective Communication Library) Ring-based collective communication protocols 21

  22. Can these hardware & software improvements alleviate communication bottleneck in data-parallel training? 22

  23. Can these hardware & software improvements alleviate communication bottleneck in data-parallel training? Not Really 23

  24. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) There are many different 4 GPU Cross-GPU communication allocations with a server measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 24

  25. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) 4 GPU allocation with highest overhead High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) 4 GPU allocation with lowest overhead Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 25

  26. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) High communication overheads is consistent across different number of workers and for a range of DNNs Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 26

  27. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) High communication overheads is consistent across different number of workers and for a range of DNNs Communication overheads become more pronounced with increasing GPU computation power. Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 27

  28. High communication overheads even with state-of-the-art hardware (NVLink) and software (NCCL) High communication overheads consistent across different number of workers and for a range of DNNs We need Faster Collective Communication Protocols. Communication overheads become more pronounced with increasing GPU computation power. Cross-GPU communication measured as the percentage of total epoch time when running within a single 8-GPU DGX-1 box 28

  29. Talk Outline • Motivation • Challenges to achieving faster collective communication • Design • Evaluation 29

  30. Challenge 1: Different server configurations GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 DGX1-P100 (NVLink 1 st Gen, ~18GB/s) DGX1-V100 (NVLink 2 nd Gen, ~23GB/s) 30

  31. Challenge 1: Different server configurations GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 DGX1-P100 (NVLink 1 st Gen, ~18GB/s) DGX1-V100 (NVLink 2 nd Gen, ~23GB/s) Protocols needs to be topology aware to effectively use hardware links. 31

  32. Challenge 2: Link heterogeneity NVLink topology PCIe topology GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 GPU0 GPU1 GPU3 Ring-based collectives can only utilize homogeneous links. 32

  33. Challenge 2: Link heterogeneity NVLink topology PCIe topology GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 GPU0 GPU1 GPU3 Ring-based collectives can only utilize homogeneous links. Why not heterogeneous links? e.g. PCIe will be bottleneck if included in a NVLink ring 33

  34. Challenge 3: Fragmentation in multi-tenant clusters Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft. Examples of fragmented allocation (8GPU job across 2 servers) 3 + 5 2 + 6 34

  35. Challenge 3: Fragmentation in multi-tenant clusters Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft. Why fragmentation? Many cluster schedulers are not topology-aware. Without support for efficient migration, DNN jobs must embrace fragmentation to avoid queuing delays. 35

  36. Challenge 3: Fragmentation in multi-tenant clusters GPU3 GPU0 GPU4 GPU4 GPU7 GPU7 GPU2 GPU1 GPU5 GPU5 GPU6 GPU6 Within each 8-GPU server, # of GPUs allocated to 40,000 multi-GPU jobs at Microsoft. Why fragmentation? Irregular topo. à no ring Many cluster schedulers are not topology-aware. Existing solutions (NCCL) fall back to PCIe if they cannot form a NVLink ring. Without support for efficient migration, DNN jobs must embrace fragmentation to avoid queuing delays. 36 36

  37. Can we do better than state-of-the-art? Topology Heterogeneity 1. Different server configurations 2. Link heterogeneity 3. Fragmentation in multi-tenant clusters Ring-based collective communication protocols 37

  38. Can we do better than state-of-the-art? Topology Heterogeneity 1. Different server configurations BLINK 2. Link heterogeneity 3. Fragmentation in multi-tenant clusters 38

  39. Talk Outline • Motivation • Challenges to achieving high-performance collective communication 1. Different server configurations 2. Link heterogeneity 3. Fragmentation in multi-tenant clusters • Design • Evaluation 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend