NV-Group: Link-Efficient Reductions for Distributed Deep Learning on - PowerPoint PPT Presentation

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems Ching-Hsiang Chu , Pouya Kousha, Ammar A. Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K. (DK) Panda {chu.368, kousha.2, awan.10, shafiekhorassani.1}@osu.edu {subramon, panda}@cse.ohio-state.edu Network-based Computing Laboratory The Ohio State University

Outline • Introduction – Trend in Modern HPC systems – All-reduce for Distributed Deep Learning on Dense-GPU systems • Research Challenge • Proposed Designs: NV-Group Allreduce • Performance Evaluation • Concluding Remarks Network Based Computing Laboratory ICS 2020 2

Trends in Modern HPC Architecture: Heterogeneous Accelerators SSD, NVMe-SSD, High Performance Interconnects Multi/ Many-core high compute density, NVRAM InfiniBand, Omni-Path, EFA Processors high performance/watt Node local storage <1usec latency, 200Gbps+ Bandwidth #2 Summit (27,648 GPUs) #1 Fugaku #7 Selene #6 HPC5 #3 Sierra (17,280 GPUs) (158,976 nodes with A64FX ARM NVIDIA DGX SuperPOD (7,280 GPUs) CPU, a GPU-like processor) #14 Lassen (2,664 GPUs) (2,240 GPUs) https://www.top500.org/ Network Based Computing Laboratory ICS 2020

Trends in Modern Large-scale Dense-GPU Systems • Scale-up (up to 150 GB/s) • Scale-out (up to 25 GB/s) – PCIe, NVLink/NVSwitch – InfiniBand, Omni-path, Ethernet – Infinity Fabric, X e Link – Cray Slingshot NVIDIA DGX machine IBM Power System AC922 on #1 Summit system Network Based Computing Laboratory ICS 2020 4

GPU-enabled Distributed Deep Learning • Easy-to-use and high-performance frameworks 999 PetaFlop/s sustained, and 1.13 ExaFlop/s peak FP 16 performance over 4560 nodes (27,360 GPUs) • Wide range of applications – Image Classification – Speech Recognition – Self-driving Car – Healthcare – Climate Analytic Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, Houston M. Exascale deep learning for climate analytics. SC 2018 Nov 11 (p. 51). (Golden Bell Prize) Network Based Computing Laboratory ICS 2020 5

Reduction Operations for Distributed Deep Learning • Distributed deep learning • State-of-the-art Ring-based training with data parallelism Allreduce for GPUs * – Using Allreduce operations to – Pros: Contention-free exchange and update gradients, – Cons: not scalable weights…etc. Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An https://www.oreilly.com/ideas/distributed-tensorflow in-depth concurrency analysis. arXiv preprint arXiv:1802.09941. 2018 Feb 26. *Please refer to the paper for the analysis of more algorithms Network Based Computing Laboratory ICS 2020 6

Motivation • Ring-based Allreduce cannot efficiently utilize NVLinks SpectrumMPI-10.3 OpenMPI-4.0.3 MV2-GDR-2.3 NCCL-2.6 15 Throughput (GB/s) 10 5 0 U U U U U U 2 3 3 5 6 6 G G G G G G P P P P P P C C C C C C > > > > > > > > > > > > - - - - - - < < < < < < - - - - - - 1 1 2 4 4 5 < < < < < < 1 2 3 4 5 6 G G G G G G G G G G G G NVLink Pairs * Profiling tool: P. Kousha et al., Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters, HiPC 2019. Network Based Computing Laboratory ICS 2020 7

Outline • Introduction • Research Challenge • Proposed Designs: NV-Group • Performance Evaluation • Concluding Remarks Network Based Computing Laboratory ICS 2020 8

Broad Research Challenge How to design a link-efficient Allreduce algorithm that can maximize the utilization of available hardware communication channels to boost the performance for distributed DL training on emerging dense GPU systems? Network Based Computing Laboratory ICS 2020 9

Outline • Introduction • Research Challenge • Proposed Designs: NV-Group Allreduce • Performance Evaluation • Concluding Remarks Network Based Computing Laboratory ICS 2020 10

Overview of the Proposed NVGroup Allreduce 1. Forming NV-Groups – Treat multiple GPUs as one 2. Cooperative reduction kernel within NV-Group – Persistent GPU kernels – Exploit load-store primitives over NVLinks – High-occupancy kernel 3. Communication across NV-Groups – Contention-free over slowest IB networks Network Based Computing Laboratory ICS 2020 11

Forming NV-Groups • Topology detection and GPU grouping – Discover which GPUs are fully connected by NVLink; using tools such as hwloc [1] and NVML [2] – Create logical GPU groups, e.g., MPI Group or Communicator [1] https://www.open-mpi.org/projects/hwloc/ [2] NVIDIA Management Library (NVML), https://developer.nvidia.com/nvidia-management-library-nvml Network Based Computing Laboratory ICS 2020 12

Cooperative Reduction Kernel within NV-Group • CPU creates work queue for each Cooperative Thread Array (CTA or block ) Work queue • Persistent GPU Kernel Shared System Memory Chunk 0 1) Poll the individual work queue Chunk 1 2) Reduce the data chunks GPU 0 GPU 1 • Reduce-scatter among GPUs CTA 0 CTA 1 CTA 0 CTA 1 • Direct Load-Store over NVLink 3) Signal CPU upon completion Chunk 0-0 Chunk 0-1 Chunk 1-0 Chunk 1-1 CPU GPU0 CPU GPU0 CPU GPU1 CPU GPU1 • Implicit synchronization [1] Chunk 0-0 Chunk 0-1 Chunk 1-0 Chunk 1-1 GPU1 GPU0 GPU0 GPU1 GPU1 GPU0 GPU0 GPU1 [1 ] Ching-Hsiang Chu et al. "Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM, " OpenSHMEM 2018. Network Based Computing Laboratory ICS 2020 13

Cooperative Reduction Kernel - Efficiency • High-Occupancy kernel with low register pressure * – CPU coordinates the topology and communication paths – Enable all threads for reduction operations Links have been saturated • Free resources for applications Efficiency of Reduction Kernel – Low SM consumption (higher the better) NVGroup (1024 threads) NCCL-2.6 (256 threads) Ø Low scheduling overhead 80 Throughput (GB/s) – Enable overlap opportunity 60 40 20 0 1 2 4 8 16 Number of CTAs * https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#execution-configuration-optimizations Network Based Computing Laboratory ICS 2020 14

Group-Wise Communication – CPU-GPU Cooperation • CPU processes • GPUs (NV-Group): – Processing operations – Inter-group communication requested by CPU – Ring-based Reduce-Scatter + • Direct Reduce-scatter or Allgather over IB or X-BUS Allgather over NVLink – Offload reduction to NV-Group Group-1 Group-4 Group-2 Group-3 GPU 0 GPU 1 * Please check out the GPU 2 paper for more optimization techniques. Network Based Computing Laboratory ICS 2020 15

Outline • Introduction • Research Challenge • Proposed Designs: NV-Group • Performance Evaluation • Concluding Remarks Network Based Computing Laboratory ICS 2020 16

Experimental Environments #1 Summit #10 Lassen* NVIDIA DGX-2 CPU Model IBM POWER9 AC922 Intel Skylake System memory 512 GB 256 GB 1.5 TB GPU Model NVIDIA Volta V100 x 6 NVIDIA Volta V100 x 4 NVIDIA Volta V100 x 16 Interconnects between CPU & GPU PCIe Gen3 2-lane NVLink 3-lane NVLink Interconnects between GPUs 6-lane NVLink & NVSwitch Mellanox IB EDR x 8 Interconnects between nodes Dual-rail Mellanox IB EDR (Unused) NVIDIA driver & CUDA versions 418.116 & 10.1.243 410.48 & 10.1.243 • Libraries: SpectrumMPI v10.3.1, OpenMPI v4.0.3+UCX v1.8.0, MVAPICH2-GDR v2.3, NCCL v2.6 • Benchmarks : OSU Micro-Benchmark (OMB) & modified nccl-test • Applications: Horovod v0.19 with TensorFlow v1.14 & PyTorch v1.5 *Please refer to the paper for the thorough performance comparison Network Based Computing Laboratory ICS 2020 17

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on - PowerPoint PPT Presentation

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems Ching-Hsiang Chu , Pouya Kousha, Ammar A. Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K. (DK) Panda {chu.368, kousha.2, awan.10,

CS 301 Lecture 20 Reductions Stephen Checkoway April 9, 2018 1 / 17 Reductions Reductions

Polynomial-time reductions We have seen several reductions: Polynomial-time reductions Informal

Recommended Round 2 March Budget Reductions GENERAL FUND SUMMARY TOTAL REDUCTIONS ROUNDS

Corporate Presentation September 2018 About Link REIT About Link REIT Link is Our Portfolio (1)

10 GHz Microwave Link 10 GHz Microwave Link 10 GHz Microwave Link 10 GHz Microwave Link Project

Vertex Standard EVX-Link Training EVX-Link Training What is the EVX-Link EVX-Link is a fast

Changing the Game - The De-Linking Paradigm Old Way Our Way De-Link De-Link Link Link

RT-Link: A Time-Synchronized Link Protocol Anthony Rowe, Rahul Mangharam, Raj Rajkumar C

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Teacher Teacher-Student Data Link Teacher Teacher Student Data Link Student Data Link Student

ESCom and Scottish Environment LINK Phoebe Cochrane Scottish Environment LINK May 2014

An introduction to link homology Marco Mackaay CAMGSD and Universidade do Algarve 2 September,

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

Chapter 5: The Data Link Layer Chapter 5 Link Layer and LANs Our goals: understand

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

Chapter 5: The Data Link Layer Chapter 5 Link Layer and LANs Our goals: understand

ARITHMETIC, SET THEORY, AND THEIR MODELS PART ONE: END EXTENSIONS Ali Enayat YOUNG SET THEORY

Commercializing University Research Sita Pappu Director Office of Commercialization Where do we

+N o ^l-\ o -,S-r"b l-. ()O z --iry-bt> CaAl C^"N,^A NrA o^- '^A /f*il ult

Presented by: Jeffrey A. Harrell, CFA Disclosure The presentations are provided for general

1 A program is

Compact Tree Encodings for Planning as QBF Mal Valais Planning as QBF and CTE framework Our

The Sustainable Development Oxymoron: Quantifying and Modelling the Incompatibility of

Priority School Building Programme SPACES Study Day 19 June 2015 Pupils and staffs

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on - PowerPoint PPT Presentation

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems Ching-Hsiang Chu , Pouya Kousha, Ammar A. Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K. (DK) Panda {chu.368, kousha.2, awan.10,

CS 301 Lecture 20 Reductions Stephen Checkoway April 9, 2018 1 / 17 Reductions Reductions

Polynomial-time reductions We have seen several reductions: Polynomial-time reductions Informal

Recommended Round 2 March Budget Reductions GENERAL FUND SUMMARY TOTAL REDUCTIONS ROUNDS

Corporate Presentation September 2018 About Link REIT About Link REIT Link is Our Portfolio (1)

10 GHz Microwave Link 10 GHz Microwave Link 10 GHz Microwave Link 10 GHz Microwave Link Project

Vertex Standard EVX-Link Training EVX-Link Training What is the EVX-Link EVX-Link is a fast

Changing the Game - The De-Linking Paradigm Old Way Our Way De-Link De-Link Link Link

RT-Link: A Time-Synchronized Link Protocol Anthony Rowe, Rahul Mangharam, Raj Rajkumar C

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Teacher Teacher-Student Data Link Teacher Teacher Student Data Link Student Data Link Student

ESCom and Scottish Environment LINK Phoebe Cochrane Scottish Environment LINK May 2014

An introduction to link homology Marco Mackaay CAMGSD and Universidade do Algarve 2 September,

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

Chapter 5: The Data Link Layer Chapter 5 Link Layer and LANs Our goals: understand

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

Chapter 5: The Data Link Layer Chapter 5 Link Layer and LANs Our goals: understand

ARITHMETIC, SET THEORY, AND THEIR MODELS PART ONE: END EXTENSIONS Ali Enayat YOUNG SET THEORY

Commercializing University Research Sita Pappu Director Office of Commercialization Where do we

+N o ^l-\ o -,S-r&quot;b l-. ()O z --iry-bt&gt; CaAl C^&quot;N,^A NrA o^- '^A /f*il ult

Presented by: Jeffrey A. Harrell, CFA Disclosure The presentations are provided for general

1 A program is

Compact Tree Encodings for Planning as QBF Mal Valais Planning as QBF and CTE framework Our

The Sustainable Development Oxymoron: Quantifying and Modelling the Incompatibility of

Priority School Building Programme SPACES Study Day 19 June 2015 Pupils and staffs

+N o ^l-\ o -,S-r"b l-. ()O z --iry-bt> CaAl C^"N,^A NrA o^- '^A /f*il ult