NV-Group: Link-Efficient Reductions for Distributed Deep Learning on - - PowerPoint PPT Presentation

nv group link efficient reductions for distributed deep
SMART_READER_LITE
LIVE PREVIEW

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on - - PowerPoint PPT Presentation

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems Ching-Hsiang Chu , Pouya Kousha, Ammar A. Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K. (DK) Panda {chu.368, kousha.2, awan.10,


slide-1
SLIDE 1

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems

Ching-Hsiang Chu, Pouya Kousha, Ammar A. Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K. (DK) Panda

{chu.368, kousha.2, awan.10, shafiekhorassani.1}@osu.edu {subramon, panda}@cse.ohio-state.edu

Network-based Computing Laboratory The Ohio State University

slide-2
SLIDE 2

Network Based Computing Laboratory ICS 2020

  • Introduction

– Trend in Modern HPC systems – All-reduce for Distributed Deep Learning on Dense-GPU systems

  • Research Challenge
  • Proposed Designs: NV-Group Allreduce
  • Performance Evaluation
  • Concluding Remarks

Outline

2

slide-3
SLIDE 3

Network Based Computing Laboratory ICS 2020

Trends in Modern HPC Architecture: Heterogeneous

Accelerators high compute density, high performance/watt High Performance Interconnects InfiniBand, Omni-Path, EFA <1usec latency, 200Gbps+ Bandwidth Multi/ Many-core Processors SSD, NVMe-SSD, NVRAM Node local storage

#2 Summit (27,648 GPUs) #3 Sierra (17,280 GPUs) #14 Lassen (2,664 GPUs) #7 Selene

NVIDIA DGX SuperPOD

(2,240 GPUs)

#1 Fugaku

(158,976 nodes with A64FX ARM CPU, a GPU-like processor)

#6 HPC5

(7,280 GPUs)

https://www.top500.org/

slide-4
SLIDE 4

Network Based Computing Laboratory ICS 2020

  • Scale-up (up to 150 GB/s)

– PCIe, NVLink/NVSwitch – Infinity Fabric, Xe Link

  • Scale-out (up to 25 GB/s)

– InfiniBand, Omni-path, Ethernet – Cray Slingshot

Trends in Modern Large-scale Dense-GPU Systems

NVIDIA DGX machine IBM Power System AC922 on #1 Summit system

4

slide-5
SLIDE 5

Network Based Computing Laboratory ICS 2020

  • Easy-to-use and high-performance frameworks
  • Wide range of applications

– Image Classification – Speech Recognition – Self-driving Car – Healthcare – Climate Analytic

GPU-enabled Distributed Deep Learning

Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, Houston M. Exascale deep learning for climate analytics. SC 2018 Nov 11 (p. 51). (Golden Bell Prize)

999 PetaFlop/s sustained, and 1.13 ExaFlop/s peak FP 16 performance over 4560 nodes (27,360 GPUs)

5

slide-6
SLIDE 6

Network Based Computing Laboratory ICS 2020

  • Distributed deep learning

training with data parallelism

– Using Allreduce operations to exchange and update gradients, weights…etc.

  • State-of-the-art Ring-based

Allreduce for GPUs*

– Pros: Contention-free – Cons: not scalable

Reduction Operations for Distributed Deep Learning

https://www.oreilly.com/ideas/distributed-tensorflow Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXiv preprint arXiv:1802.09941. 2018 Feb 26.

6

*Please refer to the paper for the analysis of more algorithms

slide-7
SLIDE 7

Network Based Computing Laboratory ICS 2020

  • Ring-based Allreduce cannot efficiently utilize NVLinks

Motivation

5 10 15

G 1 <

  • >

C P U G 2 <

  • >

C P U G 3 <

  • >

C P U G 4 <

  • >

C P U G 5 <

  • >

C P U G 6 <

  • >

C P U G 1 <

  • >

G 2 G 1 <

  • >

G 3 G 2 <

  • >

G 3 G 4 <

  • >

G 5 G 4 <

  • >

G 6 G 5 <

  • >

G 6

Throughput (GB/s) NVLink Pairs

SpectrumMPI-10.3 OpenMPI-4.0.3 MV2-GDR-2.3 NCCL-2.6

7

* Profiling tool: P. Kousha et al., Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters, HiPC 2019.

slide-8
SLIDE 8

Network Based Computing Laboratory ICS 2020

  • Introduction
  • Research Challenge
  • Proposed Designs: NV-Group
  • Performance Evaluation
  • Concluding Remarks

Outline

8

slide-9
SLIDE 9

Network Based Computing Laboratory ICS 2020

How to design a link-efficient Allreduce algorithm that can maximize the utilization of available hardware communication channels to boost the performance for distributed DL training on emerging dense GPU systems?

Broad Research Challenge

9

slide-10
SLIDE 10

Network Based Computing Laboratory ICS 2020

  • Introduction
  • Research Challenge
  • Proposed Designs: NV-Group Allreduce
  • Performance Evaluation
  • Concluding Remarks

Outline

10

slide-11
SLIDE 11

Network Based Computing Laboratory ICS 2020

  • 1. Forming NV-Groups

– Treat multiple GPUs as one

  • 2. Cooperative reduction kernel within NV-Group

– Persistent GPU kernels – Exploit load-store primitives over NVLinks

– High-occupancy kernel

  • 3. Communication across NV-Groups

– Contention-free over slowest IB networks

Overview of the Proposed NVGroup Allreduce

11

slide-12
SLIDE 12

Network Based Computing Laboratory ICS 2020

  • Topology detection and GPU grouping

– Discover which GPUs are fully connected by NVLink; using tools such as hwloc[1] and NVML[2] – Create logical GPU groups, e.g., MPI Group or Communicator

12

Forming NV-Groups

[1] https://www.open-mpi.org/projects/hwloc/ [2] NVIDIA Management Library (NVML), https://developer.nvidia.com/nvidia-management-library-nvml

slide-13
SLIDE 13

Network Based Computing Laboratory ICS 2020

  • CPU creates work queue for each Cooperative Thread Array

(CTA or block)

  • Persistent GPU Kernel

1) Poll the individual work queue 2) Reduce the data chunks

  • Reduce-scatter among GPUs
  • Direct Load-Store over NVLink

3) Signal CPU upon completion

  • Implicit synchronization[1]

13

Cooperative Reduction Kernel within NV-Group

[1 ] Ching-Hsiang Chu et al. "Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM, " OpenSHMEM 2018.

GPU 0 CTA 0 CTA 1 GPU 1 CTA 0 CTA 1

Chunk 0 Shared System Memory Chunk 1

Chunk 0-0

CPU GPU0

Chunk 1-0

CPU GPU1

Work queue Chunk 0-1

CPU GPU0

Chunk 1-1

CPU GPU1

Chunk 0-0

GPU1 GPU0

Chunk 1-0

GPU1 GPU0

Chunk 0-1

GPU0 GPU1

Chunk 1-1

GPU0 GPU1

slide-14
SLIDE 14

Network Based Computing Laboratory ICS 2020

  • High-Occupancy kernel with low register pressure*

– CPU coordinates the topology and communication paths – Enable all threads for reduction operations

  • Free resources for applications

– Low SM consumption

Ø Low scheduling overhead

– Enable overlap opportunity

14

Cooperative Reduction Kernel - Efficiency

20 40 60 80 1 2 4 8 16

Throughput (GB/s) Number of CTAs

Efficiency of Reduction Kernel (higher the better)

NVGroup (1024 threads) NCCL-2.6 (256 threads)

Links have been saturated

* https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#execution-configuration-optimizations

slide-15
SLIDE 15

Network Based Computing Laboratory ICS 2020

  • CPU processes

– Inter-group communication

– Ring-based Reduce-Scatter + Allgather over IB or X-BUS – Offload reduction to NV-Group

  • GPUs (NV-Group):

– Processing operations requested by CPU

  • Direct Reduce-scatter or

Allgather over NVLink

15

Group-Wise Communication – CPU-GPU Cooperation

GPU 0 GPU 1 GPU 2

Group-1 Group-2 Group-3 Group-4

* Please check out the paper for more

  • ptimization techniques.
slide-16
SLIDE 16

Network Based Computing Laboratory ICS 2020

  • Introduction
  • Research Challenge
  • Proposed Designs: NV-Group
  • Performance Evaluation
  • Concluding Remarks

Outline

16

slide-17
SLIDE 17

Network Based Computing Laboratory ICS 2020

17

Experimental Environments

#1 Summit #10 Lassen* NVIDIA DGX-2

CPU Model IBM POWER9 AC922 Intel Skylake System memory 512 GB 256 GB 1.5 TB GPU Model NVIDIA Volta V100 x 6 NVIDIA Volta V100 x 4 NVIDIA Volta V100 x 16 Interconnects between CPU & GPU 2-lane NVLink 3-lane NVLink PCIe Gen3 Interconnects between GPUs 6-lane NVLink & NVSwitch Interconnects between nodes Dual-rail Mellanox IB EDR Mellanox IB EDR x 8 (Unused) NVIDIA driver & CUDA versions 418.116 & 10.1.243 410.48 & 10.1.243

  • Libraries: SpectrumMPI v10.3.1, OpenMPI v4.0.3+UCX v1.8.0, MVAPICH2-GDR v2.3, NCCL v2.6
  • Benchmarks: OSU Micro-Benchmark (OMB) & modified nccl-test
  • Applications: Horovod v0.19 with TensorFlow v1.14 & PyTorch v1.5

*Please refer to the paper for the thorough performance comparison

slide-18
SLIDE 18

Network Based Computing Laboratory ICS 2020

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library
  • Support for multiple interconnects

– InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS EFA

  • Support for multiple platforms

– x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs

  • Started in 2001, first open-source version demonstrated at SC ‘02
  • Supports the latest MPI-3.1 standard
  • http://mvapich.cse.ohio-state.edu
  • Additional optimized versions for different systems/environments:

– MVAPICH2-X (Advanced MPI + PGAS), since 2011 – MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014 – MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 – MVAPICH2-Virt with virtualization support, since 2015 – MVAPICH2-EA with support for Energy-Awareness, since 2015 – MVAPICH2-Azure for Azure HPC IB instances, since 2019 – MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019

  • Tools:

– OSU MPI Micro-Benchmarks (OMB), since 2003 – OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015

  • Used by more than 3,100 organizations in 89 countries
  • More than 772,000 (> 0.7 million) downloads from the

OSU site directly

  • Empowering many TOP500 clusters (June 2020 ranking)

– 4th, 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China – 8th, 448, 448 cores (Frontera) at TACC – 12th, 391,680 cores (ABCI) in Japan – 18th, 570,020 cores (Nurion) in South Korea and many others

  • Available with software stacks of many vendors and Linux

Distros (RedHat, SuSE, OpenHPC, and Spack)

  • Partner in the 8th ranked TACC Frontera system
  • Empowering Top500 systems for more than 15 years

18

slide-19
SLIDE 19

Network Based Computing Laboratory ICS 2020

  • NVGroup design utilizes all NVLinks

19

Evaluation of Link Utilization on Summit

5 10 15 20 25 30 G 1 <

  • >

C P U G 2 <

  • >

C P U G 3 <

  • >

C P U G 4 <

  • >

C P U G 5 <

  • >

C P U G 6 <

  • >

C P U G 1 <

  • >

G 2 G 1 <

  • >

G 3 G 2 <

  • >

G 3 G 4 <

  • >

G 5 G 4 <

  • >

G 6 G 5 <

  • >

G 6 Throughput (GB/s) NVLink Pairs

Throughput of Allreduce operation on IBM AC922 machine

SpectrumMPI-10.3 OpenMPI-4.0.3 MV2-GDR-2.3 NCCL-2.6 Proposed

slide-20
SLIDE 20

Network Based Computing Laboratory ICS 2020

Allreduce Benchmarking on Summit

1 2 3 4 5 6 32M 64M 128M 256M Bandwidth (GB/s) Message Size (Bytes) Bandwidth on 1,536 GPUs NVGroup NCCL 2.6

1.7X better

100 200 300 400 500 4 16 64 256 1K 4K 16K Latency (us) Message Size (Bytes) Latency on 1,536 GPUs NVGroup NCCL 2.6

1.6X better

2 4 6 8 10 24 48 96 192 384 768 1536 Bandwidth (GB/s) Number of GPUs 128MB Message SpectrumMPI 10.3.1 OpenMPI 4.0.3 NCCL 2.6 NVGroup

1.7X better

  • NVGroup yields 40% and 30% better latency and bandwidth than NCCL
  • NVGroup scale significantly better than production libraries on Summit system

20

*Please refer to the paper for the thorough comparison

slide-21
SLIDE 21

Network Based Computing Laboratory ICS 2020

Distributed Deep Learning Training on DGX-2 Machine

  • ResNet-50 Training using Horovod with TensorFlow

– Synthetic ImageNet dataset with batch size 64 and 100 batches

  • NVGroup achieves higher throughput due to high-occupancy reduction kernel

1000 2000 3000 4000 5000 6000 7000 1 2 4 8 16 Images per second Number of GPUs NCCL-2.6 Proposed Ideal

9% higher

10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 Scaling Efficiency (%) Number of GPUs NCCL-2.6 Proposed

21

93%

slide-22
SLIDE 22

Network Based Computing Laboratory ICS 2020 20000 40000 60000 80000 100000 120000 140000 160000 6 12 24 48 96 192 384 Images per second Number of GPUs

TensorFlow v1.14

SpectrumMPI MV2-GDR-2.3 NCCL-2.6 Proposed

Distributed Deep Learning Training on Summit

  • ResNet-50 Training using Horovod with TensorFlow and PyTorch

– Synthetic ImageNet dataset with batch size 64 and 100 batches 5% higher

20000 40000 60000 80000 100000 120000 6 12 24 48 96 192 384 Images per second Number of GPUs

PyTorch v1.5

SpectrumMPI MV2-GDR-2.3 NCCL-2.6 Proposed

22

1.48X higher

slide-23
SLIDE 23

Network Based Computing Laboratory ICS 2020

  • Introduction
  • Research Challenge
  • Proposed Designs
  • Performance Evaluation
  • Concluding Remarks

Outline

23

slide-24
SLIDE 24

Network Based Computing Laboratory ICS 2020

  • Allreduce operations dominate the performance of distributed deep

learning training with data parallelism

  • State-of-the-art ring-based Allreduce failed to efficiently utilize

interconnects on the modern Dense GPU systems

  • Proposed NVGroup Allreduce can

– Maximize the NVLink utilization for the dense-GPU systems – Perform cooperative reduction on GPUs – Achieve faster distributed DL training on dense-GPU systems such as #1 Summit

  • Publicly available since MVAPICH2-GDR 2.3.2 release!

– http://mvapich.cse.ohio-state.edu/

Concluding Remarks

24

slide-25
SLIDE 25

Network Based Computing Laboratory ICS 2020

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

{chu.368, kousha.2, awan.10, shafiekhorassani.1}@osu.edu {subramon, panda}@cse.ohio-state.edu

The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

25