High-Performance Adaptive MPI Derived Datatype Communication for - - PowerPoint PPT Presentation

high performance adaptive mpi derived datatype
SMART_READER_LITE
LIVE PREVIEW

High-Performance Adaptive MPI Derived Datatype Communication for - - PowerPoint PPT Presentation

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi- GPU Systems Ching-Hsiang Chu, Jahanzeb Maqbool Hashmi, Kawthar Shafie Khorassani, Hari Subramoni, Dhabaleswar K. (DK) Panda The Ohio State University {chu.368,


slide-1
SLIDE 1

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi- GPU Systems

Ching-Hsiang Chu, Jahanzeb Maqbool Hashmi, Kawthar Shafie Khorassani, Hari Subramoni, Dhabaleswar K. (DK) Panda The Ohio State University

{chu.368, hashmi.29, shafiekhorassani.1}@osu.edu, {subramon, panda}@cse.ohio-state.edu

http://www.cse.ohio-state.edu/~panda

slide-2
SLIDE 2

Network Based Computing Laboratory HiPC 2019

  • Introduction
  • Problem Statement
  • Proposed Designs
  • Performance Evaluation
  • Concluding Remarks

Outline

2

slide-3
SLIDE 3

Network Based Computing Laboratory HiPC 2019

  • Multi-core/many-core technologies
  • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
  • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
  • Multiple Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) connected by PCIe/NVLink interconnects
  • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Drivers of Modern HPC Cluster Architectures

Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, 200Gbps Bandwidth> Multi-core Processors SSD, NVMe-SSD, NVRAM

K - Computer Sunway TaihuLight Summit Sierra

3

slide-4
SLIDE 4

Network Based Computing Laboratory HiPC 2019

  • Wide usages of MPI derived datatype for Non-contiguous Data Transfer

– Requires Low-latency and high overlap processing

Non-contiguous Data Transfer for HPC Applications

  • M. Martinasso, G. Kwasniewski, S. R. Alam, Thomas C. Schulthess, and T. Hoefler. “A PCIe congestion-aware performance

model for densely populated accelerator servers. “ SC 2016

Weather Simulation: COSMO model

Mike Clark. “GPU Computing with QUDA, “Developer Technology Group, https://www.olcf.ornl.gov/wp-content/uploads/2013/02/Clark_M_LQCD.pdf

Quantum Chromodynamics

4

slide-5
SLIDE 5

Network Based Computing Laboratory HiPC 2019

  • GPU kernel-based packing/unpacking [1-3]

– High-throughput memory access – Leverage GPUDirect RDMA capability

5

State-of-the-art MPI Derived Datatype Processing

[1] R. Shi, X. Lu, S. Potluri, K. Hamidouche, J. Zhang and D. K. Panda, "HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters," ICPP 2014. [2] C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS 2016. [3] Wei Wu, George Bosilca, Rolf vandeVaart, Sylvain Jeaugey, and Jack Dongarra. “GPU-Aware Non-contiguous Data Movement In Open MPI, “ HPDC 2016.

slide-6
SLIDE 6

Network Based Computing Laboratory HiPC 2019

1 10 100 1000 10000 100000 1000000 [32,16,16] (3.28 KB) [512,512,256] (1012 KB) [1957x245] (25.8 KB) [11797x3009] (173.51KB) NAS specfem3D_cm Latency (us) Application Kernels and their sizes

Overhead of MPI Datatype Processing

MVAPICH2-GDR: Contiguous MVAPICH2-GDR: DDT OpenMPI: Contiguous OpenMPI: DDT

  • Significant overhead

when moving non- contiguous GPU- resident data

– Wasting cycles – Extra data copies – High Latency!!!

Expensive Packing/Unpacking Operations in GPU-Aware MPI

337X worse!

6

Data transfer between two NVIDIA K80 GPUs with PCIe link

slide-7
SLIDE 7

Network Based Computing Laboratory HiPC 2019

  • Primary overhead

– Packing/Unpacking – CPU-GPU synchronization – GPU driver overhead

  • Can we reduce or

eliminate the expensive packing/unpacking

  • perations?

Analysis of Packing/Unpacking Operations in GPU-Aware MPI

0% 20% 40% 60% 80% 100% 3.28 KB 1012 KB 25.8 KB 173.51KB 3.28 KB 1012 KB 25.8 KB 173.51KB NAS specfem3D_cm NAS specfem3D_cm MVAPICH-GDR 2.3.1 OpenMPI 4.0.1 + UCX 1.5.1 Time Breakdown CUDA Driver Overhead Pack/unpack kernels Memory Allocation CUDA Synchronization Others

7

Data transfer between two NVIDIA K80 GPUs with PCIe link

slide-8
SLIDE 8

Network Based Computing Laboratory HiPC 2019

  • Introduction
  • Problem Statement
  • Proposed Designs
  • Performance Evaluation
  • Concluding Remarks

Outline

8

slide-9
SLIDE 9

Network Based Computing Laboratory HiPC 2019

  • How can we exploit load-store based remote memory access model over high-

performance interconnects like PCIe and NVLink to achieve “packing-free” non- contiguous data transfers for GPU-resident data?

  • Can we propose new designs that mitigate the overheads of existing

approaches and offer optimal performance for GPU based derived datatype transfers when packing/unpacking approaches are inevitable?

  • How to design an adaptive MPI communication runtime that can dynamically

employ optimal DDT processing mechanisms for diverse application scenarios?

Problem Statement

9

slide-10
SLIDE 10

Network Based Computing Laboratory HiPC 2019

  • Introduction
  • Problem Statement
  • Proposed Designs

– Zero-copy non-contiguous data movement over NVLink/PCIe – One-shot packing/unpacking – Adaptive MPI derived datatype processing

  • Performance Evaluation
  • Concluding Remarks

Outline

10

slide-11
SLIDE 11

Network Based Computing Laboratory HiPC 2019

  • Direct link such as

PCIe/NVLink is available between two GPUs

  • Efficient datatype layout

exchange and cache

  • Load-store data

movement

Overview of Zero-copy Datatype Transfer

  • 11
slide-12
SLIDE 12

Network Based Computing Laboratory HiPC 2019

  • Convert IOV list to

displacement list

– Improved reusability – One-time effort

  • Cache datatype layout on

the shared system memory

– Accessible within the node without extra copies

Zero-copy Datatype Transfer: Enhanced Layout Cache

  • 12
slide-13
SLIDE 13

Network Based Computing Laboratory HiPC 2019

  • Exploiting load-store capability of modern interconnects

– Eliminate extra data copies and expensive packing/unpacking processing

Zero-copy Datatype Transfer: Copy vs. Load-Store

Source GPU Memory Destination GPU Memory PCIe/NVLink Source GPU Memory Destination GPU Memory PCIe/NVLink Load-Store Copy

Existing Packing Schem Proposed Packing-free Schem

13

slide-14
SLIDE 14

Network Based Computing Laboratory HiPC 2019

  • Packing/unpacking is inevitable if there is no direct link
  • Direct packing/unpacking between CPU and GPU memory

to avoid extra copies

One-shot Packing/Unpacking Mechanism

  • 1. GDRCopy-based

– CPU-driven low-latency copy-based scheme

  • 2. Kernel-based

– GPU-driven high-throughput load- store-based scheme

Source GPU Memory Destination GPU Memory PCIe/NVLink System Memory PCIe/NVLink

14

slide-15
SLIDE 15

Network Based Computing Laboratory HiPC 2019

  • Availability of GPUDirect peer access and GPUDirect RDMA
  • Latency- or throughput-oriented communication pattern

Adaptive Selection

  • 15

(Proposed) (Proposed) (Proposed) (State-of-the-art) (i.e., no direct link)

slide-16
SLIDE 16

Network Based Computing Laboratory HiPC 2019

  • Introduction
  • Problem Statement
  • Proposed Designs
  • Performance Evaluation
  • Concluding Remarks

Outline

16

slide-17
SLIDE 17

Network Based Computing Laboratory HiPC 2019

Experimental Environments

Cray CS-Storm NVIDIA DGX-2 CPU Model Intel Haswell Intel Skylake System memory 256 GB 1.5 TB GPUs 8 NVIDIA Tesla K80 16 NVIDIA Tesla V100 Interconnects PCIe Gen3 Mellanox IB FDR NVLink/NVSwitch Mellanox IB EDR x 8 (Unused) OS & compiler version RHEL 7.3 & GCC 4.8.5 Ubuntu 18.04 & GCC 7.3.0 NVIDIA driver & CUDA versions 410.79 & 9.2.148 410.48 & 9.2.148

  • Benchmarks: Modified DDTBench to use GPU-resident data
  • NAS_MG, MILC, Specfem3D_cm, and Specfem3D_oc
  • Application kernels
  • COSMO model & Jacobi Method
  • Baseline: MVAPICH2-GDR 2.3.1

17

slide-18
SLIDE 18

Network Based Computing Laboratory HiPC 2019

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,050 organizations in 89 countries – More than 615,000 (> 0.6 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Jun ‘19 ranking)

  • 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
  • 5th, 448, 448 cores (Frontera) at TACC
  • 8th, 391,680 cores (ABCI) in Japan
  • 15th, 570,020 cores (Neurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade

Overview of the MVAPICH2 Project

Partner in the TACC Frontera System

18

slide-19
SLIDE 19

Network Based Computing Laboratory HiPC 2019

  • Zero-copy performs the best in almost all cases!

Evaluation of Zero-copy Design: Dense Layout

19

Please refer to the paper for more performance comparison!

1 10 100 1000 10000

3360 62496 256032 1036320 12288 24576 49152 98304 NAS MILC

Log-Latency (us)

  • App. Kernels and data sizes

Baseline DDT-OSP-KERNEL DDT-OSP-GDRCOPY DDT-ZCPY 1 10 100 1000 10000

3360 62496 256032 1036320 12288 24576 49152 98304 NAS MILC

Log-Latency (us)

  • App. Kernels and data sizes

Baseline DDT-OSP-KERNEL DDT-OSP-GDRCOPY DDT-ZCPY

Platform: Cray CS-Storm; Two GPUs through PCIe Switch Platform: NVIDIA DGX-2; Two GPUs through NVLink/NVSwitch

24X 4.7X

slide-20
SLIDE 20

Network Based Computing Laboratory HiPC 2019

  • Zero-copy performs the best in all cases by avoiding unnecessary

data copies and CPU-GPU synchronization

Evaluation of Zero-copy Design: Sparse Layout

20

1 10 100 1000 10000 100000

1972 3508 7588 12900 26424 55224 106248 177672 specfem3D_oc specfem3D_cm

Log-Latency (us)

  • App. Kernels and data sizes

Baseline DDT-OSP-KERNEL DDT-OSP-GDRCOPY DDT-ZCPY 1 10 100 1000 10000 100000

1972 3508 7588 12900 26424 55224 106248 177672 specfem3D_oc specfem3D_cm

Log-Latency (us)

  • App. Kernels and data sizes

Baseline DDT-OSP-KERNEL DDT-OSP-GDRCOPY DDT-ZCPY

Platform: Cray CS-Storm; Two GPUs through PCIe Switch Platform: NVIDIA DGX-2; Two GPUs through NVLink/NVSwitch

520X 369X

slide-21
SLIDE 21

Network Based Computing Laboratory HiPC 2019

1 10 100 1000 10000

3360 62496 256032 1036320 12288 24576 49152 98304 NAS MILC

Log-Latency (us)

  • App. Kernels and data sizes

Baseline DDT-OSP-KERNEL DDT-OSP-GDRCOPY

21

Evaluation of One-shot Packing Design

Please refer to the paper for more performance comparison!

  • GDRCopy-based scheme performs better for dense layout
  • Kernel-based scheme performs better for sparse layout

Platform: Cray CS-Storm; Two GPUs on different sockets without direct link

1 10 100 1000 10000 100000

1972 3508 7588 12900 26424 55224 106248 177672 specfem3D_oc specfem3D_cm

Log-Latency (us)

  • App. Kernels and data sizes

Baseline DDT-OSP-KERNEL DDT-OSP-GDRCOPY

Dense Layout Distributed/Sparse Layout

78X 8X

slide-22
SLIDE 22

Network Based Computing Laboratory HiPC 2019 500 1000 1500 2000 2500 3000 3500 4000 4500 2 4 8 16 GFLOPS Number of GPUs

Baseline Proposed-Adaptive

  • COSMO Model
  • Jacobi (2DStencil Computation)

Evaluation of Applications

22

5 10 15 20 25 16 32 64 Execution Time (s) Number of GPUs

Baseline Proposed-Adaptive

Platform: Cray CS-Storm, 8 NVIDIA K80 GPUs per node

Improved 3.4X

(https://github.com/cosunae/HaloExchangeBenchmarks)

Improved 15X

Platform: NVIDIA DGX-2, 16 NVIDIA V100 GPUs per node

Improved 13%

slide-23
SLIDE 23

Network Based Computing Laboratory HiPC 2019

  • Introduction
  • Problem Statement
  • Proposed Designs
  • Performance Evaluation
  • Concluding Remarks

Outline

23

slide-24
SLIDE 24

Network Based Computing Laboratory HiPC 2019

  • Non-contiguous data communication is common in many HPC

applications

– however, it is not optimized in current GPU-Aware MPI implementations

  • Proposed designs significantly reduce the packing overhead

– Zero-copy design eliminates expensive packing/unpacking operations – One-shot design avoids extra data copies

– Adaptive scheme dynamically selects the optimal communication paths

  • Publicly available since MVAPICH2-GDR 2.3.2 release

– http://mvapich.cse.ohio-state.edu/

Concluding Remarks

24

slide-25
SLIDE 25

Network Based Computing Laboratory HiPC 2019

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

panda@cse.ohio-state.edu

The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

25