MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling - - PowerPoint PPT Presentation

mvapich2 gdr pushing the frontier of mpi libraries
SMART_READER_LITE
LIVE PREVIEW

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling - - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference (GTC 2018) by Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail:


slide-1
SLIDE 1

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

GPU Technology Conference (GTC 2018) by

Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu http://www.cse.ohio-state.edu/~subramon

slide-2
SLIDE 2

GTC 2018 2 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-3
SLIDE 3

GTC 2018 3 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,875 organizations in 86 countries – More than 460,000 (> 0.46 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘17 ranking)

  • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
  • 9th, 556,104 cores (Oakforest-PACS) in Japan
  • 12th, 368,928-core (Stampede2) at TACC
  • 17th, 241,108-core (Pleiades) at NASA
  • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade
slide-4
SLIDE 4

GTC 2018 4 Network Based Computing Laboratory 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 Sep-04 Feb-05 Jul-05 Dec-05 May-06 Oct-06 Mar-07 Aug-07 Jan-08 Jun-08 Nov-08 Apr-09 Sep-09 Feb-10 Jul-10 Dec-10 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 May-16 Oct-16 Mar-17 Aug-17 Jan-18 Number of Downloads Timeline MV 0.9.4 MV2 0.9.0 MV2 0.9.8 MV2 1.0 MV 1.0 MV2 1.0.3 MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.0b MV2-MIC 2.0 MV2-GDR 2.3a MV2-X 2.3b MV2 Virt 2.2 MV2 2.3rc1 OSU INAM 0.9.3

MVAPICH2 Release Timeline and Downloads

slide-5
SLIDE 5

GTC 2018 5 Network Based Computing Laboratory

MVAPICH2 Architecture

High Performance Parallel Programming Models

Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime

Diverse APIs and Mechanisms

Point-to- point Primitives Collectives Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Job Startup Introspection & Analysis

Support for Modern Networking Technology

(InfiniBand, iWARP, RoCE, OmniPath)

Support for Modern Multi-/Many-core Architectures

(Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU) Transport Protocols Modern Features

RC XRC UD DC UMR ODP* SR- IOV Multi Rail

Transport Mechanisms

Shared Memory CMA IVSHMEM

Modern Features

MCDRAM* NVLink* CAPI*

* Upcoming

slide-6
SLIDE 6

GTC 2018 6 Network Based Computing Laboratory

MVAPICH2 Software Family

High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications

slide-7
SLIDE 7

GTC 2018 7 Network Based Computing Laboratory

CPU CPU

QPI GPU

PCIe

GPU GPU

CPU

GPU IB

Node 0 Node 1

  • 1. Intra-GPU
  • 2. Intra-Socket GPU-GPU
  • 3. Inter-Socket GPU-GPU
  • 4. Inter-Node GPU-GPU
  • 5. Intra-Socket GPU-Host
  • 7. Inter-Node GPU-Host
  • 6. Inter-Socket GPU-Host

Memory buffers

  • 8. Inter-Node GPU-GPU with IB adapter on remote socket

and more . . .

  • NVLink is leading to more paths
  • For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …
  • Critical for runtimes to optimize data movement while hiding the complexity
  • Connected as PCIe devices – Flexibility but Complexity

MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters

slide-8
SLIDE 8

GTC 2018 8 Network Based Computing Laboratory

At Sender: At Receiver:

MPI_Recv(r_devbuf, size, …); inside MVAPICH2

  • Standard MPI interfaces used for unified data movement
  • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
  • Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

slide-9
SLIDE 9

GTC 2018 9 Network Based Computing Laboratory

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases

  • Support for MPI communication from NVIDIA GPU device memory
  • High performance RDMA-based inter-node point-to-point communication

(GPU-GPU, GPU-Host and Host-GPU)

  • High performance intra-node point-to-point communication for multi-GPU

adapters/node (GPU-GPU, GPU-Host and Host-GPU)

  • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node

  • Optimized and tuned collectives for GPU device buffers
  • MPI datatype support for point-to-point and collective communication from

GPU device buffers

  • Unified memory
slide-10
SLIDE 10

GTC 2018 10 Network Based Computing Laboratory

  • MVAPICH2-2.3 with GDR support can be downloaded from

https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

  • System software requirements
  • Mellanox OFED 3.2 or later
  • NVIDIA Driver 367.48 or later
  • NVIDIA CUDA Toolkit 7.5/8.0/9.0 or later
  • Plugin for GPUDirect RDMA

http://www.mellanox.com/page/products_dyn?product_family=116

  • Strongly recommended
  • GDRCOPY module from NVIDIA

https://github.com/NVIDIA/gdrcopy

  • Contact MVAPICH help list with any questions related to the package

mvapich-help@cse.ohio-state.edu

Using MVAPICH2-GPUDirect Version

slide-11
SLIDE 11

GTC 2018 11 Network Based Computing Laboratory

  • Released on 11/09/2017
  • Major Features and Enhancements

– Based on MVAPICH2 2.2 – Support for CUDA 9.0 – Add support for Volta (V100) GPU – Support for OpenPOWER with NVLink – Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink – Enhanced performance of GPU-based point-to-point communication – Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication – Enhanced performance of MPI_Allreduce for GPU-resident data – InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications

  • Basic support for IB-MCAST designs with GPUDirect RDMA
  • Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA
  • Advanced reliability support for IB-MCAST designs

– Efficient broadcast designs for Deep Learning applications – Enhanced collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems

MVAPICH2-GDR 2.3a

slide-12
SLIDE 12

GTC 2018 12 Network Based Computing Laboratory

2000 4000 6000

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us) Message Size (Bytes)

GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR-2.3a MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA

10x 9x

Optimized MVAPICH2-GDR Design

1.88us 11X

slide-13
SLIDE 13

GTC 2018 13 Network Based Computing Laboratory

  • RoCE V1 and V2 support
  • RDMA_CM connection support
  • CUDA-Aware Collective Tuning

– Point-point Tuning (available since MVAPICH2-GDR 2.0)

  • Tuned thresholds for the different communication patterns and features
  • Depending on the system configuration (CPU, HCA and GPU models)

– Tuning Framework for GPU based collectives

  • Select the best algorithm depending on message size, system size and system

configuration

  • Support for Bcast and Gather operations for different GDR-enabled systems
  • Available since MVAPICH2-GDR 2.2RC1 release

ROCE and Optimized Collectives Support

slide-14
SLIDE 14

GTC 2018 14 Network Based Computing Laboratory

  • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
  • HoomdBlue Version 1.0.5
  • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768

MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

500 1000 1500 2000 2500 4 8 16 32

Average Time Steps per second (TPS)

Number of Processes

MV2 MV2+GDR

500 1000 1500 2000 2500 3000 3500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes 64K Particles 256K Particles 2X 2X

slide-15
SLIDE 15

GTC 2018 15 Network Based Computing Laboratory

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0.2 0.4 0.6 0.8 1 1.2 16 32 64 96 Normalized Execution Time Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based 0.2 0.4 0.6 0.8 1 1.2 4 8 16 32 Normalized Execution Time Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

  • 2X improvement on 32 GPUs nodes
  • 30% improvement on 96 GPU nodes (8 GPUs/node)
  • C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data

Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content /tasks/operational/meteoSwiss/

slide-16
SLIDE 16

GTC 2018 16 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-17
SLIDE 17

GTC 2018 17 Network Based Computing Laboratory

Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1

  • Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect
  • Up to 30% higher D2D bandwidth on DGX-1 with NVLink
  • 5000

10000 15000 20000 25000 30000 35000 40000 128K 256K 512K 1M 2M 4M

Million Bytes (MB)/second Message Size (Bytes)

Pt-to-pt (D-D) Bandwidth: Benefits of Multi-stream CUDA IPC Design 1-stream 4-streams

16% better

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 16K 32K 64K 128K 256K 512K 1M 2M 4M

Million Bytes (MB)/second Message Size (Bytes)

Pt-to-pt (D-D) Bandwidth: Benefits of Multi-stream CUDA IPC Design 1-stream 4-streams

30% better

Available with MVAPICH2-GDR-2.3a

slide-18
SLIDE 18

GTC 2018 18 Network Based Computing Laboratory 5000 10000 15000 Bandwidth (MBps) Message Size (Bytes)

INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

CMA-based Intra-node Host-to-Host Communication Support

100 200 300 400 500 600 Latency (us) Message Size (Bytes)

INTRA-NODE Pt-to-Pt (H2H) LATENCY

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA) MVAPICH2-GDR-2.3a Intel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCA CUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA

  • Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth

30% better 30% better

slide-19
SLIDE 19

GTC 2018 19 Network Based Computing Laboratory

  • Multi-dimensional data
  • Row based organization
  • Contiguous on one dimension
  • Non-contiguous on other dimensions
  • Halo data exchange
  • Duplicate the boundary
  • Exchange the boundary in each

iteration

Halo data exchange

Non-contiguous Data Exchange

slide-20
SLIDE 20

GTC 2018 20 Network Based Computing Laboratory

MPI Datatype support in MVAPICH2

  • Datatypes support in MPI

– Operate on customized datatypes to improve productivity – Enable MPI library to optimize non-contiguous data

At Sender:

MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type); MPI_Type_commit(&new_type); … MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

  • Inside MVAPICH2
  • Use datatype specific CUDA Kernels to pack data in chunks
  • Efficiently move data between nodes using RDMA
  • In progress - currently optimizes vector and hindexed datatypes
  • Transparent to the user
  • H. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel

and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.

slide-21
SLIDE 21

GTC 2018 21 Network Based Computing Laboratory

MPI Datatype Processing (Computation Optimization )

  • Comprehensive support
  • Targeted kernels for regular datatypes - vector, subarray, indexed_block
  • Generic kernels for all other irregular datatypes
  • Separate non-blocking stream for kernels launched by MPI library
  • Avoids stream conflicts with application kernels
  • Flexible set of parameters for users to tune kernels
  • Vector
  • MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
  • MV2_CUDA_KERNEL_VECTOR_YSIZE
  • Subarray
  • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
  • MV2_CUDA_KERNEL_SUBARR_XDIM
  • MV2_CUDA_KERNEL_SUBARR_YDIM
  • MV2_CUDA_KERNEL_SUBARR_ZDIM
  • Indexed_block
  • MV2_CUDA_KERNEL_IDXBLK_XDIM
slide-22
SLIDE 22

GTC 2018 22 Network Based Computing Laboratory

MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPU Common Scenario

*A, B…contain non-contiguous MPI Datatype MPI_Isend (A,.. Datatype,…) MPI_Isend (B,.. Datatype,…) MPI_Isend (C,.. Datatype,…) MPI_Isend (D,.. Datatype,…) … MPI_Waitall (…);

slide-23
SLIDE 23

GTC 2018 23 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-24
SLIDE 24

GTC 2018 24 Network Based Computing Laboratory

Initial (Basic) Support for GPU Managed Memory

  • CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified)

memory allowing a common memory allocation for GPU or CPU through cudaMallocManaged() call

  • Significant productivity benefits due to abstraction of

explicit allocation and cudaMemcpy()

  • Extended MVAPICH2 to perform communications directly

from managed buffers (Available since MVAPICH2-GDR 2.2b)

  • OSU Micro-benchmarks extended to evaluate the

performance of point-to-point and collective communications using managed buffers

  • Available since OMB 5.2
  • D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance

Communication Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 4 8 16 32 64 128 256 1K 4K 8K 16K Halo Exchange Time (ms) Total Dimension Size (Bytes)

2D Stencil Performance for Halowidth=1

Device Managed

slide-25
SLIDE 25

GTC 2018 25 Network Based Computing Laboratory

Enhanced Support for Intra-node Managed Memory

  • CUDA Managed => no memory pin down
  • No IPC support for intra-node communication
  • No GDR support for Inter-node

communication

  • Initial and basic support in MVAPICH2-GDR
  • For both intra- and inter-nodes use “pipeline

through” host memory

  • Enhance intra-node managed memory to use IPC
  • Double buffering pair-wise IPC-based scheme
  • Brings IPC performance to Managed memory
  • High performance and high productivity
  • 2.5 X improvement in bandwidth
  • Available since MVAPICH2-GDR 2.2RC1

200 400 600 800 1000 1200 4K 16K 64K 256K 1M 4M Enhanced MV2-GDR 2.2b Message Size (bytes) Latency (us) 2000 4000 6000 8000 10000 32K 128K 512K 2M Enhanced MV2-GDR 2.2b Message Size (bytes) Bandwidth (MB/s)

2.5X

slide-26
SLIDE 26

GTC 2018 26 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-27
SLIDE 27

GTC 2018 27 Network Based Computing Laboratory

  • Streaming applications on HPC

systems

  • 1. Communication (MPI)
  • Broadcast-type operations
  • 2. Computation (CUDA)
  • Multiple GPU nodes as workers

Streaming Applications

Data Source Sender

HPC resources for real-time analytics

Real-time streaming

Worker

CPU GPU GPU

Worker

CPU GPU GPU

Worker

CPU GPU GPU

Worker

CPU GPU GPU

Worker

CPU GPU GPU

Data streaming-like broadcast operations

slide-28
SLIDE 28

GTC 2018 28 Network Based Computing Laboratory

IB HCA CPU GPU

Source

IB Switch

Header Data

IB HCA CPU GPU

Destination 1

Header Data

IB HCA CPU GPU

Destination N

Header Data

  • 1. IB Gather + GDR Read
  • 2. IB Hardware Multicast
  • 3. IB Scatter + GDR Write
  • For GPU-resident data, using

– GPUDirect RDMA (GDR) – InfiniBand Hardware Multicast (IB-MCAST)

  • Overhead

– IB UD limit – GDR limit

Hardware Multicast-based Broadcast

  • A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda,

“A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014, Dec 2014.

slide-29
SLIDE 29

GTC 2018 29 Network Based Computing Laboratory

  • Preparing Intermediate buffer (im_buf)

– Page-locked (pinned) host buffer

  • Fast Device-Host data movement

– Allocated at initialization phase

  • Low overhead
  • Streaming data through host

– Fine-tuned chunked data – Asynchronous copy operations

  • Three-stage pipeline

IB HCA CPU GPU

Source

IB

Switch

Header d_out

  • 1. Data Preparation
  • 2. IB Gather
  • 3. IB Hardware Multicast

im_buf

Optimized Broadcast Send

MPI_Bcast(d_out,…)

C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017.

slide-30
SLIDE 30

GTC 2018 30 Network Based Computing Laboratory

  • Zero-copy broadcast receive

– Pre-posted user buffer (d_in) – Avoids additional data movement – Leverages IB Scatter and GDR features

  • Low-latency
  • Free-up PCIe resources for applications

IB

Switch

IB HCA CPU GPU

Destination 1

Header d_in

IB HCA CPU GPU

Destination N

Header d_in IB Hardware Multicast IB Scatter (GDR Write)

Optimized Broadcast Receive

MPI_Bcast(d_in,…)

C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017.

slide-31
SLIDE 31

GTC 2018 31 Network Based Computing Laboratory

  • Proposed Intra-node Topology-Aware Broadcast

– CUDA InterProcess Communication (IPC)

Broadcast on Multi-GPU systems

Node 1

IB

Switch

GPU 0 GPU 1 GPU N

Node N GPU CPU Source GPU CPU CPU

Multicast steps cudaMemcpy (Device ↔ Device)

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26- 28, 2016.

Available in MVAPICH2-GDR 2.3a

slide-32
SLIDE 32

GTC 2018 32 Network Based Computing Laboratory 10 20 30 40 50 60 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Latency (μs) Message Size (Bytes) MCAST-GDR-OPT MCAST-GDR

  • IB-MCAST + GDR + Topology-aware IPC-based schemes

– Up to 58% and 79% reduction for small and large messages

Streaming Benchmark @ CSCS (88 GPUs)

58%

2000 4000 6000 8000 10000 12000 32K 64K 128K 256K 512K 1M 2M 4M Latency (μs) Message Size (Bytes) MCAST-GDR-OPT MCAST-GDR

79%

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.

slide-33
SLIDE 33

GTC 2018 33 Network Based Computing Laboratory

0.5 1 1.5 8 GPUs 16 GPUs 8 GPUs 16 GPUs 8 GPUs 16 GPUs AlexNet VGG ResNet-50 Speedup MV2-GDR-Knomial MV2-GDR-Ring MV2-MCAST-GDR-Opt

  • @ RI2 cluster, 16 GPUs, 1 GPU/node

– CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) [2]

Application-based Evaluation: CUDA-Aware CNTK

Higher is better

18% 24%

[1] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17. [2] D. S. Banerjee, K. Hamidouche, and D. K. Panda, Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE CloudCom’16

  • Reduces up to 24%, 16% and 18% of latency for AlexNet, VGG and ResNet models
  • Higher improvement can be observed for larger system sizes

16%

slide-34
SLIDE 34

GTC 2018 34 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-35
SLIDE 35

GTC 2018 35 Network Based Computing Laboratory

  • Deep Learning frameworks are a different game

altogether

– Unusually large message sizes (order of megabytes) – Most communication based on GPU buffers

  • Existing State-of-the-art

– cuDNN, cuBLAS, NCCL --> scale-up performance – NCCL2, CUDA-Aware MPI --> scale-out performance

  • For small and medium message sizes only!
  • Proposed: Can we co-design the MPI runtime (MVAPICH2-

GDR) and the DL framework (Caffe) to achieve both? – Efficient Overlap of Computation and Communication – Efficient Large-Message Communication (Reductions) – What application co-designs are needed to exploit communication-runtime co-designs?

Deep Learning: New Challenges for MPI Runtimes

Scale-up Performance Scale-out Performance

cuDNN NCCL gRPC Hadoop

Proposed Co- Designs

MPI cuBLAS

  • A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU
  • Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

NCCL2

slide-36
SLIDE 36

GTC 2018 36 Network Based Computing Laboratory

100 200 300 2 4 8 16 32 64 128 Latency (ms) Message Size (MB)

Reduce – 192 GPUs

Large Message Optimized Collectives for Deep Learning

100 200 128 160 192 Latency (ms)

  • No. of GPUs

Reduce – 64 MB

100 200 300 16 32 64 Latency (ms)

  • No. of GPUs

Allreduce - 128 MB

50 100 2 4 8 16 32 64 128 Latency (ms) Message Size (MB)

Bcast – 64 GPUs

50 100 16 32 64 Latency (ms)

  • No. of GPUs

Bcast 128 MB

  • MV2-GDR provides
  • ptimized collectives for

large message sizes

  • Optimized Reduce,

Allreduce, and Bcast

  • Good scaling with large

number of GPUs

  • Available since MVAPICH2-

GDR 2.2GA

100 200 300 2 4 8 16 32 64 128 Latency (ms) Message Size (MB)

Allreduce – 64 GPUs

slide-37
SLIDE 37

GTC 2018 37 Network Based Computing Laboratory 1000 2000 3000 4000 5000 6000 512K 1M 2M 4M Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI 100000 200000 300000 400000 500000 600000 700000 800000 8388608 16777216 33554432 67108864 134217728 268435456 536870912 Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI 1 10 100 1000 10000 100000 4 16 64 256 1024 4096 16384 65536 262144 Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI

  • Initial Evaluation shows promising performance gains for MVAPICH2-GDR 2.3a*
  • 8 GPUs (2 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2: Allreduce Comparison with Baidu and OpenMPI

*Available with MVAPICH2-GDR 2.3a ~10X better

MV2 is ~20% better than Baidu

~3.5X better

OpenMPI is ~4X slower than Baidu

slide-38
SLIDE 38

GTC 2018 38 Network Based Computing Laboratory 10000 20000 30000 40000 50000 512K 1M 2M 4M Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI 1000000 2000000 3000000 4000000 5000000 6000000 8388608 16777216 33554432 67108864 134217728 268435456 536870912 Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI 1 10 100 1000 10000 100000 4 16 64 256 1024 4096 16384 65536 262144 Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI

  • 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2: Allreduce Comparison with Baidu and OpenMPI

*Available with MVAPICH2-GDR 2.3a ~30X better

MV2 is ~2X better than Baidu

~10X better

OpenMPI is ~5X slower than Baidu

~4X better

slide-39
SLIDE 39

GTC 2018 39 Network Based Computing Laboratory

  • Caffe : A flexible and layered Deep Learning framework.
  • Benefits and Weaknesses

– Multi-GPU Training within a single node – Performance degradation for GPUs across different sockets – Limited Scale-out

  • OSU-Caffe: MPI-based Parallel Training

– Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes) – Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

OSU-Caffe: Scalable Deep Learning

50 100 150 200 250 8 16 32 64 128 Training Time (seconds)

  • No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use case

OSU-Caffe publicly available from http://hidl.cse.ohio-state.edu/ Support on OPENPOWER will be available soon

slide-40
SLIDE 40

GTC 2018 40 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-41
SLIDE 41

GTC 2018 41 Network Based Computing Laboratory 0.5 1 1.5 1 4 16 64 256 1K 4K Latency (us) MVAPICH2-GDR 2.3a

Point-to-Point Host-level Performance on OpenPower

Platform: OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA Intra-node Latency Intra-node Bi-directional Bandwidth Intra-node Bandwidth 10000 20000 30000 40000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) MVAPICH2-GDR 2.3a 20000 40000 60000 80000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) MVAPICH2-GDR 2.3a 1 2 3 4 5 1 4 16 64 256 1K 4K Latency (us) MVAPICH2-GDR 2.3a 5000 10000 15000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) MVAPICH2-GDR… 5000 10000 15000 20000 25000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) MVAPICH2-GDR… Inter-node Latency

Inter-node Bi-directional Bandwidth

Inter-node Bandwidth

~0.5 μs ~2.3 μs ~30GB/s ~60GB/s ~12GB/s ~24GB/s

slide-42
SLIDE 42

GTC 2018 42 Network Based Computing Laboratory

10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (us) Message Size (Bytes)

INTRA-NODE LATENCY (SMALL)

INTRA-SOCKET(NVLINK) INTER-SOCKET

Device-to-Device Performance on OpenPOWER (NVLink + Pascal)

10 20 30 40 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Bandwidth (GB/sec) Message Size (Bytes)

INTRA-NODE BANDWIDTH

INTRA-SOCKET(NVLINK) INTER-SOCKET 5 10 15 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Bandwidth (GB/sec) Message Size (Bytes)

INTER-NODE BANDWIDTH Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and EDR InfiniBand Inter-connect

200 400 16K 32K 64K 128K256K512K 1M 2M 4M Latency (us) Message Size (Bytes)

INTRA-NODE LATENCY (LARGE)

INTRA-SOCKET(NVLINK) INTER-SOCKET 20 22 24 26 28 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (us) Message Size (Bytes)

INTER-NODE LATENCY (SMALL)

100 200 300 400 500 Latency (us) Message Size (Bytes)

INTER-NODE LATENCY (LARGE)

Intra-node Bandwidth: 33.9 GB/sec (NVLINK) Intra-node Latency: 14.6 us (without GPUDirectRDMA) Inter-node Latency: 23.8 us (without GPUDirectRDMA) Inter-node Bandwidth: 11.9 GB/sec (EDR)

Available in MVAPICH2-GDR 2.3a

slide-43
SLIDE 43

GTC 2018 43 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-44
SLIDE 44

GTC 2018 44 Network Based Computing Laboratory

  • Increasing trend to provide container support for MPI Libraries
  • Ease of build
  • Portability
  • Reproducibility
  • MVAPICH2-GDR 2.3a provides container (Docker) support
  • More details are available in the MVAPICH2-GDR User Guide
  • http://mvapich.cse.ohio-state.edu/userguide/gdr/2.3a/
  • Synergistic with the HPC-Container-Maker and hpccm efforts by

NVIDIA

  • (https://github.com/NVIDIA/hpc-container-maker)

Container Support

slide-45
SLIDE 45

GTC 2018 45 Network Based Computing Laboratory 5000 10000 15000 20000 Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

Docker Native 2000 4000 6000 8000 10000 12000 Bandwidth (MB/s) Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

Docker Native 1 10 100 1000 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Latency (us) Message Size (Bytes)

GPU-GPU Inter-node Latency

Docker Native MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA

MVAPICH2-GDR on Container with Negligible Overhead

slide-46
SLIDE 46

GTC 2018 46 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-47
SLIDE 47

GTC 2018 47 Network Based Computing Laboratory 10 20 30 40 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0

Scalable Host-based Collectives with CMA (Intra-node Reduce & AlltoAll)

(Nodes=1, PPN=20) 50 100 150 200 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0

(Nodes=1, PPN=20)

Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively

3.6X

Alltoall Reduce

5.2X 3.2X 3.3X 400 800 1200 1600 2000 8K 16K 32K 64K 128K 256K 512K 1M

Latency (us)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0

(Nodes=1, PPN=20) 2500 5000 7500 10000 8K 16K 32K 64K 128K 256K 512K 1M

Latency (us)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0

(Nodes=1, PPN=20) 3.2X 1.4X 1.3X 1.2X

slide-48
SLIDE 48

GTC 2018 48 Network Based Computing Laboratory 1000 2000 8K 16K 32K 64K 128K 256K 512K 1M

Latency (us)

Message Size (bytes)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0

25 50 75 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us)

Message Size (bytes)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0

15 30 45 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us)

Message Size (bytes)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0

Scalable Host-based Collectives with CMA (Intra-node, Gather & Scatter)

(Nodes=1, PPN=20) 250 500 750 8K 16K 32K 64K 128K 256K 512K 1M

Latency (us)

Message Size (bytes)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0

(Nodes=1, PPN=20) (Nodes=1, PPN=20) (Nodes=1, PPN=20)

Up to 24X and 15x performance improvement by MVAPICH2 for medium to large messages respectively

Scatter Gather

24.5X 5.6X 8.3X 15.6X 4.9X 1.5X 6.4X 6.7X

slide-49
SLIDE 49

GTC 2018 49 Network Based Computing Laboratory 10 20 30 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us)

MVAPICH2-GDR-Next OpenMPI-3.0.0

Scalable Host-based Collectives with CMA (Multi-node, Reduce & Alltoall)

(Nodes=4, PPN=20) 500 1000 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us)

MVAPICH2-GDR-Next OpenMPI-3.0.0

(Nodes=4, PPN=20)

Up to 12.4X and 8.5X performance improvement by MVAPICH2 for small and large messages respectively

12.4X

Alltoall Reduce

1.9X 2000 4000 6000 8K 16K 32K 64K 128K 256K 512K 1M

Latency (us)

MVAPICH2-GDR-Next OpenMPI-3.0.0

(Nodes=4, PPN=20) 50000 100000 150000 8K 16K 32K 64K 128K 256K 512K 1M

Latency (us)

MVAPICH2-GDR-Next OpenMPI-3.0.0

(Nodes=4, PPN=20) 8.5X

slide-50
SLIDE 50

GTC 2018 50 Network Based Computing Laboratory 1000 2000 3000 4000 8K 16K 32K 64K 128K 256K 512K 1M

Latency (us)

Message Size (bytes)

MVAPICH2-GDR-Next OpenMPI-3.0.0

20 40 60 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us)

Message Size (bytes)

MVAPICH2-GDR-Next OpenMPI-3.0.0

20 40 60 80 100 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us)

Message Size (bytes)

MVAPICH2-GDR-Next OpenMPI-3.0.0

Scalable Host-based Collectives with CMA (Multi-node, Gather & Scatter)

(Nodes=4, PPN=20) 1000 2000 3000 4000 5000 8K 16K 32K 64K 128K 256K 512K 1M

Latency (us)

Message Size (bytes)

MVAPICH2-GDR-Next OpenMPI-3.0.0

(Nodes=4, PPN=20) (Nodes=4, PPN=20) (Nodes=4, PPN=20)

Up to 17.8X and 3.9X improvement with MVAPICH2-GDR-Next for medium to large messages respectively

Scatter Gather

17.8X 3.9X 1.6X

slide-51
SLIDE 51

GTC 2018 51 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-52
SLIDE 52

GTC 2018 52 Network Based Computing Laboratory 1000 2000 16K 32K 64K 128K 256K 512K 1M 2M

Latency (us)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0

3X

2000 4000 16K 32K 64K 128K 256K 512K 1M 2M

Latency (us) Message Size

MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0

34%

1000 2000 3000 4000 16K 32K 64K 128K 256K 512K 1M 2M

Latency (us) Message Size

MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0

1000 2000 16K 32K 64K 128K 256K 512K 1M 2M

Latency (us)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0

Optimized All-Reduce with XPMEM

(Nodes=1, PPN=20)

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch

  • Optimized MPI All-Reduce Design in MVAPICH2

– Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node

2X

(Nodes=2, PPN=20)

4X 48% 3.3X 2X 2X

slide-53
SLIDE 53

GTC 2018 53 Network Based Computing Laboratory 1000 2000 3000 4000 16K 32K 64K 128K 256K 512K 1M 2M

Latency (us) Message Size

MVAPICH2-GDR-Next OpenMPI-3.0.0

1000 2000 3000 4000 16K 32K 64K 128K 256K 512K 1M 2M

Latency (us)

MVAPICH2-GDR-Next OpenMPI-3.0.0

1000 2000 3000 4000 16K 32K 64K 128K 256K 512K 1M 2M

Latency (us)

MVAPICH2-GDR-Next OpenMPI-3.0.0

Optimized All-Reduce with XPMEM

(Nodes=3, PPN=20)

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch

  • Optimized MPI All-Reduce Design in MVAPICH2

– Up to 2X performance improvement over OpenMPI for inter-node.

1000 2000 3000 4000 16K 32K 64K 128K 256K 512K 1M 2M

Latency (us) Message Size

MVAPICH2-GDR-Next OpenMPI-3.0.0

(Nodes=4, PPN=20)

42% 27% 2X 35%

slide-54
SLIDE 54

GTC 2018 54 Network Based Computing Laboratory 4000 8000 12000 16000

Latency (us)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0

Optimized Reduce with XPMEM

(Nodes=1, PPN=20) 8000 16000 24000 32000 40000

Latency (us) Message Size

MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch

(Nodes=2, PPN=20)

  • Optimized MPI Reduce Design in MVAPICH2

– Up to 3.1X performance improvement over OpenMPI at scale and up to 37% over spectrum MPI on intra-node

5000 10000 15000

Latency (us)

MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0

10000 20000 30000 40000

Latency (us) Message Size

MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0

26% 2.8X 27% 1.9X 37% 5.6X 93% 3.1X

slide-55
SLIDE 55

GTC 2018 55 Network Based Computing Laboratory 8000 16000 24000 32000 40000 48000

Latency (us)

MVAPICH2-GDR-Next OpenMPI-3.0.0

Optimized Reduce with XPMEM

(Nodes=3, PPN=20) 8000 16000 24000 32000 40000 48000

Latency (us) Message Size

MVAPICH2-GDR-Next OpenMPI-3.0.0

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch

(Nodes=4, PPN=20)

  • Optimized MPI Reduce Design in MVAPICH2

– Up to 7X performance improvement over OpenMPI at scale

10000 20000 30000 40000 50000

Latency (us)

MVAPICH2-GDR-Next OpenMPI-3.0.0

10000 20000 30000 40000 50000

Latency (us) Message Size

MVAPICH2-GDR-Next OpenMPI-3.0.0

4X 6X 7X 4X

slide-56
SLIDE 56

GTC 2018 56 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-57
SLIDE 57

GTC 2018 57 Network Based Computing Laboratory

1 10 100 1000 10000 100000 Latency (us) Message Size (Bytes) NCCL2 MVAPICH2-GDR

MVAPICH2-GDR vs. NCCL2 – Broadcast Operation

  • Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
  • MPI_Bcast (MVAPICH2-GDR) vs. ncclBcast (NCCL2) on 16 K-80 GPUs

*Will be available with upcoming MVAPICH2-GDR 2.3b

1 10 100 1000 10000 100000 Latency (us) Message Size (Bytes) NCCL2 MVAPICH2-GDR

~10X better ~4X better 1 GPU/node 2 GPUs/node

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-80 GPUs, and EDR InfiniBand Inter-connect

slide-58
SLIDE 58

GTC 2018 58 Network Based Computing Laboratory

MVAPICH2-GDR vs. NCCL2 – Reduce Operation

  • Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
  • MPI_Reduce (MVAPICH2-GDR) vs. ncclReduce (NCCL2) on 16 GPUs

*Will be available with upcoming MVAPICH2-GDR 2.3b

1 10 100 1000 10000 100000 Latency (us) Message Size (Bytes) MVAPICH2-GDR NCCL2

~5X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

1 10 100 1000 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Latency (us) Message Size (Bytes) MVAPICH2-GDR NCCL2

~2.5X better

slide-59
SLIDE 59

GTC 2018 59 Network Based Computing Laboratory

MVAPICH2-GDR vs. NCCL2 – Allreduce Operation

  • Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases
  • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

*Will be available with upcoming MVAPICH2-GDR 2.3b

1 10 100 1000 10000 100000 Latency (us) Message Size (Bytes) MVAPICH2-GDR NCCL2

~1.2X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect 1 10 100 1000 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Latency (us) Message Size (Bytes) MVAPICH2-GDR NCCL2

~3X better

slide-60
SLIDE 60

GTC 2018 60 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-61
SLIDE 61

GTC 2018 61 Network Based Computing Laboratory

Out-of-Core Deep Neural Network Training with Caffe

  • Large DNNs cannot be trained on GPUs due to memory limitation!

– ResNet-50 is the state-of-the-art DNN architecture for Image Recognition but current frameworks can only go up to a small batch size of 45 – Next generation models like Neural Machine Translation (NMT) are ridiculously large, consists of billions of parameters, and require even more memory – Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs?

  • General intuition is that managed allocations “will be” slow!

– The proposed framework called OC-Caffe (Out-of-Core Caffe) shows the potential of managed memory designs that can provide performance with negligible/no overhead. – In addition to Out-of-core Training support, productivity can be greatly enhanced in terms of DL framework design by using the new Unified Memory features.

Submission Under Review

slide-62
SLIDE 62

GTC 2018 62 Network Based Computing Laboratory

  • Comparable performance to Caffe-Default for “in-memory” batch sizes
  • OC-Caffe-Opt: up to 5X improvement over Intel-MKL-optimized CPU-based AlexNet

training on Volta V100 GPU with CUDA9 and CUDNN7

Performance Trends for OC-Caffe

OC-Caffe will be released by the HiDL Team@OSU hidl.cse.ohio-state.edu Out-of-core (over-subscription) Trainable (in-memory)

Submission Under Review

slide-63
SLIDE 63

GTC 2018 63 Network Based Computing Laboratory

  • OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta

V100 GPU with CUDA9 and CUDNN7

  • OC-Caffe allows efficient scale-up on DGX-1 system with Pascal P100 GPUs with CUDA9 and

CUDNN7

Performance Trends for OC-Caffe

Out-of-core (over-subscription) Scale-up on DGX-1

Submission Under Review

slide-64
SLIDE 64

GTC 2018 64 Network Based Computing Laboratory

Can Unified-Memory also Simplify Framework Design?

  • Enhanced and simplified the Caffe

framework

  • Simplified Layer class and all inherited

classes (e.g. ConvolutionLayer)

  • Remove almost all of the memory allocation,

movement, and state management code in SyncedMemory and Blob class

  • Estimated 3,000 lines of repetitive and

error-prone code can be eliminated

  • Developers can add new inherited Layer

classes in a much simpler manner

– E.g. Implement a single Forward function instead of forward_gpu() and forward_cpu() function separately.

class ConvolutionLayer { public: void cpu_data() void cpu_diff() void gpu_data() void gpu_diff() void mutable_cpu_data() void mutable_cpu_diff() void mutable_gpu_data() void mutable_gpu_diff() void Forward_cpu() void Forward_gpu() void forward_cpu_gemm() void forward_gpu_gemm() void forward_cpu_bias() void forward_gpu_bias() void Backward_cpu() void Backward_gpu() void backward_cpu_gemm() void backward_gpu_gemm() void backward_cpu_bias() void backward_gpu_bias() }

class ConvolutionLayer { public: void data() void diff() void mutable_data() void mutable_diff() void Forward() void forward_gemm() void forward_bias() void Backward() void backward_gemm() void backward_bias() }

Existing Design Proposed High-Productivity Design based on Managed Memory Allocation and Data Movement

Submission Under Review

slide-65
SLIDE 65

GTC 2018 65 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • Current Features
  • Multi-stream Communication for IPC
  • CMA-based Intra-node Host-to-Host Communication Support
  • Maximal Overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • Streaming Support with InfiniBand Multicast and GDR
  • Support for Deep Learning
  • Support for OpenPOWER with NVLink
  • Support for Container
  • Upcoming Features
  • CMA-based Intra-node Collective Communication Support
  • XPMEM-based Collective Communication Support
  • Optimized Collectives for Deep Learning
  • Out-of-core processing for Deep Learning
  • Conclusions

Outline

slide-66
SLIDE 66

GTC 2018 66 Network Based Computing Laboratory

  • MVAPICH2-GDR MPI library optimizes MPI communication on InfiniBand and

RoCE (V1 and V2) clusters with GPUs on both x86 and OpenPOWER platforms

  • Provides optimized designs for point-to-point two-sided and one-sided

communication, datatype processing and collective operations

  • Takes advantage of CUDA features like IPC and GPUDirect RDMA families
  • Allows flexible solutions for streaming applications with GPUs
  • Provides optimized solutions for High-Performance Deep Learning (HiDL)

frameworks and applications

Conclusions

slide-67
SLIDE 67

GTC 2018 67 Network Based Computing Laboratory

  • CE8118 - Connect with the Experts: MVAPICH for GPU
  • Session Schedule
  • Wednesday, Mar 28, 3:00 PM - 4:00 PM, LL Pod A
  • Session Speakers
  • Dhabaleswar Panda - Professor and University Distinguished Scholar, OSU
  • Davide Rossetti - Senior Software Engineer, NVIDIA
  • Hari Subramoni - Research Scientist, OSU
  • Session Description
  • MVAPICH is a well established MPI library supporting GPUDirect. Meet with the experts to

discuss how you can optimize your HPC and AI application using GPUDirect RDMA, with MVAPICH GDR, and learn about MVAPICH GDS, newest feature, supporting GPUDirect Async.

Join Us for a Connect with the Experts Session

slide-68
SLIDE 68

GTC 2018 68 Network Based Computing Laboratory

Personnel Acknowledgments

Current Students (Graduate)

  • A. Awan (Ph.D.)

  • R. Biswas (M.S.)

  • M. Bayatpour (Ph.D.)

  • S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.) –

  • S. Guganani (Ph.D.)

Past Students

  • A. Augustine (M.S.)

  • P. Balaji (Ph.D.)

  • S. Bhagvat (M.S.)

  • A. Bhat (M.S.)

  • D. Buntinas (Ph.D.)

  • L. Chai (Ph.D.)

  • B. Chandrasekharan (M.S.)

  • N. Dandapanthula (M.S.)

  • V. Dhanraj (M.S.)

  • T. Gangadharappa (M.S.)

  • K. Gopalakrishnan (M.S.)

  • R. Rajachandrasekar (Ph.D.)

  • G. Santhanaraman (Ph.D.)

  • A. Singh (Ph.D.)

  • J. Sridhar (M.S.)

  • S. Sur (Ph.D.)

  • H. Subramoni (Ph.D.)

  • K. Vaidyanathan (Ph.D.)

  • A. Vishnu (Ph.D.)

  • J. Wu (Ph.D.)

  • W. Yu (Ph.D.)

Past Research Scientist

  • K. Hamidouche

  • S. Sur

Past Post-Docs

  • D. Banerjee

  • X. Besseron

– H.-W. Jin –

  • W. Huang (Ph.D.)

  • W. Jiang (M.S.)

  • J. Jose (Ph.D.)

  • S. Kini (M.S.)

  • M. Koop (Ph.D.)

  • K. Kulkarni (M.S.)

  • R. Kumar (M.S.)

  • S. Krishnamoorthy (M.S.)

  • K. Kandalla (Ph.D.)

  • M. Li (Ph.D.)

  • P. Lai (M.S.)

  • J. Liu (Ph.D.)

  • M. Luo (Ph.D.)

  • A. Mamidala (Ph.D.)

  • G. Marsh (M.S.)

  • V. Meshram (M.S.)

  • A. Moody (M.S.)

  • S. Naravula (Ph.D.)

  • R. Noronha (Ph.D.)

  • X. Ouyang (Ph.D.)

  • S. Pai (M.S.)

  • S. Potluri (Ph.D.)

  • J. Hashmi (Ph.D.)

  • H. Javed (Ph.D.)

  • P. Kousha (Ph.D.)

  • D. Shankar (Ph.D.)

  • H. Shi (Ph.D.)

  • J. Zhang (Ph.D.)

  • J. Lin

  • M. Luo

  • E. Mancini

Current Research Scientists

  • X. Lu

  • H. Subramoni

Past Programmers

  • D. Bureddy

  • J. Perkins

Current Research Specialist

  • J. Smith

  • M. Arnold

  • S. Marcarelli

  • J. Vienne

  • H. Wang

Current Post-doc

  • A. Ruhela

Current Students (Undergraduate)

  • N. Sarkauskas (B.S.)
slide-69
SLIDE 69

GTC 2018 69 Network Based Computing Laboratory

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

panda@cse.ohio-state.edu, subramon@cse.ohio-state.edu

The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/