MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU - - PowerPoint PPT Presentation

mvapich2 gdr pushing the frontier of hpc and deep learning
SMART_READER_LITE
LIVE PREVIEW

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU - - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 2017 by Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail: subramon@cse.ohio-state.edu E-mail:


slide-1
SLIDE 1

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

GPU Technology Conference GTC 2017 by

Hari Subramoni The Ohio State University E-mail: subramon@cse.ohio-state.edu http://www.cse.ohio-state.edu/~subramon

slide-2
SLIDE 2

GTC 2017 2 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • What’s new with MVAPICH2-GDR
  • Efficient MPI-3 Non-Blocking Collective support
  • Maximal overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • RoCE and Optimized Collective
  • Initial support for GPUDirect Async feature
  • Efficient Deep Learning with MVAPICH2-GDR
  • OpenACC-Aware support
  • Conclusions

Outline

slide-3
SLIDE 3

GTC 2017 3 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,750 organizations in 84 countries – More than 416,000 (> 0.4 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘16 ranking)

  • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
  • 13th, 241,108-core (Pleiades) at NASA
  • 17th, 462,462-core (Stampede) at TACC
  • 40th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1st in Jun’16, 10M cores, 100 PFlops)

slide-4
SLIDE 4

GTC 2017 4 Network Based Computing Laboratory 50000 100000 150000 200000 250000 300000 350000 400000 450000 Sep-04 Feb-05 Jul-05 Dec-05 May-06 Oct-06 Mar-07 Aug-07 Jan-08 Jun-08 Nov-08 Apr-09 Sep-09 Feb-10 Jul-10 Dec-10 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 May-16 Oct-16 Mar-17 Number of Downloads Timeline MV 0.9.4 MV2 0.9.0 MV2 0.9.8 MV2 1.0 MV 1.0 MV2 1.0.3 MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2 2.1 MV2-GDR 2.0b MV2-MIC 2.0 MV2-Virt 2.1rc2 MV2-GDR 2.2rc1 MV2-X 2.2 MV2 2.3a

MVAPICH2 Release Timeline and Downloads

slide-5
SLIDE 5

GTC 2017 5 Network Based Computing Laboratory

MVAPICH2 Software Family

Requirements MVAPICH2 Library to use MPI with IB, iWARP and RoCE MVAPICH2 Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X MPI with IB & GPU MVAPICH2-GDR MPI with IB & MIC MVAPICH2-MIC HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

slide-6
SLIDE 6

GTC 2017 6 Network Based Computing Laboratory

Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime

Diverse APIs and Mechanisms

Point-to- point Primitives Collectives Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Job Startup Introspection & Analysis

Support for Modern Networking Technology

(InfiniBand, iWARP, RoCE, Omni-Path)

Support for Modern Multi-/Many-core Architectures

(Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Transport Protocols Modern Features

RC XRC UD DC UMR ODP SR- IOV Multi Rail

Transport Mechanisms

Shared Memory CMA IVSHMEM

Modern Features

MCDRAM* NVLink* CAPI*

* Upcoming

slide-7
SLIDE 7

GTC 2017 7 Network Based Computing Laboratory

CPU CPU

QPI GPU

PCIe

GPU GPU

CPU

GPU IB

Node 0 Node 1

  • 1. Intra-GPU
  • 2. Intra-Socket GPU-GPU
  • 3. Inter-Socket GPU-GPU
  • 4. Inter-Node GPU-GPU
  • 5. Intra-Socket GPU-Host
  • 7. Inter-Node GPU-Host
  • 6. Inter-Socket GPU-Host

Memory buffers

  • 8. Inter-Node GPU-GPU with IB adapter on remote socket

and more . . .

  • For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …
  • Critical for runtimes to optimize data movement while hiding the complexity
  • Connected as PCIe devices – Flexibility but Complexity

Optimizing MPI Data Movement on GPU Clusters

slide-8
SLIDE 8

GTC 2017 8 Network Based Computing Laboratory

At Sender: At Receiver:

MPI_Recv(r_devbuf, size, …); inside MVAPICH2

  • Standard MPI interfaces used for unified data movement
  • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
  • Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

slide-9
SLIDE 9

GTC 2017 9 Network Based Computing Laboratory

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases

  • Support for MPI communication from NVIDIA GPU device memory
  • High performance RDMA-based inter-node point-to-point

communication (GPU-GPU, GPU-Host and Host-GPU)

  • High performance intra-node point-to-point communication for multi-

GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

  • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node

  • Optimized and tuned collectives for GPU device buffers
  • MPI datatype support for point-to-point and collective communication

from GPU device buffers

slide-10
SLIDE 10

GTC 2017 10 Network Based Computing Laboratory

  • MVAPICH2-2.2 with GDR support can be downloaded from

https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

  • Please select the best matching package for your system
  • We have most of common combinations. If you do not find your match here please email OSU with the following details

– OS versions – OFED version – CUDA Version – Compiler (GCC, Intel and PGI)

  • Install instructions
  • Having root permissions
  • On default Path: rpm -Uvh --nodeps mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm
  • On specific Path: rpm --prefix /custom/install/prefix -Uvh --nodeps mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm
  • Do not have root permissions:
  • rpm2cpio mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm | cpio –id
  • More details on the installation process refer to:

http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2#_installing_mvapich2_gdr_library

Installing MVAPICH2-GDR

slide-11
SLIDE 11

GTC 2017 11 Network Based Computing Laboratory

1000 2000 3000 4000 1 4 16 64 256 1K 4K MV2-GDR2.2 MV2-GDR2.0b MV2 w/o GDR

GPU-GPU Internode Bi-Bandwidth

Message Size (bytes) Bi-Bandwidth (MB/s) 5 10 15 20 25 30 2 8 32 128 512 2K MV2-GDR2.2 MV2-GDR2.0b MV2 w/o GDR

GPU-GPU internode latency

Message Size (bytes)

Latency (us)

MVAPICH2-GDR-2.2 Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU Mellanox Connect-X4 EDR HCA CUDA 8.0 Mellanox OFED 3.0 with GPU-Direct-RDMA

10x

2X

11x

Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)

2.18us 500 1000 1500 2000 2500 3000 1 4 16 64 256 1K 4K MV2-GDR2.2 MV2-GDR2.0b MV2 w/o GDR

GPU-GPU Internode Bandwidth

Message Size (bytes)

Bandwidth (MB/s)

11X 2X 3X

slide-12
SLIDE 12

GTC 2017 12 Network Based Computing Laboratory

  • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
  • HoomdBlue Version 1.0.5
  • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768

MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

500 1000 1500 2000 2500 4 8 16 32

Average Time Steps per second (TPS)

Number of Processes

MV2 MV2+GDR

500 1000 1500 2000 2500 3000 3500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes 64K Particles 256K Particles 2X 2X

slide-13
SLIDE 13

GTC 2017 13 Network Based Computing Laboratory

Full and Efficient MPI-3 RMA Support

5 10 15 20 25 30 35 2 8 32 128 512 2K 8K

Small Message Latency

Message Size (bytes)

Latency (us)

MVAPICH2-GDR-2.2 Intel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA, CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA 6X 2.88 us

slide-14
SLIDE 14

GTC 2017 14 Network Based Computing Laboratory

Performance of MVAPICH2-GDR with GPU-Direct RDMA and Multi-Rail Support

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1 4 16 64 256 1K 4K 16K64K 256K 1M4M MV2-GDR 2.1 MV2-GDR 2.1 RC2 GPU-GPU Internode MPI Uni-Directional Bandwidth Message Size (bytes)

Bandwidth (MB/s)

MVAPICH2-GDR-2.2.b Intel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA

2000 4000 6000 8000 10000 12000 14000 16000 1 4 16 64 256 1K 4K 16K64K 256K1M 4M MV2-GDR 2.1 MV2-GDR 2.1 RC2 GPU-GPU Internode Bi-directional Bandwidth Message Size (bytes) Bi-Bandwidth (MB/s)

8,759 15,111

40% 20%

slide-15
SLIDE 15

GTC 2017 15 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • What’s new with MVAPICH2-GDR
  • Efficient MPI-3 Non-Blocking Collective support
  • Maximal overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • RoCE and Optimized Collective
  • Initial support for GPUDirect Async feature
  • Efficient Deep Learning with MVAPICH2-GDR
  • OpenACC-Aware support
  • Conclusions

Outline

slide-16
SLIDE 16

GTC 2017 16 Network Based Computing Laboratory

Non-Blocking Collectives (NBC) using Core-Direct Offload

  • A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda,

Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC, 2015

  • MPI NBC decouples initiation (Ialltoall) and completion (Wait) phases and provide overlap potential (Ialltoall

+ compute + Wait) but CPU drives progress largely in Wait (=> 0 overlap)

  • CORE-Direct feature provides true overlap capabilities by providing a priori specification of a list of network-

tasks which is progressed by the NIC instead of the CPU (hence freeing it)

  • We propose a design that combines GPUDirect RDMA and Core-Direct features to provide efficient support
  • f CUDA-Aware NBC collectives on GPU buffers
  • Overlap communication with CPU computation
  • Overlap communication with GPU computation
  • Extend OMB with CUDA-Aware NBC benchmarks to evaluate
  • Latency
  • Overlap on both CPU and GPU
slide-17
SLIDE 17

GTC 2017 17 Network Based Computing Laboratory

20 40 60 80 100 120 4K 16K 64K 256K 1M Overlap (%) Message Size (Bytes)

Medium/Large Message Overlap (64 GPU nodes)

Ialltoall (1process/node) Ialltoall (2process/node; 1process/GPU) 20 40 60 80 100 120 4K 16K 64K 256K 1M Overlap (%) Message Size (Bytes)

Medium/Large Message Overlap (64 GPU nodes)

Igather (1process/node) Igather (2processes/node; 1process/GPU) Platform: Wilkes: Intel Ivy Bridge NVIDIA Tesla K20c + Mellanox Connect-IB Available since MVAPICH2-GDR 2.2b

CUDA-Aware Non-Blocking Collectives

  • A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU

Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC, 2015

slide-18
SLIDE 18

GTC 2017 18 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • What’s new with MVAPICH2-GDR
  • Efficient MPI-3 Non-Blocking Collective support
  • Maximal overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • RoCE and Optimized Collective
  • Initial support for GPUDirect Async feature
  • Efficient Deep Learning with MVAPICH2-GDR
  • OpenACC-Aware support
  • Conclusions

Outline

slide-19
SLIDE 19

GTC 2017 19 Network Based Computing Laboratory

  • Multi-dimensional data
  • Row based organization
  • Contiguous on one dimension
  • Non-contiguous on other dimensions
  • Halo data exchange
  • Duplicate the boundary
  • Exchange the boundary in each

iteration

Halo data exchange

Non-contiguous Data Exchange

slide-20
SLIDE 20

GTC 2017 20 Network Based Computing Laboratory

MPI Datatype support in MVAPICH2

  • Datatypes support in MPI

– Operate on customized datatypes to improve productivity – Enable MPI library to optimize non-contiguous data

At Sender:

MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type); MPI_Type_commit(&new_type); … MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

  • Inside MVAPICH2
  • Use datatype specific CUDA Kernels to pack data in chunks
  • Efficiently move data between nodes using RDMA
  • In progress - currently optimizes vector and hindexed datatypes
  • Transparent to the user
  • H. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel

and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.

slide-21
SLIDE 21

GTC 2017 21 Network Based Computing Laboratory

MPI Datatype Processing (Computation Optimization )

  • Comprehensive support
  • Targeted kernels for regular datatypes - vector, subarray, indexed_block
  • Generic kernels for all other irregular datatypes
  • Separate non-blocking stream for kernels launched by MPI library
  • Avoids stream conflicts with application kernels
  • Flexible set of parameters for users to tune kernels
  • Vector
  • MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
  • MV2_CUDA_KERNEL_VECTOR_YSIZE
  • Subarray
  • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
  • MV2_CUDA_KERNEL_SUBARR_XDIM
  • MV2_CUDA_KERNEL_SUBARR_YDIM
  • MV2_CUDA_KERNEL_SUBARR_ZDIM
  • Indexed_block
  • MV2_CUDA_KERNEL_IDXBLK_XDIM
slide-22
SLIDE 22

GTC 2017 22 Network Based Computing Laboratory

MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPU Common Scenario

*A, B…contain non-contiguous MPI Datatype MPI_Isend (A,.. Datatype,…) MPI_Isend (B,.. Datatype,…) MPI_Isend (C,.. Datatype,…) MPI_Isend (D,.. Datatype,…) … MPI_Waitall (…);

slide-23
SLIDE 23

GTC 2017 23 Network Based Computing Laboratory

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0.2 0.4 0.6 0.8 1 1.2 16 32 64 96 Normalized Execution Time Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based 0.2 0.4 0.6 0.8 1 1.2 4 8 16 32 Normalized Execution Time Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

  • 2X improvement on 32 GPUs nodes
  • 30% improvement on 96 GPU nodes (8 GPUs/node)
  • C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data

Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content /tasks/operational/meteoSwiss/

slide-24
SLIDE 24

GTC 2017 24 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • What’s new with MVAPICH2-GDR
  • Efficient MPI-3 Non-Blocking Collective support
  • Maximal overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • RoCE and Optimized Collective
  • Initial support for GPUDirect Async feature
  • Efficient Deep Learning with MVAPICH2-GDR
  • OpenACC-Aware support
  • Conclusions

Outline

slide-25
SLIDE 25

GTC 2017 25 Network Based Computing Laboratory

Initial (Basic) Support for GPU Managed Memory

  • CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified)

memory allowing a common memory allocation for GPU or CPU through cudaMallocManaged() call

  • Significant productivity benefits due to abstraction of

explicit allocation and cudaMemcpy()

  • Extended MVAPICH2 to perform communications directly

from managed buffers (Available since MVAPICH2-GDR 2.2b)

  • OSU Micro-benchmarks extended to evaluate the

performance of point-to-point and collective communications using managed buffers

  • Available since OMB 5.2
  • D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance

Communication Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 4 8 16 32 64 128 256 1K 4K 8K 16K Halo Exchange Time (ms) Total Dimension Size (Bytes)

2D Stencil Performance for Halowidth=1

Device Managed

slide-26
SLIDE 26

GTC 2017 26 Network Based Computing Laboratory

Enhanced Support for Intra-node Managed Memory

  • CUDA Managed => no memory pin down
  • No IPC support for intra-node communication
  • No GDR support for Inter-node

communication

  • Initial and basic support in MVAPICH2-GDR
  • For both intra- and inter-nodes use “pipeline

through” host memory

  • Enhance intra-node managed memory to use IPC
  • Double buffering pair-wise IPC-based scheme
  • Brings IPC performance to Managed memory
  • High performance and high productivity
  • 2.5 X improvement in bandwidth
  • Available since MVAPICH2-GDR 2.2RC1

200 400 600 800 1000 1200 4K 16K 64K 256K 1M 4M Enhanced MV2-GDR 2.2b Message Size (bytes) Latency (us) 2000 4000 6000 8000 10000 32K 128K 512K 2M Enhanced MV2-GDR 2.2b Message Size (bytes) Bandwidth (MB/s)

2.5X

slide-27
SLIDE 27

GTC 2017 27 Network Based Computing Laboratory

Enhanced Support for Inter-node Managed Memory

  • Enhance inter-node managed memory to use GDR
  • SGL-based scheme:
  • Eager Pre-allocated GPU VBUFs
  • Register the GPU vbufs with the HCA
  • Scatter-Gather List to combine control and data messages
  • Brings GDR performance to Managed memory
  • High performance and high productivity
  • Up to 25% improvement for small and medium messages

(micro-benchmark)

  • Up to 1.92X improvement for GPU-LBM application
  • Will be available in a future MVAPICH2-GDR release

25% 1.92X

  • K. Hamidouche, A. A. Awan, A. Venkatesh and D. K. Panda, "CUDA M3: Designing Efficient

CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC," 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), Hyderabad, 2016

slide-28
SLIDE 28

GTC 2017 28 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • What’s new with MVAPICH2-GDR
  • Efficient MPI-3 Non-Blocking Collective support
  • Maximal overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • RoCE and Optimized Collective
  • Initial support for GPUDirect Async feature
  • Efficient Deep Learning with MVAPICH2-GDR
  • OpenACC-Aware support
  • Conclusions

Outline

slide-29
SLIDE 29

GTC 2017 29 Network Based Computing Laboratory

  • RoCE V1 and V2 support
  • RDMA_CM connection support
  • CUDA-Aware Collective Tuning

– Point-point Tuning (available since MVAPICH2-GDR 2.0)

  • Tuned thresholds for the different communication patterns and features
  • Depending on the system configuration (CPU, HCA and GPU models)

– Tuning Framework for GPU based collectives

  • Select the best algorithm depending on message size, system size and system

configuration

  • Support for Bcast and Gather operations for different GDR-enabled systems
  • Available since MVAPICH2-GDR 2.2RC1 release

ROCE and Optimized Collectives Support

slide-30
SLIDE 30

GTC 2017 30 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • What’s new with MVAPICH2-GDR
  • Efficient MPI-3 Non-Blocking Collective support
  • Maximal overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • RoCE and Optimized Collective
  • Initial support for GPUDirect Async feature
  • Efficient Deep Learning with MVAPICH2-GDR
  • OpenACC-Aware support
  • Conclusions

Outline

slide-31
SLIDE 31

GTC 2017 31 Network Based Computing Laboratory

Overview of GPUDirect aSync (GDS) Feature: Current MPI+CUDA interaction

CUDA_Kernel_a<<<>>>(A…., stream1) cudaStreamSynchronize(stream1) MPI_ISend (A,…., req1) MPI_Wait (req1) CUDA_Kernel_b<<<>>>(B…., stream1) GPU CPU HCA 100% CPU control

  • Limit the throughput of a GPU
  • Limit the asynchronous progress
  • Waste CPU cycles
slide-32
SLIDE 32

GTC 2017 32 Network Based Computing Laboratory

MVAPICH2-GDS: Decouple GPU Control Flow from CPU

CUDA_Kernel_a<<<>>>(A…., stream1) MPI_ISend (A,…., req1, stream1) MPI_Wait (req1, stream1) (non-blocking from CPU) CUDA_Kernel_b<<<>>>(B…., stream1) GPU CPU HCA CPU offloads the compute, communication and synchronization tasks to GPU

  • CPU is out of the critical path
  • Tight interaction between GPU and HCA
  • Hide the overhead of kernel launch
  • Requires MPI semantics extensions
  • All operations are asynchronous from CPU
  • Extend MPI semantics with Stream-based semantics

Kernel Launch

  • verhead hided
slide-33
SLIDE 33

GTC 2017 33 Network Based Computing Laboratory

Latency oriented: Send+kernel and Recv+kernel

MVAPICH2-GDS: Preliminary Results of Micro-benchmark

5 10 15 20 25 30 8 32 128 512 Default Enhanced-GDS Message Size (bytes) Latency (us) 5 10 15 20 25 30 35 16 64 256 1024 Default Enhanced-GDS Message Size (bytes) Latency (us)

Throughput Oriented: back-to-back

  • Latency Oriented: Able to hide the kernel launch overhead

– 25% improvement at 256 Bytes compared to default behavior

  • Throughput Oriented: Asynchronously to offload queue the Communication and computation tasks

– 14% improvement at 1KB message size – Requires some tuning and expect better performance for Application with different Kernels

Intel SandyBridge, NVIDIA K20 and Mellanox FDR HCA

slide-34
SLIDE 34

GTC 2017 34 Network Based Computing Laboratory

Point-to-point: kernels+Send and Recv+kernels

MVAPICH2-GDS: Results of 1D Stencil Kernels

Collective: Broadcast+Kernels

  • Point-to-point: Hide the kernel launch overhead

– 30% improvement at 16K Bytes compared to traditional MPI+CUDA programs

  • Collective: Overlap computations with GPU communications (GDS)

– 36% improvement at 64K Bytes compared to traditional MPI+CUDA programs

Intel Broadwell CPU, NVIDIA K80 and Mellanox EDR HCA

Akshay Venkatesh, Ching-Hsiang Chu, Khaled Hamidouche, Sreeram Potluri, Davide Rossetti and Dhabaleswar K. Panda, MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling, ICPP 2017, Bristol, UK, Aug 14-17, 2017. (Accepted)

100 200 300 400 500 16K 32K 64K Latency (us) Message Size (Bytes) Default MPI MPI-GDS 20 40 60 80 2 8 32 128 512 2K 8K Default MPI-GDS Message Size (bytes) Latency (us)

1.4X 1.8X

slide-35
SLIDE 35

GTC 2017 35 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • What’s new with MVAPICH2-GDR
  • Efficient MPI-3 Non-Blocking Collective support
  • Maximal overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • RoCE and Optimized Collective
  • Initial support for GPUDirect Async feature
  • Efficient Deep Learning with MVAPICH2-GDR
  • OpenACC-Aware support
  • Conclusions

Outline

slide-36
SLIDE 36

GTC 2017 36 Network Based Computing Laboratory

  • Many Deep Learning (DL) frameworks have emerged

– Berkeley Caffe – Google TensorFlow – Microsoft CNTK – Facebook Torch – Facebook Caffe2

  • Broadly, DL Frameworks are being developed along two directions
  • 1. The HPC Eco-system: MPI-based Deep Learning
  • 2. Enterprise Eco-system: BigData-based Deep Learning

Deep Learning Frameworks

Courtesy: https://www.microway.com/hpc-tech-tips/deep-learning-frameworks-survey-tensorflow-torch-theano-caffe-neon-ibm-machine-learning-stack/

slide-37
SLIDE 37

GTC 2017 37 Network Based Computing Laboratory

  • Scale-up: Intra-node Performance

– Many improvements like:

  • NVIDIA cuDNN, cuBLAS, NCCL, etc.
  • Scale-out: Inter-node Performance

– DL frameworks - single-node only – Distributed (Parallel) Training – an emerging trend

  • S-Caffe or OSU-Caffe – MPI-based
  • Microsoft CNTK – MPI-based
  • Google TensorFlow – gRPC-based
  • Facebook Caffe2 – Hybrid

Parallel Training: Scale-up and Scale-out

Scale-up Performance Scale-out Performance

cuDNN NCCL gRPC Hadoop MPI cuBLAS

slide-38
SLIDE 38

GTC 2017 38 Network Based Computing Laboratory

How to efficiently scale-out a Deep Learning (DL) framework and take advantage of heterogeneous High Performance Computing (HPC) resources?

Broad Challenge

slide-39
SLIDE 39

GTC 2017 39 Network Based Computing Laboratory

  • Deep Learning frameworks are a different game

altogether

– Unusually large message sizes (order of megabytes) – Most communication based on GPU buffers

  • How to address these newer requirements?

– GPU-specific Communication Libraries (NCCL)

  • NVidia's NCCL library provides inter-GPU

communication

– CUDA-Aware MPI (MVAPICH2-GDR)

  • Provides support for GPU-based communication
  • Can we exploit CUDA-Aware MPI and NCCL to

support Deep Learning applications?

New Challenges for MPI Runtimes

Hierarchical Communication (Knomial + NCCL ring)

slide-40
SLIDE 40

GTC 2017 40 Network Based Computing Laboratory

  • NCCL has some limitations

– Only works for a single node, thus, no scale-out on multiple nodes – Degradation across IOH (socket) for scale-up (within a node)

  • We propose optimized MPI_Bcast

– Communication of very large GPU buffers (order of megabytes) – Scale-out on large number of dense multi-GPU nodes

  • Hierarchical Communication that efficiently exploits:

– CUDA-Aware MPI_Bcast in MV2-GDR – NCCL Broadcast primitive

Efficient Broadcast: MVAPICH2-GDR and NCCL

10 20 30 2 4 8 16 32 64 Time (seconds) Number of GPUs MV2-GDR MV2-GDR-Opt

Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement ) Performance Benefits: OSU Micro-benchmarks Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning,

  • A. Awan , K. Hamidouche , A. Venkatesh , and D. K. Panda,

The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 2016 [Best Paper Runner-Up]

2.2X

slide-41
SLIDE 41

GTC 2017 41 Network Based Computing Laboratory

100 200 300 2 4 8 16 32 64 128 Latency (ms) Message Size (MB)

Reduce – 192 GPUs

Large Message Optimized Collectives for Deep Learning

100 200 128 160 192 Latency (ms)

  • No. of GPUs

Reduce – 64 MB

100 200 300 16 32 64 Latency (ms)

  • No. of GPUs

Allreduce - 128 MB

50 100 2 4 8 16 32 64 128 Latency (ms) Message Size (MB)

Bcast – 64 GPUs

50 100 16 32 64 Latency (ms)

  • No. of GPUs

Bcast 128 MB

  • MV2-GDR provides
  • ptimized collectives for

large message sizes

  • Optimized Reduce,

Allreduce, and Bcast

  • Good scaling with large

number of GPUs

  • Available with MVAPICH2-

GDR 2.2GA

100 200 300 2 4 8 16 32 64 128 Latency (ms) Message Size (MB)

Allreduce – 64 GPUs

slide-42
SLIDE 42

GTC 2017 42 Network Based Computing Laboratory

  • Caffe : A flexible and layered Deep Learning

framework.

  • Benefits and Weaknesses

– Multi-GPU Training within a single node – Performance degradation for GPUs across different sockets – Limited Scale-out

  • OSU-Caffe: MPI-based Parallel Training

– Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes) – network on ImageNet dataset

OSU-Caffe: Scalable Deep Learning

50 100 150 200 250 8 16 32 64 128 Training Time (seconds)

  • No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use case

OSU-Caffe is publicly available from: http://hidl.cse.ohio-state.edu

  • A. A. Awan, K. Hamidouche, J. Hashmi, and D. K. Panda, S-Caffe: Co-designing

MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters, PPoPP, Sep 2017

slide-43
SLIDE 43

GTC 2017 43 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • What’s new with MVAPICH2-GDR
  • Efficient MPI-3 Non-Blocking Collective support
  • Maximal overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • RoCE and Optimized Collective
  • Initial support for GPUDirect Async feature
  • Efficient Deep Learning with MVAPICH2-GDR
  • OpenACC-Aware support
  • Conclusions

Outline

slide-44
SLIDE 44

GTC 2017 44 Network Based Computing Laboratory

OpenACC-Aware MPI

  • acc_malloc to allocate device memory

– No changes to MPI calls – MVAPICH2 detects the device pointer and optimizes data movement

  • acc_deviceptr to get device pointer (in OpenACC 2.0)

– Enables MPI communication from memory allocated by compiler when it is available in OpenACC 2.0 implementations – MVAPICH2 will detect the device pointer and optimize communication

  • Delivers the same performance as with CUDA

A = acc_malloc(sizeof(int) * N); …… #pragma acc parallel loop deviceptr(A) . . . //compute for loop MPI_Send (A, N, MPI_INT, 0, 1, MPI_COMM_WORLD); …… acc_free(A); A = malloc(sizeof(int) * N); …… #pragma acc data copyin(A) . . . { #pragma acc parallel loop . . . //compute for loop MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD); } …… free(A);

slide-45
SLIDE 45

GTC 2017 45 Network Based Computing Laboratory

  • Overview of the MVAPICH2 Project
  • MVAPICH2-GPU with GPUDirect-RDMA (GDR)
  • What’s new with MVAPICH2-GDR
  • Efficient MPI-3 Non-Blocking Collective support
  • Maximal overlap in MPI Datatype Processing
  • Efficient Support for Managed Memory
  • RoCE and Optimized Collective
  • Initial support for GPUDirect Async feature
  • Efficient Deep Learning with MVAPICH2-GDR
  • OpenACC-Aware support
  • Conclusions

Outline

slide-46
SLIDE 46

GTC 2017 46 Network Based Computing Laboratory

  • MVAPICH2 optimizes MPI communication on InfiniBand and RoCE clusters with GPUs
  • Provides optimized designs for point-to-point two-sided and one-sided communication,

datatype processing and collective operations

  • Efficient and maximal overlap for MPI-3 NBC collectives
  • Delivers high performance and high productivity with support for the latest NVIDIA GPUs

and InfiniBand/RoCE Adapters

  • Looking forward to next-generation designs with GPUDirect Async (GDS) and applications

domain like Deep Learning

  • Users are strongly encouraged to use the latest MVAPICH2-GDR release to avail all features

and performance benefits

Conclusions

slide-47
SLIDE 47

GTC 2017 47 Network Based Computing Laboratory

A Follow-up Talk on PGAS/OpenSHMEM

  • S7324 - Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges

and Solutions

– Day: Today, 05/11 – Time: 15:00 - 15:25 – Location: Room 211B

slide-48
SLIDE 48

GTC 2017 48 Network Based Computing Laboratory

  • Dr. Davide Rossetti
  • Dr. Sreeram Potluri

Filippo Spiga and Stuart Rankin, HPCS, University of Cambridge (Wilkes Cluster) Acknowledgments

slide-49
SLIDE 49

GTC 2017 49 Network Based Computing Laboratory

Thank You!

The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ panda@cse.ohio-state.edu subramon@cse.ohio-state.edu