MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library - - PowerPoint PPT Presentation
MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library - - PowerPoint PPT Presentation
MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI GPU Technology Conference (GTC 2019) by Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail:
GTC 2019 2 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 3 Network Based Computing Laboratory
Overview of the MVAPICH2 Project
- High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,975 organizations in 88 countries – More than 529,000 (> 0.5 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘18 ranking)
- 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
- 14th, 556,104 cores (Oakforest-PACS) in Japan
- 17th, 367,024 cores (Stampede2) at TACC
- 27th, 241,108-core (Pleiades) at NASA and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
- Empowering Top500 systems for over a decade
Partner in the upcoming TACC Frontera System
GTC 2019 4 Network Based Computing Laboratory 100000 200000 300000 400000 500000 600000 Sep-04 Feb-05 Jul-05 Dec-05 May-06 Oct-06 Mar-07 Aug-07 Jan-08 Jun-08 Nov-08 Apr-09 Sep-09 Feb-10 Jul-10 Dec-10 May-11 Oct-11 Mar-12 Aug-12 Jan-13 Jun-13 Nov-13 Apr-14 Sep-14 Feb-15 Jul-15 Dec-15 May-16 Oct-16 Mar-17 Aug-17 Jan-18 Jun-18 Nov-18 Number of Downloads Timeline MV 0.9.4 MV2 0.9.0 MV2 0.9.8 MV2 1.0 MV 1.0 MV2 1.0.3 MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.0b MV2-MIC 2.0 MV2 2.3.1 MV2-X 2.3rc1 MV2 Virt 2.2 MV2-GDR 2.3 OSU INAM 0.9.4
MVAPICH2 Release Timeline and Downloads
GTC 2019 5 Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication Runtime
Diverse APIs and Mechanisms
Point-to- point Primitives Collectives Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Job Startup Introspection & Analysis
Support for Modern Networking Technology
(InfiniBand, iWARP, RoCE, Omni-Path)
Support for Modern Multi-/Many-core Architectures
(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features
RC XRC UD DC UMR ODP SR- IOV Multi Rail
Transport Mechanisms
Shared Memory CMA
IVSHMEM
Modern Features
MCDRAM* NVLink CAPI*
* Upcoming
XPMEM
GTC 2019 6 Network Based Computing Laboratory
MVAPICH2 Software Family
High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications
GTC 2019 7 Network Based Computing Laboratory
CPU CPU
QPI GPU
PCIe
GPU GPU
CPU
GPU IB
Node 0 Node 1
- 1. Intra-GPU
- 2. Intra-Socket GPU-GPU
- 3. Inter-Socket GPU-GPU
- 4. Inter-Node GPU-GPU
- 5. Intra-Socket GPU-Host
- 7. Inter-Node GPU-Host
- 6. Inter-Socket GPU-Host
Memory buffers
- 8. Inter-Node GPU-GPU with IB adapter on remote socket
and more . . .
- NVLink is leading to more paths …..
- For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …
- Critical for runtimes to optimize data movement while hiding the complexity
- Connected as PCIe devices – Flexibility but Complexity
MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters
GTC 2019 8 Network Based Computing Laboratory
At Sender: At Receiver:
MPI_Recv(r_devbuf, size, …); inside MVAPICH2
- Standard MPI interfaces used for unified data movement
- Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
- Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
GTC 2019 9 Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3.1 Releases
- Support for MPI communication from NVIDIA GPU device memory
- High performance RDMA-based inter-node point-to-point communication
(GPU-GPU, GPU-Host and Host-GPU)
- High performance intra-node point-to-point communication for multi-GPU
adapters/node (GPU-GPU, GPU-Host and Host-GPU)
- Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node
- Optimized and tuned collectives for GPU device buffers
- MPI datatype support for point-to-point and collective communication from
GPU device buffers
- Unified memory
GTC 2019 10 Network Based Computing Laboratory
- MVAPICH2-GDR 2.3.1 requires the following software to be installed on your system:
- 1. Mellanox OFED 3.2 and later
- 2. NVIDIA Driver 367.48 or later
- 3. NVIDIA CUDA Toolkit 7.5 and later
- 4. NVIDIA Peer Memory (nv_peer_mem) module to enable GPUDirect RDMA (GDR) support
- Strongly Recommended for Best Performance
- 5. GDRCOPY Library by NVIDIA: https://github.com/NVIDIA/gdrcopy
- Comprehensive Instructions can be seen from the MVAPICH2-GDR User Guide:
– http://mvapich.cse.ohio-state.edu/userguide/gdr/
MVAPICH2-GDR: Pre-requisites for OpenPOWER & x86 Systems
GTC 2019 11 Network Based Computing Laboratory
- Simple Installation steps for both systems
- Pick the right MVAPICH2-GDR RPM from Downloads page:
– http://mvapich.cse.ohio-state.edu/downloads/ – e.g. http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3/mofed4.5/mvapich2-gdr- mcast.cuda10.0.mofed4.5.gnu4.8.5-2.3-1.el7.x86_64.rpm (== <mv2-gdr-rpm-name>.rpm)
$ wget http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3/<mv2-gdr-rpm- name>.rpm Root Users: $ rpm -Uvh --nodeps <mv2-gdr-rpm-name>.rpm Non-Root Users: $ rpm2cpio <mv2-gdr-rpm-name>.rpm | cpio – id
- Contact MVAPICH help list with any questions related to the package
mvapich-help@cse.ohio-state.edu
MVAPICH2-GDR: Download and Setup on OpenPOWER & x86 Systems
GTC 2019 12 Network Based Computing Laboratory
- Released on 03/16/2018
- Major Features and Enhancements
– Based on MVAPICH2 2.3.1 – Enhanced intra-node and inter-node point-to-point performance for DGX-2 and IBM POWER8 and IBM POWER9 systems – Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems – Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get – Support for PGI 18.10 – Flexible support for running TensorFlow (Horovod) jobs – Add support for Volta (V100) GPU – Support for OpenPOWER with NVLink – Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink – Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication – InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications – Efficient broadcast designs for Deep Learning applications
MVAPICH2-GDR 2.3.1
GTC 2019 13 Network Based Computing Laboratory
2000 4000 6000
1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth (MB/s) Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3.1
1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Bandwidth (MB/s) Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3.1
10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K
Latency (us) Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3.1 MVAPICH2-GDR-2.3.1 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 10.0 Mellanox OFED 4.2 with GPU-Direct-RDMA
10x 9x
Optimized MVAPICH2-GDR Design
1.85us 11X
GTC 2019 14 Network Based Computing Laboratory
- RoCE V1 and V2 support
- RDMA_CM connection support
- CUDA-Aware Collective Tuning
– Point-point Tuning (available since MVAPICH2-GDR 2.0)
- Tuned thresholds for the different communication patterns and features
- Depending on the system configuration (CPU, HCA and GPU models)
– Tuning Framework for GPU based collectives
- Select the best algorithm depending on message size, system size and system
configuration
- Support for Bcast and Gather operations for different GDR-enabled systems
- Available since MVAPICH2-GDR 2.2RC1 release
ROCE and Optimized Collectives Support
GTC 2019 15 Network Based Computing Laboratory
- Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
- HoomdBlue Version 1.0.5
- GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
500 1000 1500 2000 2500 4 8 16 32
Average Time Steps per second (TPS)
Number of Processes
MV2 MV2+GDR
500 1000 1500 2000 2500 3000 3500 4 8 16 32 Average Time Steps per second (TPS) Number of Processes 64K Particles 256K Particles 2X 2X
GTC 2019 16 Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0.2 0.4 0.6 0.8 1 1.2 16 32 64 96 Normalized Execution Time Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based 0.2 0.4 0.6 0.8 1 1.2 4 8 16 32 Normalized Execution Time Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
- 2X improvement on 32 GPUs nodes
- 30% improvement on 96 GPU nodes (8 GPUs/node)
- C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data
Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content /tasks/operational/meteoSwiss/
GTC 2019 17 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 18 Network Based Computing Laboratory
Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1
- Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect
- Up to 30% higher D2D bandwidth on DGX-1 with NVLink
- 5000
10000 15000 20000 25000 30000 35000 40000 128K 256K 512K 1M 2M 4M
Million Bytes (MB)/second Message Size (Bytes)
Pt-to-pt (D-D) Bandwidth: Benefits of Multi-stream CUDA IPC Design 1-stream 4-streams
16% better
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 16K 32K 64K 128K 256K 512K 1M 2M 4M
Million Bytes (MB)/second Message Size (Bytes)
Pt-to-pt (D-D) Bandwidth: Benefits of Multi-stream CUDA IPC Design 1-stream 4-streams
30% better
Available since MVAPICH2-GDR-2.3a
GTC 2019 19 Network Based Computing Laboratory 5000 10000 15000 Bandwidth (MBps) Message Size (Bytes)
INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH
MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)
CMA-based Intra-node Host-to-Host Communication Support
100 200 300 400 500 600 Latency (us) Message Size (Bytes)
INTRA-NODE Pt-to-Pt (H2H) LATENCY
MV2-GDR (w/out CMA) MV2-GDR (w/ CMA) MVAPICH2-GDR-2.3a Intel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCA CUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA
- Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth
30% better 30% better
GTC 2019 20 Network Based Computing Laboratory
- Multi-dimensional data
- Row based organization
- Contiguous on one dimension
- Non-contiguous on other dimensions
- Halo data exchange
- Duplicate the boundary
- Exchange the boundary in each
iteration
Halo data exchange
Non-contiguous Data Exchange
GTC 2019 21 Network Based Computing Laboratory
MPI Datatype support in MVAPICH2
- Datatypes support in MPI
– Operate on customized datatypes to improve productivity – Enable MPI library to optimize non-contiguous data
At Sender:
MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type); MPI_Type_commit(&new_type); … MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);
- Inside MVAPICH2
- Use datatype specific CUDA Kernels to pack data in chunks
- Efficiently move data between nodes using RDMA
- In progress - currently optimizes vector and hindexed datatypes
- Transparent to the user
- H. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on
Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.
GTC 2019 22 Network Based Computing Laboratory
MPI Datatype Processing (Computation Optimization )
- Comprehensive support
- Targeted kernels for regular datatypes - vector, subarray, indexed_block
- Generic kernels for all other irregular datatypes
- Separate non-blocking stream for kernels launched by MPI library
- Avoids stream conflicts with application kernels
- Flexible set of parameters for users to tune kernels
- Vector
- MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
- MV2_CUDA_KERNEL_VECTOR_YSIZE
- Subarray
- MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
- MV2_CUDA_KERNEL_SUBARR_XDIM
- MV2_CUDA_KERNEL_SUBARR_YDIM
- MV2_CUDA_KERNEL_SUBARR_ZDIM
- Indexed_block
- MV2_CUDA_KERNEL_IDXBLK_XDIM
GTC 2019 23 Network Based Computing Laboratory
MPI Datatype Processing (Communication Optimization )
Waste of computing resources on CPU and GPU Common Scenario
*A, B…contain non-contiguous MPI Datatype MPI_Isend (A,.. Datatype,…) MPI_Isend (B,.. Datatype,…) MPI_Isend (C,.. Datatype,…) MPI_Isend (D,.. Datatype,…) … MPI_Waitall (…);
GTC 2019 24 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 25 Network Based Computing Laboratory
Enhanced Support for Intra-node Unified Memory
- CUDA Unified Memory(UM) => no memory pin down
- No IPC support for intra-node communication
- No GDR support for Inter-node communication
- Initial and basic support in MVAPICH2-GDR
- For both intra- and inter-nodes use “pipeline
through” host memory
- Enhance intra-node UM to use IPC
- Double buffering pair-wise IPC-based scheme
- Brings IPC performance to UM
- High performance and high productivity
- Available since MVAPICH2-GDR 2.2RC1
- K. Hamidouche, A. Awan, A. Venkatesh, and D. K Panda, CUDA M3: Designing Efficient
CUDA Managed Memory-aware MPI by Exploiting GDR and IPC, HiPC ‘16
On K80 with MV2-GDR
GTC 2019 26 Network Based Computing Laboratory
Characterizing Unified Memory aware MPI on modern GPUs
- Improved UM support in Pascal & Volta GPUs through:
- Advanced GPU page fault engines
- cudaMemPrefetch and cudaMemAdvise APIs provide
more control for UM data placement
- Are the UM designs developed during Kepler era still valid?
- Carried out an in-depth characterization
- Our characterization studies show:
- The UM designs from Kepler era are still valid
- They are 4.2X and 2.8X better in latency compared to
MVAPICH2-GDR and Open MPI
- K. V. Manian, A. Awan, A. Ruhela, C. Chu, H. Subramoni and D. K Panda, Characterizing
CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures, GPGPU ‘19 Workshop, in conjunction with ASPLOS ’19, April ‘19 On V100 with MV2-GDR and OMPI On V100 with MV2-GDR
GTC 2019 27 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 28 Network Based Computing Laboratory
- Streaming applications on HPC
systems
- 1. Communication (MPI)
- Broadcast-type operations
- 2. Computation (CUDA)
- Multiple GPU nodes as workers
Streaming Applications
Data Source Sender
HPC resources for real-time analytics
Real-time streaming
Worker
CPU GPU GPU
Worker
CPU GPU GPU
Worker
CPU GPU GPU
Worker
CPU GPU GPU
Worker
CPU GPU GPU
Data streaming-like broadcast operations
GTC 2019 29 Network Based Computing Laboratory
IB HCA CPU GPU
Source
IB Switch
Header Data
IB HCA CPU GPU
Destination 1
Header Data
IB HCA CPU GPU
Destination N
Header Data
- 1. IB Gather + GDR Read
- 2. IB Hardware Multicast
- 3. IB Scatter + GDR Write
- For GPU-resident data, using
– GPUDirect RDMA (GDR) – InfiniBand Hardware Multicast (IB-MCAST)
- Overhead
– IB UD limit – GDR limit
Hardware Multicast-based Broadcast
- A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda,
“A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014, Dec 2014.
Available since MVAPICH2-GDR 2.3a
GTC 2019 30 Network Based Computing Laboratory 10 20 30 40 50 60 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Latency (μs) Message Size (Bytes) MCAST-GDR-OPT MCAST-GDR
- IB-MCAST + GDR + Topology-aware IPC-based schemes
– Up to 58% and 79% reduction for small and large messages
Streaming Benchmark @ CSCS (88 GPUs)
58%
2000 4000 6000 8000 10000 12000 32K 64K 128K 256K 512K 1M 2M 4M Latency (μs) Message Size (Bytes) MCAST-GDR-OPT MCAST-GDR
79%
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.
GTC 2019 31 Network Based Computing Laboratory
0.5 1 1.5 8 GPUs 16 GPUs 8 GPUs 16 GPUs 8 GPUs 16 GPUs AlexNet VGG ResNet-50 Speedup MV2-GDR-Knomial MV2-GDR-Ring MV2-MCAST-GDR-Opt
- @ RI2 cluster, 16 GPUs, 1 GPU/node
– CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) [2]
Application-based Evaluation: CUDA-Aware CNTK
Higher is better
18% 24%
[1] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17. [2] D. S. Banerjee, K. Hamidouche, and D. K. Panda, Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE CloudCom’16
- Reduces up to 24%, 16% and 18% of latency for AlexNet, VGG and ResNet models
- Higher improvement can be observed for larger system sizes
16%
Research Poster (P9242)
GTC 2019 32 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 33 Network Based Computing Laboratory
- Scale-up: Intra-node Communication
– Many improvements like:
- NVIDIA cuDNN, cuBLAS, NCCL, etc.
- CUDA 9 Co-operative Groups
- Scale-out: Inter-node Communication
– DL Frameworks – most are optimized for single-node only – Distributed (Parallel) Training is an emerging trend
- OSU-Caffe – MPI-based
- Microsoft CNTK – MPI/NCCL2
- Google TensorFlow – gRPC-based/MPI/NCCL2
- Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI)
Deep Learning: New Challenges for Runtimes
Scale-up Performance Scale-out Performance
cuDNN gRPC Hadoop MPI MKL-DNN
Desired
NCCL2
GTC 2019 34 Network Based Computing Laboratory
Data Parallel Deep Learning and MPI Collectives
MPI_Bcast (GPU 0) packed_comm_buff L1 L2 .. Ln F L1 L2 .. Ln L1 L2 .. Ln L1 L2 .. Ln Params GPU 0 Params GPU 1 Params GPU 2 Params GPU 3 Gradients
- 1. Data
Propagation
- 2. Forward
Backward Pass
- 3. Gradient
Aggregatio n B F B F B F B
packed_red uce_buff packed_red uce_buff packed_red uce_buff packed_red uce_buff
ApplyUpdates MPI_Reduce (GPU 0) Loop {}
- Major MPI Collectives
involved in Designing distributed frameworks
- MPI_Bcast – required for
DNN parameter exchange
- MPI_Reduce – needed for
gradient accumulation from multiple solvers
- MPI_Allreduce – use just
- ne Allreduce instead of
Reduce and Broadcast
- A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU
- Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
GTC 2019 35 Network Based Computing Laboratory 10000 20000 30000 40000 50000 512K 1M 2M 4M Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI 1000000 2000000 3000000 4000000 5000000 6000000 8388608 16777216 33554432 67108864 134217728 268435456 536870912 Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI 1 10 100 1000 10000 100000 4 16 64 256 1024 4096 16384 65536 262144 Latency (us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI
- 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0
MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI
*Available since MVAPICH2-GDR 2.3a ~30X better
MV2 is ~2X better than Baidu
~10X better
OpenMPI is ~5X slower than Baidu
~4X better
GTC 2019 36 Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation
- Optimized designs since MVAPICH2-GDR 2.3 offer better/comparable performance for most cases
- MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs
1 10 100 1000 10000 100000 Latency (us) Message Size (Bytes) MVAPICH2-GDR 2.3.1 NCCL2
~1.2X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect 1 10 100 1000 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Latency (us) Message Size (Bytes) MVAPICH2-GDR 2.3.1 NCCL2
~3X better
GTC 2019 37 Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)
- Optimized designs in MVAPICH2-GDR 2.3.1 offer better/comparable performance for most cases
- MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)
1 10 100 1000 10000 Latency (us) Message Size (Bytes) MVAPICH2-GDR 2.3.1 NCCL-2.3
~1.7X better
Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2 10 20 30 40 50 60 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K Latency (us) Message Size (Bytes) MVAPICH2-GDR 2.3.1 NCCL-2.3
~2.5X better
GTC 2019 38 Network Based Computing Laboratory
- Efficient Allreduce is crucial for
Horovod’s overall training performance
– Both MPI and NCCL designs are available
- We have evaluated Horovod extensively
and compared across a wide range of designs using gRPC and gRPC extensions
- MVAPICH2-GDR achieved up to 90%
scaling efficiency for ResNet-50 Training
- n 64 Pascal GPUs
Scalable TensorFlow using Horovod, MPI, and NCCL
- A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, “Scalable Distributed
DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, (To be presented) CCGrid ‘19. https://arxiv.org/abs/1810.11112
GTC 2019 39 Network Based Computing Laboratory
Distributed Training with TensorFlow and MVAPICH2-GDR
- ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (8 Volta GPUs)
500 1000 1500 2000 2500 3000 1 2 4 8 Image per second Number of GPUs NCCL-2.3 MVAPICH2-GDR-Next
7.5% higher
Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2
75 80 85 90 95 100 1 2 4 8 Scaling Efficiency (%) Number of GPUs NCCL-2.3 MVAPICH2-GDR-Next
Scaling Efficiency =
Actual throughput Ideal throughput at scale × 100%
GTC 2019 40 Network Based Computing Laboratory
- Caffe : A flexible and layered Deep Learning framework.
- Benefits and Weaknesses
– Multi-GPU Training within a single node – Performance degradation for GPUs across different sockets – Limited Scale-out
- OSU-Caffe: MPI-based Parallel Training
– Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes) – Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset
OSU-Caffe: Scalable Deep Learning
50 100 150 200 250 8 16 32 64 128 Training Time (seconds)
- No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use case
OSU-Caffe publicly available from http://hidl.cse.ohio-state.edu/ Support on OPENPOWER will be available soon
GTC 2019 41 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 42 Network Based Computing Laboratory 0.5 1 1.5 1 4 16 64 256 1K 4K Latency (us) MVAPICH2-GDR 2.3.1
Point-to-Point Host-level Performance on OpenPOWER
Platform: OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA Intra-node Latency Intra-node Bi-directional Bandwidth Intra-node Bandwidth 10000 20000 30000 40000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) MVAPICH2-GDR 2.3.1 20000 40000 60000 80000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) MVAPICH2-GDR 2.3.1 1 2 3 4 5 1 4 16 64 256 1K 4K Latency (us) MVAPICH2-GDR 2.3.1 5000 10000 15000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) MVAPICH2-GDR 2.3.1 5000 10000 15000 20000 25000 1 4 16 64 256 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) MVAPICH2-GDR 2.3.1 Inter-node Latency
Inter-node Bi-directional Bandwidth
Inter-node Bandwidth
~0.5 μs ~2.3 μs ~30GB/s ~60GB/s ~12GB/s ~24GB/s
GTC 2019 43 Network Based Computing Laboratory
5 10 15 20 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (us) Message Size (Bytes)
INTRA-NODE LATENCY (SMALL)
Intra-Socket Inter-Socket
Device-to-Device Performance on OpenPOWER (NVLink2 + Volta)
5 10 15 20 25 30 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Bandwidth (GB/sec) Message Size (Bytes)
INTER-NODE BANDWIDTH Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
100 200 300 400 500 16K 32K 64K 128K256K512K 1M 2M 4M Latency (us) Message Size (Bytes)
INTRA-NODE LATENCY (LARGE)
Intra-Socket Inter-Socket 5 10 15 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (us) Message Size (Bytes)
INTER-NODE LATENCY (SMALL)
100 200 300 400 Latency (us) Message Size (Bytes)
INTER-NODE LATENCY (LARGE)
Intra-node Bandwidth: 70.4 GB/sec for 128MB (via NVLINK2) Intra-node Latency: 5.36 us (without GDRCopy) Inter-node Latency: 5.66 us (without GDRCopy) Inter-node Bandwidth: 23.7 GB/sec (2 port EDR)
Available since MVAPICH2-GDR 2.3a
20 40 60 80 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Bandwidth (GB/sec) Message Size (Bytes)
INTRA-NODE BANDWIDTH
Intra-Socket Inter-Socket
GTC 2019 44 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 45 Network Based Computing Laboratory
- Increasing trend to provide container support for MPI Libraries
- Ease of build
- Portability
- Reproducibility
- MVAPICH2-GDR 2.3.1 provides container (Docker) support
- More details are available in the MVAPICH2-GDR User Guide
- http://mvapich.cse.ohio-state.edu/userguide/gdr/
- Synergistic with the HPC-Container-Maker and hpccm efforts by
NVIDIA
- (https://github.com/NVIDIA/hpc-container-maker)
Container Support
GTC 2019 46 Network Based Computing Laboratory 5000 10000 15000 20000 Bandwidth (MB/s) Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
Docker Native 2000 4000 6000 8000 10000 12000 Bandwidth (MB/s) Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
Docker Native 1 10 100 1000 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Latency (us) Message Size (Bytes)
GPU-GPU Inter-node Latency
Docker Native MVAPICH2-GDR-2.3.1 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA
MVAPICH2-GDR on Container with Negligible Overhead
GTC 2019 47 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 48 Network Based Computing Laboratory 10 20 30 40 4 8 16 32 64 128 256 512 1K 2K 4K
Latency (us)
MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0
Scalable Host-based Collectives on OpenPOWER with CMA (Intra-node Reduce & AlltoAll)
(Nodes=1, PPN=20) 50 100 150 200 4 8 16 32 64 128 256 512 1K 2K 4K
Latency (us)
MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0
(Nodes=1, PPN=20)
Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively
3.6X
Alltoall Reduce
5.2X 3.2X 3.3X 400 800 1200 1600 2000 8K 16K 32K 64K 128K 256K 512K 1M
Latency (us)
MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0
(Nodes=1, PPN=20) 2500 5000 7500 10000 8K 16K 32K 64K 128K 256K 512K 1M
Latency (us)
MVAPICH2-GDR-Next SpectrumMPI-10.1.0.2 OpenMPI-3.0.0
(Nodes=1, PPN=20) 3.2X 1.4X 1.3X 1.2X
GTC 2019 49 Network Based Computing Laboratory 10 20 30 4 8 16 32 64 128 256 512 1K 2K 4K
Latency (us)
MVAPICH2-GDR-Next OpenMPI-3.0.0
Scalable Host-based Collectives on OpenPOWER with CMA (Multi-node, Reduce & Alltoall)
(Nodes=4, PPN=20) 500 1000 4 8 16 32 64 128 256 512 1K 2K 4K
Latency (us)
MVAPICH2-GDR-Next OpenMPI-3.0.0
(Nodes=4, PPN=20)
Up to 12.4X and 8.5X performance improvement by MVAPICH2 for small and large messages respectively
12.4X
Alltoall Reduce
1.9X 2000 4000 6000 8K 16K 32K 64K 128K 256K 512K 1M
Latency (us)
MVAPICH2-GDR-Next OpenMPI-3.0.0
(Nodes=4, PPN=20) 50000 100000 150000 8K 16K 32K 64K 128K 256K 512K 1M
Latency (us)
MVAPICH2-GDR-Next OpenMPI-3.0.0
(Nodes=4, PPN=20) 8.5X
GTC 2019 50 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 51 Network Based Computing Laboratory
- Offload Reduction computation and communication to peer MPI ranks
– Every Peer has direct “load/store” access to other peer’s buffers – Multiple pseudo roots independently carry-out reductions for intra-and inter-node – Directly put reduced data into root’s receive buffer
- True “Zero-copy” design for Allreduce and Reduce
– No copies require during the entire duration of Reduction operation – Scalable to multiple nodes
- Zero contention overheads as memory copies happen in “user-space”
Shared Address Space (XPMEM-based) Collectives
- J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Collectives
for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.
Available since MVAPICH2-X 2.3rc1
GTC 2019 52 Network Based Computing Laboratory
Benefits of XPMEM based MPI_Bcast
- 28 MPI Processes on single dual-socket Broadwell E5-2680v4, 2x14 core processor
1 10 100 1000 10000 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Intel MPI 2018 OpenMPI 3.0.1 MV2X-2.3rc1 (CMA Coll) MV2X-2.3rc2 (XPMEM Coll) Latency (us)
5X over OpenMPI
Message Size (bytes)
GTC 2019 53 Network Based Computing Laboratory
Benefits of XPMEM based MPI_Scatter
- High cache-locality and contention-free access compared to CMA
1 10 100 1000 10000 100000 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Intel MPI 2018 OpenMPI 3.0.1 MV2X-2.3rc1 (CMA Coll) MV2X-2.3rc2 (XPMEM Coll) Latency (us)
~28X better than state-of-the-art
Message Size (bytes)
GTC 2019 54 Network Based Computing Laboratory 1000 2000 16K 32K 64K 128K 256K 512K 1M 2M
Latency (us)
MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0
3X
2000 4000 16K 32K 64K 128K 256K 512K 1M 2M
Latency (us) Message Size
MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0
34%
1000 2000 3000 4000 16K 32K 64K 128K 256K 512K 1M 2M
Latency (us) Message Size
MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0
1000 2000 16K 32K 64K 128K 256K 512K 1M 2M
Latency (us)
MVAPICH2-GDR-Next SpectrumMPI-10.1.0 OpenMPI-3.0.0
Optimized All-Reduce with XPMEM
(Nodes=1, PPN=20)
Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch
- Optimized MPI All-Reduce Design in MVAPICH2
– Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node
2X
(Nodes=2, PPN=20)
4X 48% 3.3X 2X 2X
GTC 2019 55 Network Based Computing Laboratory
Application-Level Benefits of XPMEM-Based Collectives
MiniAMR (Broadwell, ppn=16)
- Up to 20% benefits over IMPI for CNTK DNN training using AllReduce
- Up to 27% benefits over IMPI and up to 15% improvement over MVAPICH2 for MiniAMR application kernel
100 200 300 400 500 600 700 800 28 56 112 224
Execution Time (s)
- No. of Processes
Intel MPI MVAPICH2 MVAPICH2-XPMEM
CNTK AlexNet Training (Broadwell, B.S=default, iteration=50, ppn=28)
10 20 30 40 50 60 70 16 32 64 128 256
Execution Time (s)
- No. of Processes
Intel MPI MVAPICH2 MVAPICH2-XPMEM 20% 9% 27% 15%
GTC 2019 56 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 57 Network Based Computing Laboratory
MVAPICH2-GDR: Enhanced Derived Datatype Processing
- Kernel-based and GDRCOPY-based one-shot packing for inter-socket and inter-node communication
- Zero-copy (packing-free) for GPUs with peer-to-peer direct access over PCIe/NVLink
5 10 15 20 25 [6, 8,8,8,8] [6, 8,8,8,16] [6, 8,8,16,16] [6, 16,16,16,16] MILC Speedup Problem size
GPU-based DDTBench mimics MILC communication kernel
OpenMPI 4.0.0 MVAPICH2-GDR 2.3 MVAPICH2-GDR-Next
Platform: Nvidia DGX-2 system (16 NVIDIA Volta GPUs connected with NVSwitch), CUDA 9.2
1 2 3 4 5 8 16 32 64 Execution Time (s) Number of GPUs
Communication Kernel of COSMO Model
MVAPICH2-GDR 2.3 MVAPICH2-GDR-Next
Platform: Cray CS-Storm (8 NVIDIA Tesla K80 GPUs per node), CUDA 8.0
Improved ~10X
GTC 2019 58 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 59 Network Based Computing Laboratory
Scalability and Large (Out-of-core) Models?
- Large DNNs cannot be trained on GPUs due to memory limitation!
– ResNet-50 for Image Recognition but current frameworks can only go up to a small batch size of 45 – Next generation models like Neural Machine Translation (NMT) are ridiculously large, consists of billions of parameters, and require even more memory – Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs?
- General intuition is that managed allocations “will be” slow!
– The proposed framework called OC-Caffe (Out-of-Core Caffe) shows the potential of managed memory designs that can provide performance with negligible/no overhead.
- OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for
ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7
- A. A. Awan, C.-H. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and
Volta GPUs for Out-of-Core DNN Training, HiPC ’18 Research Poster (P9243)
GTC 2019 60 Network Based Computing Laboratory
- Overview of the MVAPICH2 Project
- MVAPICH2-GPU with GPUDirect-RDMA (GDR)
- Current Features
- Multi-stream Communication for IPC
- CMA-based Intra-node Host-to-Host Communication Support
- Maximal Overlap in MPI Datatype Processing
- Efficient Support for Managed Memory
- Streaming Support with InfiniBand Multicast and GDR
- Support for Deep Learning
- Support for OpenPOWER with NVLink
- Support for Container
- Upcoming Features
- CMA-based Intra-node Collective Communication Support
- XPMEM-based Collective Communication Support
- Optimized Datatype Processing
- Out-of-core processing for Deep Learning
- Conclusions
Outline
GTC 2019 61 Network Based Computing Laboratory
- MVAPICH2-GDR MPI library optimizes MPI communication on InfiniBand and
RoCE (V1 and V2) clusters with GPUs on both x86 and OpenPOWER platforms (including NVLink)
- Provides optimized designs for point-to-point two-sided and one-sided
communication, datatype processing and collective operations
- Takes advantage of CUDA features like IPC and GPUDirect RDMA families
- Allows flexible solutions for streaming applications with GPUs
- Provides optimized solutions for both HPC and High-Performance Deep
Learning (HiDL) frameworks and applications
- Upcoming releases will be supporting advanced designs
Conclusions
GTC 2019 62 Network Based Computing Laboratory
Please join us for more events..
Monday, March 18 Tuesday, March 19 Wednesday, March 20 Research Poster
1. P9243 - Exploiting CUDA Unified Memory for Efficient Out-of-Core DNN Training 2. P9242 - Exploiting GPUDirect Technology and Hardware Multicast for Streaming and Deep Learning Applications
Talk S9476 - MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI Instructor-Led Training L9121 - How to Boost the Performance of HPC/AI Applications Using MVAPICH2 Library SJCC Upper Concourse 06:00 PM - 08:00 PM SJCC Room 211A (Concourse Level) 03:00 PM - 03:50 PM SJCC Room LL21D (Lower Level) 08:00 AM - 10:00 AM
GTC 2019 63 Network Based Computing Laboratory
Personnel Acknowledgments
Current Students (Graduate)
–
- A. Awan (Ph.D.)
–
- M. Bayatpour (Ph.D.)
–
- S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.) –
- S. Guganani (Ph.D.)
Past Students
–
- A. Augustine (M.S.)
–
- P. Balaji (Ph.D.)
–
- R. Biswas (M.S.)
–
- S. Bhagvat (M.S.)
–
- A. Bhat (M.S.)
–
- D. Buntinas (Ph.D.)
–
- L. Chai (Ph.D.)
–
- B. Chandrasekharan (M.S.)
–
- N. Dandapanthula (M.S.)
–
- V. Dhanraj (M.S.)
–
- T. Gangadharappa (M.S.)
–
- K. Gopalakrishnan (M.S.)
–
- R. Rajachandrasekar (Ph.D.)
–
- G. Santhanaraman (Ph.D.)
–
- A. Singh (Ph.D.)
–
- J. Sridhar (M.S.)
–
- S. Sur (Ph.D.)
–
- H. Subramoni (Ph.D.)
–
- K. Vaidyanathan (Ph.D.)
–
- A. Vishnu (Ph.D.)
–
- J. Wu (Ph.D.)
–
- W. Yu (Ph.D.)
–
- J. Zhang (Ph.D.)
Past Research Scientist
–
- K. Hamidouche
–
- S. Sur
Past Post-Docs
–
- D. Banerjee
–
- X. Besseron
– H.-W. Jin –
- W. Huang (Ph.D.)
–
- W. Jiang (M.S.)
–
- J. Jose (Ph.D.)
–
- S. Kini (M.S.)
–
- M. Koop (Ph.D.)
–
- K. Kulkarni (M.S.)
–
- R. Kumar (M.S.)
–
- S. Krishnamoorthy (M.S.)
–
- K. Kandalla (Ph.D.)
–
- M. Li (Ph.D.)
–
- P. Lai (M.S.)
–
- J. Liu (Ph.D.)
–
- M. Luo (Ph.D.)
–
- A. Mamidala (Ph.D.)
–
- G. Marsh (M.S.)
–
- V. Meshram (M.S.)
–
- A. Moody (M.S.)
–
- S. Naravula (Ph.D.)
–
- R. Noronha (Ph.D.)
–
- X. Ouyang (Ph.D.)
–
- S. Pai (M.S.)
–
- S. Potluri (Ph.D.)
–
- J. Hashmi (Ph.D.)
–
- A. Jain (Ph.D.)
–
- K. S. Khorassani (Ph.D.)
–
- P. Kousha (Ph.D.)
–
- D. Shankar (Ph.D.)
–
- J. Lin
–
- M. Luo
–
- E. Mancini
Current Research Asst. Professor
–
- X. Lu
Past Programmers
–
- D. Bureddy
–
- J. Perkins
Current Research Specialist
–
- J. Smith
–
- S. Marcarelli
–
- J. Vienne
–
- H. Wang
Current Post-doc
–
- A. Ruhela
–
- K. Manian
Current Students (Undergraduate)
–
- V. Gangal (B.S.)
–
- M. Haupt (B.S.)
–
- N. Sarkauskas (B.S.)
–
- A. Yeretzian (B.S.)
Past Research Specialist
–
- M. Arnold
Current Research Scientist
–
- H. Subramoni
GTC 2019 64 Network Based Computing Laboratory
- Looking for Bright and Enthusiastic Personnel to join as
– PhD Students – Post-Doctoral Researchers – MPI Programmer/Software Engineer – Hadoop/Big Data Programmer/Software Engineer – Deep Learning and Cloud Programmer/Software Engineer
- If interested, please send an e-mail to panda@cse.ohio-state.edu
Multiple Positions Available in My Group
GTC 2019 65 Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
panda@cse.ohio-state.edu, subramon@cse.ohio-state.edu
The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/