High-Performance Broadcast Designs for Streaming Applications on - - PowerPoint PPT Presentation
High-Performance Broadcast Designs for Streaming Applications on - - PowerPoint PPT Presentation
High-Performance Broadcast Designs for Streaming Applications on Multi-GPU InfiniBand Clusters GPU Technology Conference (GTC 2017) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
GTC 2017 2 Network Based Computing Laboratory
- Examples - surveillance, habitat
monitoring, etc..
- Require efficient transport of data
from/to distributed sources/sinks
- Sensitive to latency and throughput
metrics
- Require HPC resources to
efficiently carry out compute- intensive tasks
Streaming Applications
GTC 2017 3 Network Based Computing Laboratory
- Pipelined data parallel compute phases
- Form the crux of streaming applications lend
themselves for GPGPUs
- Data distribution to GPGPU sites
- Over PCIe within the node
- Over InfiniBand interconnects across nodes
- Back-to-back Broadcast operation
– Key dictator of throughput of streaming applications
Nature of Streaming Applications
Data Source Data Distributor
HPC resources
Real-time streaming
Worker
CPU GPU GPU
Worker
CPU GPU GPU
Worker
CPU GPU GPU
Worker
CPU GPU GPU
Worker
CPU GPU GPU Data streaming-like broadcast operations
GTC 2017 4 Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
- Multi-core/many-core technologies
- Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
- Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
- Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
- Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.
Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, 100Gbps Bandwidth> Multi-core Processors SSD, NVMe-SSD, NVRAM
Tianhe – 2 Titan K - Computer Sunway TaihuLight
GTC 2017 5 Network Based Computing Laboratory
- 187 IB Clusters (37%) in the Nov’16 Top500 list
– (http://www.top500.org)
- Installations in the Top 50 (15 systems):
Large-scale InfiniBand Installations
241,108 cores (Pleiades) at NASA/Ames (13th) 147,456 cores (SuperMUC) in Germany (36th) 220,800 cores (Pangea) in France (16th) 86,016 cores (SuperMUC Phase 2) in Germany (37th) 462,462 cores (Stampede) at TACC (17th) 74,520 cores (Tsubame 2.5) at Japan/GSIC (40th) 144,900 cores (Cheyenne) at NCAR/USA (20th) 194,616 cores (Cascade) at PNNL (44th) 72,800 cores Cray CS-Storm in US (25th) 76,032 cores (Makman-2) at Saudi Aramco (49th) 72,800 cores Cray CS-Storm in US (26th) 72,000 cores (Prolix) at Meteo France, France (50th) 124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (27th ) 73,440 cores (Beaufix2) at Meteo France, France (51st) 60,512 cores (DGX SATURNV) at NVIDIA/USA (28th) 42,688 cores (Lomonosov-2) at Russia/MSU (52nd) 72,000 cores (HPC2) in Italy (29th) 60,240 cores SGI ICE X at JAEA Japan (54th) 152,692 cores (Thunder) at AFRL/USA (32nd) and many more!
GTC 2017 6 Network Based Computing Laboratory
- Introduced in Oct 2000
- High Performance Point-to-point Data Transfer
– Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and low CPU utilization (5-10%)
- Multiple Features
– Offloaded Send/Recv, RDMA Read/Write, Atomic Operations
– Hardware Multicast support through Unreliable Datagram (UD)
- A message sent from a single source can reach all destinations in a single
pass over the network through switch-based replication
- Restricted to one MTU
- Large messages need to be sent in a chunked manner
- Reliability needs to be addressed
- Leading to big changes in designing HPC clusters, file systems, cloud
computing systems, grid computing systems, ….
InfiniBand Networking Technology
GTC 2017 7 Network Based Computing Laboratory
InfiniBand Hardware Multicast Example
GTC 2017 8 Network Based Computing Laboratory
Multicast-aware CPU-Based MPI_Bcast on Stampede using MVAPICH2 (6K nodes with 102K cores)
10 20 30 40 2 8 32 128 512 Latency (µs) Message Size (Bytes)
Small Messages (102,400 Cores)
Default Multicast ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCIe Gen3 with Mellanox IB FDR switch 100 200 300 400 500 2K 8K 32K 128K Latency (µs) Message Size (Bytes)
Large Messages (102,400 Cores)
Default Multicast 10 20 30 Latency (µs) Number of Nodes
16 Byte Message
Default Multicast 50 100 150 200 Latency (µs) Number of Nodes
32 KByte Message
Default Multicast
GTC 2017 9 Network Based Computing Laboratory
GPU Memory
- Before CUDA 4: Additional copies
- Low performance and low productivity
- After CUDA 4: Host-based pipeline
- Unified Virtual Address
- Pipeline CUDA copies with IB transfers
- High performance and high productivity
- After CUDA 5.5: GPUDirect RDMA support
- GPU to GPU direct transfer
- Bypass the host memory
- Hybrid design to avoid PCI bottlenecks
InfiniBand
GPU CPU Chip set
GPUDirect RDMA (GDR) and CUDA-Aware MPI
GTC 2017 10 Network Based Computing Laboratory
1000 2000 3000 4000 1 4 16 64 256 1K 4K MV2-GDR2.2 MV2-GDR2.0b MV2 w/o GDR
GPU-GPU Internode Bi-Bandwidth
Message Size (bytes) Bi-Bandwidth (MB/s) 5 10 15 20 25 30 2 8 32 128 512 2K MV2-GDR2.2 MV2-GDR2.0b MV2 w/o GDR
GPU-GPU internode latency
Message Size (bytes)
Latency (us)
MVAPICH2-GDR-2.2 Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU Mellanox Connect-X4 EDR HCA CUDA 8.0 Mellanox OFED 3.0 with GPU-Direct-RDMA
10x
2X
11x
Performance of MVAPICH2-GDR with GPUDirect RDMA (GDR)
2.18us 500 1000 1500 2000 2500 3000 1 4 16 64 256 1K 4K MV2-GDR2.2 MV2-GDR2.0b MV2 w/o GDR
GPU-GPU Internode Bandwidth
Message Size (bytes)
Bandwidth (MB/s)
11X 2X 3X
More details in 2:00pm session today
S7356 - MVAPICH2-GDR: PUSHING THE FRONTIER OF HPC AND DEEP LEARNING
GTC 2017 11 Network Based Computing Laboratory
- Host-Staged Multicast (HSM):
Traditional short message broadcast
- peration between GPUs
- Data copied from GPU to host memory
- Using InfiniBand UD-based hardware multicast
- Sub-optimal use of near-scale invariant
UD-multicast performance
- PCIe resources are wasted and benefits
- f multicast are nullified
- GPUDirect RDMA capabilities unused
Multicasting Data from one GPU to other GPUs: Shortcomings
GTC 2017 12 Network Based Computing Laboratory
- Can we design a GPU broadcast mechanism that can deliver low latency and high
throughput for streaming applications?
- Can we combine GDR and MCAST features to
- Achieve the best performance
- Free-up the Host-Device PCIe bandwidth for application needs
- Can such design be extended to support heterogeneous configurations?
- Host-to-Device (H2D): Most common in streaming applications
- E.g., Camera connected to host and devices used for computation
- Device-to-device (D2D)
- Device-to-Host (D2H)
- Can we design an efficient MCAST based broadcast for multi-GPU systems?
- Can we design an efficient reliability support on top of the UD-based MCAST broadcast?
- How much performance benefits can be achieved with the new designs?
Problem Statement
GTC 2017 13 Network Based Computing Laboratory
- Copy user GPU data to host buffers
- Perform Multicast
- Copy data back to user GPU buffer
GPU
HCA
Host
Vbuf
user
NW
- Drawbacks:
- CudaMemcpy dictates
performance
- Requires PCIe Host-Device
resources
cudaMemcpy MCAST
Existing Protocol for GPU Multicast
GTC 2017 14 Network Based Computing Laboratory
GPU
HCA
Host
Vbuf
user
NW
- Drawbacks:
- D-H operation limits
performance
- Can we avoid GDRCOPY for
D-H copies?
GDRCOPY* MCAST
Enhanced Solution #1: GDRCOPY-based design
- Copy user GPU data to host buffers
- Using GDRCOPY module*
- Perform Multicast
- Copy data back to user GPU buffer
- Using GDRCOPY module
*https://github.com/NVIDIA/gdrcopy
GTC 2017 15 Network Based Computing Laboratory
- Copy user GPU data to host buffers
- Using loopback scheme
- Perform Multicast
- Copy back the data to GPU
- Using GDRCOPY scheme
GPU
HCA
Host
Vbuf
user
NW
- Good performance for both
H-D and D-H copies
- Expected good performance only
for small message
- Still using the PCIe H-D resources
GDRCOPY MCAST
Enhanced Solution #2: (GDRCOPY + Loopback)-based design
LoopBack
GTC 2017 16 Network Based Computing Laboratory
- How to design efficient and reliable broadcast operation from
host to device for streaming applications on multi-GPU node systems?
- Challenges
- How to handle heterogeneity of the configuration including H2D
broadcast?
- Can we have a topology-aware broadcast designs on multi-GPU nodes?
- Can we enhance the reliability support for streaming applications?
- Can we mimic such behavior at benchmark level?
- mimic the need for PCIe H-D at application level
- Demonstrate the benefits of such designs on such application patterns
Can we do Better?
GTC 2017 17 Network Based Computing Laboratory
- Handling Efficient Broadcast on Multi-GPU node Systems
- C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing
High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.
- Providing Reliability Support
- C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient
Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ in COMHPC 2016 (SC Workshop), Nov 2016.
- Optimizing Broadcast for Multi-source Streaming
- C.-H. Chu, X. Lu, A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, "Efficient and
Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , “ Accepted for presentation, Int’l Conference on Parallel Processing, ICPP ’17, Aug 2017.
Three Major Solutions
GTC 2017 18 Network Based Computing Laboratory
- Combining MCAST+GDR hardware features for
heterogeneous configurations:
– Source on the Host and destination on Device – SL design: Scatter at destination
- Source: Data and Control on Host
- Destinations: Data on Device and Control on Host
– Combines IB MCAST and GDR features at receivers – CUDA IPC-based topology-aware intra-node broadcast – Minimize use of PCIe resources – Maximizing availability of PCIe Host-Device Resources
SL-based Design for Heterogeneous Configuration (H2D)
Node N IB HCA IB HCA CPU GPU Source IB Switch GPU CPU Node 1 Multicast steps
C Data C
IB SL step
Data
IB HCA GPU CPU
Data C
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.
GTC 2017 19 Network Based Computing Laboratory
- Socket based leader (1 HCA per socket)
- Control synchronization through Host Shared
memory
- Polling on shared flag
- Reading the buffers addresses
- IPC read of the GPU data
- Direct (RMA semantics) IPC read
- IPC read with other access patterns in the future
– K-nomial Tree, Ring structure
Intra-node Topology-aware (Hybrid SL+IPC) Design for Multi- GPU Node Configuration
Node 1 IB Switch
GPU 0 GPU 1 GPU N
Node N GPU CPU Source GPU CPU CPU SL MCAST cudaMemcpy (D2D)
GTC 2017 20 Network Based Computing Laboratory
10 20 30 40 50 60 1 4 16 64 256 1K 4K 16K Latency (μs) Message Size (Bytes) SL-MCAST SGL-MCAST Host-MCAST 2000 4000 6000 8000 10000 32K 64K 128K256K512K 1M 2M 4M Latency(μs) Message Size (Bytes) SL-MCAST SGL-MCAST Host-MCAST
- Redesigned broadcast benchmark with Root buffer on Host & non-Root on Device
- Inter-node experiments @ Wilkes cluster, 32 GPUs, 1 GPU/node
SL-based Design for H-D Heterogeneous Support
56% 39%
Lower is better
GTC 2017 21 Network Based Computing Laboratory
- Evaluate H-D heterogeneous support
- Mixing Inter-node and Intra-node experiments @ CSCS cluster, 88 GPUs, 8
NVIDIA K80 GPUs per node
Evaluation of the Topology-aware (SL+IPC) Design
2000 4000 6000 8000 10000 12000 Latency (μs) Message Size (Bytes) IPC SL-MCAST SHMEM SL-MCAST 20 40 60 1 4 16 64 256 1K 4K 16K Latency (μs) Message Size (Bytes) IPC SL-MCAST SHMEM SL-MCAST 58% 79%
Lower is better
GTC 2017 22 Network Based Computing Laboratory
- Inter-node experiments @ Wilkes cluster, 32 GPUs, 1 GPU/node
– 1K byte messages
Scalability Evaluation of the Proposed Design
5 10 15 20 25 2 4 8 16 32 Latency (μs) System size (Number of GPU nodes)
SL-MCAST SGL-MCAST Host-MCAST
Maintain good Scalability while yielding up to 64% reduction of latency 64%
GTC 2017 23 Network Based Computing Laboratory
- Mimic the behavior of streaming applications @ CSCS cluster, 88 GPUs,
8 NVIDIA K80 GPUs per node
– Broadcast operations overlapped with application level Host-Device transfers
Benefits of the Availability of Host-Device PCI Resources
1 2 3 4 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Throughput (GB/s) Message Size (Bytes) IPC SL-MCAST SHMEM SL-MCAST
3.2X
Higher is better
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.
Maintain near-peak throughput over all message sizes
GTC 2017 24 Network Based Computing Laboratory
- Handling Efficient Broadcast on Multi-GPU node Systems
- C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing
High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.
- Providing Reliability Support
- C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient
Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ in COMHPC 2016 (SC Workshop), Nov 2016.
- Optimizing Broadcast for Multi-source Streaming
- C.-H. Chu, X. Lu, A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, "Efficient and
Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , “ Accepted for presentation, Int’l Conference on Parallel Processing, ICPP ’17, Aug 2017.
Three Major Solutions
GTC 2017 25 Network Based Computing Laboratory
- Remote Memory Access (RMA)-based Design
– Sender maintains a backup buffer for the MCAST packets
- Sender is not interrupted
– Receiver Performs the RMA Get operation to the sender’s backup buffer to retrieve lost MCAST packets
Efficient Reliability Support for MCAST-based broadcast
Broadcast receiver Broadcast sender
IB HCA IB HCA
MPI
Timeout
MPI
Time
GTC 2017 26 Network Based Computing Laboratory
- Evaluate the RMA-based reliability support for SL-based MCAST design
@ CSCS cluster, 88 GPUs, 8 NVIDIA K80 GPUs per node – Negligible overhead – RMA-based design performs better than NACK-based scheme for large messages
Evaluation: Efficient Reliability Design
10 20 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (μs) Message Size (Bytes) w/o reliability NACK RMA 1000 2000 3000 4000 16K 32K 64K 128K 256K 512K 1M 2M 4M Latency (μs) Message Size (Bytes) w/o reliability NACK RMA
GTC 2017 27 Network Based Computing Laboratory
Latency reduction compared to the existing NACK-based scheme
Benefits of the RMA-based Reliability Design
0.5 1 1.5 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Normalized Latency Message Size (Bytes) 0.01% 0.10% 1% 0.5 1 1.5 16K 32K 64K 128K 256K 512K 1M 2M 4M Normalized Latency Message Size (Bytes) 0.01% 0.10% 1%
Normalized to SL-based MCAST with NACK-based retransmission scheme
Message Size 8KB 128KB 2MB Error Rate 0.01% 16% 31% 11% 0.1% 21% 36% 19% 1% 24% 21% 10%
Error rates:
C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ in COMHPC 2016 (SC Workshop), Nov 2016.
GTC 2017 28 Network Based Computing Laboratory
- Handling Efficient Broadcast on Multi-GPU node Systems
- C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing
High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.
- Providing Reliability Support
- C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient
Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ in COMHPC 2016 (SC Workshop), Nov 2016.
- Optimizing Broadcast for Multi-source Streaming
- C.-H. Chu, X. Lu, A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, "Efficient and
Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , “ Accepted for presentation, Int’l Conference on Parallel Processing, ICPP ’17, Aug 2017.
Three Major Solutions
GTC 2017 29 Network Based Computing Laboratory
- Optimizing MCAST+GDR Broadcast:
– Source and destination buffers are on GPU Device
- Typically very large messages (>1MB)
– Pipelining data from Device to Host
- Avoid GDR read limit
- Leverage high-performance SL design
– Combines IB MCAST and GDR features – Minimize use of PCIe resources on the receiver side – Maximizing availability of PCIe Host-Device Resources
Optimized Design for Multi-Source Streaming
IB HCA CPU GPU
Source
IB Switch
Header
Data
IB HCA CPU GPU
Node 1
Header
Data
IB HCA CPU GPU
Node N
Header
Data
- 1. Pipeline Data movement
- 2. IB Gather
- 3. IB Hardware Multicast
- 4. IB Scatter + GDR Write
Data
GTC 2017 30 Network Based Computing Laboratory
- Pipelined MCAST+GDR Design
– Pipelines data from Device to Host on the source node
- Streaming broadcast
– Leverages high-performance SL-based design
- High Scalability
- High overlap between multiple broadcast calls
Analysis of the Optimized Design
GTC 2017 31 Network Based Computing Laboratory
- @ OSU RI2 cluster, 16 NVIDIA K80 GPUs, 1 GPU/node
1 10 100 1000 10000 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Latency (μs) Message Size (bytes) MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR MCAST-GDR-Opt
Benchmark Evaluation
1 10 100 1000 10000 2 4 8 16 Latency (μs) Number of GPU nodes
2 MB Message
MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR MCAST-GDR-Opt Lower is better
Near-Constant
65%
- Provide near-constant latency over the system sizes
- Reduces up to 65% of latency for large messages
GTC 2017 32 Network Based Computing Laboratory
- @ OSU RI2 cluster, 16 NVIDIA K80 GPUs, 1 GPU/node
– Microsoft Cognitive Toolkit (CNTK) with CUDA-Aware MPI*
Application Evaluation: Deep Learning Frameworks
100 200 300 8 16 Training Time (s) Number of GPU nodes
AlexNet model
MV2-GDR-Knomial MV2-GDR-Ring 1000 2000 3000 8 16 Training Time (s) Number of GPU nodes
VGG model
MV2-GDR-Knomial MV2-GDR-Ring
Lower is better
15% 24% 6% 15%
- Reduces up to 24% and 15% of latency for AlexNet and VGG models
– Average training time of one Epoch
- Higher improvement can be observed for larger system sizes
*D. Banerjee, K. Hamidouche, D. Panda, Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE CloudCom'16, Dec 2016.
GTC 2017 33 Network Based Computing Laboratory
- IB MCAST feature provides high scalability and low latency
- NVIDIA GDR feature provides a direct access between IB and GPUs
- MVAPICH2-GDR provides schemes to efficiently broadcast from/to
GPU memories using host staged techniques
- Presented a set of designs to couple GDR and IB MCAST features for
- Heterogeneous Systems
- Multi-GPU systems
- Single-source and Multi-source Streaming
- New designs will be available in future MVAPICH2-GDR library
Conclusions
GTC 2017 34 Network Based Computing Laboratory
Two Additional Talks
- S7356 - MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning
– Day: Today, 05/11 – Time: 14:00 - 14:50 – Location: Room 211B
- S7324 - Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges and Solutions
– Day: Today, 05/11 – Time: 15:00 - 15:25 – Location: Room 211B
GTC 2017 35 Network Based Computing Laboratory