High Performance Broadcast with GPUDirect RDMA and InfiniBand - PowerPoint PPT Presentation

High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming Applications GTC 2015

Presented By Dhabaleswar K. (DK) Panda The Ohio State University Email: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Outline GTC ’15 • Introduction • Motivation and Problem Statement • Design Considerations • Proposed Approach • Results • Conclusion and Future Work 3

Streaming Applications GTC ’15 • Examples - surveillance, habitat monitoring, etc.. • Require efficient transport of data from/to distributed sources/sinks • Sensitive to latency and throughput metrics • Require HPC resources to efficiently carry out compute- intensive tasks 4

HPC Landscape GTC ’15 • Proliferation of Multi-Petaflop systems • Heterogeneity in compute resources with GPGPUs • High performance interconnects with RDMA capabilities to host and GPU memories • Streaming applications leverage on such resources 5

Nature of Streaming Applications GTC ’15 • Pipelined data parallel compute phases that form the crux of streaming applications lend themselves for GPGPUs • Data distribution to GPGPU sites occur over PCIe within the node and over InfiniBand interconnects across nodes • Broadcast operation is a key dictator of throughput of streaming applications • Reduced latency for each operation • Support multiple back-to-back Courtesy: Agarwalla, Bikash, et al. "Streamline: A scheduling heuristic for streaming applications on the grid." Electronic operations Imaging 2006 7 • More critical with accelerators

Shortcomings of Existing GPU Broadcast GTC ’15 • Traditional short message broadcast operation between GPU buffers involves a Host-Staged Multicast (HSM) • Data copied from GPU buffers to host memory • Using InfiniBand Unreliable Datagram(UD)-based hardware multicast • Sub-optimal use of near-scale invariant UD-multicast performance • PCIe resources wasted and benefits of multicast nullified • GPU-Direct RDMA capabilities unused 8

Problem Statement GTC ’15 • Can we design a GPU broadcast mechanism that can completely avoid host-staging for streaming applications? • Can we harness the capabilities of GPU-Direct RDMA (GDR)? • Can we overcome limitations of UD transport and realize the true potential of multicast for GPU buffers? • Succinctly, how do we multicast GPU data using GDR efficiently? 9

Outline GTC ’15 • Introduction • Motivation and Problem Statement • Design Considerations • Critical Factors • Proposed Approach • Results • Conclusion and Future Work 11

Factors to Consider for an Efficient GPU Multicast GTC ’15 • Goal is to be able to multicast GPU data in lesser time than the host-staged multicast (~20us) • Cost of cudamemcpy is ~8us for short messages for host->gpu, gpu->host and gpu->gpu transfers • Cudamemcpy costs and memory registration costs determine the viability of a multicast protocol for GPU buffers 12

Outline GTC ’15 • Introduction • Motivation and Problem Statement • Design Considerations • Eager Protocol • Rendezvous Protocol • Proposed Approach • Results • Conclusion and Future Work 13

Eager Protocol for GPU multicast GTC ’15 • Copy user GPU data to host MCAST eager buffers CUDAMEMCPY • Perform Multicast and copy back • Cudamemcpy dictates performance eager • Similar variation with eager buffers Host on GPU HCA NW user - Header encoding expensive GPU 14

GTC ’15 Rendezvous Protocol for GPU multicast • Register user GPU data and start RTS multicast with control info registration • Confirm ready receivers ≡ 0 -byte gather • Perform Data Multicast • Registration cost and gather Host limitations HCA NW user • Handshake for each operation – not GPU required for streaming applications which are error tolerant INFO MCAST GATHER DATA 15 MCAST

GTC ’15 Outline • Introduction • Motivation and Problem Statement • Design Considerations • Proposed Approach • Results • Conclusion and Future Work 16

Orchestration of GDR-SGL-MCAST (GSM) GTC ’15 • One time registration of window of persistent buffers in streaming Scatter apps • Combine control and user data at the source and scatter them at the control destinations using Scatter-Gather- Host List abstraction HCA NW Scatter Gather user • Scheme lends itself for pipelined GPU phases abundant in Streaming Applications and avoids stressing PCIe MCAST 17

GTC ’15 Outline • Introduction • Motivation and Problem Statement • Design Considerations • Proposed Approach • Results • Conclusion and Future Work 18

GTC ’15 Experiment Setup • Experiments were run on Wilkes @ University of Cambridge • 12-core Ivy Bridge Intel(R) Xeon(R) E5-2630 @ 2.60 GHz with 64 GB RAM • FDR ConnectX2 HCAs • NVIDIA K20c GPUs • Mellanox OFED version MLNX OFED LINUX-2.1-1.0.6 which supports GPUDirect-RDMA (GDR) required • Baseline Host-based MCAST uses MVAPICH2-GDR (http://mvapich.cse.ohio-state.edu/downloads) • GDR-SGL-MCAST is based on MVAPICH2-GDR 19

Host Staged MCAST and GDR-SGL MCAST Latency : (<= 8 nodes) GTC ’15 • GDR-SGL-MCAST (GSM) • Host-Staged-MCAST (HSM) • GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us • Small latency increase with scale A. Venkatesh, H. Subramoni, K. Hamidouche and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters, IEEE International Conference on High Performance Computing (HiPC ‘14), Dec 2014. 20

Host Staged MCAST and GDR-SGL MCAST Latency : (<= 64 nodes) GTC ’15 • Both GSM and HSM continue to show near scale invariant latency with 60% improvement (8 bytes) 21

Host Staged MCAST and GDR-SGL MCAST Streaming Benchmark GTC ’15 • Based on a synthetic benchmark that mimics broadcast patterns in Streaming Applications • Long window of persistent m-byte buffers with 1,000 back-to-back multicast operations issued • Execution time reduces by 3x-4x 22

Conclusion and Future work GTC ’15 • Designed an efficient GPU data broadcast for streaming applications which uses near-constant-latency hardware multicast feature and GPUDirect RDMA • Proposed a new methodology which overcomes the performance challenges posed by UD transport • Benefits shown with latency and streaming-application-communication mimicking throughput benchmark • Exploration of NVIDIA’s Fastcopy module for MPI_Bcast 24

GTC ’15 One More Talk Learn about recent advances and upcoming features in CUDA-aware MVAPICH2-GPU library • S5461 - Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand • Thursday, 03/19 (Today) • Time: 17:00–17:50 • Room 212 B 25

Thanks! Questions? GTC ’15 Contact panda@cse.ohio-state.edu http://mvapich.cse.ohio-state.edu http://nowlab.cse.ohio-state.edu 26

High Performance Broadcast with GPUDirect RDMA and InfiniBand - PowerPoint PPT Presentation

High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming Applications GTC 2015 Presented By Dhabaleswar K. (DK) Panda The Ohio State University Email: panda@cse.ohio-state.edu

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP Tom Reu Consulting Applications

NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue 3/19, 2PM, Room 211A

Broadcast Algorithms BJRN A. JOHNSSON Overview Best-Effort Broadcast (Regular) Reliable

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

Broadcast Encryption and Some Other Primitives Lecture 24 Broadcast Encryption Broadcast

BROADCAST RECEIVER SERVICE Broadcast receiver A broadcast receiver is a dormant component of

January 3, 2014 Coverage Risks in the Age of the Internet of Things by Lon Berk and Paul

2014 Delaware Survey of Childrens Health Kristina Olson, MHS Evaluation & Research

SOCKS overTURNed RP86: Using TURN relays as Proxies Sean Liao 1 SOCKS Not short for

The Real-Time Transport Protocol Framework, HDTV and Standards Development Ladan Gharai

Support for IEEE 802.15.4 Ultra Wideband Communications in the Contiki Operating System M.

Instructors Ezhan Karaan & brahim Krpeo lu 14/05/2010 Introduction to WiSS (1)

Next Steps In Signaling (NSIS) in the IETF Roland Bless bless@tm.uka.de Institute of

Warm Welcome Matrix SPARSH VP High-Definition VoIP Phone SPARSH VP A Feature-rich Executive

High Performance Broadcast with GPUDirect RDMA and InfiniBand - PowerPoint PPT Presentation

High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming Applications GTC 2015 Presented By Dhabaleswar K. (DK) Panda The Ohio State University Email: panda@cse.ohio-state.edu

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP Tom Reu Consulting Applications

NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue 3/19, 2PM, Room 211A

Broadcast Algorithms BJRN A. JOHNSSON Overview Best-Effort Broadcast (Regular) Reliable

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

Broadcast Receiver Why do we need Broadcast Receiver? Broadcast Receivers Broadcast receiver

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

Broadcast Encryption and Some Other Primitives Lecture 24 Broadcast Encryption Broadcast

BROADCAST RECEIVER SERVICE Broadcast receiver A broadcast receiver is a dormant component of

January 3, 2014 Coverage Risks in the Age of the Internet of Things by Lon Berk and Paul

2014 Delaware Survey of Childrens Health Kristina Olson, MHS Evaluation &amp; Research

SOCKS overTURNed RP86: Using TURN relays as Proxies Sean Liao 1 SOCKS Not short for

The Real-Time Transport Protocol Framework, HDTV and Standards Development Ladan Gharai

Support for IEEE 802.15.4 Ultra Wideband Communications in the Contiki Operating System M.

Instructors Ezhan Karaan &amp; brahim Krpeo lu 14/05/2010 Introduction to WiSS (1)

Next Steps In Signaling (NSIS) in the IETF Roland Bless bless@tm.uka.de Institute of

Warm Welcome Matrix SPARSH VP High-Definition VoIP Phone SPARSH VP A Feature-rich Executive

2014 Delaware Survey of Childrens Health Kristina Olson, MHS Evaluation & Research

Instructors Ezhan Karaan & brahim Krpeo lu 14/05/2010 Introduction to WiSS (1)