designing high performance heterogeneous broadcast for
play

Designing High Performance Heterogeneous Broadcast for Streaming - PowerPoint PPT Presentation

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar K. (DK) Panda 1 Department of Computer


  1. Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar K. (DK) Panda 1 Department of Computer Science and Engineering, The Ohio State University 2 Engility Corporation DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. 88ABW-2016-5574

  2. Outline • Introduction • Proposed Designs • Performance Evaluation • Conclusion and Future Work Network Based Computing Laboratory SBAC-PAD 2016 2

  3. Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects – InfiniBand Multi-core Processors high compute density, high performance/watt <1 µs latency, >100 Gbps Bandwidth >1 Tflop/s DP on a chip • Multi-core processors are ubiquitous • InfiniBand is very popular in HPC clusters • Accelerators/Coprocessors are becoming common in high-end systems ➠ Pushing the envelope towards Exascale computing Tianhe – 2 Stampede Titan Tianhe – 1A Network Based Computing Laboratory SBAC-PAD 2016 3

  4. IB and GPU in HPC Systems • Growth of IB and GPU clusters in the last 3 years – IB is the major commodity network adapter used – NVIDIA GPUs boost 18% of the top 50 of the ”Top 500” systems as of June 2016 60 System share in Top 500 (%) 50 51.8 47.4 40 44.8 44.4 41.4 41 40.8 30 20 10 13.8 13.2 8.4 7.8 9 9.8 10.4 0 June-13 Nov-13 Nov-14 June-15 Nov-15 June-16 June-14 GPU Cluster InfiniBand Cluster Data from Top500 list (http://www.top500.org) Network Based Computing Laboratory SBAC-PAD 2016 4

  5. Motivation Data Source • Streaming applications on HPC systems Real-time streaming 1. Communication (MPI) HPC resources for Sender real-time analytics • Broadcast-type operations Data streaming-like broadcast operations 2. Computation (CUDA) • Multiple GPU nodes as workers Worker Worker Worker Worker Worker CPU CPU CPU CPU CPU – Examples GPU GPU GPU GPU GPU • Deep learning frameworks GPU GPU GPU GPU GPU • Proton computed tomography (pCT) Network Based Computing Laboratory SBAC-PAD 2016 5

  6. Motivation • Streaming applications on HPC systems 1. Communication — Heterogeneous Broadcast-type operations • Data are usually from a live source and stored in Host memory • Data need to be sent to remote GPU memories for computing Real-time streaming Requires data movement from Host memory to remote GPU memories, Sender i.e., host-device (H-D) heterogeneous Data streaming-like broadcast operations broadcast ⇒ Performance bottleneck Network Based Computing Laboratory SBAC-PAD 2016 6

  7. Motivation • Requirements for streaming applications on HPC systems – Low latency, high throughput and scalability – Free up Peripheral Component Interconnect Express (PCIe) bandwidth for application needs Data streaming-like broadcast operations Worker Worker Worker Worker Worker CPU CPU CPU CPU CPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU Network Based Computing Laboratory SBAC-PAD 2016 7

  8. Motivation – Technologies we have • NVIDIA GPUDirect [1] • InfiniBand (IB) hardware multicast (IB MCAST) [2] – Use remote direct memory access (RDMA) transfers – Enables efficient designs of between GPUs and other homogeneous broadcast PCIe devices ⇒ GDR operations – Peer-to-peer transfers • Host-to-Host [3] between GPUs • GPU-to-GPU [4] – and more… [1] https://developer.nvidia.com/gpudirect [2] Pfister GF., “An Introduction to the InfiniBand Architecture. ” High Performance Mass Storage and Parallel I/O, Chapter 42, pp 617-632, Jun 2001. [3] J. Liu, A. R. Mamidala, and D. K. Panda, “Fast and Scalable MPI-level Broadcast using InfiniBand’s Hardware Multicast Support,” in IPDPS 2004 , p. 10, April 2004. [4] A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014 , Dec 2014. Network Based Computing Laboratory SBAC-PAD 2016 8

  9. Problem Statement • Can we design a high-performance heterogeneous broadcast for streaming applications? • Supports Host-to-Device broadcast operations • Can we also design an efficient broadcast for multi-GPU systems? • Can we combine GPUDirect and IB technologies to • Avoid extra data movements to achieve better performance • Increase available Host-Device (H-D) PCIe bandwidth for application use Network Based Computing Laboratory SBAC-PAD 2016 9

  10. Outline • Introduction • Proposed Designs – Heterogeneous Broadcast with GPUDirect RDMA (GDR) and InfiniBand (IB) Hardware Multicast – Intra-node Topology-Aware Broadcast for Multi-GPU Systems • Performance Evaluation • Conclusion and Future Work Network Based Computing Laboratory SBAC-PAD 2016 10

  11. Proposed Heterogeneous Broadcast • Key requirement of IB MCAST – Control header needs to be stored in host memory • SL-based approach: Combine CUDA GDR and IB MCAST features – Also, take advantage of IB Scatter-Gather List (SGL) feature: • Multicast two separate addresses (control on the host + data on GPU)— in but one IB message – Directly IB read/write from/to GPU using GDR feature ⇒ low-latency zero- copy based schemes – Avoiding extra copy between Host and GPU ⇒ frees up PCIe bandwidth resource for application needs – Employing IB MCAST feature increases scalability Network Based Computing Laboratory SBAC-PAD 2016 11

  12. Proposed Heterogeneous Broadcast • Overview of SL-based approach Node 1 C CPU Source IB HCA C Data GPU CPU Data IB HCA IB GPU Switch Node N C CPU Multicast steps IB HCA IB SL step GPU Data Network Based Computing Laboratory SBAC-PAD 2016 12

  13. Broadcast on Multi-GPU systems • Existing two-level approach – Inter-node: Can apply proposed SL-based Node 1 – Intra-node: Use host-based shared memory CPU Source GPU CPU Issues of H-D cudaMemcpy : IB 1. Expensive Switch GPU 2. Consumes PCIe bandwidth Node N between CPU and GPUs! CPU Multicast steps cudaMemcpy GPU 0 GPU N GPU 1 (Host ↔ Device) Network Based Computing Laboratory SBAC-PAD 2016 13

  14. Broadcast on Multi-GPU systems • Proposed Intra-node Topology-Aware Broadcast Node 1 – CUDA InterProcess Communication (IPC) CPU Source GPU CPU IB Switch GPU Node N CPU Multicast steps cudaMemcpy GPU 0 GPU 1 GPU N (Device ↔ Device) Network Based Computing Laboratory SBAC-PAD 2016 14

  15. Broadcast on Multi-GPU systems • Proposed Intra-node Topology-Aware Broadcast – Leader keeps a copy of the data Host Memory Shared Memory – Synchronization between GPUs Region • Use a one-byte flag in shared memory on host – Non-leaders copy the data using CUDA IPC Ø Frees up PCIe bandwidth resource GPU N GPU 0 GPU 1 • Other Topology-Aware designs RecvBuf – Ring, K-nomial…etc. RecvBuf RecvBuf CopyBuf – Dynamic tuning selection IPC Network Based Computing Laboratory SBAC-PAD 2016 15

  16. Outline • Introduction • Proposed Designs • Performance Evaluation – OSU Micro-Benchmark (OMB) level evaluation – Streaming benchmark level evaluation • Conclusion and Future Work Network Based Computing Laboratory SBAC-PAD 2016 16

  17. Experimental Environments 1. Wilkes cluster @ University of • Modified Ohio State University Cambridge (OSU) Micro-Benchmark (OMB) http://www.hpc.cam.ac.uk/services/wilkes – http://mvapich.cse.ohio-state.edu/benchmarks/ – 2 NVIDIA K20c GPUs per node – osu_bcast - MPI_Bcast Latency Test – Used Up to 32 GPU nodes – Modified to support heterogeneous broadcast 2. CSCS cluster @ Swiss National • Streaming benchmark Supercomputing Centre – Mimic real streaming applications http://www.cscs.ch/computers/kesch_escha/index.html – Cray CS-Storm system – Continuously broadcasts data from a – 8 NVIDIA K80 GPU cards per node ( = 16 source to GPU-based compute nodes NVIDIA Kepler GK210 GPU chips per node) – Includes a computation phase that involves – Used Up to 88 NVIDIA K80 GPU cards host-to-device and device-to-host copies (176 GPU chips) over 11 nodes Network Based Computing Laboratory SBAC-PAD 2016 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend