Designing High Performance Heterogeneous Broadcast for Streaming - - PowerPoint PPT Presentation

designing high performance heterogeneous broadcast for
SMART_READER_LITE
LIVE PREVIEW

Designing High Performance Heterogeneous Broadcast for Streaming - - PowerPoint PPT Presentation

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar K. (DK) Panda 1 Department of Computer


slide-1
SLIDE 1

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

1Ching-Hsiang Chu, 1Khaled Hamidouche, 1Hari Subramoni, 1Akshay Venkatesh, 2Bracy Elton and 1Dhabaleswar K. (DK) Panda 1Department of Computer Science and Engineering, The Ohio State University 2Engility Corporation

DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. 88ABW-2016-5574

slide-2
SLIDE 2

SBAC-PAD 2016 2 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation
  • Conclusion and Future Work

Outline

slide-3
SLIDE 3

SBAC-PAD 2016 3 Network Based Computing Laboratory

  • Multi-core processors are ubiquitous
  • InfiniBand is very popular in HPC clusters
  • Accelerators/Coprocessors are becoming common in high-end systems

➠ Pushing the envelope towards Exascale computing

Drivers of Modern HPC Cluster Architectures

Accelerators / Coprocessors high compute density, high performance/watt >1 Tflop/s DP on a chip High Performance Interconnects – InfiniBand <1 µs latency, >100 Gbps Bandwidth

Tianhe – 2 Titan Stampede Tianhe – 1A

Multi-core Processors

slide-4
SLIDE 4

SBAC-PAD 2016 4 Network Based Computing Laboratory

  • Growth of IB and GPU clusters in the last 3 years

– IB is the major commodity network adapter used – NVIDIA GPUs boost 18% of the top 50 of the ”Top 500” systems as of June 2016

IB and GPU in HPC Systems

8.4 7.8 9 9.8 10.4 13.8 13.2 41 41.4 44.4 44.8 51.8 47.4 40.8 10 20 30 40 50 60 June-13 Nov-13 June-14 Nov-14 June-15 Nov-15 June-16

System share in Top 500 (%)

GPU Cluster InfiniBand Cluster

Data from Top500 list (http://www.top500.org)

slide-5
SLIDE 5

SBAC-PAD 2016 5 Network Based Computing Laboratory

  • Streaming applications
  • n HPC systems
  • 1. Communication (MPI)
  • Broadcast-type operations
  • 2. Computation (CUDA)
  • Multiple GPU nodes as workers

– Examples

  • Deep learning frameworks
  • Proton computed tomography

(pCT)

Motivation

Data Source Sender

HPC resources for real-time analytics

Real-time streaming

Worker CPU GPU GPU Worker CPU GPU GPU Worker CPU GPU GPU Worker CPU GPU GPU Worker CPU GPU GPU

Data streaming-like broadcast operations

slide-6
SLIDE 6

SBAC-PAD 2016 6 Network Based Computing Laboratory

  • Streaming applications on HPC systems
  • 1. Communication — Heterogeneous Broadcast-type operations
  • Data are usually from a live source and stored in Host memory
  • Data need to be sent to remote GPU memories for computing

Motivation

Sender Real-time streaming Data streaming-like broadcast operations

Requires data movement from Host memory to remote GPU memories, i.e., host-device (H-D) heterogeneous broadcast ⇒ Performance bottleneck

slide-7
SLIDE 7

SBAC-PAD 2016 7 Network Based Computing Laboratory

  • Requirements for streaming applications on HPC systems

– Low latency, high throughput and scalability – Free up Peripheral Component Interconnect Express (PCIe) bandwidth for application needs

Motivation

Worker CPU GPU GPU Worker CPU GPU GPU Worker CPU GPU GPU Worker CPU GPU GPU Worker CPU GPU GPU

Data streaming-like broadcast operations

slide-8
SLIDE 8

SBAC-PAD 2016 8 Network Based Computing Laboratory

  • NVIDIA GPUDirect[1]

– Use remote direct memory access (RDMA) transfers between GPUs and other PCIe devices ⇒ GDR – Peer-to-peer transfers between GPUs – and more…

  • InfiniBand (IB) hardware

multicast (IB MCAST)[2]

– Enables efficient designs of homogeneous broadcast

  • perations
  • Host-to-Host[3]
  • GPU-to-GPU[4]

Motivation – Technologies we have

[1] https://developer.nvidia.com/gpudirect [2] Pfister GF., “An Introduction to the InfiniBand Architecture. ” High Performance Mass Storage and Parallel I/O, Chapter 42, pp 617-632, Jun 2001. [3] J. Liu, A. R. Mamidala, and D. K. Panda, “Fast and Scalable MPI-level Broadcast using InfiniBand’s Hardware Multicast Support,” in IPDPS 2004, p. 10, April 2004. [4] A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014, Dec 2014.

slide-9
SLIDE 9

SBAC-PAD 2016 9 Network Based Computing Laboratory

  • Can we design a high-performance heterogeneous broadcast

for streaming applications?

  • Supports Host-to-Device broadcast operations
  • Can we also design an efficient broadcast for multi-GPU

systems?

  • Can we combine GPUDirect and IB technologies to
  • Avoid extra data movements to achieve better performance
  • Increase available Host-Device (H-D) PCIe bandwidth for

application use

Problem Statement

slide-10
SLIDE 10

SBAC-PAD 2016 10 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs

– Heterogeneous Broadcast with GPUDirect RDMA (GDR) and InfiniBand (IB) Hardware Multicast – Intra-node Topology-Aware Broadcast for Multi-GPU Systems

  • Performance Evaluation
  • Conclusion and Future Work

Outline

slide-11
SLIDE 11

SBAC-PAD 2016 11 Network Based Computing Laboratory

  • Key requirement of IB MCAST

– Control header needs to be stored in host memory

  • SL-based approach: Combine CUDA GDR and IB MCAST features

– Also, take advantage of IB Scatter-Gather List (SGL) feature:

  • Multicast two separate addresses (control on the host + data on GPU)—in but
  • ne IB message

– Directly IB read/write from/to GPU using GDR feature ⇒ low-latency zero- copy based schemes – Avoiding extra copy between Host and GPU ⇒ frees up PCIe bandwidth resource for application needs – Employing IB MCAST feature increases scalability

Proposed Heterogeneous Broadcast

slide-12
SLIDE 12

SBAC-PAD 2016 12 Network Based Computing Laboratory

  • Overview of SL-based approach

Proposed Heterogeneous Broadcast

Node N IB HCA IB HCA CPU GPU Source IB Switch GPU CPU Node 1 Multicast steps

C Data C

IB SL step

Data

IB HCA GPU CPU

Data C

slide-13
SLIDE 13

SBAC-PAD 2016 13 Network Based Computing Laboratory

Broadcast on Multi-GPU systems

IB Switch

GPU CPU

Source

GPU CPU

GPU 0

CPU

Node N GPU 1 GPU N Multicast steps cudaMemcpy (Host ↔ Device) Node 1

  • Existing two-level approach

– Inter-node: Can apply proposed SL-based – Intra-node: Use host-based shared memory

Issues of H-D cudaMemcpy :

  • 1. Expensive
  • 2. Consumes PCIe bandwidth

between CPU and GPUs!

slide-14
SLIDE 14

SBAC-PAD 2016 14 Network Based Computing Laboratory

  • Proposed Intra-node Topology-Aware Broadcast

– CUDA InterProcess Communication (IPC)

Broadcast on Multi-GPU systems

Node 1

IB Switch GPU 0 GPU 1 GPU N

Node N GPU CPU Source GPU CPU CPU

Multicast steps cudaMemcpy (Device ↔ Device)

slide-15
SLIDE 15

SBAC-PAD 2016 15 Network Based Computing Laboratory

  • Proposed Intra-node Topology-Aware Broadcast

– Leader keeps a copy of the data – Synchronization between GPUs

  • Use a one-byte flag in shared memory on host

– Non-leaders copy the data using CUDA IPC Ø Frees up PCIe bandwidth resource

  • Other Topology-Aware designs

– Ring, K-nomial…etc. – Dynamic tuning selection

Broadcast on Multi-GPU systems

GPU 0

Shared Memory Region

RecvBuf CopyBuf GPU 1 RecvBuf GPU N RecvBuf

Host Memory IPC

slide-16
SLIDE 16

SBAC-PAD 2016 16 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation

– OSU Micro-Benchmark (OMB) level evaluation – Streaming benchmark level evaluation

  • Conclusion and Future Work

Outline

slide-17
SLIDE 17

SBAC-PAD 2016 17 Network Based Computing Laboratory

  • 1. Wilkes cluster @ University of

Cambridge

http://www.hpc.cam.ac.uk/services/wilkes

– 2 NVIDIA K20c GPUs per node – Used Up to 32 GPU nodes

  • 2. CSCS cluster @ Swiss National

Supercomputing Centre

http://www.cscs.ch/computers/kesch_escha/index.html

– Cray CS-Storm system – 8 NVIDIA K80 GPU cards per node ( = 16

NVIDIA Kepler GK210 GPU chips per node)

– Used Up to 88 NVIDIA K80 GPU cards (176 GPU chips) over 11 nodes

  • Modified Ohio State University

(OSU) Micro-Benchmark (OMB)

– http://mvapich.cse.ohio-state.edu/benchmarks/

  • su_bcast - MPI_Bcast Latency Test

– Modified to support heterogeneous broadcast

  • Streaming benchmark

– Mimic real streaming applications – Continuously broadcasts data from a source to GPU-based compute nodes – Includes a computation phase that involves host-to-device and device-to-host copies

Experimental Environments

slide-18
SLIDE 18

SBAC-PAD 2016 18 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR), Available since 2014

– Support for MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,675 organizations in 83 countries – More than 391,000 (> 0.39 million) downloads from the OSU site directly – Empowering many TOP500 clusters (June ‘16 ranking)

  • 12th ranked 462,462-core cluster (Stampede) at TACC
  • 15th ranked 185,344-core cluster (Pleiades) at NASA
  • 31th ranked 74520-core cluster (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 Tflop/s) ⇒ – Stampede at TACC (12th in June 2016, 462,462 cores, 5.168 Pflop/s)

slide-19
SLIDE 19

SBAC-PAD 2016 19 Network Based Computing Laboratory

2000 4000 6000 8000 10000 32K 64K 128K 256K 512K 1M 2M 4M Latency(μs) Message Size (Bytes)

SL-MCAST GPU-MCAST Host-MCAST

10 20 30 40 50 60 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Latency (μs) Message Size (Bytes)

SL-MCAST GPU-MCAST Host-MCAST

  • Compared proposed SL-based design to homogeneous broadcast

designs with explicitly data transfers

  • Reduces latency up to 56% and 39% for small and large messages

– No extra data transfers between Host and GPU memories

OMB – Heterogeneous Inter-node Broadcast @ Wilkes

56% 39%

slide-20
SLIDE 20

SBAC-PAD 2016 20 Network Based Computing Laboratory

  • Inter-node Broadcast on Wilkes

– IB Hardware Multicast provides good scalability

OMB – SL-based Approach

5 10 15 20 25 2 4 8 16 32 Latency (μs) System size (Number of GPU nodes)

SL-MCAST GPU-MCAST Host-MCAST

slide-21
SLIDE 21

SBAC-PAD 2016 21 Network Based Computing Laboratory

10 20 30 40 50 60 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Latency (μs) Message Size (Bytes)

IPC SL-MCAST SHMEM SL-MCAST

  • SL-based inter-node + Topology-aware intra-node on CSCS

– Up to 58% and 79% reduction for small and large messages

  • No extra data transfers between Host and GPU memories

OMB – Inter- and Intra-node Broadcast @ CSCS

58%

2000 4000 6000 8000 10000 12000 32K 64K 128K 256K 512K 1M 2M 4M Latency (μs) Message Size (Bytes)

IPC SL-MCAST SHMEM SL-MCAST

79%

slide-22
SLIDE 22

SBAC-PAD 2016 22 Network Based Computing Laboratory

2 4 6 8 10 12 14 16 32K 64K 128K 256K 512K 1M 2M 4M Execution Time (s) Message Size (Bytes)

IPC SL-MCAST SHMEM SL-MCAST

0.5 1 1.5 2 2.5 3 3.5 4 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K Execution Time (s) Message Size (Bytes)

IPC SL-MCAST SHMEM SL-MCAST

  • Utilizes IPC-based Device-To-Device data transfer for streaming

applications on multi-GPU systems

– Up to 26% and 67% improvement for small and large messages

Streaming Benchmark – Execution Time @ CSCS

26% 67%

slide-23
SLIDE 23

SBAC-PAD 2016 23 Network Based Computing Laboratory

0.5 1 1.5 2 2.5 3 3.5 4 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K256K512K 1M 2M 4M PCIe Throughput (GB/s) Message Size (Bytes)

IPC SL-MCAST SHMEM SL-MCAST

  • Increases availability of PCIe Host-Device Resources

– Utilize IPC-based Device-to-Device data transfers – Free up PCIe bandwidth resources between Host and Devices for applications

Streaming Benchmark – Throughput @ CSCS

3.2X

slide-24
SLIDE 24

SBAC-PAD 2016 24 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation
  • Conclusion and Future Work

Outline

slide-25
SLIDE 25

SBAC-PAD 2016 25 Network Based Computing Laboratory

  • Combines NVIDIA GPUDirect technology and InfiniBand (IB)

hardware multicast for GPU-enabled streaming applications

  • Further proposes an intra-node topology-aware scheme that

exploits CUDA IPC for multi-GPU systems

– Achieves 2X improvement over state-of-the-art schemes with Ohio State University (OSU) Micro-Benchmarks (OMBs) – Achieves up to a 67% improvement in execution time and 3.5X of throughput in a synthetic streaming benchmark – Indicates applying this approach to a streaming application, such as photon computed tomography (pCT) or deep learning framework, is promising

Conclusion

slide-26
SLIDE 26

SBAC-PAD 2016 26 Network Based Computing Laboratory

  • Include in future releases of MVAPICH2-GDR library
  • Improve reliability
  • Evaluate effectiveness with streaming applications, such as,

photon computed tomography (pCT) and deep learning frameworks

  • Extend the designs for other collective operations as well

as non-blocking operations

– Allreduce, gather…etc.

Future Work

slide-27
SLIDE 27

SBAC-PAD 2016 27 Network Based Computing Laboratory

Thank You!

Ching-Hsiang Chu

chu.368@osu.edu

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ This project is supported under the United States Department of Defense (DOD) High Performance Computing Modernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity (Contract No. GS04T09DBC0017 Engility Corporation). The opinions expressed herein are those of the authors and do not necessarily reflect the views of the DOD or the employer of the author.

slide-28
SLIDE 28

SBAC-PAD 2016 28 Network Based Computing Laboratory

  • Mimics behavior of a streaming application

– Continuously broadcasts data from a source to GPU-based compute nodes – Includes a computation phase that involves host-to-device and device-to-host copies

Streaming Benchmark

/* h_buf and d_buf: buffer on Host and GPU memory. */ for iter=0 to max_iter do cudaMemcpyAsync(..., cudaMemcpyHostToDevice, cpy_stream); if rank == root then MPI Bcast(h_buf, ...); else MPI Bcast(d_buf, ...); end if dummy kernel<<<...>>>(d_buf,...); cudaMemcpyAsync(..., cudaMemcpyDeviceToHost, cpy_stream); end for