Efficient Reliability Support for Hardware Multicast-based Broadcast - - PowerPoint PPT Presentation

efficient reliability support for hardware multicast
SMART_READER_LITE
LIVE PREVIEW

Efficient Reliability Support for Hardware Multicast-based Broadcast - - PowerPoint PPT Presentation

Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar K. (DK) Panda 1 Department of


slide-1
SLIDE 1

Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications

1Ching-Hsiang Chu, 1Khaled Hamidouche, 1Hari Subramoni, 1Akshay Venkatesh, 2Bracy Elton and 1Dhabaleswar K. (DK) Panda 1Department of Computer Science and Engineering, The Ohio State University 2Engility Corporation

slide-2
SLIDE 2

COMHPC @ SC16 2 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation
  • Conclusion and Future Work

Outline

slide-3
SLIDE 3

COMHPC @ SC16 3 Network Based Computing Laboratory

  • Multi-core processors are ubiquitous
  • InfiniBand (IB) is very popular in HPC clusters
  • Accelerators/Coprocessors are becoming common in high-end systems

➠ Pushing the envelope towards Exascale computing

Drivers of Modern HPC Cluster Architectures

Accelerators / Coprocessors high compute density, high performance/watt >1 Tflop/s DP on a chip High Performance Interconnects – InfiniBand <1 µs latency, >100 Gbps Bandwidth

Tianhe – 2 Titan Stampede Tianhe – 1A

Multi-core Processors

slide-4
SLIDE 4

COMHPC @ SC16 4 Network Based Computing Laboratory

  • Growth of IB and GPU clusters in the last 3 years

– IB is the major commodity network adapter used – NVIDIA GPUs boost 18% of the top 50 of the ”Top 500” systems as of June 2016

IB and GPU in HPC Systems

8.4 7.8 9 9.8 10.4 13.8 13.2 41 41.4 44.4 44.8 51.8 47.4 40.8 10 20 30 40 50 60 June 2013 Nov 2013 June 2014 Nov 2014 June 2015 Nov 2015 June 2016

System share in Top 500 (%)

GPU Cluster InfiniBand Cluster

Data from Top500 list (http://www.top500.org)

slide-5
SLIDE 5

COMHPC @ SC16 5 Network Based Computing Laboratory

  • Streaming applications
  • n HPC systems
  • 1. Communication (MPI)
  • Pipeline of broadcast-type
  • perations
  • 2. Computation (CUDA)
  • Multiple GPU nodes as workers

– Examples

  • Deep learning frameworks
  • Proton computed tomography

(pCT)

Motivation

Data Source Data Distributor

HPC resources for real-time analytics

Real-time streaming

Worker CPU GPU GPU Worker CPU GPU GPU Worker CPU GPU GPU Worker CPU GPU GPU Worker CPU GPU GPU

Data streaming-like broadcast operations

slide-6
SLIDE 6

COMHPC @ SC16 6 Network Based Computing Laboratory

  • High-performance Heterogeneous Broadcast*

– Leverages NVIDIA GPUDirect and IB hardware multicast (MCAST) features – Eliminates unnecessary data staging through host memory

Communication for Streaming Applications

*Ching-Hsiang Chu, Khaled Hamidouche, Hari Subramoni, Akshay Venkatesh, Bracy Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

Node N

IB HCA IB HCA CPU GPU

Source

IB Switch GPU CPU

Node 1

Multicast steps

C Data C

IB SL step

Data

IB HCA GPU CPU

Data C

IB HCA: InfiniBand Host Channel Adapter

slide-7
SLIDE 7

COMHPC @ SC16 7 Network Based Computing Laboratory

  • IB hardware multicast significantly improves the performance,

however, it is a Unreliable Datagram (UD)-based scheme

Ø Reliability needs to be handled explicitly

  • Existing Negative ACKnowledgement (NACK)-based Design

– Sender must stall to check receipt of NACK packets

Ø Breaks the pipeline of broadcast operations

– Re-send MCAST packets even if it is not necessary for some receivers

Ø Wastes network resource, degrades throughput/bandwidth

Limitations of the Existing Scheme

slide-8
SLIDE 8

COMHPC @ SC16 8 Network Based Computing Laboratory

  • How to provide reliability support while leveraging UD-based IB

hardware multicast to achieve high-performance broadcast for GPU-enabled streaming applications?

  • Maintains the pipeline of broadcast operations
  • Minimizes the consumption of Peripheral Component Interconnect

Express (PCIe) resources

Problem Statement

slide-9
SLIDE 9

COMHPC @ SC16 9 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs

– Remote Memory Access (RMA)-based Design

  • Performance Evaluation
  • Conclusion and Future Work

Outline

slide-10
SLIDE 10

COMHPC @ SC16 10 Network Based Computing Laboratory

  • Goals of the proposed design

– Allows the receivers to retrieve lost MCAST packets through the RMA

  • perations without interrupting sender

Ø Maintains pipelining of broadcast operations Ø Minimizes consumption of PCIe resources

  • Major Benefit of MPI-3 Remote Memory Access (RMA)*

– Supports one-sided communication è broadcast sender won’t be interrupted

  • Major Challenge

– How and where receivers can retrieve the correct MCAST packets through RMA operations

Overview: RMA-based Reliability Design

*”MPI Forum”, http://mpi-forum.org/

slide-11
SLIDE 11

COMHPC @ SC16 11 Network Based Computing Laboratory

  • Maintains an additional window of a circular backup buffer for

MCAST packets

  • Exposes this window to other processes in the MCAST group, e.g.,

performs MPI_Win_create

  • Utilizes an additional helper thread to copy MCAST packets to the

backup buffer è we can overlap with broadcast communication

Implementing MPI_Bcast: Sender Side

slide-12
SLIDE 12

COMHPC @ SC16 12 Network Based Computing Laboratory

  • When a receiver experiences timeout (lost MCAST packet)

– Performs the RMA Get operation to the sender’s backup buffer to retrieve lost MCAST packets – Sender is not interrupted

Implementing MPI_Bcast: Receiver Side

Broadcast receiver Broadcast sender IB HCA IB HCA MPI

Timeout

MPI

Time

slide-13
SLIDE 13

COMHPC @ SC16 13 Network Based Computing Laboratory

  • Large enough to keep the MCAST packets available when it is

needed

  • As small as possible to limit size of memory footprint

Backup Buffer Requirements

𝑿 > 𝑪×(𝑳×𝑺𝑼𝑼) 𝒈

Frame size: Size of a single MCAST packet Bandwidth Constant Round-Trip Time between sender and receiver

slide-14
SLIDE 14

COMHPC @ SC16 14 Network Based Computing Laboratory

  • Pros:

– Broadcast sender is not involved in retransmission, i.e., maintains the pipeline of broadcast operations Ø High throughput, high scalability – No extra MCAST operation, i.e., minimizes consumption of PCIe resources Ø Low overhead, low latency

  • Cons:

– Congestion may occur when multiple RMA Get operations from multiple receivers are issued to retrieve the same data in extreme unreliable networks (highly unlikely for IB clusters)

Proposed RMA-based Reliability Design

slide-15
SLIDE 15

COMHPC @ SC16 15 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation

– Experimental Environments – Streaming Benchmark Level Evaluation

  • Conclusion and Future Work

Outline

slide-16
SLIDE 16

COMHPC @ SC16 16 Network Based Computing Laboratory

  • 1. RI2 cluster @ The Ohio State

University*

– Mellanox EDR InfiniBand HCAs – 2 NVIDIA K80 GPUs per node – Used up to 16 GPU nodes

  • 2. CSCS cluster @ Swiss National

Supercomputing Centre

http://www.cscs.ch/computers/kesch_escha/index.html

– Mellanox FDR InfiniBand HCAs – Cray CS-Storm system, 8 NVIDIA K80 GPU cards per node – Used up to 88 NVIDIA K80 GPU cards

  • ver 11 nodes
  • Modified Ohio State University

(OSU) Micro-Benchmark (OMB)*

– http://mvapich.cse.ohio-state.edu/benchmarks/

  • su_bcast - MPI_Bcast Latency Test

– Modified to support heterogeneous broadcast

  • Streaming benchmark

– Mimics real streaming applications – Continuously broadcasts data from a source to GPU-based compute nodes – Includes a computation phase that involves host-to-device and device-to-host copies

Experimental Environments

*Results from RI2 and OMB are omitted in this presentation due to time constraints

slide-17
SLIDE 17

COMHPC @ SC16 17 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR), Available since 2014

– Support for MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,675 organizations in 83 countries – More than 400,000 (> 0.4 million) downloads from the OSU site directly – Empowering many TOP500 clusters (June 2016 ranking)

  • 12th ranked 462,462-core cluster (Stampede) at TACC
  • 15th ranked 185,344-core cluster (Pleiades) at NASA
  • 31th ranked 74520-core cluster (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 Tflop/s) ⇒ – Stampede at TACC (12th in June 2016, 462,462 cores, 5.168 Pflop/s)

slide-18
SLIDE 18

COMHPC @ SC16 18 Network Based Computing Laboratory

  • Negligible overhead compared to existing NACK-based design
  • RMA-based design outperforms NACK-based scheme for large messages
  • A helper thread in the background performs backups of MCAST packets

Evaluation: Overhead

5 10 15 20 25 30 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (μs) Message Size (Bytes)

w/o reliability NACK RMA

1000 2000 3000 4000 16K 32K 64K 128K 256K 512K 1M 2M 4M Latency (μs) Message Size (Bytes)

w/o reliability NACK RMA

slide-19
SLIDE 19

COMHPC @ SC16 19 Network Based Computing Laboratory

Latency reduction of proposed RMA-based design compared to the existing NACK-based scheme

Evaluation: Latency on Streaming Benchmark

0.5 1 1.5 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Normalized Latency Message Size (Bytes) 0.01% 0.10% 1% 0.5 1 1.5 16K 32K 64K 128K 256K 512K 1M 2M 4M Normalized Latency Message Size (Bytes) 0.01% 0.10% 1%

Normalized to SL-based MCAST with NACK-based retransmission scheme

Message Size 8KB 128KB 2MB Modeled Error Rate 0.01% 16% 31% 11% 0.1% 21% 36% 19% 1% 24% 21% 10%

Modeled Error rates: Modeled Error rates:

slide-20
SLIDE 20

COMHPC @ SC16 20 Network Based Computing Laboratory

  • Equal or better than the leading NACK-based design for different message sizes

and error rates

  • Always yields (up to 56% ) a higher broadcast rate than the existing NACK-

based design

Evaluation: Broadcast Rate (Throughput)

1 1.2 1.4 1.6 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Normalized Latency Message Size (Bytes) 0.01% 0.10% 1%

Normalized to SL-based MCAST with NACK-based retransmission scheme

Modeled Error rates:

slide-21
SLIDE 21

COMHPC @ SC16 21 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation
  • Conclusion and Future Work

Outline

slide-22
SLIDE 22

COMHPC @ SC16 22 Network Based Computing Laboratory

  • Propose an RMA-based reliability design on top of IB hardware

multicast based broadcast for streaming applications

– Maintains pipelining of broadcast operations – Minimizes consumption of PCIe resources – Provides good performance with streaming benchmarks, which is promising for real streaming applications

  • Future work

– Include the proposed design in future releases of the MVAPICH2-GDR library – Evaluate effectiveness with real streaming applications

Conclusion

slide-23
SLIDE 23

COMHPC @ SC16 23 Network Based Computing Laboratory

Thank You!

Ching-Hsiang Chu

chu.368@osu.edu

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ This project is supported under the United States Department of Defense (DOD) High Performance Computing Modernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity (Contract No. GS04T09DBC0017 through Engility Corporation). The opinions expressed herein are those of the authors and do not necessarily reflect the views of the DOD or the employer of the author.