Efficient and Scalable Multi-Source Streaming Broadcast on GPU - - PowerPoint PPT Presentation

efficient and scalable multi source streaming broadcast
SMART_READER_LITE
LIVE PREVIEW

Efficient and Scalable Multi-Source Streaming Broadcast on GPU - - PowerPoint PPT Presentation

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Ching-Hsiang Chu 1 , Xiaoyi Lu 1 , Ammar A. Awan 1 , Hari Subramoni 1 , Jahanzeb Hashmi 1 , Bracy Elton 2 and Dhabaleswar K. (DK) Panda 1 1 Department of


slide-1
SLIDE 1

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Ching-Hsiang Chu1, Xiaoyi Lu1, Ammar A. Awan1, Hari Subramoni1, Jahanzeb Hashmi1, Bracy Elton2 and Dhabaleswar K. (DK) Panda1

1Department of Computer Science and Engineering, The Ohio State University 2Engility Corporation

slide-2
SLIDE 2

ICPP 2017 2 Network Based Computing Laboratory

  • Introduction

– Deep Learning on GPU and InfiniBand (IB) Clusters – Multi-source Broadcast-type Operation for Deep Learning

  • Analysis
  • Proposed Design

– Streaming-based Design with IB multicast and NVIDIA GPUDirect features

  • Performance Evaluation
  • Conclusion and Future Work

Outline

slide-3
SLIDE 3

ICPP 2017 3 Network Based Computing Laboratory

Trends in Modern HPC Architecture

  • Multi-core/many-core technologies
  • High Performance Interconnects
  • Accelerators/Coprocessors are becoming common in high-end systems
  • High Performance Storage and Compute devices

Accelerators / Coprocessors high compute density, high performance/watt > 1 Tflop/s DP on a chip High Performance Interconnects – InfiniBand (IB), Omni-Path < 1 μsec latency, 100 Gbps Bandwidth> Multi-core Processors SSD, NVMe-SSD, NVRAM

Tianhe – 2 Titan K - Computer Sunway TaihuLight

slide-4
SLIDE 4

ICPP 2017 4 Network Based Computing Laboratory

20 18 15 14 10 8 6 23 28 33 52 53 50 43 2 22 10 20 30 40 50 60 70 80 June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 System Count NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal

  • Growth of GPU clusters in the last 3 years

– NVIDIA GPUs boost many Top 500 and Green 500 systems

  • “Top 13 systems on the latest Green500 are all equipped with the P100 hardware”*

GPU in HPC Systems

*Data collected from http://top500.org

slide-5
SLIDE 5

ICPP 2017 5 Network Based Computing Laboratory

Architectures for Deep Learning (DL)

Multi-core CPUs across nodes Multi-core CPUs + Single GPU across nodes Multi-core CPUs within a node Multi-core CPUs + Multi-GPU within a node Multi-core CPUs + Multi-GPU across nodes

Past and Current Trend Near-future

E.g., NVIDIA DGX-1 systems

IB Networks IB Networks IB Networks

slide-6
SLIDE 6

ICPP 2017 6 Network Based Computing Laboratory

  • Computation using GPU
  • Communication using MPI

– Exchanging partial gradients after each minibatch

– All-to-all (Multi-Source) communications Ø E.g., MPI_Bcast

  • Challenges

– High computation-communication overlap – Good scalability for upcoming large-scale GPU clusters – No application-level modification

High-performance Deep Learning

GPU Node 1 GPU Node 2 GPU Node 4 GPU Node 3

slide-7
SLIDE 7

ICPP 2017 7 Network Based Computing Laboratory

  • Introduction
  • Analysis

– Existing Designs – Problem Statement

  • Proposed Design
  • Performance Evaluation
  • Conclusion and Future Work

Outline

slide-8
SLIDE 8

ICPP 2017 8 Network Based Computing Laboratory

Evaluation Parameters

IB HCA CPU GPU Bandwidth

𝑪𝑯 𝑪𝑰 ≫ 𝑪𝑯 𝑪𝑸𝑫𝑱𝒇 Notation Meaning Unit

𝒐 Number of processes N/A 𝒏 Number of broadcast sources N/A 𝒖𝒕 Set up time for sending data sec 𝒖𝒑(𝒐) Overhead for issuing an IB-MCAST packet sec 𝑵 Original message size bytes 𝑫 Size of a data chunk bytes 𝑽 Maximum Transmission Unit for IB-MCAST, provided by hardware manufacturer bytes 𝑪𝑰 Bandwidth of reading Host memory bytes/sec 𝑪𝑯 Bandwidth of reading GPU memory (NVIDIA GPUDirect RDMA) bytes/sec 𝑪𝑸𝑫𝑱𝒇 PCIe Bandwidth between Host and GPU memory bytes/sec

Message

𝑵 𝑫 𝑽

slide-9
SLIDE 9

ICPP 2017 9 Network Based Computing Laboratory

  • Direct
  • Pipeline
  • Staging

Ring-based Broadcast

IB HCA

CPU GPU Source

Data

IB HCA

CPU GPU Destination 1

Data

IB HCA

CPU GPU Destination 3

Data GDR Read GDR Write Network Transfer

IB HCA

CPU GPU Destination 2

Data

Poor Scalability

(𝑜 − 1)× 𝑢7 + 𝑁 𝐶; 𝑁 𝐷 + (𝑜 − 2) × 𝑢7 + 𝐷 𝐶; 𝑁 𝐶>?@A + (𝑜 − 1)× 𝑢7 + 𝑁 𝐶B

slide-10
SLIDE 10

ICPP 2017 10 Network Based Computing Laboratory

  • Direct
  • Pipeline
  • Staging

K-nomial-based Broadcast

Non-optimized Scalability

IB HCA

CPU GPU Source

Data

IB HCA

CPU GPU Destination 1

Data

IB HCA

CPU GPU Destination 3

Data GDR Read GDR Write Network Transfer

IB HCA

CPU GPU Destination 2

Data

logF 𝑜 × 𝑢7 + 𝑁 𝐶; 𝑁 𝐷 × logF 𝑜 × 𝑢7 + 𝐷 𝐶; 𝑁 𝐶>?@A + logF 𝑜 × 𝑢7 + 𝑁 𝐶B

slide-11
SLIDE 11

ICPP 2017 11 Network Based Computing Laboratory

IB HCA CPU GPU

Source

IB Switch

Header Data

IB HCA CPU GPU

Destination 1

Header Data

IB HCA CPU GPU

Destination N

Header Data

  • 1. IB Gather + GDR Read
  • 2. IB Hardware Multicast
  • 3. IB Scatter + GDR Write
  • For GPU-resident data, using

– GPUDirect RDMA (GDR) – InfiniBand Hardware Multicast (IB-MCAST)

  • Overhead

– IB UD limit – GDR limit

Hardware Multicast-based Broadcast*

*A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014, Dec 2014.

𝑁 𝑉 × 𝑢7 + 𝑢H(𝑜) + 𝑉 𝐶;

slide-12
SLIDE 12

ICPP 2017 12 Network Based Computing Laboratory

  • How to determine techniques to leverage IB-MCAST and
  • ther GPU advanced features GDR to design efficient and

scalable broadcast with large messages on GPU clusters?

  • How to achieve high overlap and scalability for multi-source

broadcast operations?

  • How to determine attainable theoretical and practical

performance benefits for deep learning applications?

Problem Statement

slide-13
SLIDE 13

ICPP 2017 13 Network Based Computing Laboratory

  • Introduction
  • Analysis
  • Proposed Design

– Streaming-based Design with IB multicast and NVIDIA GPUDirect features

  • Performance Evaluation
  • Conclusion and Future Work

Outline

slide-14
SLIDE 14

ICPP 2017 14 Network Based Computing Laboratory

  • Optimized broadcast send operation

– Streaming the GPU-resident data through host memory – Leveraging InfiniBand hardware multicast Ø Low-latency: avoiding GDR Read limit Ø Overlapping data transfers within and across nodes

  • Optimized broadcast receive operation

– Zero-copy scheme by leveraging GDR feature Ø Low-latency: avoiding unnecessary data transfers

Overview of Proposed Streaming Design

slide-15
SLIDE 15

ICPP 2017 15 Network Based Computing Laboratory

  • Preparing Intermediate buffer (im_buf)

– Page-locked (pinned) host buffer

Ø Fast Device-Host data movement

– Allocated at initialization phase

Ø Low overhead

  • Streaming data through host

– Fine-tuned chunked data – Asynchronous copy operations Ø Three-stage pipeline

IB HCA CPU GPU

Source

IB Switch

Header d_out

  • 1. Data Preparation
  • 2. IB Gather
  • 3. IB Hardware Multicast

im_buf

Optimized Broadcast Send

MPI_Bcast(d_out,…)

slide-16
SLIDE 16

ICPP 2017 16 Network Based Computing Laboratory

  • Zero-copy broadcast receive

– Pre-posted user buffer (d_in) – Avoids additional data movement – Leverages IB Scatter and GDR features Ø Low-latency Ø Free-up PCIe resources for applications

IB Switch IB HCA CPU GPU

Destination 1

Header d_in

IB HCA CPU GPU

Destination N

Header d_in IB Hardware Multicast IB Scatter (GDR Write)

Optimized Broadcast Receive

MPI_Bcast(d_in,…)

slide-17
SLIDE 17

ICPP 2017 17 Network Based Computing Laboratory

Overlap Opportunities

Broadcast from Node C Broadcast from Node A Broadcast from Node B

Timeline HCA CPU GPU GPU CPU HCA

Node B Node C

GPU CPU HCA

Node A : cudaMemcpyAsync : IB Hardware Multicast : cudaStreamSynchronize

: GDR Write

Overlap Across Nodes Overlap within a node 𝐷 𝐶>?@A + 𝑁 𝑉 × 𝑢7 + 𝑢H(𝑜) + 𝑉 𝐶B

slide-18
SLIDE 18

ICPP 2017 18 Network Based Computing Laboratory

  • Introduction
  • Analysis
  • Proposed Design
  • Performance Evaluation

– OSU Micro-Benchmark (OMB) – Deep Learning Framework

  • Conclusion and Future Work

Outline

slide-19
SLIDE 19

ICPP 2017 19 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet

(RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,775 organizations in 85 countries – More than 420,000 (> 0.4 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (June ‘17 ranking)

  • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
  • 15th, 241,108-core (Pleiades) at NASA
  • 20th, 462,462-core (Stampede) at TACC
  • 44th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1st in Jun’16, 10M cores, 100 PFlops)

slide-20
SLIDE 20

ICPP 2017 20 Network Based Computing Laboratory

  • RI2 cluster @ The Ohio State University

– Two 14-core Intel (Broadwell) Xeon E5-2680 V4 processors – 1 NVIDIA K80 GPU per node; Used Up to 16 GPU nodes – One single port InfiniBand EDR HCA – Mellanox SB7790 and SB7800 InfiniBand switches

  • Ohio State University (OSU) Micro-Benchmark (OMB)

http://mvapich.cse.ohio-state.edu/benchmarks/

  • su_bcast - MPI_Bcast Latency Test
  • Deep learning framework: CUDA-Aware Microsoft Cognitive

Toolkit (CA-CNTK)*

– AlexNet and VGG models with ImageNet dataset

Experimental Environments

*D. S. Banerjee, K. Hamidouche and D. K. Panda, "Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters," IEEE CloudCom, Luxembourg City, 2016, pp. 144-151.

slide-21
SLIDE 21

ICPP 2017 21 Network Based Computing Laboratory

  • @ RI2 cluster, 16 GPUs, 1 GPU/node

1 10 100 1000 10000 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Latency (μs) Message Size (bytes) MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR MCAST-GDR-Opt

Evaluation: Benchmark Evaluation

1 10 100 1000 10000 2 4 8 16

Latency (μs)

Number of GPU nodes

2 MB Message

MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR MCAST-GDR-Opt Lower is better

Near-Constant

65%

  • Provide near-constant latency over the system sizes
  • Reduces up to 65% of latency for large messages

Hit GDR read limit

slide-22
SLIDE 22

ICPP 2017 22 Network Based Computing Laboratory

  • @ RI2 cluster, 16 GPUs, 1 GPU/node:

– CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) without modification

Evaluation: Deep Learning Frameworks

100 200 300 8 16 Training Time (s) Number of GPU nodes AlexNet model

MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

1000 2000 3000 8 16 Training Time (s) Number of GPU nodes VGG model

MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

Lower is better

15% 24% 6% 15%

  • Reduces up to 24% and 15% of latency for AlexNet and VGG models
  • Higher improvement is expected for larger system sizes
slide-23
SLIDE 23

ICPP 2017 23 Network Based Computing Laboratory

  • Based on the architecture on RI2 cluster

Performance Prediction

0.001 0.1 10 1000 100000 Latency (s) Number of Broadcast Sources K-nomial-based: Model-based Estimation Ring-based: Model-based Estimation MCAST-GDR-Opt: Model-based Estimation 0.001 0.01 0.1 1 10 100 2 4 8 16 Latency (s) Number of Broadcast Sources

K-nomial-based: Model-based Estimation K-nomial-based: Experiment Ring-based: Model-based Estimation Ring-based: Experiment MCAST-GDR-Opt: Model-based Estimation MCAST-GDR-Opt: Experiment

Within 10% of error

𝑵 = 2𝑁𝐶; 𝑫 = 512 𝐿𝐶; 𝑽 = 4 𝐿𝐶; 𝑪𝑰 ≈ 100 𝐻𝑐𝑞𝑡; 𝑪𝑸𝑫𝑱𝒇 = 8 𝐻𝑐𝑞𝑡; 𝒖𝒑 𝒐 ≈ 1 𝛽 × ln 𝑜 , 15 ≤ 𝛽 ≤ 20

slide-24
SLIDE 24

ICPP 2017 24 Network Based Computing Laboratory

  • Introduction
  • Analysis
  • Proposed Design
  • Performance Evaluation
  • Conclusion and Future Work

Outline

slide-25
SLIDE 25

ICPP 2017 25 Network Based Computing Laboratory

  • Proposed efficient broadcast schemes to leverage GDR and MCAST

features for deep learning applications

– Optimized streaming design for large messages transfers

  • Provided and evaluated analytical models to capture essential

performance behavior of alternative broadcast schemes on GPU clusters Ø These features are included in the latest release of MVAPICH2-GDR library

Conclusion

slide-26
SLIDE 26

ICPP 2017 26 Network Based Computing Laboratory

  • Extend the design for other broadcast-based collective

algorithms as well as non-blocking operations

– Allreduce, Allgather, …, and so on

  • Evaluate the proposed design in upcoming larger-scale

GPU clusters

Future Work

slide-27
SLIDE 27

ICPP 2017 27 Network Based Computing Laboratory

Thank You!

Ching-Hsiang Chu, Xiaoyi Lu, Ammar A. Awan, Hari Subramoni, Jahanzeb Hashmi, Bracy Elton and Dhabaleswar K. (DK) Panda

{chu.368, lu.932, awan.10, subramoni.1, hashmi.29}@osu.edu bracy.elton@engilitycorp.com, panda@cse.ohio-state.edu

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ This project is supported under the United States Department of Defense (DOD) High Performance Computing Modernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity (Contract No. GS04T09DBC0017 Engility Corporation). The opinions expressed herein are those of the authors and do not necessarily reflect the views of the DOD or the employer of the author.

slide-28
SLIDE 28

ICPP 2017 28 Network Based Computing Laboratory

  • NVIDIA GPUDirect[1]

– Remote direct memory access (RDMA) transfers between GPUs and other PCIe devices ⇒ GDR – and more…

  • InfiniBand (IB) hardware

multicast (IB MCAST)[2]

– Enables efficient designs of broadcast operations

  • Host-based[3]
  • GPU-based[4]

MCAST-based Broadcast

[1] https://developer.nvidia.com/gpudirect [2] Pfister GF., “An Introduction to the InfiniBand Architecture. ” High Performance Mass Storage and Parallel I/O, Chapter 42, pp 617-632, Jun 2001. [3] J. Liu, A. R. Mamidala, and D. K. Panda, “Fast and Scalable MPI-level Broadcast using InfiniBand’s Hardware Multicast Support,” in IPDPS 2004, p. 10, April 2004. [4] A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014, Dec 2014.