CUDA Kernel based Collective Reduction Operations on Large-scale GPU - - PowerPoint PPT Presentation

cuda kernel based collective reduction operations on
SMART_READER_LITE
LIVE PREVIEW

CUDA Kernel based Collective Reduction Operations on Large-scale GPU - - PowerPoint PPT Presentation

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu , Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty Network-based Computing Laboratory


slide-1
SLIDE 1

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

slide-2
SLIDE 2

CCGrid 2016 2 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation
  • Conclusion

Outline

slide-3
SLIDE 3

CCGrid 2016 3 Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures

  • Multi-core processors are ubiquitous
  • InfiniBand very popular in HPC clusters
  • Accelerators/Coprocessors becoming common in high-end systems
  • Pushing the envelope for Exascalecomputing

Accelerators / Coprocessors high compute density, high performance/watt >1 Tflop/s DP on a chip High Performance Interconnects -InfiniBand <1us latency, >100 Gbps Bandwidth

Tianhe – 2 Titan Stampede Tianhe – 1A

Multi-core Processors

slide-4
SLIDE 4

CCGrid 2016 4 Network Based Computing Laboratory

  • Growth of Accelerator-enabled clusters in the last 3 years

– 22% of Top 50 clusters are boosted by NVIDIA GPUs in Nov’15 – From Top500 list (http://www.top500.org)

Accelerators in HPC Systems

8 15 23 28 33 52 31 22 20 18 15 14 11 12 16 20 30 29 20 40 60 80 100 June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 System Count NVIDIA Kepler NVIDIA Fermi Intel Xeon Phi

slide-5
SLIDE 5

CCGrid 2016 5 Network Based Computing Laboratory

  • Scientific

parallel applications spend a considerable amount of time in collectivecommunication operations

– E.g. Deep learning frameworks such as Caffe

Motivation – Collectives in Applications

GPU Node 1 GPU Node 2 GPU Node N

MPI_Bcast/MPI_Scatter MPI_Gather/MPI_Reduce GPU computations

slide-6
SLIDE 6

CCGrid 2016 6 Network Based Computing Laboratory

  • Scientific

parallel applications spend a considerable amount of time in collectivecommunication operations

– Pure communication collectives: Broadcast, Gather, Scatter… – Compute-oriented collectives: Reduce, Allreduce, Scan – Communication part is highly optimized

  • Compute-oriented collectives operations are not fully
  • ptimized for GPU clusters

– CPU is doing all the works – GPU resources are not fully utilized

Motivation - Collective Reduction Operations

slide-7
SLIDE 7

CCGrid 2016 7 Network Based Computing Laboratory

  • Fast computation

– Massive parallelism

  • Efficient communication

– NVIDIA GPUDirect RDMA

Motivation – Powerful GPU Resources

  • GPU features are not being utilized for all collectives
  • Can we leverage these features to further optimize the

compute-oriented collectives on GPU clusters?

http://www.nvidia.com/object/gpu-accelerated-computing.html https://developer.nvidia.com/gpudirect

slide-8
SLIDE 8

CCGrid 2016 8 Network Based Computing Laboratory

  • How to eliminate explicit data movements between Host

and GPU memories?

– cudaMemcpy call is expensive!

  • How to handle the GPU-to-GPU communication after the

computationsfinish?

  • When

to use GPU for compute-oriented collective

  • perations?

– Launching kernels bring overhead; How to minimize?

Problem Statement

slide-9
SLIDE 9

CCGrid 2016 9 Network Based Computing Laboratory

  • Design a framework that exploits the CUDA kernels to

efficiently handle compute-orientedcollectives

  • Propose extensions to the existing collective algorithms to

be GPU-Aware compute-orientedalgorithms

– MPI_Reduce, MPI_Allreduce and MPI_Scan

  • Detailed analysis and evaluation using different GPU

systems includinga Cray CS-Storm system.

Overview

slide-10
SLIDE 10

CCGrid 2016 10 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation
  • Conclusion

Outline

slide-11
SLIDE 11

CCGrid 2016 11 Network Based Computing Laboratory

  • Existing designs

1. Explicit copy the data from GPU to host memory 2. Host-to-Host communication to remote processes 3. Perform computation on CPU 4. Explicit copy the data from host to GPU memory

  • Proposed designs

1. GPU-to-GPU communication

  • NVIDIA GPUDirect RDMA (GDR)
  • Pipeline through host for large msg

2. Perform computation on GPU

  • Efficient CUDA kernels

Design Consideration

CPU Host Memory GPU PCIe IB Adapter CPU Host Memory GPU PCIe IB Adapter 1 2 3 4 1 2

Node B Node A

slide-12
SLIDE 12

CCGrid 2016 12 Network Based Computing Laboratory

  • Tree-based K-nomial algorithm

– Only the non-leaf nodes perform reduction operation

  • Pros & Cons

– Load balance, Efficient/scalable communication – Higher average latency

K-nomial MPI_Reduce

1 2 3 4 5 6 7 [1] [2] [3] 1 2 4 3 5 6 7

slide-13
SLIDE 13

CCGrid 2016 13 Network Based Computing Laboratory

  • Host-based Binomial-Reduce (Default)
  • GPU-based Binomial-Reduce (BR-DD)

Cost Analysis

Expensive cudaMemcpy, before/after reduction op. Relatively slow computation on CPU Fast Host-Host Comm. Fast, highly parallelized computation on GPU Overhead of launching CUDA kernels (~10us) GDR-based GPU-GPU Comm. Constant variant of tree initialization

log$𝑜 × 𝜗×𝐷𝑝𝑛𝑛+,-.(𝑁) + 𝐷𝑝𝑛𝑞+,-.(𝑁) + 2×𝐷𝑝𝑞𝑧(𝑁) log$ 𝑜 × 𝜗×𝐷𝑝𝑛𝑛678 𝑁 + 𝑃𝑤𝑓𝑠ℎ𝑓𝑏𝑒6@A 𝑁 + 𝐷𝑝𝑛𝑞6@A(𝑁)

Message Size

slide-14
SLIDE 14

CCGrid 2016 14 Network Based Computing Laboratory

  • Gather-first algorithm

– Root gathers all the data and perform the computation

  • Since only root needs the final result
  • Pros & Cons

– Low computation overhead – Poor scalability

Gather-first MPI_Reduce

1 2 3 4 5 6 7

slide-15
SLIDE 15

CCGrid 2016 15 Network Based Computing Laboratory

  • Host-based Gather and Reduce (GR-H-HH)
  • Host-based Gather, GPU-based Reduce (GR-HH)
  • GPU-based Gather and Reduce (GR-DD)

Cost Analysis

(𝑜 − 1)× 𝐷𝑝𝑛𝑛+,-.(𝑁) + 𝐷𝑝𝑛𝑞+,-.(𝑁) + 2×𝐷𝑝𝑞𝑧(𝑁) (𝑜 − 1)×𝐷𝑝𝑛𝑛678(𝑁) + 𝑃𝑤𝑓𝑠ℎ𝑓𝑏𝑒6@A 𝑁 + 𝐷𝑝𝑛𝑞6@A(𝑁)

(𝑜 − 1)×(𝐷𝑝𝑛𝑛+,-. 𝑁 + 𝑃𝑤𝑓𝑠ℎ𝑓𝑏𝑒6@A 𝑁 + 𝐷𝑝𝑛𝑞6@A 𝑁 + 2×𝐷𝑝𝑞𝑧(𝑁))

Could suffer scalable issue è Good for small messages or small scale Less affect from CUDA kernel launching overhead è Good for small messages

slide-16
SLIDE 16

CCGrid 2016 16 Network Based Computing Laboratory

  • Recursive doubling algorithm

– Every processor needs to perform computation

  • Pros & Cons

– Load balance, Efficient/scalable communication – Higher average latency

GPU-based MPI_Allreduce and MPI_Scan

1 2 3 4 5 6 7 [1] [2] [3]

slide-17
SLIDE 17

CCGrid 2016 17 Network Based Computing Laboratory

  • GPU-based Recursive Doubling (RD-DD)
  • GPU-based Binomial-Reduce-Broadcast (GBRB-DD)

Cost Analysis

Same as BD-DD MPI_Reduce

log$ 𝑜 × 𝜗×𝐷𝑝𝑛𝑛678 𝑁 + 𝑃𝑤𝑓𝑠ℎ𝑓𝑏𝑒6@A 𝑁 + 𝐷𝑝𝑛𝑞6@A(𝑁) log$𝑜 × 2×𝜗×𝐷𝑝𝑛𝑛678 𝑁 + 𝑃𝑤𝑓𝑠ℎ𝑓𝑏𝑒6@A 𝑁 + 𝐷𝑝𝑛𝑞6@A(𝑁)

slide-18
SLIDE 18

CCGrid 2016 18 Network Based Computing Laboratory

Communication Computation

Design

Algorithm Benefit

Host<->Host CPU

BR-H-HH (Default)

Binomial-Reduce Large scale, small messages

RD-H-HH (Default)

Recursive doubling

GR-H-HH

Gather-Reduce Small scale, small messages GPU

GR-HH

Host<->Device (GDR)

GR-HD / GR-DH

Device<->Device (GDR)

GR-DD BR-DD

Binomial-Reduce Largemessages for any scale

BRB-DD

Binomial-Reduce-Bcast

RD-DD

Recursive doubling Host<->Device (GDR)

RD-HD/RD-DH

Alternative and Extended Designs

slide-19
SLIDE 19

CCGrid 2016 19 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation
  • Conclusion

Outline

slide-20
SLIDE 20

CCGrid 2016 20 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,550 organizations in 79 countries – More than 360,000 (> 0.36 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘15 ranking)

  • 10th ranked 519,640-core cluster (Stampede) at TACC
  • 13th ranked 185,344-core cluster (Pleiades) at NASA
  • 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)

slide-21
SLIDE 21

CCGrid 2016 21 Network Based Computing Laboratory

  • 1. Wilkes cluster @ University of Cambridge

– 2 NVIDIA K20c GPUs per node – Used for inter-node experiments

  • Up to 32 GPU nodes
  • 2. CSCS cluster @ Swiss National Supercomputing Centre

– Cray CS-Storm system – 8 NVIDIA K80 GPUs per node ( = 16 NVIDIA K40 GPUs per node) – Used for intra-node experiments

  • Up to 96 GPUs over 11 nodes

Experimental Environments

slide-22
SLIDE 22

CCGrid 2016 22 Network Based Computing Laboratory

Latency (us) Message Size (Bytes) Default BD-DD GR-DD GR-HD GR-HH GR-H-HH 20 40 60 80 100 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (us) Message Size (Bytes) Default BD-DD GR-DD GR-HD GR-HH GR-H-HH

Evaluation - MPI_Reduce @ Wilkes (32 GPUs)

Gather-first approaches win for small messages K-nomial GPU-based approach wins for large messages

slide-23
SLIDE 23

CCGrid 2016 23 Network Based Computing Laboratory

16K 64K 256K 1M 4M Latency (us) Message Size (Bytes) Default BD-DD GR-DD GR-HD GR-HH GR-H-HH 50 100 150 200 250 4 16 64 256 1K 4K Latency (us) Message Size (Bytes) Default BD-DD GR-DD GR-HD GR-HH GR-H-HH

Evaluation - MPI_Reduce @ CSCS (96 GPUs)

Gather-first approaches win for small messages K-nomial GPU-based approach win for large messages

slide-24
SLIDE 24

CCGrid 2016 24 Network Based Computing Laboratory

5 10 15 20 25 Latency (ms) Message Size (Bytes) Default RD-DD BRB-DD 2 4 6 8 10 2 4 8 16 32 Latency (ms) System Size (Number of Nodes) Default RD-DD BRB-DD

Evaluation - MPI_Allreduce

96 GPUs @ CSCS Good Scalability 32 GPUs @ Wilkes

slide-25
SLIDE 25

CCGrid 2016 25 Network Based Computing Laboratory

5 10 15 20 2 4 8 16 32 Latency (ms) System Size (Number of nodes) Default RD-DD RD-HD 10 20 30 40 50 60 64K 128K 256K 512K 1M 2M 4M Latency (ms) Message Size (Bytes) Default RD-DD RD-HD

Evaluation - MPI_Scan

96 GPUs @ CSCS Good Scalability 32 GPUs @ Wilkes

2MB Message

slide-26
SLIDE 26

CCGrid 2016 26 Network Based Computing Laboratory

Prediction

  • Use the proposed analytical models to predict the

performance for large scale GPU clusters

10000 20000 30000 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Latecny (us) Message Size (Bytes)

Prediction for 1024 GPUs

Default RD-DD/BR-DD 2000 4000 6000 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Latency (us) Message Size (Bytes) 32 GPUs on Wilkes cluster Model-based Estimation Experiment result

slide-27
SLIDE 27

CCGrid 2016 27 Network Based Computing Laboratory

  • CUDA kernel based designs significantly improve the

performance of compute-oriented collective operations

– MPI_Reduce, MPI_Allreduce and MPI_Scan – CUDA kernels based reduction operations èFast computation – GPUDirect feature èEfficient GPU-to-GPU communication

  • Future work

– Performing application-level evaluation

  • Deep learning frameworks such as Caffe

– Will be included in the future release of MVAPICH2-GDR library

Conclusion

slide-28
SLIDE 28

CCGrid 2016 28 Network Based Computing Laboratory

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/

slide-29
SLIDE 29

CCGrid 2016 29 Network Based Computing Laboratory

  • CUDA Kernels

– Example: Vector addition for MPI_SUM operation

Reduction Operations on GPU

template<class T> __global__ void vector_addition(T *dst, T *src, size_t count){ int i = blockIdx.x * blockDim.x + threadIdx.x; for (; i < count; i += blockDim.x * gridDim.x) dst[i] += src[i]; }

More information about optimizing your CUDA kernels: http://developer.download.nvidia.com/books/cuda-by-example/cuda-by-example-sample.pdf http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html