Exploiting Maximal Overlap for Non- Contiguous Data Movement - - PowerPoint PPT Presentation

exploiting maximal overlap for non contiguous data
SMART_READER_LITE
LIVE PREVIEW

Exploiting Maximal Overlap for Non- Contiguous Data Movement - - PowerPoint PPT Presentation

Exploiting Maximal Overlap for Non- Contiguous Data Movement Processing on Modern GPU-enabled Systems Ching-Hsiang Chu, Khaled Hamidouche, AkshayVenkatesh, Dip S. Banerjee, Hari Subramoni and Dhabaleswar K. (DK) Panda Network-based Computing


slide-1
SLIDE 1

Exploiting Maximal Overlap for Non- Contiguous Data Movement Processing

  • n Modern GPU-enabled Systems

Ching-Hsiang Chu, Khaled Hamidouche, AkshayVenkatesh, Dip S. Banerjee, Hari Subramoni and Dhabaleswar K. (DK) Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

slide-2
SLIDE 2

IPDPS 2016 2 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation
  • Conclusion

Outline

slide-3
SLIDE 3

IPDPS 2016 3 Network Based Computing Laboratory

  • Multi-core processors are ubiquitous
  • InfiniBand is very popular in HPC clusters
  • Accelerators/Coprocessors becoming common in high-end systems
  • Pushing the envelope for Exascale computing

Drivers of Modern HPC Cluster Architectures

Accelerators / Coprocessors high compute density, high performance/watt >1 Tflop/s DP on a chip High Performance Interconnects - InfiniBand <1us latency, >100 Gbps Bandwidth

Tianhe – 2 Titan Stampede Tianhe – 1A

Multi-core Processors

slide-4
SLIDE 4

IPDPS 2016 4 Network Based Computing Laboratory

  • Growth of Accelerator-enabled clusters in

the last 3 years

– 22% of Top 50 clusters are boosted by NVIDIA GPUs in Nov’15 – From Top500 list (http://www.top500.org)

Accelerators in HPC Systems

8 15 23 28 33 52 31 22 20 18 15 14 11 12 16 20 30 29 20 40 60 80 100 June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 System Count NVIDIA Kepler NVIDIA Fermi Intel Xeon Phi

slide-5
SLIDE 5

IPDPS 2016 5 Network Based Computing Laboratory

  • Parallel applications on GPU clusters

– CUDA (Compute Unified Device Architecture):

  • Kernel computation on NVIDIA GPUs

– CUDA-Aware MPI (Message Passing Interface):

  • Communications across processes/nodes
  • Non-blocking communication to overlap with CUDA

kernels

Motivation

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); /* /* Independent Independent computations computations

  • n
  • n CPU/GPU

CPU/GPU */ */ MPI_Wait (request1, status1); MPI_Wait (request2, status2);

slide-6
SLIDE 6

IPDPS 2016 6 Network Based Computing Laboratory

  • Use of non-contiguous data becoming common

– Easy to represent complex data structure

  • MPI Datatypes

– E.g., Fluid dynamic, image processing…

  • What if the data are on GPU memory?
  • 1. Copy data to CPU to perform the packing/unpacking
  • Slower for large message
  • Data movements between GPU and CPU are expensive
  • 2. Utilize GPU kernel to perform the packing/unpacking*
  • No explicit copies, faster for large message

Motivation

*R. Shi et al., “HAND: A Hybrid Approach to Accelerate Non- contiguous Data Movement Using MPI Datatypes on GPU Clusters,” in 43rd ICPP, Sept 2014, pp. 221–230.

slide-7
SLIDE 7

IPDPS 2016 7 Network Based Computing Laboratory

MPI_Isend(Buf1, ...,req1); MPI_Isend(Buf2, ...,req2); Application work on the CPU/GPU MPI_Waitall(req, …)

Common Scenario

*Buf1, Buf2…contain non- contiguous MPI Datatype

Waste of computing resources on CPU and GPU

Motivation –

Non-Contiguous Data Movement in MPI

Timeline

slide-8
SLIDE 8

IPDPS 2016 8 Network Based Computing Laboratory

  • Low overlap between CPU and GPU for

applications

– Packing/Unpacking operations are serialized

  • CPU/GPU resources are not fully utilized

– GPU threads remain idle for most of the time – Low utilization, low efficiency

Problem Statement

Overlap Productivity Performanc e Resource Utilization Proposed User Naive User Advanced Farther from the center is Better

Can we have designs to leverage new GPU technology to address these issues?

slide-9
SLIDE 9

IPDPS 2016 9 Network Based Computing Laboratory

  • Proposes new designs leverage new NVIDIA

GPU technologies

ØHyper-Q technology (Multi-Streaming) ØCUDA Event and Callback

  • Achieving

ØHigh performance and resource utilization for applications ØHigh productivity for developers

Goals of this work

slide-10
SLIDE 10

IPDPS 2016 10 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs

– Event-based – Callback-based

  • Performance Evaluation
  • Conclusion

Outline

slide-11
SLIDE 11

IPDPS 2016 11 Network Based Computing Laboratory

CPU Progress GPU Time Initiate Kernel

Start Send Isend(1)

Initiate Kernel

Start Send

Initiate Kernel GPU CPU Initiate Kernel Start Send Wait For Kernel (WFK)

Kernel on Stream

Isend(1)

Existing Design Proposed Design

Kernel on Stream Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Initiate Kernel Start Send Wait For Kernel (WFK)

Kernel on Stream

Isend(1)

Initiate Kernel Start Send Wait For Kernel (WFK)

Kernel on Stream

Isend(1) Wait

WFK

Start Send Wait

Progress

Start Finish Proposed Finish Existing

WFK WFK Expected Benefits

Overview

slide-12
SLIDE 12

IPDPS 2016 12 Network Based Computing Laboratory

  • CUDA Event Management

– Provides a mechanism to signal when tasks have

  • ccurred in a CUDA stream
  • Basic design idea
  • 1. CPU launches a CUDA packing/unpacking kernel
  • 2. CPU creates CUDA event and then returns immediately
  • GPU sets the status as ‘completed’ when the kernel is

completed

  • 3. In MPI_Wait/MPI_Waitall:
  • CPU queries the events when the packed/unpacked

data is required for communication

Event-based Design

slide-13
SLIDE 13

IPDPS 2016 13 Network Based Computing Laboratory

Event-based Design

MPI_Isend()

cudaEventRecord()

HCA CPU GPU

MPI_Waitall()

Query / Progress

Send Completion Request Complete pack_kernel1<<< >>> pack_kernel2<<< >>> pack_kernel3<<< >>>

cudaEventRecord() cudaEventRecord()

MPI_Isend() MPI_Isend()

slide-14
SLIDE 14

IPDPS 2016 14 Network Based Computing Laboratory

  • Major benefits

– Overlap between CPU communication and GPU packing kernel – GPU resources are highly utilized

  • Limitation

– CPU is required to keep checking the status of the event

  • Lower CPU utilization

Event-based Design

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1); MPI_Wait (request2, status2);

slide-15
SLIDE 15

IPDPS 2016 15 Network Based Computing Laboratory

  • CUDA Stream Callback

– Launching work automatically on the CPU when something has completed on the CUDA stream – Restrictions:

  • Callbacks are processed by a driver thread, where no CUDA APIs can

be called

  • Overhead when initializing callback function
  • Basic design idea
  • 1. CPU launches a CUDA packing/unpacking kernel
  • 2. CPU adds Callback function and then returns immediately
  • 3. Callback function wakes up a helper thread to process the

communication

Callback-based Design

slide-16
SLIDE 16

IPDPS 2016 16 Network Based Computing Laboratory

Callback-based Design

MPI_Isend()

addCallback()

HCA CPU GPU Send Completion Request Complete

main helper callback

pack_kernel1<<< >>> Callback MPI_Waitall()

addCallback() addCallback()

CPU Computations

pack_kernel2<<< >>> pack_kernel3<<< >>> MPI_Isend() MPI_Isend() Callback Callback

slide-17
SLIDE 17

IPDPS 2016 17 Network Based Computing Laboratory

  • Major benefits

– Overlap between CPU communication and GPU packing kernel – Overlap between CPU communication and other computations – Higher CPU and GPU utilization

Callback-based Design

MPI_Isend(Buf1, ...,&requests[0]); MPI_Isend(Buf2, ...,&requests[1]); MPI_Isend(Buf3, ...,&requests[2]); // // Application Application work work on

  • n the

the CPU CPU MPI_Waitall(requests, status);

slide-18
SLIDE 18

IPDPS 2016 18 Network Based Computing Laboratory

  • Introduction
  • Proposed Designs
  • Performance Evaluation

– Benchmark – HaloExchange-based Application Kernel

  • Conclusion

Outline

slide-19
SLIDE 19

IPDPS 2016 19 Network Based Computing Laboratory

Overview of the MVAPICH2 Project

  • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced

Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,575 organizations in 80 countries – More than 376,000 (0.37 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘15 ranking)

  • 10th ranked 519,640-core cluster (Stampede) at TACC
  • 13th ranked 185,344-core cluster (Pleiades) at NASA
  • 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)

slide-20
SLIDE 20

IPDPS 2016 20 Network Based Computing Laboratory

  • 1. Wilkes cluster @ University of Cambridge

– 2 NVIDIA K20c GPUs per node

  • Up to 32 GPU nodes
  • 2. CSCS cluster @ Swiss National Supercomputing

Centre

– Cray CS-Storm system – 8 NVIDIA K80 GPUs per node

  • Up to 96 GPUs over 12 nodes

Experimental Environments

slide-21
SLIDE 21

IPDPS 2016 21 Network Based Computing Laboratory

  • Modified ‘CUDA-Aware’ DDTBench

(http://htor.inf.ethz.ch/research/datatypes/ddtbench/)

Benchmark-level Evaluation - Performance

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Normalized Execution Time Input Size

Default Event-based Callback-based

NAS_MG_y SPECFEM3D_OC WRF_sa SPECFEM3D_CM

2.7X 3.4X 2.6X 1.5X

Lower is better

slide-22
SLIDE 22

IPDPS 2016 22 Network Based Computing Laboratory

  • Modified ‘CUDA-Aware’ DDTBench for

NAS_MG_y test

– Injected dummy computations

Benchmark-level Evaluation - Overlap

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Isend(Buf3, ...,request3); Dummy_comp(); Dummy_comp(); // // Application work on the CPU Application work on the CPU MPI_Waitall(requests, status);

20 40 60 80 100 Overlap (%) Input Size Default Event-based Callback-based

Higher is better

slide-23
SLIDE 23

IPDPS 2016 23 Network Based Computing Laboratory

  • MeteoSwiss weather forecasting COSMO* application

kernel @ CSCS cluster

  • Multi-dimensional data
  • Contiguous on one dimension
  • Non-contiguous on other dimensions
  • Halo data exchange
  • Duplicate the boundary
  • Exchange the boundary

Application-level Evaluation - Halo Data Exchange

*http://www.cosmo-model.org/

slide-24
SLIDE 24

IPDPS 2016 24 Network Based Computing Laboratory

Application-level (HaloExchange) Evaluation

0.2 0.4 0.6 0.8 1 1.2 16 32 64 96 Normalized Execution Time Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based 0.5 1 1.5 4 8 16 32 Normalized Execution Time Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

2X 1.6X

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); // Computations on GPU // Computations on GPU MPI_Wait (request1, status1); MPI_Wait (request2, status2);

Lower is better

slide-25
SLIDE 25

IPDPS 2016 25 Network Based Computing Laboratory

  • Proposed

designs can improve the

  • verall

performance and utilization of CPU as well as GPU

– Event-based design: Overlapping CPU communication with

GPU computation

– Callback-based design:

Further

  • verlapping

with CPU computation

  • Future Work

– Non-blocking collectiveoperations – Contiguousdata movements – Next generation GPU architectures – Will be available in the MVAPICH2-GDR library

Conclusion

slide-26
SLIDE 26

IPDPS 2016 26 Network Based Computing Laboratory

Thank You!

Ching-Hsiang Chu

chu.368@osu.edu

The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/

slide-27
SLIDE 27

IPDPS 2016 27 Network Based Computing Laboratory

  • NVIDIA - CUDA Hyper-Q (Multi-stream) technology

– Multiple CPU threads/processes to launch kernel on a single GPU simultaneously – Increasing GPU utilization and reducing CPU idle times

Motivation – NVIDIA GPU Feature

http://www.hpc.co.jp/images/hyper-q.png

slide-28
SLIDE 28

IPDPS 2016 28 Network Based Computing Laboratory

Motivation – Non-Contiguous Data Movement in MPI

sbuf=malloc(…); rbuf=malloc(…); /* Packing */ for (i=1; i<n ; i+=2 ) sbuf[i]=matrix[n][0]; MPI_Send(sbuf, n, MPI_DOUBLE,…); MPI_Recv(rbuf, n, MPI_DOUBLE,…); /* Unpacking */ for (i=1; i<n ; i+=2 ) matrix[i][0]=rbuf[i] free(sbuf); free(rbuf); MPI_Datatype nt; MPI_Type_Vector(1,1,n,MPI_DOUBLE,&nt); MPI_Type_commit(&nt); MPI_Send(matrix, 1, nt,…); MPI_Recv(matrix, 1, nt,…);

Using MPI Datatypes

  • No

explicit copies in applications ØBette performance

  • Less code

ØHigher productivity