S6357 Towards Efficient Communication Methods and Models for - - PowerPoint PPT Presentation

s6357 towards efficient communication methods and models
SMART_READER_LITE
LIVE PREVIEW

S6357 Towards Efficient Communication Methods and Models for - - PowerPoint PPT Presentation

S6357 Towards Efficient Communication Methods and Models for Scalable GPU-Centric Computing Systems Holger Frning Computer Engineering Group Ruprecht-Karls University of Heidelberg GPU Technology Conference 2016 About us JProf. Dr.


slide-1
SLIDE 1

S6357
 Towards Efficient Communication Methods and Models for Scalable GPU-Centric Computing Systems

Holger Fröning Computer Engineering Group Ruprecht-Karls University of Heidelberg GPU Technology Conference 2016

slide-2
SLIDE 2

About us

  • JProf. Dr. Holger Fröning
  • PI of Computer Engineering Group, ZITI, Ruprecht-Karls University of Heidelberg
  • http://www.ziti.uni-heidelberg.de/compeng
  • Research: Application-specific computing under hard power and energy constraints (HW/

SW), future emerging technologies

  • High-performance computing (traditional and emerging)
  • GPU computing (heterogeneity & massive concurrency)
  • High-performance analytics (scalable graph computations)
  • Emerging technologies (approximate computing and stacked memory)
  • Reconfigurable logic (digital design and high-level synthesis)
  • Current collaborations
  • Nvidia Research, US; University of Castilla-La Mancha, Albacete, Spain; CERN, Switzerland; SAP,

Germany; Georgia Institute of Technology, US; Technical University of Valencia, Spain; TU Graz, Austria; various companies

2

slide-3
SLIDE 3

The Problem

  • GPU are powerful high-core count devices, but only for

in-core computations

  • But many workloads cannot be satisfied by a single GPU
  • Technical computing, graph computations, data-

warehousing, molecular dynamics, quantum chemistry, particle physics, deep learning, spiking neural networks

  • => Multi-GPU, at node level and at cluster level
  • Hybrid programming models
  • While single GPUs are rather simple to program,

interactions between multiple GPUs dramatically push complexity!

  • This talk: how good are GPUs in sourcing/sinking network

traffic, how should one orchestrate communication, what do we need for best performance/energy efficiency

3

slide-4
SLIDE 4

Review: Messaging-based Communication

  • Usually Send/Receive or

Put/Get

  • MPI as de-facto standard
  • Work requests descriptors
  • Issued to the network device
  • Target node, source pointer,

length, tag, communication method, ...

  • Irregular accesses, little

concurrency

  • Memory registration
  • OS & driver interactions
  • Consistency by polling on

completion notifications

4

slide-5
SLIDE 5

Lena Oden, Holger Fröning, Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from the GPU, International Journal of High Performance Computing Applications, Special Issue on Applications for the Heterogeneous Computing Era, Sage Publications, 2015.

Beyond CPU-centric communication

5

Source'Node' GPU' CPU' NIC' PCIe'root' GPU' memory' Host' memory' Target'Node' GPU' CPU' NIC' PCIe'root' GPU' memory' Host' memory'

100x Start-up latency

  • f 1.5usec

Start-up latency

  • f 15usec

GPU-controlled Put/Get (IBVERBS)

“… a bad semantic match between communication primitives required by the application and those provided by the network.” - DOE Subcommittee Report, Top Ten Exascale Research Challenges. 02/10/2014

5

GPUs rather incompatible with messaging:

  • Constructing descriptors

(work requests)

  • Registering memory
  • Polling
  • Controlling networking

devices

slide-6
SLIDE 6

Communication orchestration - how to source and sink network traffic

slide-7
SLIDE 7

Example application: 3D stencil code

  • Himeno 3D stencil code
  • Solving a Poisson equation using

2D CTAs (marching planes)

  • Multiple iterations using iterative

kernel launches

  • Multi-GPU: inter-block and inter-

GPU dependencies

  • Dependencies => communication
  • Inter-block: device synchronization

required among adjacent CTAs

  • Inter-GPU: all CTAs participate

communications (sourcing and sinking) => device synchronization required

7

Control flow using CPU-controlled communication 3D Himeno stencil code with 2D CTAs

… …

slide-8
SLIDE 8

Different forms of communication control for an example stencil code

8

Control flow using in-kernel synchronization Control flow using stream synchronization (with/without nested parallelism)

slide-9
SLIDE 9

Performance comparison - execution time

9

0" 0,1" 0,2" 0,3" 0,4" 0,5" 0,6" 0,7" 0,8" 0,9" 1" 256x256x256" 256x256x512" 256x256x1024" 512x512x256" 512x512x512" 512x512x640" 640x640x128" 640x640x256" 640x640x386" Perfomance** Rela-ve*to*hybrid*approach* in0kernel0sync" stream0sync" device0sync"

  • CPU-controlled still fastest
  • Backed up by previous experiments
  • In-kernel synchronization slowest
  • Communication overhead increases with

problem size: more CTAs, more device synchronization

  • ~28% of all instructions have to be replayed,

likely due to serialization (use of atomics)

  • Stream synchronization a good option
  • Difference to device synchronization is
  • verhead of nested parallelism
  • Device synchronization most flexible

regarding control flow

  • Communication as device function or as

independent kernel

  • Flexibility in kernel launch configuration

Lena Oden, Benjamin Klenk, Holger Fröning, Analyzing GPU-controlled Communication and Dynamic Parallelism in Terms of Performance and Energy, Elsevier Journal of Parallel Computing (ParCo), 2016.

slide-10
SLIDE 10

Performance comparison - energy consumption

10

0" 0,2" 0,4" 0,6" 0,8" 1" 1,2" 256x256x256" 256x256x512" 256x256x1024" 512x512x256" 512x512x512" 512x512x640" 640x640x128" 640x640x256" 640x640x386" Energy'consump.on' Rela.ve'to'hybrid'approach' stream2sync" device2sync" in2kernel2sync"

  • Benefits for stream/device

synchronization as the CPU is put into sleep mode

  • 10% less energy consumption
  • CPU: 20-25W saved
  • In-kernel synchronization saves

much more total power, but execution time increase results in a higher energy consumption

  • Likely bad GPU utilization

Lena Oden, Benjamin Klenk, Holger Fröning, Analyzing GPU-controlled Communication and Dynamic Parallelism in Terms of Performance and Energy, Elsevier Journal of Parallel Computing (ParCo), 2016.

slide-11
SLIDE 11

Communication orchestration - Take aways

  • CPU-controlled communication is still fastest - independent of different
  • rchestration optimizations
  • GPU-controlled communication: intra-GPU synchronization between the

individual CTAs is most important for performance

  • Stream synchronization most promising
  • Otherwise reply overhead due to serialization
  • Dedicated communication kernels or functions are highly recommended
  • Either device functions for master kernel (nested parallelism), or communication

kernels in the same stream (issued by CPU)

  • Bypassing CPUs has substantial energy advantages
  • Decrease polling rates, or use interrupt-based CUDA events!
  • More room for optimizations left

11

while( cudaStreamQuery(stream) == cudaErrorNotReady ) usleep(sleeptime};

slide-12
SLIDE 12

GGAS: Fast GPU-controlled traffic sourcing and sinking

slide-13
SLIDE 13

GGAS – Global GPU Address Spaces

  • Forwarding load/store
  • perations to global

addresses

  • Address translation and

target identification

  • Special hardware support

required (NIC)

  • Severe limitations for full

coherence and strong consistency

  • Well known for CPU-based

distributed shared memory

  • Reverting to highly relaxed

consistency models can be a solution

13

. Holger Fröning and Heiner Litz, Efficient Hardware Support for the Partitioned Global Address Space, 10th Workshop on Communication Architecture for Clusters (CAC2010), co-located with 24th International Parallel and Distributed Processing Symposium (IPDPS 2010), April 19, 2010, Atlanta, Georgia.

slide-14
SLIDE 14

14

GGAS – thread-collaborative BSP-like communication

Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE International Conference on Cluster Computing 2013, September 23-27, 2013, Indianapolis, US.

Computa(on

… …

Communica(on using collec(ve remote stores

Con(nue … Global barrier Computa(on

GPU 0 …

Computa(on

Con(nue …

… GPU 1 (remote)

slide-15
SLIDE 15

GGAS – current programming model using mailboxes

15

<snip> ... remMailbox[getProcess(index)][tid] = data[tid]; __threadfence_system(); // memory fence remoteEndFlag[getProcess(index)][0] = 1; __ggas_barrier(); … <snip>

slide-16
SLIDE 16

GGAS Prototype

16

Remote load latency Virtex-6: 1.44 – 1.9 usec (CPU/GPU)

Node #0 (Source) Issuing loads/stores Node #1 (Target) Memory host

Source-local address Target node determination Address calculation Global address Loss-less and in-order packet forwarding Target-local address Source tag management Address calculation Return route

  • FPGA-based network prototype
  • Xilinx Virtex-6
  • 64bit data paths, 156MHz =

1.248GB/s (theoretical peak)

  • PCIe G1/G2
  • 4 network links (torus topology)
slide-17
SLIDE 17

GGAS – Microbenchmarking

  • GPU-to-GPU streaming
  • Prototype system consisting of Nvidia

K20c & dual Intel Xeon E5

  • Relative results applicable to

technology-related performance improvements

  • MPI
  • CPU-controlled: D2H, MPI send/recv,

H2D

  • GGAS
  • GPU-controlled: GDDR to GDDR,

remote stores

  • RMA: Remote Memory Access
  • Put/Get-based, CPU-to-CPU (host)
  • resp. GPU-to-GPU (direct)

17

GGAS latency starting at 1.9usec P2P PCIe issue

slide-18
SLIDE 18

Allreduce – Power and Energy analysis

Lena Oden, Benjamin Klenk and Holger Fröning, Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs, 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid2014), May 26-29, 2014, Chicago, IL, US.

18

For this case: 50% of the energy saved

GGAS MPI

slide-19
SLIDE 19

Analyzing Communication Models for Thread-parallel Processors

slide-20
SLIDE 20

Communication Models

  • MPI
  • CPU-controlled
  • De-facto standard, widely used, heavily optimized (for CPUs)
  • GGAS (using mailboxes with send/receive semantics)
  • GPU-controlled
  • Communication by forwarding load/store operations using global address spaces
  • Completely in-line with the GPU execution model => highly thread parallel
  • Main drawback is reduced overlap
  • RMA: Remote Memory Access
  • GPU-controlled
  • Put/Get operations of the custom interconnect
  • Communication engine designed for HPC
  • GPUs have to construct/interpret descriptors (which was very crucial for the IBVERBS

experiment)

20

slide-21
SLIDE 21

Completing GGAS with Put/Get operations

  • Descriptor-based Put/Get
  • perations
  • Completely asynchronous
  • Work request with
  • Type
  • Local/Remote pointers
  • Credentials
  • Notification requests
  • Notification
  • Completion information

with reference to work request

  • Key is a simple descriptor

format

21

slide-22
SLIDE 22

3 2 1 1 2

Doubleword

4 3 5

10 11 8 9 6 7 4 5 2 3 1 19 20 17 18 15 16 13 14 12 21 22 23 24 25 26 27 28 29 30 31

Read Address[63:0] (64 bit) Write Address[63:0] (64 bit) Destination Node (16 bit) Payload Size (byte) (23 bit)

R S V D.Len (2 b)

CMD (4 bit) NOTI (3 bit)

E R A N T R M C RSV R S V E W A

Destination VPID (8 bit)

Byte bit

Put/Get: Different Work Request Queue Implementations

  • Descriptor format and queue organization matters
  • Explicit/implicit trigger, conversion effort, descriptor complexity
  • FPGA implementation: we could even change the descriptor format

22

IBVERBS descriptor format Custom descriptor format

slide-23
SLIDE 23

Example applications

  • Testing is hard as applications have to be re-written for GPU-centric

communication

  • Set of 4 workloads implemented:

23

slide-24
SLIDE 24

nbody_small nbody_large sum_small sum_large himeno randomAccess benchmarks performance normalized to MPI 0.0 1.0 2.0 3.0 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 GGAS RMA

Performance comparison - execution time

24

Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA, March 29-31, 2015.

  • 2-12 nodes (each 2x Intel Ivy Bridge, Nvidia K20, FPGA network)
  • Normalized to MPI: >1 = better performance, <1 = worse performance
slide-25
SLIDE 25

Performance comparison - energy consumption

25

Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA, March 29-31, 2015.

NB−S NB−L sum−S sum−L RA Himeno avg.

benchmarks energy consumption normalized to MPI 0.0 0.4 0.8 1.2 ggas rma

  • 12 nodes (each 2x Intel Ivy Bridge, Nvidia K20, Extoll FPGA)
  • Normalized to MPI: <1 = better energy consumption, >1 = worse energy consumption
slide-26
SLIDE 26

Performance comparison - observations

26

Observation N- Body Himeno Stencil Global Sum Random Access

RMA and MPI offer a better exploitation of overlap possibilities

X X

GGAS performs outstanding for small payloads, as no indirections are required like context switches to the CPU or work request issues to the NIC

X X X

The PCIe peer-to-peer read problem results in MPI performing better than RMA or GGAS for large payload sizes

X X

GGAS in combination with RMA outperform MPI substantially (without the PCIe peer-to-peer read limitation)

X X X

In practice, the execution time has an essential influence on energy consumption

X X X

Accesses to host memory contribute significantly to DRAM and CPU socket power

X

Bypassing a component like a CPU can save enough power to compensate a longer execution time, resulting in energy savings

X

Staging copies contribute significantly to both CPU and GPU power, due to involved software stacks respectively active DMA controllers

X

For irregular communication patterns or small payloads, GGAS saves both time and energy

X (X)

slide-27
SLIDE 27

Related effort: Simplified multi-GPU programming

slide-28
SLIDE 28
  • Single-GPU programming based on CUDA/OpenCL is a prime example for a

BSP execution model

  • Exposes large amounts of structured parallelism, rather easy to use
  • Multi-GPU programming becoming more important, but adds huge amounts of

complexity

  • Processor aggregation easy
  • Memory aggregation 


challenging

  • Local/remote bandwidth disparity
  • UVA (Unified Virtual Addressing) 


works out-of-the-box 


  • > poor performance
  • Solution: Use a compiler-based approach to automatically partition regular GPU

code Towards simplified multi-GPU programming

28

slide-29
SLIDE 29

GPU Mekong

  • Tool stack based on LLVM tool

chain

  • Code analysis
  • Automated decision making
  • Code transformations for

automated partitioning and data movements

29

  • Regularity
  • Input/Output data,

dimensionality

  • Addition of super-

block ID

  • Index modification
  • Executed kernels
  • Iterative execution?
  • Multi-Device

initialization

  • Data distribution
slide-30
SLIDE 30

GPU Mekong

  • Funded by a Google Research Award
  • Under heavy development
  • https://sites.google.com/site/gpumekong
  • Initial results for a 16-GPU matrix multiply

30

Alexander Matz, Mark Hummel, Holger Fröning, Exploring LLVM Infrastructure for Simplified Multi-GPU Programming, Ninth International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2016), in conjunction with HiPEAC 2016, Prague, Czech Republic, Jan. 18, 2016.

0.25 0.50 0.75 1.00 2 4 6 8

Number of GPUs Speedup

2 4 6 4 8 12 16

Number of GPUs Speedup

N (NxN matrices) 4096 8192 12288 16384 20480 24576

slide-31
SLIDE 31

Wrapping up

slide-32
SLIDE 32

Summary

  • GPUs as first-class citizens in a peer networking environment, capable of

sourcing and sinking traffic

  • Traditional messaging libraries and models poorly match the GPU’s thread-

collaborative execution model

  • Also GPUs require fast communication paths with minimal latency/maximum

message rate, combined with high-overlap paths like Put/Get

  • However: the semantic gap between architecture and user is growing
  • We need automated tooling to close this gap
  • Unified communication models
  • Automatically selecting the right communication method and path between

heterogeneous computing units => flexibility of scheduling kernels

  • Multiple communication models will dramatically push complexity, too
  • CACM 03/2015, John Hennessy: it’s the era of software

32

Specialized processors like GPUs require spezialized communication models/methods

slide-33
SLIDE 33

Thank you!

Credits ContribuYons: Lena Oden (former PhD student), Benjamin Klenk (PhD student), Daniel Schlegel (graduate student), Günther Schindler (graduate student) Discussions: Sudha Yalamanchili (Georgia Tech), Jeff Young (Georgia Tech), Larry Dennison (Nvidia), Hans Eberle (Nvidia)
 Sponsoring: Nvidia, Xilinx, German Excellence IniYaYve, Google Current main interacZons

33