S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Frning 1 , - - PowerPoint PPT Presentation

s7300 managed communication for multi gpu systems
SMART_READER_LITE
LIVE PREVIEW

S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Frning 1 , - - PowerPoint PPT Presentation

S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Frning 1 , Benjamin Klenk 1 , Hans Eberle 2 & Larry Dennison 2 1 Ruprecht-Karls University of Heidelberg, Germany 2 NVIDIA Research http://www.ziti.uni-heidelberg.de/compeng


slide-1
SLIDE 1

S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS

Holger Fröning1 , Benjamin Klenk1, Hans Eberle2 & Larry Dennison2

1 Ruprecht-Karls University of Heidelberg, Germany 2 NVIDIA Research

http://www.ziti.uni-heidelberg.de/compeng holger.froening@ziti.uni-heidelberg.de GTC 2017, May 8, 2017

slide-2
SLIDE 2

ABOUT US & TODAY

Performance and productivity for future and emerging technologies under hard power and energy constraints

Rather unusual hardware engineers

Sold on BSP styles of computing for data-intensive problems

Strong computer engineering background, focus on low-level software layers High-performance analytics & high-performance computing

Today’s talk

An update on our work on GPU-centric communication

2

slide-3
SLIDE 3

GPU APPLICATIONS

“Regular” algorithms: scientific/technical, HPC, machine learning

Mostly dense matrix FFT , matrix-matrix multiplication, N-body, convolution, (deep) neural networks, finite-difference codes (PDE solvers) Excellent understanding in the community

"Irregular" algorithms: most algorithms outside computational science

Organized around pointer-based data structures Data mining, Bayesian inference, compilers, functional interpreters, Maxflow, n- Body methods (Barnes-Hut, fast multipole), mesh refinement, graphics (ray tracing), event-driven simulation, relational join (databases), ...

3

Partly by Keshav Pingali et al., Amorphous Data-parallelism, technical report TR-09-05, U. Texas at Austin, 2009 David Kaeli, How Can GPUs Become First-Class Computing Devices?, William & Mary Computer Science Colloquium, October 26th 2016

slide-4
SLIDE 4

NOTE ON DEEP LEARNING

4

Greg Diamos, HPC Opportunities in Deep Learning, Stanford Computer Systems Colloquium, October 5, 2016

training dataset shuffle mini-batch forward prop back prop

  • ptimizer

forward prop back prop

  • ptimizer

data parallelism sequential dependence model parallelism

Training: 20 EFLOPs @10TFLOP/s = 23 days

slide-5
SLIDE 5

REMINDER: BULK-SYNCHRONOUS PARALLEL

In 1990, Valiant already described GPU computing pretty well Superstep

Compute, communicate, synchronize

Parallel slackness: # of virtual processors v, physical processors p

v = 1: not viable v = p: unpromising wrt optimality v >> p: scheduling and pipelining

Extremely scalable A GPU is a (almost) perfect BSP processor

5

Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 1990 SM SM SM Address-sliced XBARs L2 slice SM SM SM L2 slice

slide-6
SLIDE 6

TRANSITIONING TO MULTI-GPU IS FUNDAMENTAL

Transition from SMP to NUMA

Reasons: multi-GPU systems, multi-chip modules, heterogeneous memory, tiled layout

Beauty of BSP is lost

Kernel launch orchestration Data movement operations Naming a physical resource is disgusting

Compute stack lacks NUMA support

Programming models Abstractions Consistency model

6

L2 slice L2 slice Address-sliced XBARs Address-sliced XBARs SM SM SM SM SM SM L2 slice L2 slice Address-sliced XBARs Address-sliced XBARs SM SM SM SM SM SM L2 slice L2 slice Address-sliced XBARs Address-sliced XBARs SM SM SM SM SM SM

slide-7
SLIDE 7

ADDRESSING NUMA

Analyzing NUMA latency effects Observations on PCIe

Huge penalty for local/remote Unloaded/loaded penalty

NVLINK changes the regime Strong and dynamic NUMA effects

Publicization/privatization concept

=> Managed communication

Examples: MPI, TCP/IP , active messages, various more …

7 Pascal-class Read latency [usec] 2x GPU, PCIe unloaded loaded factor local 0.250 0.461 1.8 peer 1.311 1.378 1.1 host 0.838 1.004 1.2 factor ~3.3-5.2 ~2.2-3.0 Pascal-class Bandwidth [GB/s] local 480 remote 16 factor 30

slide-8
SLIDE 8

REST OF THIS TALK

Background Understanding massively-parallel communication GPU-centric (but unmanaged) communication Introducing MANTARO Use cases for work execution

8

slide-9
SLIDE 9

BACKGROUND

9

slide-10
SLIDE 10

COMMUNICATION MODELS

Plain load/store (LD/ST) - de-facto standard in shared memory systems

Never designed for communication Can be fast for SMP , but often unknown costs for NUMA Assumption of perfectly timed load seeing a store

Message passing (MP) - de-facto standard in HPC

Various p2p and collective functions Mainly send/recv semantics used - ease-of-use Overhead due to functionality & guarantees: copying, matching, progress, ordering

Many more

Active messages - latency tolerance becomes a programming/compiling concern One-sided communication (put/get) - never say receive

10

T0 SHARED
 MEM T1 store resp. load Match P0 LOCAL
 MEM P1 LOCAL
 MEM send (X, 1, tag) recv (Y , 0, tag)

slide-11
SLIDE 11

GPU COMMUNICATION TODAY

Standard: context switch to CPU

Limited to coarse-grain communication Kernel-completion boundaries

Related work explores CPU helper threads

#GPU entities >> #CPU entities Applicability depends on communication pattern [DGCN, dCUDA, ...]

11

GPU CPU NIC

PCIe PCIe

NIC CPU GPU

PCIe PCIe Network

1

finish kernel copy data

2

send copy data network packet

2

finish kernel copy data

1

copy data recv start kernel start kernel

x

completion CUDA stack

1 2 x

MPI stack Computation Possible overlap

x

completion

slide-12
SLIDE 12

UPSHOT: CPU BYPASS HELPS

GPU-to-GPU streaming

Prototype system consisting of NVIDIA K20c, dual Intel Xeon E5, custom FPGA network

12

slide-13
SLIDE 13

UNDERSTANDING MASSIVELY-PARALLEL COMMUNICATION

13

Do we need fine-grain privatization?

slide-14
SLIDE 14

APPROACH

Characteristics of massively parallel communication

Analyzing large-scale HPC applications

DOE Exascale MPI proxy app traces

~1/2 TB analyzed (25+TB available

  • nline)

14

Application Pattern Ranks MOCFE (CESAR) Nearest Neighbor 64; 256; 1024 NEKBONE (CESAR) Nearest Neighbor 64; 256; 1024 CNS (EXACT) Nearest Neighbor 64; 256 CNS Large (EXACT) Nearest Neighbor 64; 256; 1024 MultiGrid (EXACT) Nearest Neighbor 64; 256 MultiGrid Large (EXACT) Nearest Neighbor 64; 256; 1024 LULESH (EXMATEX) Nearest Neighbor 64; 512 CMC 2D (EXMATEX) Nearest Neighbor 64; 256; 1024 AMG (DF) Nearest Neighbor 216; 1728; 13824 AMG Boxlib (DF) Irregular 64; 1728 BIGFFT (DF) Many-to-many 100; 1024; 10000 BIGFFT Medium (DF) Many-to-many 100; 1024; 10000 Crystal Router (DF) Staged all-to-all 10; 100

slide-15
SLIDE 15

Observations

Structured patterns

APPLICATION CHARACTERISTICS

15

Neighbor Many-to-many All-to-all Irregular

slide-16
SLIDE 16

Observations

Structured patterns Collectives for synchronization, 
 point-to-point for communication

APPLICATION CHARACTERISTICS

16

slide-17
SLIDE 17

Observations

Structured patterns Collectives for synchronization, 
 point-to-point for communication Most messages are surprisingly small

APPLICATION CHARACTERISTICS

17

slide-18
SLIDE 18

Observations

Structured patterns Collectives for synchronization, 
 point-to-point for communication Most messages are surprisingly small Few communication peers

APPLICATION CHARACTERISTICS

18 Job size (ranks) Min Median Max [0:63] 3.1 % 28.1 % 40.6 % [64:127] 6.0 % 12.0 % 15.2 % [128:255] 0.6 % 7.8 % 26.4 % [256:511] 3.7 % 5.4 % 7.1 % [512:1023] 0.4 % 2.0 % 7.0 % [1024:2047] 1.3 % 2.0 % 4.6 % [8192:16383] 0.1 % 0.2 % 0.7 %

Communication peers as percentage of all ranks

slide-19
SLIDE 19

Observations

Structured patterns Collectives for synchronization, 
 point-to-point for communication Most messages are surprisingly small Few communication peers

Insights on communication

Selective, structured and fine-grained Little/no use of advanced MPI features Irregular applications will further push requirements

APPLICATION CHARACTERISTICS

19

Benjamin Klenk, Holger Fröning, An Overview of MPI Characteristics of Exascale Proxy Applications, International Supercomputer Conference ISC 2017. (accepted for publication & best paper finalist)

Application Pattern Ranks MOCFE (CESAR) Nearest Neighbor 64; 256; 1024 NEKBONE (CESAR) Nearest Neighbor 64; 256; 1024 CNS (EXACT) Nearest Neighbor 64; 256 CNS Large (EXACT) Nearest Neighbor 64; 256; 1024 MultiGrid (EXACT) Nearest Neighbor 64; 256 MultiGrid Large (EXACT) Nearest Neighbor 64; 256; 1024 LULESH (EXMATEX) Nearest Neighbor 64; 512 CMC 2D (EXMATEX) Nearest Neighbor 64; 256; 1024 AMG (DF) Nearest Neighbor 216; 1728; 13824 AMG Boxlib (DF) Irregular 64; 1728 BIGFFT (DF) Many-to-many 100; 1024; 10000 BIGFFT Medium (DF) Many-to-many 100; 1024; 10000 Crystal Router (DF) Staged all-to-all 10; 100

slide-20
SLIDE 20

GPU-CENTRIC (BUT UNMANAGED) COMMUNICATION

20

Addressing the need for privatization

slide-21
SLIDE 21

GPU-CENTRIC TRAFFIC SOURCING & SINKING

GGAS: GPU-centric send/receive

Thread-collective data movement Complete CPU bypass

Cons

Special hardware support required Reduced overlap

GRMA: GPU-centric put/get

Key is simple descriptor format

Cons

Special hardware support required Indirection to issue work

21

Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE CLUSTER 2013

GPU CPU NIC

PCIe PCIe

NIC CPU GPU

PCIe PCIe Network store to GAS CUDA stack

1 2 x

MPI stack Computation Possible overlap network packet store to GPU memory

slide-22
SLIDE 22

MICRO-BENCHMARK PERFORMANCE

GPU-to-GPU streaming

Prototype system consisting of NVIDIA K20c, dual Intel Xeon E5, custom network

MPI

CPU-controlled: D2H, MPI send/recv, H2D

Others

GPU-controlled, bypassing CPU

Results do not cover overheads regarding issue & completion

22

slide-23
SLIDE 23

nbody_small nbody_large sum_small sum_large himeno randomAccess benchmarks performance normalized to MPI 0.0 1.0 2.0 3.0 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 GGAS RMA

APPLICATION-LEVEL PERFORMANCE

Experiments with applications rewritten for GPU-centric communication

12 nodes (each 2x Intel Ivy Bridge, NVIDIA K20, FPGA network)

Specialized communication always faster than MPI

But can we also get the convenience of managed communication?

23

Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015)

slide-24
SLIDE 24

INTRODUCING MANTARO

24

GPU-centric privatization

slide-25
SLIDE 25

A MANY-CORE MESSAGE PROCESSOR

Transforming an SM into a message-parallel processor (SOFTNIC)

Building blocks to support send/recv, put/get, active messages, ... Layered on top of LD/ST over global address spaces Or interfacing to a NIC/CPU

Managed communication

Buffer management Protocol selection Scheduling data transfers Choosing communication paths Asynchronous communication

Adaptable (reprogrammable) to workload Scalable with flows and GPUs

25

GPU memory SOFTNIC Grid NVlink fabric Compute Grid(s) CTA CTA CTA CTA Work request queue Event queue Worker warp pool (single or multi-CTA) Supervisor warp Event aggregation & notification Tag matching Queue management Egress path Connection & registration handlers Ingress buffers Collective & AM handlers

GPU

slide-26
SLIDE 26

FLEXIBLE & COMPOSABLE

Flexible: who sources/sinks traffic?

Threads, warps, CTAs or kernels

Flexible: what is the model?

Send/recv, put/get, active messages?

Flexible: which data path?

LD/ST or DMA engines

Composable using building blocks

Three fundamental tasks

  • 1. Work generation
  • 2. Work execution
  • 3. Work completion

26

GPU memory SOFTNIC Grid NVlink fabric Compute Grid(s) CTA CTA CTA CTA Work request queue Event queue Worker warp pool (single or multi-CTA) Supervisor warp Event aggregation & notification Tag matching Queue management Egress path Connection & registration handlers Ingress buffers Collective & AM handlers

GPU

slide-27
SLIDE 27

WORK GENERATION

Warp-parallel queue

Collaborative enqueue of 1-32 elements Avoids branch divergence Warp-parallel except for pointer update

Building block for various uses

Entities: warps, CTAs, or kernels Shared, global or remote memory

Communication as a sequence of queues

27

slide-28
SLIDE 28

WORK COMPLETION

Notifications have to be found quickly

Tables are very handy Parallel search, low administration overhead Messaging operations returns pointer to table entry

Aggregating notifications

Reducing table contention Reducing time to find all notifications

Issues with current GPUs

Preemption & scheduling

28

compute_stencil (..) { ... exchange_halo(top, noti); exchange_halo(bot, noti); exchange_halo(left, noti); exchange_halo(right, noti); ... wait( noti == 4 ); }

slide-29
SLIDE 29

USE CASES FOR WORK EXECUTION

29

MPI-like send/recv Active messages

slide-30
SLIDE 30

REMINDER: MESSAGE MATCHING USING MPI

Match one send to one receive based on {communicator, sender, tag}- tuple

Wildcards on sender and tag possible Messages can arrive unexpectedly Messages stay in-order

MPI internally maintains lists for pre-posted receives and unexpected messages

Queue length and search depth of importance

30

MPI_(I)Send( <buffer>, <count>, <type>, <dest.>, <tag>, <comm>, ...) MPI_(I)Recv( <buffer>, <count>, <type>, <source>, <tag>, <comm>, ...)

=

slide-31
SLIDE 31

CPU MATCHING PERFORMANCE

31

Best case (forward) Average case (random)

slide-32
SLIDE 32

MASSIVELY-PARALLEL TAG MATCHING

32

Parallelization

Vote matrix Shared memory Hierarchical approach

  • 1. Multi-warp scan

Based on __ballot

  • 2. Single-warp

reduction of column vector to single vote

Based on __ballot, __ffs and bit masking

slide-33
SLIDE 33

RELAXING MATCHING SEMANTICS

No unexpected messages

No compaction (10% perf.) No unnecessary propagation

  • f unmatched elements

No source wildcards

Rank partitioning Multiple matrixes

No ordering

Hash tables Constant insert and search time complexity

Two orders of magnitude

33

Benjamin Klenk, Holger Fröning, Hans Eberle, Larry Dennison, Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors, IPDPS 2017 (accepted for publication & best paper award)

Wildcards Ordering Unexpected messages Parti- tioning Data structure Performance [matches/s] User implications yes yes yes no matrix < 6M none (MPI-like) yes yes no no matrix ~ 6M medium no yes yes yes matrix < 60M low no yes no yes matrix ~ 60M medium no no yes yes hash table < 500M high no no no yes hash table ~ 500M high

slide-34
SLIDE 34

ACTIVE MESSAGES

Heavily used in task-based programming models Map nicely to irregular applications

Work lists Coalescing/aggregation Possibly sorting for locality maximization

Different forms of execution in Mantaro

Inline (thread warp) - limited to max. 32 threads Inline (complete CTA) - stalls communication Kernel launch - high costs (NVIDIA’s Dynamic Parallelism feature) Registered and pre-launched kernel (persistent threads)

34

slide-35
SLIDE 35

RANDOM-ACCESS BENCHMARK

35

Daniel Schlegel, Active Messaging in Autonomous GPU Networks, Master thesis, Ruprecht-Karls University of Heidelberg, Germany, 2016.

Part of HPCC benchmark suite (CPU version)

http://icl.cs.utk.edu/hpcc/

Ported to a GPU version

Data-driven memory accesses distributed over multiple GPUs Many fine-grain interactions Buckets aggregate update operations

Performance

PCIe-connected K80 GPUs Up to 1 GUPS, good scalability Similar to equivalent CPU 
 system (192 MPI ranks, 104 SMs total)

slide-36
SLIDE 36

WRAPPING UP

36

slide-37
SLIDE 37

COMMUNICATION MODEL PROPOSALS

Send/recv communication

No ordering guarantees, limited use

  • f wildcards

Asynchronous communication Consistency control: blocking wait, or pre-registered, deferred actions depending on completion

Active messages

Pre-registered functions Different forms of execution, possibly dynamically determined Data placement determines place of execution

37

Point-to-point source mantaro_send (dest, &buf, tag, &handle, ...); Point-to-point sink mantaro_recv (src, &buf, tag, &handle, ...); Collective ops mantaro_barrier (group, &handle, ...); mantaro_all2all (tag, &handle, ...); Synchronization mantaro_wait (&handle); /* blocking wait */ mantaro_defer (&handle, &action); Function registration // base handler class w/t virtual functions class AMUpdate : public Mantaro::AMBase {...} AM send mantaro_am_send (AMUpdate_msg, &buf, ...);

slide-38
SLIDE 38

SUMMARY

Managed communication addresses

Strong NUMA effects of multi-GPU systems Needs of fine-grain, selective communication

Mantaro: a many-core message-parallel processor

Capable of handling massively parallel communication Flexible and adaptable Tool to explore GPU communication

Issues/limitations

Inter-CTA latency, progress guarantees, preemption & memory management, execution launch costs We believe GPUs will continue to evolve

38

GPU memory SOFTNIC Grid NVlink fabric Compute Grid(s) CTA CTA CTA CTA Work request queue Event queue Worker warp pool (single or multi-CTA) Supervisor warp Event aggregation & notification Tag matching Queue management Egress path Connection & registration handlers Ingress buffers Collective & AM handlers

GPU