S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Frning 1 , - PowerPoint PPT Presentation

S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Fröning 1 , Benjamin Klenk 1 , Hans Eberle 2 & Larry Dennison 2 1 Ruprecht-Karls University of Heidelberg, Germany 2 NVIDIA Research http://www.ziti.uni-heidelberg.de/compeng holger.froening@ziti.uni-heidelberg.de GTC 2017, May 8, 2017

ABOUT US & TODAY Performance and productivity for future and emerging technologies under hard power and energy constraints Rather unusual hardware engineers Sold on BSP styles of computing for data-intensive problems Strong computer engineering background, focus on low-level software layers High-performance analytics & high-performance computing Today’s talk An update on our work on GPU-centric communication 2

GPU APPLICATIONS “Regular” algorithms: scientific/technical, HPC, machine learning Mostly dense matrix FFT , matrix-matrix multiplication, N-body, convolution, (deep) neural networks, finite-difference codes (PDE solvers) Excellent understanding in the community "Irregular" algorithms: most algorithms outside computational science Organized around pointer-based data structures Data mining, Bayesian inference, compilers, functional interpreters, Maxflow, n- Body methods (Barnes-Hut, fast multipole), mesh refinement, graphics (ray tracing), event-driven simulation, relational join (databases), ... Partly by Keshav Pingali et al., Amorphous Data-parallelism, technical report TR-09-05, U. Texas at Austin, 2009 3 David Kaeli, How Can GPUs Become First-Class Computing Devices?, William & Mary Computer Science Colloquium, October 26th 2016

NOTE ON DEEP LEARNING training dataset shuffle data parallelism Training: 20 EFLOPs mini-batch @10TFLOP/s = 23 days forward prop forward prop model parallelism back prop back prop optimizer optimizer sequential dependence 4 Greg Diamos, HPC Opportunities in Deep Learning, Stanford Computer Systems Colloquium, October 5, 2016

REMINDER: BULK-SYNCHRONOUS PARALLEL In 1990, Valiant already described GPU computing pretty well Superstep Compute, communicate, synchronize Parallel slackness: # of virtual processors v, physical processors p SM SM SM SM SM SM v = 1: not viable v = p: unpromising wrt optimality v >> p: scheduling and pipelining Address-sliced XBARs Extremely scalable A GPU is a (almost) perfect BSP processor L2 slice L2 slice 5 Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 1990

TRANSITIONING TO MULTI-GPU IS FUNDAMENTAL SM SM SM SM SM SM Transition from SMP to NUMA Reasons: multi-GPU systems, multi-chip modules, heterogeneous memory, tiled layout Address-sliced XBARs Address-sliced XBARs Beauty of BSP is lost Kernel launch orchestration L2 slice L2 slice Data movement operations SM SM SM SM SM SM SM SM SM SM SM SM Naming a physical resource is disgusting Compute stack lacks NUMA support Programming models Address-sliced XBARs Address-sliced XBARs Address-sliced XBARs Address-sliced XBARs Abstractions Consistency model L2 slice L2 slice L2 slice L2 slice 6

ADDRESSING NUMA Analyzing NUMA latency effects Pascal-class Read latency [usec] Observations on PCIe 2x GPU, PCIe unloaded loaded factor Huge penalty for local/remote local 0.250 0.461 1.8 peer 1.311 1.378 1.1 Unloaded/loaded penalty host 0.838 1.004 1.2 NVLINK changes the regime ~3.3- 5.2 ~2.2- 3.0 factor Strong and dynamic NUMA effects Pascal-class Bandwidth [GB/s] Publicization/privatization concept local 480 => Managed communication remote 16 Examples: MPI, TCP/IP , active messages, 30 factor various more … 7

REST OF THIS TALK Background Understanding massively-parallel communication GPU-centric (but unmanaged) communication Introducing MANTARO Use cases for work execution 8

BACKGROUND 9

COMMUNICATION MODELS SHARED   T0 MEM Plain load/store (LD/ST) - de-facto standard in T1 store shared memory systems load Never designed for communication Can be fast for SMP , but often unknown costs for NUMA Assumption of perfectly timed load seeing a store resp. Message passing (MP) - de-facto standard in HPC Various p2p and collective functions LOCAL   LOCAL   Mainly send/recv semantics used - ease-of-use P1 MEM MEM P0 Overhead due to functionality & guarantees: copying, matching, progress, ordering send (X, 1, tag) Many more Match Active messages - latency tolerance becomes a programming/compiling concern recv (Y , 0, tag) One-sided communication (put/get) - never say receive 10

GPU COMMUNICATION TODAY Standard: context switch to CPU PCIe PCIe Network PCIe PCIe GPU CPU NIC NIC CPU GPU Limited to coarse-grain communication finish finish 0 0 kernel kernel recv Kernel-completion boundaries copy data 1 Related work explores CPU helper x send network 2 threads packet 2 copy data copy data start kernel x completion completion #GPU entities >> #CPU entities copy data 0 Applicability depends on 1 communication pattern start kernel 0 [DGCN, dCUDA, ...] Computation CUDA stack MPI stack Possible overlap 0 1 2 x 11

UPSHOT: CPU BYPASS HELPS GPU-to-GPU streaming Prototype system consisting of NVIDIA K20c, dual Intel Xeon E5, custom FPGA network 12

UNDERSTANDING MASSIVELY-PARALLEL COMMUNICATION Do we need fine-grain privatization? 13

APPROACH Application Pattern Ranks MOCFE (CESAR) Nearest Neighbor 64; 256; 1024 NEKBONE (CESAR) Nearest Neighbor 64; 256; 1024 Characteristics of massively CNS (EXACT) Nearest Neighbor 64; 256 parallel communication CNS Large (EXACT) Nearest Neighbor 64; 256; 1024 MultiGrid (EXACT) Nearest Neighbor 64; 256 Analyzing large-scale HPC MultiGrid Large (EXACT) Nearest Neighbor 64; 256; 1024 applications LULESH (EXMATEX) Nearest Neighbor 64; 512 DOE Exascale MPI proxy app traces CMC 2D (EXMATEX) Nearest Neighbor 64; 256; 1024 ~1/2 TB analyzed (25+TB available AMG (DF) Nearest Neighbor 216; 1728; 13824 online) AMG Boxlib (DF) Irregular 64; 1728 BIGFFT (DF) Many-to-many 100; 1024; 10000 BIGFFT Medium (DF) Many-to-many 100; 1024; 10000 Crystal Router (DF) Staged all-to-all 10; 100 14

APPLICATION CHARACTERISTICS Observations Structured patterns Neighbor Many-to-many All-to-all Irregular 15

APPLICATION CHARACTERISTICS Observations Structured patterns Collectives for synchronization,   point-to-point for communication 16

APPLICATION CHARACTERISTICS Observations Structured patterns Collectives for synchronization,   point-to-point for communication Most messages are surprisingly small 17

APPLICATION CHARACTERISTICS Observations Communication peers as percentage of all ranks Structured patterns Job size Min Median Max Collectives for synchronization,   (ranks) point-to-point for communication [0:63] 3.1 % 28.1 % 40.6 % Most messages are surprisingly small [64:127] 6.0 % 12.0 % 15.2 % Few communication peers [128:255] 0.6 % 7.8 % 26.4 % [256:511] 3.7 % 5.4 % 7.1 % [512:1023] 0.4 % 2.0 % 7.0 % [1024:2047] 1.3 % 2.0 % 4.6 % [8192:16383] 0.1 % 0.2 % 0.7 % 18

APPLICATION CHARACTERISTICS Observations Application Pattern Ranks Structured patterns MOCFE (CESAR) Nearest Neighbor 64; 256; 1024 NEKBONE (CESAR) Nearest Neighbor 64; 256; 1024 Collectives for synchronization,   CNS (EXACT) Nearest Neighbor 64; 256 point-to-point for communication CNS Large (EXACT) Nearest Neighbor 64; 256; 1024 Most messages are surprisingly small MultiGrid (EXACT) Nearest Neighbor 64; 256 Few communication peers MultiGrid Large (EXACT) Nearest Neighbor 64; 256; 1024 Insights on communication LULESH (EXMATEX) Nearest Neighbor 64; 512 CMC 2D (EXMATEX) Nearest Neighbor 64; 256; 1024 Selective, structured and fine-grained AMG (DF) Nearest Neighbor 216; 1728; 13824 Little/no use of advanced MPI AMG Boxlib (DF) Irregular 64; 1728 features BIGFFT (DF) Many-to-many 100; 1024; 10000 Irregular applications will further BIGFFT Medium (DF) Many-to-many 100; 1024; 10000 push requirements Crystal Router (DF) Staged all-to-all 10; 100 Benjamin Klenk, Holger Fröning, An Overview of MPI Characteristics of Exascale Proxy Applications, International Supercomputer Conference 19 ISC 2017. (accepted for publication & best paper finalist)

GPU-CENTRIC (BUT UNMANAGED) COMMUNICATION Addressing the need for privatization 20

GPU-CENTRIC TRAFFIC SOURCING & SINKING GGAS : GPU-centric send/receive PCIe PCIe Network PCIe PCIe GPU CPU NIC NIC CPU GPU Thread-collective data movement Complete CPU bypass network 0 Cons store to GAS store to GPU memory packet Special hardware support required Reduced overlap 0 GRMA : GPU-centric put/get Key is simple descriptor format Cons 0 Special hardware support required Computation CUDA stack MPI stack Possible overlap 0 1 2 x Indirection to issue work Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE 21 CLUSTER 2013

MICRO-BENCHMARK PERFORMANCE GPU-to-GPU streaming Prototype system consisting of NVIDIA K20c, dual Intel Xeon E5, custom network MPI CPU-controlled: D2H, MPI send/recv, H2D Others GPU-controlled, bypassing CPU Results do not cover overheads regarding issue & completion 22

S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Frning 1 , - PowerPoint PPT Presentation

S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Frning 1 , Benjamin Klenk 1 , Hans Eberle 2 & Larry Dennison 2 1 Ruprecht-Karls University of Heidelberg, Germany 2 NVIDIA Research http://www.ziti.uni-heidelberg.de/compeng

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Introducing Sterling Managed Accounts Managed Accounts Like a managed fund (and fund of funds)

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Actors or Not Async Event Architectures Yaroslav Tkachenko Senior Software Engineer at Demonware

XMLBlaster Rokey MOM (Message oriented Middleware) Mediator for messages (Broker) Asynchronous

WHAT IS ZOOM? ONLINE VIDEO CONFERENCING TOOL SIMILAR TO SKYPE, ADOBE CONNECT, GOOGLE HANGOUTS

Satsy Input-Output Code Search Engine Carl Chapman Cole Groff Cody Hoover Trevor Lund Kaitlin

B LENDED LEARNING AND COURSE DESIGN Liz Chamberlain April 2016 Liz.Chamberlain@open.ac.uk A IMS

Global Software Development in Global Software Development in Global Software Development in the

Pamela King. MBA HIE Outreach Coordinator Agency for Health Care Administration Florida

Provider Perspectives on a Bilingual Online Patient Portal in a Predominantly Spanish-Speaking

Sambuz

Useful Links

Newsletter

Mail Us

S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Frning 1 , - PowerPoint PPT Presentation

S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Frning 1 , Benjamin Klenk 1 , Hans Eberle 2 & Larry Dennison 2 1 Ruprecht-Karls University of Heidelberg, Germany 2 NVIDIA Research http://www.ziti.uni-heidelberg.de/compeng

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Introducing Sterling Managed Accounts Managed Accounts Like a managed fund (and fund of funds)

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Actors or Not Async Event Architectures Yaroslav Tkachenko Senior Software Engineer at Demonware

XMLBlaster Rokey MOM (Message oriented Middleware) Mediator for messages (Broker) Asynchronous

WHAT IS ZOOM? ONLINE VIDEO CONFERENCING TOOL SIMILAR TO SKYPE, ADOBE CONNECT, GOOGLE HANGOUTS

Satsy Input-Output Code Search Engine Carl Chapman Cole Groff Cody Hoover Trevor Lund Kaitlin

B LENDED LEARNING AND COURSE DESIGN Liz Chamberlain April 2016 Liz.Chamberlain@open.ac.uk A IMS

Global Software Development in Global Software Development in Global Software Development in the

Pamela King. MBA HIE Outreach Coordinator Agency for Health Care Administration Florida

Provider Perspectives on a Bilingual Online Patient Portal in a Predominantly Spanish-Speaking

Sambuz

Useful Links

Newsletter

Mail Us

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team