S6357 Towards Efficient Communication Methods and Models for - PowerPoint PPT Presentation

S6357   Towards Efficient Communication Methods and Models for Scalable GPU-Centric Computing Systems Holger Fröning Computer Engineering Group Ruprecht-Karls University of Heidelberg GPU Technology Conference 2016

About us • JProf. Dr. Holger Fröning • PI of Computer Engineering Group, ZITI, Ruprecht-Karls University of Heidelberg • http://www.ziti.uni-heidelberg.de/compeng • Research: Application-specific computing under hard power and energy constraints (HW/ SW), future emerging technologies • High-performance computing (traditional and emerging) • GPU computing (heterogeneity & massive concurrency) • High-performance analytics (scalable graph computations) • Emerging technologies (approximate computing and stacked memory) • Reconfigurable logic (digital design and high-level synthesis) • Current collaborations • Nvidia Research, US; University of Castilla-La Mancha, Albacete, Spain; CERN, Switzerland; SAP, Germany; Georgia Institute of Technology, US; Technical University of Valencia, Spain; TU Graz, Austria; various companies � 2

The Problem • GPU are powerful high-core count devices, but only for in-core computations • But many workloads cannot be satisfied by a single GPU • Technical computing, graph computations, data- warehousing, molecular dynamics, quantum chemistry, particle physics, deep learning, spiking neural networks • => Multi-GPU, at node level and at cluster level • Hybrid programming models • While single GPUs are rather simple to program, interactions between multiple GPUs dramatically push complexity! • This talk: how good are GPUs in sourcing/sinking network traffic, how should one orchestrate communication, what do we need for best performance/energy efficiency � 3

Review: Messaging-based Communication • Usually Send/Receive or Put/Get • MPI as de-facto standard • Work requests descriptors • Issued to the network device • Target node, source pointer, length, tag, communication method, ... • Irregular accesses, little concurrency • Memory registration • OS & driver interactions • Consistency by polling on completion notifications � 4

Beyond CPU-centric communication Source'Node' Start-up latency CPU' Host' of 1.5usec memory' PCIe'root' NIC' GPU' GPU' memory' GPUs rather incompatible with Start-up latency CPU' Host' messaging: of 15usec memory' •Constructing descriptors PCIe'root' NIC' (work requests) 100x GPU' GPU' •Registering memory memory' •Polling Target'Node' “ … a bad semantic match between •Controlling networking communication primitives required devices by the application and those provided by the network. ” - DOE Subcommittee Report, Top Ten Exascale Research Challenges. GPU-controlled Put/Get (IBVERBS) 02/10/2014 Lena Oden, Holger Fröning, Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from the GPU, International Journal of High Performance Computing Applications, Special Issue on Applications for the Heterogeneous Computing Era , Sage Publications, 2015. � 5 � 5

Communication orchestration - how to source and sink network traffic

Example application: 3D stencil code • Himeno 3D stencil code … • Solving a Poisson equation using 2D CTAs (marching planes) • Multiple iterations using iterative kernel launches • Multi-GPU: inter-block and inter- GPU dependencies • Dependencies => communication • Inter-block: device synchronization required among adjacent CTAs • Inter-GPU: all CTAs participate communications (sourcing and sinking) => device synchronization … required 3D Himeno stencil code Control flow using CPU-controlled with 2D CTAs communication � 7

Different forms of communication control for an example stencil code Control flow using in-kernel Control flow using stream synchronization synchronization (with/without nested parallelism) � 8

Performance comparison - execution time • CPU-controlled still fastest 1" Rela-ve*to*hybrid*approach* 0,9" • Backed up by previous experiments 0,8" • In-kernel synchronization slowest 0,7" Perfomance** • Communication overhead increases with 0,6" problem size: more CTAs, more device 0,5" synchronization 0,4" • ~28% of all instructions have to be replayed, 0,3" likely due to serialization (use of atomics) 0,2" • Stream synchronization a good option 0,1" 0" • Difference to device synchronization is 256x256x256" 256x256x512" 256x256x1024" 512x512x256" 512x512x512" 512x512x640" 640x640x128" 640x640x256" 640x640x386" overhead of nested parallelism • Device synchronization most flexible regarding control flow • Communication as device function or as independent kernel in0kernel0sync" stream0sync" device0sync" • Flexibility in kernel launch configuration Lena Oden, Benjamin Klenk, Holger Fröning, Analyzing GPU-controlled Communication and Dynamic Parallelism in Terms of Performance and Energy, Elsevier Journal of Parallel Computing (ParCo), 2016. � 9

Performance comparison - energy consumption 1,2" • Benefits for stream/device synchronization as the CPU is 1" put into sleep mode 0,8" Rela.ve'to'hybrid'approach' • 10% less energy consumption Energy'consump.on' 0,6" • CPU: 20-25W saved • In-kernel synchronization saves 0,4" much more total power, but 0,2" execution time increase results in a higher energy consumption 0" 256x256x256" 256x256x512" 512x512x256" 512x512x512" 512x512x640" 640x640x128" 640x640x256" 640x640x386" 256x256x1024" • Likely bad GPU utilization stream2sync" device2sync" in2kernel2sync" Lena Oden, Benjamin Klenk, Holger Fröning, Analyzing GPU-controlled Communication and Dynamic Parallelism in Terms of Performance and Energy, Elsevier Journal of Parallel Computing (ParCo), 2016. � 10

Communication orchestration - Take aways • CPU-controlled communication is still fastest - independent of different orchestration optimizations • GPU-controlled communication: intra-GPU synchronization between the individual CTAs is most important for performance • Stream synchronization most promising • Otherwise reply overhead due to serialization • Dedicated communication kernels or functions are highly recommended • Either device functions for master kernel (nested parallelism), or communication kernels in the same stream (issued by CPU) • Bypassing CPUs has substantial energy advantages • Decrease polling rates, or use interrupt-based CUDA events! • More room for optimizations left while( cudaStreamQuery(stream) == cudaErrorNotReady ) usleep(sleeptime}; � 11

GGAS: Fast GPU-controlled traffic sourcing and sinking

GGAS – Global GPU Address Spaces • Forwarding load/store operations to global addresses • Address translation and target identification • Special hardware support required (NIC) • Severe limitations for full coherence and strong consistency • Well known for CPU-based distributed shared memory • Reverting to highly relaxed consistency models can be a solution . Holger Fröning and Heiner Litz, Efficient Hardware Support for the Partitioned Global Address Space, 10th Workshop on Communication Architecture for Clusters (CAC2010) , co-located with 24th International Parallel and Distributed Processing Symposium (IPDPS 2010), April 19, 2010, Atlanta, Georgia. � 13

GGAS – thread-collaborative BSP-like communication GPU 0 GPU 1 (remote) … Computa(on Computa(on … Communica(on using collec(ve … remote stores … Computa(on Global barrier … … Con(nue … Con(nue … Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters , IEEE International Conference on Cluster Computing 2013 , September 23-27, 2013, Indianapolis, US. � 14

GGAS – current programming model using mailboxes <snip> ... remMailbox[getProcess(index)][tid] = data[tid]; __threadfence_system(); // memory fence remoteEndFlag[getProcess(index)][0] = 1; __ggas_barrier(); … <snip> � 15

GGAS Prototype Remote load latency Virtex-6: 1.44 – 1.9 usec (CPU/GPU) Node #0 (Source) Node #1 (Target) • FPGA-based network prototype Issuing loads/stores Memory host • Xilinx Virtex-6 • 64bit data paths, 156MHz = 1.248GB/s (theoretical peak) • PCIe G1/G2 • 4 network links (torus topology) Source-local address Global address Target-local address Target node Loss-less and in-order Source tag management determination packet forwarding Address calculation Address calculation Return route � 16

GGAS – Microbenchmarking • GPU-to-GPU streaming • Prototype system consisting of Nvidia K20c & dual Intel Xeon E5 • Relative results applicable to technology-related performance improvements P2P PCIe issue • MPI • CPU-controlled: D2H, MPI send/recv, H2D • GGAS • GPU-controlled: GDDR to GDDR, remote stores • RMA: Remote Memory Access • Put/Get-based, CPU-to-CPU (host) resp. GPU-to-GPU (direct) GGAS latency starting at 1.9usec � 17

S6357 Towards Efficient Communication Methods and Models for - PowerPoint PPT Presentation

S6357 Towards Efficient Communication Methods and Models for Scalable GPU-Centric Computing Systems Holger Frning Computer Engineering Group Ruprecht-Karls University of Heidelberg GPU Technology Conference 2016 About us JProf. Dr.

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Towards Efficient Distributed Towards Efficient Distributed Simulation in Modelica using

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Towards Industrial Adoption of High-Order Methods Towards Industrial Adoption of High-Order

UHF Communication System UHF Communication System UHF Communication System UHF Communication

Meshless Meshless Methods Meshless Meshless Methods Methods Methods Contents

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

u Efficient Solution of Optimal Multimarket Electricity Bid Models 1/16 d Efficient Solution of

Towards efficient, typed LR parsers Franc ois Pottier and Yann R egis-Gianas June 2005

Towards efficient automatic end-to-end learning Frank Hutter University of Freiburg, Germany

Towards efficient option pricing in incomplete markets GPU TECHNOLOGY CONFERENCE 2016 Shih-Hau Tan

Towards Efficient String Processing of Annotated Events David Woods 1 Tim Fernando 2 Carl Vogel 2

Formal Methods and Cryptography Lecture 25 Formal Methods Formal Methods Logical foundations

Formal Methods and Cryptography Lecture 24 1 Formal Methods 2 Formal Methods Logical

Communication, Services, and Coordination Communication, Services, and Coordination Communication

METHODS METHODS METHODS METHODS of of of of RADIONUCLIDE PRODUCTION RADIONUCLIDE PRODUCTION

Construction and Long Term Performance of Transportation Infrastructure Constructed Using EPS

APPLICATIONS = CODE + MARKUP: A GUIDE TO THE MICROSOFT WINDOWS PRESENTATION FOUNDATION FREE

RTgluelam Gluelam Wide-Span Girder Design Integrated analysis and design according to the

Developing Interactive Online Applications for Economic Analysis Mitchel N. Herian, Ph.D.

MAPS DIAMETER EMULATOR S6, S13, S13, Cx/Dx, Gx, Rx, Gy, SLg, Sh and SLh Interfaces Emulation

Resource M anagement in Differentiated Services A Prototype Im plem entation Final Thesis

A R C H I T E C T U R E 330 WEST 38 STREET,#309 NEW YORK, NY 10018 TEL: 212 380 6114 E FAX:

Los Angeles Basin Stormwater Conservation Study Los Angeles County Flood Control District U.S.