s6357 towards efficient communication methods and models
play

S6357 Towards Efficient Communication Methods and Models for - PowerPoint PPT Presentation

S6357 Towards Efficient Communication Methods and Models for Scalable GPU-Centric Computing Systems Holger Frning Computer Engineering Group Ruprecht-Karls University of Heidelberg GPU Technology Conference 2016 About us JProf. Dr.


  1. S6357 
 Towards Efficient Communication Methods and Models for Scalable GPU-Centric Computing Systems Holger Fröning Computer Engineering Group Ruprecht-Karls University of Heidelberg GPU Technology Conference 2016

  2. About us • JProf. Dr. Holger Fröning • PI of Computer Engineering Group, ZITI, Ruprecht-Karls University of Heidelberg • http://www.ziti.uni-heidelberg.de/compeng • Research: Application-specific computing under hard power and energy constraints (HW/ SW), future emerging technologies • High-performance computing (traditional and emerging) • GPU computing (heterogeneity & massive concurrency) • High-performance analytics (scalable graph computations) • Emerging technologies (approximate computing and stacked memory) • Reconfigurable logic (digital design and high-level synthesis) • Current collaborations • Nvidia Research, US; University of Castilla-La Mancha, Albacete, Spain; CERN, Switzerland; SAP, Germany; Georgia Institute of Technology, US; Technical University of Valencia, Spain; TU Graz, Austria; various companies � 2

  3. The Problem • GPU are powerful high-core count devices, but only for in-core computations • But many workloads cannot be satisfied by a single GPU • Technical computing, graph computations, data- warehousing, molecular dynamics, quantum chemistry, particle physics, deep learning, spiking neural networks • => Multi-GPU, at node level and at cluster level • Hybrid programming models • While single GPUs are rather simple to program, interactions between multiple GPUs dramatically push complexity! • This talk: how good are GPUs in sourcing/sinking network traffic, how should one orchestrate communication, what do we need for best performance/energy efficiency � 3

  4. Review: Messaging-based Communication • Usually Send/Receive or Put/Get • MPI as de-facto standard • Work requests descriptors • Issued to the network device • Target node, source pointer, length, tag, communication method, ... • Irregular accesses, little concurrency • Memory registration • OS & driver interactions • Consistency by polling on completion notifications � 4

  5. Beyond CPU-centric communication Source'Node' Start-up latency CPU' Host' of 1.5usec memory' PCIe'root' NIC' GPU' GPU' memory' GPUs rather incompatible with Start-up latency CPU' Host' messaging: of 15usec memory' •Constructing descriptors PCIe'root' NIC' (work requests) 100x GPU' GPU' •Registering memory memory' •Polling Target'Node' “ … a bad semantic match between •Controlling networking communication primitives required devices by the application and those provided by the network. ” - DOE Subcommittee Report, Top Ten Exascale Research Challenges. GPU-controlled Put/Get (IBVERBS) 02/10/2014 Lena Oden, Holger Fröning, Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from the GPU, International Journal of High Performance Computing Applications, Special Issue on Applications for the Heterogeneous Computing Era , Sage Publications, 2015. � 5 � 5

  6. Communication orchestration - how to source and sink network traffic

  7. Example application: 3D stencil code • Himeno 3D stencil code … • Solving a Poisson equation using 2D CTAs (marching planes) • Multiple iterations using iterative kernel launches • Multi-GPU: inter-block and inter- GPU dependencies • Dependencies => communication • Inter-block: device synchronization required among adjacent CTAs • Inter-GPU: all CTAs participate communications (sourcing and sinking) => device synchronization … required 3D Himeno stencil code Control flow using CPU-controlled with 2D CTAs communication � 7

  8. Different forms of communication control for an example stencil code Control flow using in-kernel Control flow using stream synchronization synchronization (with/without nested parallelism) � 8

  9. Performance comparison - execution time • CPU-controlled still fastest 1" Rela-ve*to*hybrid*approach* 0,9" • Backed up by previous experiments 0,8" • In-kernel synchronization slowest 0,7" Perfomance** • Communication overhead increases with 0,6" problem size: more CTAs, more device 0,5" synchronization 0,4" • ~28% of all instructions have to be replayed, 0,3" likely due to serialization (use of atomics) 0,2" • Stream synchronization a good option 0,1" 0" • Difference to device synchronization is 256x256x256" 256x256x512" 256x256x1024" 512x512x256" 512x512x512" 512x512x640" 640x640x128" 640x640x256" 640x640x386" overhead of nested parallelism • Device synchronization most flexible regarding control flow • Communication as device function or as independent kernel in0kernel0sync" stream0sync" device0sync" • Flexibility in kernel launch configuration Lena Oden, Benjamin Klenk, Holger Fröning, Analyzing GPU-controlled Communication and Dynamic Parallelism in Terms of Performance and Energy, Elsevier Journal of Parallel Computing (ParCo), 2016. � 9

  10. Performance comparison - energy consumption 1,2" • Benefits for stream/device synchronization as the CPU is 1" put into sleep mode 0,8" Rela.ve'to'hybrid'approach' • 10% less energy consumption Energy'consump.on' 0,6" • CPU: 20-25W saved • In-kernel synchronization saves 0,4" much more total power, but 0,2" execution time increase results in a higher energy consumption 0" 256x256x256" 256x256x512" 512x512x256" 512x512x512" 512x512x640" 640x640x128" 640x640x256" 640x640x386" 256x256x1024" • Likely bad GPU utilization stream2sync" device2sync" in2kernel2sync" Lena Oden, Benjamin Klenk, Holger Fröning, Analyzing GPU-controlled Communication and Dynamic Parallelism in Terms of Performance and Energy, Elsevier Journal of Parallel Computing (ParCo), 2016. � 10

  11. Communication orchestration - Take aways • CPU-controlled communication is still fastest - independent of different orchestration optimizations • GPU-controlled communication: intra-GPU synchronization between the individual CTAs is most important for performance • Stream synchronization most promising • Otherwise reply overhead due to serialization • Dedicated communication kernels or functions are highly recommended • Either device functions for master kernel (nested parallelism), or communication kernels in the same stream (issued by CPU) • Bypassing CPUs has substantial energy advantages • Decrease polling rates, or use interrupt-based CUDA events! • More room for optimizations left while( cudaStreamQuery(stream) == cudaErrorNotReady ) usleep(sleeptime}; � 11

  12. GGAS: Fast GPU-controlled traffic sourcing and sinking

  13. GGAS – Global GPU Address Spaces • Forwarding load/store operations to global addresses • Address translation and target identification • Special hardware support required (NIC) • Severe limitations for full coherence and strong consistency • Well known for CPU-based distributed shared memory • Reverting to highly relaxed consistency models can be a solution . Holger Fröning and Heiner Litz, Efficient Hardware Support for the Partitioned Global Address Space, 10th Workshop on Communication Architecture for Clusters (CAC2010) , co-located with 24th International Parallel and Distributed Processing Symposium (IPDPS 2010), April 19, 2010, Atlanta, Georgia. � 13

  14. GGAS – thread-collaborative BSP-like communication GPU 0 GPU 1 (remote) … Computa(on Computa(on … Communica(on using collec(ve … remote stores … Computa(on Global barrier … … Con(nue … Con(nue … Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters , IEEE International Conference on Cluster Computing 2013 , September 23-27, 2013, Indianapolis, US. � 14

  15. GGAS – current programming model using mailboxes <snip> ... remMailbox[getProcess(index)][tid] = data[tid]; __threadfence_system(); // memory fence remoteEndFlag[getProcess(index)][0] = 1; __ggas_barrier(); … <snip> � 15

  16. GGAS Prototype Remote load latency Virtex-6: 1.44 – 1.9 usec (CPU/GPU) Node #0 (Source) Node #1 (Target) • FPGA-based network prototype Issuing loads/stores Memory host • Xilinx Virtex-6 • 64bit data paths, 156MHz = 1.248GB/s (theoretical peak) • PCIe G1/G2 • 4 network links (torus topology) Source-local address Global address Target-local address Target node Loss-less and in-order Source tag management determination packet forwarding Address calculation Address calculation Return route � 16

  17. GGAS – Microbenchmarking • GPU-to-GPU streaming • Prototype system consisting of Nvidia K20c & dual Intel Xeon E5 • Relative results applicable to technology-related performance improvements P2P PCIe issue • MPI • CPU-controlled: D2H, MPI send/recv, H2D • GGAS • GPU-controlled: GDDR to GDDR, remote stores • RMA: Remote Memory Access • Put/Get-based, CPU-to-CPU (host) resp. GPU-to-GPU (direct) GGAS latency starting at 1.9usec � 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend