energy issues of gpu computing clusters
play

Energy issues of GPU computing clusters Stphane Vialle SUPELEC UMI - PowerPoint PPT Presentation

AlGorille INRIA Project Team Energy issues of GPU computing clusters Stphane Vialle SUPELEC UMI GT CNRS 2958 & AlGorille INRIA Project Team EJC 19 20/11/2012 Lyon, France What means using a GPU cluster ? Programming a


  1. AlGorille INRIA Project Team Energy issues of GPU computing clusters Stéphane Vialle SUPELEC– UMI GT ‐ CNRS 2958 & AlGorille INRIA Project Team EJC 19 ‐ 20/11/2012 Lyon, France

  2. What means « using a GPU cluster » ? Programming a cluster of « CPU+GPU » nodes • Implementing message passing + multithreading + vectorization • Long and difficult code development and maintenance  How many software engineers can do it ? Computing nodes requiring more electrical power (Watt) • CPU + (powerful) GPU dissipate more electrical power than CPU • Can lead to improve the electrical network and subscription  Can generate some extra ‐ costs ! But we expect : • To run faster and / or • To save energy (Watt.Hours)

  3. 1 ‐ First experiment: « hapilly parallel » application • Asian option pricer (independant Monte Carlo simulations) • Rigorous parallel random number generation 2008 Lokman Abas ‐ Turki Stephane Vialle Bernard Lapeyre

  4. 1 ‐ Happily parallel application Application : « Asian option pricer »: Independent Monte Carlo trajectory computations Coarse grain parallelism on the cluster: • Distribution of data on each computing node • Local and independent computations on each node • Collect of partial results and small final computation Fine grain parallelism on each node: • Local data transfer on GPU memory • Local parallel computation on the GPU • Local result transfer from the GPU to the CPU memory  Coarse and fine grain parallel codes can be developed separately (nice!)

  5. 1 ‐ Happily parallel application Long work to design rigorous parallel random number generation on the GPUs Input PC 0 1 ‐ Input data reading on P 0 data files Coarse grain 2 ‐ Input data broadcast from P 0 PC 0 PC 1 PC P ‐ 1 3 ‐ Parallel and independent Coarse grain PC 0 PC 1 PC P ‐ 1 RNG initialization and fine grain GPU cores 4 ‐ Parallel and independent Coarse grain PC 0 PC 1 PC P ‐ 1 Monte ‐ Carlo computations and fine grain GPU cores 5 ‐ Parallel and independent Coarse grain PC 0 PC 1 PC P ‐ 1 partial results computation and fine grain GPU cores 6 ‐ Partial results reduction on Coarse grain PC 0 PC 1 PC P ‐ 1 P 0 and final price computation PC 0 7 – Print results and perfs

  6. 1 ‐ Happily parallel application 256 INTEL dual-core nodes Comparison to a multi ‐ core CPU 1 CISCO 256-ports switch, Gigabit-eth cluster (using all CPU cores): 16 INTEL dual-core nodes 16 GPU nodes run 2.83 1 GPU (GeForce 8800 GT) / node times faster than 256 CPU 1 DELL 24-ports switch, Gigabit-eth nodes Asian Pricer on clusters of GPU and CPU 1E+05 T-1024x1024-1coreCPU T-512x1024-1coreCPU 1E+04 T-512x512-1coreCPU Total Exec Time(s) T-1024x1024-1nodeCPU 1E+03 T-512x1024-1nodeCPU 1E+02 T-512x512-1nodeCPU T-1024x1024-1nodeGPU 1E+01 T-512x1024-1nodeGPU T-512x512-1nodeGPU 1E+00 1 10 100 1000 10000 Nb of nodes

  7. 1 ‐ Happily parallel application Comparison to a multi ‐ core CPU cluster (using all CPU cores): 16 GPU nodes consume GPU cluster is 2.83x28.3 = 28.3 times less than 256 80.1 times more efficient CPU nodes Asian Pricer on clusters of GPU and CPU 1E+4 W.h-1024x1024-CPU W.h-512x1024-CPU W.h-512x512-CPU W.h-1024x1024-GPU 1E+3 W.h-512x1024-GPU Consummed Watt.h 1E+2 1E+1 1E+0 1 2 4 8 16 32 64 128 256 Nb of nodes

  8. 2 – « Real parallel » code experiments: Parallel codes including communications • 2D relaxation (frontier exchange) • 3D transport PDE solver Sylvain Contassot ‐ Vivier 2009 Stephane Vialle Thomas Jost Wilfried Kirschenmann

  9. 2 – Parallel codes including comms. These algorithms remain synchronous and deterministic But coarse and fine grained parallel codes have to be jointly designed  Developments become more complex … Internode CPU communications Local CPU  GPU data transfers Local GPU computations Local GPU  CPU partial result transfers Local CPU computations (not adapted to GPU processing) Internode CPU communications Local CPU  GPU data transfers … More synchronization issues between CPU and GPU tasks More complex buffer and indexes management: One data has a global index, node cpu ‐ buffer index, node gpu ‐ buffer index, a fast ‐ shared ‐ memory index in a sub ‐ part of the GPU…

  10. 2 – Parallel codes including comms. Developments become (really) more complex  Less software engineers can develop and maintain parallel code including communications on a GPU cluster GPU accelerate only some parts of the code GPU requires more data transfer overheads (CPU  GPU)  Is it possible to speedup on a GPU cluster ?  Is it possible to speedup enough to save energy ?

  11. 2 – Parallel codes including comms. PDE Solver ‐ Synchronous Pricer ‐ parallel MC Jacobi Relaxation 1E+5 1E+3 1E+4 Execution time (s) Execution time (s) 1E+4 Execution time (s) 1E+2 1E+3 1E+3 1E+2 1E+1 1E+2 1E+1 1E+0 1E+0 1E+1 1 10 100 1 10 100 1 10 100 Number of nodes Number of nodes Number of nodes Monocore ‐ CPU cluster Multicore ‐ CPU cluster Manycore ‐ GPU cluster Pricer – parallel MC PDE Solver ‐ Synchronous Jacobi Relaxation 1E+2 1E+3 1E+4 1E+3 Energy (Wh) 1E+2 Energy (Wh) Energy (Wh) 1E+1 1E+2 1E+1 1E+1 1E+0 1E+0 1E+0 1 10 100 1 10 100 1 10 100 Number of nodes Number of nodes Number of nodes

  12. 2 – Parallel codes including comms. Rmk: Which comparison ? Which reference ? Jacobi Relaxation ‐ Gains vs 1 multicore ‐ node 2D Relaxation ‐ Gains vs 1 monocore ‐ node 1 2 10 10 1 1 SU ncore ‐ GPU vs 1monocore ‐ node ‐ CPU SU ncore ‐ GPU vs 1multicore ‐ node ‐ CPU EG ncore GPU vs 1monocore ‐ node ‐ CPU EG ncore GPU vs 1monocore ‐ node ‐ CPU Number of nodes Number of nodes You have a GPU cluster Jacobi ‐ Gains vs multicore CPU ‐ cluster  you have a CPU cluster! 3 SU ncoreGPU vs ncoreCPU ‐ cluster You succeed to program a GPU cluster ED ncoreGPU vs multicoreCPU ‐ cluster  you can program a CPU cluster! 10 Compare a GPU cluster to a CPU cluster (not to one CPU core…) when possible 1 Comparison will be really different Number of nodes

  13. 2 – Parallel codes including comms. Temporal gain (speedup) & Energy Gain of GPU cluster vs CPU cluster: Speedup Energy gain Pricer – parallel MC PDE Solver ‐ synchronous Jacobi Relaxation 1E+3 1E+2 1E+2 multicore CPU cluster multicore CPU cluster multicore CPU cluster GPU cluster vs GPU cluster vs GPU cluster vs 1E+2 OK 1E+1 1E+1 Hum… Oops… 1E+1 1E+0 1E+0 1E+0 1 10 100 1 10 100 1 10 100 Number of nodes Number of nodes Number of nodes Up to 16 nodes this GPU cluster is more interesting than our CPU cluster, but its interest decreases…

  14. 2 – Parallel codes including comms. CPU cluster GPU cluster Computations T ‐ calc ‐ CPU If algorithm is adapted to GPU architecture: T ‐ comput ‐ GPU << T ‐ compu ‐ CPU else: do not use GPUs! Communications T ‐ comm ‐ CPU = T ‐ comm ‐ GPU = T ‐ comm ‐ MPI T ‐ transfert ‐ GPUtoCPU + T ‐ comm ‐ MPI + T ‐ transfert ‐ CPUtoGPU T ‐ comm ‐ GPU ≥ T ‐ comm ‐ CPU Total time T ‐ CPUcluster T ‐ GPUcluster < ? > T ‐ CPUcluster  For a given pb on a GPU cluster: T ‐ comm becomes strongly dominant and GPU cluster interest decreases

  15. 3 – Asynchronous parallel code experiments: (asynchronous algorithm & asynchronous implementation) • 3D transport PDE solver Sylvain Contassot ‐ Vivier 2009 ‐ 2010 Stephane Vialle

  16. 3 ‐ Async. parallel codes on GPU cluster Asynchronous algo. provide implicit overlapping of communications and computations, and communications are important into GPU clusters.  Asynchronous code should improve execution on GPU clusters specially on heterogeneous GPU cluster BUT : • Only some iterative algorithms can be turned into asynchronous algorithms • The convergence detection of the algorithm is more complex and requires more communications (than with synchronous algo) • Some extra iterations are required to achieve the same accuracy.

  17. 3 ‐ Async. parallel codes on GPU cluster Rmk: asynchronous code on GPU cluster has awful complexity Available synchronous PDE solver on GPU cluster (previous work) • 2 senior rechearchers in parallel computing • 1 year work The most complex debug we have achieved ! … how to « validate » the code ? Operational asynchronous PDE solver on GPU cluster

  18. 3 ‐ Async. parallel codes on GPU cluster Execution time using 2 GPU clusters of Supelec: • 17 nodes Xeon dual ‐ core + GT8800 • 2 interconnected Gibagit switches • 16 nodes Nehalem quad ‐ core + GT285 GPUs & GPUs & synchronous asynchronous T ‐ exec(s) T ‐ exec(s) Nb of fast Nb of fast nodes nodes

  19. 3 ‐ Async. parallel codes on GPU cluster Speedup vs 1 GPU: • asynchronous version achieves more regular speedup • asynchronous version achieves better speedup on high nb of nodes GPU cluster & synchronous GPU cluster & asynchronous vs 1 GPU vs 1 GPU Sync. Speedup vs seq. Sync. Speedup vs seq. Nb of fast Nb of fast nodes nodes

  20. 3 ‐ Async. parallel codes on GPU cluster Energy consumption : • sync. and async. energy consumption curves are (just) different GPU cluster & synchronous GPU cluster & asynchronous Energy consumption (W.h) Energy consumption (W.h) Nb of fast Nb of fast nodes nodes

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend