cpu gpu load balance guided by execution time prediction
play

CPU+GPU Load Balance Guided by Execution Time Prediction - PowerPoint PPT Presentation

CPU+GPU Load Balance Guided by Execution Time Prediction Jean-Franois Dollinger, Vincent Loechner Inria CAMUS, ICube Lab., University of Strasbourg jean-francois.dollinger@inria.fr, vincent.loechner@inria.fr 19 January 2015 1 / 29 Outline 1


  1. CPU+GPU Load Balance Guided by Execution Time Prediction Jean-François Dollinger, Vincent Loechner Inria CAMUS, ICube Lab., University of Strasbourg jean-francois.dollinger@inria.fr, vincent.loechner@inria.fr 19 January 2015 1 / 29

  2. Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 1 / 29

  3. Introduction Achieving and predicting performance on CPU/GPU is difficult. Sensitive to: • Input dataset (CUDA grid size, cache effects) • Compiler optimizations (unrolling, fission) • Cloudy infrastructures • Hardware availability • Efficient resources exploitation 2 / 29

  4. Introduction Because of dynamic behaviors compilers miss performance opportunities • PLUTO • PPCG • Par4All • openACC/HMPP: manual tuning → Automatic methods are the way to go (typical use case) → Our interest: polyhedral codes 3 / 29

  5. Introduction How to get performance? • Right code with right PU (Processing Unit) • Select best code version on each given PU • Ensure load balance between PUs → Multi-versioning + runtime code selection = win 4 / 29

  6. Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 5 / 29

  7. Prediction Overview Multi-versioning: performance factors • Static factors (instruction) • External dynamic factors (scheduler) • Internal dynamic factors (cache effects, memory contention) 6 / 29

  8. Prediction Overview Static code generation #pragma omp parallel for #pragma scop for(t0 = lb; t0 < ub; ...) for(...) Parallelize for(...) Extract for(...) and chunk for(...) scop for(...) S0(...); (ppcg) S0(...); #pragma endscop Launch Launch PPCG PLUTO 2 2 version 0 version 0 Build templates O ine pro ling Pro le Pro le memcpy kernels Bandwidth table Ranking tables Runtime prediction GPU CPU Application binary object le memcpy Kernel Kernel ... duration duration duration /* scop */ call schedule(...) + call dispatch(...) /* endscop */ ... version 1 version 0 version 1 version 2 version 2 version 0 Schedule 7 / 29

  9. Prediction Overview Pedro Framework [Benoit Pradelle et al. 2011] • Multi-versioning of polyhedral loop nests • Target : multicore CPUs 8 / 29

  10. Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 9 / 29

  11. Prediction Code generation Code version • Block size • Tile size • Schedule → controlled by PPCG cmd line options PPCG, source-to-source compiler • Transforms C to CUDA • Generates: • Ehrhart polynomials • Loop nest parameters Python scripts • Fill templates in C code 10 / 29

  12. Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 11 / 29

  13. Prediction Data transfers: host ↔ device • Parameter: message size • Asymetric and non-uniform bandwidth Code simulation • Parameters: number of CUDA blocks, sequential parameters • Load balance • Memory contention How to model the performance curves ? • Affine intervals detection 12 / 29

  14. Prediction Testbed 1st test platform • 2 Nvidia GTX 590 (16 (SM) * 32 (SP)) • Asus P8P67-Pro (PCIe 2, x8 per card) • Core i7 2700k, stock 2nd test platform • Nvidia GTX 680 (8 (SM) * 192 (SP)) • Asus P8P67-Deluxe (PCIe 2, x16) • Core i7 2600 13 / 29

  15. Prediction Data transfers (testbed 1) 7000 real dev-host GTX590 real host-dev GTX590 prof. dev-host GTX590 prof. host-dev GTX590 6000 5000 bandwidth (MB/s) 4000 3000 2000 1000 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 size (KB) 14 / 29

  16. Prediction Data transfers (testbed 2) 7000 real dev-host GTX680 real host-dev GTX680 prof. dev-host GTX680 prof. host-dev GTX680 6000 5000 bandwidth (MB/s) 4000 3000 2000 1000 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 size (KB) 15 / 29

  17. Prediction Kernel simulation (testbed 1) gemm 32x16 - GTX 590 1 real profiled execution time per iteration (ns) 0.1 0.01 10 0 10 1 10 2 10 3 10 4 10 5 number of blocks 16 / 29

  18. Prediction Kernel simulation (testbed 1) syrk2 - GTX 590 1 real e i =p i β +u i (e)xecution time per iteration (ns) 0.1 0.01 0 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 (p)arameter size 17 / 29

  19. Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 18 / 29

  20. Runtime CPU + GPU Outermost parallel loop split into chunks • Each chunk associated to one PU • PUs performance differ → Ensure load balance Multi-Versioning • Code optimized towards target (PLUTO + PPCG) • Multiple code versions (combined) Two components: • Scheduler: • Execution time of chunks [B. Pradelle et al.] + [J-F. Dollinger et al.] • Adjust chunks sizes • Dispatcher 19 / 29

  21. Runtime CPU + GPU Scheduler functioning 1 T 0 = t 0 ∗ Card D 0 ≈ t 1 ∗ Card D 1 ≈ ... ≈ t n ∗ Card D n 2 T i must tend to 1 / n ∗ � n − 1 i = 0 ( t i ∗ Card D i ) = 1 / n ∗ T all 3 t i = f ( G i , seq ) on GPU 4 t i = g ( P i , S i ) on CPU Eliminate inefficient PUs 20 / 29

  22. Runtime CPU + GPU 1 1 0.8 0.8 0.6 0.6 -workload proportion | exec. time proportion -workload proportion | exec. time proportion 0.4 0.4 0.2 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 -0.8 -0.8 -1 -1 1 1 2 2 3 3 4 4 5 5 6 6 step step 21 / 29

  23. Runtime CPU + GPU (speedup to one PU) 30 CPU 1GPU CPU+1GPU CPU+2GPUs 25 CPU+3GPUs CPU+4GPUs 20 speedup 15 10 5 0 g 2 3 s s d g m g e m m y y o e e r r v m i s m m m k 2 t t g u m k v e m e n m r v 22 / 29

  24. Runtime CPU + GPU (load imbalance) 1 CPU+1GPU CPU+2GPUs CPU+3GPUs CPU+4GPUs 0.8 0.6 imbalance 0.4 0.2 0 g 2 3 s s d g m g e m m y y o e e r r v m i s m m m k 2 t t g u m k v e m e n m r v 23 / 29

  25. Runtime Multiversioning CPU + GPU (speedup to worst) 6 syr2k (c1) syr2k (c2) syr2k (c3) syr2k (c4) 5 syr2k (c5) syr2k (c6) syr2k (c7) syr2k (c8) 4 syr2k (c9) syr2k (all) speedup 3 2 1 0 C 1 C C C C P G P P P P P U U U U U U + + + + 1 2 3 4 G G G G P P P P U U U U s s s 24 / 29

  26. Runtime Multiversioning CPU + GPU (imbalance) 1 syr2k (c1) syr2k (c2) syr2k (c3) syr2k (c4) syr2k (c5) 0.8 syr2k (c6) syr2k (c7) syr2k (c8) syr2k (c9) syr2k (all) 0.6 imbalance 0.4 0.2 0 C C C C P P P P U U U U + + + + 1 2 3 4 G G G G P P P P U U U U s s s 25 / 29

  27. Conclusion Framework capabilities • Execution time prediction • Fastest version selection • CPU vs GPU competition • CPU + GPU joint usage Future work • Energy consumption 26 / 29

  28. Outline 1 Introduction 2 Prediction Overview Code generation Profiling 3 Runtime CPU + GPU 4 Conclusion 27 / 29

  29. Annex Offline profiling: ranking table Number of threads version 1 version 2 version 3 1 40 ms 55 ms 32 ms 2 32 ms 28 ms 17 ms 3 22 ms 15 ms 9 ms 4 14 ms 7 ms 8 ms Online prediction: execution time computation observation = { 2000 , 600 , 300 , 300 } prediction ( version 1 ) = (( 2000 − 600 ) ∗ 40 ) + (( 600 − 300 ) ∗ 32 ) + ( 0 ∗ 22 ) + ( 300 ∗ 14 ) = 69800 ms 28 / 29

  30. Annex The algorithm stages: • Init.: distribute iterations equitably amongst PUs • Repeat 10 times: • Compute per chunk execution time • r i = T i / T all • Adjust chunk size according to r i 29 / 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend