hints to improve automatic load balancing with lewi for
play

Hints to improve automatic load balancing with LeWI for hybrid - PowerPoint PPT Presentation

Hints to improve automatic load balancing with LeWI for hybrid applications Marta Garcia, Jesus Labarta, Julita Corbalan Journal of Parallel and Distributed Computing Volume 74, Issue 9 September 2014 1 / 27 Motivation Loss of efficiency


  1. Hints to improve automatic load balancing with LeWI for hybrid applications Marta Garcia, Jesus Labarta, Julita Corbalan Journal of Parallel and Distributed Computing — Volume 74, Issue 9 September 2014 1 / 27

  2. Motivation Loss of efficiency Hybrid programming models ( MPI + X ) Manual tuning of parallel codes (load-balancing, data redistribution) 2 / 27

  3. The X (in this paper) SMPSs (SMPSuperscalar) OpenMP Task as basic element Directives to annotate Annotate taskifiable parallel code functions and their Fork/join model with parameters (in/out/inout) shared memory Task graph to track Number of threads may dependencies change between parallel Number of threads may regions change any time 3 / 27

  4. DLB and LeWI DLB (dynamic load balancing) “ Runtime interposition to [...] intercept MPI calls ” Balance load on the inner level (OpenMP/SMPSs) Several load balancing algorithms LeWI (Lend CPU when Idle) CPUs of rank in blocking MPI call are idle Lend CPUs to other ranks and recover them after MPI call completes 4 / 27

  5. LeWI (a) No load balancing. (b) LeWI algorithm with SMPSs. (c) LeWI algorithm with OpenMP. 5 / 27

  6. Approach “ Extensive performance evaluation ” “ Modeling parallelization characteristics that limit the automatic load balancing potential ” “ Improving automatic load balancing ” 6 / 27

  7. Performance evaluation Marenostrum 2: 2 × IBM PowerPC 970MP (2 cores); 8 GiB RAM Linux 2.6.5-7.244-pseries64; MPICH; IBM XL C/C++ compiler w/o optimizations Metrics Speedup = parallel _ execution _ time serial _ execution _ time useful _ cpu _ time Efficiency = elapsed _ time ∗ cpus where useful _ cpu _ time = cpu _ time − ( mpi _ time + openmp/smpss _ time + dlb _ time ) CPUs _ used to simultaneously run application code 3 benchmarks + 2 real applications 7 / 27

  8. PILS (Parallel ImbaLance Simulation) Synthetic benchmark Core: “ floating point operations without data involved ” Tunable parameters Programming model (MPI, MPI + OpenMP, MPI + SMPSs) Load distribution 1 Parallelism grain ( = # parallel regions ) Iterations 8 / 27

  9. PILS 9 / 27

  10. Parallelism Grain 10 / 27

  11. Other Codes Benchmarks BT-MZ: block tri-diagonal solver LUB: LU matrix factorization Applications Gromacs: molecular dynamics, MPI-only Gadget: cosmological N-body/SPH (smoothed-particle hydrodynamics) simulation 11 / 27

  12. Other Codes Benchmarks BT-MZ: block tri-diagonal solver LUB: LU matrix factorization Applications Gromacs: molecular dynamics, MPI-only Gadget: cosmological N-body/SPH (smoothed-particle hydrodynamics) simulation Application Original version MPI + OpenMP MPI + SMPSs Executed in nodes (cpus) MPI + OpenMP 1 (4) PILS X X MPI + SMPSs BT-MZ MPI + OpenMP X X 1, 2, 4 (4, 8, 16) MPI + OpenMP 1, 2, 4 (4, 8, 16) LUB X X MPI + SMPSs Gromacs MPI X 1.64 (4.256) Gadget MPI X 200 (800) 11 / 27

  13. PILS, 2 and 4 MPI processes 12 / 27

  14. BT-MZ; 1 node 13 / 27

  15. BT-MZ; 2,4 nodes; Class C 14 / 27

  16. BT-MZ; 1 node; 4 MPI processes 15 / 27

  17. LUB; 1 node; Block size 200 . . . . 16 / 27

  18. Gromacs; 1–64 nodes + Details for 16 nodes 17 / 27

  19. Gromacs; Efficiency + CPUs used per Node 18 / 27

  20. Gadget; 200 nodes 19 / 27

  21. Factors Limiting Performance Improvement with LeWI “ Parallelism Grain in OpenMP applications ” “ Task duration in SMPSs applications ” “ Distribution of MPI processes among computation nodes ” 20 / 27

  22. Parallelism Grain 21 / 27

  23. Modified Parallelism Grain in LUB 22 / 27

  24. Performance of Modified LUB 23 / 27

  25. Rank Distribution — BT-MZ 24 / 27

  26. Rank Distribution — Gromacs 25 / 27

  27. Rank Distribution — Gadget Total 26 / 27

  28. Conclusion Summary DLB/LeWI can improve performance transparently Inter-node load imbalances not handled Granularity of parallelism and placement as important factors Optimal configuration with vs. without DLB/LeWI 27 / 27

  29. Conclusion Summary DLB/LeWI can improve performance transparently Inter-node load imbalances not handled Granularity of parallelism and placement as important factors Optimal configuration with vs. without DLB/LeWI Discussion Interaction with MPI Benchmarks (1.5 of 3 NPB-MZ, arbitrary load distribution) How to find “the right” granularity 27 / 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend