enabling low overhead hybrid mpi openmp parallelism with
play

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC - PowerPoint PPT Presentation

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Prache and Herv Jourdren CEA, DAM, DIF, F-91297 Arpajon France IWOMP'2010 June 15th 2010 1 Introduction/Context HPC Architecture: Petaflop/s Era


  1. Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache and Hervé Jourdren CEA, DAM, DIF, F-91297 Arpajon France IWOMP'2010 June 15th 2010 1

  2. Introduction/Context � HPC Architecture: Petaflop/s Era � Multicore processors as basic blocks � Clusters of ccNUMA nodes � Parallel programming models � MPI: distributed-memory model � OpenMP: shared-memory model � Hybrid MPI/OpenMP (or mixed-mode programming ) � Promising solution (benefit from both models for data parallelism) � How to hybridize an application? � Contributions � Approaches for hybrid programming � Unified MPI/OpenMP framework (MPC) for lower hybrid overhead IWOMP'2010 June 15th 2010 2

  3. Outline � Introduction/Context � Hybrid MPI/OpenMP Programming � Overview � Extended taxonomy � MPC Framework � OpenMP runtime implementation � Hybrid optimization � Experimental Results � OpenMP performance � Hybrid performance � Conclusion & Future Work IWOMP'2010 June 15th 2010 3

  4. Hybrid MPI/OpenMP Programming Overview � MPI (Message Passing Interface ) � Inter-node communication � Implicit locality � Useless data duplication and useless shared-memory transfers � OpenMP � Fully exploit shared-memory data parallelism � No inter-node standard � No data-locality standard (ccNUMA node) � Hybrid Programming � Mix MPI and OpenMP inside an application � Benefit from Pure-MPI and Pure-OpenMP modes IWOMP'2010 June 15th 2010 4

  5. Hybrid MPI/OpenMP Programming Approaches � Traditional Approaches � Exploit one core with one execution flow � E.g., MPI for inter-node communication, OpenMP otherwise � E.g., multi core CPU Socket-exploitation with OpenMP � Oversubscribing Approaches � Exploit one core with several execution flows � Load balancing on the whole node � Adaptive behavior between parallel regions � Mixing Depth � Communications outside parallel regions � Network bandwidth saturation � Communications inside parallel regions � MPI thread-safety � Extended Taxonomy from [Heager09] IWOMP'2010 June 15th 2010 5

  6. Hybrid MPI/OpenMP Extended Taxonomy IWOMP'2010 June 15th 2010 6

  7. MPC Framework � User-level thread library [EuroPar’08] � Pthreads API, debugging with GDB [MTAAP’2010] � Thread-based MPI [EuroPVM/MPI’09] � MPI 1.3 Compliant � Optimized to save memory � NUMA-aware memory allocator (for multithreaded applications) � Contribution: Hybrid representation inside MPC � Implementation of OpenMP Runtime (2.5 compliant) � Compiler part w/ patched GCC (4.3.X and 4.4.X) � Optimizations for hybrid applications � Efficient oversubscribed OpenMP (more threads than cores) � Unified representation of MPI tasks and OpenMP threads � Scheduler-integrated polling methods � Message-buffer privatization and parallel message reception IWOMP'2010 June 15th 2010 7

  8. MPC’s Hybrid Execution Model ( Fully Hybrid ) � Application with 1 MPI task per node IWOMP'2010 June 15th 2010 8

  9. MPC’s Hybrid Execution Model ( Fully Hybrid ) � Initialization of OpenMP regions (on the whole node) IWOMP'2010 June 15th 2010 9

  10. MPC’s Hybrid Execution Model ( Fully Hybrid ) � Entering OpenMP parallel region w/ 6 threads IWOMP'2010 June 15th 2010 10

  11. MPC’s Hybrid Execution Model ( Simple Mixed ) � 2 MPI tasks + OpenMP parallel region w/ 4 threads (on 2 cores) IWOMP'2010 June 15th 2010 11

  12. Experimental Environment � Architecture � Dual-socket Quad-core Nehalem-EP machine � 24GB of memory/Linux 2.6.31 kernel � Programming model implementations � MPI: MPC, IntelMPI 3.2.1, MPICH2 1.1, OpenMPI 1.3.3 � OpenMP: MPC, ICC 11.1, GCC 4.3.0 and 4.4.0, SunCC 5.1 � Best option combination � OpenMP thread pinning (KMP_AFFINITY, GOMP_CPU_AFFINITY) � OpenMP wait policy (OMP_WAIT_POLICY, SUN_MP_THR_IDLE=spin) � MPI task placement (I_MPI_PIN_DOMAIN=omp) � Benchmarks � EPCC suite (Pure OpenMP/Fully Hybrid) [Bull et al. 01] � Microbenchmarks for mixed-mode OpenMP/MPI [Bull et al. IWOMP’09] IWOMP'2010 June 15th 2010 12

  13. EPCC: OpenMP Parallel-Region Overhead MPC ICC 11.1 GCC 4.3.0 GCC 4.4.0 SUNCC 5.1 50 45 40 Execution Time (us) 35 30 25 20 15 10 5 0 1 2 4 8 Number of Threads IWOMP'2010 June 15th 2010 13

  14. EPCC: OpenMP Parallel-Region Overhead (cont.) MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1 5 4,5 4 Execution Time (us) 3,5 3 2,5 2 1,5 1 0,5 0 1 2 4 8 Number of Threads IWOMP'2010 June 15th 2010 14

  15. EPCC: OpenMP Parallel-Region Overhead (cont.) MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1 350 300 Execution Time (us) 250 200 150 100 50 0 8 16 32 64 Number of Threads IWOMP'2010 June 15th 2010 15

  16. Hybrid Funneled Ping-Pong (1KB) MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1 OPENMPI/GCC 4.4.0 OPENMPI/ICC 11.1 1000 100 Ratio 10 1 2 4 8 Number of OpenMP Threads IWOMP'2010 June 15th 2010 16

  17. Hybrid Multiple Ping-Pong (1KB) MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1 20 18 16 14 12 Ratio 10 8 6 4 2 0 2 4 Number of OpenMP Threads IWOMP'2010 June 15th 2010 17

  18. Hybrid Multiple Ping-Pong (1KB) (cont.) MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1 60 50 40 Ratio 30 20 10 0 2 4 8 Number of OpenMP Threads IWOMP'2010 June 15th 2010 18

  19. Hybrid Multiple Ping-Pong (1MB) MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1 3,5 3 2,5 2 Ratio 1,5 1 0,5 0 2 4 8 Number of OpenMP Threads IWOMP'2010 June 15th 2010 19

  20. Alternating (MPI Tasks Waiting) MPC Intel MPI MPICH2 1.1/GCC 4.4.0 MPICH2 1.1/ICC 11.1 OPENMPI/GCC 4.4.0 9 8 7 6 Ratio 5 4 3 2 1 0 2 4 8 16 Number of OpenMP Threads IWOMP'2010 June 15th 2010 20

  21. Conclusion � Mixing MPI+OpenMP is a promising solution for next-generation computer architectures � How to avoid large overhead? � Contributions � Taxonomy of hybrid approaches � MPC: a Framework unifying both programming models � Lower hybrid overhead � Fully compliant MPI 1.3 and OpenMP 2.5 (with patched GCC) � Freely available at http://mpc.sourceforge.net (version 2.0) � Contact: patrick.carribault@cea.fr or marc.perache@cea.fr � Future Work � Optimization of OpenMP runtime (e.g., NUMA barrier) � OpenMP 3.0 (tasks) � Thread/data affinity (thread placement, data locality) � Tests on large applications IWOMP'2010 June 15th 2010 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend