Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC - - PowerPoint PPT Presentation

enabling low overhead hybrid mpi openmp parallelism with
SMART_READER_LITE
LIVE PREVIEW

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC - - PowerPoint PPT Presentation

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Prache and Herv Jourdren CEA, DAM, DIF, F-91297 Arpajon France IWOMP'2010 June 15th 2010 1 Introduction/Context HPC Architecture: Petaflop/s Era


slide-1
SLIDE 1

1 June 15th 2010 IWOMP'2010

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC

Patrick Carribault, Marc Pérache and Hervé Jourdren

CEA, DAM, DIF, F-91297 Arpajon France

slide-2
SLIDE 2

Introduction/Context

2 June 15th 2010 IWOMP'2010

  • HPC Architecture: Petaflop/s Era

Multicore processors as basic blocks Clusters of ccNUMA nodes

  • Parallel programming models

MPI: distributed-memory model OpenMP: shared-memory model

  • Hybrid MPI/OpenMP (or mixed-mode programming)

Promising solution (benefit from both models for data parallelism) How to hybridize an application?

  • Contributions

Approaches for hybrid programming Unified MPI/OpenMP framework (MPC) for lower hybrid overhead

slide-3
SLIDE 3

Outline

3 June 15th 2010 IWOMP'2010

  • Introduction/Context
  • Hybrid MPI/OpenMP Programming

Overview Extended taxonomy

  • MPC Framework

OpenMP runtime implementation Hybrid optimization

  • Experimental Results

OpenMP performance Hybrid performance

  • Conclusion & Future Work
slide-4
SLIDE 4

Hybrid MPI/OpenMP Programming Overview

4 June 15th 2010 IWOMP'2010

  • MPI (Message Passing Interface)

Inter-node communication Implicit locality Useless data duplication and useless shared-memory transfers

  • OpenMP

Fully exploit shared-memory data parallelism No inter-node standard No data-locality standard (ccNUMA node)

  • Hybrid Programming

Mix MPI and OpenMP inside an application Benefit from Pure-MPI and Pure-OpenMP modes

slide-5
SLIDE 5

Hybrid MPI/OpenMP Programming Approaches

5 June 15th 2010 IWOMP'2010

  • Traditional Approaches

Exploit one core with one execution flow E.g., MPI for inter-node communication, OpenMP otherwise E.g., multi core CPU Socket-exploitation with OpenMP

  • Oversubscribing Approaches

Exploit one core with several execution flows Load balancing on the whole node Adaptive behavior between parallel regions

  • Mixing Depth

Communications outside parallel regions Network bandwidth saturation Communications inside parallel regions MPI thread-safety Extended Taxonomy from [Heager09]

slide-6
SLIDE 6

Hybrid MPI/OpenMP Extended Taxonomy

6 June 15th 2010 IWOMP'2010

slide-7
SLIDE 7

MPC Framework

7 June 15th 2010 IWOMP'2010

  • User-level thread library [EuroPar’08]

Pthreads API, debugging with GDB [MTAAP’2010]

  • Thread-based MPI [EuroPVM/MPI’09]

MPI 1.3 Compliant Optimized to save memory

  • NUMA-aware memory allocator (for multithreaded applications)
  • Contribution: Hybrid representation inside MPC

Implementation of OpenMP Runtime (2.5 compliant) Compiler part w/ patched GCC (4.3.X and 4.4.X) Optimizations for hybrid applications Efficient oversubscribed OpenMP (more threads than cores) Unified representation of MPI tasks and OpenMP threads Scheduler-integrated polling methods Message-buffer privatization and parallel message reception

slide-8
SLIDE 8

MPC’s Hybrid Execution Model (Fully Hybrid)

8 June 15th 2010 IWOMP'2010

Application with 1 MPI task per node

slide-9
SLIDE 9

MPC’s Hybrid Execution Model (Fully Hybrid)

9 June 15th 2010 IWOMP'2010

Initialization of OpenMP regions (on the whole node)

slide-10
SLIDE 10

MPC’s Hybrid Execution Model (Fully Hybrid)

10 June 15th 2010 IWOMP'2010

Entering OpenMP parallel region w/ 6 threads

slide-11
SLIDE 11

MPC’s Hybrid Execution Model (Simple Mixed)

11 June 15th 2010 IWOMP'2010

2 MPI tasks + OpenMP parallel region w/ 4 threads (on 2 cores)

slide-12
SLIDE 12

Experimental Environment

12 June 15th 2010 IWOMP'2010

  • Architecture

Dual-socket Quad-core Nehalem-EP machine 24GB of memory/Linux 2.6.31 kernel

  • Programming model implementations

MPI: MPC, IntelMPI 3.2.1, MPICH2 1.1, OpenMPI 1.3.3 OpenMP: MPC, ICC 11.1, GCC 4.3.0 and 4.4.0, SunCC 5.1 Best option combination OpenMP thread pinning (KMP_AFFINITY, GOMP_CPU_AFFINITY) OpenMP wait policy (OMP_WAIT_POLICY, SUN_MP_THR_IDLE=spin) MPI task placement (I_MPI_PIN_DOMAIN=omp)

  • Benchmarks

EPCC suite (Pure OpenMP/Fully Hybrid) [Bull et al. 01] Microbenchmarks for mixed-mode OpenMP/MPI [Bull et al. IWOMP’09]

slide-13
SLIDE 13

EPCC: OpenMP Parallel-Region Overhead

13 June 15th 2010 IWOMP'2010

5 10 15 20 25 30 35 40 45 50 1 2 4 8 Number of Threads Execution Time (us) MPC ICC 11.1 GCC 4.3.0 GCC 4.4.0 SUNCC 5.1

slide-14
SLIDE 14

EPCC: OpenMP Parallel-Region Overhead (cont.)

14 June 15th 2010 IWOMP'2010

0,5 1 1,5 2 2,5 3 3,5 4 4,5 5 1 2 4 8 Number of Threads Execution Time (us) MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1

slide-15
SLIDE 15

EPCC: OpenMP Parallel-Region Overhead (cont.)

15 June 15th 2010 IWOMP'2010

50 100 150 200 250 300 350 8 16 32 64 Number of Threads Execution Time (us) MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1

slide-16
SLIDE 16

Hybrid Funneled Ping-Pong (1KB)

16 June 15th 2010 IWOMP'2010

1 10 100 1000 2 4 8 Number of OpenMP Threads Ratio MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1 OPENMPI/GCC 4.4.0 OPENMPI/ICC 11.1

slide-17
SLIDE 17

Hybrid Multiple Ping-Pong (1KB)

17 June 15th 2010 IWOMP'2010

2 4 6 8 10 12 14 16 18 20 2 4 Number of OpenMP Threads Ratio MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

slide-18
SLIDE 18

Hybrid Multiple Ping-Pong (1KB) (cont.)

18 June 15th 2010 IWOMP'2010

10 20 30 40 50 60 2 4 8 Number of OpenMP Threads Ratio MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

slide-19
SLIDE 19

Hybrid Multiple Ping-Pong (1MB)

19 June 15th 2010 IWOMP'2010

0,5 1 1,5 2 2,5 3 3,5 2 4 8 Number of OpenMP Threads Ratio MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

slide-20
SLIDE 20

Alternating (MPI Tasks Waiting)

20 June 15th 2010 IWOMP'2010

1 2 3 4 5 6 7 8 9 2 4 8 16 Number of OpenMP Threads Ratio MPC Intel MPI MPICH2 1.1/GCC 4.4.0 MPICH2 1.1/ICC 11.1 OPENMPI/GCC 4.4.0

slide-21
SLIDE 21

Conclusion

21 June 15th 2010 IWOMP'2010

  • Mixing MPI+OpenMP is a promising solution for next-generation computer

architectures How to avoid large overhead?

  • Contributions

Taxonomy of hybrid approaches MPC: a Framework unifying both programming models Lower hybrid overhead Fully compliant MPI 1.3 and OpenMP 2.5 (with patched GCC) Freely available at http://mpc.sourceforge.net (version 2.0) Contact: patrick.carribault@cea.fr or marc.perache@cea.fr

  • Future Work

Optimization of OpenMP runtime (e.g., NUMA barrier) OpenMP 3.0 (tasks) Thread/data affinity (thread placement, data locality) Tests on large applications