All the things you need to know about Intel MPI Library
Jerome Vienne viennej@tacc.utexas.edu
Texas Advanced Computing Center The University of Texas at Austin Austin, TX November 12th, 2016
All the things you need to know about Intel MPI Library Jerome - - PowerPoint PPT Presentation
All the things you need to know about Intel MPI Library Jerome Vienne viennej@tacc.utexas.edu Texas Advanced Computing Center The University of Texas at Austin Austin, TX November 12th, 2016 A Heterogeneous Environment MPI performance
Texas Advanced Computing Center The University of Texas at Austin Austin, TX November 12th, 2016
▶ CPUs (Number of cores, Cache sizes, Frequency) ▶ Memory (Amount, Frequency) ▶ Network Speed (10,20,40 … Gbit/s) ▶ Size of the job ▶ Type of code: Hybrid (ex: OpenMP+MPI) or Pure MPI
All the things you need to know about Intel MPI Library | November 12th, 2016 | 2
▶ CPUs (Number of cores, Cache sizes, Frequency) ▶ Memory (Amount, Frequency) ▶ Network Speed (10,20,40 … Gbit/s) ▶ Size of the job ▶ Type of code: Hybrid (ex: OpenMP+MPI) or Pure MPI
▶ Why ? Because the number of combinations is too large. ▶ Are these choices optimal for my application ? Not necessarily. ▶ Can we change them ? Yes, this is why we are there.
All the things you need to know about Intel MPI Library | November 12th, 2016 | 2
▶ ”How to tune MPI” cannot be found easily inside books. ▶ Show that MPI libraries are not black boxes. ▶ Describe concepts that are common inside MPI libraries. ▶ Understand the difgerence between MPI libraries. ▶ Provide some useful commands for Intel MPI. ▶ Result: Help you to reduce the time and memory foot print of
All the things you need to know about Intel MPI Library | November 12th, 2016 | 3
▶ Talk based on Intel MPI (few references to MVAPICH2 and
▶ All experiments were done on Stampede supercomputer at
▶ Tuning options are specific to a MPI library ! But concepts are
▶ Options can have counter-efgects ! ▶ MPI libraries have lot of options for tuning, we will only cover
▶ Tuning could be time consuming, but long-term, it might be
All the things you need to know about Intel MPI Library | November 12th, 2016 | 4
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 5
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 6
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 7
▶ Intel MPI: Intel MPI Benchmarks (IMB) ▶ MVAPICH2: OSU Micro-Benchmarks (OMB)
All the things you need to know about Intel MPI Library | November 12th, 2016 | 8
▶ Intel MPI: Intel MPI Benchmarks (IMB) ▶ MVAPICH2: OSU Micro-Benchmarks (OMB)
▶ Both are communication intensive without computation ▶ Depend on your application ▶ The best benchmark is your application !
All the things you need to know about Intel MPI Library | November 12th, 2016 | 8
▶ Intel MPI: Intel MPI Benchmarks (IMB) ▶ MVAPICH2: OSU Micro-Benchmarks (OMB)
▶ Both are communication intensive without computation ▶ Depend on your application ▶ The best benchmark is your application !
All the things you need to know about Intel MPI Library | November 12th, 2016 | 8
▶ Originally know as Pallas MPI Benchmarks (PMB) ▶ Support Point-to-Point and Collective operations ▶ 1 program with lot of options for classical MPI functions
▶ Root changes afuer each iteration for collectives
All the things you need to know about Intel MPI Library | November 12th, 2016 | 9
1 10 100 1000 10000 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Time (us) Message Size (Bytes) Mvapich2 2.2 Intel MPI 2017 All the things you need to know about Intel MPI Library | November 12th, 2016 | 9
▶ Very simple to use ▶ Support Point-to-Point and Collective operations ▶ Multiples programs with simple options ▶ Keep the same root during all iterations + use barrier
All the things you need to know about Intel MPI Library | November 12th, 2016 | 10
1 10 100 1000 4 16 64 256 1K 4K 16K 64K 256K 1M Time (us) Message Size (Bytes) Mvapich2 2.2 Intel MPI 2017 All the things you need to know about Intel MPI Library | November 12th, 2016 | 10
1 10 100 1000 10000 4 16 64 256 1K 4K 16K 64K 256K 1M Time (us) Message Size (Bytes) Mvapich2 2.2 Intel MPI 2017 All the things you need to know about Intel MPI Library | November 12th, 2016 | 10
▶ Don’t trust them ! ▶ They have difgerent behaviors: so, KNOW your benchmark ! ▶ Don’t provide you necessarily the best results by default. ▶ Be sure that you tune things correctly if you want to compare
▶ Collective tuning for a particular benchmark/application could
All the things you need to know about Intel MPI Library | November 12th, 2016 | 11
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 12
▶ To identify which MPI functions are used, you have two
▶ Look at the code ▶ Profile your application
▶ Profiling provides you all the information regarding MPI
▶ Could be integrated in the MPI library (ex: Intel MPI) ▶ Lot of tools can help you to profile your application (TAU,
All the things you need to know about Intel MPI Library | November 12th, 2016 | 13
mpiexec -genv I_MPI_STATS=ipm I_MPI_STATS_FILE=myprofile.txt ….
▶ MPI Performance Snapshots (MPS) ▶ Intel Trace Analyzer and Collector (ITAC)
All the things you need to know about Intel MPI Library | November 12th, 2016 | 14
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 15
▶ Hostfile provides the list of nodes that will be used ▶ Depending on the MPI library, the same hostfile could lead to
All the things you need to know about Intel MPI Library | November 12th, 2016 | 16
All the things you need to know about Intel MPI Library | November 12th, 2016 | 17
All the things you need to know about Intel MPI Library | November 12th, 2016 | 17
▶ Default: 176 sec.
All the things you need to know about Intel MPI Library | November 12th, 2016 | 17
▶ Default: 176 sec. ▶ Correct Hostfile: 176 sec.
All the things you need to know about Intel MPI Library | November 12th, 2016 | 17
▶ Default: 176 sec. ▶ Correct Hostfile: 176 sec. ▶ + Process Placement: 19 sec.
All the things you need to know about Intel MPI Library | November 12th, 2016 | 17
▶ Default: 176 sec. ▶ Correct Hostfile: 176 sec. ▶ + Process Placement: 19 sec.
▶ Default: 51 sec.
All the things you need to know about Intel MPI Library | November 12th, 2016 | 17
▶ Default: 176 sec. ▶ Correct Hostfile: 176 sec. ▶ + Process Placement: 19 sec.
▶ Default: 51 sec. ▶ Correct Hostfile/Command:
All the things you need to know about Intel MPI Library | November 12th, 2016 | 17
All the things you need to know about Intel MPI Library | November 12th, 2016 | 18
All the things you need to know about Intel MPI Library | November 12th, 2016 | 19
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 20
User-def. Binding Socket Core Preset Binding No Binding MPI Library Compact Scatter
All the things you need to know about Intel MPI Library | November 12th, 2016 | 21
Latency (us) Description Pair 0 byte 8 k-byte 2-4 0.2 1.77 Same socket, shared L3, best performance 0-2 0.2 1.80 Same socket, shared L3, but core 0 handles interrupts 2-10 0.41 2.52 Difgerent sockets, does not share L3 0-8 0.42 2.53 Difgerent sockets, does not share L3, core 0 handles interrupts
▶ Default Mapping: 176 seconds ▶ Optimal Mapping: 19 seconds
All the things you need to know about Intel MPI Library | November 12th, 2016 | 22
S1 S2 HCA
1 2 3
(a) MVAPICH2 2.1
S1 S2 HCA
0 2 1 3
(b) Open MPI 1.8.8
1 3 2
S1 S2 HCA
(c) Intel MPI 5.0.2
All the things you need to know about Intel MPI Library | November 12th, 2016 | 23
2 4 6 8 10 12 14 4 16 64 256 1K 4K 16K Latency (us) Message Size (Bytes) Core 0 Core 8
All the things you need to know about Intel MPI Library | November 12th, 2016 | 24
S1 S2 HCA
1 2 3 6 7 4 5
(a) MVAPICH2 2.1
S1 S2 HCA
0 2 4 6 1 3 5 7
(b) Open MPI 1.8.8
S1 S2 HCA
1 2 3 6 7 4 5
(c) Intel MPI 5.0.2
All the things you need to know about Intel MPI Library | November 12th, 2016 | 25
▶ I_MPI_PIN_PROCESSOR_LIST: Define a processor subset and
▶ I_MPI_PIN_DOMAIN (For Hybrid code)
▶ MV2_CPU_BINDING_POLICY=bunch|scatter ▶ MV2_CPU_BINDING_LEVEL=core|socket|numanode ▶ Manual: MV2_CPU_MAPPING=0:8:9-15:1-7
All the things you need to know about Intel MPI Library | November 12th, 2016 | 26
▶ Intel MPI: -print-rank-map ▶ MVAPICH2: MV2_SHOW_CPU_BINDING=1 ▶ OpenMPI: --report-bindings
c421-502$ mpirun_rsh -np 2 -hostfile hosts MV2_CPU_MAPPING=2-4:10-12 MV2_SHOW_CPU_BINDING=1 ./osu_latency
RANK:0 CPU_SET: 2 3 4 RANK:1 CPU_SET: 10 11 12 All the things you need to know about Intel MPI Library | November 12th, 2016 | 27
All the things you need to know about Intel MPI Library | November 12th, 2016 | 28
All the things you need to know about Intel MPI Library | November 12th, 2016 | 28
All the things you need to know about Intel MPI Library | November 12th, 2016 | 28
All the things you need to know about Intel MPI Library | November 12th, 2016 | 28
All the things you need to know about Intel MPI Library | November 12th, 2016 | 28
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 29
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 30
▶ There are multiple difgerent protocols for sending messages. ▶ We will only focus here on eager / rendezvous protocol. ▶ Switch point between these two protocols can be called
▶ It is an implementation technique, it is not part of the MPI
All the things you need to know about Intel MPI Library | November 12th, 2016 | 31
1 10 100 1000 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Time (us) Message Size (Bytes) Eager Rendezvous All the things you need to know about Intel MPI Library | November 12th, 2016 | 32
All the things you need to know about Intel MPI Library | November 12th, 2016 | 33
All the things you need to know about Intel MPI Library | November 12th, 2016 | 34
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 35
▶ Shared Memory ▶ Kernel Assisted
All the things you need to know about Intel MPI Library | November 12th, 2016 | 36
▶ Used by all MPI implementations ▶ Double-copy implementation involves a shared bufger space
▶ Best approach for small messages ▶ Not ideal for large messages (tie down CPU, cache pollution)
All the things you need to know about Intel MPI Library | November 12th, 2016 | 37
▶ Single copy mechanism ▶ Preferred approach for medium or large messages ▶ You need to use:
▶ Kernel module: LiMIC or KNEM ▶ Kernel feature: CMA All the things you need to know about Intel MPI Library | November 12th, 2016 | 38
▶ Cross Memory Atuach ▶ Introduced with Linux kernel 3.2 and has been back-ported to
▶ Available on Stampede, supported by Intel MPI (since 5.0u2),
▶ CMA will be enable automatically for large messages since
All the things you need to know about Intel MPI Library | November 12th, 2016 | 39
▶ For short messages (eager protocol): Shared Memory is betuer ▶ For large messages (rendezvous protocol): Kernel assisted is
▶ I_MPI_SHM_LMT= shm | direct | ofg
All the things you need to know about Intel MPI Library | November 12th, 2016 | 40
1 10 100 1000 10000 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Bandwidth(Mbytes/sec) Message Size (Bytes)
shm direct
All the things you need to know about Intel MPI Library | November 12th, 2016 | 41
All the things you need to know about Intel MPI Library | November 12th, 2016 | 42
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 43
▶ Used when a communication involves more than 2 MPI tasks ▶ Behind each collective, there are many algorithms (Binomial
▶ The choice of the algorithm depends on many parameters
▶ Default tuning is not necessarily the best one for you. ▶ MPI libraries provide mechanisms to select which algorithms
All the things you need to know about Intel MPI Library | November 12th, 2016 | 44
1 10 100 1000 10000 100000 4 16 64 256 1K 4K 16K 64K 256K 1M Latency(us) Message Size (Bytes) Ref Algo 1 Algo 2 Algo 3
All the things you need to know about Intel MPI Library | November 12th, 2016 | 45
1 10 100 1000 10000 100000 1e+06 4 16 64 256 1K 4K 16K 64K 256K 1M Latency(us) Message Size (Bytes) Ref Algo 1 Algo 2 Algo 3
All the things you need to know about Intel MPI Library | November 12th, 2016 | 46
All the things you need to know about Intel MPI Library | November 12th, 2016 | 47
All the things you need to know about Intel MPI Library | November 12th, 2016 | 47
All the things you need to know about Intel MPI Library | November 12th, 2016 | 47
All the things you need to know about Intel MPI Library | November 12th, 2016 | 47
The Choice of the Benchmark Profiling Hostfile Process Placement To conclude
Inter-node Point-to-Point Optimization Intra-node Point-to-Point Optimization Collective Tuning To conclude
All the things you need to know about Intel MPI Library | November 12th, 2016 | 48
▶ Read the documentation :) ▶ MPI libraries behave difgerently ▶ Mechanisms exists to improve easily the performance of your
▶ Mechanisms exists also to reduce the memory footprint of your
▶ Don’t be afraid to ask for help
All the things you need to know about Intel MPI Library | November 12th, 2016 | 49