[PPT] - Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a PowerPoint Presentation

SLIDE 1

Using SimGrid to Evaluate the Impact of AMPI Load Balancing In a Geophysics HPC Application

Rafael Keller Tesser⋆, Philippe O. A. Navaux⋆, Lucas Mello Schnorr⋆, Arnaud Legrand†

⋆: UFRGS GPPD/Inf, Porto Alegre, Brazil †: CNRS/Inria POLARIS, Grenoble, France

Urbana, April 2016 14th Charm++ Workshop

1 / 21

SLIDE 2

Outline

1 Context: Improving The Performance of Iterative Unbalanced

Applications

2 SimGrid and SMPI in a Nutshell 3 A Simulation Based Methodology 4 Experimental Results

Validation Investigating AMPI parameters

5 Conclusion

2 / 21

SLIDE 3

Context

Parallel HPC applications are often written with MPI, which is based

n a regular SPMD programming model.
Many of these applications are iterative and such paradigm is

suited to balanced applications;

Unbalanced applications:
May resort to static load balancing techniques (at application

level )

Or not. . . (the load imbalance comes from the nature of the

input data, evolve over time and space. e.g., Ondes3D) Handling this at the application level is just a nightmare.

A possible approach is to use over-decomposition and dynamic process-level load-balancing as proposed with AMPI/CHARM++

3 / 21

SLIDE 4

Ondes3D, a Seismic Wave Propagation Simulator

Developed by BRGM [Aochi et al. 2013];
Used to predict the consequences of future earthquakes.

Many sources of load imbalance:

Absorbing boundary conditions

(tasks at the borders perform more computation)

Variation in the constitution laws of

different geological layers (different equations);

Propagation of the shockwave in space and time;

Mesh partitioning techniques and quasi-static load balancing algorithm are thus ineffective.

4 / 21

SLIDE 5

AMPI can be quite effective

500 1000 1500 2000 2500 288 2304 4608 Time (seconds) Number of chunks MPI (No LB) AMPI (No LB) Ref neLB NucoLB HwTopoLB HierarchicalLB 1 HierarchicalLB 2

32.12%
33.97%
34.72%
36.58%

500 time-steps Average execution times

Based on Mw 6.6, 2007 Niigata Chuetsu-Oki, Japan, earthquake [Aochi et.al ICCS 2013]

Full problem (6000 time steps) 162 minutes on 32 nodes

(Intel Hapertown processors)

5 / 21

SLIDE 6

Challenges

Finding the best load balancing parameters:

Which Load Balancer is the most suited?
How many iterations should be grouped together? (Migration

Frequency)

How many VPs? (Decomposition level) Load-balancing benefit
vs. application communication overhead and LB overhead
. . .

6 / 21

SLIDE 7

Challenges

Finding the best load balancing parameters:

Which Load Balancer is the most suited?
How many iterations should be grouped together? (Migration

Frequency)

How many VPs? (Decomposition level) Load-balancing benefit
vs. application communication overhead and LB overhead
. . .

And preparing for AMPI is not free:

Need to write data serialization code
Engaging in such approach without knowing how much there is

to gain can be deterring;

Goal

Propose a sound methodology for investigating performance improve- ment of irregular applications through over decomposition

6 / 21

SLIDE 8

Outline

1 Context: Improving The Performance of Iterative Unbalanced

Applications

2 SimGrid and SMPI in a Nutshell 3 A Simulation Based Methodology 4 Experimental Results

Validation Investigating AMPI parameters

5 Conclusion

7 / 21

SLIDE 9

SimGrid

Timed Trace

[0.001000] 0 compute 1e6 0.01000 [0.010028] 0 send 1 1e6 0.009028 [0.040113] 0 recv 3 1e6 0.030085 [0.010028] 1 recv 0 1e6 0.010028 ...

time slice

Visualization Paje TRIVA

<?xml version=1.0?> <!DOCTYPE platform SYSTEM "simgrid.dtd"> <platform version="3"> <cluster id="griffon" prefix="griffon-" suffix=".grid5000.fr" radical="1-144" power="286.087kf" bw="125MBps" lat="24us" bb_bw="1.25GBps" bb_lat="0" sharing_policy="FULLDUPLEX" />

Platform Description

Down Up Down Up Down Up Down Up 10G 1G 1−39 40−74 105−144 75−104 13G 10G Limiter ... ... ... ... 1.5G 1G Limiter Down Up

Simulated Execution Time

43.232 seconds

Model the machine

f your dreams

mpirun tau, PAPI

Trace once on a simple cluster

SMPI

Simulated or Emulated Computations Simulated Communications Time Independent Trace

0 compute 1e6 0 send 1 1e6 0 recv 3 1e6 1 recv 0 1e6 1 compute 1e6 1 send 2 1e6 2 recv 1 1e6 2 compute 1e6 2 send 3 1e6 3 recv 2 1e6 3 compute 1e6 3 send 0 1e6

Replay the trace as many times as you want

MPI Application

On-line: simulate/emulate unmodified complex applications

Possible memory folding and shadow execution
Handles non-deterministic applications

Off-line: trace replay

SimGrid: 15 years old collaboration between France, US, UK,

Austria, . . .

Flow-level models that account for topology and contention
SMPI: Supports both trace replay and direct emulation
Embeds 100+ collective communication algorithms

8 / 21

SLIDE 10

Outline

1 Context: Improving The Performance of Iterative Unbalanced

Applications

2 SimGrid and SMPI in a Nutshell 3 A Simulation Based Methodology 4 Experimental Results

Validation Investigating AMPI parameters

5 Conclusion

9 / 21

SLIDE 11

Principle

Approach:

1 Implement various load-balancing algorithms in SMPI; 2 Capture a time independent trace (faithful application profile)

Two alternatives:

– Standard tracing: parallel/fast , requires more resources – Emulation (smpicc/smpirun): requires a single host , slow ;

Add a fake call to MPI_Migrate where needed
Track how much memory is used by each VP and use it as an

upper bound of migration cost;

May take some time but does requires minimal modification /

knowledge of the application;

3 Replay the trace as often as wished, playing with the different

parameters (LB, frequency, topology, . . . )

10 / 21

SLIDE 12

Principle

Key questions:

How do we know whether our simulations are faithful?
How do we understand where the mismatch comes from?
VP scheduling, LB implementation, trace capture, network, . . .

10 / 21

SLIDE 13

Evaluation Challenge

No LB vs. GreedyLB : Simple Gantt charts are not very informative

11 / 21

SLIDE 14

Evaluation Challenge

No LB vs. GreedyLB : Simple Gantt charts are not very informative

11 / 21

SLIDE 15

Outline

1 Context: Improving The Performance of Iterative Unbalanced

Applications

2 SimGrid and SMPI in a Nutshell 3 A Simulation Based Methodology 4 Experimental Results

Validation Investigating AMPI parameters

5 Conclusion

12 / 21

SLIDE 16

Description of the experiments

Scenarios Two different earthquake simulations:

Niigata-ken Chuetsu-Oki:
2007, Mw 6.6, Japan
500 time-steps; dimensions: 300x300x150
Ligurian:
1887, Mw 6.3, north-western Italy
300 times-teps; dimensions: 500x350x130

Load balancers No load balancing vs. GreedyLB vs. RefineLB Hardware Resources Parapluie cluster from Grid’5000

2 x AMD Opteron™ 6164 HE x 24 cores, 1.7GHz, Infiniband
Plus my own laptop (Intel Core™ i7-4610M, 2 cores, 3GHz)

13 / 21

SLIDE 17

Chuetsu-Oki simulation - 64 VPs and 16 processes

Detailed View

None GreedyLB RefineLB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 AMPI SMPI 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500

Iteration (index) Resource (index)

0.4 0.6 0.8 1.0 Load

14 / 21

SLIDE 18

Chuetsu-Oki simulation - 64 VPs and 16 processes

Space Aggregated View 5-10 runs for each configuration

None GreedyLB RefineLB

0.0

0.3 0.6 0.9 250 500 750 250 500 750 250 500 750

Time (s) Average Load

AMPI

SMPI

The simulated load behaves very similar to real life
GreedyLB is the best choice in both simulation and RL
There is still some mismatch in terms of makespan

14 / 21

SLIDE 19

Ligurian simulation - 64 VPs and 16 processes

Detailed View

None GreedyLB RefineLB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 AMPI SMPI 100 200 300 100 200 300 100 200 300

Iteration (index) Resource (index)

0.4 0.6 0.8 1.0 Load

15 / 21

SLIDE 20

Ligurian simulation - 64 VPs and 16 processes

Space Aggregated View

None GreedyLB RefineLB

●
●
●
●
●
●
●
●
●
●
●
●
0.0

0.3 0.6 0.9 250 500 750 1000 1250 250 500 750 1000 1250 250 500 750 1000 1250

Time (s) Average Load

AMPI

SMPI

Once again, the simulated an RL loads behave similarly
RefineLB is the best choice in both simulation and RL
Mismatch in the timings between simulation and RL

15 / 21

SLIDE 21

Impact of the LB frequency (simulation)

Call MPI_MIgrate on every iteration
Change the load balancing frequency in simulation

500 1000 10 20 30 40

Load−balancing interval Makespan (s) Heuristic

None GreedyLB RefineLB

Use RefineLB every 10 or 20 iterations

Trace capture time: 10(XP)x5 hours ≈ 50 hours
Simulation time: 10x3(Heuristics)x4(Freq.)x200 sec ≈ 6h40m

16 / 21

SLIDE 22

Impact of the decomposition level (in Simulation)

500 1000 1500 2000 32 48 64 128 256

VP count Makespan (s)

Heuristic None GreedyLB RefineLB

Use either GreedyLB with 32 VP or RefineLB with 64VP

Trace capture time: 5(VP) × 5(XP) × 5 hours ≈ 5 days
Simulation time: 5 × 5 × 3(Heuristics) × 200 sec ≈ 4.1 hours

17 / 21

SLIDE 23

Impact of the decomposition level (real exp.)

500 1000 1500 2000 32 48 64 128 256

VP count Makespan (s)

Heuristic None GreedyLB RefineLB

Same conclusion. . . In only ≈ 29 hours but on a 16 node cluster.

18 / 21

SLIDE 24

Outline

1 Context: Improving The Performance of Iterative Unbalanced

Applications

2 SimGrid and SMPI in a Nutshell 3 A Simulation Based Methodology 4 Experimental Results

Validation Investigating AMPI parameters

5 Conclusion

19 / 21

SLIDE 25

Conclusion

This is still ongoing work. . . Any comments are welcome!
Simulation of over decomposition based dynamic load balancing
Good results in terms of load distribution;
Some inaccuracy in terms of total makespan.
Visualize the evolution of resource usage:
quite useful to compare simulation with real life;
or to compare different load balancing heuristics.
We need to devise some way to speed up trace collection:
Facilitate the analysis of different over-decomposition levels;
Is there some way to get similar input traces straight from

Charm++/AMPI?

20 / 21

SLIDE 26

Acknowledgements

This research was partially supported by:

CNPq: PhD Scholarship at the Post-Graduate Program in Com-

puter Science (PPGC) at UFRGS

CAPES-COFECUB: Part of this work was conducted dur-

ing a sandwich doctorate scholarship at the Laboratoire d’Informatique de Grenoble, supported by the International Co-

peration Program CAPES/COFECUB Fondation; financed by

CAPES within the Ministry of Education of Brazil

HPC4E: This research has received funding from the EU H2020

Programme and from MCTI/RNP-Brazil under the HPC4E Project, grant agreement number 689772

21 / 21