[PPT] - Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI PowerPoint Presentation

SLIDE 1

Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI Programs

Guoyong Mao, David Böhme, Markus Geimer, Marc-André Hermanns, Daniel Lorenz and Felix Wolf Petascale Tools Workshop, Madison, WI, USA, August 4, 2014

SLIDE 2

Late sender

time processes A B Send Recv

Waiting time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014 2

SLIDE 3

Wait an NxN

time processes A B C Allgather Allgather Allgather

Waiting time Waiting time

3 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 4

What we want to know

4

Processing time Processing time Processing time Processing time Wait time Wait time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 5

What we measure

5

Execution time Execution time Execution time Execution time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 6

The minimum idea

6

Minimal execution time Execution time Execution time Execution time Execution time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 7

The minimum idea

7

Estimated processing time Processing time Processing time Processing time Processing time Wait time Wait time Estimated wait time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 8

Considered parameters

We consider
MPI function
Message size
Receiver rank
Other possible parameters
Sender rank
Data type
Tradeoff between
Number of samples for a meaningful minimum and amount data
Parameters considered
Need to find the relevant parameters.

8 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 9

Algorithm

For every combination of

MPI function
Message size class
Process

record the

Minimum execution time

For every combination of MPI call path and message size class record the

Number of visits
Total execution time

At the end of the profiling run, subtract the minimum from the execution time for every visit to calculate the wait time.

9 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 10

Per-call overhead increase compared to profiling overhead w/o wait state analysis (%)

10 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

50 100 150 200 250

SLIDE 11

Accuracy MPI_Recv

11

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Scalasca minimum method wait ratio

JUQUEEN JUROPA

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 12

Accuracy MPI_Wait

12

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

make_id_list ks_congrad path_product u_shift_fermion comm_embed rev_comm_rho rev_commnct tf_ad_splitting parallel rsl_lite_exch_y rsl_lite_exch_x x_solve y_solve z_solve resid rprj3 psinv interp rhs x_solve y_solve z_solve resid rprj3 psinv iterp

Scalasca minimum method wait ratio

JUQUEEN JUROPA

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 13

Accuracy MPI_Wait

13

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

make_id_list ks_congrad path_product u_shift_fermion comm_embed rev_comm_rho rev_commnct tf_ad_splitting parallel rsl_lite_exch_y rsl_lite_exch_x x_solve y_solve z_solve resid rprj3 psinv interp rhs x_solve y_solve z_solve resid rprj3 psinv iterp

Scalasca minimum method wait ratio

JUQUEEN JUROPA

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 14

Accuracy MPI_Waitall

14

0.02 0.04 0.06 0.08 0.1 0.12 0.14 bndry_3 d bndry_2 d solvers, pcg commnc t copy_fa ces copy_fa ces x_solve y_solve z_solve copy_fa ces copy_fa ces x_solve y_solve z_solve Scalasca minimum method wait ratio

JUQUEEN JUROPA

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 15

Non-blocking communication

15

Isend Wait Isend time processes A B

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 16

Scalasca detects no wait state

16

Isend Wait Isend time processes A B

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 17

Minimum approach does calculate wait states

17

Isend Wait Isend time processes A B

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 18

But is this wrong for performance analysis?

18

Isend Wait Isend time processes A B Latency = Possible overlap time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 19

Detailed example from SP

19 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 20

Wait time according to Scalasca

20 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 21

Wait time according to minimum method

21 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 22

Jitter may cause a little higher wait time

22

Estimated processing time Processing time Processing time Processing time Processing time Wait time Estimated wait time Wait time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 23

Accuracy MPI_Waitall

23

0.02 0.04 0.06 0.08 0.1 0.12 0.14 bndry_3 d bndry_2 d solvers, pcg commnc t copy_fa ces copy_fa ces x_solve y_solve z_solve copy_fa ces copy_fa ces x_solve y_solve z_solve Scalasca minimum method wait ratio

JUQUEEN JUROPA

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 24

Static imbalance

24

Estimated processing time Processing time Processing time Processing time Processing time Wait time Wait time Estimated wait time too small Wait time Wait time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 25

Static imbalance

Calculating global minima could resolve process local static imbalances
Reduction operation after measurement
No dilation at measurement time
Loose sender/receiver parameterization of minima
For collective operations, global minima were better

25 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 26

Accuracy for Wait at NxN

26

0.05 0.1 0.15 0.2 0.25 0.3 0.35 get_max_recvs solvers,pcg tf_controle glbl_int_sum EP trnspse_x_yz glbl_int_sum Scalasca minimum method wait ratio

JUROPA JUQUEEN

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 27

Conclusion (1)

Minimum method works for the estimation of blocking and non-blocking

communication

For blocking communication results similar to Scalasca
For non-blocking communication, in Waitall wait time do not match

the Scalasca analysis.

Low runtime overhead
No trace recording or piggybacking
May not produce 100% accurate numbers, but
Sufficient accuracy to locate performance problems
Point to places where we might want to investigate further with trace

analysis

27 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 28

Conclusion (2)

Detection of good minimum crucial
Static imbalance
Tradeoff between number of parameters and number of samples
Jitter may lead to minor increase of measured wait state
For non-blocking communication
Count possible overlap time
Might be larger than pure Late Sender time
Isn’t this even more accurate to estimate the optimization potential?

28 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

SLIDE 29

Reference

Guoyong Mao, David Böhme, Marc-André Hermanns, Markus Geimer, Daniel Lorenz, Felix Wolf: Catching Idlers With Ease: A Lightweight Wait- State Profiler for MPI Programs. In: EuroMPI ’14: Proc. Of the 21st European MPI Users’ Group Meeting, Tokyo, Japan, Sep. 9-12, 2014

Daniel Lorenz, Petascale Workshop, Madison, WI, 8/4/14 29