Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI - - PowerPoint PPT Presentation
Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI - - PowerPoint PPT Presentation
Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI Programs Guoyong Mao, David Bhme, Markus Geimer, Marc-Andr Hermanns, Daniel Lorenz and Felix Wolf Petascale Tools Workshop, Madison, WI, USA, August 4, 2014 Late sender
Late sender
time processes A B Send Recv
Waiting time
Daniel Lorenz et al., Petascale Workshop, August 4, 2014 2
Wait an NxN
time processes A B C Allgather Allgather Allgather
Waiting time Waiting time
3 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
What we want to know
4
Processing time Processing time Processing time Processing time Wait time Wait time
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
What we measure
5
Execution time Execution time Execution time Execution time
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
The minimum idea
6
Minimal execution time Execution time Execution time Execution time Execution time
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
The minimum idea
7
Estimated processing time Processing time Processing time Processing time Processing time Wait time Wait time Estimated wait time
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Considered parameters
- We consider
- MPI function
- Message size
- Receiver rank
- Other possible parameters
- Sender rank
- Data type
- Tradeoff between
- Number of samples for a meaningful minimum and amount data
- Parameters considered
- Need to find the relevant parameters.
8 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Algorithm
For every combination of
- MPI function
- Message size class
- Process
record the
- Minimum execution time
For every combination of MPI call path and message size class record the
- Number of visits
- Total execution time
At the end of the profiling run, subtract the minimum from the execution time for every visit to calculate the wait time.
9 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Per-call overhead increase compared to profiling overhead w/o wait state analysis (%)
10 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
50 100 150 200 250
Accuracy MPI_Recv
11
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Scalasca minimum method wait ratio
JUQUEEN JUROPA
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Accuracy MPI_Wait
12
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
make_id_list ks_congrad path_product u_shift_fermion comm_embed rev_comm_rho rev_commnct tf_ad_splitting parallel rsl_lite_exch_y rsl_lite_exch_x x_solve y_solve z_solve resid rprj3 psinv interp rhs x_solve y_solve z_solve resid rprj3 psinv iterp
Scalasca minimum method wait ratio
JUQUEEN JUROPA
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Accuracy MPI_Wait
13
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
make_id_list ks_congrad path_product u_shift_fermion comm_embed rev_comm_rho rev_commnct tf_ad_splitting parallel rsl_lite_exch_y rsl_lite_exch_x x_solve y_solve z_solve resid rprj3 psinv interp rhs x_solve y_solve z_solve resid rprj3 psinv iterp
Scalasca minimum method wait ratio
JUQUEEN JUROPA
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Accuracy MPI_Waitall
14
0.02 0.04 0.06 0.08 0.1 0.12 0.14 bndry_3 d bndry_2 d solvers, pcg commnc t copy_fa ces copy_fa ces x_solve y_solve z_solve copy_fa ces copy_fa ces x_solve y_solve z_solve Scalasca minimum method wait ratio
JUQUEEN JUROPA
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Non-blocking communication
15
Isend Wait Isend time processes A B
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Scalasca detects no wait state
16
Isend Wait Isend time processes A B
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Minimum approach does calculate wait states
17
Isend Wait Isend time processes A B
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
But is this wrong for performance analysis?
18
Isend Wait Isend time processes A B Latency = Possible overlap time
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Detailed example from SP
19 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Wait time according to Scalasca
20 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Wait time according to minimum method
21 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Jitter may cause a little higher wait time
22
Estimated processing time Processing time Processing time Processing time Processing time Wait time Estimated wait time Wait time
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Accuracy MPI_Waitall
23
0.02 0.04 0.06 0.08 0.1 0.12 0.14 bndry_3 d bndry_2 d solvers, pcg commnc t copy_fa ces copy_fa ces x_solve y_solve z_solve copy_fa ces copy_fa ces x_solve y_solve z_solve Scalasca minimum method wait ratio
JUQUEEN JUROPA
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Static imbalance
24
Estimated processing time Processing time Processing time Processing time Processing time Wait time Wait time Estimated wait time too small Wait time Wait time
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Static imbalance
- Calculating global minima could resolve process local static imbalances
- Reduction operation after measurement
- No dilation at measurement time
- Loose sender/receiver parameterization of minima
- For collective operations, global minima were better
25 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Accuracy for Wait at NxN
26
0.05 0.1 0.15 0.2 0.25 0.3 0.35 get_max_recvs solvers,pcg tf_controle glbl_int_sum EP trnspse_x_yz glbl_int_sum Scalasca minimum method wait ratio
JUROPA JUQUEEN
Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Conclusion (1)
- Minimum method works for the estimation of blocking and non-blocking
communication
- For blocking communication results similar to Scalasca
- For non-blocking communication, in Waitall wait time do not match
the Scalasca analysis.
- Low runtime overhead
- No trace recording or piggybacking
- May not produce 100% accurate numbers, but
- Sufficient accuracy to locate performance problems
- Point to places where we might want to investigate further with trace
analysis
27 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Conclusion (2)
- Detection of good minimum crucial
- Static imbalance
- Tradeoff between number of parameters and number of samples
- Jitter may lead to minor increase of measured wait state
- For non-blocking communication
- Count possible overlap time
- Might be larger than pure Late Sender time
- Isn’t this even more accurate to estimate the optimization potential?
28 Daniel Lorenz et al., Petascale Workshop, August 4, 2014
Reference
Guoyong Mao, David Böhme, Marc-André Hermanns, Markus Geimer, Daniel Lorenz, Felix Wolf: Catching Idlers With Ease: A Lightweight Wait- State Profiler for MPI Programs. In: EuroMPI ’14: Proc. Of the 21st European MPI Users’ Group Meeting, Tokyo, Japan, Sep. 9-12, 2014
Daniel Lorenz, Petascale Workshop, Madison, WI, 8/4/14 29