Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI - - PowerPoint PPT Presentation

catching idlers with ease a lightweight wait state
SMART_READER_LITE
LIVE PREVIEW

Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI - - PowerPoint PPT Presentation

Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI Programs Guoyong Mao, David Bhme, Markus Geimer, Marc-Andr Hermanns, Daniel Lorenz and Felix Wolf Petascale Tools Workshop, Madison, WI, USA, August 4, 2014 Late sender


slide-1
SLIDE 1

Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI Programs

Guoyong Mao, David Böhme, Markus Geimer, Marc-André Hermanns, Daniel Lorenz and Felix Wolf Petascale Tools Workshop, Madison, WI, USA, August 4, 2014

slide-2
SLIDE 2

Late sender

time processes A B Send Recv

Waiting time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014 2

slide-3
SLIDE 3

Wait an NxN

time processes A B C Allgather Allgather Allgather

Waiting time Waiting time

3 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-4
SLIDE 4

What we want to know

4

Processing time Processing time Processing time Processing time Wait time Wait time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-5
SLIDE 5

What we measure

5

Execution time Execution time Execution time Execution time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-6
SLIDE 6

The minimum idea

6

Minimal execution time Execution time Execution time Execution time Execution time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-7
SLIDE 7

The minimum idea

7

Estimated processing time Processing time Processing time Processing time Processing time Wait time Wait time Estimated wait time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-8
SLIDE 8

Considered parameters

  • We consider
  • MPI function
  • Message size
  • Receiver rank
  • Other possible parameters
  • Sender rank
  • Data type
  • Tradeoff between
  • Number of samples for a meaningful minimum and amount data
  • Parameters considered
  • Need to find the relevant parameters.

8 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-9
SLIDE 9

Algorithm

For every combination of

  • MPI function
  • Message size class
  • Process

record the

  • Minimum execution time

For every combination of MPI call path and message size class record the

  • Number of visits
  • Total execution time

At the end of the profiling run, subtract the minimum from the execution time for every visit to calculate the wait time.

9 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-10
SLIDE 10

Per-call overhead increase compared to profiling overhead w/o wait state analysis (%)

10 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

50 100 150 200 250

slide-11
SLIDE 11

Accuracy MPI_Recv

11

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Scalasca minimum method wait ratio

JUQUEEN JUROPA

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-12
SLIDE 12

Accuracy MPI_Wait

12

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

make_id_list ks_congrad path_product u_shift_fermion comm_embed rev_comm_rho rev_commnct tf_ad_splitting parallel rsl_lite_exch_y rsl_lite_exch_x x_solve y_solve z_solve resid rprj3 psinv interp rhs x_solve y_solve z_solve resid rprj3 psinv iterp

Scalasca minimum method wait ratio

JUQUEEN JUROPA

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-13
SLIDE 13

Accuracy MPI_Wait

13

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

make_id_list ks_congrad path_product u_shift_fermion comm_embed rev_comm_rho rev_commnct tf_ad_splitting parallel rsl_lite_exch_y rsl_lite_exch_x x_solve y_solve z_solve resid rprj3 psinv interp rhs x_solve y_solve z_solve resid rprj3 psinv iterp

Scalasca minimum method wait ratio

JUQUEEN JUROPA

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-14
SLIDE 14

Accuracy MPI_Waitall

14

0.02 0.04 0.06 0.08 0.1 0.12 0.14 bndry_3 d bndry_2 d solvers, pcg commnc t copy_fa ces copy_fa ces x_solve y_solve z_solve copy_fa ces copy_fa ces x_solve y_solve z_solve Scalasca minimum method wait ratio

JUQUEEN JUROPA

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-15
SLIDE 15

Non-blocking communication

15

Isend Wait Isend time processes A B

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-16
SLIDE 16

Scalasca detects no wait state

16

Isend Wait Isend time processes A B

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-17
SLIDE 17

Minimum approach does calculate wait states

17

Isend Wait Isend time processes A B

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-18
SLIDE 18

But is this wrong for performance analysis?

18

Isend Wait Isend time processes A B Latency = Possible overlap time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-19
SLIDE 19

Detailed example from SP

19 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-20
SLIDE 20

Wait time according to Scalasca

20 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-21
SLIDE 21

Wait time according to minimum method

21 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-22
SLIDE 22

Jitter may cause a little higher wait time

22

Estimated processing time Processing time Processing time Processing time Processing time Wait time Estimated wait time Wait time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-23
SLIDE 23

Accuracy MPI_Waitall

23

0.02 0.04 0.06 0.08 0.1 0.12 0.14 bndry_3 d bndry_2 d solvers, pcg commnc t copy_fa ces copy_fa ces x_solve y_solve z_solve copy_fa ces copy_fa ces x_solve y_solve z_solve Scalasca minimum method wait ratio

JUQUEEN JUROPA

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-24
SLIDE 24

Static imbalance

24

Estimated processing time Processing time Processing time Processing time Processing time Wait time Wait time Estimated wait time too small Wait time Wait time

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-25
SLIDE 25

Static imbalance

  • Calculating global minima could resolve process local static imbalances
  • Reduction operation after measurement
  • No dilation at measurement time
  • Loose sender/receiver parameterization of minima
  • For collective operations, global minima were better

25 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-26
SLIDE 26

Accuracy for Wait at NxN

26

0.05 0.1 0.15 0.2 0.25 0.3 0.35 get_max_recvs solvers,pcg tf_controle glbl_int_sum EP trnspse_x_yz glbl_int_sum Scalasca minimum method wait ratio

JUROPA JUQUEEN

Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-27
SLIDE 27

Conclusion (1)

  • Minimum method works for the estimation of blocking and non-blocking

communication

  • For blocking communication results similar to Scalasca
  • For non-blocking communication, in Waitall wait time do not match

the Scalasca analysis.

  • Low runtime overhead
  • No trace recording or piggybacking
  • May not produce 100% accurate numbers, but
  • Sufficient accuracy to locate performance problems
  • Point to places where we might want to investigate further with trace

analysis

27 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-28
SLIDE 28

Conclusion (2)

  • Detection of good minimum crucial
  • Static imbalance
  • Tradeoff between number of parameters and number of samples
  • Jitter may lead to minor increase of measured wait state
  • For non-blocking communication
  • Count possible overlap time
  • Might be larger than pure Late Sender time
  • Isn’t this even more accurate to estimate the optimization potential?

28 Daniel Lorenz et al., Petascale Workshop, August 4, 2014

slide-29
SLIDE 29

Reference

Guoyong Mao, David Böhme, Marc-André Hermanns, Markus Geimer, Daniel Lorenz, Felix Wolf: Catching Idlers With Ease: A Lightweight Wait- State Profiler for MPI Programs. In: EuroMPI ’14: Proc. Of the 21st European MPI Users’ Group Meeting, Tokyo, Japan, Sep. 9-12, 2014

Daniel Lorenz, Petascale Workshop, Madison, WI, 8/4/14 29