Scientific Benchmarking of Parallel Computing Systems Paper Reading - - PowerPoint PPT Presentation

scientific benchmarking of parallel computing systems
SMART_READER_LITE
LIVE PREVIEW

Scientific Benchmarking of Parallel Computing Systems Paper Reading - - PowerPoint PPT Presentation

Scientific Benchmarking of Parallel Computing Systems Paper Reading Group Torsten Hoefler Roberto Belli Presents: Maksym Planeta 21.12.2015 Table of Contents Introduction State of the practice The rules Use speedup with Care Do not


slide-1
SLIDE 1

Scientific Benchmarking of Parallel Computing Systems

Paper Reading Group Torsten Hoefler Roberto Belli Presents: Maksym Planeta 21.12.2015

slide-2
SLIDE 2

Table of Contents

Introduction State of the practice The rules Use speedup with Care Do not cherry-pick Summarize cata with Care Report variability of measurements Report distribution of measurements Compare data with Care Choose percentiles with Care Design interpretable measurements Use performance modeling Graph the results Conclusion

slide-3
SLIDE 3

Table of Contents

Introduction State of the practice The rules Use speedup with Care Do not cherry-pick Summarize cata with Care Report variability of measurements Report distribution of measurements Compare data with Care Choose percentiles with Care Design interpretable measurements Use performance modeling Graph the results Conclusion

slide-4
SLIDE 4

Reproducibility

◮ machines are unique ◮ machines age quick ◮ relevant configuration is volatile

slide-5
SLIDE 5

Interpretability

◮ Weaker than reproducibility ◮ Describe an experiment in an understandable way ◮ Allow to draw own conclusions and generalize results

slide-6
SLIDE 6

Frequently wrong answered questions

◮ How many iterations do I have to run per measurement? ◮ How many measurements should I run? ◮ Once I have all data, how do I summarize it into a single

number?

◮ How do I measure time in a parallel system?

slide-7
SLIDE 7

Performance report

High-Performance Linpack (HPL)

run on 64 nodes (N=314k) of the Piz Daint system during normal operation achieved 77.38 Tflops/s.

slide-8
SLIDE 8

Performance report

High-Performance Linpack (HPL)

run on 64 nodes (N=314k) of the Piz Daint system during normal operation achieved 77.38 Tflops/s. Theoretical peak is 94.5 Tflops/s . . . the benchmark achieves 81.8% of peak performance

slide-9
SLIDE 9

Performance report

High-Performance Linpack (HPL)

run on 64 nodes (N=314k) of the Piz Daint system during normal operation achieved 77.38 Tflops/s. Theoretical peak is 94.5 Tflops/s . . . the benchmark achieves 81.8% of peak performance

Problems

  • 1. What was the influence of OS noise?
  • 2. How typical this run is?
  • 3. How to compare with other systems?
slide-10
SLIDE 10

It’s worth a thousand words

95% Quantile Arithmetic Mean Median

99% CI

(median)

Min Max

77.38 Tflop/s 72.79 Tflop/s 69.92 Tflop/s 65.23 Tflop/s 61.23 Tflop/s

0.00 0.05 0.10 0.15 280 300 320 340

Completion Time (s) Density

Figure 1: Distribution of completion times for 50 HPL runs.

slide-11
SLIDE 11

Table of Contents

Introduction State of the practice The rules Use speedup with Care Do not cherry-pick Summarize cata with Care Report variability of measurements Report distribution of measurements Compare data with Care Choose percentiles with Care Design interpretable measurements Use performance modeling Graph the results Conclusion

slide-12
SLIDE 12

The survey

◮ Pick papers from SC, PPoPP, HPDC ◮ Evaluate result reports from different aspects ◮ Categorize aspects as covered, not applicable, missed

slide-13
SLIDE 13

Experiment report

Experimental design

  • 1. Hardware

1.1 Processor Model / Accelerator (79/95) 1.2 RAM Size / Type / Bus Infos (26/95) 1.3 NIC Model / Network Infos (60/95)

  • 2. Software

2.1 Compiler Version / Flags (35/95) 2.2 Kernel / Libraries Version (20/95) 2.3 Filesystem / Storage (12/95)

  • 3. Configuration

3.1 Software and Input (48/95) 3.2 Measurement Setup (50/95) 3.3 Code Available Online (7/95)

Data Analysis

  • 1. Results
slide-14
SLIDE 14

Experiment report

Experimental design

  • 1. Hardware
  • 2. Software
  • 3. Configuration

Data Analysis

  • 1. Results

1.1 Mean (51/95) 1.2 Best / Worst Performance (13/95) 1.3 Rank Based Statistics (9/95) 1.4 Measure of Variation (17/95)

slide-15
SLIDE 15

Outcome

◮ Benchmarking is important ◮ Study 120 papers from three conferences (25 were not

applicable)

◮ Benchmarking usually done wrong ◮ Advice researchers how to do better job

If supercomputing benchmarking and performance analysis is to be taken seriously, the community needs to agree on a common set of standards for measuring, reporting, and interpreting performance results.

slide-16
SLIDE 16

Table of Contents

Introduction State of the practice The rules Use speedup with Care Do not cherry-pick Summarize cata with Care Report variability of measurements Report distribution of measurements Compare data with Care Choose percentiles with Care Design interpretable measurements Use performance modeling Graph the results Conclusion

slide-17
SLIDE 17

Use speedup with Care

When publishing parallel speedup, report if the base case is a single parallel process or best serial execution, as well as the absolute execution performance of the base case.

slide-18
SLIDE 18

because speedup may be ambigious

◮ Is it against best possible serial implementation? ◮ Or is it just parallel implementation on single processor?

slide-19
SLIDE 19

because speedup may be misleading

◮ Higher on slow processors ◮ Lower on fast processors

slide-20
SLIDE 20

because speedup may be misleading

◮ Higher on slow processors ◮ Lower on fast processors

Thus,

◮ Speedup on one computer can’t be compared with speedup on

another computer.

◮ Better avoid speedup

slide-21
SLIDE 21

Do not cherry-pick

Specify the reason for only reporting subsets of standard benchmarks or applications or not using all system resources.

slide-22
SLIDE 22

Do not cherry-pick

Specify the reason for only reporting subsets of standard benchmarks or applications or not using all system resources.

◮ Use the whole node

to utilize all available resources

slide-23
SLIDE 23

Do not cherry-pick

Specify the reason for only reporting subsets of standard benchmarks or applications or not using all system resources.

◮ Use the whole node

to utilize all available resources

◮ Use the whole benchmark/application

not only kernels

slide-24
SLIDE 24

Summarize cata with Care

Use the arithmetic mean only for summarizing costs. Use the harmonic mean for summarizing rates. Avoid summarizing ratios; summarize the costs or rates that the ratios base on instead. Only if these are not available use the geometric mean for summarizing ratios.

slide-25
SLIDE 25

Mean

  • 1. if all measurements are weighted equally use the arithmetic

mean (absolute values): x = 1 n

n

  • i=1

xi

  • 2. if the denominator has the primary semantic meaning use

harmonic mean (rates): x(h) = n n

i=1 1 xi

  • 3. ratios may be summarized by using geometric mean:

x(g) =

n

  • n
  • i=1

xi

slide-26
SLIDE 26

do not use geometric mean

the geometric mean has no simple interpretation and should thus be used with greatest care

slide-27
SLIDE 27

do not use geometric mean

the geometric mean has no simple interpretation and should thus be used with greatest care It can be interpreted as a log-normalized average

slide-28
SLIDE 28

and tell what you use

51 papers use summarizing. . .

slide-29
SLIDE 29

and tell what you use

51 papers use summarizing. . . four of these specify the exact averaging method. . .

slide-30
SLIDE 30

and tell what you use

51 papers use summarizing. . . four of these specify the exact averaging method. . . one paper correctly specifies the use of the harmonic mean. . .

slide-31
SLIDE 31

and tell what you use

51 papers use summarizing. . . four of these specify the exact averaging method. . . one paper correctly specifies the use of the harmonic mean. . . Two papers report that they use geometric mean

slide-32
SLIDE 32

and tell what you use

51 papers use summarizing. . . four of these specify the exact averaging method. . . one paper correctly specifies the use of the harmonic mean. . . Two papers report that they use geometric mean, both without a good reason.

slide-33
SLIDE 33

Report variability of measurements

Report if the measurement values are deterministic. For nondeterministic data, report confidence intervals of the measurement.

slide-34
SLIDE 34

Dangerous variations

Measurements may be very unpredictable on HPC systems. In fact, this problem is so severe that several large procurements specified upper bounds on performance variations as part of the vendor’s deliverables.

slide-35
SLIDE 35

Report distribution of measurements

Do not assume normality of collected data (e.g., based on the number of samples) without diagnostic checking.

slide-36
SLIDE 36

Q-Q plot

slide-37
SLIDE 37

Parametric measurements

Parametric Non-parametric Assumed distribution Normal Any Assumed variance Homogeneous Any Usual central measure Mean Any Data set relationships Independent Any1 Type of data Interval or Ratio Ordinal, Nominal, Interval, Ratio Conclusion More powerful Conservative

1Paper says opposite

slide-38
SLIDE 38

Compare data with Care

Compare nondeterministic data in a statistically sound way, e. g., using non-overlapping confidence intervals or ANOVA. None of the 95 analyzed papers compared medians in a statistically sound way.

slide-39
SLIDE 39

Mean vs. Median

though many of the 1M measurements overlap.

Piz Dora Pilatus 2 4 6 3 6 9 1.5 1.6 1.7 1.8 1.9 2.0 1.5 1.6 1.7 1.8 1.9 2.0

Time Density

Arithmetic Mean 99% CI (Mean) 99% CI (Mean) Arithmetic Mean Min: 1.57 Max: 7.2 Min: 1.48 Max: 11.59 99% CI (Median) 99% CI (Median) Median Median

Figure 3: Significance of latency results on two systems.

slide-40
SLIDE 40

Choose percentiles with Care

Carefully investigate if measures of central tendency such as mean or median are useful to report. Some problems, such as worst-case latency, may require other percentiles.

slide-41
SLIDE 41

Piz Dora vs Pilatus

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1.6 1.7 1.8 1.9 2.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Quantiles

Piz Dora (intercept)

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • ● ●
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • 0.0

0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Quantiles

Pilatus (difference to Piz Dora) Figure 4: Quantile regression comparison of the latencies com- paring Pilatus (base case or intercept) with Piz Dora.

slide-42
SLIDE 42

Design interpretable measurements

Document all varying factors and their levels as well as the complete experimental setup (e. g., software, hardware, techniques) to facilitate reproducibility and provide interpretability.

slide-43
SLIDE 43

Fix environments

  • 1. Fix environment parameters

If controlling a certain parameter is not possible then we suggest randomization following standard textbook procedures.

  • 2. Document setup

For parallel time measurements, report all measurement, (optional) synchronization, and summarization techniques.

slide-44
SLIDE 44

Particular parameters may be very important

5 10 15 20 25 2 4 8 16 32 64

Number of Processes Completion Time (us) Powers of Two Others

Figure 5: 1,000 MPI_Reduce runs for different process counts.

slide-45
SLIDE 45

Use performance modeling

If possible, show upper performance bounds to facilitate interpretability of the measured results.

slide-46
SLIDE 46

Interpretable speedup graph

2.5 5.0 7.5 4 8 12 16 20 24 28 32

Processes Time Measurement Result Parallel Overheads Bound Ideal Linear Bound Serial Overheads Bound

(a) Time

5 10 15 20 4 8 12 16 20 24 28 32

Processes Speedup Measurement Result Parallel Overheads Bound Ideal Linear Bound Serial Overheads Bound

(b) Speedup

Parallel overheads bound (based on Amdahl’s law) t =      10ns , if p ≤ 8 0.1ms · log2p , if 8 < p ≤ 16 0.17ms · log2p , if 16 < p

slide-47
SLIDE 47

Graph the results

Plot as much information as needed to interpret the experimental results. Only connect measurements by lines if they indicate trends and the interpolation is valid.

slide-48
SLIDE 48

Use appropriate tool

◮ Box plots ◮ Histograms ◮ Violin plots ◮ Plot summary

statistics

◮ Plot CIs ◮ Combinations of all

Box Plot Violin Plot Combined Plot

1.75 2.00 2.25 2.50

Latency (us)

Higher 1.5 IQR Lower 1.5 IQR 4th Quartile 1st Quartile Mean Median 95% CI Median } 95% CI Median } Density Density Mean Median Mean 4th Quartile 1st Quartile Median

(c) Box and Violin Plots

slide-49
SLIDE 49

Table of Contents

Introduction State of the practice The rules Use speedup with Care Do not cherry-pick Summarize cata with Care Report variability of measurements Report distribution of measurements Compare data with Care Choose percentiles with Care Design interpretable measurements Use performance modeling Graph the results Conclusion

slide-50
SLIDE 50

Conclusion

◮ Important problem ◮ Good introduction ◮ Some of the claims have no obvious conclusion