Five ways not to fool yourself Tim Harris 23-Jun-18 Five ways not - - PowerPoint PPT Presentation

five ways not to fool yourself
SMART_READER_LITE
LIVE PREVIEW

Five ways not to fool yourself Tim Harris 23-Jun-18 Five ways not - - PowerPoint PPT Presentation

Five ways not to fool yourself Tim Harris 23-Jun-18 Five ways not to fool yourself A pragmatic implementation of non-blocking linked lists, Tim Harris, DISC 2001 Five ways not to fool yourself 1. Measure as you go Starting and stopping


slide-1
SLIDE 1

Tim Harris 23-Jun-18

Five ways not to fool yourself

slide-2
SLIDE 2

Five ways not to fool yourself

“A pragmatic implementation of non-blocking linked lists”, Tim Harris, DISC 2001

slide-3
SLIDE 3

Five ways not to fool yourself

  • 1. Measure as you go
slide-4
SLIDE 4

Starting and stopping work

  • How much work to do?

Too little: results dominated by start-up

  • effects. Normalized

metrics vary as you vary the duration.

Long runs Short runs

slide-5
SLIDE 5

Starting and stopping work

  • How much work to do?

Too little: results dominated by start-up

  • effects. Normalized

metrics vary as you vary the duration.

Long runs Short runs

OK: results not sensitive to the exact choice of settings. Confirm this: double / halve duration with no change.

slide-6
SLIDE 6

Starting and stopping work

  • How much work to do?

Too little: results dominated by start-up

  • effects. Normalized

metrics vary as you vary the duration.

Long runs Short runs

OK: results not sensitive to the exact choice of settings. Confirm this: double / halve duration with no change. Unnecessarily long – deters experimentation, and risks errors from mixing up results from different runs

slide-7
SLIDE 7

Constant load Constant work

Fixed set of threads active throughout the measurement interval. Measure the work they do. 1 3 6 11 4 7 8 9 10 5 2 Fixed amount of work (e.g., loop iterations). Measure the time taken to perform it. Vary the number of threads.

slide-8
SLIDE 8

Plot what you measure, not what you configure

“Bind threads 1 per socket” Have each thread report where it is running “Run for 10s” Record time at start & end “Use 50% reads” Measured #reads/#ops “Distribute memory across the machine” Actual locations and page sizes used

slide-9
SLIDE 9

Five ways not to fool yourself

  • 1. Measure as you go
  • 2. Include lightweight sanity checks
slide-10
SLIDE 10

Be skeptical about the results

slide-11
SLIDE 11

Be skeptical about the results

  • Is the harness running what you intend it to run?

– Incorrect algorithms are often faster – Good practice: do not print any output until you have confidence in the result

slide-12
SLIDE 12

Be skeptical about the results

  • Does the data structure pass simple checks?

– Start with N items, insert P, delete M, check that we have N+P-M at the end – Suppose we are building a balanced binary tree – is it actually balanced at the end? – Suppose we have a vector of N items and swap pairs of items – do we have N distinct items at the end?

slide-13
SLIDE 13

Five ways not to fool yourself

  • 1. Measure as you go
  • 2. Include lightweight sanity checks
  • 3. Understand the simple cases first
slide-14
SLIDE 14

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 4 8 16 32 64 128 Normalized throughput Threads

Skip-list, 100 % read only, 2*Haswell

Normalize to optimized sequential code (and report absolute baseline). Self-relative scaling is almost never a good metric to use.

Why isn’t this a horizontal line?

slide-15
SLIDE 15

Skip-list, 100 % read only, 2*Haswell

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 2 4 8 16 32 64 128 Normalized throughput Threads

  • Fixed. Without Turbo Boost.

With Turbo Boost.

slide-16
SLIDE 16

Five ways not to fool yourself

  • 1. Measure as you go
  • 2. Include lightweight sanity checks
  • 3. Understand the simple cases first
  • 4. Look beyond timing
slide-17
SLIDE 17

Look beyond timing

  • Try to link:

– Performance measurements from an experiment – Measurements of resource use during the experiment – Differences between the algorithms being executed

slide-18
SLIDE 18

Resource utilization

  • Examine the use of significant resources in the machine

– Bandwidth to and from memory – Bandwidth use on the interconnect – Instruction execution rate

  • Clock frequency and power settings
  • Look for evidence of bad behavior

– High page fault rate (i.e., going to disk) – High TLB miss rate

slide-19
SLIDE 19

Thread placement

  • Choice between OS-control threading versus pinning
  • Real workloads run with OS-controlled threading

– …but OS-controlled threading can be sensitive to blocking / wake-up behavior, thread creation order, prior machine state, ….

  • Deliberately explore different pinned placements, and quantify

impact

– Are differences between algorithms consistent across these runs?

  • In experiments compare:

– OS (report version) – Different pinning choices (how many sockets used, how many cores per socket, what order are h/w threads used)?

slide-20
SLIDE 20

Memory placement

  • How are we distributing memory across sockets?
  • How is the load distributed over memory channels?
  • How is memory being allocated / deallocated?
slide-21
SLIDE 21

Unfairness

  • Look across all of the threads: did they complete the same amount
  • f work?
  • Trade-offs between unfairness and aggregate throughput

– Unfairness may correlate with better LLC behavior – Threads running nearby synchronize more quickly, and get to complete more work

  • Whether we care about unfairness in itself depends on the

workload

– Threads serving different clients: may want even response time – Threads completing a batch of work: just care about overall completion time

slide-22
SLIDE 22

Unfairness: simple test-and-test-and-set lock

  • 2-socket Haswell, threads pinned sequentially to cores in both

sockets

Operations per thread normalized to main

0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0

45x, not 45%!

H/W thread number (0..36)

slide-23
SLIDE 23

Five ways not to fool yourself

  • 1. Measure as you go
  • 2. Include lightweight sanity checks
  • 3. Understand the simple cases first
  • 4. Look beyond timing
  • 5. Move toward production settings
slide-24
SLIDE 24

Concluding comments

  • We optimize for what we measure, or measure what we optimized

– Why pick specific workloads (read/write mix, key space, … ?) – Does the choice reflect an important workload? – Are results sensitive to the choice?

  • Be careful about averages

– As with fairness over threads, an average over time hides details – Even if you do not plot all the results, examine trends over time, variability, etc.

  • Be careful about trade-offs

– Is a new system strictly better, or exploring a new point in a trade-off?

slide-25
SLIDE 25

Further reading

  • Books

– Huff & Geis – “How to Lie with Statistics” – Jain – “The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling” – Tufte – “The Visual Display of Quantitative Information”

  • Papers and articles

– Bailey – “Twelve Ways to Fool the Masses” – Fleming & Wallace – “How not to lie with statistics: the correct way to summarize benchmark results” – Heiser – “Systems Benchmarking Crimes” – Hoefler & Belli – “Scientific Benchmarking of Parallel Computing Systems”