Duet Benchmarking Improving Measurement Accuracy in the Cloud - - PowerPoint PPT Presentation

duet benchmarking
SMART_READER_LITE
LIVE PREVIEW

Duet Benchmarking Improving Measurement Accuracy in the Cloud - - PowerPoint PPT Presentation

Duet Benchmarking Improving Measurement Accuracy in the Cloud Lubomr Bulej Franois Farquet Vojtch Hork Aleksandar Prokopec Petr Tma Software Regression Testing of Performance Common Testing Pipeline Write Check Write


slide-1
SLIDE 1

Duet Benchmarking

Improving Measurement Accuracy in the Cloud

Lubomír Bulej Vojtěch Horký Petr Tůma François Farquet Aleksandar Prokopec

slide-2
SLIDE 2

Software Regression Testing … … of Performance

slide-3
SLIDE 3

Common Testing Pipeline

Code Repository commit commit commit Code Repository Code Repository Check Out Check Out Build Build Run Tests Run Tests commit hook

Pass or fail verdict Pass or fail verdict

Write Code Write Code Write More Code Write More Code Write Even More Code Write Even More Code

slide-4
SLIDE 4

Project Context

Graal Java JIT+AOT Compiler

  • Currently ~5 merge commits per day
  • Bare minimum testing JDK 8 + JDK 11
  • Running ~60 standard benchmarks
  • Minimum warm up time 5 minutes
  • Minimum 10 executions

5 x 2 x 60 x 5 x 10 = 30000 minutes = ~21 days 5 x 2 x 60 x 5 x 10 = 30000 minutes = ~21 days Skip some commits Skip some commits Skip some benchmarks Skip some benchmarks Use more machines Use more machines

slide-5
SLIDE 5

Where to Go for More Machines ? … to the Cloud !

slide-6
SLIDE 6

Cloud Resource Sharing

Amazon Elastic Cloud Instance Types

  • t3.nano 2 vCPU @ 5% power, 512MB RAM
  • t3.medium 2 vCPU @ 20% power, 4GB RAM
  • m5.large 2 vCPU 8GB RAM
  • ...

Or you can forgo the virtualization

  • m5.metal 96 threads 48 cores 384GB RAM
  • Likely the same Intel Xeon Platinum 8175M

Envelope estimate

  • CPU 48 cores / 5% = 960 instances
  • RAM 384 GB / 512 MB = 768 instances

This might perhaps somewhat disrupt measurements This might perhaps somewhat disrupt measurements

slide-7
SLIDE 7

… Effect of Resource Sharing

99% CI for the mean is ~61% bigger 99% CI for the mean is ~61% bigger

slide-8
SLIDE 8

… Effect of Resource Sharing

99% CI for the mean is ~1800% bigger 99% CI for the mean is ~1800% bigger

slide-9
SLIDE 9

Resource Management ... … Should Be Fair !

slide-10
SLIDE 10

Is Resource Management Fair ?

Hyperthreading

  • Intel says it “maximizes use of execution units”

Bursty processor scheduling

  • Amazon says “one CPU credit is equal to 100%

utilization for one minute” (in any combination) and “credits are accrued and spent at millisecond resolution”

Memory caches ? Memory bandwidth ? Thermal budget ?

Would it be fine if some instances were systematically disadvantaged ? Would it be fine if some instances were systematically disadvantaged ?

slide-11
SLIDE 11

Two Measurements In Parallel

Measured on GitLab CI Measured on GitLab CI Both workloads fluctuate together Both workloads fluctuate together

slide-12
SLIDE 12

How To Use This ?

Look at ratios instead of absolute values

  • Assumes effects are multiplicative
  • Ratios are what people want to know

“We want to reliably detect 5% slowdowns ...”

Confidence intervals using bootstrap Compare with sequential measurements

  • Confidence interval width relative to mean
  • Not quite apples-to-apples but gives some intuition
slide-13
SLIDE 13

How Much More Accurate ?

ScalaBench ~2.3x SPEC CPU ~27x ScalaBench ~2.3x SPEC CPU ~27x ScalaBench ~9.1x ScalaBench ~9.1x ScalaBench ~12x SPEC CPU ~24x ScalaBench ~12x SPEC CPU ~24x

slide-14
SLIDE 14

… More Done

Does duet benchmarking work because of synchronized interference ? Does duet benchmarking address interference due to resource sharing ? Does duet benchmarking measure performance differences accurately ? …

slide-15
SLIDE 15

Thank You !

Complete paper at https://arxiv.org/abs/2001.05811 For more information visit http://d3s.mff.cuni.cz