What Exactly do we Mean by JIT Warmup ? Edd Barrett, Carl Friedrich - - PowerPoint PPT Presentation

what exactly do we mean by jit warmup
SMART_READER_LITE
LIVE PREVIEW

What Exactly do we Mean by JIT Warmup ? Edd Barrett, Carl Friedrich - - PowerPoint PPT Presentation

What Exactly do we Mean by JIT Warmup ? Edd Barrett, Carl Friedrich Bolz, Rebecca Killick (Lancaster), Vincent Knight (Cardiff), Sarah Mount, Laurence Tratt Software Development Team April 20, 2016 1 / 40 http://soft-dev.org/ Agenda Agenda


slide-1
SLIDE 1

What Exactly do we Mean by JIT Warmup?

Edd Barrett, Carl Friedrich Bolz, Rebecca Killick (Lancaster), Vincent Knight (Cardiff), Sarah Mount, Laurence Tratt Software Development Team April 20, 2016

1 / 40 http://soft-dev.org/

slide-2
SLIDE 2

Agenda Agenda

1

JIT Warmup Background

2

The Back-Story

3

The Warmup Experiment v2.0

4

Results

5

Automated Analyses

6

Conclusion and Future Work

2 / 40 http://soft-dev.org/

slide-3
SLIDE 3

JIT Warmup Background

3 / 40 http://soft-dev.org/

slide-4
SLIDE 4

JIT Warmup Background JIT Warmup Background

Informally: Time taken for a JITted VM to reach peak performance

4 / 40 http://soft-dev.org/

slide-5
SLIDE 5

JIT Warmup Background JIT Warmup Background

5 / 40 http://soft-dev.org/

slide-6
SLIDE 6

JIT Warmup Background JIT Warmup Background

5 / 40 http://soft-dev.org/

slide-7
SLIDE 7

JIT Warmup Background JIT Warmup Background

5 / 40 http://soft-dev.org/

slide-8
SLIDE 8

JIT Warmup Background JIT Warmup Background

5 / 40 http://soft-dev.org/

slide-9
SLIDE 9

JIT Warmup Background JIT Warmup Background

5 / 40 http://soft-dev.org/

slide-10
SLIDE 10

JIT Warmup Background JIT Warmup Background

6 / 40 http://soft-dev.org/

slide-11
SLIDE 11

JIT Warmup Background JIT Warmup Background

6 / 40 http://soft-dev.org/

slide-12
SLIDE 12

JIT Warmup Background JIT Warmup Background

6 / 40 http://soft-dev.org/

slide-13
SLIDE 13

JIT Warmup Background JIT Warmup Background

6 / 40 http://soft-dev.org/

slide-14
SLIDE 14

Why is Warmup Important? Why is Warmup Important?

Warmup contributes to overall performance. Long warmup is bad for user-facing and short-lived programs. VM authors report peak performance.

7 / 40 http://soft-dev.org/

slide-15
SLIDE 15

The Back-Story

8 / 40 http://soft-dev.org/

slide-16
SLIDE 16

The Back-Story The Back-Story

We have a hunch that warmup is longer than people expect. We have some preliminary ideas to improve warmup.

9 / 40 http://soft-dev.org/

slide-17
SLIDE 17

The Back-Story The Back-Story Goal: Measure how long modern JITs take to warm up.

10 / 40 http://soft-dev.org/

slide-18
SLIDE 18

The Warmup Experiment v1.0 The Warmup Experiment v1.0

Microbenchmarks Reasonable number of repetitions.

10 process executions. 50 in-process iterations.

Run on various VMs. Plot and report warmup time.

11 / 40 http://soft-dev.org/

slide-19
SLIDE 19

The Back-Story The Back-Story Weird Results

Many benchmarks don’t warmup under the classic model.

12 / 40 http://soft-dev.org/

slide-20
SLIDE 20

The Back-Story The Back-Story New goal: Try to understand why we see “weird” results.

13 / 40 http://soft-dev.org/

slide-21
SLIDE 21

The Warmup Experiment v2.0

14 / 40 http://soft-dev.org/

slide-22
SLIDE 22

Microbenchmarks Revisited Microbenchmarks Revisited

CFG determinism.

Each run takes same path through CFG.

Checksums.

Ensures different languages do the same work. Harder for VMs to optimise away whole benchmark.

Code for microbenchmarks:

https://github.com/softdevteam/warmup_experiment

15 / 40 http://soft-dev.org/

slide-23
SLIDE 23

The Benchmark Runner Revisited The Benchmark Runner Revisited Krun

Benchmark runner that aims to control sources of variation. WRT: memory limits, I/O, system state, . . .

https://github.com/softdevteam/krun

16 / 40 http://soft-dev.org/

slide-24
SLIDE 24

VMs VMs

Graal-0.13 HHVM-3.12.0 JRuby/Truffle (recent git version) Hotspot-8u72b15 LuaJit-2.0.4 PyPy-4.0.1 V8-4.9.385.21 GCC-4.9.3 (not really a VM) Same GCC across the board, minor VM patching.

17 / 40 http://soft-dev.org/

slide-25
SLIDE 25

Machines Machines

Linux-Debian8/i4790K, 24GiB RAM Linux-Debian8/i4790, 32GiB RAM OpenBSD-5.8/i4790, 32GiB RAM “Turbo boost” disabled. SSH blocked from non-local machines. Daemons disabled (e.g. cron, smtpd).

18 / 40 http://soft-dev.org/

slide-26
SLIDE 26

Run for Longer Run for Longer

Run many more in-process iterations (2000). Plot results and see if we see classic warmup now.

19 / 40 http://soft-dev.org/

slide-27
SLIDE 27

Results

20 / 40 http://soft-dev.org/

slide-28
SLIDE 28

Classical Warmup Classical Warmup

200 400 600 800 1000 1200 1400 1600 1800 2000

In-process iteration

0.232 0.341 0.449 0.558 0.666 0.775 0.884

Time(s) Richards, Graal, Linux1/i7-4790K, Process execution #3

1 2 3 4 5 6 7 8 9 0.232 0.558 0.884

21 / 40 http://soft-dev.org/

slide-29
SLIDE 29

Classical Warmup Classical Warmup

200 400 600 800 1000 1200 1400 1600 1800 2000

In-process iteration

1.137 1.143 1.150 1.156 1.163 1.169 1.176

Time(s) Fasta, V8, Linux2/i7-4790, Process execution #1

21 / 40 http://soft-dev.org/

slide-30
SLIDE 30

Classical Warmup Classical Warmup

200 400 600 800 1000 1200 1400 1600 1800 2000

In-process iteration

0.466 0.469 0.471 0.473 0.476 0.478 0.480

Time(s) Spectral Norm, PyPy, Linux1/i7-4790K, Process execution #7

21 / 40 http://soft-dev.org/

slide-31
SLIDE 31

Classical Warmup Classical Warmup

200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 1.021 1.027 1.032 1.038 1.044 1.050 1.055 Time(s) Fasta, V8, Linux1/i7-4790K, Process execution #1 200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 1.137 1.143 1.150 1.156 1.163 1.169 1.176 Time(s) Fasta, V8, Linux2/i7-4790, Process execution #1

(Different machines)

22 / 40 http://soft-dev.org/

slide-32
SLIDE 32

Slowdown Slowdown

200 400 600 800 1000 1200 1400 1600 1800 2000

In-process iteration

0.562 0.563 0.564 0.565 0.566 0.566 0.567

Time(s) Fannkuch Redux, LuaJIT, OpenBSD/i7-4790, Process execution #10

23 / 40 http://soft-dev.org/

slide-33
SLIDE 33

Slowdown Slowdown

200 400 600 800 1000 1200 1400 1600 1800 2000

In-process iteration

0.266 0.271 0.276 0.282 0.287 0.293 0.298

Time(s) Richards, Hotspot, Linux2/i7-4790, Process execution #2

23 / 40 http://soft-dev.org/

slide-34
SLIDE 34

Cycles Cycles

200 400 600 800 1000 1200 1400 1600 1800 2000

In-process iteration

0.301 0.309 0.316 0.324 0.332 0.340 0.347

Time(s) Fannkuch Redux, Hotspot, Linux1/i7-4790K, Process execution #1

24 / 40 http://soft-dev.org/

slide-35
SLIDE 35

Cycles Cycles

200 400 600 800 1000 1200 1400 1600 1800 2000

In-process iteration

0.358 0.366 0.374 0.382 0.389 0.397 0.405

Time(s) Fannkuch Redux, Hotspot, OpenBSD/i7-4790, Process execution #4

250 300 350 400 450 500 550 600 0.359 0.372 0.386

24 / 40 http://soft-dev.org/

slide-36
SLIDE 36

Cycles Cycles

200 400 600 800 1000 1200 1400 1600 1800 2000

In-process iteration

0.504 0.513 0.522 0.530 0.539 0.547 0.556

Time(s) Binary Trees, PyPy, Linux2/i7-4790, Process execution #1

200 205 210 215 220 225 230 235 240 0.506 0.510 0.515

24 / 40 http://soft-dev.org/

slide-37
SLIDE 37

Changing Phases Changing Phases

200 400 600 800 1000 1200 1400 1600 1800 2000

In-process iteration

0.350 0.351 0.352 0.353 0.354 0.354 0.355

Time(s) Fasta, LuaJIT, OpenBSD/i7-4790, Process execution #5

25 / 40 http://soft-dev.org/

slide-38
SLIDE 38

Vastly Inconsistent Process-executions Vastly Inconsistent Process-executions

200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 3.605 3.618 3.630 3.643 3.655 3.668 3.681 Time(s) Fasta, PyPy, Linux2/i7-4790, Process execution #3 200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 3.605 3.618 3.630 3.643 3.655 3.668 3.681 Time(s) Fasta, PyPy, Linux2/i7-4790, Process execution #4

(same machine)

26 / 40 http://soft-dev.org/

slide-39
SLIDE 39

Vastly Inconsistent Process-executions Vastly Inconsistent Process-executions

200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 0.963 0.976 0.989 1.001 1.014 1.026 1.039 Time(s) Binary Trees, C, Linux2/i7-4790, Process execution #1 200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 3.242 3.265 3.288 3.311 3.334 3.357 3.380 Time(s) Binary Trees, C, OpenBSD/i7-4790, Process execution #1

(Different machines. Bouncing ball pattern Linux-specific)

26 / 40 http://soft-dev.org/

slide-40
SLIDE 40

Full Results Full Results

https://archive.org/download/softdev_warmup_ experiment_artefacts/v0.2/

all_graphs.pdf All plots in one huge PDF. warmup_results*.json.bz2 Raw results.

27 / 40 http://soft-dev.org/

slide-41
SLIDE 41

Automated Analyses

28 / 40 http://soft-dev.org/

slide-42
SLIDE 42

Automated Analyses: Outlier Detection Automated Analyses: Outlier Detection

200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 0.466 0.469 0.471 0.473 0.476 0.478 0.480 Time(s) Spectral Norm, PyPy, Linux1/i7-4790K, Process execution #1 200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 0.466 0.469 0.471 0.473 0.476 0.478 0.480 Time(s) Spectral Norm, PyPy, Linux1/i7-4790K, Process execution #2 Measurement Outliers

29 / 40 http://soft-dev.org/

slide-43
SLIDE 43

Automated Analyses: Outlier Detection Automated Analyses: Outlier Detection

200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 0.466 0.469 0.471 0.473 0.476 0.478 0.480 Time(s) Spectral Norm, PyPy, Linux1/i7-4790K, Process execution #1 200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 0.466 0.469 0.471 0.473 0.476 0.478 0.480 Time(s) Spectral Norm, PyPy, Linux1/i7-4790K, Process execution #2 Measurement Outliers 5¾

  • utliers outside 5σ of rolling average

29 / 40 http://soft-dev.org/

slide-44
SLIDE 44

Automated Analyses: Outlier Detection Automated Analyses: Outlier Detection

200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 0.466 0.469 0.471 0.473 0.476 0.478 0.480 Time(s) Spectral Norm, PyPy, Linux1/i7-4790K, Process execution #1 200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 0.466 0.469 0.471 0.473 0.476 0.478 0.480 Time(s) Spectral Norm, PyPy, Linux1/i7-4790K, Process execution #2 Measurement Unique outliers (0:05%) Common outliers (0:40%)

Recurring outliers

29 / 40 http://soft-dev.org/

slide-45
SLIDE 45

Automated Analyses: Change-point Analysis Automated Analyses: Change-point Analysis

30 / 40 http://soft-dev.org/

slide-46
SLIDE 46

Automated Analyses: Change-point Analysis Automated Analyses: Change-point Analysis

30 / 40 http://soft-dev.org/

slide-47
SLIDE 47

Automated Analyses: Change-point Analysis Automated Analyses: Change-point Analysis

30 / 40 http://soft-dev.org/

slide-48
SLIDE 48

Conclusion and Future Work

31 / 40 http://soft-dev.org/

slide-49
SLIDE 49

Conclusion Conclusion We can’t rely on the classical warmup model.

32 / 40 http://soft-dev.org/

slide-50
SLIDE 50

Future Work Future Work

Extend automated analyses. More {benchmarks, VMs, arches, OSs}. Try to assign meaning to artefacts in plots.

E.g. is that spike at x = 78 actually {GC, compilation, . . .}.

Memory consumption over time.

Correlation with iteration times?

Look at hardware performance counters? What else?

33 / 40 http://soft-dev.org/

slide-51
SLIDE 51

References References

JIT Warmup Blows Hot and Cold

  • E. Barrett, C. F. Bolz, R. Killick, V. Knight, S. Mount and L. Tratt.

Rigorous Benchmarking in Reasonable Time

  • T. Kalibera and R. Jones

Specialising Dynamic Techniques for Implementing the Ruby Programming Language

  • C. Seaton (Chapter 4)

Quantifying performance changes with effect size confidence intervals

  • T. Kalibera and R. Jones

34 / 40 http://soft-dev.org/

slide-52
SLIDE 52

References References

NO_HZ: Reducing Scheduling-Clock Ticks Linux Kernel Documentation Intel P-state driver Linux Kernel Documentation malloc.conf(5) OpenBSD Manual Pages

35 / 40 http://soft-dev.org/

slide-53
SLIDE 53

That’s a wrap. Thanks!

200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 0.301 0.309 0.316 0.324 0.332 0.340 0.347 Time(s) Fannkuch Redux, Hotspot, Linux1/i7-4790K, Process execution #1 200 400 600 800 1000 1200 1400 1600 1800 2000 In-process iteration 0.350 0.351 0.352 0.353 0.354 0.354 0.355 Time(s) Fasta, LuaJIT, OpenBSD/i7-4790, Process execution #5

36 / 40 http://soft-dev.org/

slide-54
SLIDE 54

37 / 40 http://soft-dev.org/

slide-55
SLIDE 55

Krun Controls Krun Controls

Platform independent controls: Minimises I/O. Consistently limits heap and stack ulimits. Drops privileges to a fresh krun UNIX account. Automatically reboots the system prior to each proc. exec. Checks dmesg for changes after each proc. exec. Checks system is at same temperature for each proc. exec.

38 / 40 http://soft-dev.org/

slide-56
SLIDE 56

Krun Controls Krun Controls

Linux controls. Krun checks: Intel P-state support is disabled in the kernel. The performance governor is used. The kernel is “tickless” (NO_HZ_FULL_ALL). The perf sample rate is lowest possible (1Hz). ASLR is disabled. (Note: Linux ignores ulimits)

39 / 40 http://soft-dev.org/

slide-57
SLIDE 57

Krun Controls Krun Controls

OpenBSD controls. Krun checks: Malloc flags are the least noise inducing.

apmd is running in performance mode.

On OpenBSD: We can’t disable ASLR We can’t disable ticks. We can’t disable P-states in software. There is no kernel profiler (good for us).

40 / 40 http://soft-dev.org/