Privet! 2004 - 2010: Performance Engineer, Software Engineer @ MySQL - - PowerPoint PPT Presentation

privet
SMART_READER_LITE
LIVE PREVIEW

Privet! 2004 - 2010: Performance Engineer, Software Engineer @ MySQL - - PowerPoint PPT Presentation

Benchmark Noise Reduction: How to Configure Your Machines for Stable Results Santa Clara, California | April 23th 25th, 2018 Privet! 2004 - 2010: Performance Engineer, Software Engineer @ MySQL AB / Sun Microsystems / Oracle 2010 - 2015:


slide-1
SLIDE 1

Santa Clara, California | April 23th – 25th, 2018

Benchmark Noise Reduction:

How to Configure Your Machines for Stable Results

slide-2
SLIDE 2

Privet!

2

2004 - 2010: Performance Engineer, Software Engineer @ MySQL AB / Sun Microsystems / Oracle

2010 - 2015: Principal Software Engineer, Project Lead @
 Percona 2015 - NOW(): MySQL/InnoDB Performance Expert @ Cavium sysbench maintainer

slide-3
SLIDE 3

Why reduce benchmark noise?

slide-4
SLIDE 4

4

slide-5
SLIDE 5

4

slide-6
SLIDE 6

5

slide-7
SLIDE 7

CPU frequency scaling

slide-8
SLIDE 8

7

CPU frequency scaling

slide-9
SLIDE 9

7

CPU frequency scaling

Modern CPU cores can scale their frequencies up and down based on temperature, load & OS power saving policies

slide-10
SLIDE 10

7

CPU frequency scaling

Modern CPU cores can scale their frequencies up and down based on temperature, load & OS power saving policies The most easily identified problem:
 $ grep MHz /proc/cpuinfo
 $ lscpu
 $ cpupower -c all frequency-info


slide-11
SLIDE 11

7

CPU frequency scaling

Modern CPU cores can scale their frequencies up and down based on temperature, load & OS power saving policies The most easily identified problem:
 $ grep MHz /proc/cpuinfo
 $ lscpu
 $ cpupower -c all frequency-info
 and the most frequently hit too!

slide-12
SLIDE 12

CPU frequency scaling

8

slide-13
SLIDE 13

CPU frequency scaling

Balancing power and performance:

8

slide-14
SLIDE 14

CPU frequency scaling

Balancing power and performance: P-states: busy cores, P0 — maximum performance, Pn — reduced voltage/frequency with n > 0

8

slide-15
SLIDE 15

CPU frequency scaling

Balancing power and performance: P-states: busy cores, P0 — maximum performance, Pn — reduced voltage/frequency with n > 0 C-states: idle cores, C0 — not sleeping, Cn — deeper sleep levels with n > 0

8

slide-16
SLIDE 16

CPU frequency scaling

Balancing power and performance: P-states: busy cores, P0 — maximum performance, Pn — reduced voltage/frequency with n > 0 C-states: idle cores, C0 — not sleeping, Cn — deeper sleep levels with n > 0 Higher P- and C-states are a major source of noise in benchmarks

8

slide-17
SLIDE 17

Turbo mode

Turbo Boost™ in Intel CPUs similar technologies by other vendors and in other architectures dynamic overclocking increased frequency is limited by HW limits and the number of currently active cores complicates core-to-core and scalability comparisons

9

slide-18
SLIDE 18

CPU frequency scaling: What You Can Do

10

slide-19
SLIDE 19

CPU frequency scaling: What You Can Do

disable higher P-states by setting CPU governor to performance: 
 echo performance | sudo tee \
 /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_governor


10

slide-20
SLIDE 20

CPU frequency scaling: What You Can Do

disable higher P-states by setting CPU governor to performance: 
 echo performance | sudo tee \
 /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_governor
 disable higher C-states via PM QOS:
 (echo 0; cat) > /dev/cpu_dma_latency &
 


  • r use pmqos-static.py from tuned

10

slide-21
SLIDE 21

CPU frequency scaling: What You Can Do

disable higher P-states by setting CPU governor to performance: 
 echo performance | sudo tee \
 /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_governor
 disable higher C-states via PM QOS:
 (echo 0; cat) > /dev/cpu_dma_latency &
 


  • r use pmqos-static.py from tuned

disable TurboBoost: with intel_pstate echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo without intel_pstate, use Machine-Specific Registers and msr-tools wrmsr -a 0x1a0 0x4000850089

10

slide-22
SLIDE 22

CPU scheduler

slide-23
SLIDE 23

CPU scheduler tuning

12

slide-24
SLIDE 24

CPU scheduler tuning

More an art than a science sysctl -a | grep sched | grep -cv domain 14

12

slide-25
SLIDE 25

CPU scheduler tuning

More an art than a science sysctl -a | grep sched | grep -cv domain 14 Inadequate settings may result in higher context switches & cache misses

12

slide-26
SLIDE 26

CPU scheduler tuning

More an art than a science sysctl -a | grep sched | grep -cv domain 14 Inadequate settings may result in higher context switches & cache misses There’s no universal solution

12

slide-27
SLIDE 27

CPU scheduler tuning

More an art than a science sysctl -a | grep sched | grep -cv domain 14 Inadequate settings may result in higher context switches & cache misses There’s no universal solution this is what I use for sysbench OLTP: CFS (the default) is best disable autogrouping: sysctl kernel.sched_autogroup_enabled=0 raise minimal granularity from default: sysctl kernel.sched_min_granularity_ns=5000000

12

slide-28
SLIDE 28

Memory management

slide-29
SLIDE 29

Address space layout randomization

14

slide-30
SLIDE 30

Address space layout randomization

addresses of program code, libraries and data are different on each invokation

14

slide-31
SLIDE 31

Address space layout randomization

addresses of program code, libraries and data are different on each invokation enabled by default

14

slide-32
SLIDE 32

Address space layout randomization

addresses of program code, libraries and data are different on each invokation enabled by default can affect performance by multiple times: Causes of Performance Instability due to Code Placement in X86

14

slide-33
SLIDE 33

Address space layout randomization

addresses of program code, libraries and data are different on each invokation enabled by default can affect performance by multiple times: Causes of Performance Instability due to Code Placement in X86 enabled by default

14

slide-34
SLIDE 34

Address space layout randomization

addresses of program code, libraries and data are different on each invokation enabled by default can affect performance by multiple times: Causes of Performance Instability due to Code Placement in X86 enabled by default to disable: sysctl kernel.randomize_va_space=0

14

slide-35
SLIDE 35

Address space layout randomization

addresses of program code, libraries and data are different on each invokation enabled by default can affect performance by multiple times: Causes of Performance Instability due to Code Placement in X86 enabled by default to disable: sysctl kernel.randomize_va_space=0 Security feature, don’t try this at home in production!

14

slide-36
SLIDE 36

NUMA

15

slide-37
SLIDE 37

NUMA

NUMA auto-balancing moves memory and tasks to avoid remote memory access works by unmapping pages and handling page faults potentially improves performance at the cost of latency jitter enabled by default on most systems

15

slide-38
SLIDE 38

NUMA

NUMA auto-balancing moves memory and tasks to avoid remote memory access works by unmapping pages and handling page faults potentially improves performance at the cost of latency jitter enabled by default on most systems Disable for benchmarks: sysctl kernel.numa_balancing=0

15

slide-39
SLIDE 39

NUMA

NUMA auto-balancing moves memory and tasks to avoid remote memory access works by unmapping pages and handling page faults potentially improves performance at the cost of latency jitter enabled by default on most systems Disable for benchmarks: sysctl kernel.numa_balancing=0 Don’t forget about innodb_numa_interleave=1 in my.cnf

15

slide-40
SLIDE 40

Swap

16

slide-41
SLIDE 41

Swap

Minimize (not disable) swapping: sysctl vm.swappinnes=1 "In defence of swap: common misconceptions" by Chris Down

16

slide-42
SLIDE 42

Swap

Minimize (not disable) swapping: sysctl vm.swappinnes=1 "In defence of swap: common misconceptions" by Chris Down innodb_numa_interleave=1 in my.cnf

16

slide-43
SLIDE 43

Swap

Minimize (not disable) swapping: sysctl vm.swappinnes=1 "In defence of swap: common misconceptions" by Chris Down innodb_numa_interleave=1 in my.cnf To ensure allocation fairness between nodes:

16

slide-44
SLIDE 44

Swap

Minimize (not disable) swapping: sysctl vm.swappinnes=1 "In defence of swap: common misconceptions" by Chris Down innodb_numa_interleave=1 in my.cnf To ensure allocation fairness between nodes: sync; sysctl vm.drop_caches=3

16

slide-45
SLIDE 45

Transparent Huge Pages

Disable: echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag

17

slide-46
SLIDE 46

Memory allocators

18

slide-47
SLIDE 47

Memory allocators

MySQL is a heavy malloc() user

18

slide-48
SLIDE 48

Memory allocators

MySQL is a heavy malloc() user glibc/jemalloc/tcmalloc performance & scalability heavily depend on the version

18

slide-49
SLIDE 49

Memory allocators

MySQL is a heavy malloc() user glibc/jemalloc/tcmalloc performance & scalability heavily depend on the version glibc is improving, but… scalability & fragmentation is a problem in LTS distributions tcmalloc is good enough in LTS distributions, broken in Ubuntu Artful jemalloc gives the most stable & consistent results, sane default behavior

18

slide-50
SLIDE 50

Memory allocators

MySQL is a heavy malloc() user glibc/jemalloc/tcmalloc performance & scalability heavily depend on the version glibc is improving, but… scalability & fragmentation is a problem in LTS distributions tcmalloc is good enough in LTS distributions, broken in Ubuntu Artful jemalloc gives the most stable & consistent results, sane default behavior for benchmarks, make sure to use the same version of same library 
 with same settings!

18

slide-51
SLIDE 51

Spectre and Meltdown mitigations

19

slide-52
SLIDE 52

Spectre and Meltdown mitigations

major headache for benchmarks

19

slide-53
SLIDE 53

Spectre and Meltdown mitigations

major headache for benchmarks

  • verhead up to hundreds of %, depends on:

the workload kernel version CPU microcode version compiler version and flags

19

slide-54
SLIDE 54

Spectre and Meltdown mitigations

major headache for benchmarks

  • verhead up to hundreds of %, depends on:

the workload kernel version CPU microcode version compiler version and flags no runtime tuning

19

slide-55
SLIDE 55

Spectre and Meltdown mitigations

major headache for benchmarks

  • verhead up to hundreds of %, depends on:

the workload kernel version CPU microcode version compiler version and flags no runtime tuning make sure mitigations are as close as possibe between compared systems

19

slide-56
SLIDE 56

sysbench tune

slide-57
SLIDE 57

sysbench tune

$ sysbench tune list $ sudo sysbench tune apply --profile=mysqlbench Inspired by tuned, python-perf system and Krun but more portable already available from rocks.sysbench.io pull requests are welcome!

21

slide-58
SLIDE 58

Summary

achieving stable & consistent benchmark results gets increasingly harder all existing knowledge is fragmented & mostly scratches the surface YMMV, test on your workloads feedback on sysbench tune is welcome!

22

slide-59
SLIDE 59

Links

Victor Stinner’s articles on benchmark stability (with focus on microbenchmarks): https://vstinner.readthedocs.io/benchmark.html Brendan Gregg’s talk on tuning production cloud instances: "How Netflix Tunes EC2 Instances for Performance" Øystein Grøvlen’s post on improving stability of MySQL benchmarks: https://oysteing.blogspot.com/2017/01/improving-stability-of-mysql-single.html Upcoming series of posts in my blog http://kaamos.me/blog

23

slide-60
SLIDE 60

Thank You!

Questions?