Accurate emulation of CPU performance Tomasz Buchert 1 Lucas Nussbaum - - PowerPoint PPT Presentation

accurate emulation of cpu performance
SMART_READER_LITE
LIVE PREVIEW

Accurate emulation of CPU performance Tomasz Buchert 1 Lucas Nussbaum - - PowerPoint PPT Presentation

Accurate emulation of CPU performance Tomasz Buchert 1 Lucas Nussbaum 2 Jens Gustedt 1 1 INRIA Nancy Grand Est 2 LORIA / Nancy - Universit e Validation of distributed systems Approaches: Theoretical approach (paper and pencil) the most


slide-1
SLIDE 1

Accurate emulation of CPU performance

Tomasz Buchert1 Lucas Nussbaum2 Jens Gustedt1

1 INRIA Nancy – Grand Est 2 LORIA / Nancy - Universit´

e

slide-2
SLIDE 2

Validation of distributed systems

Approaches: Theoretical approach (paper and pencil)

the most general results and understanding very hard (leads to unsolvability results)

Experimentation (real application on a real environment)

realistic context, credibility difficulty of preparation and control, questionable reproducibility

Simulation (modeled application inside modeled environment)

very simple and perfectly reproducible experimental bias, possibly unrealistic

Emulation (real application inside a modeled environment)

control over the experiment parameters difficult

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 2 / 20

slide-3
SLIDE 3

Emulation

The perfect emulated environment should emulate (independently): Network bandwidth, latency, topology Performance and number of CPUs Memory capabilities Background noise (network, CPU, faults) Already implemented in Wrekavoc – a tool to define and control heterogeneity of the cluster (but not perfect yet!) In this talk, however, we specifically concentrate on

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 3 / 20

slide-4
SLIDE 4

Emulation

The perfect emulated environment should emulate (independently): Network bandwidth, latency, topology Performance and number of CPUs Memory capabilities Background noise (network, CPU, faults) Already implemented in Wrekavoc – a tool to define and control heterogeneity of the cluster (but not perfect yet!) In this talk, however, we specifically concentrate on

Emulation of CPU

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 3 / 20

slide-5
SLIDE 5

CPU emulation

Various elements of CPU architecture could be emulated: speed number of cores sizes and properties of caches (and topology thereof) memory access speed (especially for NUMA systems) In this talk, we will talk about

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 4 / 20

slide-6
SLIDE 6

CPU emulation

Various elements of CPU architecture could be emulated: speed number of cores sizes and properties of caches (and topology thereof) memory access speed (especially for NUMA systems) In this talk, we will talk about

Degradation of CPU speed

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 4 / 20

slide-7
SLIDE 7

An example

50 % Unused CPU 1 50 % Unused CPU 2 70 % Unused CPU 3 30 % Unused CPU 4 (1) controlling speed of each CPU/core independently

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 5 / 20

slide-8
SLIDE 8

An example (continued)

50 % Unused CPU 1 50 % Unused CPU 2 70 % Unused CPU 3 30 % Unused CPU 4 (2) being able to create separated scheduling zones

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 6 / 20

slide-9
SLIDE 9

Dynamic frequency scaling (CPU-Freq)

AKA Intel Enhanced SpeedStep or AMD Cool’n’Quiet Hardware solution to reduce:

heat noise power usage

For:

no overhead of emulation completely unintrusive meaningful CPU time measure

Against:

  • nly a finite set of different frequency levels

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 7 / 20

slide-10
SLIDE 10

CPU-Lim

Method available in Wrekavoc Algorithm:

if CPU usage ≥ threshold → send SIGSTOP to the process if CPU usage < threshold → send SIGCONT to the process

CPU usage = CPU time of the process

process lifetime

For:

easy and almost POSIX-compliant

Against:

intrusive and unscalable decision based on one process instead of global CPU usage sleeping is indistinguishable from preemption

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 8 / 20

slide-11
SLIDE 11

Fracas

Based on idea from KRASH (load injection tool) idea Uses Linux Cgroups and Completely Fair Scheduler A predefined portion of the CPU is given to tasks burning CPU All other processes are given the remaining CPU time

Emulated processes CPU burner Core 1 Emulated processes CPU burner Core 2 Emulated processes CPU burner Core 3 Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 9 / 20

slide-12
SLIDE 12

Fracas

Based on idea from KRASH (load injection tool) idea Uses Linux Cgroups and Completely Fair Scheduler A predefined portion of the CPU is given to tasks burning CPU All other processes are given the remaining CPU time For:

unintrusive scalable

Against:

unportable to other systems sensitive to the configuration of the scheduler

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 9 / 20

slide-13
SLIDE 13

Fracas and latency of the scheduler

1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0

CPU Frequency [GHz]

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

GFLOP / s

0.1 ms 1 ms 10 ms 100 ms 1000 ms

The smaller the latency, the better the emulation

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 10 / 20

slide-14
SLIDE 14

Evaluation

Based on different types of work:

CPU intensive (Linpack benchmark) IO bound multiprocessing multithreading memory speed (STREAM benchmark)

X-axis – emulated frequency Y-axis – speed perceived by the benchmark each test repeated 10 times, results = average 95% confidence interval using t-Student distribution Evaluation performed on Grid’5000 platform

nodes with two quad-core Intel Xeon X5570 processors nodes with a pair of single-core AMD Opteron 252 processors

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 11 / 20

slide-15
SLIDE 15

Grid’5000

9 sites, 1600 machines

Lille, Rennes, Orsay, Nancy, Bordeaux, Lyon, Grenoble, Toulouse, Sophia

Dedicated to research on distributed systems and HPC

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 12 / 20

slide-16
SLIDE 16

CPU intensive work

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0

CPU Frequency [GHz]

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5

GFLOP / s

CPU-Freq CPU-Lim1 Fracas

CPU-Lim is less predictable (the outcome has higher variance)

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 13 / 20

slide-17
SLIDE 17

IO-bound work

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0

CPU Frequency [GHz]

3500 4000 4500 5000 5500 6000

Loops / s

CPU-Freq CPU-Lim1 Fracas

CPU-Lim gives (unfair) advantage to IO-bound tasks

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 14 / 20

slide-18
SLIDE 18

Multiprocessing

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0

CPU Frequency [GHz]

2000 4000 6000 8000 10000

Loops / s

CPU-Freq CPU-Lim1 Fracas

Fracas can’t emulate CPU for multitask computation

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 15 / 20

slide-19
SLIDE 19

Multithreading

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0

CPU Frequency [GHz]

2000 4000 6000 8000 10000

Loops / s

CPU-Freq CPU-Lim1 Fracas

CPU-Lim controls processes instead of scheduling entities

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 16 / 20

slide-20
SLIDE 20

Memory speed

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0

CPU Frequency [GHz]

5000 6000 7000 8000 9000 10000 11000

GB / s

CPU-Freq CPU-Lim1 Fracas

Memory speed is affected differently by each method

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 17 / 20

slide-21
SLIDE 21

Summary of the evaluation

CPU-Freq:

very good results coarse granularity

CPU-Lim:

not scalable due to implementation, intrusive higher variance controls processes, not threads

Fracas:

good behavior for a single-task workload scalable bad behavior for multitask workload

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 18 / 20

slide-22
SLIDE 22

Future work

Explore other approaches Improve Fracas to cover multitasking Emulate memory bandwidth Emulate other aspects of CPU Integrate Fracas into Wrekavoc Take over the world :)

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 19 / 20

slide-23
SLIDE 23

Conclusions

Presented Fracas, a method for CPU performance emulation based on Linux cgroups Compared with CPU-Freq and CPU-Lim (Wrekavoc) Evaluated experimentally on Grid’5000 None of the methods is perfect:

CPU-Freq: coarse grained CPU-Lim: implementation problems, not scalable Fracas: works perfectly in single thread/process case, needs work in multithread/process case

Questions?

Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 20 / 20