Functional System Simulation with SimuBoost GI Fachgruppentreffen - - PowerPoint PPT Presentation

functional system simulation with simuboost
SMART_READER_LITE
LIVE PREVIEW

Functional System Simulation with SimuBoost GI Fachgruppentreffen - - PowerPoint PPT Presentation

Towards Scalable Parallelization of Functional System Simulation with SimuBoost GI Fachgruppentreffen Betriebssysteme (BS) 2016 Marc Rittinghaus , Frank Bellosa OPERATING SYSTEMS GROUP DEPARTMENT OF COMPUTER SCIENCE Virtualization Node


slide-1
SLIDE 1

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association OPERATING SYSTEMS GROUP DEPARTMENT OF COMPUTER SCIENCE

www.kit.edu

Towards Scalable Parallelization of Functional System Simulation with SimuBoost

GI Fachgruppentreffen Betriebssysteme (BS) 2016 Marc Rittinghaus, Frank Bellosa

Node 1 Node 0

Virtualization [Core 0] Virtualization [Core 1] Non-Det. Events Non-Det. Events Checkpoints/Logs Trace Data Checkpoints/Logs Trace Data Checkpoints/Logs Trace Data Checkpoints/Logs Trace Data Checkpoints/Logs Trace Data Checkpoints/Logs Trace Data Checkpoints/Logs Trace Data Checkpoints/Logs Trace Data

Central Storage

Virtualization Node Management Node Simulation Node Simulation Node Analysis Node

SimuTrace Simulation [Interval 0] SimuTrace SimuTrace SimuTrace Simulation [Interval 1] Simulation [Interval 2] Simulation [Interval 3] SimuTrace Simulation [Interval 4] SimuTrace SimuTrace SimuTrace Simulation [Interval 5] Simulation [Interval 6] Simulation [Interval 7] Simu Trace Simu Trace Storage Provider Storage Provider

Virtualization Logs Simulation Traces

Checkpoints Input Processor Trace Data Results Trace Data Simu Trace Simu Trace Custom Analysis Custom Analysis

Analysis Results

Phase 1 Phase 2 Phase 3

slide-2
SLIDE 2

Operating Systems Group Department of Computer Science 2

Motivation

Study properties of redundant memory contents [Miller13]

Origin? Lifetime? Sharing possible? Analyze memory contents after each modification But: Analysis should not affect workload

Analyze memory access patterns on system interfaces [Jurczyk13, Wilhelm15]

Detect vulnerabilities in Windows 8 and Xen (CVE-2015-8550) Trace individual memory reads and writes

Marc Rittinghaus - SimuBoost

We want detailed runtime information

slide-3
SLIDE 3

Operating Systems Group Department of Computer Science 3

Motivation

Operating system research

Debugging Application, OS, and hardware interaction Malware and vulnerabilities

Functional Full System Simulation

But: It is slow

Marc Rittinghaus - SimuBoost

Virtualization Simulation KVM QEMU Simics ~ 1x ~ 100x ~ 1000x

Average slowdowns for: kernel build, SPECint_base06, LAMMPS

  • Not practical for long-running workloads
  • Loss of interactivity (users and remote hosts)
slide-4
SLIDE 4

Operating Systems Group Department of Computer Science 4

Basic Acceleration Approach

(1) Split simulation into time intervals (2) Simulate intervals simultaneously

Does not trade accuracy for speed Applicable to single-CPU simulations Scales with run-time of workload

Marc Rittinghaus - SimuBoost

  • How to bootstrap the simulation of i[1

[1..n ..n]?

  • Still no interactivity
slide-5
SLIDE 5

Operating Systems Group Department of Computer Science 5

SimuBoost

Leverage fast virtualization

Checkpoints at interval boundaries bootstrap simulations Hardware acceleration provides full interactivity Speed difference drives parallelization

Marc Rittinghaus - SimuBoost

Virtualization i[0 ] i[k ] i[n ] t Node 0 Simulation i[0 ] Simulation i[k ] Simulation i[n ] i[n ] Node k Node n vNode

slide-6
SLIDE 6

Operating Systems Group Department of Computer Science 6

SimuBoost

Leverage fast virtualization

Checkpoints at interval boundaries bootstrap simulations Hardware virtualization provides full interactivity Speed difference drives parallelization

Marc Rittinghaus - SimuBoost

Virtualization i[0 ] i[k ] i[n ] t Node 0 Simulation i[0 ] Simulation i[k ] Simulation i[n ] i[n ] Node k Node n vNode

Challenges: Preserve interactivity and speedup

(1) Fast Checkpoint Creation: <100ms [RbMiller68] (2) Fast Checkpoint Distribution

slide-7
SLIDE 7

Operating Systems Group Department of Computer Science 7

Stop-And-Copy

Marc Rittinghaus - SimuBoost

Virtualization

suspended

i[k] i[k [k+1] 1]

slide-8
SLIDE 8

Operating Systems Group Department of Computer Science 8

Virtualization

Stop-And-Copy

Marc Rittinghaus - SimuBoost

suspended VM RAM Checkpoint

i[k] i[k [k+1] 1]

slide-9
SLIDE 9

Operating Systems Group Department of Computer Science 9

Virtualization

Stop-And-Copy

Marc Rittinghaus - SimuBoost

suspended Checkpoint

i[k] i[k [k+1] 1]

slide-10
SLIDE 10

Operating Systems Group Department of Computer Science 10 101 193 301 825 1555 2667 4321

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 256 512 1024 2048 4096 8192 16384

Memory Size (MiB) Downtime (ms)

pts_build_linux_kernel

Virtualization

Stop-And-Copy

Marc Rittinghaus - SimuBoost

suspended

Downtime depends on VM size Not suited for interactive use Limited parallelization

i[k] i[k [k+1] 1]

downtime

30% speedup loss

We need to drastically speedup checkpointing

slide-11
SLIDE 11

Operating Systems Group Department of Computer Science 11

Incremental Stop-And-Copy

Observation: Only some data modified per interval

Marc Rittinghaus - SimuBoost

Virtualization

VM RAM

i[k] i[k [k+1] 1]

pts_build_linux_kernel spec_jbb 22000 pages/s (85 MiB/s) 53000 pages/s (200 MiB/s)

slide-12
SLIDE 12

Operating Systems Group Department of Computer Science 12

Incremental Stop-And-Copy

Marc Rittinghaus - SimuBoost

Virtualization

suspended VM RAM Checkpoint

i[k] i[k [k+1] 1]

Idea: Save only modified data

Track dirty pages via page protections Use previous checkpoints to get unmodified data

slide-13
SLIDE 13

Operating Systems Group Department of Computer Science 13

Incremental Stop-And-Copy

Marc Rittinghaus - SimuBoost

Virtualization

suspended Checkpoint

i[k] i[k [k+1] 1]

Idea: Save only modified data

Track dirty pages via page protections Use previous checkpoints to get unmodified data

slide-14
SLIDE 14

Operating Systems Group Department of Computer Science 14

Incremental Stop-And-Copy

Marc Rittinghaus - SimuBoost

Virtualization

suspended

i[k] i[k [k+1] 1]

Saved downtime

Reduced downtime

Less dependent on VM size

50 100 150 200 256 512 1024 2048 4096 8192 16384

Memory Size (MiB) Downtime (ms)

pts_build_linux_kernel (interval = 16000 ms)

slide-15
SLIDE 15

Operating Systems Group Department of Computer Science 15

Incremental Stop-And-Copy

Marc Rittinghaus - SimuBoost

Virtualization

suspended

i[k] i[k [k+1] 1]

Saved downtime

50 100 150 200 250 300 350 400 100 1000 16000 2000 4000 500 8000

Interval Length (ms) Downtime (ms)

idle pts_apache pts_build_linux_kernel pts_postmark spec_jbb stress

Reduced downtime

Less dependent on VM size

But: Downtime depends on

Interval length Workload

slide-16
SLIDE 16

Operating Systems Group Department of Computer Science 16 mean = 77

50 100 150 200 250 300 1 2 3

Checkpoint Index Downtime (ms)

spec_jbb (interval = 500 ms)

Incremental Stop-And-Copy

Marc Rittinghaus - SimuBoost

Virtualization

suspended

i[k] i[k [k+1] 1]

Reduced downtime

Less dependent on VM size

But: Downtime depends on

Interval length Workload

But: Downtime strongly fluctuates

Saved downtime

25% above 100ms 60% above 100ms

We need to further speedup checkpointing

slide-17
SLIDE 17

Operating Systems Group Department of Computer Science 17

Incremental Copy-On-Write

Marc Rittinghaus - SimuBoost

Virtualization

VM RAM

i[k] i[k [k+1] 1]

slide-18
SLIDE 18

Operating Systems Group Department of Computer Science 18

Incremental Copy-On-Write

Marc Rittinghaus - SimuBoost

Virtualization

VM RAM Write-protect pages

i[k] i[k [k+1] 1]

Idea: Save modified pages asynchronously

Use write-protection to prevent modification

slide-19
SLIDE 19

Operating Systems Group Department of Computer Science 19

Incremental Copy-On-Write

Marc Rittinghaus - SimuBoost

Virtualization

VM RAM

i[k] i[k [k+1] 1]

Idea: Save modified pages asynchronously

Use write-protection to prevent modification

slide-20
SLIDE 20

Operating Systems Group Department of Computer Science 20

Incremental Copy-On-Write

Marc Rittinghaus - SimuBoost

Virtualization

VM RAM Checkpoint Page Fault

i[k] i[k [k+1] 1]

Idea: Save modified pages asynchronously

Use write-protection to prevent modification Copy and release protection on pagefault

slide-21
SLIDE 21

Operating Systems Group Department of Computer Science 21

Incremental Copy-On-Write

Marc Rittinghaus - SimuBoost

Virtualization

i[k] i[k [k+1] 1]

10 20 30 40 50 60 70 80 90 100 1 1 1 6 2 4 5 8

Interval Length (ms) Downtime (ms)

idle pts_apache pts_build_linux_kernel pts_postmark spec_jbb stress

Drastically reduced downtime

Pagefaults do not impede interactivity

Less dependent on

Interval length Workload

slide-22
SLIDE 22

Operating Systems Group Department of Computer Science 22

Incremental Copy-On-Write

Marc Rittinghaus - SimuBoost

Virtualization

i[k] i[k [k+1] 1]

Drastically reduced downtime

Pagefaults do not impede interactivity

Less dependent on

Interval length Workload

Almost constant downtime

We can do checkpointing fast enough

mean = 7

10 20 30 40 50 60 70 80 90 100 1 2 3

Checkpoint Index Downtime (ms)

spec_jbb (interval = 500 ms)

slide-23
SLIDE 23

Operating Systems Group Department of Computer Science 23

Checkpoint Distribution – The Naïve Way

Nodes request full checkpoints from central server But: Central server becomes bottleneck

Limits parallelization and speedup

Marc Rittinghaus - SimuBoost

Virtualization Node 1 Node 2 Node 3 3 1 Node 4 4 2 Bottleneck

slide-24
SLIDE 24

Operating Systems Group Department of Computer Science 24

SimuBoost Evaluation

Marc Rittinghaus - SimuBoost

Prototype: 1GiB RAM, 1s intervals, 4 simulation nodes

SimuBoost delivers predicted speedup [Rittinghaus13] But: Saturates 10 Gbit Ethernet Need to avoid single bottleneck

1.0 1.5 2.0 2.5 3.0 3.5 4.0 1000 2000 4000 6000 8000

Interval Length (ms) Speedup Factor

build-linux-kernel Analytical Model

slide-25
SLIDE 25

Operating Systems Group Department of Computer Science 25

Future Checkpoint Distribution

Idea: Only send new data

Deduplicate and compress data Use distributed file system (e.g., Ceph [Weil06]) Append new data to global file Checkpoint = Map of VM addresses to offsets in file

Marc Rittinghaus - SimuBoost

Virtualization Node 1

FS Cache

Node 2

FS Cache

Node 3

FS Cache

Node 4

FS Cache

pts_build_linux_kernel spec_jbb 22000 pages/s (85 MiB/s) 53000 pages/s (200 MiB/s) 5000 pages/s (20 MiB/s) 16000 pages/s (65 MiB/s)

slide-26
SLIDE 26

Operating Systems Group Department of Computer Science 26

Conclusion

Slowdown of Functional Full System Simulation: >100x SimuBoost: Accelerate simulation

Run workload with fast virtualization Take checkpoints in regular intervals Start parallel simulations on checkpoints

Challenges Fast checkpoint creation

Incremental Copy-On-Write

Fast checkpoint distribution

Distributed file system

Marc Rittinghaus - SimuBoost

Virtualization i[0 ] i[k ] i[n ] t Node 0 Simulation i[0 ] Simulation i[k ] Simulation i[n ] i[n ] Node k Node n vNode

slide-27
SLIDE 27

Operating Systems Group Department of Computer Science 27 Marc Rittinghaus - SimuBoost

slide-28
SLIDE 28

Operating Systems Group Department of Computer Science 28

Deterministic Replay

Marc Rittinghaus - SimuBoost

i[1] i[2]

=

i[2] Simulation Virtualization i[1] Simulation

States match

Interrupt Node 1 Node 2 Node 0

log replay

2 1

(1) Trap and log non-deterministic events in the hypervisor (2) Precisely replay events in the simulation Non-deterministic events (e.g., interrupts, timing instructions)

…appear at equal points in the instruction stream …produce same data output

Existing work: Retrace [Sheldon07], V2E [Yan12]

i[1] i[2]

i[2] Simulation Virtualization i[1] Simulation

States mismatch

Node 1 Node 2 Node 0 Interrupt

slide-29
SLIDE 29

Operating Systems Group Department of Computer Science 29

Speedup and Scalability

Right interval length is crucial

Too short (a):

Checkpoint time dominates

Too long (c):

Little parallelization Long simulation of final interval

Example scenario:

100ms downtime, 8% logging, 100x slowdown Optimal interval length: 2s Best possible speedup for 1h workload: 84x @ 90 nodes (94% parallel efficiency)

Near linear speedup possible

Marc Rittinghaus - SimuBoost

a) b) c)

Virtualization Virtualization Virtualization Sim Simulation Simulation

Lopt

tc L ti ssim Total run-time Tps L slog

slide-30
SLIDE 30

Operating Systems Group Department of Computer Science 30

Selected Previous Research

Workload Reduction

MinneSPEC [KleinOsowski02]

Simulate samples and extrapolate

Truncated Execution SimPoints [Sherwood02] SMARTS [Wunderlich03]

Improve simulation engine

Optimize engine: below 5x speedup mark Parallelize simulation of vCPUs [Ding11]

Divide simulation time

For microarchitectural simulations: DiST [Girbal03]

Marc Rittinghaus - SimuBoost

slide-31
SLIDE 31

Operating Systems Group Department of Computer Science 31

References

[Miller13] K. Miller et al. XLH: More effective memory deduplication scanners through cross-layer hints. USENIX, 2013 [Wilhelm15] F. Wilhelm. Tracing Privileged Memory Accesses to Discover Software Vulnerabilities. Master Thesis, KIT, 2015 [Jurczyk13] M. Jurczyk et al. Bochspwn: Exploiting Kernel Race Conditions Found via Memory Access Patterns. 2013 [Rittinghaus13] M. Rittinghaus. SimuBoost: Scalable Parallelization of Functional System Simulation. WODA, 2013 [Weil06] S. A. Weil at al. Ceph: A Scalable, High-Performance Distributed File System. OSDI, 2006 [Bellard05] F. Bellard. Qemu: A Fast and Portable Dynamic Translator. USENIX, 2005 [Magnusson02] P. Magnusson et al. Simics: A Full System Simulation Platform. Computer, 35(2), 2002 [Sherwood02] T. Sherwood et al. Automatically Characterizing Large Scale Program Behavior. ACM SIGARCH, 30(5), 2002 [Ding11] J. Ding et al. PQEMU: A Parallel System Emulator Based on QEMU. ICPADS, 2011 [Wunderlich03] R. E. Wunderlich et al. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical

  • Sampling. Computer Architecture, 2003

[Girbal03] S. Girbal et al. DiST: A Simple, Reliable and Scalable Method to Significantly Reduce Processor Architecture Simulation Time. SIGMETRICS, 31(1), 2003 [KleinOsowski02] A. J. KleinOsowski et al. MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research. IEEE Computer Architecture Letters 1.1, 2002 [Sheldon07] M. Sheldon et al. Retrace: Collecting Execution Trace With Virtual Machine Deterministic Replay. MoBS, 2007 [Yan12] L. Yan et al. V2E: Combining Hardware Virtualization and Software Emulation for Transparent and Extensible Malware Analysis. VEE, 2012 [RbMiller68] Robert B. Miller. Response Time in Man-Computer Conversational Transactions. 1968.

Marc Rittinghaus - SimuBoost