SimBench A Portable Benchmarking Methodology for Full-System - - PowerPoint PPT Presentation

simbench
SMART_READER_LITE
LIVE PREVIEW

SimBench A Portable Benchmarking Methodology for Full-System - - PowerPoint PPT Presentation

SimBench A Portable Benchmarking Methodology for Full-System Simulators Harry Wagstaff Bruno Bodin Tom Spink Bjrn Franke Institute for Computing Systems Architecture University of Edinburgh ISPASS 2017 1 Motivation Instruction Set


slide-1
SLIDE 1

SimBench

A Portable Benchmarking Methodology for Full-System Simulators Harry Wagstaff Bruno Bodin Tom Spink Björn Franke

Institute for Computing Systems Architecture University of Edinburgh

ISPASS 2017

1

slide-2
SLIDE 2

Motivation Instruction Set Simulation Evaluating Simulation Tools The SimBench Methodology Overview Implementation Porting SimBench Evaluation

2

slide-3
SLIDE 3

Instruction Set Simulation

Instruction Set Simulation is used in a wide variety of contexts:

Motivation Instruction Set Simulation 3

slide-4
SLIDE 4

Instruction Set Simulation

Instruction Set Simulation is used in a wide variety of contexts:

  • Design Space Exploration
  • Gem5
  • Multi2Sim

Motivation Instruction Set Simulation 3

slide-5
SLIDE 5

Instruction Set Simulation

Instruction Set Simulation is used in a wide variety of contexts:

  • Design Space Exploration
  • Software Development
  • QEMU
  • Android Emulator

Motivation Instruction Set Simulation 3

slide-6
SLIDE 6

Instruction Set Simulation

Instruction Set Simulation is used in a wide variety of contexts:

  • Design Space Exploration
  • Software Development
  • Backwards Compatibility
  • Apple Rosetta
  • Nintendo NES Classic

Motivation Instruction Set Simulation 3

slide-7
SLIDE 7

Instruction Set Simulation

Instruction Set Simulation is used in a wide variety of contexts:

  • Design Space Exploration
  • Software Development
  • Backwards Compatibility

Simulators can be broken up into several categories:

  • Functional Only
  • Profiling/Performance Modelling
  • User-Mode
  • Full System

Motivation Instruction Set Simulation 3

slide-8
SLIDE 8

User-Mode Simulation

  • Execute a single binary
  • Emulate System Calls
  • Simplified Memory System

Simulated CPU Syscall Emulation Layer Flat Memory Host Console Host Timers Host File System

Motivation Instruction Set Simulation 4

slide-9
SLIDE 9

Full-System Simulation

  • Boot out of reset
  • System-mode Instructions
  • Memory Translation
  • Asynchronous Interrupts
  • External Devices

Simulated CPU Simulated MMU Simulated Serial Port Host Console Simulated Timers Host Timers Simulated Storage Device Host File System Physical Memory

Motivation Instruction Set Simulation 5

slide-10
SLIDE 10

Fast Simulation

We want to make our simulators go fast!

To be clear, we want the simulator itself to be efficient.

Motivation Instruction Set Simulation 6

slide-11
SLIDE 11

Fast Simulation

We want to make our simulators go fast!

To be clear, we want the simulator itself to be efficient.

How?

  • Dynamic Binary Translation
  • Memory-Related Techniques
  • Efficient Interrupt Modelling
  • Control Flow Handling

Motivation Instruction Set Simulation 6

slide-12
SLIDE 12

Fast Simulation

We want to make our simulators go fast!

To be clear, we want the simulator itself to be efficient.

How?

  • Dynamic Binary Translation
  • Memory-Related Techniques
  • Efficient Interrupt Modelling
  • Control Flow Handling

How do we evaluate the effectiveness of these techniques in spite of complex interactions?

Motivation Instruction Set Simulation 6

slide-13
SLIDE 13

Typical Benchmarking Approaches

Most common approaches fall into two categories:

  • Large, complex macrobenchmark suites
  • Small, targeted but ad-hoc microbenchmarks

Both approaches are problematic for several reasons.

Motivation Evaluating Simulation Tools 7

slide-14
SLIDE 14

Macrobenchmarking

The most common approach is to use large macrobenchmarks However, this presents several problems:

  • Benchmark runtimes are long
  • Difficult to perform detailed analysis
  • Aggregated runtimes may hide interesting results

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.7.0 1.7.1 1.7.2 2.0.0 2.0.1 2.0.2 2.1.0 2.1.1 2.1.2 2.1.3 2.2.0 2.2.1 2.3.0 2.3.1 2.4.0.1 2.4.0 2.4.1 2.5.0-rc0 2.5.0-rc1 2.5.0-rc2 Speedup QEMU Version sjeng SPEC (overall) mcf

Motivation Evaluating Simulation Tools 8

slide-15
SLIDE 15

Ad-Hoc Microbenchmarking

Many papers also attempt to use ad-hoc microbenchmarks to assess performance. However, these also have problems:

  • Source code may be unavailable, damaging reproducibility
  • Secondary effects may not be taken into account

Motivation Evaluating Simulation Tools 9

slide-16
SLIDE 16

Sum Two Arrays - Source Code

float out[ARRAY_SIZE]; float a[ARRAY_SIZE]; float b[ARRAY_SIZE]; ... void foo() { for(int i = 0; i < ARRAY_SIZE; ++i) {

  • ut[i] = a[i] + b[i];

} }

Motivation Evaluating Simulation Tools 10

slide-17
SLIDE 17

Sum Two Arrays - Assembly Code

foo: 0: add r3, r0, #512 ; 0x200 4: vldmia r0!, {s15} 8: vldmia r1!, {s14} c: cmp r0, r3 10: vadd.f32 s15, s15, s14 14: vstmia r2!, {s15} 18: bne 4 <foo+0x4> 1c: bx lr

Motivation Evaluating Simulation Tools 11

slide-18
SLIDE 18

Sum Two Arrays - Assembly Code

foo: 0: add r3, r0, #512 ; 0x200 4: vldmia r0!, {s15} 8: vldmia r1!, {s14} c: cmp r0, r3 10: vadd.f32 s15, s15, s14 14: vstmia r2!, {s15} 18: bne 4 <foo+0x4> 1c: bx lr

✘ Instruction Page Fault ✘ Code Generation

Instruction Page fault, Code generation

Motivation Evaluating Simulation Tools 11

slide-19
SLIDE 19

Sum Two Arrays - Assembly Code

foo: 0: add r3, r0, #512 ; 0x200 4: vldmia r0!, {s15} 8: vldmia r1!, {s14} c: cmp r0, r3 10: vadd.f32 s15, s15, s14 14: vstmia r2!, {s15} 18: bne 4 <foo+0x4> 1c: bx lr

✘ Instruction Page Fault ✘ Code Generation ✘ Cold Memory Access ✘ Data Page Fault

Cold memory access, Data Page fault

Motivation Evaluating Simulation Tools 11

slide-20
SLIDE 20

Sum Two Arrays - Assembly Code

foo: 0: add r3, r0, #512 ; 0x200 4: vldmia r0!, {s15} 8: vldmia r1!, {s14} c: cmp r0, r3 10: vadd.f32 s15, s15, s14 14: vstmia r2!, {s15} 18: bne 4 <foo+0x4> 1c: bx lr

✘ Instruction Page Fault ✘ Code Generation ✘ Cold Memory Access ✘ Data Page Fault ✘ Hot Memory Access

Hot memory access

Motivation Evaluating Simulation Tools 11

slide-21
SLIDE 21

Sum Two Arrays - Assembly Code

foo: 0: add r3, r0, #512 ; 0x200 4: vldmia r0!, {s15} 8: vldmia r1!, {s14} c: cmp r0, r3 10: vadd.f32 s15, s15, s14 14: vstmia r2!, {s15} 18: bne 4 <foo+0x4> 1c: bx lr

✘ Instruction Page Fault ✘ Code Generation ✘ Cold Memory Access ✘ Data Page Fault ✘ Hot Memory Access ✓ FP Operation

FP Operation

Motivation Evaluating Simulation Tools 11

slide-22
SLIDE 22

Sum Two Arrays - Assembly Code

foo: 0: add r3, r0, #512 ; 0x200 4: vldmia r0!, {s15} 8: vldmia r1!, {s14} c: cmp r0, r3 10: vadd.f32 s15, s15, s14 14: vstmia r2!, {s15} 18: bne 4 <foo+0x4> 1c: bx lr

✘ Instruction Page Fault ✘ Code Generation ✘ Cold Memory Access ✘ Data Page Fault ✘ ✘ Hot Memory Access ✓ FP Operation

Hot memory access

Motivation Evaluating Simulation Tools 11

slide-23
SLIDE 23

Sum Two Arrays - Assembly Code

foo: 0: add r3, r0, #512 ; 0x200 4: vldmia r0!, {s15} 8: vldmia r1!, {s14} c: cmp r0, r3 10: vadd.f32 s15, s15, s14 14: vstmia r2!, {s15} 18: bne 4 <foo+0x4> 1c: bx lr

✘ Instruction Page Fault ✘ Code Generation ✘ Cold Memory Access ✘ Data Page Fault ✘ ✘ Hot Memory Access ✓ FP Operation ✘ Direct Control Flow ✘ Interrupt

Direct control flow, Interrupt

Motivation Evaluating Simulation Tools 11

slide-24
SLIDE 24

Sum Two Arrays - Assembly Code

foo: 0: add r3, r0, #512 ; 0x200 4: vldmia r0!, {s15} 8: vldmia r1!, {s14} c: cmp r0, r3 10: vadd.f32 s15, s15, s14 14: vstmia r2!, {s15} 18: bne 4 <foo+0x4> 1c: bx lr

✘ Instruction Page Fault ✘ ✘ Code Generation ✘ Cold Memory Access ✘ Data Page Fault ✘ ✘ Hot Memory Access ✓ FP Operation ✘ Direct Control Flow ✘ Interrupt

Code generation

Motivation Evaluating Simulation Tools 11

slide-25
SLIDE 25

Sum Two Arrays - Assembly Code

foo: 0: add r3, r0, #512 ; 0x200 4: vldmia r0!, {s15} 8: vldmia r1!, {s14} c: cmp r0, r3 10: vadd.f32 s15, s15, s14 14: vstmia r2!, {s15} 18: bne 4 <foo+0x4> 1c: bx lr

✘ Instruction Page Fault ✘ ✘ ✘ Code Generation ✘ Cold Memory Access ✘ Data Page Fault ✘ ✘ Hot Memory Access ✓ FP Operation ✘ Direct Control Flow ✘ Interrupt

Code generation

Motivation Evaluating Simulation Tools 11

slide-26
SLIDE 26

Sum Two Arrays - Assembly Code

foo: 0: add r3, r0, #512 ; 0x200 4: vldmia r0!, {s15} 8: vldmia r1!, {s14} c: cmp r0, r3 10: vadd.f32 s15, s15, s14 14: vstmia r2!, {s15} 18: bne 4 <foo+0x4> 1c: bx lr

✘ Instruction Page Fault ✘ ✘ ✘ Code Generation ✘ Cold Memory Access ✘ Data Page Fault ✘ ✘ Hot Memory Access ✓ FP Operation ✘ Direct Control Flow ✘ ✘ Interrupt ✘ Indirect Control Flow

Indirect control flow, Interrupt

Motivation Evaluating Simulation Tools 11

slide-27
SLIDE 27

SimBench

We present SimBench, which is designed to address some of these problems.

The SimBench Methodology Overview 12

slide-28
SLIDE 28

SimBench

We present SimBench, which is designed to address some of these problems.

  • SimBench. . .
  • Contains a range of targeted microbenchmarks

The SimBench Methodology Overview 12

slide-29
SLIDE 29

SimBench

We present SimBench, which is designed to address some of these problems.

  • SimBench. . .
  • Contains a range of targeted microbenchmarks
  • Is aimed at full-system simulation techniques

The SimBench Methodology Overview 12

slide-30
SLIDE 30

SimBench

We present SimBench, which is designed to address some of these problems.

  • SimBench. . .
  • Contains a range of targeted microbenchmarks
  • Is aimed at full-system simulation techniques
  • Is easily portable to new platforms and architectures

The SimBench Methodology Overview 12

slide-31
SLIDE 31

SimBench

We present SimBench, which is designed to address some of these problems.

  • SimBench. . .
  • Contains a range of targeted microbenchmarks
  • Is aimed at full-system simulation techniques
  • Is easily portable to new platforms and architectures
  • Runs ‘bare-metal’ on the guest (i.e., without an OS)

The SimBench Methodology Overview 12

slide-32
SLIDE 32

Platform Library

  • Arch. Library

Benchmarks Guest System

Timing Application

Host System

The SimBench Methodology Overview 13

slide-33
SLIDE 33

Platform Library

  • Arch. Library

Benchmarks Guest System

Timing Application

Host System

The SimBench Methodology Overview 14

slide-34
SLIDE 34

Categories

SimBench currently features five categories of benchmark:

  • Code Generation
  • Control Flow
  • Exception Handling
  • I/O
  • Memory System

A total of 18 benchmarks are in the suite.

The SimBench Methodology Overview 15

slide-35
SLIDE 35

Implementation

The benchmarks are implemented entirely in C:

  • No inline assembly in the benchmarks
  • No weird intrinsics or builtins
  • Some extensions used for code alignment

The SimBench Methodology Implementation 16

slide-36
SLIDE 36

Implementation - Defeating Optimisations

We want to preserve interesting operations while still using our compiler optimisations:

  • Use of empty volatile assembly statements
  • Volatile memory accesses where necessary
  • Indirect jumps via opaque pointers where necessary
  • Benchmarks split into many objects to defeat inlining

for(int i = 0; i < ITERATIONS; ++i) { volatile asm ("" ::: memory); }

The SimBench Methodology Implementation 17

slide-37
SLIDE 37

Implementation - Defeating Optimisations

We want to preserve interesting operations while still using our compiler optimisations:

  • Use of empty volatile assembly statements
  • Volatile memory accesses where necessary
  • Indirect jumps via opaque pointers where necessary
  • Benchmarks split into many objects to defeat inlining

static volatile uint32_t value = 0; ... value = value; value = value; value = value;

The SimBench Methodology Implementation 17

slide-38
SLIDE 38

Implementation - Defeating Optimisations

We want to preserve interesting operations while still using our compiler optimisations:

  • Use of empty volatile assembly statements
  • Volatile memory accesses where necessary
  • Indirect jumps via opaque pointers where necessary
  • Benchmarks split into many objects to defeat inlining

static benchmark_kernel_t fn_table[]; ... while(iterations-- > 0) { fn_table[i % FN_TABLE_SIZE](); }

The SimBench Methodology Implementation 17

slide-39
SLIDE 39

Implementation - Defeating Optimisations

We want to preserve interesting operations while still using our compiler optimisations:

  • Use of empty volatile assembly statements
  • Volatile memory accesses where necessary
  • Indirect jumps via opaque pointers where necessary
  • Benchmarks split into many objects to defeat inlining

inter-page-direct.c: void ipd-fn-10(); ... ipd-fn-10(); inter-page-direct-10.c: void ipd-fn-9(); void ipd-fn-10() { ipd-fn-9(); }

The SimBench Methodology Implementation 17

slide-40
SLIDE 40

Platform Library

  • Arch. Library

Benchmarks Guest System

Timing Application

Host System

The SimBench Methodology Porting SimBench 18

slide-41
SLIDE 41

Porting To A New Platform

To port SimBench to a new platform (where the architecture is already supported), 4 main components are required:

  • A simple UART driver
  • An IRQ Controller driver
  • A description of the memory layout
  • A linker script

50 100 150 200 250 300 ARM x86 Lines of Code C ASM

The SimBench Methodology Porting SimBench 19

slide-42
SLIDE 42

Porting To A New Architecture

Porting to a new architecture is somewhat more complex.

  • Bring system out of reset
  • Manage page tables
  • Manage interrupt/exception vectors

500 1000 1500 2000 ARM X86 Lines of Code C ASM

The SimBench Methodology Porting SimBench 20

slide-43
SLIDE 43

Platform Library

  • Arch. Library

Benchmarks Guest System

Timing Application

Host System

The SimBench Methodology Porting SimBench 21

slide-44
SLIDE 44

Timing

SimBench uses the Host system to collect benchmark timings:

  • We don’t trust the timing accuracy of the Guest
  • Guest might report simulated cycles instead of real time
  • Only needs platform-specific UART rather than

Timer+IRQ+handler+. . .

Time Guest Host

Startup Prepare Benchmark Benchmark Kernel Cleanup Benchmark

Guest Sends '[' Guest Sends ']'

Time Benchmark

The SimBench Methodology Porting SimBench 22

slide-45
SLIDE 45

Evaluation

We ran SimBench on a variety of platforms

  • A range of full system simulators including QEMU
  • A virtualized system, using KVM
  • Natively on real hardware

Evaluation 23

slide-46
SLIDE 46

Categories - ARM Guest

Simulators on x86 Host, QEMU-KVM and Hardware on ARM Host

0.01 0.1 1 10 100 QEMU-DBT SimIt-ARM Gem5 QEMU-KVM Hardware Speedup over QEMU-DBT Code Generation Control Flow Exception Handling I/O Memory System Evaluation 24

slide-47
SLIDE 47

Categories

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 v 1 . 7 . v 1 . 7 . 1 v 1 . 7 . 2 v 2 . . v 2 . . 1 v 2 . . 2 v 2 . 1 . v 2 . 1 . 1 v 2 . 1 . 2 v 2 . 1 . 3 v 2 . 2 . v 2 . 2 . 1 v 2 . 3 . v 2 . 3 . 1 v 2 . 4 . . 1 v 2 . 4 . v 2 . 4 . 1 v 2 . 5 .

  • r

c v 2 . 5 .

  • r

c 1 v 2 . 5 .

  • r

c 2 Speedup QEMU Version Control Flow, QEMU-ARM on x86 Same-Page-Indirect Inter-Page-Indirect Same-Page-Direct Inter-Page-Direct

Evaluation 25

slide-48
SLIDE 48

Categories

This represents 13% of executed instructions!

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 v 1 . 7 . v 1 . 7 . 1 v 1 . 7 . 2 v 2 . . v 2 . . 1 v 2 . . 2 v 2 . 1 . v 2 . 1 . 1 v 2 . 1 . 2 v 2 . 1 . 3 v 2 . 2 . v 2 . 2 . 1 v 2 . 3 . v 2 . 3 . 1 v 2 . 4 . . 1 v 2 . 4 . v 2 . 4 . 1 v 2 . 5 .

  • r

c v 2 . 5 .

  • r

c 1 v 2 . 5 .

  • r

c 2 Speedup QEMU Version Control Flow, QEMU-ARM on x86 Same-Page-Indirect Inter-Page-Indirect Same-Page-Direct Inter-Page-Direct

Evaluation 26

slide-49
SLIDE 49

Summary

In summary:

  • Existing benchmarking methods have several

shortcomings

  • SimBench is capable of addressing these shortcomings
  • But SimBench does not replace application benchmarks

Future Work:

  • Additional categories of benchmark
  • Ports to new architectures & platforms
  • Improve robustness of existing benchmarks

Summary 27

slide-50
SLIDE 50

Thanks for Listening! Any Questions?

SimBench is available now at http://bitbucket.org/simbench/simbench

28

slide-51
SLIDE 51

Benchmarks

  • Code Generation
  • Small Blocks
  • Large Blocks
  • Control Flow
  • Inter-Page Direct
  • Inter-Page Indirect
  • Intra-Page Direct
  • Intra-Page Indirect
  • I/O
  • Memory Mapped

Device

  • Coprocessor Access
  • Exception Handling
  • Data Access Fault
  • Instruction Access

Fault

  • Undefined Instruction
  • System Call
  • External Software

Interrupt

  • Memory
  • Cold Memory Access
  • Hot Memory Access
  • Nonprivileged Access
  • TLB Eviction
  • TLB Flush

Extra Slides Benchmarks 29

slide-52
SLIDE 52

Evaluation Platforms

Machine ODROID-XU3 HP z440 CPU Exynos 5422 Xeon E5-1620 v3 CPU GHz 2.0 (A15) 1.4 (A7) 3.5 (3.6 Boost) Memory 2GB 16GB Compiler gcc 4.8.2 gcc 5.3.1 OS Name Ubuntu 14.04 Fedora 21 OS Kernel 3.10.53 4.1.13

Extra Slides Benchmarks 30