Automatic Synthesis of High-Speed Processor Simulators Martin - - PowerPoint PPT Presentation

automatic synthesis of high speed processor simulators
SMART_READER_LITE
LIVE PREVIEW

Automatic Synthesis of High-Speed Processor Simulators Martin - - PowerPoint PPT Presentation

Automatic Synthesis of High-Speed Processor Simulators Martin Burtscher and Ilya Ganusov Computer Systems Laboratory Cornell University Motivation Processor simulators are invaluable tools They allow us to cheaply and quickly test ideas


slide-1
SLIDE 1

Automatic Synthesis of High-Speed Processor Simulators

Martin Burtscher and Ilya Ganusov Computer Systems Laboratory Cornell University

slide-2
SLIDE 2

Automatic Synthesis of High-Speed Processor Simulators

Motivation

Processor simulators are invaluable tools

They allow us to cheaply and quickly test ideas

Problem

Portable simulators tend to be slow Fast simulators are complex and either require

access to source code or symbol table info, are ISA specific (non-portable), need dynamic compilation support, or perturb the simulation

slide-3
SLIDE 3

Automatic Synthesis of High-Speed Processor Simulators

Functional Simulation

Simulates correct behavior but not timing

Used for prototyping, trace generation, etc.

Needed for fast forwarding (sampling)

Integral part of cycle-accurate simulators Average fast-forwarding and simulation time for

SPECcpu2000 with early SimPoints

sim-fast + sim-mase: 1.9h + 1.25h = 3.15h SyntSim + sim-mase: 0.25h + 1.25h = 1.5h

slide-4
SLIDE 4

Automatic Synthesis of High-Speed Processor Simulators

Contributions

Goal

Develop a functional simulator that is simple,

portable, and fast (+ supports instrumentation)

Our approach: SyntSim

Before every run, statically synthesize a

simulator that is optimized for the given binary

Combine interpreted- and compiled-mode

simulation for speed and simplicity

Perform other important optimizations

slide-5
SLIDE 5

Automatic Synthesis of High-Speed Processor Simulators

SyntSim’s Features

Simplicity

Only a little more complex than an interpreter Even works with stripped executables Easy to add code to simulate caches, etc.

Portability

Emits C source code Does not perturb simulation

Performance

Only 6.6x slower than native execution on

SPECcpu2000 reference runs (geo. mean)

slide-6
SLIDE 6

Automatic Synthesis of High-Speed Processor Simulators

Interpreted-Mode Simulation

Instruction example

addq r7, 200, r22

  • Interpreted code

Slow simulation speed Handles all adds in all

programs

Compiled once

inst = mem[pc];

  • p = inst >> 26;

switch (op) { case ALUop: rsrc = (inst >> 21) & 31; imm = (inst >> 13) & 255; func = (inst >> 5) & 255; rdst = inst & 31; switch (func) { case AddI: reg[rdst] = reg[rsrc] + imm; pc++;

rdst

  • p

rsrc imm func

slide-7
SLIDE 7

Automatic Synthesis of High-Speed Processor Simulators

Compiled-Mode Simulation

Instruction example

addq r7, 200, r22

  • Translated code

Fast simulation speed Only handles this add

in this program

Incurs synthesis and

compilation overhead

reg[22] = reg[7] + 200;

Optimizations

No decoding Hardcoded

indices and immediates

Other optims.

rdst

  • p

rsrc imm func

slide-8
SLIDE 8

Automatic Synthesis of High-Speed Processor Simulators

Mixed-Mode Simulation

Combine interpreted and compiled mode

Translating the 15% most-frequently executed

static instructions suffices to run 99.9% of the dynamic instruction in compiled mode

Remaining instructions are interpreted

Translating only frequently executed instrs

Much shorter compilation time Smaller executable (better i-cache performance)

slide-9
SLIDE 9

Automatic Synthesis of High-Speed Processor Simulators

SyntSim’s Operation

SyntSim code generator and optimizer program executable

  • ptional

profile instruction definitions (C code) add: D=A+B; sub: D=A-B; bne: if (A) goto B; … high-speed simulator (C code) compiled- mode simulator interpreter

user options

slide-10
SLIDE 10

Automatic Synthesis of High-Speed Processor Simulators

Compiled-Mode Simulator

static void RunCompiled() { forever { switch (pc/4) { case 0x4800372c: r[2] = RdMem8(r[1]-30768); // 12000dcb8: ldq r2, -30768(r1) s1 = r[0]; s2 = r[4]; // 12000dcd0: cmplt r0, r4, r0 r[0] = 0; if (s1<s2) r[0] = 1; ic += 3; if (0!=r[0]) goto L12000dcf0; // 12000dcdc: bne r0, 12000dcf0 ic += 1; goto L12000c970; // 12000dcec: br r31, 12000c970 L12000dcf0: ic += 1; pc = r[26]&(~3ULL); // 12000dcf0: ret r31, (r26), 1 icnt[fnc(lasttarget)] += ic; ic = 0; lasttarget = pc; break; default: RunInterpreted(); } // switch } // forever }

slide-11
SLIDE 11

Automatic Synthesis of High-Speed Processor Simulators

Related Work

MINT 1994

  • Dyn. decompile short code sequences into fncs

QPT/EEL 1994/1995

Rewrite executable, use quite precise algorithm

for indirect branches, need dyn. translation

SuperSim 1996

Static decompilation into C, fully labeled

UQBT 2000

Decompilation into special high-level language,

static hooks to interpret untranslated code

slide-12
SLIDE 12

Automatic Synthesis of High-Speed Processor Simulators

Evaluation Methodology

System

750MHz 64-bit Alpha 21264A 64kB L1, 8MB L2, 2GB RAM Tru64 UNIX V5.1

Benchmarks

20 SPECcpu2000 programs, highly optimized All F77 and C programs except perlbmk Full test, train, and reference runs

slide-13
SLIDE 13

Automatic Synthesis of High-Speed Processor Simulators

Profile vs. Heuristic Performance

Runtimes include

Synthesis time (0.08s) Compilation time (33s) Simulation time (3160s)

Profile based

6.6x gmean slowdown

(2x to 16x)

Heuristic based

8.7x gmean slowdown

(2.2x to 66x)

2 4 6 8 10 12 14 16 18 20 22 24

gzip vpr gcc mcf crafty parser gap vortex bzip2 twolf mesa art equake ammp wupwise swim mgrid applu sixtrack apsi geo_mean

slowdown relative to native execution.. ref runs with ref profiles ref runs with heuristics

32.8 66.3

slide-14
SLIDE 14

Automatic Synthesis of High-Speed Processor Simulators

Mixed-Mode Performance

Observations

Better profiles help Pure compiled mode is

slower than mixed mode with good profile (24% on train runs)

Best c/i ratio decreases

with quality of profile

99.9% compiled mode

is best with self profile (15% of static instrs)

6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 0.001 0.01 0.1 1 10 percent interpreted instructions (dynamic) slowdown relative to native execution.. ref runs with test profiles ref runs with train profiles ref runs with ref profiles ref runs with heuristics

slide-15
SLIDE 15

Automatic Synthesis of High-Speed Processor Simulators

Comparison with Interpreters

SyntSim’s interpreter

2.5x faster than sim-fast

Mixed mode

19x faster than sim-fast 8x faster on ref runs

than SyntSim interpreter (3.6x to 14x)

7x faster on train runs 3.7x faster on test runs

25 50 75 100 125 150 175 200 225 250 gzip gcc mcf parser vortex bzip2 twolf mesa art equake ammp mgrid applu geo_mean slowdown relative to native execution.. ref runs using mixed mode ref runs using interpreter ref runs using sim-fast

282

slide-16
SLIDE 16

Automatic Synthesis of High-Speed Processor Simulators

Comparison with ATOM

Adding instrumentation

Identical C code Instruction count (ic) Mem hierarchy (memh) Branch predictor (bp)

Results

ic: ATOM is 2x faster rest: SyntSim is 2.6x

faster than ATOM

5 10 15 20 25 30 35 ic ic+memh ic+memh+bp slowdown relative to native execution.. SyntSim ATOM

slide-17
SLIDE 17

Automatic Synthesis of High-Speed Processor Simulators

Conclusions

Presented a fully automated technique to

statically create fast yet portable simulators

Interleaves compiled- and interpreted-

mode simulation for speed and simplicity

Only 6.6x slower than native execution Only 13x slowdown when counting

instructions and simulating a memory hierarchy and a branch predictor (warmup)