A Universal Parallel Front End for Execution Driven - - PowerPoint PPT Presentation

a universal parallel front end for execution driven
SMART_READER_LITE
LIVE PREVIEW

A Universal Parallel Front End for Execution Driven - - PowerPoint PPT Presentation

A Universal Parallel Front End for Execution Driven Microarchitecture Simulation Chad D. Kersey Sudhakar Yalamanchili Georgia Institute of Technology Arun Rodrigues Sandia National Laboratories Outline Outline Introduction Outline


slide-1
SLIDE 1

A Universal Parallel Front End for Execution Driven Microarchitecture Simulation

Chad D. Kersey Sudhakar Yalamanchili

Georgia Institute of Technology

Arun Rodrigues

Sandia National Laboratories

slide-2
SLIDE 2

Outline

slide-3
SLIDE 3

Outline

◮ Introduction

slide-4
SLIDE 4

Outline

◮ Introduction

◮ Simulators

slide-5
SLIDE 5

Outline

◮ Introduction

◮ Simulators ◮ Front Ends

slide-6
SLIDE 6

Outline

◮ Introduction

◮ Simulators ◮ Front Ends

◮ The QSim Simulator Front End

slide-7
SLIDE 7

Outline

◮ Introduction

◮ Simulators ◮ Front Ends

◮ The QSim Simulator Front End

◮ API Overview

slide-8
SLIDE 8

Outline

◮ Introduction

◮ Simulators ◮ Front Ends

◮ The QSim Simulator Front End

◮ API Overview ◮ How is QSim “Universal”?

slide-9
SLIDE 9

Outline

◮ Introduction

◮ Simulators ◮ Front Ends

◮ The QSim Simulator Front End

◮ API Overview ◮ How is QSim “Universal”? ◮ How is it Parallel?

slide-10
SLIDE 10

Outline

◮ Introduction

◮ Simulators ◮ Front Ends

◮ The QSim Simulator Front End

◮ API Overview ◮ How is QSim “Universal”? ◮ How is it Parallel? ◮ How does it Perform?

slide-11
SLIDE 11

Outline

◮ Introduction

◮ Simulators ◮ Front Ends

◮ The QSim Simulator Front End

◮ API Overview ◮ How is QSim “Universal”? ◮ How is it Parallel? ◮ How does it Perform? ◮ How are QEMU and QSim Related?

slide-12
SLIDE 12

Outline

◮ Introduction

◮ Simulators ◮ Front Ends

◮ The QSim Simulator Front End

◮ API Overview ◮ How is QSim “Universal”? ◮ How is it Parallel? ◮ How does it Perform? ◮ How are QEMU and QSim Related?

◮ Back Ends

slide-13
SLIDE 13

Outline

◮ Introduction

◮ Simulators ◮ Front Ends

◮ The QSim Simulator Front End

◮ API Overview ◮ How is QSim “Universal”? ◮ How is it Parallel? ◮ How does it Perform? ◮ How are QEMU and QSim Related?

◮ Back Ends ◮ Summary

slide-14
SLIDE 14

Outline

◮ Introduction

◮ Simulators ◮ Front Ends

◮ The QSim Simulator Front End

◮ API Overview ◮ How is QSim “Universal”? ◮ How is it Parallel? ◮ How does it Perform? ◮ How are QEMU and QSim Related?

◮ Back Ends ◮ Summary ◮ Acknowledgements

slide-15
SLIDE 15

Introduction

slide-16
SLIDE 16

Introduction– The Ubiquity of Simulation

◮ Simulation is a requirement of architecture research.

slide-17
SLIDE 17

Introduction– The Ubiquity of Simulation

◮ Simulation is a requirement of architecture research. ◮ Few architecture researchers have access to the resources

needed to create full-scale prototypes.

slide-18
SLIDE 18

Introduction– The Ubiquity of Simulation

◮ Simulation is a requirement of architecture research. ◮ Few architecture researchers have access to the resources

needed to create full-scale prototypes.

◮ Those with the resources would prefer not to spend them

building incremental prototypes.

slide-19
SLIDE 19

Introduction– The Ubiquity of Simulation

◮ Simulation is a requirement of architecture research. ◮ Few architecture researchers have access to the resources

needed to create full-scale prototypes.

◮ Those with the resources would prefer not to spend them

building incremental prototypes.

◮ Even if they would, the turn-around time for building a new

CPU, even using pre-designed components would be very long.

slide-20
SLIDE 20

Introduction– The Simulation Gap

Time Capacity Demand Simulation Complexity

slide-21
SLIDE 21

Introduction– The Simulation Gap

Time Capacity Demand Simulation Complexity

Reasons for the simulation gap:

slide-22
SLIDE 22

Introduction– The Simulation Gap

Time Capacity Demand Simulation Complexity

Reasons for the simulation gap:

◮ Parallel simulation is hard, so we use serial simulators for

parallel machines.

slide-23
SLIDE 23

Introduction– The Simulation Gap

Time Capacity Demand Simulation Complexity

Reasons for the simulation gap:

◮ Parallel simulation is hard, so we use serial simulators for

parallel machines.

◮ Developments in computer architecture tend to be additive,

but we keep building simulators from scratch.

slide-24
SLIDE 24

Introduction– Narrowing the Gap

Ways to narrow the simulation gap:

slide-25
SLIDE 25

Introduction– Narrowing the Gap

Ways to narrow the simulation gap:

◮ Spend less time researching architecture and more time

developing simulators?

slide-26
SLIDE 26

Introduction– Narrowing the Gap

Ways to narrow the simulation gap:

◮ Spend less time researching architecture and more time

developing simulators?

◮ Probably would not be well-received by the architecture

community.

slide-27
SLIDE 27

Introduction– Narrowing the Gap

Ways to narrow the simulation gap:

◮ Spend less time researching architecture and more time

developing simulators?

◮ Probably would not be well-received by the architecture

community.

◮ Increase simulator throughput so more simulations can be run

in a reasonable amount of time.

slide-28
SLIDE 28

Introduction– Narrowing the Gap

Ways to narrow the simulation gap:

◮ Spend less time researching architecture and more time

developing simulators?

◮ Probably would not be well-received by the architecture

community.

◮ Increase simulator throughput so more simulations can be run

in a reasonable amount of time.

◮ Parallelize them.

slide-29
SLIDE 29

Introduction– Narrowing the Gap

Ways to narrow the simulation gap:

◮ Spend less time researching architecture and more time

developing simulators?

◮ Probably would not be well-received by the architecture

community.

◮ Increase simulator throughput so more simulations can be run

in a reasonable amount of time.

◮ Parallelize them.

◮ Find ways to make simulator development more efficient.

slide-30
SLIDE 30

Introduction– Narrowing the Gap

Ways to narrow the simulation gap:

◮ Spend less time researching architecture and more time

developing simulators?

◮ Probably would not be well-received by the architecture

community.

◮ Increase simulator throughput so more simulations can be run

in a reasonable amount of time.

◮ Parallelize them.

◮ Find ways to make simulator development more efficient.

If we make simulator development more efficient, we increase the rate at which simulation capacity can grow.

slide-31
SLIDE 31

Introduction– Front Ends

What is a front end?

slide-32
SLIDE 32

Introduction– Front Ends

What is a front end?

◮ Most simulators are broken into a front end and a back end

by their designers.

slide-33
SLIDE 33

Introduction– Front Ends

What is a front end?

◮ Most simulators are broken into a front end and a back end

by their designers.

◮ The front end handles the execution of instructions (making

sure the register state is correct).

slide-34
SLIDE 34

Introduction– Front Ends

What is a front end?

◮ Most simulators are broken into a front end and a back end

by their designers.

◮ The front end handles the execution of instructions (making

sure the register state is correct).

◮ Because instruction sets are very complex, front ends are

usually created by using and modifying an existing emulation solution or avoiding emulation entirely and tracing native execution.

slide-35
SLIDE 35

Introduction– Front Ends

What is a front end?

◮ Most simulators are broken into a front end and a back end

by their designers.

◮ The front end handles the execution of instructions (making

sure the register state is correct).

◮ Because instruction sets are very complex, front ends are

usually created by using and modifying an existing emulation solution or avoiding emulation entirely and tracing native execution.

◮ The back end handles timing, power, and other metrics (how

long did that instruction take to clear the pipeline).

slide-36
SLIDE 36

Introduction– Front Ends

What is a front end?

◮ Most simulators are broken into a front end and a back end

by their designers.

◮ The front end handles the execution of instructions (making

sure the register state is correct).

◮ Because instruction sets are very complex, front ends are

usually created by using and modifying an existing emulation solution or avoiding emulation entirely and tracing native execution.

◮ The back end handles timing, power, and other metrics (how

long did that instruction take to clear the pipeline).

◮ Back ends are the part that implements the logic that makes

a simulator unique.

slide-37
SLIDE 37

Introduction– The Ideal Front End

Trace Writer Trace Reader Trace Back−End Results Back−End Results Emulator

Ideally:

slide-38
SLIDE 38

Introduction– The Ideal Front End

Trace Writer Trace Reader Trace Back−End Results Back−End Results Emulator

Ideally:

◮ Each front end and back end must be written only once, after

which they can be used in any combination, like compier front ends and back ends.

slide-39
SLIDE 39

Introduction– The Ideal Front End

Trace Writer Trace Reader Trace Back−End Results Back−End Results Emulator

Ideally:

◮ Each front end and back end must be written only once, after

which they can be used in any combination, like compier front ends and back ends.

◮ No additional code would need to be written to adapt general

purpose emulators for simulation duty.

slide-40
SLIDE 40

Introduction– Real Front Ends

Trace Writer Trace Reader Trace Internal API Back−End Results Internal API Custom Shim Emulator Back−End Results Compatibility Layers

Realistically:

slide-41
SLIDE 41

Introduction– Real Front Ends

Trace Writer Trace Reader Trace Internal API Back−End Results Internal API Custom Shim Emulator Back−End Results Compatibility Layers

Realistically:

◮ Much custom code needs to be written to adapt most

  • ff-the-shelf emulators as simulator front ends.
slide-42
SLIDE 42

Introduction– Real Front Ends

Trace Writer Trace Reader Trace Internal API Back−End Results Internal API Custom Shim Emulator Back−End Results Compatibility Layers

Realistically:

◮ Much custom code needs to be written to adapt most

  • ff-the-shelf emulators as simulator front ends.

◮ Each time this is done, yet another simulator-specific front

end is created.

slide-43
SLIDE 43

Introduction– Front Ends

How do we make new simulator front ends closer to the ideal?

slide-44
SLIDE 44

Introduction– Front Ends

How do we make new simulator front ends closer to the ideal?

◮ Provide an API that does not change unnecessarily beacuse of

the type of front end or the instruction set.

slide-45
SLIDE 45

Introduction– Front Ends

How do we make new simulator front ends closer to the ideal?

◮ Provide an API that does not change unnecessarily beacuse of

the type of front end or the instruction set.

◮ Enable the construction of multithreaded simulators.

slide-46
SLIDE 46

Introduction– Front Ends

How do we make new simulator front ends closer to the ideal?

◮ Provide an API that does not change unnecessarily beacuse of

the type of front end or the instruction set.

◮ Enable the construction of multithreaded simulators. ◮ Provide sufficient control and detail in the API to make it

useable with most back ends.

slide-47
SLIDE 47

The QSim Simulator Front End

slide-48
SLIDE 48

QSim

◮ We have built a simlator front end based on QEMU1 that

aims to address these issues. It currently:

1http://www.qemu.org

slide-49
SLIDE 49

QSim

◮ We have built a simlator front end based on QEMU1 that

aims to address these issues. It currently:

◮ Runs unmodified binaries on a lightly-modified Linux guest. 1http://www.qemu.org

slide-50
SLIDE 50

QSim

◮ We have built a simlator front end based on QEMU1 that

aims to address these issues. It currently:

◮ Runs unmodified binaries on a lightly-modified Linux guest. ◮ Enables the construction of multithreaded simulators. 1http://www.qemu.org

slide-51
SLIDE 51

QSim

◮ We have built a simlator front end based on QEMU1 that

aims to address these issues. It currently:

◮ Runs unmodified binaries on a lightly-modified Linux guest. ◮ Enables the construction of multithreaded simulators. ◮ Has full support for 32-bit x86 guests. (Port to 64-bit x86

weeks from completion; port to ARM in early coding stages.)

1http://www.qemu.org

slide-52
SLIDE 52

QSim

◮ We have built a simlator front end based on QEMU1 that

aims to address these issues. It currently:

◮ Runs unmodified binaries on a lightly-modified Linux guest. ◮ Enables the construction of multithreaded simulators. ◮ Has full support for 32-bit x86 guests. (Port to 64-bit x86

weeks from completion; port to ARM in early coding stages.)

◮ Enables parallel simulation. 1http://www.qemu.org

slide-53
SLIDE 53

QSim– API Overview

Set callback Unset callback Run Back End Callbacks QSim

A simplified diagram of the QSim API. run(i, j) Advance guest CPU i by j instructions. set * callback(x) Set callbacks. unset * callback(h) Unset callbacks by handle. Callback types include: instruction, register access, memory access, interrupt

slide-54
SLIDE 54

QSim– How is it “Universal”?

Typical emulator used as a simulator front end:

Internal API Custom Shim Emulator Back−End Results

slide-55
SLIDE 55

QSim– How is it “Universal”?

Typical emulator used as a simulator front end:

Internal API Custom Shim Emulator Back−End Results

◮ Off-the-shelf open-source emulators like QEMU provide most

  • f the code needed to build a front end but are incomplete.
slide-56
SLIDE 56

QSim– How is it “Universal”?

Typical emulator used as a simulator front end:

Internal API Custom Shim Emulator Back−End Results

◮ Off-the-shelf open-source emulators like QEMU provide most

  • f the code needed to build a front end but are incomplete.

◮ Simulation projects like PTLSim and FAST have modified

QEMU heavily to create their front ends.

slide-57
SLIDE 57

QSim– How is it “Universal”?

QSim:

Back−End Results Custom Shim Emulator API QSim

slide-58
SLIDE 58

QSim– How is it “Universal”?

QSim:

Back−End Results Custom Shim Emulator API QSim

◮ QSim has done this yet again, but with compatibility with a

diverse set of back ends in mind.

slide-59
SLIDE 59

QSim– How is it “Universal”?

QSim:

Back−End Results Custom Shim Emulator API QSim

◮ QSim has done this yet again, but with compatibility with a

diverse set of back ends in mind.

◮ Similarly, the API is the same no matter what the instruction

set.

slide-60
SLIDE 60

QSim– Parallelism

◮ The run() function can be called simultaneously for two

different guest CPUs from different host threads.

slide-61
SLIDE 61

QSim– Parallelism

◮ The run() function can be called simultaneously for two

different guest CPUs from different host threads.

◮ This enables parallel emulation of multithreaded guests, as

well as multithreaded back ends.

slide-62
SLIDE 62

QSim– Parallelism

◮ The run() function can be called simultaneously for two

different guest CPUs from different host threads.

◮ This enables parallel emulation of multithreaded guests, as

well as multithreaded back ends.

◮ Since back ends tend to use more CPU time than front ends,

thread safety is more important than parallel emulation (both are provided by QSim).

slide-63
SLIDE 63

QSim– Parallelism

◮ The run() function can be called simultaneously for two

different guest CPUs from different host threads.

◮ This enables parallel emulation of multithreaded guests, as

well as multithreaded back ends.

◮ Since back ends tend to use more CPU time than front ends,

thread safety is more important than parallel emulation (both are provided by QSim).

◮ Up to 512 guest cores have been demonstrated running on up

to 512 host threads.

slide-64
SLIDE 64

QSim– Software Architecture

slide-65
SLIDE 65

QSim– Parallel Scaling

slide-66
SLIDE 66

QSim– Performance

The following represents the performance of QSim with empty

  • callbacks. With typical simulation speeds measured in thousands of

instructions per second, QSim will not likely be the bottleneck. Benchmark Slowdown MIPS swaptions 259x 18.5 mtgl-bfs 387x 36.6

  • cean-non-contig

267x 40.7

slide-67
SLIDE 67

QSim– Relation to QEMU

How are QEMU and QSim related? QEMU QSim

slide-68
SLIDE 68

QSim– Relation to QEMU

How are QEMU and QSim related? QEMU QSim Emulator Simulator Front-End

slide-69
SLIDE 69

QSim– Relation to QEMU

How are QEMU and QSim related? QEMU QSim Emulator Simulator Front-End Standalone Built on QEMU

slide-70
SLIDE 70

QSim– Relation to QEMU

How are QEMU and QSim related? QEMU QSim Emulator Simulator Front-End Standalone Built on QEMU Full-system CPU and RAM only

slide-71
SLIDE 71

QSim– Relation to QEMU

How are QEMU and QSim related? QEMU QSim Emulator Simulator Front-End Standalone Built on QEMU Full-system CPU and RAM only CPUs serialized CPUs in parallel

slide-72
SLIDE 72

QSim– Relation to QEMU

How are QEMU and QSim related? QEMU QSim Emulator Simulator Front-End Standalone Built on QEMU Full-system CPU and RAM only CPUs serialized CPUs in parallel Program Library

slide-73
SLIDE 73

Back Ends

slide-74
SLIDE 74

Back Ends– The Canonical Example

Host Thread 3 Host Thread 2 Host Thread 1 Parallel Discrete Event Simulation Engine . . . CPU Component CPU Component QSim . . . Other Components

This is the kind of simulation QSim was designed for.

slide-75
SLIDE 75

Back Ends– The Canonical Example

Host Thread 3 Host Thread 2 Host Thread 1 Parallel Discrete Event Simulation Engine . . . CPU Component CPU Component QSim . . . Other Components

This is the kind of simulation QSim was designed for.

◮ QSim feeds instructions into CPU timing models that are part

  • f a larger simulation infrastructure.
slide-76
SLIDE 76

Back Ends– The Canonical Example

Host Thread 3 Host Thread 2 Host Thread 1 Parallel Discrete Event Simulation Engine . . . CPU Component CPU Component QSim . . . Other Components

This is the kind of simulation QSim was designed for.

◮ QSim feeds instructions into CPU timing models that are part

  • f a larger simulation infrastructure.

◮ A parallel discrete event simulation engine keeps track of

events and ensures correctness.

slide-77
SLIDE 77

Back Ends– Others

Other back ends that have been built for QSim include:

slide-78
SLIDE 78

Back Ends– Others

Other back ends that have been built for QSim include:

◮ A binary trace writer, which was built along with a trace

reader library that exports the QSim API.

slide-79
SLIDE 79

Back Ends– Others

Other back ends that have been built for QSim include:

◮ A binary trace writer, which was built along with a trace

reader library that exports the QSim API.

◮ A serial universal processor emulator, simplesim.

slide-80
SLIDE 80

Back Ends– Others

Other back ends that have been built for QSim include:

◮ A binary trace writer, which was built along with a trace

reader library that exports the QSim API.

◮ A serial universal processor emulator, simplesim.

◮ A demonstration vehicle; breaks instructions into plausible

micro-ops regardless of instruction set.

slide-81
SLIDE 81

Back Ends– Others

Other back ends that have been built for QSim include:

◮ A binary trace writer, which was built along with a trace

reader library that exports the QSim API.

◮ A serial universal processor emulator, simplesim.

◮ A demonstration vehicle; breaks instructions into plausible

micro-ops regardless of instruction set.

◮ An interactive OS/application debugger.

slide-82
SLIDE 82

Back Ends– Others

Other back ends that have been built for QSim include:

◮ A binary trace writer, which was built along with a trace

reader library that exports the QSim API.

◮ A serial universal processor emulator, simplesim.

◮ A demonstration vehicle; breaks instructions into plausible

micro-ops regardless of instruction set.

◮ An interactive OS/application debugger. ◮ Visualization utilities.

slide-83
SLIDE 83

Summary

This has been:

2http://www.cdkersey.com/qsim-web/

slide-84
SLIDE 84

Summary

This has been:

◮ An appeal for consistent front end/back end APIs.

2http://www.cdkersey.com/qsim-web/

slide-85
SLIDE 85

Summary

This has been:

◮ An appeal for consistent front end/back end APIs. ◮ A plea for parallel front ends.

2http://www.cdkersey.com/qsim-web/

slide-86
SLIDE 86

Summary

This has been:

◮ An appeal for consistent front end/back end APIs. ◮ A plea for parallel front ends. ◮ A look at how we might narrow the simulation gap.

2http://www.cdkersey.com/qsim-web/

slide-87
SLIDE 87

Summary

This has been:

◮ An appeal for consistent front end/back end APIs. ◮ A plea for parallel front ends. ◮ A look at how we might narrow the simulation gap. ◮ An invitation to try QSim2.

2http://www.cdkersey.com/qsim-web/

slide-88
SLIDE 88

Acknowledgements

The authors are grateful to Paolo Faraboschi and Daniel Ortega for their suggestions and guidance in getting QSim started. This work was supported by the National Science Foundation under grant CNS855110, Sandia National Laboratories, and HP Laboratories.