Using Reconfigurable Logic Using Reconfigurable Logic to Simulate - - PowerPoint PPT Presentation

using reconfigurable logic using reconfigurable logic to
SMART_READER_LITE
LIVE PREVIEW

Using Reconfigurable Logic Using Reconfigurable Logic to Simulate - - PowerPoint PPT Presentation

Using Reconfigurable Logic Using Reconfigurable Logic to Simulate Computer Systems Derek Chiou University of Texas at Austin Electrical and Computer Engineering Supported in part by DOE NSF SRC Supported in part by DOE, NSF, SRC, Bluespec,


slide-1
SLIDE 1

Using Reconfigurable Logic Using Reconfigurable Logic to Simulate Computer Systems

Derek Chiou

University of Texas at Austin Electrical and Computer Engineering

Supported in part by DOE NSF SRC Supported in part by DOE, NSF, SRC, Bluespec, Intel, Xilinx, IBM, and Freescale

slide-2
SLIDE 2

R d’ T lk Rand’s Talk

Cycle poor simulators

Cycle-poor simulators Cycle-rich hardware

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 2

slide-3
SLIDE 3

F t A t Si l t ? Fast, Accurate Simulator?

Fast simulator is easy Fast simulator is easy

Several that are within a factor of 10 of reality No performance/power prediction

p p p

Accurate inherently slow, lots of details

Intel/AMD arch simulators 100K-1M slower than real RTL simulators 1B slower than real

Only way to have fast, accurate simulator is

aggressive (10K+) parallelization aggressive (10K+) parallelization

Multicore not sufficient FPGAs?

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 3

slide-4
SLIDE 4

H t A l FPGA ? How to Apply FPGAs?

Emulation/Prototyping Emulation/Prototyping

Port RTL to FPGAs Issues

L t

Late RTL not designed for FPGA Not that fast (10K slower than hardware)

Lots of FPGA reso rces

Lots of FPGA resources

Port software simulators to FPGAs New simulator architectures for FPGAs C-to-gates doesn’t work well for simulators

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 4

slide-5
SLIDE 5

Functional/Timing Partitioned Si l t Simulators

Simulator partitioned into two Simulator partitioned into two

Partitions change roughly

independently, reducing cost of change

Functional model

Executes functionality of target system

E ISA i h l f i li

W M E D F

E.g., ISA, peripheral functionality

Implement x86 once, reuse many times

Timing model $ Timing model

Models time of target system E.g., caches, pipelining

W M E D F

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 5

slide-6
SLIDE 6

Ti i Di t d Timing-Directed

Timing model calls functional

Timing model calls functional model at appropriate target time

Ensures functionality performed

in the correct order

W M E D F

$

Requires very frequent

communication F FPGA i l i b h

W M E D F W M E D F

For FPGA implementation, both

functional and timing need to be implemented on FPGA for performance

W M E D F

performance

Intel/MIT HAsim, Berkeley

RAMP-Gold

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 6

slide-7
SLIDE 7

A th W ? Another Way?

Difficult to implement full ISA on

FPGA FPGA

Intel has implemented x86 roughly 3

times on FPGA with full RTL

Software functional models very

FM TM

Trace Software functional models very

fast, very complete

Boot full operating systems, run

unmodified code

FM TM

Flow Control

Functional first: Functional model

(FM) executes, feeds trace to ( ) timing model (TM)

All information that TM needs

(opcode, register names, addresses, t ) b t d b FM etc.) can be computed by FM

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 7

slide-8
SLIDE 8

P ll li F ti l Fi t Parallelize Functional First

Parallelize between FM/TM

  • Minimized round trip communication
  • Minimized round-trip communication

between FM & TM (just flow control) maximizes parallel performance

Parallelize TM by implementing

FM TM

Trace

Parallelize TM by implementing

in FPGA

  • TM bottleneck, small, lots of fine grain

communication

FM TM

Flow Control

communication

  • FPGA is excellent at fine-grained

communication needed by timing model

  • FM on CPU runs very fast

Host Host Host

  • FM on CPU runs very fast

Result is a fast simulator

  • 10MIPS-100MIPS to simulate single

core target

CPU CPU FPGA

core target

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 8

slide-9
SLIDE 9

But, F ti l Fi t I I t Functional First Is Inaccurate

FM executes first without timing information FM executes first without timing information Functional accuracy dependent on timing

Shared memory accesses highly dependent on Shared memory accesses highly dependent on

timing of loads

FM executes load/store in differernt order than TM

Branch mispredictions and resolution highly

dependent on timing

Wrong path instructions pollute pipeline caches Wrong path instructions pollute pipeline, caches

Timing dependent on accurate functionality Inaccurate even for unicore target Inaccurate even for unicore target

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 9

slide-10
SLIDE 10

Example: D kk ’ Al ith Dekker’s Algorithm

Core0 Core1 10: M[0] = 1 20: M[1] = 1 10: M[0] 1 20: M[1] 1 11: R0= M[1] 21: R0=M[0] 12 BR 15 22 BR 25 12: BRR0!=0 15 22: BRR0!=0 25 13: M[CS] = 0 23: M[CS] = 1 14: BR END 24: BR END 15: M[0] = 0 25: M[1] = 0 15: M[0] 25: M[1] 0

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 10

slide-11
SLIDE 11

Functional First: C 0 G t L k Core 0 Gets Lock

R0=M[1] M[0]=1 11 10 R0=M[0] M[1]=1 21 20 M[CS]=10 BR!=0 15 R0=M[1] JMP END 13 12 11 14 M[CS]=20 BR!=0 25 R0=M[0] JMP END 23 22 21 24 P0.3| 13: M[CS]= JMP END M[0]=0 14 15 JMP END M[1]=0 24 25 | [ ] P0.2| 12: BR!=0 15 P0.1| 11: R0=M[1] P0.0| 10: M[0]= P1.2| 22: BR!=0 25 P1.1| 21: R1=M[0] P1.0| 20: M[1]=

TMCore 0 TMCore 1

5/10/2010 Derek Chiou of UTAustin at Stanford 11

Memory

slide-12
SLIDE 12

What if on Target, Core1 Gets L k? Lock?

FMCore 0 FMCore 1

P0.3| 13: M[CS] P1.3| 25: M[1] Functional Trace Is Target Incorrect!

  • How to detect?

How to correct? P0.2| 12: BR!=0 15 P0.1| 11: R0=M[1] P0 0| 10 M[0] P1.2| 22: BR!=0 25 P1.1| 21: (R0=M[0]) P1 0| 20 M[1] P1 0| 20 M[1] P1.1| 21: R0=M[0] P1.2| 22: BR!=0 25

  • How to correct?

Traditional solution is to avoid functional-first when accuracy important P0.0| 10: M[0]= P1.0| 20: M[1]

TMCore 0 TMCore 1

P1.0| 20: M[1]=

  • timing-directed, execute-in-execute

5/10/2010 Derek Chiou of UTAustin at Stanford 12

Memory

slide-13
SLIDE 13

Solution: S l ti F ti l Fi t

CAL 2009 Patent: 7,840,397

Speculative Functional First

  • Assume we have target-correct values

g

  • Easy to get functional load/store values (hard to get exec order)
  • Load values (and store values) provided in functional trace
  • Compare target load value with functional load value to detect
  • Have target correct value to correct when necessary
  • Have target correct value to correct when necessary
  • Rollback functional model, change value, replay, regenerating trace including

addresses, stored values

  • Differs from traditional parallelization techniques (e.g., PDES) that use order
  • How do we get target-correct values?
  • Target Memory Oracle (TMO) models target-correct memory values
  • TMO read at target time with target-correct address
  • TMO written at target correct time with target-correct address, data
  • Won’t execute in timing model until address/data values correct
  • Speculatively execute functionally, produce functional values,

t h correct when wrong

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 13

slide-14
SLIDE 14

Speculative Functional First d O l and Oracles

FMCore 0 FMCore 1

P1.3| 23: M[CS]=20 P0.3| 13: M[CS]=0 P1.3| 25: M[1]=0 P0.2| 12: BR!=0 15 P0.1| 11: (R0=M[1])==0 P0 0| 10 M[0] 1 P1.2| 22: BR!=0 25 P1.1| 21: (R0=M[0])==1 P1 0| 20 M[1] 1 P1 0| 20 M[1] 1 P1.1| 21: (R0=M[0])==1 P1.2| 22: BR!=0 25 P0.0| 10: M[0]=1 P1.0| 20: M[1]=1

TMCore 0 TMCore 1

P1.0| 20: M[1]=1

5/10/2010 Derek Chiou of UTAustin at Stanford 14

Memory

1 1

slide-15
SLIDE 15

A Vid A Video

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 15

slide-16
SLIDE 16

C l i Conclusions

Fast computer system simulators would be really useful for Fast computer system simulators would be really useful for

architecture, verification, debug

FPGA-based simulators can help achieve speed

  • Several ways to attack the problem

Several ways to attack the problem Could be used for hardware/software codesign,

performance/power tuning

Current work: Current work:

  • Accurate power models at same speed
  • 5% cycle-by-cycle RMS for ARM A8, Freescale superscalar core (FPL 2010)
  • Automatically transforming simulator description to implementation

y g p p (DAC 2011)

Biggest Issue

  • FPGA design still hard, need to simplify for faster development

g y

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 16

slide-17
SLIDE 17

A k l d t Acknowledgements

  • Students

H i A t (FM MP)

  • Hari Angepat (FM-MP)
  • Ram Chakravarthy (parallelizing FM-MP)
  • Dam Sunwoo (FM-UP, Power), now at ARM Research
  • Nikhil Patil (TM, tools, FAST2Imp)

Gene Wu (FM Power)

  • Gene Wu (FM, Power)
  • Yi Yuan (TM, Reliability)
  • Dan Zhang (TMO)
  • Xiaoyu Ma (MP TM)
  • Maysam Lavasani (Magilla)
  • Maysam Lavasani (Magilla)
  • Funding, Equipment
  • DOE, NSF, SRC
  • Intel, Xilinx, IBM, Freescale
  • Software, tools

Software, tools

  • Bluespec, Xilinx
  • Open-source full system simulators
  • QEMU, Bochs

6/6/2011 Derek Chiou of UTAustin for IWLS 2011 17