Computer Architecture Lab at
RAMP Retreat, June 2009
The Open Source ProtoFlex Simulator
Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, Ken Mai
The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. - - PowerPoint PPT Presentation
The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at RAMP Retreat, June 2009 The ProtoFlex Simulator History Project started (circa 2007) to build
Computer Architecture Lab at
RAMP Retreat, June 2009
Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, Ken Mai
– Project started (circa 2007) to build scalable, full-system multiprocessor simulators using FPGAs
– Functional simulator for N-way UltraSPARC III server (~50-90 MIPS) – Using hybrid simulation, runs real server apps + Solaris OS – Employs multithreading to virtualize # CPUs per FPGA core
2
Hybrid Simulation Virtualization
– Demonstration of FPGAs as viable architecture research vehicle – Facilitate adoption of hybrid simulation & host multithreading – Encourage building on top of our work
– Bluespec source HDL, Verilog and pre-generated netlists for SPARCV9 CPU model + interfaces – XUPV5 Reference Design for EDK 10.1 – Virtutech Simics plug-ins for hybrid simulation – Top-level SW controller, user command-line interface – Documentation through online wiki
3
4
– Software user interface similar to Simics – Applications load directly from Simics checkpoints – Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring
5
– Software user interface similar to Simics – Applications load directly from Simics checkpoints – Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring
6
FPGA Linux PC
PFMON SIMICS (I/O) FPGA Core Main Memory PowerPC (or uBlaze)
Ethernet
User Interfac e
– 64-bit SPARCV9 ISA + US III extensions – 8 register windows, 4 global register files – 512-entry D-TLB, 128 I-TLB
– 14-stage, multi-threaded pipeline, switch context on each cycle – On Virtex-5, XST~148MHz, Placed & routed @ 100MHz – Parameterized non-blocking caches – FP + rare MMU instructions are SW-emulated by nearby uBlaze – 100% mirrors Virtutech Simics model
7
I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D-cache (BRAM)
– Synthesizes up to 148MHz using standard tools (ISE XST)
– 23.5 KLUTs (11.3% LX330T)
– 120 BRAMs for 16-context configuration (37% LX330T)
– Paging structures to SRAM or DRAM can reduce BRAM by significant amount – Will release in future updates
8
9
– Counters, histograms – Roll your own
10
I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D-cache (BRAM)
– Collect dynamic traces – Feed traces to functional-first timing model
– Use micro-blaze (or PPC) to monitor core/memory state – Unintrusive profiling w/o changes to target SW
Counters Counters Counters
Histogram Tracker
Histogram
Tracker Histogram Trackers
Timing Model
FPGA Hard/Soft Core (PowerPC or MicroBlaze)
– Functional-first CMP cache coherency model for first-order timing models and functional warming *TRETS’09+ – Real-time stack trace profiling – CMP interconnect model (in progress) – Realistic CPU traffic generators (in progress)
– Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K
11
Piranha CMP Cache (First-Order Timing Model) Statistics + Warmed Coherency & Tag States
– 4-8 weeks learning curve for normal HDL users – Once learned, easier to read/modify than conventional RTL – Requires BSV compiler (free for academics) – Paper in MEMOCODE’09 describes BSV coding/validation of core
12
rule split_ALU_pipeline (True); … p1 = piperegs[DECODE]; piperegs[ALU1] <= doALUStage1(p1, alu_ifc); p2 = piperegs[ALU1]; piperegs[ALU2] <= doALUStage2(p2, alu_ifc); … endrule rule merged_ALU_pipeline (True); … p1 = piperegs[DECODE]; p_tmp1 = doALUStage1(p1, alu_ifc); p_tmp2 = doALUStage2(p_tmp1, alu_ifc); piperegs[ALU] <= p_tmp2; … endrule
2-stage ALU 1-stage ALU
– Number of CPU contexts – Cache sizes – Merge/split pipeline stages – Enable/disable modules for profiling & debugging – Clock frequency (tested @ 10 MHz – 100 MHz) – Set optimal LUTRAM size (16 = V2P, 64 = V5) – Choose LUTRAMs or BRAMs for any CPU state
– UDP or TCP/IP (for PFMON-to-FPGA communication) – XUPv5, BEE2
13
14
– Inexpensive (~$750), easily accessible – Standard tool flows (EDK, ISE) – Reference design portable to other platforms – just drop in our ‘pcores’
– Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP) – Plan to release with future updates
15
16
17
18
19
PLB
LX110T DRAM SRAM
20
PLB
LX110T DRAM SRAM
– connects to PLB & NPI
21
PLB
LX110T DRAM SRAM
22
PLB
LX110T DRAM SRAM
– Core 51% (76 out of 148) – Rest 30% (45 out of 148)
23
PLB
LX110T DRAM SRAM
24
PLB
LX110T DRAM SRAM
25
PLB
LX110T DRAM SRAM
26
PLB
LX110T DRAM SRAM
27
28
29
30
– Orchestrates communication between Simics & BlueSPARC – Provides CLI interface to simulator (like Simics Console)
– Handles I/O, FPGA Core and memory initialization
Linux PC
31
– Bluespec compiler – ~30 minutes
– Xilinx EDK – ~ 3 hours
– Stream mem. image over ethernet – ~ 5 minutes (for 512MB image)
1 2 3 1 2
3
– Due to BRAM limits, only 4 CPU contexts (compared to 16 on BEE2’s V2P70) – Slow PC-to-FPGA latency via LwIP/Ethernet (1 key/sec in terminal) – Limited DDR2 RAM capacity (up to 2GB)
– Our XUPv5 design still has much room for improvement – Useful for familiarizing with ProtoFlex tools – Can still run multithreaded workloads + perform monitoring – Easily portable to more powerful platforms
32
33
– All source code will be available in 1 week (June) – Send email to echung@ece.cmu.edu for more info
– GNU GPL
– Documentation: www.ece.cmu.edu/~protoflex – Support mailing list/forum will be available soon
34
– Funding for this work provided by: NSF CCF-0811702, NSF CNS- 0509356, C2S2 Marco Center, and Sun Microsystems – Paul Hartke & Xilinx for FPGA donations & email support
– Hybrid Simulation and FPGA Core Multithreading
A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs [FPGA’08]
– Functional-First, First-Order Timing Model for a CMP cache
ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs [TRETS’09]
– FPGA Core Design & Validation Using Flight Data Recorder
Implementing a High-performance Multithreaded Microprocessor: A Case Study in High- level Design and Validation [MEMOCODE’09]
35