Optimizing Under Abstraction:
Using Prefetching to Improve FPGA Performance
Hsin-Jung Yang†, Kermin E. Fleming‡, Michael Adler‡, and Joel Emer†‡
† Massachusetts Institute of Technology ‡ Intel Corporation
September 3rd, FPL 2013
Optimizing Under Abstraction: Using Prefetching to Improve FPGA - - PowerPoint PPT Presentation
Optimizing Under Abstraction: Using Prefetching to Improve FPGA Performance Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , and Joel Emer Massachusetts Institute of Technology Intel Corporation September 3rd, FPL
† Massachusetts Institute of Technology ‡ Intel Corporation
September 3rd, FPL 2013
User Program A FPGA A
User Program A FPGA A Ethernet SRAM User Program B FPGA B DRAM
User Program A FPGA A Ethernet SRAM User Program B FPGA B DRAM
circuit verification
User Program A FPGA A Ethernet SRAM User Program B FPGA B DRAM
circuit verification algorithm acceleration
User Program A FPGA A FPGA C Ethernet SRAM User Program B FPGA B DRAM
circuit verification algorithm acceleration
DRAM PCIE Ethernet SRAM User Program B SRAM SRAM DRAM LUTs SRAM
User Program A FPGA A FPGA C Ethernet SRAM User Program B FPGA B DRAM
circuit verification algorithm acceleration
DRAM PCIE Ethernet SRAM User Program B’ SRAM SRAM DRAM LUTs SRAM
C++/Python/Perl Application Software Library Operating System (config. A) Memory Device CPU Processor A
C++/Python/Perl Application Software Library Operating System (config. B) Memory’ Device’ CPU’ Processor B
Ethernet SRAM Interface User Program FPGA A Abstraction (config. A) C++/Python/Perl Application Software Library Operating System (config. B) Memory’ Device’ CPU’ Processor B
C++/Python/Perl Application Software Library Operating System (config. B) Memory’ Device’ CPU’ Processor B Interface User Program FPGA B Abstraction (config. B) PCIe DRAM Unused resources
C++/Python/Perl Application Software Library Operating System (config. B) Memory’ Device’ CPU’ Processor B Interface User Program FPGA B PCIe DRAM Abstraction (config. B) Optimization
Client RAM Block Client RAM Block Client RAM Block
addr din wen dout clk addr dout A1 D1
interface MEMORY_INTERFACE input: readReq (addr); write(addr, din);
// dout is available at the next cycle of readReq readResp() if (readReq fired previous cycle); endinterface
Client RAM Block Client RAM Block Client RAM Block
addr din wen dout valid clk addr dout A1 D1 valid
interface MEMORY_INTERFACE input: readReq (addr); write(addr, din);
// dout is available when response is ready readResp() if (valid == True); endinterface
Client Scratchpad Client Private Cache Scratchpad Interface Client Private Cache Scratchpad Interface Client Private Cache Scratchpad Interface Connector Scratchpad Controller Host Memory Platform Central Cache
Client Scratchpad Client Private Cache Scratchpad Interface Client Private Cache Scratchpad Interface Client Private Cache Scratchpad Interface Connector Scratchpad Controller Host Memory Platform Central Cache
Static Prefetching Dynamic Prefetching Platform Processor FPGA Processor FPGA How? User/ Compiler User Hardware manufacturer Compiler No code change High prefetch accuracy No instruction overhead Runtime information
Static Prefetching Dynamic Prefetching Platform Processor FPGA Processor FPGA How? User/ Compiler User Hardware manufacturer Compiler No code change High prefetch accuracy No instruction overhead Runtime information
Tag Previous Address Stride State
0xa001 0x1008 4 Steady 0xa002 0x2000 Initial learner 1 learner 2 learner 3
Client Scratchpad Client Private Cache Scratchpad Interface Connector Scratchpad Controller Host Memory Platform Central Cache Private Cache Scratchpad Interface Private Cache Scratchpad Interface Client Client
Client Scratchpad Client Private Cache Scratchpad Interface Connector Scratchpad Controller Host Memory Platform Central Cache Prefetcher Private Cache Scratchpad Interface Prefetcher Private Cache Scratchpad Interface Prefetcher Client Client
Issued prefetch
Issued prefetch To Memory Dropped
Issued prefetch To Memory Dropped Dropped by busy Dropped by hit
Issued prefetch To Memory Dropped Dropped by busy Usable Useless Dropped by hit
Issued prefetch To Memory Dropped Dropped by busy Usable Useless Dropped by hit Timely Late
Issued prefetch To Memory Dropped Dropped by busy Usable Useless Dropped by hit Timely Late
Untimely Untimely
preload
preload compute
compute
compute
Slice Registers Slice LUTs BRAM fmax 32 learners, LUTRAM 333 1045 127 MHz 32 learners, BRAM 419 1275 2 131 MHz H.264, Baseline Profile 60770 86364 99 80 MHz