Optimizing Under Abstraction: Using Prefetching to Improve FPGA - - PowerPoint PPT Presentation

optimizing under abstraction
SMART_READER_LITE
LIVE PREVIEW

Optimizing Under Abstraction: Using Prefetching to Improve FPGA - - PowerPoint PPT Presentation

Optimizing Under Abstraction: Using Prefetching to Improve FPGA Performance Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , and Joel Emer Massachusetts Institute of Technology Intel Corporation September 3rd, FPL


slide-1
SLIDE 1

Optimizing Under Abstraction:

Using Prefetching to Improve FPGA Performance

Hsin-Jung Yang†, Kermin E. Fleming‡, Michael Adler‡, and Joel Emer†‡

† Massachusetts Institute of Technology ‡ Intel Corporation

September 3rd, FPL 2013

slide-2
SLIDE 2

Motivation

  • Moore’s Law

– Increasing FPGA size and capability

  • Use case for FPGA:

User Program A FPGA A

slide-3
SLIDE 3

Motivation

  • Moore’s Law

– Increasing FPGA size and capability

  • Use case for FPGA:

User Program A FPGA A Ethernet SRAM User Program B FPGA B DRAM

slide-4
SLIDE 4

Motivation

  • Moore’s Law

– Increasing FPGA size and capability

  • Use case for FPGA:

User Program A FPGA A Ethernet SRAM User Program B FPGA B DRAM

circuit verification

slide-5
SLIDE 5

Motivation

  • Moore’s Law

– Increasing FPGA size and capability

  • Use case for FPGA:

User Program A FPGA A Ethernet SRAM User Program B FPGA B DRAM

circuit verification algorithm acceleration

slide-6
SLIDE 6

Motivation

  • Moore’s Law

– Increasing FPGA size and capability

  • Use case for FPGA:

User Program A FPGA A FPGA C Ethernet SRAM User Program B FPGA B DRAM

circuit verification algorithm acceleration

DRAM PCIE Ethernet SRAM User Program B SRAM SRAM DRAM LUTs SRAM

slide-7
SLIDE 7

Motivation

  • Moore’s Law

– Increasing FPGA size and capability

  • Use case for FPGA:

User Program A FPGA A FPGA C Ethernet SRAM User Program B FPGA B DRAM

circuit verification algorithm acceleration

DRAM PCIE Ethernet SRAM User Program B’ SRAM SRAM DRAM LUTs SRAM

slide-8
SLIDE 8

Abstraction

C++/Python/Perl Application Software Library Operating System (config. A) Memory Device CPU Processor A

  • Goal: making FPGAs easier to use
slide-9
SLIDE 9

Abstraction

C++/Python/Perl Application Software Library Operating System (config. B) Memory’ Device’ CPU’ Processor B

  • Goal: making FPGAs easier to use
slide-10
SLIDE 10

Abstraction

Ethernet SRAM Interface User Program FPGA A Abstraction (config. A) C++/Python/Perl Application Software Library Operating System (config. B) Memory’ Device’ CPU’ Processor B

  • Goal: making FPGAs easier to use
slide-11
SLIDE 11

Abstraction

C++/Python/Perl Application Software Library Operating System (config. B) Memory’ Device’ CPU’ Processor B Interface User Program FPGA B Abstraction (config. B) PCIe DRAM Unused resources

  • Goal: making FPGAs easier to use
slide-12
SLIDE 12
  • Goal: making FPGAs easier to use
  • Optimization under abstraction

– Automatically accelerate FPGA applications – Provide FREE performance gain

Abstraction

C++/Python/Perl Application Software Library Operating System (config. B) Memory’ Device’ CPU’ Processor B Interface User Program FPGA B PCIe DRAM Abstraction (config. B) Optimization

slide-13
SLIDE 13

Memory Abstraction

Client RAM Block Client RAM Block Client RAM Block

FPGA Block RAMs

addr din wen dout clk addr dout A1 D1

interface MEMORY_INTERFACE input: readReq (addr); write(addr, din);

  • utput:

// dout is available at the next cycle of readReq readResp() if (readReq fired previous cycle); endinterface

slide-14
SLIDE 14

Memory Abstraction

Client RAM Block Client RAM Block Client RAM Block

FPGA Block RAMs

addr din wen dout valid clk addr dout A1 D1 valid

interface MEMORY_INTERFACE input: readReq (addr); write(addr, din);

  • utput:

// dout is available when response is ready readResp() if (valid == True); endinterface

slide-15
SLIDE 15

Client Scratchpad Client Private Cache Scratchpad Interface Client Private Cache Scratchpad Interface Client Private Cache Scratchpad Interface Connector Scratchpad Controller Host Memory Platform Central Cache

LEAP Scratchpads

Scratchpads

  • M. Adler et al., “LEAP Scratchpads,” in FPGA, 2011.
slide-16
SLIDE 16

Client Scratchpad Client Private Cache Scratchpad Interface Client Private Cache Scratchpad Interface Client Private Cache Scratchpad Interface Connector Scratchpad Controller Host Memory Platform Central Cache

LEAP Scratchpads

Processor Scratchpads

  • M. Adler et al., “LEAP Scratchpads,” in FPGA, 2011.
slide-17
SLIDE 17

Scratchpad Optimization

Automatically accelerate memory-using FPGA programs

  • Reduce scratchpad latency
  • Leverage unused resources
  • Learn from optimization techniques in processors

– Larger caches, greater associativity – Better cache policies – Cache prefetching

slide-18
SLIDE 18

Scratchpad Optimization

Automatically accelerate memory-using FPGA programs

  • Reduce scratchpad latency
  • Leverage unused resources
  • Learn from optimization techniques in processors

– Larger caches, greater associativity – Better cache policies – Cache prefetching

slide-19
SLIDE 19

Talk Outline

  • Motivation
  • Introduction to LEAP Scratchpads
  • Prefetching in FPGAs vs. in processors
  • Scratchpad Prefetcher Microarchitecture
  • Evaluation and Prefetch Optimization
  • Conclusion
slide-20
SLIDE 20

Comparison of prefetching techniques and platforms

Prefetching Techniques

Static Prefetching Dynamic Prefetching Platform Processor FPGA Processor FPGA How? User/ Compiler User Hardware manufacturer Compiler No code change   High prefetch accuracy   No instruction overhead    Runtime information  

slide-21
SLIDE 21

Comparison of prefetching techniques and platforms

Prefetching Techniques

Static Prefetching Dynamic Prefetching Platform Processor FPGA Processor FPGA How? User/ Compiler User Hardware manufacturer Compiler No code change   High prefetch accuracy   No instruction overhead    Runtime information  

slide-22
SLIDE 22

Dynamic Prefetching in Processor

Classic processor dynamic prefetching policies

  • When to prefetch

– Prefetch on cache miss – Prefetch on cache miss and prefetch hit

  • Also called tagged prefetch
  • What to prefetch

– Always prefetch next memory block – Learn stride-access patterns

slide-23
SLIDE 23

Dynamic Prefetching in Processor

Stride prefetching

  • L1 cache: PC-based stride prefetching
  • L2 cache: address-based stride prefetching

(fully associative cache of learners)

Tag Previous Address Stride State

0xa001 0x1008 4 Steady 0xa002 0x2000 Initial learner 1 learner 2 learner 3

slide-24
SLIDE 24

Dynamic Prefetching on FPGAs

  • Easier:

– Cleaner streaming memory accesses – No need for PC as a filter to separate streams – Fixed (usually plenty of) resources

  • Harder:

– Back-to-back memory accesses – Inefficient to implement CAM

slide-25
SLIDE 25

Dynamic Prefetching on FPGAs

  • Easier:

– Cleaner streaming memory accesses – No need for PC as a filter to separate streams – Fixed (usually plenty of) resources

  • Harder:

– Back-to-back memory accesses – Inefficient to implement CAM Scratchpad prefetcher uses address-based stride prefetching with a larger set of direct-mapped learners

slide-26
SLIDE 26

Talk Outline

  • Motivation
  • Introduction to LEAP Scratchpads
  • Prefetching in FPGAs vs. in processors
  • Scratchpad Prefetcher Design
  • Evaluation and Prefetch Optimization
  • Conclusion
slide-27
SLIDE 27

Scratchpad Prefetcher

Client Scratchpad Client Private Cache Scratchpad Interface Connector Scratchpad Controller Host Memory Platform Central Cache Private Cache Scratchpad Interface Private Cache Scratchpad Interface Client Client

slide-28
SLIDE 28

Scratchpad Prefetcher

Client Scratchpad Client Private Cache Scratchpad Interface Connector Scratchpad Controller Host Memory Platform Central Cache Prefetcher Private Cache Scratchpad Interface Prefetcher Private Cache Scratchpad Interface Prefetcher Client Client

slide-29
SLIDE 29

Scratchpad Prefetching Policy

  • When to prefetch

– Cache line miss / prefetch hit – Prefetcher learns the stride pattern

  • What to prefetch

– Prefetch address: P = L + s * d – Cache line address: L – Learned stride: S – Look-ahead distance: d

slide-30
SLIDE 30

Scratchpad Prefetching Policy

  • When to prefetch

– Cache line miss / prefetch hit – Prefetcher learns the stride pattern

  • What to prefetch

– Prefetch address: P = L + s * d – Cache line address: L – Learned stride: S – Look-ahead distance: d

slide-31
SLIDE 31

Scratchpad Prefetching Policy

  • Look-ahead distance:

– Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms?

slide-32
SLIDE 32

Scratchpad Prefetching Policy

  • Look-ahead distance:

– Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance

slide-33
SLIDE 33

Scratchpad Prefetching Policy

  • Look-ahead distance:

– Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance

Issued prefetch

slide-34
SLIDE 34

Scratchpad Prefetching Policy

  • Look-ahead distance:

– Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance

Issued prefetch To Memory Dropped

slide-35
SLIDE 35

Scratchpad Prefetching Policy

  • Look-ahead distance:

– Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance

Issued prefetch To Memory Dropped Dropped by busy Dropped by hit

slide-36
SLIDE 36

Scratchpad Prefetching Policy

  • Look-ahead distance:

– Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance

Issued prefetch To Memory Dropped Dropped by busy Usable Useless Dropped by hit

slide-37
SLIDE 37

Scratchpad Prefetching Policy

  • Look-ahead distance:

– Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance

Issued prefetch To Memory Dropped Dropped by busy Usable Useless Dropped by hit Timely Late

slide-38
SLIDE 38

Scratchpad Prefetching Policy

  • Look-ahead distance:

– Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance

Issued prefetch To Memory Dropped Dropped by busy Usable Useless Dropped by hit Timely Late

Untimely Untimely

slide-39
SLIDE 39

Talk Outline

  • Motivation
  • Introduction to LEAP Scratchpads
  • Prefetching in FPGAs vs. in processors
  • Scratchpad Prefetcher Microarchitecture
  • Evaluation and Prefetch Bandwidth Control
  • Conclusion
slide-40
SLIDE 40

Evaluation

  • Blocked matrix-matrix multiplication (MMM)
  • N. Dave et al., “Hardware acceleration of matrix multiplication on a xilinx fpga,” in MEMOCODE, 2007.
slide-41
SLIDE 41

Evaluation

  • Blocked matrix-matrix multiplication (MMM)
  • Prefetching has larger gains in smaller matrices.
  • Prefetching helps in edge conditions.
  • N. Dave et al., “Hardware acceleration of matrix multiplication on a xilinx fpga,” in MEMOCODE, 2007.
slide-42
SLIDE 42

Evaluation

  • Blocked matrix-matrix multiplication (MMM)
  • Prefetching has larger gains in smaller matrices.
  • Prefetching helps in edge conditions.

preload

  • N. Dave et al., “Hardware acceleration of matrix multiplication on a xilinx fpga,” in MEMOCODE, 2007.
slide-43
SLIDE 43

Evaluation

  • Blocked matrix-matrix multiplication (MMM)
  • Prefetching has larger gains in smaller matrices.
  • Prefetching helps in edge conditions.

preload compute

  • N. Dave et al., “Hardware acceleration of matrix multiplication on a xilinx fpga,” in MEMOCODE, 2007.
slide-44
SLIDE 44

Evaluation

  • Blocked matrix-matrix multiplication (MMM)
  • Prefetching has larger gains in smaller matrices.
  • Prefetching helps in edge conditions.

compute

  • N. Dave et al., “Hardware acceleration of matrix multiplication on a xilinx fpga,” in MEMOCODE, 2007.
slide-45
SLIDE 45

Evaluation

  • Blocked matrix-matrix multiplication (MMM)
  • Prefetching has larger gains in smaller matrices.
  • Prefetching helps in edge conditions.

compute

  • N. Dave et al., “Hardware acceleration of matrix multiplication on a xilinx fpga,” in MEMOCODE, 2007.
slide-46
SLIDE 46

Memory Bandw idth Control

  • MMM with memory bandwidth control

Prefetcher automatically stops issuing requests when there are too many requests inflight

slide-47
SLIDE 47

Memory Bandw idth Control

  • MMM with memory bandwidth control

Prefetcher automatically stops issuing requests when there are too many requests inflight

slide-48
SLIDE 48

Prefetch Performance Summary

  • K. Fleming et al., “H.264 Decoder: A Case Study in Multiple Design Points,” in MEMOCODE, 2008
slide-49
SLIDE 49

Prefetcher Resource Utilization

Area of different prefetching logic implementations

Slice Registers Slice LUTs BRAM fmax 32 learners, LUTRAM 333 1045 127 MHz 32 learners, BRAM 419 1275 2 131 MHz H.264, Baseline Profile 60770 86364 99 80 MHz

  • Area requirements of FPGA prefetching are small.

― less than 0.5% of total chip area on the ML605 board

slide-50
SLIDE 50

Conclusion

  • FPGA programs are hard to write
  • FPGA programs usually do not fully utilize resources
  • Optimizations under abstraction leverage unused

resources and provide FREE performance gain

  • Adding prefetching to LEAP Scratchpads speeds up

existing streaming applications

– No program code changes in target design – 15% average runtime improvement

  • There are many other possible optimizations to the

FPGA memory system

slide-51
SLIDE 51

Thank You