Scavenger: Automating the Construction of Application-Optimized - - PowerPoint PPT Presentation

scavenger
SMART_READER_LITE
LIVE PREVIEW

Scavenger: Automating the Construction of Application-Optimized - - PowerPoint PPT Presentation

Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , Felix Winterstein , and Joel Emer * Massachusetts Institute of Technology, Intel


slide-1
SLIDE 1

Scavenger:

Automating the Construction of Application-Optimized Memory Hierarchies

Hsin-Jung Yang†, Kermin E. Fleming‡, Michael Adler‡, Felix Winterstein§, and Joel Emer†*

† Massachusetts Institute of Technology, ‡ Intel Corporation § European Space Agency, *NVIDIA Research

September 3rd, FPL 2015

slide-2
SLIDE 2

Abstraction

  • Abstraction hides implementation details and provides

good programmability

2

C/Python Application Memory Processor Instruction Set Architecture CPU I/O Operating System

Software Hardware

programmer FPGA SRAM SRAM DRAM LUTs PCIE User Program

  • Implementation details are

handled by programmers

  • Hardware can be optimized for

the target application

  • Hardware is optimized for a set of

applications and fixed at design time

slide-3
SLIDE 3

Abstraction

  • Abstraction hides implementation details and provides

good programmability

3

FPGA User Program Abstraction Memory Communication

  • Platform hardware can be
  • ptimized for the target application

C/Python Application Memory Processor Instruction Set Architecture CPU I/O Operating System

Software Hardware

programmer

  • Hardware is optimized for a set of

applications and fixed at design time

slide-4
SLIDE 4
  • Goal: build the “best” memory subsystem for a

given application

– What is the “best”?

  • The memory subsystem which minimizes the

execution time

– How?

  • A clean memory abstraction
  • A rich set of memory building blocks
  • Intelligent algorithms to analyze programs and

automatically compose memory hierarchies

Application-Optimized Memory Subsystems

4

slide-5
SLIDE 5

5

Observation

  • Many FPGA programs do not consume all the available

block RAMs (BRAMs)

– Design difficulty – Same program ported from smaller FPGAs to larger ones Goal: Utilizing spare BRAMs to improve program performance

slide-6
SLIDE 6

LEAP Memory Abstraction

6

interface MEM_IFC#(type t_ADDR, type t_DATA) method void readReq(t_ADDR addr); method void write(t_ADDR addr, t_DATA din); method t_DATA readResp(); endinterface

LEAP Memory User Engine Interface

LEAP Memory Block

  • Simple memory interface
  • Arbitrary data size
  • Private address space
  • “Unlimited” storage
  • Automatic caching
slide-7
SLIDE 7

LEAP Scratchpad

7

Client Client Client Interface

Processor

Application L1 Cache L2 Cache Memory

Scratchpads

  • n-chip SRAM
  • n-board DRAM
  • M. Adler et al., “LEAP Scratchpads,” in FPGA, 2011.
slide-8
SLIDE 8

LEAP Memory is Customizable

8

  • Highly parametric

– Cache capacity – Cache associativity – Cache word size – Number of cache ports

  • Enable specific features/optimizations only when

necessary

– Private/coherent caches for private/shared memory – Prefetching – Cache hierarchy topology

slide-9
SLIDE 9
  • Many FPGA programs do not consume all the BRAMs
  • Goal: utilize all spare BRAMs in LEAP memory hierarchy
  • Problem: need to build very large caches

9

Utilizing Spare Block RAMs

slide-10
SLIDE 10

Cache Scalability Issue

10

  • Simply scaling up BRAM-based structures may have a

negative impact on operating frequency

– BRAMs are distributed across chip, increasing wire delay

slide-11
SLIDE 11

Cache Scalability Issue

11

  • Solution: trade latency for frequency

– Multi-banked BRAM structure – Pipelining relieves timing pressure

slide-12
SLIDE 12

Cache Scalability Issue

12

  • Solution: trade latency for frequency
slide-13
SLIDE 13

Banked Cache Overhead

13

  • Simple kernel (hit rate=100%)

Latency-oriented applications Throughput-oriented applications

slide-14
SLIDE 14

Banked Cache Overhead

14

  • Simple kernel (hit rate=69%)
slide-15
SLIDE 15

Results: Scaling Private Caches

15

  • Case study: Merger (an HLS kernel)

Merger has 4 partitions: each connects to a LEAP scratchpad and forms a sorted linked list from a stream of random values.

slide-16
SLIDE 16

Private or Shared Cache?

16

  • We can now build large caches
  • Where should we allocate spare BRAMs?

– Option1: Large private caches – Option2: A large shared cache at the next level

  • Many applications have multiple memory clients

– Different working set sizes and runtime memory footprints

slide-17
SLIDE 17

Adding a Shared Cache

17

Host Memory Central Cache (DRAM)

FPGA Host

Scratchpad Controller Shared On-Chip Cache

Consume all extra BRAMs

slide-18
SLIDE 18

Automated Optimization

18

User frequency, memory demands (ex: cache capacity)

Shared Cache Construction LEAP Platform Construction

BRAM Usage Estimation

User Kernel Generation (Bluespec, Verilog, HLS kernel)

Pre-build database

FPGA Tool Chain

slide-19
SLIDE 19

Results: Shared Cache

19

  • Case study: Filter (an HLS kernel)

– Filtering algorithm for K-means clustering – 8 partitions: each uses 3 LEAP Scratchpads

8192 set, 4 way 16384 set, 2 way 8192 set, 2 way 4096 set, 1 way

slide-20
SLIDE 20

Conclusion

20

  • It is possible to exploit unused resources to construct memory

systems that accelerate the user program.

  • We propose microarchitecture changes for large on-chip

caches to run at high frequency.

  • We make some steps toward automating the construction of

memory hierarchies based on program resource utilization and frequency requirements.

  • Future work:

– Program analysis – Energy study

slide-21
SLIDE 21

Thank You

21