scavenger
play

Scavenger: Automating the Construction of Application-Optimized - PowerPoint PPT Presentation

Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , Felix Winterstein , and Joel Emer * Massachusetts Institute of Technology, Intel


  1. Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies Hsin-Jung Yang † , Kermin E. Fleming ‡ , Michael Adler ‡ , Felix Winterstein § , and Joel Emer †* † Massachusetts Institute of Technology, ‡ Intel Corporation § European Space Agency, *NVIDIA Research September 3rd, FPL 2015

  2. Abstraction • Abstraction hides implementation details and provides good programmability programmer Processor FPGA Software C/Python Application User Program Operating System LUTs SRAM SRAM Instruction Set Architecture Hardware PCIE DRAM Memory CPU I/O • • Hardware is optimized for a set of Implementation details are applications and fixed at design time handled by programmers • Hardware can be optimized for the target application 2

  3. Abstraction • Abstraction hides implementation details and provides good programmability programmer Processor FPGA Software C/Python Application User Program Operating System Abstraction Instruction Set Architecture Hardware Memory Communication Memory CPU I/O • • Hardware is optimized for a set of Platform hardware can be applications and fixed at design time optimized for the target application 3

  4. Application-Optimized Memory Subsystems • Goal: build the “best” memory subsystem for a given application – What is the “best”? • The memory subsystem which minimizes the execution time – How? • A clean memory abstraction • A rich set of memory building blocks • Intelligent algorithms to analyze programs and automatically compose memory hierarchies 4

  5. Observation • Many FPGA programs do not consume all the available block RAMs (BRAMs) – Design difficulty – Same program ported from smaller FPGAs to larger ones Goal: Utilizing spare BRAMs to improve program performance 5

  6. LEAP Memory Abstraction LEAP Memory Block • Simple memory interface User Engine • Arbitrary data size Interface • Private address space • “Unlimited” storage LEAP • Automatic caching Memory interface MEM_IFC#(type t_ADDR, type t_DATA) method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA din); method t_DATA readResp (); endinterface 6

  7. LEAP Scratchpad Scratchpads Processor Client Client Client Application Interface on-chip SRAM L1 Cache on-board DRAM L2 Cache Memory M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011. 7

  8. LEAP Memory is Customizable • Highly parametric – Cache capacity – Cache associativity – Cache word size – Number of cache ports • Enable specific features/optimizations only when necessary – Private/coherent caches for private/shared memory – Prefetching – Cache hierarchy topology 8

  9. Utilizing Spare Block RAMs • Many FPGA programs do not consume all the BRAMs • Goal: utilize all spare BRAMs in LEAP memory hierarchy • Problem: need to build very large caches 9

  10. Cache Scalability Issue • Simply scaling up BRAM-based structures may have a negative impact on operating frequency – BRAMs are distributed across chip, increasing wire delay 10

  11. Cache Scalability Issue • Solution: trade latency for frequency – Multi-banked BRAM structure – Pipelining relieves timing pressure 11

  12. Cache Scalability Issue • Solution: trade latency for frequency 12

  13. Banked Cache Overhead • Simple kernel (hit rate=100%) Latency-oriented applications Throughput-oriented applications 13

  14. Banked Cache Overhead • Simple kernel (hit rate=69%) 14

  15. Results: Scaling Private Caches • Case study: Merger (an HLS kernel) Merger has 4 partitions: each connects to a LEAP scratchpad and forms a sorted linked list from a stream of random values. 15

  16. Private or Shared Cache? • We can now build large caches • Where should we allocate spare BRAMs? – Option1: Large private caches – Option2: A large shared cache at the next level • Many applications have multiple memory clients – Different working set sizes and runtime memory footprints 16

  17. Adding a Shared Cache Scratchpad Controller Consume all extra BRAMs Shared On-Chip Cache Central Cache (DRAM) FPGA Host Host Memory 17

  18. Automated Optimization User frequency, User Kernel Generation memory demands (Bluespec, Verilog, HLS kernel) (ex: cache capacity) Pre-build database LEAP Platform Construction BRAM Usage Estimation Shared Cache Construction FPGA Tool Chain 18

  19. Results: Shared Cache • Case study: Filter (an HLS kernel) – Filtering algorithm for K-means clustering – 8 partitions: each uses 3 LEAP Scratchpads 16384 set, 2 way 8192 set, 4 way 8192 set, 2 way 4096 set, 1 way 19

  20. Conclusion • It is possible to exploit unused resources to construct memory systems that accelerate the user program. • We propose microarchitecture changes for large on-chip caches to run at high frequency. • We make some steps toward automating the construction of memory hierarchies based on program resource utilization and frequency requirements. • Future work: – Program analysis – Energy study 20

  21. Thank You 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend