dynaburst dynamically assemblying dram bursts over a
play

DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of - PowerPoint PPT Presentation

DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of Random Accesses Mikhail Asiatici and Paolo Ienne Processor Architecture Laboratory (LAP) School of Computer and Communication Sciences EPFL FPL 2019, Barcelona, Spain 11


  1. DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of Random Accesses Mikhail Asiatici and Paolo Ienne Processor Architecture Laboratory (LAP) School of Computer and Communication Sciences EPFL FPL 2019, Barcelona, Spain 11 September 2019 1

  2. Motivation Local Accelerator Memory Local Accelerator Memory Memory DDRx Controller Memory Local Accelerator Memory Local Accelerator Memory Read accesses: regular , predictable local reuse 2

  3. Motivation 8 beats Accelerator 64 bits 512 bits Accelerator Miss-Optimized ? Memory DDRx Memory System Controller Memory (Nonblocking Cache) Accelerator Accelerator Read accesses: irregular, short, pattern unknown at compile time 3 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

  4. Limitations of Prior Work Accelerator Accelerator Miss-Optimized Memory DDRx Memory System Controller Memory (Nonblocking Cache) Accelerator (1) Useless with multiple, narrow ports - Reuse opportunities are rare - Most of DDR burst content is wasted Accelerator 4 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

  5. Limitations of Prior Work - Request reordering - …but limited view of future requests (10s) The access pattern sent to the Accelerator banks memory controller matters! Accelerator Memory DDRx Controller Memory Accelerator Accelerator row buffer 1 kB Row conflict 5 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

  6. Limitations of Prior Work Accelerator banks Accelerator Memory DDRx Controller Memory Accelerator - Request reordering Accelerator row buffer - …but limited view of future requests (10s) 1 kB The access pattern sent to the memory controller matters! Row conflict 6 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

  7. Limitations of Prior Work Accelerator Accelerator Miss-Optimized Memory DDRx Memory System Controller Memory (Nonblocking Cache) Accelerator (1) Useless with multiple, narrow ports (2) No care given to access pattern Accelerator (2) to the memory controller → Up to 60% of bandwidth lost to row conflicts 7 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

  8. Key Idea: Bursts of Memory Requests (1) Useless with multiple, narrow ports Bursts increase reuse opportunities and Accelerator use larger portions of DDR bursts Accelerator Memory DDRx DynaBurst Controller Memory Accelerator row buffer (1) Useless with multiple, narrow ports (2) No care given to access pattern Accelerator (2) to the memory controller Access pattern becomes locally sequential and we make use of larger portions of DDR rows, increasing available bandwidth 8

  9. Outline • Nonblocking Caches and Miss-Optimized Memory Systems • Top-Level Architecture • Handling Bursts • Experimental Setup • Results • Conclusion 9

  10. MSHR = Miss Status Holding Register Nonblocking Caches 1 MSHR ↔ 1 memory request = 1 cache line 1 subentry ↔ 1 incoming request Cache array 0x1004 0x100C tag data MSHR array 0x123 0x1F2D5D08706718799CD58F2F566 tag subentries 0xCA8 0xE9C0F7A7697CBA7CDC1A7934E34 0x1000x36C2156B751D4EBB940316495CB 0x100 4 C miss 0x100 0x156B 0xEBB9 0x100: 0x36C2156B751D4EBB940316495CB Primary miss Secondary miss • MSHRs provide reuse without having to store • allocate MSHR • allocate subentry the cache line → same result, smaller area • allocate subentry External memory • More MSHRs can be better than a larger cache • send memory request 10

  11. Scaling Up Miss Handling Miss-Optimized Memory System [1] Traditional nonblocking caches = = h 0 h d-1 = = = → efficient storage and lookup of 10,000s MSHRs and subentries = = = Fully-associative array Cuckoo hash tables in BRAMs MSHR buffer Subentry buffer Subentry slots statically assigned to MSHRs Dynamic subentry allocation 11

  12. Outline • Nonblocking Caches and Miss-Optimized Memory System • Top-Level Architecture • Handling Bursts • Experimental Setup • Results • Conclusion 12

  13. Top-Level Architecture Data Tag ID Address 0x100: 0x36C2156B751D4EBB940316495CB 56 0x1004 56 0x1004 56 0x100 0x100 Subentry Data Accelerator Cache MSHR buffer buffer buffer Multi-ported memory interface miss hit External memory controller Crossbar … … … Subentry Data Cache MSHR buffer buffer buffer Accelerator 13

  14. What’s New in DynaBurst Multi-ported Flip-flops → LUTRAM, BRAM Variable-length bursts Subentry Data Accelerator Cache MSHR buffer buffer buffer Multi-ported memory interface External memory controller Crossbar … … … Subentry Data Cache MSHR buffer buffer buffer Accelerator 14

  15. Outline • Nonblocking Caches and Miss-Optimized Memory System • Top-Level Architecture • Handling Bursts • Experimental Setup • Results • Conclusion 15

  16. From Single Requests to Bursts Memory space MSHR 1 MSHR: 1 memory request 1 subentry: 1 incoming request 1 cache line = 1 memory request 16

  17. From Single Requests to Bursts Memory space MSHR 1 MSHR: 1 burst = N memory requests 1 subentry: 1 incoming request Problem: Data wastage 1 cache line 17

  18. From Single Requests to Bursts Memory space MSHR 1 MSHR: 1 burst = 1–N memory requests 1 subentry: 1 incoming request Actual burst Grouping region 1 cache line 18

  19. From Single Requests to Bursts On a new request: On a new request: 1) MSHR exists? 1) MSHR exists? Memory space 2) If yes: request covered 2a) If yes: request covered 2a) by current burst 0 by current burst 2a) bounds? bounds? MSHR 3) If not: burst still in the 3a) If not: burst 0 still in the 3a) output queue? output queue? 1 MSHR: 1 burst = 1–N memory requests 2b) Is burst 1 valid? 1 subentry: 1 incoming request 3b) If yes: request covered 3b) by current burst 1 3b) bounds? Output request queue 4b) If not: burst 1 still in 4b) the output queue? 2c) Is burst 2 valid? DDR 3c) If yes: request covered Memory 3c) by current burst 2 → Large area and delay 3c) bounds? overheads 4c) If not: burst 2 still in the 1 cache line 4c) output queue? 19 …

  20. Burst Invalidations Burst updatable during � � � � � � ��� of its lifetime Memory space Capacities: MSHR Queue: ≈ 1,000 – 10,000 1 MSHR: 1 burst = 1–N memory requests Memory pipeline: ≈ 10 – 100 I � � 1 subentry: 1 incoming request (see paper) � � � � ��� Output request queue DDR Invalidated burst pending: Memory N Q N mem ignore first response 20

  21. Outline • Nonblocking Caches and Miss-Optimized Memory System • Top-Level Architecture • Handling Bursts • Experimental Setup • Results • Conclusion 21

  22. Board: Xilinx ZC706 • XC7Z045 Zynq-7000 FPGA • 1 GB of DDR3 on processing system (PS) side • 3.9 GB/s through four 64-bit ports at 150 MHz • 1 GB of DDR3 on programmable logic (PL) side • 12.0 GB/s through one 512-bit port at 200 MHz 22

  23. Accelerators: Compressed Sparse Row SpMV • 15 benchmarks from SuiteSparse • This work is not about optimized SpMV! • Include web, social, road • We aim for a generic architectural networks, and linear programming solution • Single-precision floating point • Why SpMV? values • Vector size: 1.7–91 MB • Representative of latency-tolerant, (BRAM size: 2.39 MB) bandwidth-bound applications with various degrees of locality • Important kernel in many applications • Several sparse graph algorithms can be mapped to it A. Ashari et al. “Fast Sparse Matrix-Vector Multiplication on GPUs for graph applications” SC 2014 https://sparse.tamu.edu/ 23 J. Kepner and J. Gilbert “Graph Algorithms in the Language of Linear Algebra” SIAM 2011

  24. PL and PS Systems High bandwidth, single wide port PL system PL DDR - Same as in our previous work - 4 accelerators and banks - 200 MHz PS system - 8 accelerators and PS DDR banks - 150 MHz Low bandwidth, 4 narrow ports 24 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

  25. Design Space Exploration PL systems (4 banks) PS systems (8 banks) Total cache size (KB) 0, 128, 256, 512, 1024 0, 64, 128, 256, 512 Maximum burst length 2, 4, 8, 16 Miss handling (6 subentries/row) - Small 2k MSHR, 12k subentries 4k MSHR, 24k subentries - Medium 6k MSHR, 48k subentries 8k MSHR, 48k subentries - Large 16k MSHR, 96k subentries 16k MSHR, 96k subentries • Baselines: other generic memory systems for irregular access pattern • Our prior work (single-request) • Each design point compared to same cache and miss handling configuration • Traditional nonblocking cache with associative MSHRs • 16 MSHRs + 8 subentries each, per bank • Each design point compared to traditional cache with closest BRAM utilization 25 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend