DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of - PowerPoint PPT Presentation

DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of Random Accesses Mikhail Asiatici and Paolo Ienne Processor Architecture Laboratory (LAP) School of Computer and Communication Sciences EPFL FPL 2019, Barcelona, Spain 11 September 2019 1

Motivation Local Accelerator Memory Local Accelerator Memory Memory DDRx Controller Memory Local Accelerator Memory Local Accelerator Memory Read accesses: regular , predictable local reuse 2

Motivation 8 beats Accelerator 64 bits 512 bits Accelerator Miss-Optimized ? Memory DDRx Memory System Controller Memory (Nonblocking Cache) Accelerator Accelerator Read accesses: irregular, short, pattern unknown at compile time 3 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

Limitations of Prior Work Accelerator Accelerator Miss-Optimized Memory DDRx Memory System Controller Memory (Nonblocking Cache) Accelerator (1) Useless with multiple, narrow ports - Reuse opportunities are rare - Most of DDR burst content is wasted Accelerator 4 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

Limitations of Prior Work - Request reordering - …but limited view of future requests (10s) The access pattern sent to the Accelerator banks memory controller matters! Accelerator Memory DDRx Controller Memory Accelerator Accelerator row buffer 1 kB Row conflict 5 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

Limitations of Prior Work Accelerator banks Accelerator Memory DDRx Controller Memory Accelerator - Request reordering Accelerator row buffer - …but limited view of future requests (10s) 1 kB The access pattern sent to the memory controller matters! Row conflict 6 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

Limitations of Prior Work Accelerator Accelerator Miss-Optimized Memory DDRx Memory System Controller Memory (Nonblocking Cache) Accelerator (1) Useless with multiple, narrow ports (2) No care given to access pattern Accelerator (2) to the memory controller → Up to 60% of bandwidth lost to row conflicts 7 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

Key Idea: Bursts of Memory Requests (1) Useless with multiple, narrow ports Bursts increase reuse opportunities and Accelerator use larger portions of DDR bursts Accelerator Memory DDRx DynaBurst Controller Memory Accelerator row buffer (1) Useless with multiple, narrow ports (2) No care given to access pattern Accelerator (2) to the memory controller Access pattern becomes locally sequential and we make use of larger portions of DDR rows, increasing available bandwidth 8

Outline • Nonblocking Caches and Miss-Optimized Memory Systems • Top-Level Architecture • Handling Bursts • Experimental Setup • Results • Conclusion 9

MSHR = Miss Status Holding Register Nonblocking Caches 1 MSHR ↔ 1 memory request = 1 cache line 1 subentry ↔ 1 incoming request Cache array 0x1004 0x100C tag data MSHR array 0x123 0x1F2D5D08706718799CD58F2F566 tag subentries 0xCA8 0xE9C0F7A7697CBA7CDC1A7934E34 0x1000x36C2156B751D4EBB940316495CB 0x100 4 C miss 0x100 0x156B 0xEBB9 0x100: 0x36C2156B751D4EBB940316495CB Primary miss Secondary miss • MSHRs provide reuse without having to store • allocate MSHR • allocate subentry the cache line → same result, smaller area • allocate subentry External memory • More MSHRs can be better than a larger cache • send memory request 10

Scaling Up Miss Handling Miss-Optimized Memory System [1] Traditional nonblocking caches = = h 0 h d-1 = = = → efficient storage and lookup of 10,000s MSHRs and subentries = = = Fully-associative array Cuckoo hash tables in BRAMs MSHR buffer Subentry buffer Subentry slots statically assigned to MSHRs Dynamic subentry allocation 11

Outline • Nonblocking Caches and Miss-Optimized Memory System • Top-Level Architecture • Handling Bursts • Experimental Setup • Results • Conclusion 12

Top-Level Architecture Data Tag ID Address 0x100: 0x36C2156B751D4EBB940316495CB 56 0x1004 56 0x1004 56 0x100 0x100 Subentry Data Accelerator Cache MSHR buffer buffer buffer Multi-ported memory interface miss hit External memory controller Crossbar … … … Subentry Data Cache MSHR buffer buffer buffer Accelerator 13

What’s New in DynaBurst Multi-ported Flip-flops → LUTRAM, BRAM Variable-length bursts Subentry Data Accelerator Cache MSHR buffer buffer buffer Multi-ported memory interface External memory controller Crossbar … … … Subentry Data Cache MSHR buffer buffer buffer Accelerator 14

From Single Requests to Bursts Memory space MSHR 1 MSHR: 1 memory request 1 subentry: 1 incoming request 1 cache line = 1 memory request 16

From Single Requests to Bursts Memory space MSHR 1 MSHR: 1 burst = N memory requests 1 subentry: 1 incoming request Problem: Data wastage 1 cache line 17

From Single Requests to Bursts Memory space MSHR 1 MSHR: 1 burst = 1–N memory requests 1 subentry: 1 incoming request Actual burst Grouping region 1 cache line 18

From Single Requests to Bursts On a new request: On a new request: 1) MSHR exists? 1) MSHR exists? Memory space 2) If yes: request covered 2a) If yes: request covered 2a) by current burst 0 by current burst 2a) bounds? bounds? MSHR 3) If not: burst still in the 3a) If not: burst 0 still in the 3a) output queue? output queue? 1 MSHR: 1 burst = 1–N memory requests 2b) Is burst 1 valid? 1 subentry: 1 incoming request 3b) If yes: request covered 3b) by current burst 1 3b) bounds? Output request queue 4b) If not: burst 1 still in 4b) the output queue? 2c) Is burst 2 valid? DDR 3c) If yes: request covered Memory 3c) by current burst 2 → Large area and delay 3c) bounds? overheads 4c) If not: burst 2 still in the 1 cache line 4c) output queue? 19 …

Burst Invalidations Burst updatable during � � � � � � �� of its lifetime Memory space Capacities: MSHR Queue: ≈ 1,000 – 10,000 1 MSHR: 1 burst = 1–N memory requests Memory pipeline: ≈ 10 – 100 I � � 1 subentry: 1 incoming request (see paper) � � � � �� Output request queue DDR Invalidated burst pending: Memory N Q N mem ignore first response 20

Board: Xilinx ZC706 • XC7Z045 Zynq-7000 FPGA • 1 GB of DDR3 on processing system (PS) side • 3.9 GB/s through four 64-bit ports at 150 MHz • 1 GB of DDR3 on programmable logic (PL) side • 12.0 GB/s through one 512-bit port at 200 MHz 22

Accelerators: Compressed Sparse Row SpMV • 15 benchmarks from SuiteSparse • This work is not about optimized SpMV! • Include web, social, road • We aim for a generic architectural networks, and linear programming solution • Single-precision floating point • Why SpMV? values • Vector size: 1.7–91 MB • Representative of latency-tolerant, (BRAM size: 2.39 MB) bandwidth-bound applications with various degrees of locality • Important kernel in many applications • Several sparse graph algorithms can be mapped to it A. Ashari et al. “Fast Sparse Matrix-Vector Multiplication on GPUs for graph applications” SC 2014 https://sparse.tamu.edu/ 23 J. Kepner and J. Gilbert “Graph Algorithms in the Language of Linear Algebra” SIAM 2011

PL and PS Systems High bandwidth, single wide port PL system PL DDR - Same as in our previous work - 4 accelerators and banks - 200 MHz PS system - 8 accelerators and PS DDR banks - 150 MHz Low bandwidth, 4 narrow ports 24 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

Design Space Exploration PL systems (4 banks) PS systems (8 banks) Total cache size (KB) 0, 128, 256, 512, 1024 0, 64, 128, 256, 512 Maximum burst length 2, 4, 8, 16 Miss handling (6 subentries/row) - Small 2k MSHR, 12k subentries 4k MSHR, 24k subentries - Medium 6k MSHR, 48k subentries 8k MSHR, 48k subentries - Large 16k MSHR, 96k subentries 16k MSHR, 96k subentries • Baselines: other generic memory systems for irregular access pattern • Our prior work (single-request) • Each design point compared to same cache and miss handling configuration • Traditional nonblocking cache with associative MSHRs • 16 MSHRs + 8 subentries each, per bank • Each design point compared to traditional cache with closest BRAM utilization 25 M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of - PowerPoint PPT Presentation

DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of Random Accesses Mikhail Asiatici and Paolo Ienne Processor Architecture Laboratory (LAP) School of Computer and Communication Sciences EPFL FPL 2019, Barcelona, Spain 11

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

Prompt emission in gamma-ray bursts Felix Ryde KTH Royal Institute of Technology Stockholm

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

Gamma- -Ray Bursts Ray Bursts Gamma Multi-wavelength astronomy and Multi-Particle astronomy?

Forecasting SEP Events with Solar Radio Forecasting SEP Events with Bursts Solar Radio Bursts

Gamma-Ray Bursts and Gravitational Waves Shiho Kobayashi (Penn State) Gamma-Ray Bursts (GRBs)

DRAM 1 Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

DRAM Dynamic Random Access Memory (DRAM) Storage Charge on a capacitor Decays

Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory

The programmer's view The programmer's view of a dynamically reconfigurable of a dynamically

2018 2019 Demand Response Auction Mechanism ( DRAM DRAM 3) 3) Pre Bi Pre Bid

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit

Main Memory and DRAM Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Main Memory and DRAM Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture SRAM vs.

DRAM CONTROLLER Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Operating Systems Operating Systems CMPSC 473 CMPSC 473 CPU Scheduling CPU Scheduling

Bankrupt Covert Channel: Turning Network Predictability into Vulnerability Dmitrii Ustiugov ,

What to use for the default supernova burst model(s)? K. Scholberg DAQ Physics Performance WG

BurScale: Using Burstable Instances for Cost-Effective Autoscaling in the Public Cloud Ata Fatahi

Scheduling Sanzheng Qiao Department of Computing and Software January, 2013 Introduction In

Extending Pluto-Style Polyhedral Scheduling with Consecutivity Sven Verdoolaege 1 Alexandre Isoard

Seagull: Intelligent Cloud Bursting For Enterprise Applications Tian Guo , Upendra Sharma,

TopicSketch: Real-time Bursty Topic Detection from Twitter Wei Xie , Feida Zhu, Jing Jiang,

Sambuz

Useful Links

Newsletter

Mail Us