DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of - - PowerPoint PPT Presentation

dynaburst dynamically assemblying dram bursts over a
SMART_READER_LITE
LIVE PREVIEW

DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of - - PowerPoint PPT Presentation

DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of Random Accesses Mikhail Asiatici and Paolo Ienne Processor Architecture Laboratory (LAP) School of Computer and Communication Sciences EPFL FPL 2019, Barcelona, Spain 11


slide-1
SLIDE 1

DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of Random Accesses

Mikhail Asiatici and Paolo Ienne Processor Architecture Laboratory (LAP) School of Computer and Communication Sciences EPFL FPL 2019, Barcelona, Spain 11 September 2019

1

slide-2
SLIDE 2

Motivation

DDRx Memory Memory Controller Accelerator Accelerator Accelerator Accelerator Local Memory Local Memory Local Memory Local Memory

2

Read accesses: regular, predictable local reuse

slide-3
SLIDE 3

Motivation

DDRx Memory Memory Controller Accelerator Accelerator Accelerator Accelerator 8 beats 64 bits Miss-Optimized Memory System (Nonblocking Cache)

?

  • M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

512 bits

3

Read accesses: irregular, short, pattern unknown at compile time

slide-4
SLIDE 4

Limitations of Prior Work

DDRx Memory Accelerator Accelerator Accelerator Accelerator

(1) Useless with multiple, narrow ports

  • Reuse opportunities are rare
  • Most of DDR burst content is wasted

4

Memory Controller Miss-Optimized Memory System (Nonblocking Cache)

  • M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019
slide-5
SLIDE 5

Accelerator Accelerator Accelerator Accelerator

Limitations of Prior Work

DDRx Memory Memory Controller 1 kB row buffer banks Row conflict

  • Request reordering
  • …but limited view of

future requests (10s) The access pattern sent to the memory controller matters!

5

  • M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019
slide-6
SLIDE 6

Accelerator Accelerator Accelerator Accelerator

Limitations of Prior Work

DDRx Memory Memory Controller 1 kB row buffer banks Row conflict

6

  • M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019
  • Request reordering
  • …but limited view of

future requests (10s) The access pattern sent to the memory controller matters!

slide-7
SLIDE 7

Limitations of Prior Work

DDRx Memory Memory Controller Accelerator Accelerator Accelerator Accelerator

(2) No care given to access pattern (2) to the memory controller → Up to 60% of bandwidth lost to row conflicts (1) Useless with multiple, narrow ports

7

  • M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019

Miss-Optimized Memory System (Nonblocking Cache)

slide-8
SLIDE 8

(1) Useless with multiple, narrow ports Bursts increase reuse opportunities and use larger portions of DDR bursts

Key Idea: Bursts of Memory Requests

DDRx Memory Memory Controller Accelerator Accelerator Accelerator Accelerator

(2) No care given to access pattern (2) to the memory controller Access pattern becomes locally sequential and we make use of larger portions of DDR rows, increasing available bandwidth (1) Useless with multiple, narrow ports

8

row buffer

DynaBurst

slide-9
SLIDE 9

Outline

  • Nonblocking Caches and Miss-Optimized Memory Systems
  • Top-Level Architecture
  • Handling Bursts
  • Experimental Setup
  • Results
  • Conclusion

9

slide-10
SLIDE 10

1 MSHR ↔ 1 memory request = 1 cache line 1 subentry ↔ 1 incoming request MSHR array

tag subentries

Nonblocking Caches

0x1004

External memory

0x100 0x100C

10

miss

0x100 4 C

Cache array

tag data 0x123 0xCA8 0x1F2D5D08706718799CD58F2F566 0xE9C0F7A7697CBA7CDC1A7934E34 0x100: 0x36C2156B751D4EBB940316495CB 0x156B 0xEBB9

Primary miss

  • allocate MSHR
  • allocate subentry
  • send memory request

Secondary miss

  • allocate subentry

MSHR = Miss Status Holding Register

0x1000x36C2156B751D4EBB940316495CB

  • MSHRs provide reuse without having to store

the cache line → same result, smaller area

  • More MSHRs can be better than a larger cache
slide-11
SLIDE 11

Scaling Up Miss Handling

11

= = = = = = = =

h0 hd-1

Fully-associative array Cuckoo hash tables in BRAMs Subentry slots statically assigned to MSHRs Dynamic subentry allocation

Subentry buffer MSHR buffer

→ efficient storage and lookup of 10,000s MSHRs and subentries

Traditional nonblocking caches Miss-Optimized Memory System [1]

slide-12
SLIDE 12

Outline

  • Nonblocking Caches and Miss-Optimized Memory System
  • Top-Level Architecture
  • Handling Bursts
  • Experimental Setup
  • Results
  • Conclusion

12

slide-13
SLIDE 13

56 0x1004

Top-Level Architecture

13

56 0x1004 ID Address Accelerator Accelerator Crossbar Cache MSHR buffer Subentry buffer Data buffer Multi-ported memory interface External memory controller miss 0x100 0x100 56 0x100: 0x36C2156B751D4EBB940316495CB Tag Data Cache MSHR buffer Subentry buffer Data buffer

… … …

hit

slide-14
SLIDE 14

What’s New in DynaBurst

14

Accelerator Accelerator Crossbar Cache MSHR buffer Subentry buffer Data buffer Multi-ported memory interface External memory controller Cache MSHR buffer Subentry buffer Data buffer

… … …

Flip-flops → LUTRAM, BRAM Multi-ported Variable-length bursts

slide-15
SLIDE 15

Outline

  • Nonblocking Caches and Miss-Optimized Memory System
  • Top-Level Architecture
  • Handling Bursts
  • Experimental Setup
  • Results
  • Conclusion

15

slide-16
SLIDE 16

From Single Requests to Bursts

16

Memory space

1 cache line = 1 memory request

MSHR

1 MSHR: 1 memory request 1 subentry: 1 incoming request

slide-17
SLIDE 17

From Single Requests to Bursts

17

Memory space

1 cache line

MSHR

1 MSHR: 1 burst = N memory requests 1 subentry: 1 incoming request

Problem: Data wastage

slide-18
SLIDE 18

From Single Requests to Bursts

18

Memory space MSHR

1 MSHR: 1 burst = 1–N memory requests 1 subentry: 1 incoming request 1 cache line

Grouping region Actual burst

slide-19
SLIDE 19

1 cache line

From Single Requests to Bursts

19

Memory space MSHR DDR Memory Output request queue

On a new request: 1) MSHR exists? 2) If yes: request covered by current burst bounds? 3) If not: burst still in the

  • utput queue?

On a new request: 1) MSHR exists? 2a) If yes: request covered 2a) by current burst 0 2a) bounds? 3a) If not: burst 0 still in the 3a) output queue? 2b) Is burst 1 valid? 3b) If yes: request covered 3b) by current burst 1 3b) bounds? 4b) If not: burst 1 still in 4b) the output queue? 2c) Is burst 2 valid? 3c) If yes: request covered 3c) by current burst 2 3c) bounds? 4c) If not: burst 2 still in the 4c) output queue? …

1 MSHR: 1 burst = 1–N memory requests 1 subentry: 1 incoming request

→ Large area and delay

  • verheads
slide-20
SLIDE 20

Burst Invalidations

20

Memory space MSHR DDR Memory Output request queue

I

NQ

Burst updatable during

  • f its lifetime

1 MSHR: 1 burst = 1–N memory requests 1 subentry: 1 incoming request Invalidated burst pending: ignore first response Capacities: Queue: ≈ 1,000 – 10,000 Memory pipeline: ≈ 10 – 100

  • (see paper)

Nmem

slide-21
SLIDE 21

Outline

  • Nonblocking Caches and Miss-Optimized Memory System
  • Top-Level Architecture
  • Handling Bursts
  • Experimental Setup
  • Results
  • Conclusion

21

slide-22
SLIDE 22

Board: Xilinx ZC706

  • XC7Z045 Zynq-7000 FPGA
  • 1 GB of DDR3 on processing system (PS) side
  • 3.9 GB/s through four 64-bit ports at 150 MHz
  • 1 GB of DDR3 on programmable logic (PL) side
  • 12.0 GB/s through one 512-bit port at 200 MHz

22

slide-23
SLIDE 23

Accelerators: Compressed Sparse Row SpMV

  • This work is not about optimized

SpMV!

  • We aim for a generic architectural

solution

  • Why SpMV?
  • Representative of latency-tolerant,

bandwidth-bound applications with various degrees of locality

  • Important kernel in many

applications

  • Several sparse graph algorithms

can be mapped to it

23

  • A. Ashari et al. “Fast Sparse Matrix-Vector Multiplication on GPUs for graph applications” SC 2014
  • J. Kepner and J. Gilbert “Graph Algorithms in the Language of Linear Algebra” SIAM 2011
  • 15 benchmarks from SuiteSparse
  • Include web, social, road

networks, and linear programming

  • Single-precision floating point

values

  • Vector size: 1.7–91 MB

(BRAM size: 2.39 MB) https://sparse.tamu.edu/

slide-24
SLIDE 24

PL and PS Systems

PL DDR PS DDR High bandwidth, single wide port Low bandwidth, 4 narrow ports PL system

  • Same as in our

previous work

  • 4 accelerators and

banks

  • 200 MHz

PS system

  • 8 accelerators and

banks

  • 150 MHz

24

  • M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019
slide-25
SLIDE 25

Design Space Exploration

  • Baselines: other generic memory systems for irregular access pattern
  • Our prior work (single-request)
  • Each design point compared to same cache and miss handling configuration
  • Traditional nonblocking cache with associative MSHRs
  • 16 MSHRs + 8 subentries each, per bank
  • Each design point compared to traditional cache with closest BRAM utilization

PL systems (4 banks) PS systems (8 banks) Total cache size (KB) 0, 128, 256, 512, 1024 0, 64, 128, 256, 512 Maximum burst length 2, 4, 8, 16 Miss handling (6 subentries/row)

  • Small

2k MSHR, 12k subentries 4k MSHR, 24k subentries

  • Medium

6k MSHR, 48k subentries 8k MSHR, 48k subentries

  • Large

16k MSHR, 96k subentries 16k MSHR, 96k subentries

25

  • M. Asiatici and P. Ienne, “Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs” ISFPGA 2019
slide-26
SLIDE 26

Outline

  • Nonblocking Caches and Miss-Optimized Memory System
  • Top-Level Architecture
  • Handling Bursts
  • Experimental Setup
  • Results
  • Conclusion

26

slide-27
SLIDE 27

Impact of Maximum Burst Length and of Burst Trimming

27

Bursts much more useful

  • n PS system

We’ll now consider only best max burst lengths Some benefit

  • n specific

design points

  • n PL system

Variable-length always better than fixed-length Short bursts: minimize data wastage Long bursts: maximize reuse opportunities, DDR burst, and DDR row usage Optimum is somewhere in the middle

slide-28
SLIDE 28

Cache and Miss Handling Size Exploration

28

Bursts required to make miss-optimized memory systems cost-effective Bursts further improve performance on most of the design points We’ll now look into these points

slide-29
SLIDE 29

Speedup on Individual Benchmarks

29

Most effective where traditional cache performance was lower Up to 3.4x speedup at similar area cost where caches are ineffective Bursts further improve most of single-request results

slide-30
SLIDE 30

Resource Utilization

30

BRAM Slices PS system 0 – 15 % 30 – 40 % PL system 2 – 15 % 10 – 15 % Overhead vs single-request

slide-31
SLIDE 31

Conclusion

  • Caches are not effective when read accesses are irregular and short
  • Single-request miss-optimized memory systems
  • Reuse same memory request among multiple incoming requests
  • Useful only when memory controllers have wide ports
  • DynaBurst
  • Merges incoming requests into bursts of memory requests
  • Controller with narrow ports: up to 3.4x speedup compared to cache with

similar area

  • Controller with wide port: up to 2.4x speedup with < 15% area overhead

compared to prior work

31

slide-32
SLIDE 32

Thank you!

https://github.com/m-asiatici/dynaburst

32

slide-33
SLIDE 33

Backup

33

slide-34
SLIDE 34

34

Memory bound → in > memory

Filling the Output Queue

MSHRs:

  • Allocated on primary miss
  • Deallocated on memory response

Memory response rate = memory request rate

slide-35
SLIDE 35

35

Slope = primary - memory Slope = in - memory t

Filling the Output Queue

NMSHR NQ NMSHR,eq NQ,eq= NMSHR,eq - Nmem Nmem NMSHR,eq > NMSHR,max

  • Stalls decrease incoming rate

until NMSHR,eq = NMSHR,max

  • NQ,eq= NMSHR,max - Nmem

Probability that a burst can be updated: 𝑂 𝑂 + 𝑂 = 𝑂, − 𝑂 𝑂, ≈ 1 (𝑂, ≫ 𝑂) NMSHR,eq < NMSHR,max memory is not a bottleneck any more

slide-36
SLIDE 36

Invalidations and Burst Usage

36