DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis
ISCA – June 22, 2016
Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, - - PowerPoint PPT Presentation
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based Accelerators Improve performance
Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis
ISCA – June 22, 2016
Improve performance and energy efficiency Good balance between flexibility (CPUs) and efficiency (ASICs) Recently used for many datacenter apps
2
Pictures: Putnam, et al. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. ISCA’14
Deploy FPGAs in cost & power constrained systems Datacenter systems
Mobile systems
3
A high-density & low-power FPGA
Uses dense DRAM technology for lookup tables
DRAF vs. FPGA
4
5
6
Subarray
MAT Sense-amp Master wordline Row decoder Local wordline bitline
A DRAM subarray is naturally a lookup-table
Input Output
MAT Sense-amp Master wordline Row decoder Local wordline bitline
7
~8k-bit output ~1k rows ~10-bit input Mismatch LUT size
a 8192-bit LUT?
Slow speed
a LUT with 10 ns delay?
10-30 ns delay Destructive access
data lost after access?
User Clock Physical Path R1 R2 L1 L2 L3 L4
Explicit activation, restoration, and precharge operations
Issue of LUT chaining: order of LUT access
8
Must activate L2 after L1 Must activate L4 after both L2 & L3
Basic Logic Element Multi-Context Support Timing
9
Same island layout and configurable interconnect as FPGA
10
DSP
In DRAM technology Slower but not critical
Block RAM
Uses DRAM arrays
CLB
Contains multiple basic logic elements (BLEs)
7-10 bits input 2-4 bits output
11
Subarray
MAT Sense-amp Master wordline Row decoder Local wordline bitline
Narrower MAT 1k bits to 8-16 bits
Col logic 6 14 4x2 4x2
Specialized column logic Better flexibility
FFs 4 4
Additional FFs & MUXs Registering & retiming Single-MAT access Multi-context
3 4
DRAF supports 8-16 contexts per chip
Instant switch between active contexts
Context uses
12
User Clock Physical Path R1 R2 L1 L2 L3 L4
Issue of LUT chaining: order of LUT access Solution: phase – similar to critical path finding
13 Phase 0 Phase 1
Phase Timeline
Phase 0 Phase 2 Phase 0 Phase 1 Phase 2
LUT-1 LUT-2
Issue: precharge and restore delays Solution: 3-way delay overlapping
Performance gap between DRAF and FPGA reduces from >10x to 2-4x
14 ACT PRE RST Wire ACT PRE RST
LUT-1 LUT-2
ACT PRE RST Wire ACT PRE RST
Saved delay
Challenges solutions
multi-context BLE
phase-based timing
3-way delay overlapping
Other design features (see paper)
15
Area, power, performance against FPGA and CPU
16
Synthesize, place & route with Yosys + VTR CACTI-3DD with 45 nm power and area models Comparisons
18 accelerator designs
17
18 0.01 0.1 1 10 100 1000 0.5 1 1.5 Chip Area (mm2) Logic Capacity (in million 6-LUT equivalents) FPGA DRAF 0.01 0.1 1 10 100 1000 0.5 1 1.5 Peak Chip Power (W) Logic Capacity (in million 6-LUT equivalents) FPGA DRAF
10x area improvement 50x peak power reduction
8-context DRAF occupies 19% less area than 1-context FPGA
19
0.2 0.4 0.6 0.8 1 1.2 1.4 aes backprop gemm gmm harris stemmer stencil viterbi editdist
Normalized Min Bounding Area FPGA Logic FPGA Routing DRAF Logic DRAF Routing Inefficient use of larger DRAM LUT exp/log functions
Use one context in DRAF DRAF consumes 1/3 power of FPGA and 15% less energy
20
0.2 0.4 0.6 0.8 1 aes backprop gemm gmm harris stemmer stencil viterbi editdist
Normalized Power FPGA Logic FPGA Routing DRAF Logic DRAF Routing
DRAF is 2.7x slower than FPGA DRAF is 13.5x faster than CPU, 3.4x faster than ideal 4-core
21
0.1 1 10 100 1000 aes backprop gemm gmm harris stemmer stencil viterbi
Normalized Throughput CPU 4 CPU FPGA DRAF exp/log functions Efficient line buffer
DRAF: high-density and low-power reconfigurable fabric
DRAF targets cost and power constrained applications
DRAF trades off some performance for area & power efficiency
22
Questions?