HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA – March 14, 2016
Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis - - PowerPoint PPT Presentation
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard MapReduce, graph scaling
Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA – March 14, 2016
2 Near-Data Processing (NDP) End of Dennard scaling In- memory analytics 3D stacking
MapReduce, graph processing, deep neural networks, … HMC, HBM
Figs: www.extremetech.com www.cisl.columbia.edu/grads/tuku/research/ www.oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html
Energy-bound systems
Area-efficient
Power-efficient
Flexible
3
Programmable cores
[IRAM, FlexRAM, NDC, TOP-PIM]
FPGA (fine-grained)
[Active Pages]
CGRA (coarse-grained)
[NDA]
ASIC
[MSA, LiM]
4
Area Efficiency Power Efficiency Flexibility
5
FPGA
CGRA
Heterogeneity: achieve the best of FPGA and CGRA
Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions
6
Multiple stacks linked to host processor through serial links
7
Host Processor Memory Stack High-Speed Serial Link
Multi-core chip with cache hierarchy: Runs code with high temporal locality Memory stack with NDP capability: runs memory-intensive code
Vault Logic
NoC Logic Die DRAM Die
...
Bank Channel Vault
8
Vault:
Vault logic:
Processing phase: PEs run tasks independently and in parallel Communication phase: data exchange and sync b/w PEs [PACT’15]
9 Task Task Task Task PEs PEs PEs PEs
Process Local Buffer
Task Task Task Task
Pull
Output Queues Mem Ctrl To remote memory Generic Load/Store Unit Task Queue
Router To local memory Scratchpad Buffer
Global Controller (Host Processor)
DMA
...
PE PE PE
User-defined Logic Fixed Logic
Output write queues
DMA Ctrl
Handles task control and data communication Allows the use of reconfigurable or custom PEs
10
Input & output data info from host or other PEs Cached data (e.g., graph vertices) Streaming data (e.g., graph edge list) Separate queue for each consumer
Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions
11
Fine-grained + coarse-grained reconfigurable blocks
Static interconnects
Special blocks for branches & irregular data layout
Compute throughput per Watt: 2.2x over FPGA, 1.7x over CGRA
12
Area-efficiency and flexibility Power-efficiency Flexibility
13
Control Input Data Input FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU OMB OMB OMB OMB OMB CLB CLB CLB CLB 16-bit data routing tracks 1-bit control routing tracks Control Output Data Output
CGRA-style functional unit
FPGA-style configurable block
Output MUX Block
Flexible IO Interface
14
Control Input Data Input FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU OMB OMB OMB OMB OMB CLB CLB CLB CLB 16-bit data routing tracks 1-bit control routing tracks Control Output Data Output
15
FU MXB CLB
in INA INB sel OUT A B Y
16-bit data net tracks 1-bit control net tracks Block output to routing track connection Routing switch box Block input MUX connection
Fully static for low power Separate data and control to reduce area 16-bit data, bus-based 1-bit control, few tracks No switches b/w two networks Connection box and switch
16
Control Input Data Input FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU OMB OMB OMB OMB OMB CLB CLB CLB CLB 16-bit data routing tracks 1-bit control routing tracks Control Output Data Output
17
48-bit fixed-point 16-bit short 32-bit int 48-bit fixed- point 16-bit short 32-bit int
Ctrl IO 1-bit control tracks 16-bit data tracks Data IO
Control IO
Data IO
Low cost and sufficiently flexible even for irregular data
Motivation NDP System Design Heterogeneous Reconfigurable Logic (HRL) Evaluation Conclusions
18
Workloads
Technology
Tools
19
Same logic capacity for each type array
20 0.2 0.4 0.6 0.8 1 1.2 1.4
FPGA CGRA HRL Single Array Area (mm2) Ctrl Routing Data Routing Routing Logic Coarse-grained FUs improve area efficiency
21 5 10 15 20 25 30 35 40 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average
Power (mW) FPGA CGRA HRL HRL: static but flexible routing CGRA: high power on circuit-switched network FPGA: half the frequency
22 0.5 1 1.5 2 2.5 3 3.5 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average
Normalized Perf/Watt FPGA CGRA HRL HRL: 2.2x FPGA, 1.7x CGRA on perf/Watt
ASIC represents the upper bound of efficiency Cores, FPGA, CGRA only match 30% to 80% of ASIC HRL has 92% of ASIC performance on average
23 0.2 0.4 0.6 0.8 1
GroupBy Hist LinReg PageRank SSSP CC ConvNet MLP dA
Normalized Performance
Cores FPGA CGRA HRL ASIC
Memory bandwidth saturated Memory bandwidth not saturated
NDP logic requirements: area + power efficiency, flexibility Heterogeneous reconfigurable logic (HRL)
HRL for in-memory analytics
24
Questions?