[PPT] - Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis PowerPoint Presentation

SLIDE 1

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA – March 14, 2016

SLIDE 2

PIM is Coming Back …

2 Near-Data Processing (NDP) End of Dennard scaling In- memory analytics 3D stacking

MapReduce, graph processing, deep neural networks, … HMC, HBM

Figs: www.extremetech.com www.cisl.columbia.edu/grads/tuku/research/ www.oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html

Energy-bound systems

SLIDE 3

NDP Logic Requirements

 Area-efficient

High processing throughput to match the high memory bandwidth
128 GBps per 50 mm2 stack  > 32 Gflops  > 0.6 Gflops/mm2

 Power-efficient

Thermal constraints limit clock frequency
5 W per stack  100 mW/mm2

 Flexible

Must amortize manufacturing cost through reuse across apps

3

SLIDE 4

NDP Logic Options

Programmable cores

[IRAM, FlexRAM, NDC, TOP-PIM]

FPGA (fine-grained)

[Active Pages]

CGRA (coarse-grained)

[NDA]

ASIC

[MSA, LiM]

4

Area Efficiency Power Efficiency Flexibility

           

SLIDE 5

Reconfigurable Logic Challenges

5

 FPGA

Area overhead due to support for bit-level configuration

 CGRA

Traditional GGRAs
Limited flexibility in interconnects, only for regular computation patterns
DySER [HPCA’11] and NDA [HPCA’15]
High power due to circuit-switched routing
Inefficient for branches and irregular data layouts

Heterogeneity: achieve the best of FPGA and CGRA

SLIDE 6

Outline

 Motivation  NDP System Design  Heterogeneous Reconfigurable Logic (HRL)  Evaluation  Conclusions

6

SLIDE 7

Overall System Architecture

Multiple stacks linked to host processor through serial links

7

Host Processor Memory Stack High-Speed Serial Link

Multi-core chip with cache hierarchy: Runs code with high temporal locality Memory stack with NDP capability: runs memory-intensive code

SLIDE 8

NDP Stack

Vault Logic

NoC Logic Die DRAM Die

...

Bank Channel Vault

8

Vault:

Vertical channel
Dedicated memory controller
8 – 16 vaults per stack
vs. DDR3 channel
10x bandwidth (160 GBps)
3-5x power improvement

Vault logic:

Multiple PEs + control logic
NoC to interconnect vaults

SLIDE 9

Iterative Execution Flow

 Processing phase: PEs run tasks independently and in parallel  Communication phase: data exchange and sync b/w PEs [PACT’15]

Communication within and across stacks

9 Task Task Task Task PEs PEs PEs PEs

Process Local Buffer

Task Task Task Task

Pull

SLIDE 10

Output Queues Mem Ctrl To remote memory Generic Load/Store Unit Task Queue

p,addr,sz
p,addr,sz
p,addr,sz

Router To local memory Scratchpad Buffer

Global Controller (Host Processor)

DMA

...

PE PE PE

User-defined Logic Fixed Logic

Output write queues

DMA Ctrl

Vault Logic

 Handles task control and data communication  Allows the use of reconfigurable or custom PEs

10

Input & output data info from host or other PEs Cached data (e.g., graph vertices) Streaming data (e.g., graph edge list) Separate queue for each consumer

Coalesce writebacks
In-place combine ops

SLIDE 11

Outline

 Motivation  NDP System Design  Heterogeneous Reconfigurable Logic (HRL)  Evaluation  Conclusions

11

SLIDE 12

HRL Features

 Fine-grained + coarse-grained reconfigurable blocks

LUTs for flexible control
ALUs for efficient arithmetic

 Static interconnects

Wide network for data
Separate and narrow network for control

 Special blocks for branches & irregular data layout

Compute throughput per Watt: 2.2x over FPGA, 1.7x over CGRA

12

Area-efficiency and flexibility Power-efficiency Flexibility

SLIDE 13

HRL Array: Logic Blocks

13

Control Input Data Input FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU OMB OMB OMB OMB OMB CLB CLB CLB CLB 16-bit data routing tracks 1-bit control routing tracks Control Output Data Output

CGRA-style functional unit

Efficient 48-bit arithmetic/logic ops
Registers for pipelining and retiming

FPGA-style configurable block

LUTs for embedded control logic
Special functions: sigmoid, tanh, etc.

Output MUX Block

Configurable MUXes (tree, cascading, parallel)
Put close to output, low cost and flexible

Flexible IO Interface

Simple but flexible alignment

SLIDE 14

HRL Array: Routing

14

Control Input Data Input FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU OMB OMB OMB OMB OMB CLB CLB CLB CLB 16-bit data routing tracks 1-bit control routing tracks Control Output Data Output

SLIDE 15

HRL Array: Routing

15

FU MXB CLB

ut

in INA INB sel OUT A B Y

ut

16-bit data net tracks 1-bit control net tracks Block output to routing track connection Routing switch box Block input MUX connection

Fully static for low power Separate data and control to reduce area 16-bit data, bus-based 1-bit control, few tracks No switches b/w two networks Connection box and switch

SLIDE 16

HRL Array: IO

16

Control Input Data Input FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU OMB OMB OMB OMB OMB CLB CLB CLB CLB 16-bit data routing tracks 1-bit control routing tracks Control Output Data Output

SLIDE 17

HRL Array: IO

17

48-bit fixed-point 16-bit short 32-bit int 48-bit fixed- point 16-bit short 32-bit int

Ctrl IO 1-bit control tracks 16-bit data tracks Data IO

Control IO

Connects to control tracks
Same as FPGA

Data IO

Connect to data net
Simple 16-bit chunk alignment

Low cost and sufficiently flexible even for irregular data

SLIDE 18

Outline

 Motivation  NDP System Design  Heterogeneous Reconfigurable Logic (HRL)  Evaluation  Conclusions

18

SLIDE 19

Methodology

 Workloads

3 analytics frameworks: MapReduce, graph, DNN
9 representative applications, 11 kernel circuits (KCs)

 Technology

45 nm area and power model
CGRA: DySER as in NDA [HPCA’11, HPCA’15]
FPGA: Xilinx Virtex-6

 Tools

Synthesize, place & route by Yosys + VTR
System simulation by zsim

19

SLIDE 20

Array Area

 Same logic capacity for each type array

20 0.2 0.4 0.6 0.8 1 1.2 1.4

FPGA CGRA HRL Single Array Area (mm2) Ctrl Routing Data Routing Routing Logic Coarse-grained FUs improve area efficiency

SLIDE 21

Array Power

21 5 10 15 20 25 30 35 40 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average

Power (mW) FPGA CGRA HRL HRL: static but flexible routing CGRA: high power on circuit-switched network FPGA: half the frequency

SLIDE 22

Vault Power Efficiency

22 0.5 1 1.5 2 2.5 3 3.5 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average

Normalized Perf/Watt FPGA CGRA HRL HRL: 2.2x FPGA, 1.7x CGRA on perf/Watt

SLIDE 23

Overall Performance

 ASIC represents the upper bound of efficiency  Cores, FPGA, CGRA only match 30% to 80% of ASIC  HRL has 92% of ASIC performance on average

23 0.2 0.4 0.6 0.8 1

GroupBy Hist LinReg PageRank SSSP CC ConvNet MLP dA

Normalized Performance

Cores FPGA CGRA HRL ASIC

Memory bandwidth saturated Memory bandwidth not saturated

SLIDE 24

Conclusions

 NDP logic requirements: area + power efficiency, flexibility  Heterogeneous reconfigurable logic (HRL)

Fine-grained + coarse-grained logic blocks
Static and separate data and control networks
Special blocks for branching and layout management
Vault logic handles communication and control

 HRL for in-memory analytics

2.2x performance/Watt over FPGA and 1.7x over CGRA
Within 92% of ASIC performance

24

SLIDE 25

Thanks!

Questions?