Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis - - PowerPoint PPT Presentation

logic for near data processing
SMART_READER_LITE
LIVE PREVIEW

Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis - - PowerPoint PPT Presentation

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard MapReduce, graph scaling


slide-1
SLIDE 1

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA – March 14, 2016

slide-2
SLIDE 2

PIM is Coming Back …

2 Near-Data Processing (NDP) End of Dennard scaling In- memory analytics 3D stacking

MapReduce, graph processing, deep neural networks, … HMC, HBM

Figs: www.extremetech.com www.cisl.columbia.edu/grads/tuku/research/ www.oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html

Energy-bound systems

slide-3
SLIDE 3

NDP Logic Requirements

 Area-efficient

  • High processing throughput to match the high memory bandwidth
  • 128 GBps per 50 mm2 stack  > 32 Gflops  > 0.6 Gflops/mm2

 Power-efficient

  • Thermal constraints limit clock frequency
  • 5 W per stack  100 mW/mm2

 Flexible

  • Must amortize manufacturing cost through reuse across apps

3

slide-4
SLIDE 4

NDP Logic Options

Programmable cores

[IRAM, FlexRAM, NDC, TOP-PIM]

FPGA (fine-grained)

[Active Pages]

CGRA (coarse-grained)

[NDA]

ASIC

[MSA, LiM]

4

Area Efficiency Power Efficiency Flexibility

           

slide-5
SLIDE 5

Reconfigurable Logic Challenges

5

 FPGA

  • Area overhead due to support for bit-level configuration

 CGRA

  • Traditional GGRAs
  • Limited flexibility in interconnects, only for regular computation patterns
  • DySER [HPCA’11] and NDA [HPCA’15]
  • High power due to circuit-switched routing
  • Inefficient for branches and irregular data layouts

Heterogeneity: achieve the best of FPGA and CGRA

slide-6
SLIDE 6

Outline

 Motivation  NDP System Design  Heterogeneous Reconfigurable Logic (HRL)  Evaluation  Conclusions

6

slide-7
SLIDE 7

Overall System Architecture

Multiple stacks linked to host processor through serial links

7

Host Processor Memory Stack High-Speed Serial Link

Multi-core chip with cache hierarchy: Runs code with high temporal locality Memory stack with NDP capability: runs memory-intensive code

slide-8
SLIDE 8

NDP Stack

Vault Logic

NoC Logic Die DRAM Die

...

Bank Channel Vault

8

Vault:

  • Vertical channel
  • Dedicated memory controller
  • 8 – 16 vaults per stack
  • vs. DDR3 channel
  • 10x bandwidth (160 GBps)
  • 3-5x power improvement

Vault logic:

  • Multiple PEs + control logic
  • NoC to interconnect vaults
slide-9
SLIDE 9

Iterative Execution Flow

 Processing phase: PEs run tasks independently and in parallel  Communication phase: data exchange and sync b/w PEs [PACT’15]

  • Communication within and across stacks

9 Task Task Task Task PEs PEs PEs PEs

Process Local Buffer

Task Task Task Task

Pull

slide-10
SLIDE 10

Output Queues Mem Ctrl To remote memory Generic Load/Store Unit Task Queue

  • p,addr,sz
  • p,addr,sz
  • p,addr,sz

Router To local memory Scratchpad Buffer

Global Controller (Host Processor)

DMA

...

PE PE PE

User-defined Logic Fixed Logic

Output write queues

DMA Ctrl

Vault Logic

 Handles task control and data communication  Allows the use of reconfigurable or custom PEs

10

Input & output data info from host or other PEs Cached data (e.g., graph vertices) Streaming data (e.g., graph edge list) Separate queue for each consumer

  • Coalesce writebacks
  • In-place combine ops
slide-11
SLIDE 11

Outline

 Motivation  NDP System Design  Heterogeneous Reconfigurable Logic (HRL)  Evaluation  Conclusions

11

slide-12
SLIDE 12

HRL Features

 Fine-grained + coarse-grained reconfigurable blocks

  • LUTs for flexible control
  • ALUs for efficient arithmetic

 Static interconnects

  • Wide network for data
  • Separate and narrow network for control

 Special blocks for branches & irregular data layout

Compute throughput per Watt: 2.2x over FPGA, 1.7x over CGRA

12

Area-efficiency and flexibility Power-efficiency Flexibility

slide-13
SLIDE 13

HRL Array: Logic Blocks

13

Control Input Data Input FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU OMB OMB OMB OMB OMB CLB CLB CLB CLB 16-bit data routing tracks 1-bit control routing tracks Control Output Data Output

CGRA-style functional unit

  • Efficient 48-bit arithmetic/logic ops
  • Registers for pipelining and retiming

FPGA-style configurable block

  • LUTs for embedded control logic
  • Special functions: sigmoid, tanh, etc.

Output MUX Block

  • Configurable MUXes (tree, cascading, parallel)
  • Put close to output, low cost and flexible

Flexible IO Interface

  • Simple but flexible alignment
slide-14
SLIDE 14

HRL Array: Routing

14

Control Input Data Input FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU OMB OMB OMB OMB OMB CLB CLB CLB CLB 16-bit data routing tracks 1-bit control routing tracks Control Output Data Output

slide-15
SLIDE 15

HRL Array: Routing

15

FU MXB CLB

  • ut

in INA INB sel OUT A B Y

  • ut

16-bit data net tracks 1-bit control net tracks Block output to routing track connection Routing switch box Block input MUX connection

Fully static for low power Separate data and control to reduce area 16-bit data, bus-based 1-bit control, few tracks No switches b/w two networks Connection box and switch

slide-16
SLIDE 16

HRL Array: IO

16

Control Input Data Input FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU OMB OMB OMB OMB OMB CLB CLB CLB CLB 16-bit data routing tracks 1-bit control routing tracks Control Output Data Output

slide-17
SLIDE 17

HRL Array: IO

17

48-bit fixed-point 16-bit short 32-bit int 48-bit fixed- point 16-bit short 32-bit int

Ctrl IO 1-bit control tracks 16-bit data tracks Data IO

Control IO

  • Connects to control tracks
  • Same as FPGA

Data IO

  • Connect to data net
  • Simple 16-bit chunk alignment

Low cost and sufficiently flexible even for irregular data

slide-18
SLIDE 18

Outline

 Motivation  NDP System Design  Heterogeneous Reconfigurable Logic (HRL)  Evaluation  Conclusions

18

slide-19
SLIDE 19

Methodology

 Workloads

  • 3 analytics frameworks: MapReduce, graph, DNN
  • 9 representative applications, 11 kernel circuits (KCs)

 Technology

  • 45 nm area and power model
  • CGRA: DySER as in NDA [HPCA’11, HPCA’15]
  • FPGA: Xilinx Virtex-6

 Tools

  • Synthesize, place & route by Yosys + VTR
  • System simulation by zsim

19

slide-20
SLIDE 20

Array Area

 Same logic capacity for each type array

20 0.2 0.4 0.6 0.8 1 1.2 1.4

FPGA CGRA HRL Single Array Area (mm2) Ctrl Routing Data Routing Routing Logic Coarse-grained FUs improve area efficiency

slide-21
SLIDE 21

Array Power

21 5 10 15 20 25 30 35 40 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average

Power (mW) FPGA CGRA HRL HRL: static but flexible routing CGRA: high power on circuit-switched network FPGA: half the frequency

slide-22
SLIDE 22

Vault Power Efficiency

22 0.5 1 1.5 2 2.5 3 3.5 KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11 Average

Normalized Perf/Watt FPGA CGRA HRL HRL: 2.2x FPGA, 1.7x CGRA on perf/Watt

slide-23
SLIDE 23

Overall Performance

 ASIC represents the upper bound of efficiency  Cores, FPGA, CGRA only match 30% to 80% of ASIC  HRL has 92% of ASIC performance on average

23 0.2 0.4 0.6 0.8 1

GroupBy Hist LinReg PageRank SSSP CC ConvNet MLP dA

Normalized Performance

Cores FPGA CGRA HRL ASIC

Memory bandwidth saturated Memory bandwidth not saturated

slide-24
SLIDE 24

Conclusions

 NDP logic requirements: area + power efficiency, flexibility  Heterogeneous reconfigurable logic (HRL)

  • Fine-grained + coarse-grained logic blocks
  • Static and separate data and control networks
  • Special blocks for branching and layout management
  • Vault logic handles communication and control

 HRL for in-memory analytics

  • 2.2x performance/Watt over FPGA and 1.7x over CGRA
  • Within 92% of ASIC performance

24

slide-25
SLIDE 25

Thanks!

Questions?