DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric - - PowerPoint PPT Presentation

draf a low power dram based reconfigurable acceleration
SMART_READER_LITE
LIVE PREVIEW

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric - - PowerPoint PPT Presentation

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based Accelerators q Improve


slide-1
SLIDE 1

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis

ISCA – June 22, 2016

slide-2
SLIDE 2

FPGA-Based Accelerators

q Improve performance and energy efficiency q Good balance between flexibility (CPUs) and efficiency (ASICs) q Recently used for many datacenter apps

  • Image/video processing, websearch, neural networks, …

2 ¡

Pictures: ¡Putnam, ¡et ¡al. ¡A ¡Reconfigurable ¡Fabric ¡for ¡Accelera:ng ¡Large-­‑Scale ¡Datacenter ¡Services. ¡ISCA’14 ¡

slide-3
SLIDE 3

Motivation

q Deploy FPGAs in cost & power constrained systems q Datacenter systems

  • High-density FPGAs for large accelerators for multiple apps
  • Low-power FPGAs to simplify integration in servers and racks

q Mobile systems

  • High-density FPGAs for accelerators for multiple apps
  • Low-power FPGAs for low cost and long battery life

3 ¡

slide-4
SLIDE 4

DRAF in a Nutshell

q A high-density & low-power FPGA

  • Bit-level reconfigurable, just like conventional FPGAs

q Uses dense DRAM technology for lookup tables

  • Replacing the SRAM technology in conventional FPGAs

q DRAF vs. FPGA

  • 10 – 100x logic density
  • 1/3 power consumption
  • Multi-context support with fast context switch

4 ¡

slide-5
SLIDE 5

Challenges of Building DRAM-based FPGAs

5 ¡

slide-6
SLIDE 6

DRAM Array Structure

6 ¡

Subarray ¡

……

MAT

……

Sense-­‑amp

……

Master ¡wordline Row ¡decoder Local ¡wordline bitline

A DRAM subarray is naturally a lookup-table

Input ¡ Output ¡

slide-7
SLIDE 7

……

MAT

……

Sense-­‑amp

……

Master ¡wordline Row ¡decoder Local ¡wordline bitline

Challenges

7 ¡

~8k-bit output ~1k rows ~10-bit input Mismatch LUT size

a 8192-bit LUT?

Slow speed

a LUT with 10 ns delay?

10-­‑30 ¡ns ¡delay ¡ Destructive access

data lost after access?

slide-8
SLIDE 8

User ¡ Clock Physical ¡ Path R1 R2 L1 L2 L3 L4

Destructive Access

q Explicit activation, restoration, and precharge operations

  • Longer access delay due to serialization

q Issue of LUT chaining: order of LUT access

8 ¡

Must activate L2 after L1 Must activate L4 after both L2 & L3

slide-9
SLIDE 9

DRAF Architecture

Basic Logic Element Multi-Context Support Timing

9 ¡

slide-10
SLIDE 10

DRAF Overview

q Same island layout and configurable interconnect as FPGA

10 ¡

DSP

In DRAM technology Slower but not critical

Block RAM

Uses DRAM arrays

CLB

Contains multiple basic logic elements (BLEs)

slide-11
SLIDE 11

Basic Logic Element

7-10 bits input 2-4 bits output

11 ¡

Subarray

……

MAT

……

Sense-­‑amp

……

Master ¡wordline Row ¡decoder Local ¡wordline bitline

Narrower MAT 1k bits to 8-16 bits

Col ¡logic 6 14 4x2 4x2

Specialized column logic Better flexibility

FFs 4 4

Additional FFs & MUXs Registering & retiming Single-MAT access Multi-context

3 4

slide-12
SLIDE 12

Multi-Context Support

q DRAF supports 8-16 contexts per chip

  • Context: one MAT per BLE
  • Efficient use of MATs with little area and power overhead

q Instant switch between active contexts

  • Similar to context-switch between processes on CPU

q Context uses

  • One context per accelerator design or application
  • One context per part of a very large accelerator design

12 ¡

slide-13
SLIDE 13

User ¡ Clock Physical ¡ Path R1 R2 L1 L2 L3 L4

Timing – Destructive Access

q Issue of LUT chaining: order of LUT access q Solution: phase – similar to critical path finding

13 ¡ Phase ¡0 ¡ Phase ¡1 ¡

Phase ¡ Timeline ¡

Phase ¡0 ¡ Phase ¡2 ¡ Phase ¡0 ¡ Phase ¡1 ¡ Phase ¡2 ¡

slide-14
SLIDE 14

LUT-­‑1 ¡ LUT-­‑2 ¡

Timing – Latency Optimization

q Issue: precharge and restore delays q Solution: 3-way delay overlapping

  • Hide PRE/RST delays with wire propagation delay

q Performance gap between DRAF and FPGA reduces from >10x to 2-4x

14 ¡ ACT ¡ PRE ¡ RST ¡ Wire ¡ ACT ¡ PRE ¡ RST ¡

LUT-­‑1 ¡ LUT-­‑2 ¡

ACT ¡ PRE ¡ RST ¡ Wire ¡ ACT ¡ PRE ¡ RST ¡

Saved ¡delay ¡

slide-15
SLIDE 15

Summary

q Challenges à solutions

  • Mismatch LUT size

à multi-context BLE

  • Destructive access

à phase-based timing

  • Slow speed

à 3-way delay overlapping

q Other design features (see paper)

  • Sense-amp as register
  • Time-multiplexed routing
  • Handling DRAM Refresh

15 ¡

slide-16
SLIDE 16

Evaluation

Area, power, performance against FPGA and CPU

16 ¡

slide-17
SLIDE 17

Methodology

q Synthesize, place & route with Yosys + VTR q CACTI-3DD with 45 nm power and area models q Comparisons

  • 70 mm2 FPGA based on Xilinx Virtex-6
  • 70 mm2 DRAF device, 8-context
  • Intel Xeon E5-2630 multi-core processor (2.3 GHz)

q 18 accelerator designs

  • MachSuite, Sirius, Vivado HLS Video Library, VTR benchsuite
  • Web service, image processing, analytics, neural networks, …

17 ¡

slide-18
SLIDE 18

DRAF Chip Area & Power

18 ¡ 0.01 ¡ 0.1 ¡ 1 ¡ 10 ¡ 100 ¡ 1000 ¡ 0 ¡ 0.5 ¡ 1 ¡ 1.5 ¡ Chip ¡Area ¡(mm2) ¡ Logic ¡Capacity ¡ (in ¡million ¡6-­‑LUT ¡equivalents) ¡ ¡ FPGA ¡ DRAF ¡ 0.01 ¡ 0.1 ¡ 1 ¡ 10 ¡ 100 ¡ 1000 ¡ 0 ¡ 0.5 ¡ 1 ¡ 1.5 ¡ Peak ¡Chip ¡Power ¡(W) ¡ Logic ¡Capacity ¡ (in ¡million ¡6-­‑LUT ¡equivalents) ¡ ¡ FPGA ¡ DRAF ¡

10x area improvement 50x peak power reduction

slide-19
SLIDE 19

FPGA vs. DRAF (Area)

q 8-context DRAF occupies 19% less area than 1-context FPGA

  • 10x area efficiency: 8 designs in less silicon area than 1 design before
  • But only one context can be active at a time

19 ¡

0 ¡ 0.2 ¡ 0.4 ¡ 0.6 ¡ 0.8 ¡ 1 ¡ 1.2 ¡ 1.4 ¡ aes ¡ backprop ¡ gemm ¡ gmm ¡ harris ¡ stemmer ¡ stencil ¡ viterbi ¡ editdist ¡

Normalized ¡Min ¡ Bounding ¡Area ¡ FPGA ¡Logic ¡ FPGA ¡Rou:ng ¡ DRAF ¡Logic ¡ DRAF ¡Rou:ng ¡ Inefficient use of larger DRAM LUT exp/log functions

slide-20
SLIDE 20

FPGA vs. DRAF (Power)

q Use one context in DRAF q DRAF consumes 1/3 power of FPGA and 15% less energy

  • Note: current CAD tools are less efficient with DRAF

20 ¡

0 ¡ 0.2 ¡ 0.4 ¡ 0.6 ¡ 0.8 ¡ 1 ¡ aes ¡ backprop ¡ gemm ¡ gmm ¡ harris ¡ stemmer ¡ stencil ¡ viterbi ¡ editdist ¡

Normalized ¡Power ¡ ¡ FPGA ¡Logic ¡ FPGA ¡Rou:ng ¡ DRAF ¡Logic ¡ DRAF ¡Rou:ng ¡

slide-21
SLIDE 21

Performance

q DRAF is 2.7x slower than FPGA q DRAF is 13.5x faster than CPU, 3.4x faster than ideal 4-core

21 ¡

0.1 ¡ 1 ¡ 10 ¡ 100 ¡ 1000 ¡ aes ¡ backprop ¡ gemm ¡ gmm ¡ harris ¡ stemmer ¡ stencil ¡ viterbi ¡

Normalized ¡Throughput ¡ CPU ¡ 4 ¡CPU ¡ FPGA ¡ DRAF ¡ exp/log functions Efficient line buffer

slide-22
SLIDE 22

Conclusions

q DRAF: high-density and low-power reconfigurable fabric

  • Based on dense DRAM technology
  • Optimized timing + multi-context support

q DRAF targets cost and power constrained applications

  • E.g., datacenters and mobile systems

q DRAF trades off some performance for area & power efficiency

  • 10x smaller area, 3x less power, and 2.7x slower than FPGA
  • Still 13x speedup over Xeon cores

22 ¡

slide-23
SLIDE 23

Thanks!

Questions?

slide-24
SLIDE 24

Backup

slide-25
SLIDE 25

Design flow

q Verilog/VHDL programming and similar synthesis flow

  • DRAF has the same primitives (LUT, FF, DSP, BRAM) as FPGA

q Specific tweaks

  • Wider LUT: more efficient packing
  • Optimize for latency rather than area
  • Routing delay is easier to handle
  • Additional timing requirements, e.g. phase, etc.
slide-26
SLIDE 26

Multi-Context

q Why not do multi-context in SRAM FPGAs? q Store contexts in-place

  • High area overhead, can be use to implement more normal LUTs
  • In DRAF: little overhead due to dense DRAM MAT array

q On-chip backup storage

  • Significant context switch overheads in power and latency
  • In DRAF: zero latency and power for context switch
slide-27
SLIDE 27

Design Exploration

q Lots of data in paper q Main tradeoff is between area and latency

  • Larger LUT: better area, worse latency
  • Smaller LUT: worse area, better latency

q A major limitation is the CAD tool

  • Cannot efficiently map applications to large LUTs

q Final LUT size

  • 7-input, 2-output, 8-context
  • 64 rows, 32 columns, 2048-bit subarray
slide-28
SLIDE 28
slide-29
SLIDE 29

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis

Session 8A, Wednesday 9am

slide-30
SLIDE 30

The Need for High-density & Low-power FPGAs

q FPGA accelerators improve performance and energy efficiency

  • Recently used for many datacenter apps (Microsoft, Baidu, …)

q Datacenter systems

  • Need high-density FPGAs for large accelerators for multiple apps
  • Need low-power FPGAs to simplify integration in servers and racks

q Mobile systems

  • Need high-density FPGAs for accelerators for multiple apps
  • Need low-power FPGAs for low cost and long battery life
slide-31
SLIDE 31

DRAF: A High-density & Low-power FPGA

q Based on dense DRAM arrays instead of SRAM LUTs

  • 10-100x density of convectional FPGAs
  • 1/3 power consumption of convectional FPGAs
  • 13x speedup over Xeon cores

q Come to the talk to learn about

  • Dense, slow DRAM arrays as small, fast LUTs
  • Phase-based timing to address the problem of destructive reads
  • Multi-context support with instantaneous context switch

q Session 8A, Wednesday 9am