Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, - PowerPoint PPT Presentation

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA – June 22, 2016

FPGA-Based Accelerators  Improve performance and energy efficiency  Good balance between flexibility (CPUs) and efficiency (ASICs)  Recently used for many datacenter apps o Image/video processing, websearch , neural networks, … 2 Pictures: Putnam, et al. A Reconfigurable Fabric for Accelerating Large- Scale Datacenter Services. ISCA’14

Motivation  Deploy FPGAs in cost & power constrained systems  Datacenter systems o High-density FPGAs for large accelerators for multiple apps o Low-power FPGAs to simplify integration in servers and racks  Mobile systems o High-density FPGAs for accelerators for multiple apps o Low-power FPGAs for low cost and long battery life 3

DRAF in a Nutshell  A high-density & low-power FPGA o Bit-level reconfigurable, just like conventional FPGAs  Uses dense DRAM technology for lookup tables o Replacing the SRAM technology in conventional FPGAs  DRAF vs. FPGA o 10 – 100x logic density o 1/3 power consumption o Multi-context support with fast context switch 4

Challenges of Building DRAM-based FPGAs 5

DRAM Array Structure Master wordline Row decoder Local wordline Input bitline MAT Sense-amp Subarray Output A DRAM subarray is naturally a lookup-table 6

Challenges Master wordline Row decoder 10-30 ns delay Local wordline ~1k rows ~10-bit input bitline Mismatch LUT size MAT a 8192-bit LUT? Sense-amp Slow speed a LUT with 10 ns delay? ~8k-bit output Destructive access data lost after access? 7

Destructive Access  Explicit activation, restoration, and precharge operations o Longer access delay due to serialization  Issue of LUT chaining: order of LUT access Must activate L2 after L1 L1 L2 R1 L4 R2 Physical L3 Path Must activate L4 after both L2 & L3 User Clock 8

DRAF Architecture Basic Logic Element Multi-Context Support Timing 9

DRAF Overview  Same island layout and configurable interconnect as FPGA CLB Contains multiple basic DSP logic elements (BLEs) In DRAM technology Slower but not critical Block RAM Uses DRAM arrays 10

Basic Logic Element 7-10 bits input Master wordline 2-4 bits output Row decoder Local wordline 6 Narrower MAT bitline 1k bits to 8-16 bits MAT 14 Specialized column logic Sense-amp Subarray Col logic Better flexibility 4x2 4x2 FFs Additional FFs & MUXs 3 4 4 Registering & retiming Single-MAT access 4 Multi-context 11

Multi-Context Support  DRAF supports 8-16 contexts per chip o Context: one MAT per BLE o Efficient use of MATs with little area and power overhead  Instant switch between active contexts o Similar to context-switch between processes on CPU  Context uses o One context per accelerator design or application o One context per part of a very large accelerator design 12

Timing – Destructive Access  Issue of LUT chaining: order of LUT access  Solution: phase – similar to critical path finding L1 L2 R1 Phase 0 Phase 1 L4 R2 Physical L3 Phase 2 Path Phase 0 User Clock Phase Phase 0 Phase 1 Phase 2 Timeline 13

Timing – Latency Optimization  Issue: precharge and restore delays  Solution: 3-way delay overlapping o Hide PRE/RST delays with wire propagation delay  Performance gap between DRAF and FPGA reduces from >10x to 2-4x LUT-1 PRE ACT RST Wire LUT-2 PRE ACT RST Saved delay LUT-1 PRE ACT RST Wire LUT-2 PRE ACT RST 14

Summary  Challenges  solutions o Mismatch LUT size  multi-context BLE o Destructive access  phase-based timing o Slow speed  3-way delay overlapping  Other design features (see paper) o Sense-amp as register o Time-multiplexed routing o Handling DRAM Refresh 15

Evaluation Area, power, performance against FPGA and CPU 16

Methodology  Synthesize, place & route with Yosys + VTR  CACTI-3DD with 45 nm power and area models  Comparisons o 70 mm 2 FPGA based on Xilinx Virtex-6 o 70 mm 2 DRAF device, 8-context o Intel Xeon E5-2630 multi-core processor (2.3 GHz)  18 accelerator designs o MachSuite, Sirius, Vivado HLS Video Library, VTR benchsuite o Web service, image processing, analytics, neural networks, … 17

DRAF Chip Area & Power 10x area improvement 50x peak power reduction 1000 1000 100 100 Peak Chip Power (W) Chip Area (mm2) 10 10 FPGA FPGA 1 1 DRAF DRAF 0 0.5 1 1.5 0 0.5 1 1.5 0.1 0.1 0.01 0.01 Logic Capacity Logic Capacity (in million 6-LUT equivalents) (in million 6-LUT equivalents) 18

FPGA vs. DRAF (Area)  8-context DRAF occupies 19% less area than 1-context FPGA o 10x area efficiency: 8 designs in less silicon area than 1 design before o But only one context can be active at a time Inefficient use of larger DRAM LUT exp/log functions 1.4 Normalized Min 1.2 Bounding Area 1 0.8 0.6 0.4 0.2 0 aes backprop gemm gmm harris stemmer stencil viterbi editdist FPGA Logic FPGA Routing DRAF Logic DRAF Routing 19

FPGA vs. DRAF (Power)  Use one context in DRAF  DRAF consumes 1/3 power of FPGA and 15% less energy o Note: current CAD tools are less efficient with DRAF 1 Normalized Power 0.8 0.6 0.4 0.2 0 aes backprop gemm gmm harris stemmer stencil viterbi editdist FPGA Logic FPGA Routing DRAF Logic DRAF Routing 20

Performance  DRAF is 2.7x slower than FPGA  DRAF is 13.5x faster than CPU, 3.4x faster than ideal 4-core Efficient line buffer 1000 Normalized Throughput exp/log functions 100 10 1 0.1 aes backprop gemm gmm harris stemmer stencil viterbi CPU 4 CPU FPGA DRAF 21

Conclusions  DRAF: high-density and low-power reconfigurable fabric o Based on dense DRAM technology o Optimized timing + multi-context support  DRAF targets cost and power constrained applications o E.g., datacenters and mobile systems  DRAF trades off some performance for area & power efficiency o 10x smaller area, 3x less power, and 2.7x slower than FPGA o Still 13x speedup over Xeon cores 22

Thanks! Questions?

Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, - PowerPoint PPT Presentation

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based Accelerators Improve performance

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

Optimising fabric quality, finishing processes and machinery through the use of fabric objective

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou,

Optimising fabric quality, finishing processes and machinery through the use of fabric objective

FPGA fabric is eating the world The rise of the custom computing machines From the eyes of Steve

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Reconfigurable Computing Computing Reconfigurable Design and implementation implementation

Reconfigurable Computing Reconfigurable Computing Design and implementation Design and

Reconfigurable Computing Reconfigurable Computing Applications Applications Chapter 9 Chapter

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

Using Reconfigurable Logic Using Reconfigurable Logic to Simulate Computer Systems Derek Chiou

Reconfigurable Computing Computing Reconfigurable Partial reconfiguration reconfiguration

Reconfigurable Computing Reconfigurable Computing Partitioning Partitioning Chapter 5 Chapter

Reconfigurable Computing Reconfigurable Computing for System on a Chip for System on a Chip

Reconfigurable Computing Reconfigurable Computing VHDL Crash Course VHDL Crash Course Chapter 2

PASSE MEETING This event is sponsored by the TBI Advisory Board Workgroup, specifically the

TFS Financial Corporation For the quarter ended September 30, 2018 Forward-Looking Statements

Cyber@UC Meeting 67 Bash and OverTheWire If Youre New! Join our Slack:

3.36pt Advanced Simulation - Lecture 1 George Deligiannidis January 18th, 2016 Lecture 1 1 /

Solving Non-deterministic Planning Problems with Pattern Database Heuristics Pascal Bercher

A new multidimensional-type reconstruction and limiting procedure for unstructured (cell-centered)

CS-527 Software Security Browser Security Asst. Prof. Mathias Payer Department of Computer

University . BONORA - IRTESTE FEST : Sissa 2015 July 1.2 , , THE LIOUVILLE Equation