Programmable Hardware Acceleration Vinay Gangadhar PhD Final - PowerPoint PPT Presentation

Methodology • Modeling framework for GenAccel  Performance: Trace driven simulator + application specific modeling  Power & Area: Synthesized modules, CACTI and McPAT • Compared to four DSAs (published perf., area & power) • Four parameterized GenAccels One combined balanced GenAccel GA N GA Q GA C GA D NPU Conv. 1 Unit 1 Unit 8 Units 4 Units GA B DianNao Q100 8 Units NPU DianNao Q100 Conv. • Provisioned to match performance of DSAs  Other tradeoffs possible (power, area, energy etc. ) 23 11/16/2017 Dissertation Talk

Performance Analysis GenAccel vs DSAs GA N vs. NPU GA C vs. Conv. GA D vs. DianNao GA Q vs. Q100 (1 Unit) (1 Unit) (8 Units) (4 Units) 120 200 35 14 Domain Provisioned 180 GenAccel (GA) 12 30 100 Domain Provisioned GenAccels 160 GA (+reuse.) 10 25 140 80 SpeedUp Spatial (+comm.) 120 8 20 SIMD (+concur.) Performance: GenAccel able to match DSA 60 100 6 15 Multi-Tile (+concur.) 80 40 LP core + SFUs (+comp.) 60 4 10 Main contributor to speedup: Concurrency DSA GeoMean 40 20 2 5 20 0 0 0 0 NPU Conv. Engine Diannao Q100 (GeoMean) (GeoMean) (GeoMean) (GeoMean) Baseline – 4 wide OOO core (Intel 3770K) 24 11/16/2017 Dissertation Talk

Domain Provisioned GenAccels GenAccel area & power compared to a single DSA ? 25 11/16/2017 Dissertation Talk

Domain Provisioned GenAccels Area and Power Analysis Area Comparison Power Comparison 4 4.5 4.1x Domain provisioned GenAccel overhead 3.8x 4 3.6x 3.5 Normalized Power Normalized Area 3 1x – 4x worse in Area 3 2.5 2 2x 1.7x 2 2x – 4x worse in Power 1.2x 1.5 0.5x 0.6x 1 1 0.5 0 0 *Detailed area breakdown in backup 26 11/16/2017 Dissertation Talk

Balanced GenAccel design Area and power of GenAccel Balanced design, when multiple domains mapped* ? * Still provisioned to match the performance of each DSA 27 11/16/2017 Dissertation Talk

GenAccel Balanced Design Area-Power Analysis Area Power 1.4 3 2.5x 1.2 2.5 Normalized Area Balance GenAccel design overheads Normalized Power 1 2 0.8 Area efficient than multiple DSAs 1.5 0.6x 0.6 1 2.5x worse in Power than multiple DSAs 0.4 0.5 0.2 0 0 Dissertation Talk 28 11/16/2017

Outline • Introduction Concurrency Computation • Principles of architectural specialization Communication Data Reuse Embodiment of principles in DSAs  Coordination • Modeling mechanisms exploiting specialization principles for a generic programmable accelerator (GenAccel Model) Energy • Evaluation of GenAccel with 4 DSAs Speedup (Performance, power & area) Core Accel. • System-level energy efficiency tradeoffs with $ System Bus GenAccel and DSA Memory 29 11/16/2017 Dissertation Talk

Conclusion – Modeling Programmable Hardware Acceleration • 5 common principles for architectural specialization • Modeled the mechanisms embodying the specialization principles – Design of a Generic Programmable accelerator ( GenAccel Model) • GenAccel model competitive with DSA performance and overheads of only up to 4x in area and power • Power overhead inconsequential when system-level energy tradeoffs considered • GenAccel Model as a baseline for future accelerator research 30 11/16/2017 Dissertation Talk

Dissertation Research Goal Programmable Hardware Acceleration 1. Explore the commonality in the way the DSAs specialize –  Specialization Principles 2. General Mechanisms for the design of a generic programmable  hardware accelerator matching the efficiency of DSAs 3. A programmable/re-configurable accelerator architecture with an efficient accelerator hardware-software (ISA) interface 4. Easy adaptation of new acceleratable algorithms in a domain-agnostic way 31 11/16/2017 Dissertation Talk

Contributions Modeling Programmable Architectural Realization with Hardware Acceleration Stream-Dataflow Acceleration • Exploring the common principles • Stream-Dataflow programmable of architectural specialization accelerator architecture with:  Programming abstractions • Modeling a general set of and execution model mechanisms to exploit the  ISA interface specialization principles – • Detailed micro-architecture with GenAccel Model an efficient architectural • Quantitative evaluation of realization of stream-dataflow GenAccel Model with four DSAs accelerator – Softbrain • System-Level Tradeoffs of • Quantitative evaluation of GenAccel Model vs. DSAs Softbrain with state-of-the-art DSA solutions 32 11/16/2017 Dissertation Talk

Stream-Dataflow Acceleration* *Published in ISCA 2017, Submitted to IEEE Micro Top-Picks 2018 33 11/16/2017 Dissertation Talk

Architectural Realization of Programmable Hardware Acceleration • Workloads characteristics:  Regular streaming memory accesses with straightforward patterns  Computationally intensive with long execution phases  Ample data-level parallelism with large datapath  Small instruction footprints with simple control flow • Accelerator architecture to accelerate data-streaming applications  Instantiates the hardware primitives from GenAccel model Exploit all the five specialization principles   Stream-Dataflow high-performance compute substrate with Dataflow and Stream specialization components  Exposes a novel stream-dataflow ISA interface for programming the accelerator 34 11/16/2017 Dissertation Talk

Stream-Dataflow Acceleration From Memory Exploit common accelerator application behavior: Local Dataflow Computation storage Memory Reuse • Stream-Dataflow Execution model – Stream Stream Abstracts typical accelerator computation phases Recurrence Stream Stream Patterns and Interface x x • Stream-Dataflow ISA encoding and Dataflow + Graph Hardware-Software interface – Exposes parallelism available in these + phases Synchronization Primitives • Barrier commands to facilitate data coordination and data consistency To Memory 35 11/16/2017 Dissertation Talk

Stream-Dataflow Acceleration Programmable Stream-Dataflow Stream-Dataflow Model From Memory Accelerator Memory/Cache Hierarchy Local storage Memory Interface Memory Reuse Stream ... Input Data Input Data Stream Output Data Output Data Streams Streams ... Streams Streams Local Storage (Programmable Re-configurable Recurrence Stream Scratchpad) Computation Fabric x x Recurring Reuse Data streams Streams Dataflow + • Data-parallel program kernels streaming data from Graph memory (DFG) + • Dataflow computation fabric operates on data streams iteratively • Computed output streams stored back to memory To Memory 36 11/16/2017 Dissertation Talk

Outline • Overview • Stream-Dataflow Execution Model • Hardware-Software (ISA) Interface for Programmable Hardware Accelerator • Stream-Dataflow Accelerator Architecture and Example program Cache/ Memory Heirarchy Writes Reads LEGEND Scratchpad Memory Interface Control State storage/SRAM Scratch Scratch SCR to MSE Memory Memory Datapath writes Stream Engine (SSE) Stream Engine (SSE) Stream Engine (MSE) Stream Engine (MSE) BLACK Data Line for Writes for Reads for Writes for Reads GREEN Control/Commands Tag Invalidate • MSE Write Cmd MSE Read Cmd Stream-Dataflow Micro-Architecture – Softbrain Free MSE Write I-Cache Req/Resp D-Cache Req/Resp Free SSE Write Free SSE Read Free MSE Read SSE Write Cmd SSE Read Cmd Free RSE Stream CGRA Config From SSE From MSE Input Cmds . . . Stream VP Scoreboard to SEs Data VPs Cmd. Config Cmd. Queue Resource Status Issue RSE Cmd Checker CGRA Stream Dispatcher Recurrence SD CMD Stream Engine (RSE) Indirect Load/Store Output VPs . . . RISCV Data VPs Rocket Core To SSE To MSE 1000 • Evaluation and Results 100 10 1 GM 37 11/16/2017 Dissertation Talk

Stream-Dataflow Execution Model Architectural Abstractions for Stream-Dataflow Model • Computation abstraction – Dataflow Graph From Memory (DFG) with input/output vector ports Local • Data abstraction – Streams of data fetched storage from memory and stored back to memory Memory Reuse Stream Stream • Reuse abstraction – Streams of data fetched once from memory, stored in local storage Input Vector Ports Acc(1) B(3) A(3) (width) (programmable scratchpad) and reused again Recurrence Stream x x x x • Communication abstraction – Stream-Dataflow Dataflow based firing Dataflow data movement commands and barriers of data from + + Graph vector ports (DFG) Source Destination + + Memory Address Memory Address Local Storage Address Local Storage Address Access DFG Port DFG Port Pattern Output Vector Ports R(1) Out(3) (width) To Memory 38 11/16/2017 Dissertation Talk

Stream-Dataflow Execution Model Programmer Abstractions for Stream-Dataflow Model • Computation abstraction – Dataflow Graph From Memory (DFG) with input/output vector ports Local • Data abstraction – Streams of data fetched storage • from memory and stored back to memory Separates the data-movement from computation Memory Reuse Stream Stream • Reuse abstraction – Streams of data fetched once from memory, stored in local storage • Achieves high-concurrency through the execution of (programmable scratchpad) and reused again Recurrence Stream coarser-grained data streams alongside dataflow x x • Communication abstraction – Stream-Dataflow computation Dataflow data movement commands and barriers + Graph Time + Read Read Barrier Data Compute All Barrier Write Data To Memory 39 Dissertation Talk 11/16/2017

Progammable Traditional Accelerator Hardware Accelerator Arch. (DSA) Programs Programs Domain-Specific (“Specialized”) Programs General H/W Language Parameters Tiny H/W-S/W H/W-S/W Compiler Interface Interface General ISA General Purpose Re-Configurable Application/Domain Hardware Hardware Specific Hardware 10-1000x Performance/Power or Performance/Area Can the specialized programs be adapted in a domain- (completely lose generality/programmability) agnostic way with this interface? 41 11/16/2017 Dissertation Talk

Stream-Dataflow ISA Interface Express any data-stream pattern of accelerator applications using simple, flexible and yet efficient encoding scheme 42 11/16/2017 Dissertation Talk

Stream-Dataflow ISA • Set-up Interface: SD_Config – Configuration data stream for dataflow computation fabric (CGRA) • Control Interface: SD_Barrier_Scratch_Rd , SD_Barrier_Scratch_Wr , SD_Barrier_All • Stream Interface  SD_[source]_[dest] Source/Dest Parameters: Address (memory or local_storage), DFG Port number Pattern Parameters: access_size, stride_size, num_strides Local Storage Memory (Scratchpad) Compute Fabric 43 11/16/2017 Dissertation Talk

Stream-Dataflow Programming Interface Access Pattern Source Start Address Destination Memory, Stride Memory, Local Storage, Local Storage, DFG Port DFG Port Access Size Number of Strides Linear access_size = 4 Strided mem_addr 2D Direct = 0xA Overlapped Example Streams Access num_strides Repeating Patterns = 2 2D Indirect Offset-Indirect Streams memory_stride = 8 44 11/16/2017 Dissertation Talk

Stream-Dataflow ISA Encoding Stream: Stream Encoding <address, access_size, stride_size, length> Time for i = 1 to 100: Eg: <a, 1, 2, 100> a ... = a[2*i]; ... = b[i]; <b, 1, 1, 100> b c[b[i]] = ... c IND<[prev], c, 100> <stream_start, offset_address> Dataflow: Vector B[0:2] Vector A[0:2] Specified in a × × × Domain Specific Dataflow Language (DSL) + Graph + C 45 11/16/2017 Dissertation Talk

Example Pseudo-Code: Dot Product Stream ISA Encoding a [ 0 : N ]  P1 Put b [ 0 : N ]  P2 Put Original Program Recur P3, N - 1 Get P3  c for(int i = 0 to N ) { c += a [ i ] * b [ i ]; Dataflow Encoding } P2 P1 × + P3 46 11/16/2017 Dissertation Talk

New ISA Class for Programmable Hardware Acceleration Local Storage Stream-Dataflow ISA Memory (Scratchpad) • Expresses long memory streams and ASIC Hardware access patterns efficiently For Computation A New ISA Paradigm for Acceleration – Address generation hardware becomes • Need to embody common accelerator much simpler principles and execution model • Decouples access and execute phases • Need to represent programs without • Reduces instruction overheads requiring complex micro-architecture techniques for performance • Dependences are explicitly encoded – VLIW, SIMT and SIMD have their own • Reduces cache requests and pressure by drawbacks for accelerators encoding alias-free memory requests • Micro-Architecture for C-programmable – Implicit coalescing for concurrent ASICs memory accesses – Enables ‘hardened’ ASIC compute • Separates architecture abstractions from substrate implementation the implementation details – Separates the memory interface primitives and interaction 47 11/16/2017 Dissertation Talk

Requirements for Stream- Dataflow Accelerator Architecture 1. Should employ the common specialization principles and hardware mechanisms explored in GenAccel model (*IEEE Micro Top-Picks 2017: Domain Specialization is Generally Unnecessary for Accelerators ) Communication Data Reuse Coordination Concurrency Computation S S + FU FU S S Multiple-Tiles Problem-Specific Spatial Fabric Scratchpad Low-Power Core FUs (CGRA) 2. Programmability features without the inefficiencies of existing data-parallel architectures (with less power, area and control overheads) 49 11/16/2017 Dissertation Talk

Inefficiencies in Data-Parallel Architectures Spatial Dataflow Vector Thread SIMT SIMD & Short Vector SIMD Scalar Dispatch Warp Scheduler + Control Core + Vector Vector Dispatch Vector Dispatch Register File Distributed PEs Large Register File + Scalar Dispatch Control SIMD Vector Scratchpad Register File Core Units … … – Control Vector Lanes Vector Lanes Sub-SIMD Memory Coalescer Vector Fetch Support • Unaligned • Redundant address • Redundant • Redundant address addressing generation address generation • generation Vector architectures – Efficient parallel memory interface • Complex scatter- • Address coalescing • Inefficient memory Addressing & Communication gather across threads b/w for local accesses • Spatial Architectures – Efficient parallel computation interface • Mask & merge • Non-decoupled access- instructions execute phases • • Core-issue width • Thread scheduling • Redundant • Redundant dispatch Application/Domain Specific Architectures – Efficient Resource Utilization dispatchers • Fixed vector width • Multi-ported large datapath for pipelined concurrent execution & • Core issue width register file & cache • Core to reorder Latency hiding pressure and re-ordering instructions • Inefficient general • Warp divergence • Re-convergence Irregular - execution pipeline hardware support for diverged support vector threads 50 11/16/2017 Dissertation Talk

Stream-Dataflow Accelerator Architecture Opportunities • Reduce address generation & duplication overheads • Distributed control to boost pipelined concurrent execution Stream Dataflow • High utilization of execution resources w/o massive multi- Scratchpad threading, reducing cache pressure or using multi- Command Core Vector Interface ported scratchpad Coarse-Grained Reconfigurable Arch. • Decouple access and execute phases of programs Vector Interface Memory Interface • Simplest hardware fallback mechanism for irregular memory access support • Able to be easily customizable/configurable for new application domain 51 11/16/2017 Dissertation Talk

Stream-Dataflow Accelerator To/from Architecture memory hierarchy 512b 64b Dataflow: Scratchpad • Coarse grained reconfigurable architecture Memory Stream (CGRA) for data parallel execution Scrathcpad Stream Engine Engine • Direct vector port interface into and out of CGRA for vector execution Input Vector Port Interface Stream Interface: Acc(1) B(3) A(3) . . . • Programmable scratchpad and supporting x CGRA Spatial Fabric x stream-engine for data-locality and data-reuse S S S S Indirect Vector Port Interface Stream Engine Recurrence . . . • Memory stream-engine to facilitate data FU FU streaming in and out of the accelerator S S S S + • Recurrence stream-engine to support S S S S recurrent data stream . . . + FU FU • Indirect vector port interface for streaming S S S S addresses (indirect load/stores) . . . R(1) Out(3) Output Vector Port Interface 52 11/16/2017 Dissertation Talk

Stream-Dataflow Accelerator To/from Architecture memory hierarchy 512b 64b Stream Command Stream ISA Encoding Scratchpad Memory Stream a [ 0 : N ]  P1 Put Scrathcpad Stream Engine Engine b [ 0 : N ]  P2 Put Recur P3, N - 1 P3  c Get Input Vector Port Interface I$ D$ Stream Commands . . . Stream CGRA Spatial Fabric Command S S S S Indirect Vector Port Interface Stream Engine Tiny Recurrence Dispatcher . . . FU FU In-order core S S S S S S S S Coarse-grained Stream • Stream command commands issued by core . . . FU FU through a command queue interface exposed to a S S S S general purpose programmable core . . . • Non-intrusive Output Vector Port Interface accelerator design 53 11/16/2017 Dissertation Talk

Stream-Dataflow Accelerator Architecture Integration Multi-Tile Stream-Dataflow Accelerator Memory/Cache Hierarchy . . . • Each tile is connected to higher-L2 cache interface • Need a simple scheduler logic to schedule the offloaded stream- dataflow kernels to each tile 54 11/16/2017 Dissertation Talk

Programming Stream-Dataflow Accelerator 1. Specify Datapath for the CGRA – Simple Dataflow Language for DFG 2. Orchestrate the parallel execution of hardware components – Coarse-grained stream commands using the stream-interface Scratchpad Memory Data Flow Graph Input Input Ports . . . Ports: Tiny CGRA In-order CGRA Core (Execution Instructions Resources) . . . Output Output Ports Ports: 55 11/16/2017 Dissertation Talk

Classifier Layer (Original) Input Neurons (Ni) #define Ni 8 #define Nn 8 × // synapse and neurons – 2 bytes ∑ uint16_t synapse[Nn][Ni]; uint16_t neuron_i[Ni]; Output Neurons (Nn) uint16_t neuron_n[Nn]; for (n = 0; n < Nn; n++) { sum = 0; for (i = 0; i < Ni; i++) { sum += synapse[n][i] * neuron_i[i]; } Synapses (Nn x Ni) neuron_n[n] = sigmoid(sum); } 56 11/16/2017 Dissertation Talk

Dataflow Graph (DFG) for CGRA: Classifier Kernel sum += synapse[n][i] * neuron_i[i]; Computation DFG for neuron_n[n] = sigmoid(sum); Input Input: do_sig Ports: Input: acc Input: N Input: S CGRA M = Mul16x4(N, S) Instructions R = Red16x4(M, acc) out = Sig16(R, do_sig) Output Output: out Ports: N – Input neuron (Ni) port Compilation + S – Synapses (synapse) port Spatial scheduling do_sig – Input sigmoid predicate port class_cfg acc – Input accumulate port out – Output neurons (Nn) port (Configuration data for CGRA) 57 11/16/2017 Dissertation Talk

Stream Dataflow Program: Classifier Kernel // Configure the CGRA SD_CONFIG(class_cfg, sizeof(class_cfg)); // Stream the data from memory to ports SD_MEM_PORT(synapse, 8, 8, Ni * Nn/ 4, Port_S); SD_MEM_PORT(neuron_i, 8, 8, Ni/4, Port_N); for (n = 0; n < Nn/nthreads; n++) { // Stream the constant values to constant ports SD_CONST(Port_acc, 0, 1); SD_CONST(Port_do_sig, 0, Ni - 1); // Recur the computed data back for accumulation SD_PORT_PORT(Port_out, N - 1, Port_acc); // Sigmoid computation and output neuron written SD_CONST(Port_do_sig, 1, 1); Compilation + Spatial scheduling SD_PORT_MEM(Port_out, 2, 2, 1, &neuron_n[n]); } class_cfg (Configuration data SD_BARRIER_ALL(); for CGRA) 58 11/16/2017 Dissertation Talk

Performance Considerations • Goal: Fully pipeline the largest dataflow graph – Increase performance [ CGRA Instructions / Cycle ] – Increase throughput [ Graph computation instances per cycle ] • Primary Bottlenecks: – Computations per Size of Dataflow Graph Increase through Loop Unrolling/Vectorization – General Core (for Issuing Streams) Increase “length” of streams – Memory/Cache Bandwidth Use Scratchpad for data-reuse – Recurrence Serialization Overhead Increase Parallel Computations (tiling) 59 11/16/2017 Dissertation Talk

Micro-Architecture Design Principles 1. Low-overhead control structures 2. Efficient execution of concurrent stream commands with simple resource dependency tracking 3. Not introduce power hungry or large CAM-like structures 4. Parameterizable design 61 11/16/2017 Dissertation Talk

Micro-Architecture of Stream-Dataflow Accelerator – Softbrain 62 11/16/2017 Dissertation Talk

Stream-Dispatcher of Softbrain • Issues the stream commands to stream-engines • Resource dependency tracking Simple vector-port to stream-engine scoreboard mechanism  • Barriers – Enforces the explicit stream-barriers for data-consistency in scratchpad as well as memory state • Interfaces to the low-power core using a simple queue-based custom accelerator logic 63 11/16/2017 Dissertation Talk

Micro-Architecture of Stream-Dataflow Accelerator – Softbrain 64 11/16/2017 Dissertation Talk

Stream-Engine of Softbrain Memory Stream-Engine (MSE) Scratchpad Stream-Engine (SSE) • Arbitration of multiple stream command requests • Responsible for address generation for various data-stream access patterns • Manages concurrent accesses to vector ports, scratchpad and the cache/memory hierarchy • Dynamic switching of streams to account for L2 cache misses and maintain the high-bandwidth memory accesses 65 11/16/2017 Dissertation Talk

Softbrain Stream-Engine Controller Request Pipeline Stream-Engine Controller Stream Request Pipeline • Responsible for address generation for both direct and indirect data-streams • Priority based selection among multiple queued data-steams • Direct streams – Affine Address Generation Unit (AGU) generates memory addresses • Indirect Streams – Non-affine AGU gets addresses, offsets from indirect vector ports 66 11/16/2017 Dissertation Talk

Micro-Architecture Flow of Softbrain Cache/ Memory Heirarchy Reads Writes LEGEND Control Scratchpad Memory Interface State storage/SRAM Datapath Scratch Scratch SCR to MSE Memory Memory writes Stream Engine (SSE) Stream Engine (SSE) Stream Engine (MSE) Stream Engine (MSE) BLACK Data Line for Writes for Reads for Writes for Reads GREEN Control/Commands Tag Invalidate MSE Write Cmd MSE Read Cmd Free MSE Write D-Cache Req/Resp Free SSE Write Free SSE Read SSE Read Cmd Free MSE Read SSE Write Cmd I-Cache Req/Resp Free RSE From MSE From SSE Stream CGRA Config Input Cmds . . . Data VPs VP Scoreboard to SEs Stream Config Cmd. Cmd. Queue Issue RSE Cmd Resource Status CGRA Checker Recurrence Stream Dispatcher SD CMD Stream Engine (RSE) Indirect Load/Store VPs Output . . . Data VPs RISCV To SSE Rocket Core To MSE 67 11/16/2017 Dissertation Talk

Stream-Dataflow Implementation: Softbrain Evaluation Stream- RISCV Dataflow Code GCC RISCV RISCV ISA (C/C++) Binary Accelerator Cycle-level Softbrain Simulator DFG DFG Config. DFG.h Compiler File (ILP Solver) Software Stack Chisel- Accelerator Chisel Parameterizable generated Softbrain Model Accelerator Verilog RTL Configuration Implementation Synthesis + Synopsis DC Hardware 69 11/16/2017 Dissertation Talk

Evaluation Methodology • Workloads Deep Neural Networks (DNN) – For domain provisioned comparison  Machsuite Accelerator Workloads – For comparison with application specific  accelerators • Comparison Domain Provisioned Softbrain vs. DianNao DSA  Broadly provisioned Softbrain vs. ASIC design points – Aladdin* generated  performance, power and area • Area and Power of Softbrain Synthesized area, power estimates  CACTI for cache and SRAM estimates  *Sophia, Shao et al. – Aladdin: a Pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures 70 11/16/2017 Dissertation Talk

Domain-Specific Comparison (Softbrain vs DianNao DSA) Speedup Relative to OOO4 (DNN Workloads) 1000 298 191 100 SPEEDUP 10 1 SoftBrain DianNao 71 11/16/2017 Dissertation Talk

Area-Power Estimates of Domain Provisioned Softbrain Components Area (mm2) @ 28nm Power (mW) Rocket Core 0.16 39.1 (16KB I$ + D$) Softbrain vs Diannao (DNN DSA) Network 0.12 31.2 CGRA FUs (5 x 4) 0.04 24.4 • Perf. – Able to match the performance Total CGRA 0.16 55.6 • Area – 1.74x Overhead 5 x Stream Engines 0.02 18.3 • Power – 2.28x Overhead Scratchpad (4KB) 0.1 2.6 Vector Ports (Input & 0.03 Output) 1 Softbrain Unit 0.47 119.3 8 Softbrain Units 3.76 954.4 DianNao DSA 2.16 418.3 Softbrain / DianNao 1.74 2.28 Overhead 72 11/16/2017 Dissertation Talk

Broadly Provisioned Softbrain vs ASIC Performance Comparison Speedup Relative to OOO4 (Machsuite Workloads) 10 8 SPEEDUP 6 4 2.59 2.67 2 0 Softbrain ASIC Aladdin* generated ASIC design points – Resources constrained to be in ~15% of Softbrain Perf. to do iso-performance analysis *Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. Sophia Shao , .et. al 73 11/16/2017 Dissertation Talk

Broadly Provisioned Softbrain vs ASIC Area & Power Comparison Power Efficiency Relative to Energy Efficiency ASIC Area Relative OOO4 (GM) Relative to OOO4 (GM) to Softbrain (GM) 20 60 18 0.15 Softbrain vs ASIC designs 18 0.14 50 48 16 14 40 • Perf. – Able to match the performance 0.1 12 11 31 • Power – 1.6x overhead 10 30 8 • Energy – 1.5x overhead 20 6 0.05 • Area – 8x overhead* 4 10 2 * All 8 ASICs combined  2.15x more area than Softbrain 0 0 0 Softbrain ASIC Softbrain ASIC GM 74 11/16/2017 Dissertation Talk

Conclusion – Stream-Dataflow Acceleration • Stream-Dataflow Acceleration Stream-Dataflow Execution Model – Abstracts typical accelerator  computation phases using a dataflow graph Stream-Dataflow ISA Encoding and Hardware-Software Interface – Exposes  parallelism available in these phases • Stream-Dataflow Accelerator Architecture CGRA and vector ports for pipelined vector-dataflow computation  Highly parallel stream-engines for low-power stream communication  • Stream-Dataflow Prototype & Implementation – Sof tbrain Matches performance of domain provisioned accelerator (DianNao  DSA) with ~2x overheads in area and power Compared to application specific designs (ASICs), Softbrain has ~2x overheads  in power and ~8x in area 75 11/16/2017 Dissertation Talk

Dissertation Research Goal Programmable Hardware Acceleration 1. Explore the commonality in the way the DSAs specialize –  Specialization Principles 2. General Mechanisms for the design of a generic programmable  hardware accelerator matching the efficiency of DSAs  3. A programmable/re-configurable accelerator architecture with an efficient accelerator hardware-software (ISA) interface  4. Easy adaptation of new acceleratable algorithms in a domain-agnostic way 76 11/16/2017 Dissertation Talk

Conclusion – Programmable Hardware Acceleration • New acceleration paradigm in specialization era Programmable Hardware Acceleration breaking the limits of acceleration  • Foundational specialization principles abstracting the acceleration primitives Getting There !! • Enables programmable accelerators instantiation in IOT, embedded, cloud environment to support Edge Computing A good enabler for exploring general purpose programmable hardware acceleration … . • A new accelerator ISA paradigm for an efficient programmable accelerator hardware implementation • Reduce the orders of magnitude overheads of programmability and generality compared to ASICs • Drives future accelerator research and innovation 77 11/16/2017 Dissertation Talk

Future Work • Multiple DFG executions Configuration cache for CGRA to switch between DFGs  • Further distribute the control into vector ports Dynamic deadlock detection for buffer overflow  Concurrent execution of different set of streams (of different DFGs)  • Low-power dynamic credit-based CGRA schedule Allow vector ports to run out-of-order reducing the overall latency  • 3D support for streams in ISA • Partitioned scratchpad to support data dependent address generation • Support for fine-grained configuration through FPGA slices (along with SRAM mats) next to CGRA for memory-dependent algorithm acceleration 78 11/16/2017 Dissertation Talk

Related Work • Programmable specialization architectures:  Smart memories, Charm, Camel, Mosphosys, XLOOPS, Maven-VT • Principles of Specialization  GPPs inefficient and need specialization – Hameed. et. Al  Trace processing – Beret  Transparent Specialization – CCA, CRIB etc, • Heterogeneous Cores – GPP + Specialized engines  Composite cores, DySER, Cambricon • Streaming Engines:  RSVP arch, Imagine, Triggered instructions, MAD, CoRAM++ 79 11/16/2017 Dissertation Talk

Other Works • Open Source GPGPU – MIAOW Lead developer and contributor to open source hardware GPGPU – MIAOW  AMD Southern Island based RTL implementation of GPGPU able to execute unmodified  AMDAPP OpenCL kernels Published in [ ACM TACO 2015, HOTCHIPS’ 2015, COOLCHIPS’ 2015, HiPEAC ’ 2016 ]  • Von-Neumann/Dataflow Hybrid Architecture A hybrid architecture aimed to exploit ILP in irregular applications  Lead developer of the micro-architecture of the dataflow offload engine – Specialized  Engine for Explicit Dataflow (SEED) Published in [ ISCA‘ 2015, IEEE MICRO Top Picks 2016 ]  • Open-source Hardware: Opportunities and Challenges A position article on the advantages of open-source hardware for hardware innovation  Huge believer in open-source hardware and contribution  To be published in IEEE Computer’ 17  80 11/16/2017 Dissertation Talk

Back Up Dissertation Talk 81 11/16/2017

Programmable Hardware Acceleration Idea 1: Specialization principles can be exploited in a general way Idea 2: Composition of known Micro-Architectural mechanisms embodying the specialization principles Programmable Hardware Accelerator (GenAccel) GenAccel as a programmable hardware design template to map one or many application domains Deep Neural Stencil, Sort, Scan, AI Domain provisioned GenAccel Balanced GenAccel 82 *Figures not to scale 11/16/2017 Dissertation Talk

Principles in DSAs NPU – Neural Proc. Unit General Purpose Processor • Match hardware concurrency to that Bus Sched Out Fifo In Fifo of algorithm Organization High Level PE PE • Problem-specific computation units PE PE • Explicit communication as opposed to implicit communication PE PE • Customized structures for data reuse PE PE • Hardware coordination using simple low-power control logic Engine Weight Buf. Processing Fifo Mult-Add Cont- Acc Reg. roller Sigmoid Out Buf. Concurrency Computation Communication Data Reuse Coordination 83 11/16/2017 Dissertation Talk

Accelerator Workloads Neural Approx. Convolution DNN Database Streaming 1. Ample Parallelism 2. Regular Memory 3. Large Datapath 4. Computation Heavy Dissertation Talk 84 11/16/2017

GenAccel Modeling Strategy • Phase 1. Model Single-Core with PIN + Gem5 based trace simulation  The algorithm to specialize in the form of c-code/binary  Potential Core Types, CGRA sizes, any specialized instructions  Degree of memory customization (which memory accesses to be specialized, either with DMA or scratchpad)  Output: single-core perf./ energy for “Pareto - optimal” designs • Phase 2. Model coarse-grained parallelism  Use profiling information to determine parallel portion of the algorithm (or tell user to indicate or estimate)  Use simple Amdahl's law to get performance estimate  Use execution time, single-core energy estimate, and static power estimate to get overall energy estimate Dissertation Talk 85 11/16/2017

GenAccel in Practice 1. Design Synthesis 2. Programming 3. Runtime Performance Runtime configuration Requirements Hardware Architect/Designer (Serial) Configure Hardware Perf. for App. 1 Constraints App. 1: ... Run App. 1 App. 2: ... Area goal: ... Configure for App. 2 App. 3: ... Power goal: ... (etc.) Runtime configuration (Parallel) For each application:  FU Types Design Configure for App. 1  No. of FUs  Write Control Program decisions  Spatial fabric size Run App. 1 (C Program + Annotations)  No. of GenAccel tiles  Write Datapath Program Configure for App. 2 Synthesis (spatial scheduling) Run App. 2 Programmable Accelerator (GenAccel) Configure for App. 3 Run App. 3 86 11/16/2017 Dissertation Talk

Programming GenAccel Pragmas Insert data transfer LSSD Memory #pragma genaccel cores 2 #pragma reuse-scratchpad weights DMA Scratchpad D$ void nn_layer(int num_in, int num_out, const float* weights, Input Interface const float* in, const float* out ) x x x x Spatial Fabric { Low-power Core x x x x for (int j = 0; j < num_out; ++j) { + + + + for (int i = 0; i < num_in; ++i) Ʃ + + + { out[j] += weights[j][i] *in[i]; Output Interface } out[j] = sigmoid(out[j]); } } Loop Parallelize, Insert Communication, Modulo Schedule Resize Computation (Unroll), Extract Computation Subgraph, Spatial Schedule Dissertation Talk 87 11/16/2017

GenAccel Design Point Selection No. of Design Concurrency Computation Communication Data Reuse GenAccel Units 24-tile CGRA 2k x 32b sigmoid 32b CGRA; 256b 2k x 32b GA N (8 Mul, 8 Add, 1 Sigmoid) lookup table SRAM interface weight 1 buffer 64-tile CGRA Standard 16b 16b CGRA; 512b 512 x 16b GA C (32 Mul/Shift, 32 Add/logic) FUs SRAM interface SRAM for 1 inputs 64-tile CGRA Piecewise linear 32b CGRA; 512b 2k x 16b GA D (32 Mul, 32 Add, 2 Sigmoid) sigmoid unit SRAM interface SRAMs for 8 inputs 32-tile CGRA Join + Filter units 64b CGRA; 256b SRAMs for 4 (16 ALU, 4 Agg, 4 Join) SRAM interface buffering GA Q 32-tile CGRA Combination of 64b CGRA; 512b 4KB SRAM GA B 8 (Combination of above) above FUs SRAM interface Mul: Multiplier, Add: Adder Dissertation Talk 88 11/16/2017

Design-Time vs. Runtime Decisions Synthesis – Time Run – Time Concurrency No. of GenAccel Units Power-gating unused GenAccel Units Computation Spatial fabric FU mix Scheduling of spatial fabric and core Communication Enabling spatial datapath Configuration of spatial datapath, elements, & SRAM interface switches and ports, memory access widths pattern Data Reuse Scratchpad (SRAM) size Scratchpad used as DMA/reuse buffer Dissertation Talk 89 11/16/2017

Performance Analysis (1) GA N vs. NPU 18 16 14 GA (+reuse.) 12 N Speedup Spatial (+comm.) 10 SIMD (+concur.) 8 LP Core + Sig. (+comp.) 6 NPU (DSA) 4 2 0 (1-4-4-2) inversek2j (18-32-8-2) Geometric (64-16-64) (6-8-4-1) (2-8-2) kmeans Mean (9-8-1) jmeint sobel fft jpeg Baseline – 4 wide OOO core (Intel 3770K) 90 11/16/2017 Dissertation Talk

Source of Accelertion Benefits Significant benefit from Massive benefits from Overall, specialization of Some benefit for algorithmic modifications straightforward algorithm the hardware is never the optimizing algorithm to to improve concurrency. parallelization. sole factor, and rarely the expose concurrency/reuse. larger factor. Specialization Some benefit from Some benefit from vector Some benefit from Massive benefit from specialized weight buffer and bit-with specialization. specialized shift registers optimizing the algorithm to and inter-layer broadcast. Convolution and graph fusion unit. avoid data copying. Engine Diannao NPU Q100 Algorithm/Concurrency Dissertation Talk 91 11/16/2017

Performance Analysis (2) GA c vs. Conv. GA D vs. DianNao (8 Tiles) (1 Tile) 400 50 45 350 40 300 35 GA (+reuse.) GA (+reuse.) D Speedup Speedup 30 C 250 Spatial (+comm.) Spatial (+comm.) 25 200 SIMD (+concur.) SIMD (+concur.) 20 150 8-Tile (+concur.) LP core + FUs (+comp.) 15 LP core + Sig. (+comp.) 10 Conv. (domain-acccel) 100 5 DianNao (domain-acccel) 50 0 0 IME DOG EXTR. FME Geometric Mean class1 class3 GeoMean conv1 pool1 conv2 conv3 pool3 conv4 conv5 pool5 GA Q vs. Q100 (4 Tiles) 500 400 GA (+comm.) Q Speedup SIMD (+concur.) 300 4-Tile (+concur.) LP core + SFUs (+comp.) 200 Q100 (domain-acccel) 100 0 q1 q2 q3 q4 q5 q6 q17 q10 q15 q16 q17 GM Baseline – 4 wide OOO core (Intel 3770K) Dissertation Talk 92 11/16/2017

GenAccel Area & Power Numbers Area (mm 2 ) Power (mW) GA N 0.37 149 Neural Approx. NPU 0.30 74 GA C 0.15 108 Stencil Conv. Engine 0.08 30 GA D 2.11 867 Deep Neural. DianNao 0.56 213 GA Q 1.78 519 Database Streaming Q100 3.69 870 GA Balanaced 2.74 352 *Intel Ivybridge 3770K CPU 1 core Area – 12.9mm 2 | Power – 4.95W *Intel Ivybridge 3770K iGPU 1 execution lane Area – 5.75mm 2 +AMD Kaveri APU Tahiti based GPU 1CU Area – 5.02mm 2 *Source: http://www.anandtech.com/show/5771/the-intel-ivy-bridge-core-i7-3770k-review/3 Dissertation Talk 93 11/16/2017 +Estimate from die-photo analysis and block diagrams from wccftech.com

Power & Area Analysis (1) GA N GA C 1.2x more Area than DSA 1.7x more Area than DSA 2x more Power than DSA 3.6x more Power than DSA Dissertation Talk 94 11/16/2017

Power & Area Analysis (2) GA Q GA D 0.5x more Area than DSA 3.8x more Area than DSA 0.6x more Power than DSA 4.1x more Power than DSA Dissertation Talk 95 11/16/2017

Power & Area Analysis (3) LSSD B  Balanced LSSD design 2.7x more Area than DSAs 0.6x more Area than DSA 2.4x more Power than DSAs 2.5x more Power than DSA Dissertation Talk 96 11/16/2017

Unsuitable Workloads for GenAccel /Stream-Dataflow • Memory-dominated workloads • Specifically small- memory footprint, but “irregular” • Heavily serialized data dependent address generation • Memory compression for example – A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs, Fower et. al • Other examples: – IBM PowerEN Regular Expression – DFA based codes Dissertation Talk 97 11/16/2017

GenAccel vs. FPGA • FPGAs are much lower frequency (global-routing and too fine-grained) • BlockRAMs too small to gang-up • Logical Multi-ported Register File needed to pass values between DSP slices to match high operand-level concurrency • Altera’s Stratix 10 seems headed exactly this direction Dissertation Talk 98 11/16/2017

GenAccel’s power overhead of 2x - 4x matter in a system with accelerator? In what scenarios you want to build DSA over GenAccel? 99 11/16/2017 Dissertation Talk

Energy Efficiency Tradeoffs System with accelerator Accel. P acc : 0.1 – 5W P core : 5W OOO (GenAccel Core Core power Accelerator power or DSA) t : execution time Caches S : accelerator’s speedup P sys : 5W System Bus U : accelerator utilization System power Memory Overall energy of the computation executed on system E = P acc * (U/S) * t + P sys * (1 – U + U/S) * t P core * (1 - U) * t + Accel. energy System energy Core energy *Power numbers are example representation 100 11/16/2017 Dissertation Talk

Programmable Hardware Acceleration Vinay Gangadhar PhD Final - PowerPoint PPT Presentation

Programmable Hardware Acceleration Vinay Gangadhar PhD Final Examination Thursday, Nov 16 th , 2017 Advisor: Karu Sankaralingam Committee: Mark Hill, Mikko Lipasti, David Wood, Dimitris Papailiopoulos 1 11/16/2017 Dissertation Talk Computing

ROMs, PLAs and FPGAs October 5, 2006 Typeset by Foil T EX Why Programmable Logic?

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Programmable Switch Hardware ECE/CS598HPN Radhika Mittal Conventional SDN Programmable

PROGRAMMABLE LOGIC CONTROLLER Control Systems Types Programmable Logic Controllers

Field Programmable Gate Arrays by Ketil Red Field Programmable Gate Array Integrated

A Crash Course on Programmable Graphics Hardware Li-Yi Wei Microsoft Research Asia Abstract

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

12.2 Programmable Graphics Hardware Kyle Morgenroth http://cs420.hao-li.com 1

12.2 Programmable Graphics Hardware Kyle Olszewski http://cs420.hao-li.com 1 Introduction

8.2 Programmable Graphics Hardware Kyle Olszewski http://cs420.hao-li.com 1 Introduction

Regulatory Guidance on the Use of Field Programmable Gate of Field Programmable Gate Arrays in

Outline FPGA clocking Programmable clocks Dynamic programmable oscillators EMI

Programmable Data Plane at Terabit Speeds Milad Sharif SOFTWARE ENGINEER PISA: Protocol

Nanowire- -Based Based Nanowire Programmable Programmable Architectures Architectures

TESTING PROGRAMMABLE INFRASTRUCTURE (WITH RUBY) @burythehammer PROGRAMMABLE INFRASTRUCTURE IS

Relating Nominal and Higher-Order Pattern Unification James Cheney University of Edinburgh UNIF

End to End Deep Learning Solution on Arm Architecture Jan. 14 2019, Jammy Zhou HPC and AI

Pending Interest Table Sizing in Named Data Networking Luca Muscariello Orange Labs Networks /

Treating software-defined networks like disk arrays Zhiyuan Teo Cornell University Joint work

Vitaliy Rusov Department of Theoretical and Experimental Nuclear Physics, Odessa National

HIGH SPEED SOFTWARE PROTOTYPE OF NAMED-DATA NETWORKING Lorenzo Saino (UCL), Massimo Gallo, Diego

Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and

CS 126 Lecture A4: Sequential Circuits Midterm Statistics 21% Average: 42.5 Last Semester