T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN - PowerPoint PPT Presentation

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Tsinghua University Google ASPLOS – April 2019

Neural Networks (NNs) q Unprecedented accuracy for challenging applications o Fully-connected (MLPs), Convolutional (CNNs), Recurrent (LSTMs) NNs q Inference: layer-wise processing on direct acyclic graphs (DAGs) d in c t −1 h t −1 d in x t | Conv FC 1 × 1 1 × 1 3 × 3 Conv × Conv Conv Pool I-Gate 1 × 1 × + Conv Conv F-Gate 3 × 3 5 × 5 1 × 1 Conv Conv Conv FC tanh | FC × O-Gate c t h t d out d out Convolutional NN LSTM Cell Inception Module 2

NN Accelerators q Domain-specific processing engine o An array of specialized processing elements (PEs) o On-chip register files and SRAMs o 100x performance and energy efficiency q Diannao/Cambricon, Google TPU, Eyeriss, Cnvlutin, EIE, … PE PE PE PE Reg File Global Buffer PE PE PE PE ALU PE PE PE PE Processing Element PE PE PE PE NN Processing Engine 3

Scaling NN Performance q Use more PEs & more on-chip buffers Mem 0 Global Buffer Monolithic Array q Monolithic engine of PEs Mem 1 û Low resource utilization û Long array buses û Far from SRAM GBuf GBuf GBuf GBuf Array Array Array Array Mem 0 Mem 2 q Tiled architecture— focus of our work ü Mostly local data transfers GBuf GBuf GBuf GBuf Array Array Array Array ü Easy to scale up/down Mem 1 Mem 3 ? Dataflow scheduling GBuf GBuf GBuf GBuf Array Array Array Array 4

T ANGRAM : Optimizing Coarse-Grained Dataflow GBuf GBuf Array Array GBuf GBuf Array Array q Intra-layer parallelism q Inter-layer pipelining q Buffer Sharing Dataflow q Fine-grained data forwarding & pipelining of complex DAGs o Reuse data across engines à higher energy efficiency o Reduce pipeline stalls o Avoid on-chip data duplication à higher throughput à smaller buffer area o Temporarily store forwarded data à smaller buffer area 5

Intra-Layer Parallelism 6

Parallelizing a Single Layer foreach b in batch Nb foreach ifmap i in Ni foreach ofmap o in No I [0][0:1] I [0][0:1] // 2D conv O [0][0] O [0][1] O[b][o] += I[b][i] * W[o][i] W [0][0:1] W [1][0:1] Ifmaps Weights Ofmaps I [1][0:1] I [1][0:1] = * N b N b O [1][0] O [1][1] N i W [0][0:1] W [1][0:1] N o N i N o q Inefficient buffer use for shared data û Replicated buffered data (area) û Data reuse limited within each tile (energy) q ALL parallelization schemes share some data! 7

Optimizing Dataflow for Shared Data W [1][0] W [0][1] W [0][0] W [1][1] O [0][0] O [0][1] I [0][1] I [0][0] O [1][0] O [1][1] I [1][0] I [1][1] q Skew computation order of engines o All engines start in parallel à high throughput q Rotate buffered data between engines o Fully reuse shared data à low energy o No on-chip data duplication à low area 8

Buffer Sharing Dataflow q Unify distributed buffers as an ideal large buffer o Efficiently store and reuse data q Formalize as loop transformations o (tile coordinate x , time step t ) -> index of data to be buffered i o See paper for detailed maths q Easy to implement o Buffer controller fetches from memory or other tiles o No changes for dataflow within a tile q Support all parallelization schemes (including hybrid) 9

Inter-Layer Pipelining 10

Pipelining Multiple Layers Layer 1 Layer 3 Layer 2 Layer 2 Layer 3 Layer 4 Layer 1 q Pros: avoid off-chip intermediate data accesses o Save DRAM bandwidth and energy q Cons: utilize resources less efficiently o Long delays: pipeline filling/draining due to inter-layer data dependencies o Large SRAM buffers: store entire intermediate data 11

Fine-Grained Data Forwarding q Forward each subset of data to the next layer as soon as ready o Reduce pipeline stalls: next layer starts earlier o Reduce buffer capacity: only store the subset currently being forwarded q Require matched access patterns between adjacent layers … … Ifmaps No dependencies; trivially pipelined foreach ofmap o in No foreach ifmap i in Ni Ofmaps foreach b in batch Nb 0 1 2 // 2D conv Time foreach ifmap i in Ni foreach ofmap o in No // 2D conv Ifmaps 0 1 2 O[b][o] += I[b][i] * W[o][i] foreach ifmap i in Ni … … foreach ofmap o in No Ofmaps // 2D conv Time 12

Alternate Layer Loop Ordering (ALLO) Unoptimized Optimized … … … … Buffer for Buffer for Layer Layer ONE fmap ALL fmaps 1 1 1 2 0 1 2 0 … … 0 1 2 Benefits apply to Layer Layer half of all layers 2 … … 2 0 1 2 … … … … Layer Layer 3 3 0 1 2 0 1 2 Delay for Delay for Time Time ALL fmaps ONE fmap 13

Layer Pipelining for Complex NN DAGs q A dataflow tool explores pipeline schedules of multiple layers q Subject to design rules due to data dependency constraints o E.g., no multiple predecessor layers on-chip FC R0 R1 R3 R5 R0 1 × 1 Conv × I-Gate 1 × 1 1 × 1 3 × 3 R1 3 × 3 Conv R1 Conv Conv Pool × + 1 × 1 F-Gate 1 × 1 Conv Conv 3 × 3 5 × 5 1 × 1 R2 R2 + Conv Conv Conv tanh R0 R2 R4 R6 1 × 1 Conv R3 × O-Gate R3 Inception Module ResNet module LSTM Cell 14

Evaluation Results 15

Modeling Methodology q State-of-the-art NNs o CNNs: AlexNet, VGGNet, GoogLeNet, ResNet o MLPs & LSTMs: medium and large scales q Hardware o Inference engine: Eyeriss [ISCA’16], 8 × 8 PEs, 32 kB buffer, 500 MHz o Off-chip memory: LPDDR3-1600, 4 channels o Overall chip: 16 x 16 tiles • 16384 PEs + 8 MB SRAM • 90 mm 2 at 28 nm 16

Overall Comparison 1 2 Energy Time 0.5 1 0 0 t t t t t t t t M L M L M L M L e e e e e e e e - - - - N N P M N N P M N N - - N N - - P L M P L M x G e s x G e s T T L M L M e e e L T e L T G M S G M S l g R l g R S L S L A A V V o o L L o o G G Monolithic Base Tiled TANGRAM Monolithic Base Tiled TANGRAM q Base tiled vs. monolithic: 3.6x performance, 7% worse energy o Less flexible and less efficient use of on-chip SRAM buffers q T ANGRAM : 2x over base tiled, outperforms monolithic 17

Intra- vs. Inter-Layer Optimizations 4.59 q Intra-layer: Buffer Sharing 2.5 o AlexNet: fit large fmaps on-chip 2 o MLP-L: enable weight pinning 1.5 Energy 1 q Inter-layer: ALLO + complex DAGs 0.5 o AlexNet, GoogLeNet & LSTM-M o Linear NNs benefit less 0 AlexNet GoogLeNet MLP-L LSTM-M TANGRAM w/o Intra w/o Inter 18

Summary q Efficiently scale NN acceleration o Coarse-grained parallel dataflow on tiled architectures o Optimized tiled architectures outperform monolithic engines q T ANGRAM : dataflow optimizations o Intra-layer buffer sharing o Inter-layer pipelining with fine-grained data forwarding o Pipelining complex NN DAGs Thank you! q Dataflow scheduling tool open sourced o https://github.com/stanford-mast/nn_dataflow 19

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN - PowerPoint PPT Presentation

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Tsinghua University Google ASPLOS April 2019 Neural Networks (NNs) q

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

MOLECULAR DYNAMICS STUDY OF LIPOSOMES WITH A NEW COARSE-GRAINED MOLECULAR MODEL Wataru SHINODA

Application of the Lattice Boltzmann method with moving boundaries in a coarse-grained suspension

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Mechanized Verification of Fine-grained Concurrent Programs Ilya Sergey Aleks Nanevski

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Coarse Woody Debris as Measurable Management Targets A.J. Kroll Weyerhaeuser COARSE WOODY

New design method for C30 recycled concr ete using mixed source concrete coarse agg regates

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara jay@cs.umd.edu Lise Getoor

Some categorical aspects of coarse spaces and balleans Nicol` o Zava joint work with Dikran

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

EFFECTS OF PAYMENTS FOR ECOSYSTEM SERVICES ON WILDLIFE IN FANJINGSHAN NATIONAL NATURE RESERVE,

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn Make it

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

More Power to the Future Uduak Akpanedet IEEE PES Day 2020 Ambassador MSc Electrical Power

How to Evaluate Efficient Deep Neural Network Approaches Vivienne Sze ( @eems_mit)

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Advanced use of OpenSHMEM 2 Outline

TABLA: A Framework for Accelerating Statistical Machine Learning Presenters: MeiXing Dong,

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN - PowerPoint PPT Presentation

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Tsinghua University Google ASPLOS April 2019 Neural Networks (NNs) q

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

MOLECULAR DYNAMICS STUDY OF LIPOSOMES WITH A NEW COARSE-GRAINED MOLECULAR MODEL Wataru SHINODA

Application of the Lattice Boltzmann method with moving boundaries in a coarse-grained suspension

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Mechanized Verification of Fine-grained Concurrent Programs Ilya Sergey Aleks Nanevski

approach to parallelism www.pervasivedatarush.com Agenda Background Dataflow Overview

Coarse Woody Debris as Measurable Management Targets A.J. Kroll Weyerhaeuser COARSE WOODY

New design method for C30 recycled concr ete using mixed source concrete coarse agg regates

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara jay@cs.umd.edu Lise Getoor

Some categorical aspects of coarse spaces and balleans Nicol` o Zava joint work with Dikran

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

EFFECTS OF PAYMENTS FOR ECOSYSTEM SERVICES ON WILDLIFE IN FANJINGSHAN NATIONAL NATURE RESERVE,

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn Make it

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

More Power to the Future Uduak Akpanedet IEEE PES Day 2020 Ambassador MSc Electrical Power

How to Evaluate Efficient Deep Neural Network Approaches Vivienne Sze ( @eems_mit)

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Advanced use of OpenSHMEM 2 Outline

TABLA: A Framework for Accelerating Statistical Machine Learning Presenters: MeiXing Dong,

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed