t angram optimized coarse grained dataflow for scalable
play

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN - PowerPoint PPT Presentation

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Tsinghua University Google ASPLOS April 2019 Neural Networks (NNs) q


  1. T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Tsinghua University Google ASPLOS – April 2019

  2. Neural Networks (NNs) q Unprecedented accuracy for challenging applications o Fully-connected (MLPs), Convolutional (CNNs), Recurrent (LSTMs) NNs q Inference: layer-wise processing on direct acyclic graphs (DAGs) d in c t −1 h t −1 d in x t | Conv FC 1 × 1 1 × 1 3 × 3 Conv × Conv Conv Pool I-Gate 1 × 1 × + Conv Conv F-Gate 3 × 3 5 × 5 1 × 1 Conv Conv Conv FC tanh | FC × O-Gate c t h t d out d out Convolutional NN LSTM Cell Inception Module 2

  3. NN Accelerators q Domain-specific processing engine o An array of specialized processing elements (PEs) o On-chip register files and SRAMs o 100x performance and energy efficiency q Diannao/Cambricon, Google TPU, Eyeriss, Cnvlutin, EIE, … PE PE PE PE Reg File Global Buffer PE PE PE PE ALU PE PE PE PE Processing Element PE PE PE PE NN Processing Engine 3

  4. Scaling NN Performance q Use more PEs & more on-chip buffers Mem 0 Global Buffer Monolithic Array q Monolithic engine of PEs Mem 1 û Low resource utilization û Long array buses û Far from SRAM GBuf GBuf GBuf GBuf Array Array Array Array Mem 0 Mem 2 q Tiled architecture— focus of our work ü Mostly local data transfers GBuf GBuf GBuf GBuf Array Array Array Array ü Easy to scale up/down Mem 1 Mem 3 ? Dataflow scheduling GBuf GBuf GBuf GBuf Array Array Array Array 4

  5. T ANGRAM : Optimizing Coarse-Grained Dataflow GBuf GBuf Array Array GBuf GBuf Array Array q Intra-layer parallelism q Inter-layer pipelining q Buffer Sharing Dataflow q Fine-grained data forwarding & pipelining of complex DAGs o Reuse data across engines à higher energy efficiency o Reduce pipeline stalls o Avoid on-chip data duplication à higher throughput à smaller buffer area o Temporarily store forwarded data à smaller buffer area 5

  6. Intra-Layer Parallelism 6

  7. Parallelizing a Single Layer foreach b in batch Nb foreach ifmap i in Ni foreach ofmap o in No I [0][0:1] I [0][0:1] // 2D conv O [0][0] O [0][1] O[b][o] += I[b][i] * W[o][i] W [0][0:1] W [1][0:1] Ifmaps Weights Ofmaps I [1][0:1] I [1][0:1] = * N b N b O [1][0] O [1][1] N i W [0][0:1] W [1][0:1] N o N i N o q Inefficient buffer use for shared data û Replicated buffered data (area) û Data reuse limited within each tile (energy) q ALL parallelization schemes share some data! 7

  8. Optimizing Dataflow for Shared Data W [1][0] W [0][1] W [0][0] W [1][1] O [0][0] O [0][1] I [0][1] I [0][0] O [1][0] O [1][1] I [1][0] I [1][1] q Skew computation order of engines o All engines start in parallel à high throughput q Rotate buffered data between engines o Fully reuse shared data à low energy o No on-chip data duplication à low area 8

  9. Buffer Sharing Dataflow q Unify distributed buffers as an ideal large buffer o Efficiently store and reuse data q Formalize as loop transformations o (tile coordinate x , time step t ) -> index of data to be buffered i o See paper for detailed maths q Easy to implement o Buffer controller fetches from memory or other tiles o No changes for dataflow within a tile q Support all parallelization schemes (including hybrid) 9

  10. Inter-Layer Pipelining 10

  11. Pipelining Multiple Layers Layer 1 Layer 3 Layer 2 Layer 2 Layer 3 Layer 4 Layer 1 q Pros: avoid off-chip intermediate data accesses o Save DRAM bandwidth and energy q Cons: utilize resources less efficiently o Long delays: pipeline filling/draining due to inter-layer data dependencies o Large SRAM buffers: store entire intermediate data 11

  12. Fine-Grained Data Forwarding q Forward each subset of data to the next layer as soon as ready o Reduce pipeline stalls: next layer starts earlier o Reduce buffer capacity: only store the subset currently being forwarded q Require matched access patterns between adjacent layers … … Ifmaps No dependencies; trivially pipelined foreach ofmap o in No foreach ifmap i in Ni Ofmaps foreach b in batch Nb 0 1 2 // 2D conv Time foreach ifmap i in Ni foreach ofmap o in No // 2D conv Ifmaps 0 1 2 O[b][o] += I[b][i] * W[o][i] foreach ifmap i in Ni … … foreach ofmap o in No Ofmaps // 2D conv Time 12

  13. Alternate Layer Loop Ordering (ALLO) Unoptimized Optimized … … … … Buffer for Buffer for Layer Layer ONE fmap ALL fmaps 1 1 1 2 0 1 2 0 … … 0 1 2 Benefits apply to Layer Layer half of all layers 2 … … 2 0 1 2 … … … … Layer Layer 3 3 0 1 2 0 1 2 Delay for Delay for Time Time ALL fmaps ONE fmap 13

  14. Layer Pipelining for Complex NN DAGs q A dataflow tool explores pipeline schedules of multiple layers q Subject to design rules due to data dependency constraints o E.g., no multiple predecessor layers on-chip FC R0 R1 R3 R5 R0 1 × 1 Conv × I-Gate 1 × 1 1 × 1 3 × 3 R1 3 × 3 Conv R1 Conv Conv Pool × + 1 × 1 F-Gate 1 × 1 Conv Conv 3 × 3 5 × 5 1 × 1 R2 R2 + Conv Conv Conv tanh R0 R2 R4 R6 1 × 1 Conv R3 × O-Gate R3 Inception Module ResNet module LSTM Cell 14

  15. Evaluation Results 15

  16. Modeling Methodology q State-of-the-art NNs o CNNs: AlexNet, VGGNet, GoogLeNet, ResNet o MLPs & LSTMs: medium and large scales q Hardware o Inference engine: Eyeriss [ISCA’16], 8 × 8 PEs, 32 kB buffer, 500 MHz o Off-chip memory: LPDDR3-1600, 4 channels o Overall chip: 16 x 16 tiles • 16384 PEs + 8 MB SRAM • 90 mm 2 at 28 nm 16

  17. Overall Comparison 1 2 Energy Time 0.5 1 0 0 t t t t t t t t M L M L M L M L e e e e e e e e - - - - N N P M N N P M N N - - N N - - P L M P L M x G e s x G e s T T L M L M e e e L T e L T G M S G M S l g R l g R S L S L A A V V o o L L o o G G Monolithic Base Tiled TANGRAM Monolithic Base Tiled TANGRAM q Base tiled vs. monolithic: 3.6x performance, 7% worse energy o Less flexible and less efficient use of on-chip SRAM buffers q T ANGRAM : 2x over base tiled, outperforms monolithic 17

  18. Intra- vs. Inter-Layer Optimizations 4.59 q Intra-layer: Buffer Sharing 2.5 o AlexNet: fit large fmaps on-chip 2 o MLP-L: enable weight pinning 1.5 Energy 1 q Inter-layer: ALLO + complex DAGs 0.5 o AlexNet, GoogLeNet & LSTM-M o Linear NNs benefit less 0 AlexNet GoogLeNet MLP-L LSTM-M TANGRAM w/o Intra w/o Inter 18

  19. Summary q Efficiently scale NN acceleration o Coarse-grained parallel dataflow on tiled architectures o Optimized tiled architectures outperform monolithic engines q T ANGRAM : dataflow optimizations o Intra-layer buffer sharing o Inter-layer pipelining with fine-grained data forwarding o Pipelining complex NN DAGs Thank you! q Dataflow scheduling tool open sourced o https://github.com/stanford-mast/nn_dataflow 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend