T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN - - PowerPoint PPT Presentation

t angram optimized coarse grained dataflow for scalable
SMART_READER_LITE
LIVE PREVIEW

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN - - PowerPoint PPT Presentation

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Tsinghua University Google ASPLOS April 2019 Neural Networks (NNs) q


slide-1
SLIDE 1

TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators

Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis

Stanford University Tsinghua University Google

ASPLOS – April 2019

slide-2
SLIDE 2

Neural Networks (NNs)

q Unprecedented accuracy for challenging applications

  • Fully-connected (MLPs), Convolutional (CNNs), Recurrent (LSTMs) NNs

q Inference: layer-wise processing on direct acyclic graphs (DAGs)

2

Conv Conv Conv FC FC

din dout

Convolutional NN

ct−1 ht−1 ct ht xt

I-Gate F-Gate O-Gate

× + ×

FC

×

tanh |

LSTM Cell

din dout

1 × 1 Conv 1 × 1 Conv 3 × 3 Pool 1 × 1 Conv 3 × 3 Conv 5 × 5 Conv 1 × 1 Conv |

Inception Module

slide-3
SLIDE 3

NN Accelerators

q Domain-specific processing engine

  • An array of specialized processing elements (PEs)
  • On-chip register files and SRAMs
  • 100x performance and energy efficiency

q Diannao/Cambricon, Google TPU, Eyeriss, Cnvlutin, EIE, …

3

ALU Reg File Processing Element PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Global Buffer

NN Processing Engine

slide-4
SLIDE 4

Scaling NN Performance

q Use more PEs & more on-chip buffers q Monolithic engine

ûLow resource utilization ûLong array buses ûFar from SRAM

q Tiled architecture—focus of our work

ü Mostly local data transfers ü Easy to scale up/down

? Dataflow scheduling

4

Mem 0 Mem 2 Mem 3 Mem 1 GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array

Global Buffer Monolithic Array

  • f PEs

Mem 1 Mem 0

slide-5
SLIDE 5

TANGRAM: Optimizing Coarse-Grained Dataflow

q Intra-layer parallelism q Buffer Sharing Dataflow

  • Reuse data across engines

à higher energy efficiency

  • Avoid on-chip data duplication

à smaller buffer area

q Inter-layer pipelining q Fine-grained data forwarding &

pipelining of complex DAGs

  • Reduce pipeline stalls

à higher throughput

  • Temporarily store forwarded data

à smaller buffer area

5

Array GBuf Array GBuf Array GBuf Array GBuf

slide-6
SLIDE 6

Intra-Layer Parallelism

6

slide-7
SLIDE 7

Parallelizing a Single Layer

q Inefficient buffer use for shared data

ûReplicated buffered data (area) ûData reuse limited within each tile (energy)

q ALL parallelization schemes share some data!

7 Nb Ni

*

Ni No

=

No Nb

Ifmaps Ofmaps Weights

foreach b in batch Nb foreach ifmap i in Ni foreach ofmap o in No // 2D conv O[b][o] += I[b][i] * W[o][i]

O[0][0] O[0][1] O[1][0] O[1][1] I[0][0:1] I[0][0:1] I[1][0:1] I[1][0:1] W[0][0:1] W[0][0:1] W[1][0:1] W[1][0:1]

slide-8
SLIDE 8

Optimizing Dataflow for Shared Data

q Skew computation order of engines

  • All engines start in parallel à high throughput

q Rotate buffered data between engines

  • Fully reuse shared data à low energy
  • No on-chip data duplication à low area

8

O[0][0] O[0][1] O[1][0] O[1][1] W[0][1] W[0][0] I[0][1] I[0][0] W[1][0] W[1][1] I[1][0] I[1][1]

slide-9
SLIDE 9

Buffer Sharing Dataflow

q Unify distributed buffers as an ideal large buffer

  • Efficiently store and reuse data

q Formalize as loop transformations

  • (tile coordinate x, time step t) -> index of data to be buffered i
  • See paper for detailed maths

q Easy to implement

  • Buffer controller fetches from memory or other tiles
  • No changes for dataflow within a tile

q Support all parallelization schemes (including hybrid)

9

slide-10
SLIDE 10

Inter-Layer Pipelining

10

slide-11
SLIDE 11

Pipelining Multiple Layers

q Pros: avoid off-chip intermediate data accesses

  • Save DRAM bandwidth and energy

q Cons: utilize resources less efficiently

  • Long delays: pipeline filling/draining due to inter-layer data dependencies
  • Large SRAM buffers: store entire intermediate data

11

Layer 1 Layer 2 Layer 3 Layer 4 Layer 1 Layer 3 Layer 2

slide-12
SLIDE 12

Fine-Grained Data Forwarding

q Forward each subset of data to the next layer as soon as ready

  • Reduce pipeline stalls: next layer starts earlier
  • Reduce buffer capacity: only store the subset currently being forwarded

q Require matched access patterns between adjacent layers

12

foreach b in batch Nb foreach ifmap i in Ni foreach ofmap o in No // 2D conv O[b][o] += I[b][i] * W[o][i]

No dependencies; trivially pipelined Ifmaps Ofmaps

1 2

… …

Time

foreach ofmap o in No foreach ifmap i in Ni // 2D conv

1 2

… …

Ifmaps Ofmaps

Time

foreach ifmap i in Ni foreach ofmap o in No // 2D conv

slide-13
SLIDE 13

Layer 1

Time

Layer 2 Layer 3

Alternate Layer Loop Ordering (ALLO)

13

1 2

… …

1 2

… …

1 2

… …

1 2

… … Unoptimized Optimized

Layer 1

Time

Layer 2 Layer 3

1 2

… …

1 2

… … Buffer for ALL fmaps Delay for ALL fmaps Delay for ONE fmap Buffer for ONE fmap Benefits apply to half of all layers

slide-14
SLIDE 14

Layer Pipelining for Complex NN DAGs

q A dataflow tool explores pipeline schedules of multiple layers q Subject to design rules due to data dependency constraints

  • E.g., no multiple predecessor layers on-chip

14 R1 R2 R3 R0 1 × 1 Conv 3 × 3 Conv 1 × 1 Conv 1 × 1 Conv

+

R5 R3 R2 R1 R4 R6 R0 1 × 1 Conv 1 × 1 Conv 3 × 3 Pool 1 × 1 Conv 3 × 3 Conv 5 × 5 Conv 1 × 1 Conv R2 R3 R1 R0 I-Gate F-Gate O-Gate

× + ×

FC

×

tanh

Inception Module ResNet module LSTM Cell

slide-15
SLIDE 15

Evaluation Results

15

slide-16
SLIDE 16

Modeling Methodology

q State-of-the-art NNs

  • CNNs: AlexNet, VGGNet, GoogLeNet, ResNet
  • MLPs & LSTMs: medium and large scales

q Hardware

  • Inference engine: Eyeriss [ISCA’16], 8 × 8 PEs, 32 kB buffer, 500 MHz
  • Off-chip memory: LPDDR3-1600, 4 channels
  • Overall chip: 16 x 16 tiles
  • 16384 PEs + 8 MB SRAM
  • 90 mm2 at 28 nm

16

slide-17
SLIDE 17

Overall Comparison

q Base tiled vs. monolithic: 3.6x performance, 7% worse energy

  • Less flexible and less efficient use of on-chip SRAM buffers

q TANGRAM: 2x over base tiled, outperforms monolithic

17

0.5 1

A l e x N e t V G G N e t G

  • g

L e N e t R e s N e t M L P

  • M

M L P

  • L

L S T M

  • M

L S T M

  • L

Time Monolithic Base Tiled TANGRAM 1 2

A l e x N e t V G G N e t G

  • g

L e N e t R e s N e t M L P

  • M

M L P

  • L

L S T M

  • M

L S T M

  • L

Energy Monolithic Base Tiled TANGRAM

slide-18
SLIDE 18

Intra- vs. Inter-Layer Optimizations

q Intra-layer: Buffer Sharing

  • AlexNet: fit large fmaps on-chip
  • MLP-L: enable weight pinning

q Inter-layer: ALLO + complex DAGs

  • AlexNet, GoogLeNet & LSTM-M
  • Linear NNs benefit less

18

0.5 1 1.5 2 2.5

AlexNet GoogLeNet MLP-L LSTM-M

Energy TANGRAM w/o Intra w/o Inter

4.59

slide-19
SLIDE 19

Summary

q Efficiently scale NN acceleration

  • Coarse-grained parallel dataflow on tiled architectures
  • Optimized tiled architectures outperform monolithic engines

q TANGRAM: dataflow optimizations

  • Intra-layer buffer sharing
  • Inter-layer pipelining with fine-grained data forwarding
  • Pipelining complex NN DAGs

q Dataflow scheduling tool open sourced

  • https://github.com/stanford-mast/nn_dataflow

19

Thank you!