TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators
Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis
Stanford University Tsinghua University Google
ASPLOS – April 2019
T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN - - PowerPoint PPT Presentation
T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Tsinghua University Google ASPLOS April 2019 Neural Networks (NNs) q
Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis
Stanford University Tsinghua University Google
ASPLOS – April 2019
q Unprecedented accuracy for challenging applications
q Inference: layer-wise processing on direct acyclic graphs (DAGs)
2
Conv Conv Conv FC FC
din dout
Convolutional NN
ct−1 ht−1 ct ht xt
I-Gate F-Gate O-Gate
× + ×
FC
×
tanh |
LSTM Cell
din dout
1 × 1 Conv 1 × 1 Conv 3 × 3 Pool 1 × 1 Conv 3 × 3 Conv 5 × 5 Conv 1 × 1 Conv |
Inception Module
q Domain-specific processing engine
q Diannao/Cambricon, Google TPU, Eyeriss, Cnvlutin, EIE, …
3
ALU Reg File Processing Element PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Global Buffer
NN Processing Engine
q Use more PEs & more on-chip buffers q Monolithic engine
ûLow resource utilization ûLong array buses ûFar from SRAM
q Tiled architecture—focus of our work
ü Mostly local data transfers ü Easy to scale up/down
? Dataflow scheduling
4
Mem 0 Mem 2 Mem 3 Mem 1 GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array GBuf Array
Global Buffer Monolithic Array
Mem 1 Mem 0
q Intra-layer parallelism q Buffer Sharing Dataflow
à higher energy efficiency
à smaller buffer area
q Inter-layer pipelining q Fine-grained data forwarding &
pipelining of complex DAGs
à higher throughput
à smaller buffer area
5
Array GBuf Array GBuf Array GBuf Array GBuf
6
q Inefficient buffer use for shared data
ûReplicated buffered data (area) ûData reuse limited within each tile (energy)
q ALL parallelization schemes share some data!
7 Nb Ni
Ni No
=
No Nb
Ifmaps Ofmaps Weights
foreach b in batch Nb foreach ifmap i in Ni foreach ofmap o in No // 2D conv O[b][o] += I[b][i] * W[o][i]
O[0][0] O[0][1] O[1][0] O[1][1] I[0][0:1] I[0][0:1] I[1][0:1] I[1][0:1] W[0][0:1] W[0][0:1] W[1][0:1] W[1][0:1]
q Skew computation order of engines
q Rotate buffered data between engines
8
O[0][0] O[0][1] O[1][0] O[1][1] W[0][1] W[0][0] I[0][1] I[0][0] W[1][0] W[1][1] I[1][0] I[1][1]
q Unify distributed buffers as an ideal large buffer
q Formalize as loop transformations
q Easy to implement
q Support all parallelization schemes (including hybrid)
9
10
q Pros: avoid off-chip intermediate data accesses
q Cons: utilize resources less efficiently
11
Layer 1 Layer 2 Layer 3 Layer 4 Layer 1 Layer 3 Layer 2
q Forward each subset of data to the next layer as soon as ready
q Require matched access patterns between adjacent layers
12
foreach b in batch Nb foreach ifmap i in Ni foreach ofmap o in No // 2D conv O[b][o] += I[b][i] * W[o][i]
No dependencies; trivially pipelined Ifmaps Ofmaps
1 2
… …
Time
foreach ofmap o in No foreach ifmap i in Ni // 2D conv
1 2
… …
Ifmaps Ofmaps
Time
foreach ifmap i in Ni foreach ofmap o in No // 2D conv
Layer 1
Time
Layer 2 Layer 3
13
1 2
… …
1 2
… …
1 2
… …
1 2
… … Unoptimized Optimized
Layer 1
Time
Layer 2 Layer 3
1 2
… …
1 2
… … Buffer for ALL fmaps Delay for ALL fmaps Delay for ONE fmap Buffer for ONE fmap Benefits apply to half of all layers
q A dataflow tool explores pipeline schedules of multiple layers q Subject to design rules due to data dependency constraints
14 R1 R2 R3 R0 1 × 1 Conv 3 × 3 Conv 1 × 1 Conv 1 × 1 Conv
R5 R3 R2 R1 R4 R6 R0 1 × 1 Conv 1 × 1 Conv 3 × 3 Pool 1 × 1 Conv 3 × 3 Conv 5 × 5 Conv 1 × 1 Conv R2 R3 R1 R0 I-Gate F-Gate O-Gate
FC
tanh
Inception Module ResNet module LSTM Cell
15
q State-of-the-art NNs
q Hardware
16
q Base tiled vs. monolithic: 3.6x performance, 7% worse energy
q TANGRAM: 2x over base tiled, outperforms monolithic
17
0.5 1
A l e x N e t V G G N e t G
L e N e t R e s N e t M L P
M L P
L S T M
L S T M
Time Monolithic Base Tiled TANGRAM 1 2
A l e x N e t V G G N e t G
L e N e t R e s N e t M L P
M L P
L S T M
L S T M
Energy Monolithic Base Tiled TANGRAM
q Intra-layer: Buffer Sharing
q Inter-layer: ALLO + complex DAGs
18
0.5 1 1.5 2 2.5
AlexNet GoogLeNet MLP-L LSTM-M
Energy TANGRAM w/o Intra w/o Inter
4.59
q Efficiently scale NN acceleration
q TANGRAM: dataflow optimizations
q Dataflow scheduling tool open sourced
19