TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis
Stanford University ASPLOS – April 2017
Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark - - PowerPoint PPT Presentation
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University ASPLOS April 2017 Neural Networks (NNs) Unprecedented accuracy for
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis
Stanford University ASPLOS – April 2017
Unprecedented accuracy for challenging applications System perspective: compute and memory intensive
2
Classification Recognition Control Prediction Optimization Multi- cores GPUs FPGAs ASICs
Clusters
“Dog”
3
=
ifmaps
filters
Ni No No Ni
foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v)
foreach b in batch Nb foreach neuron x in Nx foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v)
Spatial architectures of PEs
4
foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) foreach b in batch Nb foreach neuron x in Nx foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v)
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
Global Buffer
ALU Reg File
Processing Element
Main Memory
Large footprints and bandwidth requirements
Limit scalability for future NNs
5
foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) foreach b in batch Nb foreach neuron x in Nx foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v)
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
Global Buffer
ALU Reg File
Processing Element
Main Memory
Large on-chip buffers: area inefficiency Multiple DRAM channels: energy inefficiency
State-of-the-art NN accelerator with 400 PEs
6
5 10 15 20 25 30 0.5 1 1.5 2 2.5 3 100 120 144 168 196 224 256 288 324 360 400
Bandwidth (GBps) Power (W) Number of PEs
PE/reg dynamic Buf dynamic DRAM dynamic Total static Peak DRAM bandwidth
Opportunities
Key questions
partitioning
7
Micron’s Hybrid Memory Cube
Vault (Channel) DRAM Die Logic Die Bank TSVs
NN acceleration with 3D memory
Hardware architecture
Software optimizations
8 High performance & low energy Alleviate bandwidth pressure Optimize buffer use Efficient parallel processing
Associate one NN engine with each vault
NoC + routers for accesses to remote vaults All vaults can process NN computations in parallel
10
Vault DRAM Die Logic Die
PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
Global Buffer Memory Controller Router
To local vault To remote vault
Larger PE arrays with smaller SRAM buffers
11
2 4 6 0.6 1.2 1.8
36/467kB 48/441kB 64/408kB 80/374kB 100/332kB 120/290kB 144/240kB 168/190kB 196/133kB 224/72kB
Normalized Runtime Normalized Energy
# PEs / buffer capacity
PE dynamic Reg/buf dynamic DRAM dynamic Total static Runtime
196 PEs with 133 kB buffer (area 1:1) better performance and energy
Move simple accumulation logic close to DRAM banks
12
PE array Memory Y += W * X X W Y Y PE array Memory ΔY = W * X X W ΔY +
Critical for maximizing on-chip data reuse to save energy
14
foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) Mapping: execute 2D conv on PE array
Ordering: loop blocking and reordering
Limited reuse opportunities with small buffers IW bypass, OW bypass, IO bypass
15 ifmaps
filters
Global buffer Reg files
OW bypass ordering Off-chip On-chip
Chunk 2 Chunk 1 Chunk 0
Analytically derived
Near-optimal schedules
derived with exhaustive search
16
min 𝐵DRAM = 2 × 𝑂b𝑂o𝑇o × 𝑢i + 𝑂b𝑂i𝑇i + 𝑂o𝑂i𝑇w × 𝑢b s.t. ൞ 𝑂b 𝑢b × 𝑂i 𝑢i × 𝑇i ≤ 𝑇buf 1 ≤ 𝑢b ≤ 𝑂b, 1 ≤ 𝑢i ≤ 𝑂i
NN Runtime Gap (w.r.t. optimal) Energy Gap (w.r.t. optimal) AlexNet 1.48 % 1.86 % ZFNet 1.55 % 1.83 % VGG16 0.16 % 0.20 % VGG19 0.13 % 0.16 % ResNet 2.91 % 0.78 %
Option 1: fmap partitioning
17
Layer i Layer i+1 Vault 0 Vault 1 Vault 2 Vault 3
Process NN computations in parallel in all vaults
Option 2: output partitioning
18
Layer i Layer i+1 Vault 0 Vault 1 Vault 2 Vault 3
Process NN computations in parallel in all vaults
Combine fmap partitioning and output partitioning
Difficulties
Greedy algorithm reduces to be linear to # layers
Bypass ordering to quickly estimate total DRAM accesses
19
State-of-the-art NNs
2D and 3D accelerators with ≥1 NN engines
21
Up to 37% performance improvement with TETRIS
22
0.2 0.4 0.6 0.8 1 1.2 2D 3D 2D 3D 2D 3D 2D 3D 2D 3D
Normalized Runtime Large NNs benefit more!
AlexNet ZFNet VGG16 VGG19 ResNet
35–40% energy reduction with TETRIS
0.2 0.4 0.6 0.8 1 1.2 2D 3D 2D 3D 2D 3D 2D 3D 2D 3D
Normalized Energy
Total static NoC dynamic DRAM dynamic Reg/buf dynamic PE dynamic
23
From SRAM & DRAM, and static
AlexNet ZFNet VGG16 VGG19 ResNet
4 2D engines: 34 mm2, pin constrained (4 LPDDR3 channels) 16 3D engines: 56 mm2, area constrained (16 HMC vaults) 4.1x performance gain 2x compute density
24
0.05 0.1 0.15 0.2 0.25 0.3 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16
Normalized Runtime
AlexNet ZFNet VGG16 VGG19 ResNet
1.5x lower energy
4x computation only costs 2.7x power
25
0.2 0.4 0.6 0.8 1 1.2 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16
Normalized Energy
Total static NoC dynamic DRAM dynamic Reg/buf dynamic PE dynamic AlexNet ZFNet VGG16 VGG19 ResNet
A scalable and efficient NN accelerator using 3D memory
Hardware features
Software features
Scheduling exploration tool
26
Questions?