CNVLUTIN: Ineffectual-neuron-free DNN computing
- J. Albericio, P. Judd, T. Hetherington*,
- T. Aamodt*, N. E. Jerger, A. Moshovos
Please cite the original source.
CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. - - PowerPoint PPT Presentation
CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio , P. Judd, T. Hetherington*, T. Aamodt*, N. E. Jerger, A. Moshovos * Please cite the original source. CNVLUTIN: Ineffectual-neuron-free DNN computing J. Albericio P. Judd T.
CNVLUTIN: Ineffectual-neuron-free DNN computing
Please cite the original source.
CNVLUTIN: Ineffectual-neuron-free DNN computing
DNNs = SIMD Heaven
3+
x x ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯
100’s — 1000's
DNNs = SIMD Heaven
4+
x x ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯
100’s — 1000's
DNNs = SIMD Heaven
5+
x x ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯
100’s — 1000's
DNNs = SIMD Heaven
6+
x x ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯
100’s — 1000's
DNNs = SIMD Heaven
7+
x x ⋯ ⋯ ⋯ ⋯ ⋯
100’s — 1000's
CNVLUTIN: Smarter SIMD
852% Performance — 2x ED2P Out-of-the-box networks
Outline
design
What’s a CNN?
…
10’s of layers
Korean mask!
10What’s a CNN?
…
11What’s a CNN?
…
Neurons (Input)
11What’s a CNN?
… …
Synapses (Filters) Neurons (Input)
11What’s a CNN?
… …
12What’s a CNN?
… …
Neurons (Output)
12What’s a CNN?
… …
Neurons (Output)
12What’s a CNN?
… … …
Neurons (Output)
12What’s a CNN?
…
10’s of layers
13Korean mask!
What’s a CNN?
…
Convolution ReLU10’s of layers
Pool 13Korean mask!
What’s a CNN?
Convolution ReLU PoolInner products Negatives to 0
Data size reduction
+ x x… CNN typical layer
14Time spent in convolutions
Lots of Runtime Zeroes
0.1 0.2 0.3 0.4 0.5 0.6 Alexnet Google NiN VGG19 VGG_M VGG_S AVG Fraction of zero neurons in multiplications 16Lots of Runtime Zeroes
0.1 0.2 0.3 0.4 0.5 0.6 Alexnet Google NiN VGG19 VGG_M VGG_S AVG Fraction of zero neurons in multiplications 16 Waste of time and energy!!!Lots of Runtime Zeroes
0.1 0.2 0.3 0.4 0.5 0.6 Alexnet Google NiN VGG19 VGG_M VGG_S AVG Fraction of zero neurons in multiplications 16 Waste of time and energy!!! Dynamically generated = Not predictableHow to compute DNNs: DaDianNao*
Filter 0 Filter 15 *Chen et al. MICRO 2014 NeuronsProcessing in DaDianNao
2 1 ⋯ 1 1 1 ⋯ 1 2 ⋯ 1 15 ⋯ ⋯ ⋯ 3 1 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 18 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15Processing in DaDianNao
2 1 ⋯ 1 1 1 ⋯ 1 2 ⋯ 1 15 ⋯ ⋯ ⋯ 3 1 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 18 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15Processing in DaDianNao
2 1 ⋯ 1 1 1 ⋯ 1 2 ⋯ 1 15 ⋯ ⋯ ⋯ 3 1 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ X X Multiplication of corresponding neuron and synapse elements 18 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15Zero-skipping in DaDianNao?
3 1 ⋯ 2 1 ⋯ 1 1 1 ⋯ 1 2 ⋯ 3 1 ⋯ 2 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 19 1 15 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15Zero-skipping in DaDianNao?
3 1 ⋯ 2 1 ⋯ 1 1 1 ⋯ 1 2 ⋯ 3 1 ⋯ 2 ⋯ 1 1 1 ⋯ 1 2 1 Zero removal ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 19 1 15 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15Zero-skipping in DaDianNao?
3 1 ⋯ 2 1 ⋯ 1 1 1 ⋯ 1 2 ⋯ 3 1 ⋯ 2 ⋯ 1 1 1 ⋯ 1 2 1 Zero removal ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 19 1 15 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15Zero-skipping in DaDianNao?
3 1 ⋯ 2 1 ⋯ 1 1 1 ⋯ 1 2 ⋯ 3 1 ⋯ 2 ⋯ 1 1 1 ⋯ 1 2 1 Zero removal ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ X X 19 1 15 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15Zero-skipping in DaDianNao?
3 1 ⋯ 2 1 ⋯ 1 1 1 ⋯ 1 2 ⋯ 3 1 ⋯ 2 ⋯ 1 1 1 ⋯ 1 2 1 Zero removal ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ X XLanes can not longer
lock-step!
19 1 15 Neuron Lanes 1 15 Synapse Lanes Filter 0 1 15 Synapse Lanes Filter 15CNVLUTIN: Decoupling Lanes
⋯ ⋯ ⋯ 1 15 ⋯ ⋯ 20 Neuron Lanes ⋯ ⋯ ⋯ ⋯ 1 15 Synapse Lanes Filter 0 ⋯ ⋯ ⋯ ⋯ 1 15 Synapse Lanes Filter 15 Neuron Lane 0 Synapses Lane 0 Filter 0 Filter 1 ⋯ ⋯ ⋯ ⋯ Filter 15 Synapses Lane 15 Neuron Lane 15 Filter 0 Filter 1 ⋯ ⋯ ⋯ ⋯ Filter 15⋯
DaDianNao CNVLUTIN Subunit 0 Subunit 15
CNVLUTIN: Decoupling Lanes
1 1 2 1 1 Neuron Lane 0 Neuron Lane 15 1 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 21 1 2 3 Offsets Offsets 1 2 ⋯ ⋯ Synapses Lane 0 Filter 0 Filter 15 Synapses Lane 15 Filter 0 Filter 15Subunit 0 Subunit 15
CNVLUTIN: Decoupling Lanes
1 1 2 1 1 Neuron Lane 0 Neuron Lane 15 1 ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 21 1 2 3 Offsets Offsets 1 2 ⋯ ⋯ Synapses Lane 0 Filter 0 Filter 15 Synapses Lane 15 Filter 0 Filter 15Subunit 0 Subunit 15
CNVLUTIN: Decoupling Lanes
1 1 2 1 1 Neuron Lane 0 Neuron Lane 15 1 ⋯ ⋯ ⋯ ⋯ ⋯ X ⋯ ⋯ ⋯ ⋯ X 21 1 2 3 Offsets Offsets 1 2 ⋯ ⋯ Synapses Lane 0 Filter 0 Filter 15 Synapses Lane 15 Filter 0 Filter 15Subunit 0 Subunit 15
CNVLUTIN: Ineffectual-neuron Filtering
22Layer i Layer i+1
CNVLUTIN: Ineffectual-neuron Filtering
23Layer i
eDRAM EncoderLayer i+1
DispatcherCNVLUTIN: Ineffectual-neuron Filtering
23 Layer i eDRAM Encoder Layer i+1 DispatcherCNVLUTIN: Ineffectual-neuron Filtering
23 Layer i eDRAM Encoder Layer i+1 Dispatcher 7 Brick 0 Offsets 6 5 0 0 0 0 0 0 2 1 0 0 7 6 5 0 0 0 0 0 0 2 1 0 3 2 1 0 0 0 0 0 0 2 1 7 6 5 2 1 3 2 1 2 1 ZF Neurons Brick 1 Brick 2 Offset Unit Buffers Packed neurons eDRAM NeuronsCNVLUTIN: Ineffectual-neuron Filtering
23 Layer i eDRAM Encoder Layer i+1 Dispatcher 7 Brick 0 Offsets 6 5 0 0 0 0 0 0 2 1 0 0 7 6 5 0 0 0 0 0 0 2 1 0 3 2 1 0 0 0 0 0 0 2 1 7 6 5 2 1 3 2 1 2 1 ZF Neurons Brick 1 Brick 2 Offset Unit Buffers Packed neurons eDRAM NeuronsCNVLUTIN: Ineffectual-neuron Filtering
23 Layer i eDRAM Encoder Layer i+1 Dispatcher 7 Brick 0 Offsets 6 5 0 0 0 0 0 0 2 1 0 0 7 6 5 0 0 0 0 0 0 2 1 0 3 2 1 0 0 0 0 0 0 2 1 7 6 5 2 1 3 2 1 2 1 ZF Neurons Brick 1 Brick 2 Offset Unit Buffers Packed neurons eDRAM NeuronsNeuron Lane 0 Neuron Lane 1 Neuron Lane 15
…
CNVLUTIN: Computation Slicing
24Methodology
Area
Only +4.5% in area overhead
26Speedup: ineffectual = 0
27 0.5 1 1.5 2 Alexnet Google NiN VGG19 VGG_M VGG_S GeoBetter
Speedup: ineffectual = 0
27 0.5 1 1.5 2 Alexnet Google NiN VGG19 VGG_M VGG_S GeoBetter
Speedup: ineffectual = 0
27 0.5 1 1.5 2 Alexnet Google NiN VGG19 VGG_M VGG_S Geo1.37x Performance on average Better
Loosening the Ineffectual Neuron Criterion
28“If all you have is a hammer, everything looks like a nail” (Maslow’s hammer)
37 13 10 15 1 123 7 1 3 1 20 18 31 33CNVLUTIN zero
Loosening the Ineffectual Neuron Criterion
29 37 13 10 15 1 123 7 1 3 1 20 18 31 33Example: consider ineffectual if value<2 “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) CNVLUTIN zero
Speedup: ineffectual >= 0
1.52x Performance No accuracy lost
30 0.5 1 1.5 2 Alexnet Google NiN VGG19 VGG_M VGG_S GeoBetter
Speedup: ineffectual >= 0
1.52x Performance No accuracy lost
30 0.5 1 1.5 2 Alexnet Google NiN VGG19 VGG_M VGG_S GeoBetter
Loosening the Ineffectual Neuron Criterion
31 37 13 10 15 1 123 7 1 3 1 20 18 31 33Example: consider ineffectual if value<2 “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) CNVLUTIN zero
Loosening the Ineffectual Neuron Criterion
32 37 13 10 15 1 123 7 1 3 1 20 18 31 33Example: consider ineffectual if value<8 “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) CNVLUTIN zero
Loosening the Ineffectual Neuron Criterion
33 37 13 10 15 1 123 7 1 3 1 20 18 31 33Example: consider ineffectual if value<32 “If all you have is a hammer, everything looks like a nail” (Maslow’s hammer) CNVLUTIN zero
Trading accuracy for performance
34Trading accuracy for performance
34Trading accuracy for performance
34+60%
CNVLUTIN: Smarter SIMD
3552% Performance — 2x ED2P No accuracy lost
Our Approach
36Value-Aware Deep Learning Acceleration
arXiv, a while ago: Reduced-Precision Strategies for Bounded Memory in DNNs ICS 2016 Proteus: Exploiting Numerical Precision Variability in DNNs Today CNVLUTIN: Ineffectual-Neuron-Free DNN Computing CAL (to appear) How to offer performance that scales linearly with required numerical precision More things coming soon :-)CNVLUTIN: Ineffectual-neuron-free DNN computing
Additional Material
Please cite the original source.
Execution Activity Breakdown
38Cnvlutin: Maintaining Wide NM accesses
Dispatcher: Reads Neuron Bricks - up to 16 neuron Maintain one per Neuron Lane
n15 n1 n2 n0 n4 n0 Brick Buffer From NM Bank 0 From NM Bank 15 Neuron block Broadcast n15 n3 n1 39Cnvlutin: Maintaining Wide NM accesses
0 1 #1: Partition NM in 16 Slices over 16 banks 15
…
Neuron Lane 0 Neuron Lane 1 Neuron Lane 15
40#2: Fetch and Maintain One Container per Slice Container: up to 16 non-zero neurons Neuron Lane 0 Neuron Lane 1 Neuron Lane 15
… Cnvlutin: Maintaining Wide NM accesses
41#3: Keep neuron lanes supplied with one neuron per cycle Neuron Lane 0 Neuron Lane 1 Neuron Lane 15
… Cnvlutin: Maintaining Wide NM accesses
42Cnvlutin: Summary
SB (eDRAM) Synapse Lane 15 Synapse Lane 15 Lane 0 Filter Lane 15 Filter x x Offsets Subunit 0 + NBout to central eDRAM SB (eDRAM) encoder f + f x x Subunit 15 from central eDRAM from central eDRAM Nbin Neuron Lane 0 Neuron Lane 15 Offsets Synapse Lane 0 Synapse Lane 0 Lane 0 Filter Nbin Lane 15 FilterDecoupled Neuron Lanes: Neuron + coordinate Proceed independently Partitioned SB: 16-wide accesses 1 synapse per filter
43Cnvlutin: Summary
SB (eDRAM) Synapse Lane 15 Synapse Lane 15 Lane 0 Filter Lane 15 Filter x x Offsets Subunit 0 + NBout to central eDRAM SB (eDRAM) encoder f + f x x Subunit 15 from central eDRAM from central eDRAM Nbin Neuron Lane 0 Neuron Lane 15 Offsets Synapse Lane 0 Synapse Lane 0 Lane 0 Filter Nbin Lane 15 Filter1-wide Neuron Lanes Decoupled Neuron Lanes: Neuron + coordinate Proceed independently Partitioned SB: 16-wide accesses 1 synapse per filter
43Cnvlutin: Summary
SB (eDRAM) Synapse Lane 15 Synapse Lane 15 Lane 0 Filter Lane 15 Filter x x Offsets Subunit 0 + NBout to central eDRAM SB (eDRAM) encoder f + f x x Subunit 15 from central eDRAM from central eDRAM Nbin Neuron Lane 0 Neuron Lane 15 Offsets Synapse Lane 0 Synapse Lane 0 Lane 0 Filter Nbin Lane 15 Filter1-wide Neuron Lanes 16-synapse Synapse Lanes Decoupled Neuron Lanes: Neuron + coordinate Proceed independently Partitioned SB: 16-wide accesses 1 synapse per filter
43