Riptide: Fast End-to-End Binarized Neural Networks
Josh Fromm, Meghan Cowan, Matthai Philipose, Luis Ceze, and Shwetak Patel
Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, - - PowerPoint PPT Presentation
Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, Meghan Cowan, Matthai Philipose, Luis Ceze, and Shwetak Patel 2 Canziani et al., An analysis of deep neural network models for practical applications. 2016 1-bit Matrix
Josh Fromm, Meghan Cowan, Matthai Philipose, Luis Ceze, and Shwetak Patel
2 Canziani et al., “An analysis of deep neural network models for practical applications.” 2016
just XNOR
convergence
1.2 3.12 -11.2 3.4 -2.12 -132.1 … 0.2 -121.1, … 0b110100…1 0xD0… 64 floats 64 bits
A[:64] . W[:64] == popc(A/64 XNOR W/64)
3 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016
float x[], y[], w[]; ... for i in 1…N: y[j] += x[i] * w[i]; unsigned long x[], y[], w[]; … for i in 1…N/64: y[j] += 64 – 2*popc(not(x_b[i] xor w_b[i]));
2N ops 3N/64 ops ~40x faster 32x smaller Typically, lose ~10% accuracy
4 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016
32x smaller
float x[], y[], w[]; ... for i in 1…N: y[j] += x[i] * w[i]; unsigned long x[], y[], w[]; … for i in 1…N/64: y[j] += 64 – 2*popc(not(x_b[i] xor w_b[i]));
2N ops 3N/64 ops ~40x faster Typically, lose ~10% accuracy
5 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016
~40x faster
6
7
Runtime 380 ms Unoptimized Binary Network 1904 ms Full Precision Baseline
CPUs have no native support for low bit data types Need to work on packed data Need to implement optimizations from scratch Optimizations tuned for specific CPU
uint1 uint2
Optimized linear algebra libraries Hardware support for conventional deep learning
No optimized linear algebra libraries like BLAS to leverage Baselines incredibly well optimized
8
Majority of work in binarization is simulated
To address these questions we introduce Riptide.
9
10
A one-stop solution to training and deploying fast binary networks on a variety of hardware platforms.
quantization
point arithmetic from binary models.
while maintaining state-of-the-art accuracy.
11
features: float array
=
…
kernels: float array activations: float array
Multiply Accumulate
12
features: float array
=
…
kernels: float array activations: float array
Multiply Accumulate
features: int array
QA
13
features: float array
=
…
kernels: float array activations: float array
Multiply Accumulate
QA
features: int array
QW
…
kernels: int array
14
features: float array
=
…
kernels: float array activations: int array
QA
features: int array
QW
…
kernels: int array
Bitserial Accumulate
1
Quantization Function: ! 𝑦 = 𝑦 > 0
1
Quantization Function: ! 𝑦 = 𝑡𝑗𝑜(𝑦)
Bipolar Quantization Unipolar Quantization
correlation (1) or inverse-correlation (-1)
pattern-match (1) or no pattern-match (0)
15
16 Zhou et al., “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” 2016
1 1 0 1 0 1 …
A (unipolar)
0 1 0 0 1 0 …
W (bipolar)
0 1 0 -1 0 -1 …
Expected
=
Multiple meanings of 0 bits causes mixed polarity to unimplementable
17
Count number of bit multiplications where output should be 1 Subtract cases where output should be -1
18
1 Quantization Function: ! 𝑦 = 𝑚𝑗𝑜𝑓𝑏𝑠(𝑦) .3 .6
Zhou et al., “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” 2016
19
Quantization Function: ! 𝑦 = 𝐼𝑋𝐻𝑅(𝑦)
Cai et al., “Deep Learning with Low Precision by Half-wave Gaussian Quantization.” 2017
.2 2.1 .6 1.1
20 Cai et al., “Deep Learning with Low Precision by Half-wave Gaussian Quantization.” 2017
.2 2.1 .6 1.1 Bit pair: 00 01 10 11 Value is based on unique bit pair rather than combination of bits, (01 + 10 ≠ 11) Unique bit combinations lost during popcount
21
features: float array
=
…
kernels: float array 128 128 256 activations: int array
QA
features: int array
QW
…
kernels: int array
Bitwise Accumulate
1-bit bipolar quantization N-bit linear bipolar or unipolar quantization Xnor-popcount / mixed polarity-popcount
22
Full Precision Binary QConv Conv QConv QConv QDense QConv
23
Full Precision Binary QConv Conv QConv QConv QDense QConv BatchNorm Dequantize WeightScale Activation Quantize Bitpack Computational Complexity: 4𝐼𝑋𝐺 𝐼𝑋𝐺 4𝐼𝑋𝐺 𝐼𝑋𝐺 5𝐼𝑋𝐺 3𝐼𝑋𝐺 𝑂𝐿𝐿𝐺𝐼𝑋𝐷 43
for decent end-to-end speedups
24
25
Qconv W 𝛽? = 𝑛𝑓𝑏𝑜( 𝑋
? )
𝑟 𝑏 = 𝛽?𝑏 Qconv
Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016
26
Qconv W 𝛽? = 𝑛𝑓𝑏𝑜( 𝑋
? )
𝑟 𝑏 = 𝛽?𝑏 Qconv
Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016
𝜈? = C
D ∑FGC D (𝑏F)
𝜏?
I = C D ∑FGC D (𝑏F − 𝜈?)I
K 𝑏F = LM N OP
QP
RS T
27 Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” 2015
28
To add a constant to a binarized tensor, we must use Fixed Point Quantization (FPQ) with the same bits and scale
29
1 1 1 1 0 1 1 0 …
N-bit input to next layer wb fractional bits
𝐶 = 𝑂 + 𝑥𝑐 1 1 2 + 1 4 + 1 8 + ⋯ 𝑇 = 1 + \
FGC ]^
1 2_ − 1 ( ` 1 2)F = 1 + 1 2_ − 1 (1 − 1 2]^)
30
𝐶 = 𝑂 + 𝑥𝑐 𝑇 = 1 + 1 2_ − 1 (1 − 1 2]^) ̂ 𝜈 = 𝐺𝑄𝑅(𝜈, 𝐶, 𝑇) This is the fused glue operation All terms are constant at runtime except 𝑏 Only requires two integer operations
31
Full Precision Binary QConv QConv BatchNorm Dequantize WeightScale Activation Quantize Bitpack Computational Complexity: 4𝐼𝑋𝐺 𝐼𝑋𝐺 4𝐼𝑋𝐺 𝐼𝑋𝐺 5𝐼𝑋𝐺 3𝐼𝑋𝐺 Total = 18𝐼𝑋𝐺 QConv QConv Bitpack Computational Complexity: 2𝐼𝑋𝐺 𝐼𝑋𝐺 3𝐼𝑋𝐺 Total = 6𝐼𝑋𝐺 Fused Glue Clip
Traditional Binary Network Fully Binarized Network
state-of-the-art techniques
higher accuracies as expected
32
33
Raspberry Pi ARM Cortex-A53
Separates compute and implementation into a declaration and schedule Schedules contain knobs that are attuned for the backend
Tensor Expression Language AutoTVM: Optimize Tensor Operators Schedule Optimization Space
Optimizing deep learning compiler
34 Chen et al., “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” 2018
35 Chen et al., “Learning to Optimize Tensor Programs.” 2019
36
_efg h
Bytes 2𝑂𝐼𝑋𝐷 Bytes 2𝑂𝐼𝑋𝐷 Bytes
_efg h
Bytes BinaryConv
Int-16
𝑑j 𝑑I 𝑑C 𝑑k 𝑑_ .
Int-16 Popcount Accumulation
𝑦j 𝑦C 𝑦I 𝑦k 𝑦m 𝑦n 𝑦o 𝑦p 𝑦h 𝑦q 𝑦_ . . .
Int-N Bit Packed Activations
Int-8
Fused Shift/Scale
Int-N
𝑟j 𝑟I 𝑟C 𝑟k .
Int-16 Quantized Prepacked Bits
𝑟_
Int-16
Unused
Bit Pack
𝑧j 𝑧C 𝑧I 𝑧k 𝑧m 𝑧n 𝑧o 𝑧p 𝑧h 𝑧q 𝑧_ . . .
Int-N Bit Packed Outputs
Int-8
𝑦j 𝑦C 𝑦I 𝑦k 𝑦m 𝑦n 𝑦o 𝑦p 𝑦h 𝑦q 𝑦_ . . .
_efg h
Bytes BinaryConv
𝑑j 𝑑I 𝑑C 𝑑k 𝑑_ .
2𝑂𝐼𝑋𝐷 Bytes Int-N Bit Packed Activations Int-16 Popcount Accumulation Fused Shift/Scale
Int-16
𝑟j 𝑟I 𝑟C 𝑟k .
Int-16 Quantized Prepacked Bits
𝑟_
2𝑂𝐼𝑋𝐷 Bytes Bit Pack
𝑧j 𝑧C 𝑧I 𝑧k 𝑧m 𝑧n 𝑧o 𝑧p 𝑧h 𝑧q 𝑧_ . . .
Int-N Bit Packed Outputs
Int-8
_efg h
Bytes
Int-16
Unused
Int-8
Int-N
Bitpack Fusion
Speedup over baseline
significant impact
are due to fewer memory
implementation, we can study
39
a significant impact on performance
2X speedup, as predicted by
40
amount of time as core compute layers
almost completely removes this cost
41
more memory overhead
has twice as many operations, it is only marginally slower than bipolar quantization
42
43
across layers
network of binarizable layers
44
Code:
45
Paper:
46
47