Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, - PowerPoint PPT Presentation

Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, Meghan Cowan, Matthai Philipose, Luis Ceze, and Shwetak Patel

2 Canziani et al., “An analysis of deep neural network models for practical applications.” 2016

1-bit Matrix Operations • Quantize floats to +/-1 • 1.122 * -3.112 ==> 1 * -1 • Notice: 1.2 3.12 -11.2 3.4 -2.12 -132.1 … 0.2 -121.1, … • 1 * 1 = 1 64 floats • 1 * -1 = -1 • -1 * 1 = -1 64 bits • -1 * -1 = 1 0b110100…1 0xD0… • Replacing -1 with 0, this is just XNOR A[:64] . W[:64] == popc(A /64 XNOR W /64 ) • Retrain model to convergence 3 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

1-bit Matrix Operations: Cost/Benefit float x[], y[], w[]; ... for i in 1…N: 2N ops y[j] += x[i] * w[i]; ~40x faster 32x smaller 3N/64 ops unsigned long x[], y[], w[]; … for i in 1…N/64: y[j] += 64 – 2*popc(not(x_b[i] xor w_b[i])); Typically, lose ~10% accuracy 4 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

1-bit Matrix Operations: Cost/Benefit float x[], y[], w[]; ... for i in 1…N: 2N ops y[j] += x[i] * w[i]; ~40x faster 32x smaller 3N/64 ops unsigned long x[], y[], w[]; … for i in 1…N/64: y[j] += 64 – 2*popc(not(x_b[i] xor w_b[i])); Typically, lose ~10% accuracy 5 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

1-bit Matrix Operations: Cost/Benefit ~40x faster 6

1-bit Matrix Operations: Cost/Benefit Runtime 1904 ms 380 ms Full Precision Baseline Unoptimized Binary Network 7

Implementation Challenges No optimized linear algebra libraries like BLAS to leverage Need to implement optimizations from scratch Optimizations tuned for specific CPU uint1 uint2 CPUs have no native support for low bit data types Need to work on packed data Baselines incredibly well optimized Optimized linear algebra libraries Hardware support for conventional deep learning 8

Are Binary Networks Actually Fast? Majority of work in binarization is simulated • Which binarization techniques can be implemented efficiently? • What are the runtime bottlenecks in a binary model? • How do I deploy a fast binary model on my platform? To address these questions we introduce Riptide. 9

Addresses implementation issues in mixed polarity • quantization Introduces the Fused Glue operation, removing all floating- • point arithmetic from binary models. Provides high-performance bitserial operators through TVM. • Yields 4-12X speedups across various models and bitwidths • while maintaining state-of-the-art accuracy. Available open-source today at github.com/jwfromm/Riptide • A one-stop solution to training and deploying fast binary networks on a variety of hardware platforms. 10

Implementing Binary Layers kernels: float array … activations: float array features: float array = Multiply Accumulate 11

Implementing Binary Layers kernels: float array … activations: float array features: int array features: float array = QA Multiply Accumulate 12

Implementing Binary Layers kernels: float array … QW kernels: int array … activations: float array features: int array features: float array = QA Multiply Accumulate 13

Implementing Binary Layers kernels: float array … QW kernels: int array … activations: int array features: int array features: float array = QA Bitserial Accumulate 14

Quantization Polarity Bipolar Quantization Unipolar Quantization -1 -1 0 1 0 1 Quantization Function: ! 𝑦 = 𝑡𝑗𝑕𝑜(𝑦) Quantization Function: ! 𝑦 = 𝑦 > 0 • Implemented with bitwise-and and popcount • Implemented with bitwise-xnor and popcount • Well-suited for activations, which represent • Well-suited for weights, which represent pattern-match (1) or no pattern-match (0) correlation (1) or inverse-correlation (-1) 15

Quantization Polarity • XnorNet (all bipolar) -> 44.2% accuracy • DorefaNet (bipolar weights unipolar activations) -> 50.0% accuracy A (unipolar) 1 1 0 1 0 1 … W (bipolar) 0 1 0 0 1 0 … = Expected 0 1 0 -1 0 -1 … Multiple meanings of 0 bits causes mixed polarity to unimplementable 16 Zhou et al., “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” 2016

Mixed Polarity Operation Count number of bit Subtract cases where output multiplications where output should be -1 should be 1 Enables mixed polarity binary networks • Doubles amount of inner loop compute but does not require additional memory operations • Mixed polarity may offer compelling points on speedup to accuracy versus pure bipolar • 17

Multibit Quantization Translates naturally to integer representation • Does not necessarily fit distribution • 0 .3 .6 1 Quantization Function: ! 𝑦 = 𝑚𝑗𝑜𝑓𝑏𝑠(𝑦) 18 Zhou et al., “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” 2016

Multibit Quantization Better fit for Gaussian distribution • Not implementable • .2 .6 1.1 2.1 Quantization Function: ! 𝑦 = 𝐼𝑋𝐻𝑅(𝑦) 19 Cai et al., “Deep Learning with Low Precision by Half-wave Gaussian Quantization.” 2017

Multibit Quantization .2 .6 1.1 2.1 Unique bit combinations lost during popcount Bit pair: 00 01 10 11 Value is based on unique bit pair rather than combination of bits, ( 01 + 10 ≠ 11) 20 Cai et al., “Deep Learning with Low Precision by Half-wave Gaussian Quantization.” 2017

Implementing Binary Layers kernels: float array … 1-bit bipolar quantization QW kernels: int array … activations: int array features: int array features: float array 256 = QA 128 Bitwise Accumulate N-bit linear bipolar or Xnor-popcount / 128 unipolar quantization mixed polarity-popcount 21

Implementing Binary Models Full Precision Binary Conv QConv QConv QConv QConv QDense 22

Implementing Binary Models Full Precision Binary WeightScale Dequantize BatchNorm Activation Quantize Bitpack Conv QConv QConv QConv QConv QDense 𝑂𝐿𝐿𝐺𝐼𝑋𝐷 Computational Complexity: 4𝐼𝑋𝐺 𝐼𝑋𝐺 4𝐼𝑋𝐺 𝐼𝑋𝐺 5𝐼𝑋𝐺 3𝐼𝑋𝐺 43 23

Estimated Impact of Glue Layers • Impact of glue layers is too high • We must derive binarized glue for decent end-to-end speedups 24

Weight Scaling W Introduced in XnorNet • Allows scale of weights to be preserved • Qconv 𝛽 ? = 𝑛𝑓𝑏𝑜( 𝑋 ? ) Qconv Brought accuracy from 27% to 44% • 𝑟 𝑏 = 𝛽 ? 𝑏 Now used ubiquitously • 25 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

Quantized Weight Scaling W Use approximate power of 2 (AP2) • Replaces multiply with bitwise shift • Qconv 𝛽 ? = 𝑛𝑓𝑏𝑜( 𝑋 ? ) Qconv Constant at inference time • 𝑟 𝑏 = 𝛽 ? 𝑏 Requires only a single instruction • 26 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

BatchNormalization • Centers and scales output activations • Essential for quantization, used in all binary techniques • Must derive quantized versions of both centering and scaling D (𝑏 F ) I = C D (𝑏 F − 𝜈 ? ) I 𝜈 ? = C 𝑏 F = L M N O P D ∑ FGC D ∑ FGC 𝜏 ? K R S T Q P 27 Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” 2015

Binary Scaling • We can simply compute the AP2 of standard deviation 28

̂ Binary Center To add a constant to a binarized tensor, we must use Fixed Point Quantization (FPQ) with the same bits and scale N -bit input to wb fractional next layer bits 𝐶 = 𝑂 + 𝑥𝑐 1 1 1 1 0 1 1 0 … ]^ 1 2 + 1 4 + 1 1 1 2) F 1 𝑇 = 1 + \ 2 _ − 1 ( ` 8 + ⋯ FGC 2 _ − 1 (1 − 1 1 = 1 + 2 ]^ ) 𝜈 = 𝐺𝑄𝑅(𝜈, 𝐶, 𝑇) 29

̂ Fused Glue Operation 2 _ − 1 (1 − 1 1 𝐶 = 𝑂 + 𝑥𝑐 𝑇 = 1 + 2 ]^ ) 𝜈 = 𝐺𝑄𝑅(𝜈, 𝐶, 𝑇) This is the fused glue operation All terms are constant at runtime except 𝑏 Only requires two integer operations 30

Fully Binarized Network Full Precision Binary Traditional WeightScale Dequantize BatchNorm Activation Quantize Bitpack Binary QConv QConv Network Computational Complexity: 4𝐼𝑋𝐺 𝐼𝑋𝐺 4𝐼𝑋𝐺 𝐼𝑋𝐺 5𝐼𝑋𝐺 3𝐼𝑋𝐺 Total = 18𝐼𝑋𝐺 Fully 3X fewer glue operations • Fused Glue Bitpack Binarized Clip QConv QConv No floating-point data • Network No multiplication or division • Computational Complexity: 2𝐼𝑋𝐺 𝐼𝑋𝐺 3𝐼𝑋𝐺 Total = 6𝐼𝑋𝐺 31

FBN Accuracy • Our system is comparable to state-of-the-art techniques • Unipolar quantization yields higher accuracies as expected • Effective across various models 32

Measurement Platform • Widely available and inexpensive • Representative of IoT devices • QualComm Snapdragons • Azure Sphere Raspberry Pi • Resource constrained / in need of acceleration ARM Cortex-A53 33

Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, - PowerPoint PPT Presentation

Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, Meghan Cowan, Matthai Philipose, Luis Ceze, and Shwetak Patel 2 Canziani et al., An analysis of deep neural network models for practical applications. 2016 1-bit Matrix

Marine Harvest Upper Island Riptide From then to now - The Riptide back story From then to now:

Acknowledgements CONFIDENTIAL Next Generation Genotyping Workflow (Same for all RIPTIDE

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Efficient Voice Activity Detection via Binarized Neural Networks Jong Hwan Ko Josh Fromm

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Binarized Neural Network to Compress DNN Xianda ( Bryce ) Xu xxu373@wisc.edu November 7th Why

Efficient Layout Hotspot Detection via Binarized Residual Neural Network Yiyang Jiang 1 , Fan Yang

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

UDT 2020 Missionized Riptide Unmanned Underwater Vehicles J. E. Filiberti 1 and R. M. Carvalho

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Treatment of oligometastatic and oligoprogressive CRPC Eric J. Small, MD University of

Surgical ablation (MAZE) for atrial fibrillation during coronary and/or valvular heart surgery:

When a Knowledge Base is not Enough Question Answering over Knowledge Bases with External Text

PlaneMatch: Patch Coplanarity Prediction for Robust RGB-D Reconstruction Yifei Shi, Kai Xu,

Research and Trials within the BHS Morris J Brown William Harvey Research Institute Queen Mary

Prior Authorization Process for Certain Hospital Outpatient Department (OPD) Services Amy

GRAINS: Generative Recursive Autoencoders for INdoor Scenes Manyi Li 1,2 , Akshay Gadi Patil 2 ,

Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong Huk Park 1 , Daylen Yang 1 ,