Binarized Neural Network to Compress DNN Xianda ( Bryce ) Xu - PowerPoint PPT Presentation

Using Binarized Neural Network to Compress DNN Xianda ( Bryce ) Xu xxu373@wisc.edu November 7th

Why is model compression so important ? Problem 1 : Computation Cost A = σ ( X i W T + B ) Multiplication is energy & time consuming ! Problem 2 : Memory Cost If float-32 Figure 1. AlexNet architecture 60 M Parameters = 240 MB Memory ! (ImageNet Large Scale Visual Recognition Challenge) However, Top-1 accuracy: 57.1% The energy and memory are limited on Top-5 accuracy: 80.2% mobile devices and embedded devices ! ~ 60 M Parameters !

How can we compress DNN ? Pruning Removing unimportant parameters Decomposition Apply Singular Value Decomposition on W Parameters! Use the combined output of several large Distillation networks to train a simpler model Finding efficient representation of each Quantization parameters

What is BNN ? In brief, it binarizes parameters and activations to +1 and -1 Why should we choose BNN ? -- Reduce memory cost -- Save energy and speed up i Full-precision parameter takes 32 bits Full-precision multiplication: Binary parameter takes 1 bit Binary multiplication: (XOR) ⊙ Compress network by 32-X theoretically Multiply-accumulations can be replaced by XOR and bit-count

How do we implement BNN ? Problem 1 : How to binarize ? Problem 2 : When to binarize ? -- Forward propagation -- Stochastic Binarization 1. First Layer: We do not binarize the input but binarize the Ws and As 2. Hidden Layers: We binarize all the Ws and As -- Deterministic Binarization 3. Output Layer: We binarize Ws and only binarize the output in training -- Back-propagation Though Stochastic Binarization seems We do not binarize gradients in back-propagation more reasonable, we prefer deterministic But we have to clip weights when we update them binarization for its efficiency.

How do we implement BNN ? Straight-Through Estimator (STE) Yoshua Bengio etc. Estimating or propagating gradients through stochastic neurons for conditional computation ( 15 Aug 2013 ). Problem 3 : How to do back-propagation ? Adapted, for hidden layers: Recall : We calculate the gradients of g a L = ∂ C For the last layer: the loss function ! with respect to " # , the ∂ a L input of the $ layer. For hidden layers: (map sign(x)) g l = ∂ L g a k = g a k b 1 | a k | ≤ 1 STE ∂ I l The layer activation gradients: g s k = BN ( g a k ) Back Batch Norm g l − 1 = g l W l T b = g s k W k b g a k − 1 Activation gradients The layer weight gradients: T a k − 1 b = g s k b g W k Weight gradients g W l = g l I l − 1 T But since we use Binarizing function, gradients are all zero ! STE = The gradient on Htanh(x) So, we use Htanh(x) as our activation function Straight-Through Estimator ! H tanh( x ) = Clip ( x , − 1,1)

How was my experiment ? The architecture of BNN in this paper (fed by Cifar-10). The block In this paper: The validation accuracy 89% My experiment: The training accuracy 95% The validation accuracy 84% https://github.com/brycexu/BinarizedNeuralNetwork/tree/master/SecondTry

The problems about the current BNN model ? Accuracy Loss ! Possible reasons ? Problems 1 : Robustness Issue Problems 2 : Stability Issue BNN always has larger output change which BNN is hard to optimize due to problems makes them more susceptible to input such as gradient mismatch. This is because perturbation. of the non-smoothness of the whole architecture. Gradient mismatch: The effective activation function in a fixed point network is a non-differentiable function in a discrete point network That is why we cannot apply ReLU in BNN ! DarryI D. Lin etc. Overcoming challenges in challenges in fixed point training of deep convolutional networks. 8 Jul 2016.

The potential ways to optimize BNN model ? 2. Weakening learning rate ? Robustness Issue Research shows that higher learning rate 1. Adding more bits ? can cause turbulence inside the model, so -- Ternary model (-1,0,+1) BNN needs finer tuning. -- Quantization 4. Modifying the architecture ? Research shows that having more bits at activations improve model’ robustness. -- AdaBoost (BENN) 3. Adding more weights ? -- Recursively using binarization -- WRPN Stability Issue ? More bits per network ? 1. Better activation function ? 2. Better back-propagation methods ? More networks per bit ?

Thank you ! xxu373@wisc.edu

Binarized Neural Network to Compress DNN Xianda ( Bryce ) Xu - PowerPoint PPT Presentation

Using Binarized Neural Network to Compress DNN Xianda ( Bryce ) Xu xxu373@wisc.edu November 7th Why is model compression so important ? Problem 1 : Computation Cost A = ( X i W T + B ) Multiplication is energy & time consuming !

Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, Meghan Cowan, Matthai Philipose,

Efficient Layout Hotspot Detection via Binarized Residual Neural Network Yiyang Jiang 1 , Fan Yang

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Efficient Voice Activity Detection via Binarized Neural Networks Jong Hwan Ko Josh Fromm

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Binarized Mode Seeking for Scalable Visual Pattern Discovery Wei Zhang, Xiaochun Cao, Rui Wang,

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Lecture 12: Perceptron and Back Propagation CS109A Introduction to Data Science Pavlos

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions

Orbital Space Plane Orbital Space Plane How Did We Get Here and Why? How Did We Get Here and

OPPORTUNITY AREAS Oh, I need my ipad, where is my baaaag located?! ??? zzz BAG BEHIND

r

Road Scene Understanding Presented by: Mohamed Mohsen Agenda Problem Definition Currently

IMPLEMENTATION OF A PARALLEL BATCH TRAINING ALGORITHM FOR DEEP NEURAL NETWORK YUPING LIN IFLYTEK

Lab activity visual presentations-1 Name:B.Saritha Assoc. Professor Affiliation: RC1529 MVSR