Binary Activated Neural Networks via Continuous Binarization Charbel - PowerPoint PPT Presentation

True Gradient-Based Training of Deep Binary Activated Neural Networks via Continuous Binarization Charbel Sakr *# , Jungwook Choi + , Zhuo Wang + , Kailash Gopalakrishnan + , Naresh Shanbhag * * University of Illinois at Urbana-Champaign + IBM T.J. Watson Research Center # work done at IBM Acknowledgment: • This work was supported in part by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA. • This work is supported in part by IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM AI Horizons Network.

The Binarization Problem • Binarization of Neural Networks is an a promising direction to complexity reduction • Binary activation functions are unfortunately non-continuous • Training networks with binary activations cannot use gradient-based learning

Current Approach • Treat binary activations as stochastic units (Bengio, 2013). • Use a straight through estimator (STE) of the gradient. • Was shown to enable training of binary networks (Hubara et al., Rastegari et al., etc … ). • Often comes at the cost of accuracy loss compared to floating point.

Proposed Method • Start with a clipping activation function and learn it to become binary. 𝑛 + 𝛽 𝑦 𝑏𝑑𝑢𝐺𝑜 𝑦 = 𝐷𝑚𝑗𝑞 2 , 0, 𝛽 • Smaller 𝑛 means steeper slope. The activation function naturally approaches a binarization function.

Caveat: Bottleneck Effect • Activations cannot be learned simultaneously due to the bottleneck effect in backward computations Input Entropy Cross Derivatives • Hence we learn slopes one layer at a time.

Justification via induction Step 1: The base case, a baseline network with clipping activation function Input features Clipping Clipping Clipping Output The arrows mean layer-wise operation of the input

Justification via induction Step 1: The base case, a baseline network with clipping activation function Input features Clipping Clipping Clipping Output Step 2: Replace first layer activation with binarization and stop learning first layer Input features Binary Clipping Clipping Output The arrows mean layer-wise operation of the input

Justification via induction Step 1: The base case, a baseline network with clipping activation function Input features Clipping Clipping Clipping Output Step 2: Replace first layer activation with binarization and stop learning first layer Input features Binary Clipping Clipping Output Equivalently, we have a new network with binary inputs Binary Clipping Clipping Output inputs The arrows mean layer-wise operation of the input

Justification via induction Step 1: The base case, a baseline network with clipping activation function Input features Clipping Clipping Clipping Output Step 2: Replace first layer activation with binarization and stop learning first layer Binary Clipping Clipping Output inputs Step 𝑀 − 1 : Binary inputs – Even shorter network Binary Clipping Output inputs The arrows mean layer-wise operation of the input

Justification via induction Step 1: The base case, a baseline network with clipping activation function Input features Clipping Clipping Clipping Output Step 2: Replace first layer activation with binarization and stop learning first layer Binary Clipping Clipping Output inputs Step 𝑀 − 1 : Binary inputs – Even shorter network Binary Clipping Output inputs Step 𝑀 : Binary inputs – Very short network Binary Output The arrows mean layer-wise operation of the input inputs

Justification via induction Further analysis required! Step 1: The base case, a baseline network with clipping activation function Input features Clipping Clipping Clipping Output Step 2: Replace first layer activation with binarization and stop learning first layer Binary Clipping Clipping Output inputs Step 𝑀 − 1 : Binary inputs – Even shorter network Binary Clipping Output inputs Step 𝑀 : Binary inputs – Very short network Binary Output The arrows mean layer-wise operation of the input inputs

Analysis • The mean squared error of approximating the PCF by and SBAF decreases linearly in 𝑛 Hence, perturbation magnitude decreases with 𝑛 PCF: Parametric Clipping Function SBAF: Scaled Binary Activation Funciton

Analysis • The mean squared error of approximating the PCF by and SBAF decreases linearly in 𝑛 Hence, perturbation magnitude decreases with 𝑛 • There is no mismatch when using the SBAF or the PCF provided a bounded perturbation magnitude Backtracking: small 𝑛 → small perturbation → less mismatch PCF: Parametric Clipping Function SBAF: Scaled Binary Activation Funciton

Analysis • The mean squared error of approximating the PCF by and SBAF decreases linearly in 𝑛 Hence, perturbation magnitude decreases with 𝑛 • There is no mismatch when using the SBAF or the PCF provided a bounded perturbation magnitude Backtracking: small 𝑛 → small perturbation → less mismatch How to make 𝑛 small? PCF: Parametric Clipping Function SBAF: Scaled Binary Activation Funciton

Regularization • We add a regularization term 𝜇 to 𝑛 when learning it ( 𝑀 2 and/or 𝑀 1 ). • The optimal 𝜇 value is usually found by tuning (common problem with regularization). • We have some empirical guidelines from our experiments: – Layer type 1: fully connected – Layer type 2: convolution preceding convolution – Layer type 3: convolution preceding pooling – We have observed that the following is a good strategy • 𝜇 1 > 𝜇 2 > 𝜇 3

Convergence • Blue curve: obtained network by binarizing up to layer 𝑚 • Orange curve: completely binary activated network • As training evolves, the network becomes completely binary and the two curves meet • The accuracy is very close to the initial one which is the baseline

Comparison with STE summary of test errors • Our method consistently outperforms binarization via STE

Conclusion & Future Work • We presented a novel method for binarizing the activations of deep neural networks → The method leverages true gradient based learning Consequently, the obtained results consistently outperform conventional binarization via STE • Future work → Experimentations on larger datasets → Combining the proposed activation binarization to weight binarization → Extension to multi-bit activations

Thank you!

Binary Activated Neural Networks via Continuous Binarization Charbel - PowerPoint PPT Presentation

True Gradient-Based Training of Deep Binary Activated Neural Networks via Continuous Binarization Charbel Sakr # , Jungwook Choi + , Zhuo Wang + , Kailash Gopalakrishnan + , Naresh Shanbhag * University of Illinois at Urbana-Champaign + IBM

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

MICROSCOPY and ACTIVATED SLUDGE PROCESS CONTROL Mackenzie L. Davis MWEA Process Seminar

Critical density for Activated Random Walk Lorenzo Taggi Max Planck Institute for Mathematics in

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

CSC2547: Learning to Search Lecture 2: Background and gradient esitmators Sept 20, 2019

Backpropagation and Gradients Agenda Motivation Backprop Tips & Tricks

AMMI Introduction to Deep Learning 5.3. PyTorch optimizers Fran cois Fleuret

autograd January 31, 2019 1 Automatic Differentiation 1.1 Import autograd and create a variable

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

Selected Topics in Optimization Some slides borrowed from

Gradient Gibbs measures with disorder Codina Cotar University College London June 25, 2015, GGI

Neural Networks Part 3 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

Binary Activated Neural Networks via Continuous Binarization Charbel - PowerPoint PPT Presentation

True Gradient-Based Training of Deep Binary Activated Neural Networks via Continuous Binarization Charbel Sakr *# , Jungwook Choi + , Zhuo Wang + , Kailash Gopalakrishnan + , Naresh Shanbhag * * University of Illinois at Urbana-Champaign + IBM

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

MICROSCOPY and ACTIVATED SLUDGE PROCESS CONTROL Mackenzie L. Davis MWEA Process Seminar

Critical density for Activated Random Walk Lorenzo Taggi Max Planck Institute for Mathematics in

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

CSC2547: Learning to Search Lecture 2: Background and gradient esitmators Sept 20, 2019

Backpropagation and Gradients Agenda Motivation Backprop Tips &amp; Tricks

AMMI Introduction to Deep Learning 5.3. PyTorch optimizers Fran cois Fleuret

autograd January 31, 2019 1 Automatic Differentiation 1.1 Import autograd and create a variable

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

Selected Topics in Optimization Some slides borrowed from

Gradient Gibbs measures with disorder Codina Cotar University College London June 25, 2015, GGI

Neural Networks Part 3 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

True Gradient-Based Training of Deep Binary Activated Neural Networks via Continuous Binarization Charbel Sakr # , Jungwook Choi + , Zhuo Wang + , Kailash Gopalakrishnan + , Naresh Shanbhag * University of Illinois at Urbana-Champaign + IBM

Backpropagation and Gradients Agenda Motivation Backprop Tips & Tricks