Training Binary Neural Networks Using the Bayesian Learning Rule - - PowerPoint PPT Presentation

training binary neural networks using the bayesian
SMART_READER_LITE
LIVE PREVIEW

Training Binary Neural Networks Using the Bayesian Learning Rule - - PowerPoint PPT Presentation

Thirty-seventh International Conference on Machine Learning (ICML 2020) Training Binary Neural Networks Using the Bayesian Learning Rule Xiangming Meng Roman Bachmann Mohammad Emtiyaz Khan (EPFL) (RIKEN AIP) (RIKEN AIP) Presenter 1 Binary


slide-1
SLIDE 1

Training Binary Neural Networks Using the Bayesian Learning Rule

1

Thirty-seventh International Conference on Machine Learning (ICML 2020)

Xiangming Meng (RIKEN AIP) Presenter

Mohammad Emtiyaz Khan (RIKEN AIP) Roman Bachmann (EPFL)

slide-2
SLIDE 2

Binary Neural Networks (BiNN)

  • BiNN: Neural Networks with binary weights
  • Much faster and much smaller [1,2]
  • Difficult to optimize in theory (discrete optimization)
  • But easy in practice: Just use SGD with “Straight-through

estimator (STE)”!

  • It is mysterious as to why this works [3]
  • Are there any principled approaches to explain this?

2

1. Courbariaux et al., Training deep neural networks with binary weights during propagations. NeurIPS 2015. 2. Courbariaux et al., . Binarized neural networks.… arXiv:1602.02830, 2016. 3. Yin, P. et al., Understanding straight-through estimator in training activation quantized neural nets. arXiv, 2019.

slide-3
SLIDE 3

Our Contribution: Training BiNN using Bayes

  • We show that by using the Bayesian Learning Rule [1,2]

(natural-gradient variational inference), we can justify such previous approaches

  • Main point: optimize the parameter of a Bernoulli

distribution (a continuous optimization problem)

  • The Bayesian approach gives us an estimate of uncertainty

which can be used for continual learning [3]

3

1. Khan, M. E. and Rue, H. Learning-algorithms from bayesian principles. ArXiv. 2019. 2. Khan, M. E. and Lin, W. Conjugate-computation variational inference. AISTATS, 2017 3. Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521–3526, 2017.

slide-4
SLIDE 4

Training BiNN is a Discrete Optimization problem!

4

Output Input Loss Neural Network Binary weights

slide-5
SLIDE 5

Training BiNN is a Discrete Optimization problem!

  • Easy in practice: SGD with “Straight-

through estimator (STE)” [1]

5

Output Input Loss Neural Network Binary weights

  • 1. Bengio et al. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432, 2013.
  • 2. Helwegen et al. Latent weights do not exist: Rethinking binarized neural network optimization. arXiv preprint arXiv:1906.02107, 2019.
  • 3. Yin, P. et al., Understanding straight-through estimator in training activation quantized neural nets. arXiv, 2019.

“latent” weights

slide-6
SLIDE 6

Training BiNN is a Discrete Optimization problem!

  • Easy in practice: SGD with “Straight-

through estimator (STE)” [1]

  • Helwegen et al. [2] argued “latent”

weights are not weights but “Inertia”

6

Output Input Loss Neural Network Binary weights

  • 1. Bengio et al. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432, 2013.
  • 2. Helwegen et al. Latent weights do not exist: Rethinking binarized neural network optimization. arXiv preprint arXiv:1906.02107, 2019.
  • 3. Yin, P. et al., Understanding straight-through estimator in training activation quantized neural nets. arXiv, 2019.

“latent” weights

Binary Optimizer (Bop)

slide-7
SLIDE 7

Training BiNN is a Discrete Optimization problem!

  • Easy in practice: SGD with “Straight-

through estimator (STE)” [1]

  • Helwegen et al. [2] argued “latent”

weights are not weights but “Inertia”

7

Output Input Loss Neural Network Binary weights

  • 1. Bengio et al. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432, 2013.
  • 2. Helwegen et al. Latent weights do not exist: Rethinking binarized neural network optimization. arXiv preprint arXiv:1906.02107, 2019.
  • 3. Yin, P. et al., Understanding straight-through estimator in training activation quantized neural nets. arXiv, 2019.

“latent” weights

Binary Optimizer (Bop)

  • Open question: Why does this work?[3]
slide-8
SLIDE 8
  • Main point: optimize the parameters of Bernoulli distribution (a

continuous optimization problem)

min

q(w)

Posterior approximation

  • ver weights

KL Divergence Prior Distribution

BayesBiNN

8

  • 1. Zellner, A. Optimal information processing and Bayes’s theorem. The American Statistician, 42(4):278–280, 1988.
  • 2. Bissiri et al.. A general framework for updating belief distributions. Journal of the Royal Statistical Society, 78(5):1103–1130, 2016.
  • Problem reformulation: Optimize distribution over weights[1,2]

Loss

slide-9
SLIDE 9
  • Main point: optimize the parameters of Bernoulli distribution (a

continuous optimization problem)

min

q(w)

Posterior approximation

  • ver weights

KL Divergence Prior Distribution

BayesBiNN

9

  • 1. Zellner, A. Optimal information processing and Bayes’s theorem. The American Statistician, 42(4):278–280, 1988.
  • 2. Bissiri et al.. A general framework for updating belief distributions. Journal of the Royal Statistical Society, 78(5):1103–1130, 2016.

q (w) =

D

i=1

p

1 + wi 2

i

(1 − pi)

1 − wi 2

Natural parameters: λi := 1

2 log pi 1 − pi

Probability of wi = + 1

q (w) =

D

i=1

exp [λiϕ (wi) − A (λi)]

wi ∈ {−1, + 1}

  • is chosen to be mean-field Bernoulli distribution

q (w)

  • Problem reformulation: Optimize distribution over weights[1,2]

Loss

slide-10
SLIDE 10
  • The Bayesian learning rule[1] (natural-gradient variational inference)

10

  • 1. Khan, M. E. and Rue, H. Learning-algorithms from bayesian principles. 2019.
  • 2. Maddison, et al., The concrete distribution: A continuous relaxation of discrete random variables. arXiv:1611.00712, 2016.
  • 3. Jiang et al. Categorical repa-rameterization with gumbel-softmax. arXiv:1611.01144, 2016.

Natural parameter

  • f q(w)

Expectation parameter

  • f q(w)

Natural parameter

  • f p(w)

Learning rate

BayesBiNN

slide-11
SLIDE 11
  • The Bayesian learning rule[1] (natural-gradient variational inference)

11

  • 1. Khan, M. E. and Rue, H. Learning-algorithms from bayesian principles. 2019.
  • 2. Maddison, et al., The concrete distribution: A continuous relaxation of discrete random variables. arXiv:1611.00712, 2016.
  • 3. Jiang et al. Categorical repa-rameterization with gumbel-softmax. arXiv:1611.01144, 2016.

Natural parameter

  • f q(w)

Expectation parameter

  • f q(w)

Natural parameter

  • f p(w)

Learning rate How to compute?

BayesBiNN

slide-12
SLIDE 12
  • The Bayesian learning rule[1] (natural-gradient variational inference)

12

  • 1. Khan, M. E. and Rue, H. Learning-algorithms from bayesian principles. 2019.
  • 2. Maddison, et al., The concrete distribution: A continuous relaxation of discrete random variables. arXiv:1611.00712, 2016.
  • 3. Jiang et al. Categorical repa-rameterization with gumbel-softmax. arXiv:1611.01144, 2016.

Natural parameter

  • f q(w)

Expectation parameter

  • f q(w)

Natural parameter

  • f p(w)

Learning rate

  • Using the Gumbel Softmax trick[2,3], we can approximate the

natural gradient by using the mini-batch gradient

Minibatch Gradient, easy to compute! Scale vector How to compute?

BayesBiNN

slide-13
SLIDE 13

BayesBiNN Justifies Some Previous Methods

13

  • Main point 1: STE works as a special case of BayesBiNN as τ → 0

Note that in BayesBiNN corresponds to

wr λ

slide-14
SLIDE 14

BayesBiNN Justifies Some Previous Methods

14

  • Main point 1: STE works as a special case of BayesBiNN as τ → 0

τ → 0

Note that in BayesBiNN corresponds to

wr λ

slide-15
SLIDE 15

BayesBiNN Justifies Some Previous Methods

15

  • Main point 1: STE works as a special case of BayesBiNN as τ → 0
  • Main point 2: Justify the “exponential average” used in Bop

τ → 0

Note that in BayesBiNN corresponds to

wr λ

slide-16
SLIDE 16
  • STE finds a deterministic boundary

16

  • Open-source Code Available : https://github.com/team-approx-bayes/BayesBiNN

̂ pk ← 1

C ∑C c=1 p (y = k|x, w(c)), C = 10

~

w(c) q(w)

Uncertainty Estimation

  • Main point: BayesBiNN obtains uncertainty estimates around the

classification boundaries

Classification on two moons dataset

slide-17
SLIDE 17

17

BayesBiNN STE

  • Open-source Code Available : https://github.com/team-approx-bayes/BayesBiNN
slide-18
SLIDE 18
  • CL: Sequentially learning new tasks without forgetting old ones[1]

Uncertainty Provided by BayesBiNN Enables Continual Learning

Overcoming catastrophic forgetting

18

  • But, it is unclear how to regularize binary weights of BiNN using STE/Bop
  • Main point: BayesBiNN enables continual learning (CL) for BiNN

using the intrinsic KL divergence as regularization

Common Method: Regularizing weights

slide-19
SLIDE 19

Uncertainty Provided by BayesBiNN Enables Continual Learning

Overcoming catastrophic forgetting

19

  • But, it is unclear how to regularize binary weights of BiNN using STE/Bop

min

qt(w) 𝔽qt(w) [ ∑ i∈Dt

ℓ(yt

i, fw(xt i))] + 𝔼KL (qt (w)||p (w))

Prior Distribution (uniform)

  • Main point: BayesBiNN enables continual learning (CL) for BiNN

using the intrinsic KL divergence as regularization

Common Method: Regularizing weights

  • In BayesBiNN, there is one natural solution using KL divergence

1. Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. PANS, 114(13):3521–3526, 2017.

  • CL: Sequentially learning new tasks without forgetting old ones[1]

Independent Learning

slide-20
SLIDE 20

Uncertainty Provided by BayesBiNN Enables Continual Learning

Overcoming catastrophic forgetting

20

  • But, it is unclear how to regularize binary weights of BiNN using STE/Bop

min

qt(w) 𝔽qt(w) [ ∑ i∈Dt

ℓ(yt

i, fw(xt i))] + 𝔼KL (qt (w)||qt−1 (w))

posterior approximation after task t − 1

  • Main point: BayesBiNN enables continual learning (CL) for BiNN

using the intrinsic KL divergence as regularization

Common Method: Regularizing weights

  • In BayesBiNN, there is one natural solution using KL divergence

1. Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. PANS, 114(13):3521–3526, 2017.

  • CL: Sequentially learning new tasks without forgetting old ones[1]

Continual Learning

slide-21
SLIDE 21

Uncertainty Provided by BayesBiNN Enables Continual Learning

Overcoming catastrophic forgetting

21

  • But, it is unclear how to regularize binary weights of BiNN using STE/Bop

min

qt(w) 𝔽qt(w) [ ∑ i∈Dt

ℓ(yt

i, fw(xt i))] + 𝔼KL (qt (w)||qt−1 (w))

posterior approximation after task t − 1

λt ← (1 − α)λt − α [s ⊙ g−λt−1]

Learned natural parameter after task t − 1

  • Main point: BayesBiNN enables continual learning (CL) for BiNN

using the intrinsic KL divergence as regularization

Common Method: Regularizing weights

  • In BayesBiNN, there is one natural solution using KL divergence

1. Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. PANS, 114(13):3521–3526, 2017.

  • CL: Sequentially learning new tasks without forgetting old ones[1]

Continual Learning

slide-22
SLIDE 22

Uncertainty Provided by BayesBiNN Enables Continual Learning

22 Training on task 2 Training on task 3

Test Accuracy

  • n task 1

Our method

Catastrophic forgetting of task 1

Training on Task 1

Note: For other tasks, refer to paper

Permuted MNIST

  • Main point: BayesBiNN avoids the catastrophic forgetting problem
slide-23
SLIDE 23

Uncertainty Provided by BayesBiNN Enables Continual Learning

23 Training on task 2 Training on task 3

Test Accuracy

  • n task 1

Our method

Catastrophic forgetting of task 1

Training on Task 1

Note: For other tasks, refer to paper

  • As the number of tasks increases,

the distribution over binary weights become more and more deterministic

Permuted MNIST

  • Main point: BayesBiNN avoids the catastrophic forgetting problem
  • Open-source Code Available : https://

github.com/team-approx-bayes/BayesBiNN

slide-24
SLIDE 24

24

Summary

  • BiNN: Neural Networks with binary weights
  • Much faster and much smaller but difficult to optimize
  • Gradient based methods work well but not well understood
  • We proposed a principled approach to train BiNN using the

Bayesian Learning Rule, which can justify some previous approaches

  • The Bayesian approach also gives us estimate of uncertainty

which can be used for continual learning

slide-25
SLIDE 25

Thank you! Q&A

25