Learning: A Bayesian solution Dmitry P. Vetrov Research professor - - PowerPoint PPT Presentation

learning
SMART_READER_LITE
LIVE PREVIEW

Learning: A Bayesian solution Dmitry P. Vetrov Research professor - - PowerPoint PPT Presentation

Open Problems in Deep Learning: A Bayesian solution Dmitry P. Vetrov Research professor at HSE, Head of Bayesian methods research group http://bayesgroup.ru Idea of the talk 2 Deep Learning Revolution in machine learning Deep neural


slide-1
SLIDE 1

Dmitry P. Vetrov

Research professor at HSE, Head of Bayesian methods research group http://bayesgroup.ru

Open Problems in Deep Learning: A Bayesian solution

slide-2
SLIDE 2

Idea of the talk

2

slide-3
SLIDE 3

Deep Learning

  • Revolution in machine learning
  • Deep neural networks approach to human intelligence on a number of

problems

  • May solve quite non-standard problems such as image2caption and

artistic style transfer

slide-4
SLIDE 4

Open problems in Deep learning

  • Overfitting

Neural networks are prone to catastrophic overfitting on noisy data

  • Interpretability

Nobody knows HOW neural network makes decisions – crucial for healthcare and finances. Legislative restrictions are expected

  • Uncertainty estimation

Current neural networks are very over-confident even when they make

  • mistakes. In many applications (e.g. self-driving cars) it is important to

estimate the uncertainty of prediction

  • Adversarial examples

Neural networks can be easily fooled by barely visible perturbations of data

slide-5
SLIDE 5

Bayesian framework

  • Treats everything as a random variables
  • Allows to encode our ignorance in terms of distributions
  • Makes use of Bayes theorem
slide-6
SLIDE 6

Bayesian framework

  • Treats everything as a random variables
  • Allows to encode our ignorance in terms of distributions
  • Makes use of Bayes theorem
slide-7
SLIDE 7

Bayesian framework

  • Treats everything as a random variables
  • Allows to encode our ignorance in terms of distributions
  • Makes use of Bayes theorem
slide-8
SLIDE 8

Bayesian framework

  • Treats everything as a random variables
  • Allows to encode our ignorance in terms of distributions
  • Makes use of Bayes theorem
slide-9
SLIDE 9

Frequentist vs. Bayesian frameworks

slide-10
SLIDE 10

Frequentist vs. Bayesian frameworks

  • It can be shown that
  • In other words frequentist framework is a limit case of

Bayesian one!

  • The number of tunable parameters in Modern ML models is

comparable with the sizes of training data d n

  • We have no choice but to be Bayesian!
slide-11
SLIDE 11

Bayesian Neural networks

  • In Bayesian DNNs we treat weights of neural network 𝜄 as random

variables

  • First we define reasonable prior p(𝜄)
  • Next we perform Bayesian inference when giving training data and

derive posterior

slide-12
SLIDE 12

Advantages of Bayesian framework

  • Regularization

Prevents overfitting on the training data because prior does not allow to tune parameters too much

slide-13
SLIDE 13

Advantages of Bayesian framework

  • Regularization

Prevents overfitting on the training data because prior does not allow to tune parameters too much

  • Extensibility

Bayesian inference results to posterior which can be now used as prior in next model

slide-14
SLIDE 14

Advantages of Bayesian framework

  • Regularization

Prevents overfitting on the training data because prior does not allow to tune parameters too much

  • Extensibility

Bayesian inference results to posterior which can be now used as prior in next model

  • Ensembling

Posterior distribution over the weights defines the ensemble of neural networks rather than single network

slide-15
SLIDE 15

Advantages of Bayesian framework

  • Regularization

Prevents overfitting on the training data because prior does not allow to tune parameters too much

  • Extensibility

Bayesian inference results to posterior which can be now used as prior in next model

  • Ensembling

Posterior distribution over the weights defines the ensemble of neural networks rather than single network

  • Model selection

Automatically selects the simplest possible model that explains

  • bserved data thus performing Occam’s razor
slide-16
SLIDE 16

Advantages of Bayesian framework

  • Regularization

Prevents overfitting on the training data because prior does not allow to tune parameters too much

  • Extensibility

Bayesian inference results to posterior which can be now used as prior in next model

  • Ensembling

Posterior distribution over the weights defines the ensemble of neural networks rather than single network

  • Model selection

Automatically selects the simplest possible model that explains

  • bserved data thus performing Occam’s razor
  • Scalability

Stochastic variational inference allows to approximate posteriors using deep neural networks

slide-17
SLIDE 17

Dropout

  • Purely heuristic regularization procedure
  • Inject either Bernoulli or gaussian noise to the weights

during training

  • The magnitudes of the noise are set manually
slide-18
SLIDE 18

Bayesian dropout

  • Theoretically justified procedure
  • Corresponds to training of Bayesian ensemble under specific but

interpretable prior

  • Allows to define dropout rates automatically
slide-19
SLIDE 19

Visualization

LeNet-5: convolutional layer LeNet-5: fully-connected layer (100 x 100 patch)

slide-20
SLIDE 20

Avoiding narrow extrema

  • [Stochastic variational] Bayesian inference corresponds to the injection of

noise in gradients

  • The larger is noise the less is spatial resolution
  • Bayesian DNN simply DOES NOT SEE narrow local minima
slide-21
SLIDE 21

Avoiding catastrophic

  • verfitting
  • Bayesian model selection procedures effectively apply well-known Occam’s razor
  • They search for the simplest model capable to explain training data
  • If there are no dependencies between inputs and outputs Bayesian DNN will

never be able to learn them since there always exists a simpler NULL-model

“With all things being ng equal, l, the simplest lest expla lanati nation

  • n

tend nds s to be the e right one.”

Wi William am of

  • f Ockh

kham

slide-22
SLIDE 22

Ensembles of ML algorithms

  • If we have several ML algorithms their average is generally better than the

application of single best one

  • The problem is we need to train and keep them all in memory
  • Such technique is not scalable!
  • Bayesian ensembles are very compact (yet consist of continuum number of

elements) – you only need to sample from posterior

accuracy Single algorithms Single best Ensemble

slide-23
SLIDE 23

Real data example

slide-24
SLIDE 24

Robustness to adversarial attacks

  • Adversarial examples is another problem in DNN
  • Single DNNs are very sensitive to adversarial attacks
  • Ensembles of continuum of DNNs almost cannot be fooled

“panda” “gibbon”

slide-25
SLIDE 25

Setting desirable properties

By selecting the proper prior we may encourage the desired properties in Bayesian DNN:

  • Sparsity (compression)
  • Group sparsity (acceleration)
  • Rich ensembles (improves final accuracy, better uncertainty estimation)
  • Reliability (robustness to adversarial attacks)
  • Interpretability (hard attention maps)

Techniques to become Bayesian soon

  • GANs
  • Normalization algorithms (batchnorm, weightnorm, etc.)
slide-26
SLIDE 26

Conclusions

  • Bayesian framework is extremely powerful and extends ML tools
  • We do have scalable algorithms for approximate Bayesian inference
  • Bayes + Deep Learning =
  • Even the first attempts of NeuroBayesian inference give impressive results
  • Summer school on NeuroBayesian methods, August, 2018, Moscow,

http://deepbayes.ru