learning
play

Learning: A Bayesian solution Dmitry P. Vetrov Research professor - PowerPoint PPT Presentation

Open Problems in Deep Learning: A Bayesian solution Dmitry P. Vetrov Research professor at HSE, Head of Bayesian methods research group http://bayesgroup.ru Idea of the talk 2 Deep Learning Revolution in machine learning Deep neural


  1. Open Problems in Deep Learning: A Bayesian solution Dmitry P. Vetrov Research professor at HSE, Head of Bayesian methods research group http://bayesgroup.ru

  2. Idea of the talk 2

  3. Deep Learning • Revolution in machine learning • Deep neural networks approach to human intelligence on a number of problems • May solve quite non-standard problems such as image2caption and artistic style transfer

  4. Open problems in Deep learning • Overfitting Neural networks are prone to catastrophic overfitting on noisy data • Interpretability Nobody knows HOW neural network makes decisions – crucial for healthcare and finances. Legislative restrictions are expected • Uncertainty estimation Current neural networks are very over-confident even when they make mistakes. In many applications (e.g. self-driving cars) it is important to estimate the uncertainty of prediction • Adversarial examples Neural networks can be easily fooled by barely visible perturbations of data

  5. Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem

  6. Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem

  7. Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem

  8. Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem

  9. Frequentist vs. Bayesian frameworks

  10. Frequentist vs. Bayesian frameworks • It can be shown that • In other words frequentist framework is a limit case of Bayesian one! • The number of tunable parameters in Modern ML models is comparable with the sizes of training data d n • We have no choice but to be Bayesian!

  11. Bayesian Neural networks • In Bayesian DNNs we treat weights of neural network 𝜄 as random variables • First we define reasonable prior p ( 𝜄 ) • Next we perform Bayesian inference when giving training data and derive posterior

  12. Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much

  13. Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model

  14. Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model • Ensembling Posterior distribution over the weights defines the ensemble of neural networks rather than single network

  15. Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model • Ensembling Posterior distribution over the weights defines the ensemble of neural networks rather than single network • Model selection Automatically selects the simplest possible model that explains observed data thus performing Occam’s razor

  16. Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model • Ensembling Posterior distribution over the weights defines the ensemble of neural networks rather than single network • Model selection Automatically selects the simplest possible model that explains observed data thus performing Occam’s razor • Scalability Stochastic variational inference allows to approximate posteriors using deep neural networks

  17. Dropout • Purely heuristic regularization procedure • Inject either Bernoulli or gaussian noise to the weights during training • The magnitudes of the noise are set manually

  18. Bayesian dropout • Theoretically justified procedure • Corresponds to training of Bayesian ensemble under specific but interpretable prior • Allows to define dropout rates automatically

  19. Visualization LeNet-5: fully-connected layer LeNet-5: convolutional layer (100 x 100 patch)

  20. Avoiding narrow extrema • [Stochastic variational] Bayesian inference corresponds to the injection of noise in gradients • The larger is noise the less is spatial resolution • Bayesian DNN simply DOES NOT SEE narrow local minima

  21. Avoiding catastrophic overfitting • Bayesian model selection procedures effectively apply well- known Occam’s razor • They search for the simplest model capable to explain training data • If there are no dependencies between inputs and outputs Bayesian DNN will never be able to learn them since there always exists a simpler NULL-model “With all things being ng equal, l, the simplest lest expla lanati nation on tend nds s to be the e right one.” Wi William am of of Ockh kham

  22. Ensembles of ML algorithms • If we have several ML algorithms their average is generally better than the application of single best one • The problem is we need to train and keep them all in memory • Such technique is not scalable! • Bayesian ensembles are very compact (yet consist of continuum number of elements) – you only need to sample from posterior Single best accuracy Single algorithms Ensemble

  23. Real data example

  24. Robustness to adversarial attacks • Adversarial examples is another problem in DNN • Single DNNs are very sensitive to adversarial attacks • Ensembles of continuum of DNNs almost cannot be fooled “panda” “gibbon”

  25. Setting desirable properties By selecting the proper prior we may encourage the desired properties in Bayesian DNN: • Sparsity (compression) • Group sparsity (acceleration) • Rich ensembles (improves final accuracy, better uncertainty estimation) • Reliability (robustness to adversarial attacks) • Interpretability (hard attention maps) Techniques to become Bayesian soon • GANs • Normalization algorithms (batchnorm, weightnorm, etc.)

  26. Conclusions • Bayesian framework is extremely powerful and extends ML tools • We do have scalable algorithms for approximate Bayesian inference • Bayes + Deep Learning = • Even the first attempts of NeuroBayesian inference give impressive results • Summer school on NeuroBayesian methods, August, 2018, Moscow, http://deepbayes.ru

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend