bayesian deep learning and restricted boltzmann machines
play

Bayesian Deep Learning and Restricted Boltzmann Machines Narada - PowerPoint PPT Presentation

Bayesian Deep Learning and Restricted Boltzmann Machines Narada Warakagoda Forsvarets Forskningsinstitutt ndw@ffi.no November 1, 2018 Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56 Overview Probability Review 1 Bayesian


  1. Bayesian Deep Learning and Restricted Boltzmann Machines Narada Warakagoda Forsvarets Forskningsinstitutt ndw@ffi.no November 1, 2018 Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56

  2. Overview Probability Review 1 Bayesian Deep Learning 2 Restricted Boltzmann Machines 3 Narada Warakagoda (FFI) Short title November 1, 2018 2 / 56

  3. Probability Review Narada Warakagoda (FFI) Short title November 1, 2018 3 / 56

  4. Probability and Statistics Basics Normal (Gaussian) Distribution � � 1 − 1 µ ) T Σ Σ − 1 ( x − µ p ( x ) = Σ | 1 / 2 exp 2 ( x − µ µ Σ µ µ ) = N ( µ, Σ) µ, Σ) µ, Σ) (2 π ) d / 2 | Σ Σ Categorical Distribution k � p [ x = i ] P ( x ) = i i =1 Sampling x x x ∼ p ( x x x ) Narada Warakagoda (FFI) Short title November 1, 2018 4 / 56

  5. Probability and Statistics Basics Independent variables k � p ( x x 1 , x x x x 2 , · · · , x x x k ) = p ( x x x i ) i =1 Expectation � x ) f ( x x ) = x f ( x x x ) p ( x x x ) dx E p ( x x or for discrete variables k � E p ( x x ) f ( x x x ) = f ( x x x i ) P ( x x x i ) x i =1 Narada Warakagoda (FFI) Short title November 1, 2018 5 / 56

  6. Kullback Leibler Distance � q ( x � x ) x KL ( q ( x x x ) || p ( x x x )) = E q ( x x ) log x p ( x x x ) � = [ q ( x x x ) log q ( x x x ) − q ( x x x ) log p ( x x )] dx x x x For the discrete case k � KL ( Q ( x x x ) || P ( x x x )) = [ Q ( x x x i ) log Q ( x x x i ) − Q ( x x x i ) log P ( x x x i )] i =1 Narada Warakagoda (FFI) Short title November 1, 2018 6 / 56

  7. Bayesian Deep Learning Narada Warakagoda (FFI) Short title November 1, 2018 7 / 56

  8. Bayesian Statistics Joint distribution x | y p ( x x x , y y y ) = p ( x x y y ) p ( y y y ) Marginalization � p ( x x ) = x p ( x x x , y y y ) dy y y � P ( x x x ) = P ( x x x , y y y ) y y y Conditional distribution y ) = p ( x x , y y ) p ( y y | x x ) p ( x x ) x y y x x � p ( x x x | y y = p ( y y y ) p ( y y y | x x x ) p ( x x x ) dx x x Narada Warakagoda (FFI) Short title November 1, 2018 8 / 56

  9. Statistical view of Neural Networks Prediction p ( y y y | x x x , w w w ) = N ( f f w ( x x x ) , Σ Σ) Σ f w w Classification k � x ) [ y = i ] f i P ( y | x x x , w w ) = w f f w ( x x w w i =1 Narada Warakagoda (FFI) Short title November 1, 2018 9 / 56

  10. Training Criteria Maximum Likelihood(ML) w w = arg max � p ( Y | X Y | X Y | X , w w w ) w w w w Maximum A-Priori (MAP) � Y , w w | X w | X ) = arg max w | X w p ( Y | X Y | X , w Y | X w w w = arg max p ( Y , w Y , w w w w w ) p ( w w ) w w w w w w w Bayesian X ) = p ( Y Y Y | X X X , w w w ) p ( w w w ) p ( Y Y | X Y X , w X w ) p ( w w w w ) � p ( w w | Y w Y Y , X X = P ( Y Y Y | X X ) X P ( Y Y | X Y X X , w w w ) p ( w w w ) dw w w Narada Warakagoda (FFI) Short title November 1, 2018 10 / 56

  11. Motivation for Bayesian Approach Narada Warakagoda (FFI) Short title November 1, 2018 11 / 56

  12. Motivation for Bayesian Approach Narada Warakagoda (FFI) Short title November 1, 2018 12 / 56

  13. Uncertainty with Bayesian Approach Not only prediction/classification, but their uncertainty can also be calculated w | Y Since we have p ( w w Y Y , X X ) we can sample w X w w and use each sample as network parameters in calculating the prediction/classification p ( � y | � x , w w w )) (i.e.network output for a given input ). Prediction/classification is the mean of p ( � y | � x , w w ) w � p out = p ( � y | � x , Y Y Y , X X X ) = p ( � y | � x , w w w ) p ( w w w | Y Y Y , X X X ) dw w w Uncertainty of prediction/classification is the variance of p ( � y | � x , w w ) w � w ) − p out ] 2 p ( w Var( p ( � y | � x , w w w )) = [ p ( � y | � x , w w w w | Y Y Y , X X X ) dw w w Uncertainty is important in safety critical applications (eg: self-driving cars, medical diagnosis, military applications Narada Warakagoda (FFI) Short title November 1, 2018 13 / 56

  14. Other Advantages of Bayesian Approach Natural interpretation for regularization Model selection Input data selection (active learning) Narada Warakagoda (FFI) Short title November 1, 2018 14 / 56

  15. Main Challenge of Bayesian Approach We calculate For continuous case: p ( Y Y | X Y X X , w w w ) p ( w w w ) w | Y � p ( w w Y Y , X X X ) = Y | X P ( Y Y X X , w w w ) p ( w w w ) dw w w For discrete case: p ( Y Y | X Y X , w X w w ) P ( w w w ) w | Y � P ( w w Y Y , X X X ) = Y | X w p ( Y Y X X , w w ) P ( w w w ) w w w Calculating denominator is often intractable Eg: Consider a weight vector w w w of 100 elements, each can have two values. Then there are 2 100 = 1 . 2 × 10 30 different weight vectors. Compare this with universe’s age 13.7 billion years. We need approximations Narada Warakagoda (FFI) Short title November 1, 2018 15 / 56

  16. Different Approaches Monte Carlo techniques (Eg: Markov Chain Monte Carlo -MCMC) Variational Inference Introducing random elements in training (eg: Dropout) Narada Warakagoda (FFI) Short title November 1, 2018 16 / 56

  17. Advantages and Disadvantages of Different Approaches Markov Chain Monte Carlo - MCMC Asymptotically exact Computationally expensive Variational Inference No guarantee of exactness Possibility for faster computation Narada Warakagoda (FFI) Short title November 1, 2018 17 / 56

  18. Monte Carlo Techniques We are interested in � p out = Mean( p ( � y | � x , w w w )) = p ( � y | � x , Y Y , X Y X X ) = p ( � y | � x , w w ) p ( w w w | Y w Y Y , X X X ) dw w w � w ) − p out ] 2 p ( w Var( p ( � y | � x , w w )) = [ p ( � y | � x , w w | Y Y , X X ) dw w w w Y X w w Both are integrals of the type � I = F ( w w w ) p ( w w w |D ) dw w w where D = ( Y Y Y , X X X ) is training data. Approximate the integral by sampling w w w i from p ( w w w |D ) � L I ≈ 1 F ( w w w i ) . L i =1 Narada Warakagoda (FFI) Short title November 1, 2018 18 / 56

  19. Monte Carlo techniques Challenge: We don’t have the posterior Y | X p ( Y Y X X , w w w ) p ( w w ) w � p ( w w w |D ) = p ( w w | Y w Y Y , X X X ) = Y | X P ( Y Y X X , w w w ) p ( w w w ) dw w w ”Solution”: Use importance sampling by sampling from a proposal distribution q ( w w w ) � L � w |D ) w i | D ) w ) p ( w w w ≈ 1 w i ) p ( w w I = F ( w w w ) q ( w w w ) dw w F ( w w q ( w w q ( w w w i ) L i = Problem: We still do not have p ( w w w |D ) Narada Warakagoda (FFI) Short title November 1, 2018 19 / 56

  20. Monte Carlo Techniques w |D ) Problem: We still do not have p ( w w Solution: use unnormalized posterior ˜ p ( w w w |D ) = p ( Y Y | X Y X X , w w w ) p ( w w w ) � Y | X where normalization factor Z = P ( Y Y X X , w w w ) p ( w w w ) dw w w such that w |D ) = ˜ p ( w w w |D ) p ( w w Z Integral can be calculated with: � L i =1 F ( w w w i ) ˜ p ( w w w i | D ) / q ( w w w i ) I ≈ � L i =1 ˜ p ( w w w i | D ) / q ( w w w i ) Narada Warakagoda (FFI) Short title November 1, 2018 20 / 56

  21. Weakness of Importance Sampling Proposal distribution must be close to the non-zero areas of original distribution p ( w w w |D ). In neural networks, p ( w w w |D ) is typically small except for few narrow areas. Blind sampling from q ( w w ) has a high chance that they fall outside w w |D ) non-zero areas of p ( w w We must actively try to get samples that lie close to p ( w w w |D ) Markov Chain Monte Carlo (MCMC) is such technique. Narada Warakagoda (FFI) Short title November 1, 2018 21 / 56

  22. Metropolis Algorithm Metropolis algorithm is an example of MCMC Draw samples repeatedly from random walk w w w t +1 = w w w t + ǫ ǫ ǫ where ǫ ǫ ǫ is ǫ ǫ ∼ q ( ǫ ǫ a small random vector, ǫ ǫ ) (eg: Gaussian noise) p ( w ˜ w t |D ) w Drawn sample at t = t is either accepted based on the ratio p ( w ˜ w w t − 1 |D ) If ˜ p ( w w w t |D ) > ˜ p ( w w w t − 1 |D ) accept sample p ( w ˜ w t |D ) w If ˜ p ( w w t |D ) < ˜ w p ( w w t − 1 |D ) accept sample with probability w p ( w ˜ w w t − 1 |D ) If sample accepted use it for calculating I Can use the same formula for calculating I � L i =1 F ( w w i ) ˜ w p ( w w w i | D ) / q ( w w w i ) I ≈ � L i =1 ˜ p ( w w w i | D ) / q ( w w w i ) Narada Warakagoda (FFI) Short title November 1, 2018 22 / 56

  23. Other Monte Carlo and Related Techniques Hybrid Monte Carlo (Hamiltonian Monte Carlo) Similar to Metropolis algorithm But uses gradient information rather than a random walk. Simulated Annealing Narada Warakagoda (FFI) Short title November 1, 2018 23 / 56

  24. Variational Inference Goal: computation of posterior p ( w w w |D ), i.e. the parameters of the neural network w w w given data D = ( Y Y Y , X X X ) But this computation is often intractable Idea: find a distribution q ( w w w ) from a family of distributions Q such that q ( w w w ) can closely approximate p ( w w w |D ) How to measure the distance between q ( w w w ) and p ( w w w |D ) ? � � Kullback-Leibler Distance KL q ( w w w ) || p ( w w w |D ) The problem can be formulated as � � p ( w ˆ w w |D ) = arg min w ) KL q ( w w w ) || p ( w w w |D ) q ( w w Narada Warakagoda (FFI) Short title November 1, 2018 24 / 56

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend