Bayesian Deep Learning and Restricted Boltzmann Machines Narada - PowerPoint PPT Presentation

Bayesian Deep Learning and Restricted Boltzmann Machines Narada Warakagoda Forsvarets Forskningsinstitutt ndw@ffi.no November 1, 2018 Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56

Overview Probability Review 1 Bayesian Deep Learning 2 Restricted Boltzmann Machines 3 Narada Warakagoda (FFI) Short title November 1, 2018 2 / 56

Probability Review Narada Warakagoda (FFI) Short title November 1, 2018 3 / 56

Probability and Statistics Basics Normal (Gaussian) Distribution � � 1 − 1 µ ) T Σ Σ − 1 ( x − µ p ( x ) = Σ | 1 / 2 exp 2 ( x − µ µ Σ µ µ ) = N ( µ, Σ) µ, Σ) µ, Σ) (2 π ) d / 2 | Σ Σ Categorical Distribution k � p [ x = i ] P ( x ) = i i =1 Sampling x x x ∼ p ( x x x ) Narada Warakagoda (FFI) Short title November 1, 2018 4 / 56

Probability and Statistics Basics Independent variables k � p ( x x 1 , x x x x 2 , · · · , x x x k ) = p ( x x x i ) i =1 Expectation � x ) f ( x x ) = x f ( x x x ) p ( x x x ) dx E p ( x x or for discrete variables k � E p ( x x ) f ( x x x ) = f ( x x x i ) P ( x x x i ) x i =1 Narada Warakagoda (FFI) Short title November 1, 2018 5 / 56

Kullback Leibler Distance � q ( x � x ) x KL ( q ( x x x ) || p ( x x x )) = E q ( x x ) log x p ( x x x ) � = [ q ( x x x ) log q ( x x x ) − q ( x x x ) log p ( x x )] dx x x x For the discrete case k � KL ( Q ( x x x ) || P ( x x x )) = [ Q ( x x x i ) log Q ( x x x i ) − Q ( x x x i ) log P ( x x x i )] i =1 Narada Warakagoda (FFI) Short title November 1, 2018 6 / 56

Bayesian Deep Learning Narada Warakagoda (FFI) Short title November 1, 2018 7 / 56

Bayesian Statistics Joint distribution x | y p ( x x x , y y y ) = p ( x x y y ) p ( y y y ) Marginalization � p ( x x ) = x p ( x x x , y y y ) dy y y � P ( x x x ) = P ( x x x , y y y ) y y y Conditional distribution y ) = p ( x x , y y ) p ( y y | x x ) p ( x x ) x y y x x � p ( x x x | y y = p ( y y y ) p ( y y y | x x x ) p ( x x x ) dx x x Narada Warakagoda (FFI) Short title November 1, 2018 8 / 56

Statistical view of Neural Networks Prediction p ( y y y | x x x , w w w ) = N ( f f w ( x x x ) , Σ Σ) Σ f w w Classification k � x ) [ y = i ] f i P ( y | x x x , w w ) = w f f w ( x x w w i =1 Narada Warakagoda (FFI) Short title November 1, 2018 9 / 56

Motivation for Bayesian Approach Narada Warakagoda (FFI) Short title November 1, 2018 11 / 56

Motivation for Bayesian Approach Narada Warakagoda (FFI) Short title November 1, 2018 12 / 56

Uncertainty with Bayesian Approach Not only prediction/classification, but their uncertainty can also be calculated w | Y Since we have p ( w w Y Y , X X ) we can sample w X w w and use each sample as network parameters in calculating the prediction/classification p ( � y | � x , w w w )) (i.e.network output for a given input ). Prediction/classification is the mean of p ( � y | � x , w w ) w � p out = p ( � y | � x , Y Y Y , X X X ) = p ( � y | � x , w w w ) p ( w w w | Y Y Y , X X X ) dw w w Uncertainty of prediction/classification is the variance of p ( � y | � x , w w ) w � w ) − p out ] 2 p ( w Var( p ( � y | � x , w w w )) = [ p ( � y | � x , w w w w | Y Y Y , X X X ) dw w w Uncertainty is important in safety critical applications (eg: self-driving cars, medical diagnosis, military applications Narada Warakagoda (FFI) Short title November 1, 2018 13 / 56

Other Advantages of Bayesian Approach Natural interpretation for regularization Model selection Input data selection (active learning) Narada Warakagoda (FFI) Short title November 1, 2018 14 / 56

Main Challenge of Bayesian Approach We calculate For continuous case: p ( Y Y | X Y X X , w w w ) p ( w w w ) w | Y � p ( w w Y Y , X X X ) = Y | X P ( Y Y X X , w w w ) p ( w w w ) dw w w For discrete case: p ( Y Y | X Y X , w X w w ) P ( w w w ) w | Y � P ( w w Y Y , X X X ) = Y | X w p ( Y Y X X , w w ) P ( w w w ) w w w Calculating denominator is often intractable Eg: Consider a weight vector w w w of 100 elements, each can have two values. Then there are 2 100 = 1 . 2 × 10 30 different weight vectors. Compare this with universe’s age 13.7 billion years. We need approximations Narada Warakagoda (FFI) Short title November 1, 2018 15 / 56

Different Approaches Monte Carlo techniques (Eg: Markov Chain Monte Carlo -MCMC) Variational Inference Introducing random elements in training (eg: Dropout) Narada Warakagoda (FFI) Short title November 1, 2018 16 / 56

Advantages and Disadvantages of Different Approaches Markov Chain Monte Carlo - MCMC Asymptotically exact Computationally expensive Variational Inference No guarantee of exactness Possibility for faster computation Narada Warakagoda (FFI) Short title November 1, 2018 17 / 56

Monte Carlo Techniques We are interested in � p out = Mean( p ( � y | � x , w w w )) = p ( � y | � x , Y Y , X Y X X ) = p ( � y | � x , w w ) p ( w w w | Y w Y Y , X X X ) dw w w � w ) − p out ] 2 p ( w Var( p ( � y | � x , w w )) = [ p ( � y | � x , w w | Y Y , X X ) dw w w w Y X w w Both are integrals of the type � I = F ( w w w ) p ( w w w |D ) dw w w where D = ( Y Y Y , X X X ) is training data. Approximate the integral by sampling w w w i from p ( w w w |D ) � L I ≈ 1 F ( w w w i ) . L i =1 Narada Warakagoda (FFI) Short title November 1, 2018 18 / 56

Monte Carlo techniques Challenge: We don’t have the posterior Y | X p ( Y Y X X , w w w ) p ( w w ) w � p ( w w w |D ) = p ( w w | Y w Y Y , X X X ) = Y | X P ( Y Y X X , w w w ) p ( w w w ) dw w w ”Solution”: Use importance sampling by sampling from a proposal distribution q ( w w w ) � L � w |D ) w i | D ) w ) p ( w w w ≈ 1 w i ) p ( w w I = F ( w w w ) q ( w w w ) dw w F ( w w q ( w w q ( w w w i ) L i = Problem: We still do not have p ( w w w |D ) Narada Warakagoda (FFI) Short title November 1, 2018 19 / 56

Monte Carlo Techniques w |D ) Problem: We still do not have p ( w w Solution: use unnormalized posterior ˜ p ( w w w |D ) = p ( Y Y | X Y X X , w w w ) p ( w w w ) � Y | X where normalization factor Z = P ( Y Y X X , w w w ) p ( w w w ) dw w w such that w |D ) = ˜ p ( w w w |D ) p ( w w Z Integral can be calculated with: � L i =1 F ( w w w i ) ˜ p ( w w w i | D ) / q ( w w w i ) I ≈ � L i =1 ˜ p ( w w w i | D ) / q ( w w w i ) Narada Warakagoda (FFI) Short title November 1, 2018 20 / 56

Weakness of Importance Sampling Proposal distribution must be close to the non-zero areas of original distribution p ( w w w |D ). In neural networks, p ( w w w |D ) is typically small except for few narrow areas. Blind sampling from q ( w w ) has a high chance that they fall outside w w |D ) non-zero areas of p ( w w We must actively try to get samples that lie close to p ( w w w |D ) Markov Chain Monte Carlo (MCMC) is such technique. Narada Warakagoda (FFI) Short title November 1, 2018 21 / 56

Metropolis Algorithm Metropolis algorithm is an example of MCMC Draw samples repeatedly from random walk w w w t +1 = w w w t + ǫ ǫ ǫ where ǫ ǫ ǫ is ǫ ǫ ∼ q ( ǫ ǫ a small random vector, ǫ ǫ ) (eg: Gaussian noise) p ( w ˜ w t |D ) w Drawn sample at t = t is either accepted based on the ratio p ( w ˜ w w t − 1 |D ) If ˜ p ( w w w t |D ) > ˜ p ( w w w t − 1 |D ) accept sample p ( w ˜ w t |D ) w If ˜ p ( w w t |D ) < ˜ w p ( w w t − 1 |D ) accept sample with probability w p ( w ˜ w w t − 1 |D ) If sample accepted use it for calculating I Can use the same formula for calculating I � L i =1 F ( w w i ) ˜ w p ( w w w i | D ) / q ( w w w i ) I ≈ � L i =1 ˜ p ( w w w i | D ) / q ( w w w i ) Narada Warakagoda (FFI) Short title November 1, 2018 22 / 56

Other Monte Carlo and Related Techniques Hybrid Monte Carlo (Hamiltonian Monte Carlo) Similar to Metropolis algorithm But uses gradient information rather than a random walk. Simulated Annealing Narada Warakagoda (FFI) Short title November 1, 2018 23 / 56

Variational Inference Goal: computation of posterior p ( w w w |D ), i.e. the parameters of the neural network w w w given data D = ( Y Y Y , X X X ) But this computation is often intractable Idea: find a distribution q ( w w w ) from a family of distributions Q such that q ( w w w ) can closely approximate p ( w w w |D ) How to measure the distance between q ( w w w ) and p ( w w w |D ) ? � � Kullback-Leibler Distance KL q ( w w w ) || p ( w w w |D ) The problem can be formulated as � � p ( w ˆ w w |D ) = arg min w ) KL q ( w w w ) || p ( w w w |D ) q ( w w Narada Warakagoda (FFI) Short title November 1, 2018 24 / 56

Bayesian Deep Learning and Restricted Boltzmann Machines Narada - PowerPoint PPT Presentation

Bayesian Deep Learning and Restricted Boltzmann Machines Narada Warakagoda Forsvarets Forskningsinstitutt ndw@ffi.no November 1, 2018 Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56 Overview Probability Review 1 Bayesian

with Applications to Change-point Detection and Restricted Boltzmann Machine Restricted Boltzmann

Biologically-Inspired Sparse Restricted Boltzmann Machines Pablo Tostado Michael Wiest Alice

On the Fine-Tuning Parameters in Deep Boltzmann Machines Using Quaternions Jo ao Paulo Papa

New Modification of Restricted Boltzmann Machine that Considers the Stochasticity of Real Neural

CSC321 Lecture 19: Boltzmann Machines Roger Grosse Roger Grosse CSC321 Lecture 19: Boltzmann

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

RESTRICTED BOLTZMANN MACHINES DANIEL KOHLSDORF LAST LECTURE: DEEP AUTO ENCODERS Directed Model

Shallow vs. deep networks Restricted Boltzmann Machines Shallow : one hidden layer Features

Transport properties - Boltzmann equation goal: calculation of conductivity Boltzmann transport

On the Thermodynamic Equivalence between Hopfield Networks and Hybrid Boltzmann Machines Enrica

Various applications of restricted Boltzmann machines for bad quality training data Maciej Ziba

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Alcohol Harm Reduction Unit Insp Colin Dobson RESTRICTED RESTRICTED Historical position 2

South Wales Police Ray Forsey Head of Fleet The Real Benefits of Telematics Restricted

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Adversarial Approaches to Bayesian Learning and Bayesian Approaches to Adversarial Robustness

Applied Machine Learning Applied Machine Learning Naive Bayes Siamak Ravanbakhsh Siamak

T-61.3050 Machine Learning: Basic Principles Bayesian Networks Kai Puolam aki Laboratory of

Bayesian networks Independence Bayesian networks Markov conditions Inference by

The LESO-PB building building control system 0.0015 Density estimate 0.0010 0.0005 0.0000 0

Statistics Review of Probability Model Shiu-Sheng Chen Department of Economics National Taiwan

Machine Learning Lecture 3 Justin Pearson 1 2020 1

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for