Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National - - PowerPoint PPT Presentation
Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National - - PowerPoint PPT Presentation
Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks compared to GPs Neural networks: Nonlinear generalization of GLMs Here, defined by a logistic regression model applied to a logistic
Neural networks compared to GPs
- Neural networks: Nonlinear generalization of GLMs
- Here, defined by a logistic regression model applied to a
logistic regression model
β To make the connection b/w GP and NN [Neal 96], now consider a neural network for regression with one hidden layer
Neural networks compared to GPs
- Use the priors on the weights:
π§ β1 β2 βπβ1βπ π¦1 π¦2 π¦3 π¦πβ1 π¦π βπ π¦π π€π π£ππ
Hidden unit activiation
Neural networks compared to GPs
β Let as
- since more hidden units will increase the input to the
final node, so we should scale down the magnitude of the weights as we get a Gaussian process
Neural networks compared to GPs
- If we use as activation / transfer function
and choose
- Then the covariance kernel [William β98]:
ο¨ This is a true βneural networkβ kernel
Feedforward neural networks
- NN with two layers for a regression problem
β π: a non-linear activation or transfer function β π¨ π = π(π, πΎ): called the hidden layer
- NN for binary classification
- NN for multi-output regression
- NN for multi-class classification
Feedforward neural networks
A neural network with one hidden layer.
Bayesian neural networks
- Use prior of the form:
β where π represents all the weights combined
- Posterior can be approximated:
Bayesian neural networks
- A second-order Taylor series approximation of
πΉ(π) around its minimum (the MAP)
β π© is the Hessian of E
- Using the quadratic approximation, the posterior
becomes Gaussian:
Bayesian neural networks
- Parameter posterior for classification
- The same as the regression case, except πΎ = 1 and πΉπΈ is a
cross entropy error of the form
- Predictive posterior for regression
- The posterior predictive density is not analytically tractable
because of the nonlinearity of π(π, π)
- Let us construct a first-order Taylor series approximation
around the mode:
Bayesian neural networks
- Predictive posterior for regression
β We now have a linear-Gaussian model with a Gaussian prior on the weights β β The predictive variance depends on the input π:
Bayesian neural networks
- The posterior predictive density for an MLP with 3 hidden
nodes, trained on 16 data points
β The dashed green line: the true function β The solid red line: the posterior mean prediction
Bayesian neural networks
- Predictive posterior for classification
β Approximate π(π§|π¦, πΈ) in the case of binary classification
- The situation is similar to the case of logistic regression,
except in addition the posterior predictive mean is a non- linear function of π
- where π(π, π) is the pre-synaptic output of the final layer
Bayesian neural networks
- Predictive posterior for classification
β The posterior predictive for the output β Using the approximation
π‘πππ π π2 πππ π
Dropout as a Bayesian Approximation [Gal and Ghahramani β16]
- The statement
β A neural network with arbitrary depth and non- linearities, with dropout applied before every weight layer is mathematically equivalent to an approximation to the probabilistic deep Gaussian process (Damianou & Lawrence, 2013)
- The notations
β : the output of a NN model with π layers and a loss function πΉ(Β·,Β·) such as the softmax loss or the Euclidean loss (square loss) β β ππΏπΓπΏπβ1: NNβs weight matrix at i-th layer β : the bias vector at i-th layer
Dropout as a Bayesian Approximation [Gal and Ghahramani β16]
- L2 regularisation of NN
- The deep Gaussian process
β assume we are given a covariance function of the form β a deep GP with π layers and covariance function πΏ(π, π) can be approximated by placing a variational distribution over each component of a spectral decomposition of the GPsβ covariance functions
Dropout as a Bayesian Approximation
- The predictive probability of the deep GP
model
is intractable
- Now, is a random matrix of dims πΏπ Γ πΏπβ1
- f dims πΏπ for each GP layer
where each row of πΏπ βΌ π(π)
Dropout as a Bayesian Approximation [Gal and Ghahramani β16]
- To approximate
we define
- Minimise the KL divergence between the
approximate posterior and the posterior of the full deep GP
π¨π,1 π¨π,2 π¨π,3 π¨π,πΏπβ1β1 π¨π,πΏπβ1 π¨π,π πΏπ 1 1 1 π΅π ππππ([π¨π,π]) πΏπ =
Dropout as a Bayesian Approximation [Gal and Ghahramani β16]
- Approximate the first team of KL using a single
sample :
- Approximate the second term of KL to:
- Thus, the approximated KL objective:
Dropout as a Bayesian Approximation [Gal and Ghahramani β16]
- Approximate predictive distribution
- MC dropout for approximation
β Sample T sets of vectors of realisations from the Bernoulli distribution
Dropout as a Bayesian Approximation [Gal and Ghahramani β16]
- Model uncertainty (estimating the second raw
moment)
Dropout as a Bayesian Approximation [Gal and Ghahramani β16]
A scatter of 100 forward passes