bayesian feedforward neural networks
play

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National - PowerPoint PPT Presentation

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks compared to GPs Neural networks: Nonlinear generalization of GLMs Here, defined by a logistic regression model applied to a logistic


  1. Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University

  2. Neural networks compared to GPs • Neural networks: Nonlinear generalization of GLMs • Here, defined by a logistic regression model applied to a logistic regression model – To make the connection b/w GP and NN [Neal 96], now consider a neural network for regression with one hidden layer

  3. Neural networks compared to GPs 𝑧 𝑤 𝑘 ℎ 𝑛−1 ℎ 𝑛 ℎ 1 ℎ 2 ℎ 𝑘 Hidden unit activiation 𝑣 𝑘𝑙 𝑦 𝑗 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑜−1 𝑦 𝑜 • Use the priors on the weights:

  4. Neural networks compared to GPs – Let as • since more hidden units will increase the input to the final node, so we should scale down the magnitude of the weights as we get a Gaussian process

  5. Neural networks compared to GPs • If we use as activation / transfer function and choose • Then the covariance kernel [William ‘98]:  This is a true “neural network” kernel

  6. Feedforward neural networks • NN with two layers for a regression problem – 𝑕 : a non-linear activation or transfer function – 𝑨 𝒚 = 𝜚(𝒚, 𝑾) : called the hidden layer • NN for binary classification • NN for multi-output regression • NN for multi-class classification

  7. Feedforward neural networks A neural network with one hidden layer.

  8. Bayesian neural networks • Use prior of the form: – where 𝒙 represents all the weights combined • Posterior can be approximated:

  9. Bayesian neural networks • A second-order Taylor series approximation of 𝐹(𝒙) around its minimum (the MAP) – 𝑩 is the Hessian of E • Using the quadratic approximation, the posterior becomes Gaussian:

  10. Bayesian neural networks • Parameter posterior for classification • The same as the regression case, except 𝛾 = 1 and 𝐹 𝐸 is a cross entropy error of the form • Predictive posterior for regression • The posterior predictive density is not analytically tractable because of the nonlinearity of 𝑔(𝒚, 𝒙) • Let us construct a first-order Taylor series approximation around the mode:

  11. Bayesian neural networks • Predictive posterior for regression – We now have a linear-Gaussian model with a Gaussian prior on the weights – – The predictive variance depends on the input 𝒚 :

  12. Bayesian neural networks • The posterior predictive density for an MLP with 3 hidden nodes, trained on 16 data points – The dashed green line: the true function – The solid red line: the posterior mean prediction

  13. Bayesian neural networks • Predictive posterior for classification – Approximate 𝑞(𝑧|𝑦, 𝐸) in the case of binary classification • The situation is similar to the case of logistic regression, except in addition the posterior predictive mean is a non- linear function of 𝒙 • where 𝑏(𝒚, 𝒙) is the pre-synaptic output of the final layer

  14. Bayesian neural networks • Predictive posterior for classification – The posterior predictive for the output 𝑡𝑗𝑕𝑛 𝜆 𝜏 2 𝑏 𝑁𝑄 𝒚 – Using the approximation

  15. Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • The statement – A neural network with arbitrary depth and non- linearities, with dropout applied before every weight layer is mathematically equivalent to an approximation to the probabilistic deep Gaussian process (Damianou & Lawrence, 2013) • The notations – : the output of a NN model with 𝑀 layers and a loss function 𝐹(·,·) such as the softmax loss or the Euclidean loss (square loss) – ∈ 𝑆 𝐿 𝑗 ×𝐿 𝑗−1 : NN’s weight matrix at i-th layer – : the bias vector at i-th layer

  16. Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • L2 regularisation of NN • The deep Gaussian process – assume we are given a covariance function of the form – a deep GP with 𝑀 layers and covariance function 𝐿(𝒚, 𝒛) can be approximated by placing a variational distribution over each component of a spectral decomposition of the GPs’ covariance functions

  17. Dropout as a Bayesian Approximation • Now, is a random matrix of dims 𝐿 𝑗 × 𝐿 𝑗−1 where each row of 𝑿 𝑗 ∼ 𝑞(𝒙) of dims 𝐿 𝑗 for each GP layer • The predictive probability of the deep GP model is intractable

  18. Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • To approximate we define • Minimise the KL divergence between the approximate posterior and the posterior of the full deep GP

  19. 𝑿 𝑗 𝑨 𝑗,𝑘 𝑨 𝑗,1 𝑨 𝑗,2 𝑨 𝑗,𝐿 𝑗−1 −1 𝑨 𝑗,𝐿 𝑗−1 𝑨 𝑗,3 𝑒𝑗𝑏𝑕([𝑨 𝑗,𝑘 ]) 𝑵 𝑗 1 0 𝑿 𝑗 = 1 1

  20. Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • Approximate the first team of KL using a single sample : • Approximate the second term of KL to: • Thus, the approximated KL objective:

  21. Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • Approximate predictive distribution • MC dropout for approximation – Sample T sets of vectors of realisations from the Bernoulli distribution

  22. Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • Model uncertainty (estimating the second raw moment)

  23. Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] A scatter of 100 forward passes

  24. Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend