Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National - PowerPoint PPT Presentation

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University

Neural networks compared to GPs • Neural networks: Nonlinear generalization of GLMs • Here, defined by a logistic regression model applied to a logistic regression model – To make the connection b/w GP and NN [Neal 96], now consider a neural network for regression with one hidden layer

Neural networks compared to GPs 𝑧 𝑤 𝑘 ℎ 𝑛−1 ℎ 𝑛 ℎ 1 ℎ 2 ℎ 𝑘 Hidden unit activiation 𝑣 𝑘𝑙 𝑦 𝑗 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑜−1 𝑦 𝑜 • Use the priors on the weights:

Neural networks compared to GPs – Let as • since more hidden units will increase the input to the final node, so we should scale down the magnitude of the weights as we get a Gaussian process

Neural networks compared to GPs • If we use as activation / transfer function and choose • Then the covariance kernel [William ‘98]:  This is a true “neural network” kernel

Feedforward neural networks • NN with two layers for a regression problem – 𝑕 : a non-linear activation or transfer function – 𝑨 𝒚 = 𝜚(𝒚, 𝑾) : called the hidden layer • NN for binary classification • NN for multi-output regression • NN for multi-class classification

Feedforward neural networks A neural network with one hidden layer.

Bayesian neural networks • Use prior of the form: – where 𝒙 represents all the weights combined • Posterior can be approximated:

Bayesian neural networks • A second-order Taylor series approximation of 𝐹(𝒙) around its minimum (the MAP) – 𝑩 is the Hessian of E • Using the quadratic approximation, the posterior becomes Gaussian:

Bayesian neural networks • Parameter posterior for classification • The same as the regression case, except 𝛾 = 1 and 𝐹 𝐸 is a cross entropy error of the form • Predictive posterior for regression • The posterior predictive density is not analytically tractable because of the nonlinearity of 𝑔(𝒚, 𝒙) • Let us construct a first-order Taylor series approximation around the mode:

Bayesian neural networks • Predictive posterior for regression – We now have a linear-Gaussian model with a Gaussian prior on the weights – – The predictive variance depends on the input 𝒚 :

Bayesian neural networks • The posterior predictive density for an MLP with 3 hidden nodes, trained on 16 data points – The dashed green line: the true function – The solid red line: the posterior mean prediction

Bayesian neural networks • Predictive posterior for classification – Approximate 𝑞(𝑧|𝑦, 𝐸) in the case of binary classification • The situation is similar to the case of logistic regression, except in addition the posterior predictive mean is a nonlinear function of 𝒙 • where 𝑏(𝒚, 𝒙) is the pre-synaptic output of the final layer

Bayesian neural networks • Predictive posterior for classification – The posterior predictive for the output 𝑡𝑗𝑕𝑛 𝜆 𝜏 2 𝑏 𝑁𝑄 𝒚 – Using the approximation

Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • The statement – A neural network with arbitrary depth and non- linearities, with dropout applied before every weight layer is mathematically equivalent to an approximation to the probabilistic deep Gaussian process (Damianou & Lawrence, 2013) • The notations – : the output of a NN model with 𝑀 layers and a loss function 𝐹(·,·) such as the softmax loss or the Euclidean loss (square loss) – ∈ 𝑆 𝐿 𝑗 ×𝐿 𝑗−1 : NN’s weight matrix at i-th layer – : the bias vector at i-th layer

Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • L2 regularisation of NN • The deep Gaussian process – assume we are given a covariance function of the form – a deep GP with 𝑀 layers and covariance function 𝐿(𝒚, 𝒛) can be approximated by placing a variational distribution over each component of a spectral decomposition of the GPs’ covariance functions

Dropout as a Bayesian Approximation • Now, is a random matrix of dims 𝐿 𝑗 × 𝐿 𝑗−1 where each row of 𝑿 𝑗 ∼ 𝑞(𝒙) of dims 𝐿 𝑗 for each GP layer • The predictive probability of the deep GP model is intractable

Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • To approximate we define • Minimise the KL divergence between the approximate posterior and the posterior of the full deep GP

𝑿 𝑗 𝑨 𝑗,𝑘 𝑨 𝑗,1 𝑨 𝑗,2 𝑨 𝑗,𝐿 𝑗−1 −1 𝑨 𝑗,𝐿 𝑗−1 𝑨 𝑗,3 𝑒𝑗𝑏𝑕([𝑨 𝑗,𝑘 ]) 𝑵 𝑗 1 0 𝑿 𝑗 = 1 1

Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • Approximate the first team of KL using a single sample : • Approximate the second term of KL to: • Thus, the approximated KL objective:

Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • Approximate predictive distribution • MC dropout for approximation – Sample T sets of vectors of realisations from the Bernoulli distribution

Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] • Model uncertainty (estimating the second raw moment)

Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16] A scatter of 100 forward passes

Dropout as a Bayesian Approximation [Gal and Ghahramani ‘16]

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National - PowerPoint PPT Presentation

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks compared to GPs Neural networks: Nonlinear generalization of GLMs Here, defined by a logistic regression model applied to a logistic

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

CS7015 (Deep Learning) : Lecture 3 Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

An Introduction to Neural Networks - Feedforward NN Backpropagation Agathe Merceron Beuth

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

MultiLayer Neural Networks Xiaogang Wang xgwang@ee.cuhk.edu.hk January 15, 2019 cuhk Xiaogang

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Series Methods and Approximations Contents 12.1 Review of Calculus Topics . . . . . . . . . . .

Second order Implicit-Explicit Total Variation Diminishing schemes for the Euler system in the

1 Regularization with priors: quick refresher 1.1 MAP inference We have previously discussed

Computational Optimization Mathematical Programming Fundamentals 1/25 (revised) If you dont

Seismic Modeling, Migration and Velocity Inversion Finite Difference Approximations of the Wave

Sparse Inverse Covariance Estimation Using Quadratic Approximation Inderjit S. Dhillon Dept of

Introduction to Convex Optimization Xuezhi Wang Computer Science Department Carnegie Mellon

Math 610 Section 700 - Recitation week 3 week 4 week 6 week 8 TA: Peng Wei Office: Blocker