Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National - - PowerPoint PPT Presentation

β–Ά
bayesian feedforward neural networks
SMART_READER_LITE
LIVE PREVIEW

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National - - PowerPoint PPT Presentation

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks compared to GPs Neural networks: Nonlinear generalization of GLMs Here, defined by a logistic regression model applied to a logistic


slide-1
SLIDE 1

Bayesian feedforward Neural networks

Seung-Hoon Na Chonbuk National University

slide-2
SLIDE 2

Neural networks compared to GPs

  • Neural networks: Nonlinear generalization of GLMs
  • Here, defined by a logistic regression model applied to a

logistic regression model

– To make the connection b/w GP and NN [Neal 96], now consider a neural network for regression with one hidden layer

slide-3
SLIDE 3

Neural networks compared to GPs

  • Use the priors on the weights:

𝑧 β„Ž1 β„Ž2 β„Žπ‘›βˆ’1β„Žπ‘› 𝑦1 𝑦2 𝑦3 π‘¦π‘œβˆ’1 π‘¦π‘œ β„Žπ‘˜ 𝑦𝑗 π‘€π‘˜ π‘£π‘˜π‘™

Hidden unit activiation

slide-4
SLIDE 4

Neural networks compared to GPs

– Let as

  • since more hidden units will increase the input to the

final node, so we should scale down the magnitude of the weights as we get a Gaussian process

slide-5
SLIDE 5

Neural networks compared to GPs

  • If we use as activation / transfer function

and choose

  • Then the covariance kernel [William β€˜98]:

 This is a true β€œneural network” kernel

slide-6
SLIDE 6

Feedforward neural networks

  • NN with two layers for a regression problem

– 𝑕: a non-linear activation or transfer function – 𝑨 π’š = 𝜚(π’š, 𝑾): called the hidden layer

  • NN for binary classification
  • NN for multi-output regression
  • NN for multi-class classification
slide-7
SLIDE 7

Feedforward neural networks

A neural network with one hidden layer.

slide-8
SLIDE 8

Bayesian neural networks

  • Use prior of the form:

– where 𝒙 represents all the weights combined

  • Posterior can be approximated:
slide-9
SLIDE 9

Bayesian neural networks

  • A second-order Taylor series approximation of

𝐹(𝒙) around its minimum (the MAP)

– 𝑩 is the Hessian of E

  • Using the quadratic approximation, the posterior

becomes Gaussian:

slide-10
SLIDE 10

Bayesian neural networks

  • Parameter posterior for classification
  • The same as the regression case, except 𝛾 = 1 and 𝐹𝐸 is a

cross entropy error of the form

  • Predictive posterior for regression
  • The posterior predictive density is not analytically tractable

because of the nonlinearity of 𝑔(π’š, 𝒙)

  • Let us construct a first-order Taylor series approximation

around the mode:

slide-11
SLIDE 11

Bayesian neural networks

  • Predictive posterior for regression

– We now have a linear-Gaussian model with a Gaussian prior on the weights – – The predictive variance depends on the input π’š:

slide-12
SLIDE 12

Bayesian neural networks

  • The posterior predictive density for an MLP with 3 hidden

nodes, trained on 16 data points

– The dashed green line: the true function – The solid red line: the posterior mean prediction

slide-13
SLIDE 13

Bayesian neural networks

  • Predictive posterior for classification

– Approximate π‘ž(𝑧|𝑦, 𝐸) in the case of binary classification

  • The situation is similar to the case of logistic regression,

except in addition the posterior predictive mean is a non- linear function of 𝒙

  • where 𝑏(π’š, 𝒙) is the pre-synaptic output of the final layer
slide-14
SLIDE 14

Bayesian neural networks

  • Predictive posterior for classification

– The posterior predictive for the output – Using the approximation

𝑑𝑗𝑕𝑛 πœ† 𝜏2 𝑏𝑁𝑄 π’š

slide-15
SLIDE 15

Dropout as a Bayesian Approximation [Gal and Ghahramani β€˜16]

  • The statement

– A neural network with arbitrary depth and non- linearities, with dropout applied before every weight layer is mathematically equivalent to an approximation to the probabilistic deep Gaussian process (Damianou & Lawrence, 2013)

  • The notations

– : the output of a NN model with 𝑀 layers and a loss function 𝐹(Β·,Β·) such as the softmax loss or the Euclidean loss (square loss) – ∈ π‘†πΏπ‘—Γ—πΏπ‘—βˆ’1: NN’s weight matrix at i-th layer – : the bias vector at i-th layer

slide-16
SLIDE 16

Dropout as a Bayesian Approximation [Gal and Ghahramani β€˜16]

  • L2 regularisation of NN
  • The deep Gaussian process

– assume we are given a covariance function of the form – a deep GP with 𝑀 layers and covariance function 𝐿(π’š, 𝒛) can be approximated by placing a variational distribution over each component of a spectral decomposition of the GPs’ covariance functions

slide-17
SLIDE 17

Dropout as a Bayesian Approximation

  • The predictive probability of the deep GP

model

is intractable

  • Now, is a random matrix of dims 𝐿𝑗 Γ— πΏπ‘—βˆ’1
  • f dims 𝐿𝑗 for each GP layer

where each row of 𝑿𝑗 ∼ π‘ž(𝒙)

slide-18
SLIDE 18

Dropout as a Bayesian Approximation [Gal and Ghahramani β€˜16]

  • To approximate

we define

  • Minimise the KL divergence between the

approximate posterior and the posterior of the full deep GP

slide-19
SLIDE 19

𝑨𝑗,1 𝑨𝑗,2 𝑨𝑗,3 𝑨𝑗,πΏπ‘—βˆ’1βˆ’1 𝑨𝑗,πΏπ‘—βˆ’1 𝑨𝑗,π‘˜ 𝑿𝑗 1 1 1 𝑡𝑗 𝑒𝑗𝑏𝑕([𝑨𝑗,π‘˜]) 𝑿𝑗 =

slide-20
SLIDE 20

Dropout as a Bayesian Approximation [Gal and Ghahramani β€˜16]

  • Approximate the first team of KL using a single

sample :

  • Approximate the second term of KL to:
  • Thus, the approximated KL objective:
slide-21
SLIDE 21

Dropout as a Bayesian Approximation [Gal and Ghahramani β€˜16]

  • Approximate predictive distribution
  • MC dropout for approximation

– Sample T sets of vectors of realisations from the Bernoulli distribution

slide-22
SLIDE 22

Dropout as a Bayesian Approximation [Gal and Ghahramani β€˜16]

  • Model uncertainty (estimating the second raw

moment)

slide-23
SLIDE 23

Dropout as a Bayesian Approximation [Gal and Ghahramani β€˜16]

A scatter of 100 forward passes

slide-24
SLIDE 24

Dropout as a Bayesian Approximation [Gal and Ghahramani β€˜16]