Bayesian neural networks: a function space view tour Yingzhen Li - - PowerPoint PPT Presentation

bayesian neural networks a function space view tour
SMART_READER_LITE
LIVE PREVIEW

Bayesian neural networks: a function space view tour Yingzhen Li - - PowerPoint PPT Presentation

Bayesian neural networks: a function space view tour Yingzhen Li Microsoft Research Cambridge Neural networks 101 Lets say we want to classify different types of cats x : input images; y : output label "cat" build a neural


slide-1
SLIDE 1

Bayesian neural networks: a function space view tour

Yingzhen Li Microsoft Research Cambridge

slide-2
SLIDE 2

Neural networks 101

Let’s say we want to classify different types of cats

  • x: input images; y: output label
  • build a neural network (with param. W ):

p(y|x, W ) = softmax(fW (x))

"cat"

A typical neural network: fW (x) = WLφ(WL−1φ(...φ(W1x + b1)) + bL−1) + bL for the lth layer: hl = φ(Wlhl−1 + bl), h1 = φ(W1x + b1) Parameters: W = {W1, b1, ..., WL, bL}; nonlinearity: φ(·)

1

slide-3
SLIDE 3

Neural networks 101

Let’s say we want to classify different types of cats

  • x: input images; y: output label
  • build a neural network (with param. W ):

p(y|x, W ) = softmax(fW (x))

"cat"

Typical deep learning solution: Training the neural network weights:

  • Maximum likelihood estimation (MLE) given a dataset D = {(xn, yn)}N

n=1:

W ∗ = arg min

N

  • n=1

log p(yn|xn, W )

1

slide-4
SLIDE 4

Bayesian neural networks 101

Let’s say we want to classify different types of cats

  • x: input images; y: output label
  • build a neural network (with param. W ):

p(y|x, W ) = softmax(fW (x))

"cat"

A Bayesian solution: Put a prior distribution p(W ) over W

  • compute posterior p(W |D) given a dataset D = {(xn, yn)}N

n=1:

p(W |D) ∝ p(W )

N

  • n=1

p(yn|xn, W )

  • Bayesian predictive inference:

p(y ∗|x∗, D) = Ep(W |D)[p(y ∗|x∗, W )]

2

slide-5
SLIDE 5

Bayesian neural networks 101

Let’s say we want to classify different types of cats

  • x: input images; y: output label
  • build a neural network (with param. W ):

p(y|x, W ) = softmax(fW (x))

"cat"

In practice: p(W |D) is intractable

  • First find approximation q(W ) ≈ p(W |D)
  • In prediction, do Monte Carlo sampling:

p(y ∗|x∗, D) ≈ 1 K

K

  • k=1

p(y ∗|x∗, W k), W k ∼ q(W )

2

slide-6
SLIDE 6

Applications of Bayesian neural networks

Detecting adversarial examples: Li and Gal 2017

3

slide-7
SLIDE 7

Applications of Bayesian neural networks

Image segmentation Kendall and Gal 2017

3

slide-8
SLIDE 8

Applications of Bayesian neural networks

Medical imaging (super resolution): Tanno et al. 2019

3

slide-9
SLIDE 9

Bayesian neural networks vs Gaussian processes

Why learning about BNNs in a summer school about GPs?

  • mean-field BNNs have GP limits
  • approximate inference on GPs has links to BNNs
  • approximate inference on BNNs can leverage GP

techniques

Bayesian Deep Learning 4

slide-10
SLIDE 10

BNN → GP

slide-11
SLIDE 11

Bayesian neural networks → Gaussian process

Quick refresher: Central limit theorem Theorem Let x1, ..., xN be i.i.d. samples from p(x) and p(x) has mean µ and covariance Σ, then 1 N

N

  • n=1

xn

d

→ N

  • µ, 1

N Σ

  • ,

N → +∞

5

slide-12
SLIDE 12

Bayesian neural networks → Gaussian process 1

Consider one hidden layer BNN with mean-field prior and bounded non-linearity f (x) =

M

  • m=1

vmφ(w T

m x + bm),

W = {W1, b, W2}, W1 = [w1, ..., wm]T, b = [b1, ..., bm], W2 = [v1, ..., vm], mean-field prior p(W ) = p(W1)p(b)p(W2), p(W1) =

  • m

p(wm), p(b) =

  • m

p(bm), p(W2) =

  • m

p(vm), the same prior for each connection weight/bias: p(wi) = p(wj), p(bi) = p(bj), p(vi) = p(vj), ∀i, j

1Radford Neal’s derivation in his PhD thesis (1994)

6

slide-13
SLIDE 13

Bayesian neural networks → Gaussian process 1

Consider one hidden layer BNN with mean-field prior and bounded non-linearity f (x) =

M

  • m=1

vmφ(w T

m x + bm),

the same prior for each connection weight/bias: p(wi) = p(wj), p(bi) = p(bj), ∀i, j ⇒ the same distribution of the hidden unit outputs: hi(x) ⊥ hj(x), hi(x)

d

= hj(x), hi(x) = φ(w T

i x + bi)

⇒ i.e. h1(x), ..., hM(x) are i.i.d. samples from some implicitly defined distribution

1Radford Neal’s derivation in his PhD thesis (1994)

6

slide-14
SLIDE 14

Bayesian neural networks → Gaussian process 1

Consider one hidden layer BNN with mean-field prior and bounded non-linearity f (x) =

M

  • m=1

vmφ(w T

m x + bm),

mean-field prior with the same distribution for second layer connection weights: vi ⊥ W1, b, p(vi) = p(vj), ∀i, j ⇒ vihi(x) ⊥ vjhj(x), vihi(x)

d

= vjhj(x) so f (x) is a sum of i.i.d. random variables

1Radford Neal’s derivation in his PhD thesis (1994)

6

slide-15
SLIDE 15

Bayesian neural networks → Gaussian process 1

Consider one hidden layer BNN with mean-field prior and bounded non-linearity f (x) =

M

  • m=1

vmφ(w T

m x + bm),

if we make E[vm] = 0 and V[vm] = σ2

v scale as O(1/M):

E[f (x)] =

M

  • m=1

E[vm]E[hm(x)] = 0 V[f (x)] =

M

  • m=1

V[vmhm(x)] =

M

  • m=1

σ2

vE[hm(x)2] → σ2 vE[h(x)2]

1Radford Neal’s derivation in his PhD thesis (1994)

6

slide-16
SLIDE 16

Bayesian neural networks → Gaussian process 1

Consider one hidden layer BNN with mean-field prior and bounded non-linearity f (x) =

M

  • m=1

vmφ(w T

m x + bm),

if we make E[vm] = 0 and V[vm] = σ2

v scale as O(1/M):

Cov[f (x), f (x′)] =

M

  • m=1

σ2

vE[hm(x)hm(x′)] → σ2 vE[h(x)h(x′)]

1Radford Neal’s derivation in his PhD thesis (1994)

6

slide-17
SLIDE 17

Bayesian neural networks → Gaussian process 1

Consider one hidden layer BNN with mean-field prior and bounded non-linearity f (x) =

M

  • m=1

vmφ(w T

m x + bm),

if we make E[vm] = 0 and V[vm] = σ2

v scale as O(1/M):

(f (x), f (x′))

d

→ N(0, K), K(x, x′) = σ2

vE[h(x)h(x′)]

(CLT) it holds for any x, x′ ⇒ f ∼ GP(0, K(x, x′))

1Radford Neal’s derivation in his PhD thesis (1994)

6

slide-18
SLIDE 18

Bayesian neural networks → Gaussian process

Recent extensions of Radford Neal’s result:

  • deep and wide BNNs have GP limits
  • mean-field prior over weights
  • the activation function satisfies |φ(x)| ≤ c + A|x|
  • hidden layer widths strictly increasing to infinity

Matthews et al. 2018, Lee et al. 2018

7

slide-19
SLIDE 19

Bayesian neural networks → Gaussian process

Recent extensions of Radford Neal’s result:

  • Bayesian CNNs have GP limits
  • Convolution in CNN = fully connected layer

applied to different locations in the image

  • # channels in CNN = # hidden units in fully

connected NN

Garriga-Alonso et al. 2019, Novak et al. 2019

7

slide-20
SLIDE 20

GP → BNN

slide-21
SLIDE 21

Gaussian process → Bayesian neural networks

Exact GP inference can be very expensive: predictive inference for GP regression: p(f∗|X∗, X, y) = N(f∗; K∗n(Knn + σ2I)−1y, K∗∗ − K∗n(Knn + σ2I)−1Kn∗) (Knn)ij = K(xi, xj), Knn ∈ RN×N Inverting Knn + σ2I has O(N3) cost!

8

slide-22
SLIDE 22

Gaussian process → Bayesian neural networks

Quick refresher: Fourier (inverse) transform S(w) =

  • s(t)e−itwdt

s(t) =

  • S(w)eitwdw

9

slide-23
SLIDE 23

Gaussian process → Bayesian neural networks

Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K(x, x′) = K(x − x′) can be represented as K(x, x′) = Ep(w)

  • σ2eiw T (x−x′)

for some distribution p(w).

  • Real value kernel ⇒ Ep(w)
  • σ2eiwT (x−x′)

= Ep(w)

  • σ2cos(w T(x − x′))
  • cos(x − x′) = 2Ep(b)[cos(x + b)cos(x′ + b)],

p(b) = Uniform[0, 2π] Rahimi and Recht 2007

10

slide-24
SLIDE 24

Gaussian process → Bayesian neural networks

Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K(x, x′) = K(x − x′) can be represented as K(x, x′) = Ep(w)p(b)

  • σ2cos(w Tx + b)cos(w Tx′ + b)
  • for some distribution p(w) and p(b) = Uniform[0, 2π].
  • Real value kernel ⇒ Ep(w)
  • σ2eiwT (x−x′)

= Ep(w)

  • σ2cos(w T(x − x′))
  • cos(x − x′) = 2Ep(b)[cos(x + b)cos(x′ + b)],

p(b) = Uniform[0, 2π] Rahimi and Recht 2007

10

slide-25
SLIDE 25

Gaussian process → Bayesian neural networks

Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K(x, x′) = K(x − x′) can be represented as K(x, x′) = Ep(w)p(b)

  • σ2cos(w Tx + b)cos(w Tx′ + b)
  • for some distribution p(w) and p(b) = Uniform[0, 2π].
  • Monte Carlo approximation:

K(x, x′) ≈ ˜ K(x, x′) = σ2 M

M

  • m=1

cos(w T

m x + bm)cos(w T m x′ + bm),

wm, bm ∼ p(w)p(bm) Rahimi and Recht 2007

10

slide-26
SLIDE 26

Gaussian process → Bayesian neural networks

Bochner’s theorem: (Fourier inverse transform) Theorem A (properly scaled) translation invariant kernel K(x, x′) = K(x − x′) can be represented as K(x, x′) = Ep(w)p(b)

  • σ2cos(w Tx + b)cos(w Tx′ + b)
  • for some distribution p(w) and p(b) = Uniform[0, 2π].
  • Monte Carlo approximation: Define

h(x) = [h1(x), ..., hM(x)], hm(x) = cos(w T

m x + bm),

wm ∼ p(w), bm ∼ p(b) ⇒ ˜ K(x, x′) = σ2 M h(x)Th(x′) Rahimi and Recht 2007

10

slide-27
SLIDE 27

Gaussian process → Bayesian neural networks

Approximating the GP kernel with random feature expansions: f ∼ GP(0, K(x, x′)), f ≈ ˜ f , ˜ f ∼ GP(0, ˜ K(x, x′)), ˜ K(x, x′) = σ2 M h(x)Th(x′) Weight space view ⇒ single hidden layer BNN: ˜ f ∼ GP(0, ˜ K(x, x′)) ⇔ ˜ f (x) = v Th(x), v ∼ p(v) = N(0, σ2 M I) Adding number of components (increase M) → adding hidden units in BNNs

11

slide-28
SLIDE 28

Gaussian process → Bayesian neural networks

Deep GPs → deep BNNs with bottleneck layers: Deep Gaussian process: f (x) = f (L) ◦ f (L−1) ◦ · · · ◦ f (0)(x), f (i) ∼ GP(0, K (i)(x, x′)) Bui et al. 2016 Recall weight space view: ˜ K(x, x′) ≈ K(x, x′) ˜ f ∼ GP(0, ˜ K(x, x′)) ⇔ ˜ f (x) = v Tcos(W x + b) W , b, v ∼ p(W )p(b)p(v)

12

slide-29
SLIDE 29

Gaussian process → Bayesian neural networks

Deep GPs → deep BNNs with bottleneck layers: Deep BNN approximation to deep GP: ˜ f ≈ f , ˜ f (x) = ˜ f (L) ◦ ˜ f (L−1) ◦ · · · ◦ ˜ f (0)(x), ˜ f (i)(x) = v T

i cos(Wix + bi),

Wi, bi, vi ∼ p(Wi)p(bi)p(vi),

N

  • n=1

p(yn|f (xn))p(f ) ≈

N

  • n=1

p(yn|xn, W )p(W ) Cutajar et al. 2017

12

slide-30
SLIDE 30

Gaussian process → Bayesian neural networks

Deep GPs → deep BNNs with bottleneck layers:

  • Approx. infer. for deep GP: random feature expansion + approx. infer. for BNNs:

pDGP(y ∗|x∗, D) ≈ pBNN(y ∗|x∗, D) ≈ 1 K

K

  • k=1

pBNN(y ∗|x∗, W k), W k ∼ q∗(W ) q∗(W ) obtained by e.g. variational inference: q∗(W ) = arg min

q(W ) Eq(W )

N

  • n=1

log pBNN(y ∗|x∗, W )

  • − KL[q(W )||p(W )]

Cutajar et al. 2017

12

slide-31
SLIDE 31

BNN function-space inference

slide-32
SLIDE 32

BNN inference in function space?

  • weight space approximations can be inefficient
  • how to do function space inference for BNNs?

Ma et al. 2019, Foong et al. 2019

13

slide-33
SLIDE 33

Implicit Stochastic Processes

Definition: An implicit stochastic process (IP) is a collection of random variables f (·), such that any finite collection f = (f (x1), ..., f (xN))⊤ has joint distribution implicitly defined by the following generative process: z ∼ p(z), f (xn) = gθ(xn, z), ∀ xn ∈ X. A function distributed according to the above IP is denoted as f (·) ∼ IP(gθ(·, ·), pz).

14

slide-34
SLIDE 34

Implicit Stochastic Processes

z can be finite or infinite dimensional:

  • Finite dimensional z:

prove via Kolmogorov extension theorem (marginalisation consistency & permutation invariance)

14

slide-35
SLIDE 35

Implicit Stochastic Processes

z can be finite or infinite dimensional:

  • Finite dimensional z:

prove via Kolmogorov extension theorem (marginalisation consistency & permutation invariance)

  • Infinite dimensional case (here z = z(·) is a random function):

sufficient conditions:

  • z(·) ∼ SP(0, C(·, ·)) is a centered stochastic process on L2(Rd)
  • g(x, z) = φ(
  • x

M

m=0 Km(x, x′)z(x′)dx′), Km ∈ L2(Rd × Rd), |φ(x)| ≤ A|x|

Then f (·) is also a stochastic process. Proof: apply Karhunen-Loeve expansion and check convergence in L2(Rd).

14

slide-36
SLIDE 36

Implicit Stochastic Processes

Examples:

Bayesian NN warped GP neural sampler

Also include many simulators in physics, ecology, climate science...

14

slide-37
SLIDE 37

Implicit Process Regression

Implicit process regression model: f (·) ∼ IP(gθ(·, ·), pz), y = f (x) + ǫ, ǫ ∼ N(0, σ2).

  • Similar to GP regression, given dataset D = {X, y}, we hope to compute

p(f|X, y) ∝ p(y|f)p(f|X)

  • Then for predictive inference, compute

p(y ∗|x∗, D) =

  • p(y ∗|f ∗)p(f ∗|X, y)df ∗

intractable due to the unknown distribution p(f) (cannot use variational inference directly)

15

slide-38
SLIDE 38

Variational Implicit Processes

Generalised wake-sleep applied to implicit processes

  • Sleep phase: approximate pθ(y, f|X) ≈ q(y, f|X)
  • Wake phase: approximate log pθ(y|X) ≈ log q(y|X) then maximise w.r.t θ
  • large-scale learning: spectral approximations lead to a Bayesian linear regression problem

Dayan et al. 1995

16

slide-39
SLIDE 39

Variational Implicit Processes

Sleep phase:

  • Define qGP(y, f|X) = q(y|f)qGP(f|X),

q(y|f) = p(y|f)

  • same likelihood term
  • for any X, use (y, f) ∼ p(y, f|X) as targets to train q:

min

q

DKL[p(y, f|X)||qGP(y, f|X)]

  • Reduce to matching mean & covariance functions (with finite function samples):

m⋆

MLE(x) = 1

S

  • s

fs(x), K⋆

MLE(x1, x2) = 1

S

  • s

∆s(x1)∆s(x2), ∆s(x) = fs(x) − m⋆

MLE(x),

fs(·) ∼ IP(gθ(·, ·), pz). q⋆

GP(f|X, m⋆ MLE, K⋆ MLE, θ) depends on θ 16

slide-40
SLIDE 40

Variational Implicit Processes

Wake phase:

  • We want to maximise log pθ(y|X) w.r.t. θ (intractable)
  • Note that in sleep step we are minimising joint KL and

DKL[p(y, f|X)||qGP(y, f|X)] ≥ DKL[p(y|X)||qGP(y|X)]

  • Then we use log q⋆

GP(y|X, θ) ≈ log pθ(y|X)

  • Note that q⋆

GP(y|X, θ) depends on θ ⇒ just differentiate through 16

slide-41
SLIDE 41

Variational Implicit Processes

Wake phase: For large dataset GP inference is very expensive (O(N3)) Recall the kernel structure K⋆

MLE(x1, x2) = 1

S

  • s

∆s(x1)∆s(x2) Random feature approximation: log q⋆

GP(y|X, θ) ≈ log n

q⋆(yn|xn, a, θ)p(a)da, q⋆(yn|xn, a, θ) = N

  • yn; m⋆

MLE(xn) +

1 √ S

  • s

∆s(xn)as, σ2

  • ,

p(a) = N(a; 0, I), Bayesian linear regression (BLR) on top of function samples

16

slide-42
SLIDE 42

Some Experimental Results

−1 1 2 y VIP - train mean VIP - interpolation mean training sample test sample

(a) VIP-BNN

−2 −1 1 2 y VDO - train mean VDO - interpolation mean training sample test sample

(b) Variational dropout (VDO-BNN)

−200 −150 −100 −50 50 100 150 200 x −2 2 y GPR - train mean GPR - interpolation mean training sample test sample

(c) GP regression (GPR)

Solar irradiance prediction:

  • methods: VIP, VDO, GPR
  • Capturing the predictive mean:

VIP > GPR;

  • Uncertainty estimates:

VIP > VDO;

17

slide-43
SLIDE 43

Some Experimental Results

0.50 0.75 1.00 1.25 1.50 1.75 2.00 Test NLL VIP VDO-LSTM α-LSTM BB-α-BNN VI-BNN FITC-GP Deep GP-EP 0.8 0.9 1.0 1.1 1.2 1.3 1.4 Test RMSE

VIP applied to Bayesian LSTM:

  • CEP Data: >1 million datapoints, each x is a string represneting a molecule;
  • Goal: predict power conversion efficiency
  • Baselines: (deep) GP, BNN (hand-crafted features) & Bayesian LSTM (directly raw

features), with different inference methods;

  • VIP works significantly better for both NLL and RMSE.

17

slide-44
SLIDE 44

What we have covered today...

BNNs and GPs are good friends:

  • mean-field BNNs have GP limits
  • approximate inference on GPs has

links to BNNs

  • approximate inference on BNNs can

leverage GP techniques

Thank you!

18

slide-45
SLIDE 45

References

Neal 1994. Bayesian Learning for Neural Networks. PhD thesis Dayan et al. 1995. The Helmholtz machine. Neural Computation, 1995. Rahimi and Recht 2007. Random Features for Large-Scale Kernel Machines. NeurIPS 2007 L´ azaro-Gredilla et al. 2010. Sparse spectrum Gaussian process regression. JMLR 2010 Bui et al. 2016. Deep Gaussian Processes for Regression using Approximate Expectation Propagation. ICML 2016 Li and Gal 2016. Dropout inference in Bayesian neural networks with alpha-divergences. ICML 2017 Kendall and Gal 2017. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS 2017 Cutajar et al. 2017. Random Feature Expansions for Deep Gaussian Processes. ICML 2017 Matthews et al. 2018. Gaussian Process Behaviour in Wide Deep Neural Networks. ICLR 2018 Lee et al. 2018. Deep Neural Networks as Gaussian Processes. ICLR 2018 Ma et al. 2019. Variational Implicit Processes. ICML 2019 Tanno et al. 2019. Uncertainty Quantification in Deep Learning for Safer Neuroimage Enhancement. arXiv:1907.13418 Foong et al. 2019. Pathologies of Factorised Gaussian and MC Dropout Posteriors in Bayesian Neural Networks. arXiv:1909.00719 19