Bayesian Deep Learning and Restricted Boltzmann Machines Narada - - PowerPoint PPT Presentation

bayesian deep learning and restricted boltzmann machines
SMART_READER_LITE
LIVE PREVIEW

Bayesian Deep Learning and Restricted Boltzmann Machines Narada - - PowerPoint PPT Presentation

Bayesian Deep Learning and Restricted Boltzmann Machines Narada Warakagoda Forsvarets Forskningsinstitutt ndw@ffi.no November 1, 2018 Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56 Overview Probability Review 1 Bayesian


slide-1
SLIDE 1

Bayesian Deep Learning and Restricted Boltzmann Machines

Narada Warakagoda

Forsvarets Forskningsinstitutt ndw@ffi.no

November 1, 2018

Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56

slide-2
SLIDE 2

Overview

1

Probability Review

2

Bayesian Deep Learning

3

Restricted Boltzmann Machines

Narada Warakagoda (FFI) Short title November 1, 2018 2 / 56

slide-3
SLIDE 3

Probability Review

Narada Warakagoda (FFI) Short title November 1, 2018 3 / 56

slide-4
SLIDE 4

Probability and Statistics Basics

Normal (Gaussian) Distribution p (x) = 1 (2π)d/2 |Σ Σ Σ|1/2 exp

  • −1

2 (x − µ µ µ)T Σ Σ Σ−1 (x − µ µ µ)

  • = N(µ, Σ)

µ, Σ) µ, Σ) Categorical Distribution P (x) =

k

  • i=1

p[x=i]

i

Sampling x x x ∼ p (x x x)

Narada Warakagoda (FFI) Short title November 1, 2018 4 / 56

slide-5
SLIDE 5

Probability and Statistics Basics

Independent variables p (x x x1,x x x2, · · · ,x x xk) =

k

  • i=1

p (x x xi) Expectation Ep(x

x x)f (x

x x) =

  • f (x

x x) p (x x x) dx

  • r for discrete variables

Ep(x

x x)f (x

x x) =

k

  • i=1

f (x x xi) P (x x xi)

Narada Warakagoda (FFI) Short title November 1, 2018 5 / 56

slide-6
SLIDE 6

Kullback Leibler Distance

KL (q (x x x) ||p (x x x)) = Eq(x

x x) log

q (x x x) p (x x x)

  • =
  • [q (x

x x) log q (x x x) − q (x x x) log p (x x x)] dx x x For the discrete case KL (Q (x x x) ||P (x x x)) =

k

  • i=1

[Q (x x xi) log Q (x x xi) − Q (x x xi) log P (x x xi)]

Narada Warakagoda (FFI) Short title November 1, 2018 6 / 56

slide-7
SLIDE 7

Bayesian Deep Learning

Narada Warakagoda (FFI) Short title November 1, 2018 7 / 56

slide-8
SLIDE 8

Bayesian Statistics

Joint distribution p (x x x,y y y) = p (x x x|y y y) p (y y y) Marginalization p (x x x) =

  • p (x

x x,y y y) dy y y P (x x x) =

  • y

y y

P (x x x,y y y) Conditional distribution p (x x x|y y y) = p (x x x,y y y) p (y y y) = p (y y y|x x x) p (x x x)

  • p (y

y y|x x x) p (x x x) dx x x

Narada Warakagoda (FFI) Short title November 1, 2018 8 / 56

slide-9
SLIDE 9

Statistical view of Neural Networks

Prediction p (y y y|x x x,w w w) = N (f f f w

w w (x

x x) ,Σ Σ Σ) Classification P (y|x x x,w w w) =

k

  • i=1

f f f i

w w w (x

x x)[y=i]

Narada Warakagoda (FFI) Short title November 1, 2018 9 / 56

slide-10
SLIDE 10

Training Criteria

Maximum Likelihood(ML)

  • w

w w = arg max

w w w

p (Y |X Y |X Y |X,w w w) Maximum A-Priori (MAP)

  • w

w w = arg max

w w w

p (Y ,w w w|X Y ,w w w|X Y ,w w w|X) = arg max

w w w

p (Y |X Y |X Y |X,w w w) p(w w w) Bayesian p (w w w|Y Y Y ,X X X) = p (Y Y Y |X X X,w w w) p (w w w) P (Y Y Y |X X X) = p (Y Y Y |X X X,w w w) p (w w w)

  • P (Y

Y Y |X X X,w w w) p (w w w) dw w w

Narada Warakagoda (FFI) Short title November 1, 2018 10 / 56

slide-11
SLIDE 11

Motivation for Bayesian Approach

Narada Warakagoda (FFI) Short title November 1, 2018 11 / 56

slide-12
SLIDE 12

Motivation for Bayesian Approach

Narada Warakagoda (FFI) Short title November 1, 2018 12 / 56

slide-13
SLIDE 13

Uncertainty with Bayesian Approach

Not only prediction/classification, but their uncertainty can also be calculated

Since we have p (w w w|Y Y Y ,X X X) we can sample w w w and use each sample as network parameters in calculating the prediction/classification p ( y| x,w w w)) (i.e.network output for a given input ). Prediction/classification is the mean of p ( y| x,w w w) pout = p ( y| x,Y Y Y ,X X X) =

  • p (

y| x,w w w) p (w w w|Y Y Y ,X X X) dw w w Uncertainty of prediction/classification is the variance of p ( y| x,w w w) Var(p ( y| x,w w w)) =

  • [p (

y| x,w w w) − pout]2 p (w w w|Y Y Y ,X X X) dw w w

Uncertainty is important in safety critical applications (eg: self-driving cars, medical diagnosis, military applications

Narada Warakagoda (FFI) Short title November 1, 2018 13 / 56

slide-14
SLIDE 14

Other Advantages of Bayesian Approach

Natural interpretation for regularization Model selection Input data selection (active learning)

Narada Warakagoda (FFI) Short title November 1, 2018 14 / 56

slide-15
SLIDE 15

Main Challenge of Bayesian Approach

We calculate

For continuous case: p (w w w|Y Y Y ,X X X) = p (Y Y Y |X X X,w w w) p (w w w)

  • P (Y

Y Y |X X X,w w w) p (w w w) dw w w For discrete case: P (w w w|Y Y Y ,X X X) = p (Y Y Y |X X X,w w w) P (w w w)

  • w

w w p (Y

Y Y |X X X,w w w) P (w w w)

Calculating denominator is often intractable

Eg: Consider a weight vector w w w of 100 elements, each can have two

  • values. Then there are 2100 = 1.2 × 1030 different weight vectors.

Compare this with universe’s age 13.7 billion years.

We need approximations

Narada Warakagoda (FFI) Short title November 1, 2018 15 / 56

slide-16
SLIDE 16

Different Approaches

Monte Carlo techniques (Eg: Markov Chain Monte Carlo -MCMC) Variational Inference Introducing random elements in training (eg: Dropout)

Narada Warakagoda (FFI) Short title November 1, 2018 16 / 56

slide-17
SLIDE 17

Advantages and Disadvantages of Different Approaches

Markov Chain Monte Carlo - MCMC

Asymptotically exact Computationally expensive

Variational Inference

No guarantee of exactness Possibility for faster computation

Narada Warakagoda (FFI) Short title November 1, 2018 17 / 56

slide-18
SLIDE 18

Monte Carlo Techniques

We are interested in pout = Mean(p ( y| x,w w w)) = p ( y| x,Y Y Y ,X X X) =

  • p (

y| x,w w w) p (w w w|Y Y Y ,X X X) dw w w Var(p ( y| x,w w w)) =

  • [p (

y| x,w w w) − pout]2 p (w w w|Y Y Y ,X X X) dw w w Both are integrals of the type I =

  • F (w

w w) p (w w w|D) dw w w where D = (Y Y Y ,X X X) is training data. Approximate the integral by sampling w w wi from p (w w w|D) I ≈ 1 L

L

  • i=1

F (w w wi) .

Narada Warakagoda (FFI) Short title November 1, 2018 18 / 56

slide-19
SLIDE 19

Monte Carlo techniques

Challenge: We don’t have the posterior p (w w w|D) = p (w w w|Y Y Y ,X X X) = p (Y Y Y |X X X,w w w) p (w w w)

  • P (Y

Y Y |X X X,w w w) p (w w w) dw w w ”Solution”: Use importance sampling by sampling from a proposal distribution q(w w w) I =

  • F (w

w w) p (w w w|D) q (w w w) q (w w w) dw w w ≈ 1 L

L

  • i=

F (w w wi) p (w w wi|D) q (w w wi) Problem: We still do not have p (w w w|D)

Narada Warakagoda (FFI) Short title November 1, 2018 19 / 56

slide-20
SLIDE 20

Monte Carlo Techniques

Problem: We still do not have p (w w w|D) Solution: use unnormalized posterior ˜ p (w w w|D) = p (Y Y Y |X X X,w w w) p (w w w) where normalization factor Z =

  • P (Y

Y Y |X X X,w w w) p (w w w) dw w w such that p (w w w|D) = ˜ p (w w w|D) Z Integral can be calculated with: I ≈ L

i=1 F (w

w wi) ˜ p (w w wi|D) /q (w w wi) L

i=1 ˜

p (w w wi|D) /q (w w wi)

Narada Warakagoda (FFI) Short title November 1, 2018 20 / 56

slide-21
SLIDE 21

Weakness of Importance Sampling

Proposal distribution must be close to the non-zero areas of original distribution p (w w w|D). In neural networks, p (w w w|D) is typically small except for few narrow areas. Blind sampling from q (w w w) has a high chance that they fall outside non-zero areas of p (w w w|D) We must actively try to get samples that lie close to p (w w w|D) Markov Chain Monte Carlo (MCMC) is such technique.

Narada Warakagoda (FFI) Short title November 1, 2018 21 / 56

slide-22
SLIDE 22

Metropolis Algorithm

Metropolis algorithm is an example of MCMC Draw samples repeatedly from random walk w w wt+1 = w w wt + ǫ ǫ ǫ where ǫ ǫ ǫ is a small random vector, ǫ ǫ ǫ ∼ q(ǫ ǫ ǫ) (eg: Gaussian noise) Drawn sample at t = t is either accepted based on the ratio

˜ p(w w wt|D) ˜ p(w w wt−1|D)

If ˜ p (w w w t|D) > ˜ p (w w w t−1|D) accept sample If ˜ p (w w w t|D) < ˜ p (w w w t−1|D) accept sample with probability

˜ p(w w w t|D) ˜ p(w w w t−1|D)

If sample accepted use it for calculating I

Can use the same formula for calculating I I ≈ L

i=1 F (w

w wi) ˜ p (w w wi|D) /q (w w wi) L

i=1 ˜

p (w w wi|D) /q (w w wi)

Narada Warakagoda (FFI) Short title November 1, 2018 22 / 56

slide-23
SLIDE 23

Other Monte Carlo and Related Techniques

Hybrid Monte Carlo (Hamiltonian Monte Carlo)

Similar to Metropolis algorithm But uses gradient information rather than a random walk.

Simulated Annealing

Narada Warakagoda (FFI) Short title November 1, 2018 23 / 56

slide-24
SLIDE 24

Variational Inference

Goal: computation of posterior p (w w w|D), i.e. the parameters of the neural network w w w given data D = (Y Y Y ,X X X) But this computation is often intractable Idea: find a distribution q(w w w) from a family of distributions Q such that q(w w w) can closely approximate p(w w w|D) How to measure the distance between q(w w w) and p(w w w|D) ?

Kullback-Leibler Distance KL

  • q(w

w w)||p(w w w|D)

  • The problem can be formulated as

ˆ p(w w w|D) = arg min

q(w w w) KL

  • q(w

w w)||p(w w w|D)

  • Narada Warakagoda (FFI)

Short title November 1, 2018 24 / 56

slide-25
SLIDE 25

Minimizing KL Distance

Using the definition of KL distance KL

  • q(w

w w)||p(w w w|D)

  • =
  • q (w

w w) ln q (w w w) p (w w w|D)dw w w Cannot minimize this directly, because we do not know p (w w w|D) But we can manipulate it further, and transform it to another equivalent optimization problem involving a quantity known as Evidence Lower Bound (ELBO)

Narada Warakagoda (FFI) Short title November 1, 2018 25 / 56

slide-26
SLIDE 26

Evidence Lower Bound (ELBO)

KL

  • q(w

w w)||p(w w w|D)

  • =
  • q (w

w w) ln q (w w w) p (w w w|D)dw w w =

  • q (w

w w) ln q (w w w) p(D) p (w w w, D) dw w w =

  • q (w

w w) ln q (w w w) p (w w w, D)dw w w +

  • q (w

w w) ln p(D)dw w w = Eq(w

w w) ln

q (w w w) p (w w w, D) + ln p(D)

  • q (w

w w) dw w w ln p(D) = Eq(w

w w) ln p (w

w w, D) q (w w w) + KL

  • q(w

w w)||p(w w w|D)

  • Since ln p(D) is constant, minimizing KL
  • q(w

w w)||p(w w w|D)

  • is

equivalent to maximizing ELBO

Narada Warakagoda (FFI) Short title November 1, 2018 26 / 56

slide-27
SLIDE 27

Another Look at ELBO

ELBO = Eq(w

w w) ln p (w

w w, D) q (w w w) =

  • q (w

w w) ln p(w w w, D)dw w w −

  • q (w

w w) ln q(w w w)dw w w =

  • q (w

w w) ln[p(D|w w w)p(w w w)]dw w w −

  • q (w

w w) ln q(w w w)dw w w =

  • q (w

w w) ln p(D|w w w)dw w w −

  • q (w

w w) ln q(w w w) p(w w w)dw w w = Eq(w

w w)p(D|w

w w) − KL

  • q(w

w w)||p(w w w)

  • We maximize ELBO with respect to q(w

w w) First term Eq(w

w w)p(D|w

w w) is equivalent to maximizing q(w w w)’s ability explain training data Second term KL

  • q(w

w w)||p(w w w)

  • is equivalent to minimizing q(w

w w)’s distance to p(w w w)

Narada Warakagoda (FFI) Short title November 1, 2018 27 / 56

slide-28
SLIDE 28

Outline of Procedure with ELBO

Start with ELBO ELBO = L = Eq(w

w w) ln p (w

w w, D) q (w w w) = Eq(w

w w)

  • ln p (w

w w, D) − ln q (w w w)

  • Rewrite with parameter λ of q (w

w w) and expand expectation L(λ) =

  • ln[p (w

w w, D)]q (w w w, λ) dw w w −

  • ln[q (w

w w, λ)]q (w w w, λ) dw w w Maximize L(λ) with respect to λ λ⋆ = arg max

λ

L(λ) Use the optimized q witn respect to λ as posterior q (w w w, λ⋆) = p(w w w, D)

Narada Warakagoda (FFI) Short title November 1, 2018 28 / 56

slide-29
SLIDE 29

How to Maximize ELBO

Analytical methods are not practical for deep neural networks We resort to gradient methods with Monte Carlo sampling We discuss two methods:

Black box variational inference: Based on log derivative trick Bayes by Backprop: Based on re-parameterization trick

Narada Warakagoda (FFI) Short title November 1, 2018 29 / 56

slide-30
SLIDE 30

Black Box Variational Inference

Start with ELBO: L(λ) =

  • ln[p (w

w w, D)]q (w w w, λ) dw w w −

  • ln[q (w

w w, λ)]q (w w w, λ) dw w w Differentiate with respect to λ. ∇λL(λ) =

  • ln[p (w

w w, D)]∇λ[q (w w w, λ)]dw w w −

  • ln[q (w

w w, λ)]∇λ[q (w w w, λ)]dw w w −

  • ∇λ
  • ln[q (w

w w, λ)]

  • q (w

w w, λ) dw w w The last term is zero (Can you prove it?)

Narada Warakagoda (FFI) Short title November 1, 2018 30 / 56

slide-31
SLIDE 31

Black Box Variational Inference

Now we have ∇λL(λ) =

  • ln[p (w

w w, D)]∇λ[q (w w w, λ)]dw w w −

  • ln[q (w

w w, λ)]∇λ[q (w w w, λ)]dw w w = [p (w w w, D)] − ln[q (w w w, λ)]

  • ∇λ[q (w

w w, λ)]dw We want to write this as an expectation with respect to q Use the log derivative trick ∇λ[q (w w w, λ)] = ∇λ[ln q (w w w, λ)]q (w w w, λ)

Narada Warakagoda (FFI) Short title November 1, 2018 31 / 56

slide-32
SLIDE 32

Black Box Variational Inference

Now we get ∇λL(λ) =

  • ln[p (w

w w, D)]∇λ[ln q (w w w, λ)]q (w w w, λ) dw w w −

  • ln[q (w

w w, λ)]∇λ[ln q (w w w, λ)]q (w w w, λ) dw w w Rearranging terms ∇λL(λ) = ln[p (w w w, D)] − ln q (w w w, λ)

  • ∇λ[ln q (w

w w, λ)]q (w w w, λ) dw w w This is the same as Expectation with respect to q ∇λL(λ) = Eq(w

w w,λ)

  • ln[p (w

w w, D)] − ln q (w w w, λ)

  • ∇λ[ln q (w

w w, λ)]

Narada Warakagoda (FFI) Short title November 1, 2018 32 / 56

slide-33
SLIDE 33

BBVI optimization procedure

Assume a distribution q (w w w, λ) parameterized by λ. Draw S samples of w w w from the distribution using the current value of λ = λt Estimate the gradient of ELBO using the sample values: ∇λ ˆ L(λ) = 1 S

S

  • s=1
  • ln[p (w

w ws, D)] − ln q (w w ws, λ)

  • ∇λ[ln q (w

w ws, λ)] Update λ λt+1 = λt + ρ∇λ ˆ L(λ) repeat from step 2

Narada Warakagoda (FFI) Short title November 1, 2018 33 / 56

slide-34
SLIDE 34

Bayes by Backprop

Try to approximate ELBO directly by sampling from the q(w w w, λ) ELBO = L(λ) = Eq(w

w w,λ)

  • ln p (w

w w, D) − ln q (w w w, λ)

  • with

ˆ L(λ) = 1 S

S

  • s=1
  • ln p (w

w ws, D) − ln q (w w ws, λ)

  • But we need ∇λ ˆ

L(λ) and we can not differentiate ˆ L(λ) because it is not a smooth function of λ Use the re-parameterization trick w w ws = w w w(λ,ǫ ǫ ǫs) where ǫ ǫ ǫs is drawn from for example a standard Gaussian distribution.

Narada Warakagoda (FFI) Short title November 1, 2018 34 / 56

slide-35
SLIDE 35

Bayes by BackProp (BbB)

The estimated ELBO now ˆ L(λ) = 1 S

S

  • s=1
  • ln p (w

w w(λ,ǫ ǫ ǫs), D) − ln q (w w w(λ,ǫ ǫ ǫs), λ)

  • Now this is a smooth function of λ and can differentiate

∇λ ˆ L(λ) = 1 S

S

  • s=1
  • ∂ ˆ

Ls ∂w w w ∂w w w ∂λ + ∂ ˆ Ls ∂λ

  • where ˆ

Ls = ln p (w w w(λ,ǫ ǫ ǫs), D) − ln q (w w w(λ,ǫ ǫ ǫs), λ) Once the gradients are known, optimum λ⋆ and hence q(w w w, λ⋆) can be found by gradient descent.

Narada Warakagoda (FFI) Short title November 1, 2018 35 / 56

slide-36
SLIDE 36

Performance of BBVI and BbB

Both methods estimate approximate gradients by sampling High variance of the estimated gradients is a problem In practice, these algorithms need modifications to tackle high variance BbB tends to have a lower variance estimates than BBVI

Narada Warakagoda (FFI) Short title November 1, 2018 36 / 56

slide-37
SLIDE 37

Bayesian Deep Learning through Randomization in Training

Stochastic gradient descent and Dropout can be given Bayesian interpretations Dropout procedure in testing can be used for estimating the uncertainty of model outputs (Monte Carlo Dropout).

Enable dropout and feed the network S times with data and collect the

  • utputs f (s), s = 1, 2, · · · , S

Output variance = 1

S

  • s(f (s) − ¯

f (s))2 where ¯ f (s) = 1

S

  • s f (s)

Narada Warakagoda (FFI) Short title November 1, 2018 37 / 56

slide-38
SLIDE 38

Restricted Boltzmann Machines

Narada Warakagoda (FFI) Short title November 1, 2018 38 / 56

slide-39
SLIDE 39

Stochastic Neurons

We consider stochastic binary neurons, i.e. y can be either 1 or 0 p(y = 1) = σ(b +

  • i

wixi) p(y = 0) = 1 − p(y = 1)

Narada Warakagoda (FFI) Short title November 1, 2018 39 / 56

slide-40
SLIDE 40

Boltzmann Machine

A Boltzmann machine is a recurrent network with stochastic neurons Weights are symmetrical At the equilibrium, the relationships of the neuron outputs can be represented using an undirected graphical model

Narada Warakagoda (FFI) Short title November 1, 2018 40 / 56

slide-41
SLIDE 41

Restricted Boltzmann Machine (RBM)

Neurons are divided into two groups: Visible and Hidden Restricted architecture: No connections within visible group or hidden group Network parameters:

Bias vector hidden units, b b b = [b1, b2, · · · , bH] Bias vector visible units, c c c = [c1, c2, · · · , cV ] Connection weights, W = {wi,j}

Network values are binary random vectors: v v v = [v1, v2, · · · , vV ] and h h h = [h1, h2, · · · , hH]

Narada Warakagoda (FFI) Short title November 1, 2018 41 / 56

slide-42
SLIDE 42

How the network parameters and values are related?

Through the definition of an Energy function In RBM, the energy function is defined as E(v v v,h h h) = −h h hTW W Wv v v − c c cTv v v − b b bTh h h We assign probabilities to (v v v,h h h) based on Boltzmann distribution p(v v v,h h h) = exp(−E(v v v,h h h)) Z where Z =

  • v′

v′ v′,h′ h′ h′

exp(−E(v′ v′ v′,h′ h′ h′))

Narada Warakagoda (FFI) Short title November 1, 2018 42 / 56

slide-43
SLIDE 43

What can we do with RBM?

Assume that the network parameters W W W ,b b b,c c c are known. Can we calculate the probability of a given pair of vectors (ˆ v v v, ˆ h h h)?

This is generally not tractable, because calculating Z requires to sum all combinations v and h values.

Can we calculate the probability of h h h given v v v or vice-versa?

Yes, this is ”inference” and possible.

Assume that a data set of v v v vectors given. Can we estimate the network parameters W W W ,b b b,c c c ?

Yes, this is training and possible

Narada Warakagoda (FFI) Short title November 1, 2018 43 / 56

slide-44
SLIDE 44

Inference

We want to find p(h h h|v v v) assuming W W W ,b b b,c c c are known. We start with the Bayes rule p(h h h|v v v) = p(h h h|v v v)

  • h′

h′ h′ p(h′

h′ h′,v′ v′ v′) = exp

  • h

h hTW W Wv v v + c c cTv v v + b b bTh h h

  • /Z
  • h′

h′ h′∈{0,1}H exp

  • h′

h′ h′TW W Wv′ v′ v′ + c c cTv′ v′ v′ + b b bTh′ h′ h′ /Z Canceling common factors and expanding vector-matrix multiplication as a summation p(h h h|v v v) = exp

  • j (hjW

W W jv v v + bjhj)

  • h′

1∈{0,1}

  • h′

2∈{0,1} . . .

h′

H∈{0,1} exp(

j(h′ jW

W W jv v v + bjh′

j))

Narada Warakagoda (FFI) Short title November 1, 2018 44 / 56

slide-45
SLIDE 45

Inference

We want to find p(h h h|v v v) assuming W W W ,b b b,c c c are known. Writing exponential of sums as product of exponentials

p(h h h|v v v) =

  • j (exp (hjW

W W jv v v + bjhj))

  • h′

1∈{0,1}

  • h′

2∈{0,1} . . .

h′

H∈{0,1}

  • j(exp(h′

jW

W W jv v v + bjh′

j))

=

  • j (exp (hjW

W W jv v v + bjhj)) (

h′

1∈{0,1} exp(h′

1W

W W 1v v v + b1h′

1)) . . . ( h′

H∈{0,1} exp(h′

HW

W W Hv v v + bHh′

H))

=

  • j (exp (hjW

W W jv v v + bjhj))

  • j(

h′

j ∈{0,1} exp(h′

jW

W W jv v v + bjh′

j))

=

  • j (exp (hjW

W W jv v v + bjhj))

  • j exp(W

W W jv v v + bj) =

  • j

(exp (hjW W W jv v v + bjhj)) exp(W W W jv v v + bj)

This implies that calculation of p(h h h|v v v) is tractable

Narada Warakagoda (FFI) Short title November 1, 2018 45 / 56

slide-46
SLIDE 46

Inference

Let’s try to interpret p(h h h|v v v) =

  • j

(exp (hjW W W jv v v + bjhj)) exp(W W W jv v v + bj) Consider the quantity q(hj) = exp (hjW W W jv v v + bjhj) exp(W W W jv v v + bj) q(hj) takes two values, q(0) and q(= 1). And sum of these values are

  • 1. Therefore it is a probability measure of hj.

Since we assumed v v v is given, q(hj) is actually p(hj|v v v) A simple manipulation shows that p(hj = 1) = σ(W W W jv v v + bj) i.e. The activation function of a stochastic neuron.

Narada Warakagoda (FFI) Short title November 1, 2018 46 / 56

slide-47
SLIDE 47

Training

We consider maximum likelihood training with a given dataset {v v v1,v v v2, . . . ,v v vN} with respect to the log likelihood L = log N

i p(v

v vi) = N

i log p(v

v vi) We use gradient descent and therefore calculate ∂L ∂θ the gradient of L with respect to a model parameter θ Derive the gradient for a single sample ∂ log (p (v v v)) ∂θ

Narada Warakagoda (FFI) Short title November 1, 2018 47 / 56

slide-48
SLIDE 48

Gradients

By definition we know that p(v v v,h h h) = exp(−E(v v v,h h h)) Z (1) where Z =

  • v′

v′ v′,h′ h′ h′

exp(−E(v′ v′ v′,h′ h′ h′)) (2) Therefore p(v v v) =

  • h

h h

p(v v v,h h h) =

  • h

h h

exp(−E(v v v,h h h)) Z (3) Take log and differentiate wrt θ ∂ log p(v v v) ∂θ = ∂ log

h h h exp(−E(v

v v,h h h)) ∂θ − ∂ log Z ∂θ (4)

Narada Warakagoda (FFI) Short title November 1, 2018 48 / 56

slide-49
SLIDE 49

Gradients

Consider the first term ∂ log

h h h exp(−E(v

v v,h h h)) ∂θ = −

  • h

h h exp(−E(v

v v,h h h))∂(E(v v v,h h h)) ∂θ

  • h

h h exp(−E(v

v v,h h h)) (5) = −

  • h

h h

exp(−E(v v v,h h h))

  • h

h h exp(−E(v

v v,h h h)) ∂(E(v v v,h h h)) ∂θ (6) But dividing equation 1 by equation 3 we get p(v v v,h h h) p(v v v) = p(h h h|v v v) = exp(−E(v v v,h h h))

  • h

h h exp(−E(v

v v,h h h)) (7) Substitute equation 7 in equation 6 ∂ log

h h h exp(−E(v

v v,h h h)) ∂θ = −

  • h

h h

p(h h h|v v v)∂(E(v v v,h h h)) ∂θ (8)

Narada Warakagoda (FFI) Short title November 1, 2018 49 / 56

slide-50
SLIDE 50

Gradients

Consider the second term in equation 4 and substitute for Z from equation 2 ∂ log Z ∂θ = ∂ log

v′ v′ v′,h′ h′ h′ exp(−E(v′

v′ v′,h′ h′ h′)) ∂θ (9) = −

  • v′

v′ v′,h′ h′ h′ exp(−E(v′

v′ v′,h′ h′ h′))∂(E(v′ v′ v′,h′ h′ h′)) ∂θ

  • v′

v′ v′,h′ h′ h′ exp(−E(v′

v′ v′,h′ h′ h′)) (10) = −

  • v

v v,h h h

exp(−E(v v v,h h h))

  • v′

v′ v′,h′ h′ h′ exp(−E(v′

v′ v′,h′ h′ h′)) ∂(E(v v v,h h h)) ∂θ (11) From equations 1 and 2 it is clear that exp(−E(v v v,h h h))

  • v′

v′ v′,h′ h′ h′ exp(−E(v′

v′ v′,h′ h′ h′)) is p(v v v,h h h) Therefore ∂ log Z ∂θ = −

  • v

v v,h h h

p(v v v,h h h)∂(E(v v v,h h h)) ∂θ (12)

Narada Warakagoda (FFI) Short title November 1, 2018 50 / 56

slide-51
SLIDE 51

Gradients

From equations 4, 8 and 12 ∂ log p(v v v) ∂θ = −

  • h

h h

p(h h h|v v v)∂(E(v v v,h h h)) ∂θ +

  • v

v v,h h h

p(v v v,h h h)∂(E(v v v,h h h)) ∂θ (13) ∂ log p(v v v) ∂θ = −Ep(h

h h|v v v)

∂(E(v v v,h h h)) ∂θ

  • + Ep(v

v v,h h h)

∂(E(v v v,h h h)) ∂θ

  • (14)

The first term of equation 14

Known as positive phase Depends on training data Can be computed exactly

The second term of equation 14

Known as negative phase independent of training data, completely model dependent Must be estimated through Gibb’s sampling and a procedure known as Contrastive Divergence

Narada Warakagoda (FFI) Short title November 1, 2018 51 / 56

slide-52
SLIDE 52

Applications of RBMs

Deep belief networks Collaborative filtering

Narada Warakagoda (FFI) Short title November 1, 2018 52 / 56

slide-53
SLIDE 53

Deep Belief Networks

Method for initializing a multilayer network

1 Train an RBM with training data 2 Initialize the current layer with the trained parameters 3 Present training data to the RBM and sample the hidden layer values 4 Use the hidden layer values as training data and repeat from step 1. Narada Warakagoda (FFI) Short title November 1, 2018 53 / 56

slide-54
SLIDE 54

Collaborative Filtering

Application in recommendation systems. Eg: Movie rating/recommendation Different users rate different items (eg: movies) using a rating scale such as 1 to 5 Problem is to estimate the rating for an unrated item by a given user

Narada Warakagoda (FFI) Short title November 1, 2018 54 / 56

slide-55
SLIDE 55

Collaborative filtering with RBM

Train a different RBM for each user. But share weights across users Visible units correspond to the ratings given to each movie In training movies with missing ratings are omitted For prediction of a missing rating, find p(h h h|v v v) and back to p(v|h h h)

Narada Warakagoda (FFI) Short title November 1, 2018 55 / 56

slide-56
SLIDE 56

The End

Narada Warakagoda (FFI) Short title November 1, 2018 56 / 56