Bayesian Deep Learning and Restricted Boltzmann Machines
Narada Warakagoda
Forsvarets Forskningsinstitutt ndw@ffi.no
November 1, 2018
Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56
Bayesian Deep Learning and Restricted Boltzmann Machines Narada - - PowerPoint PPT Presentation
Bayesian Deep Learning and Restricted Boltzmann Machines Narada Warakagoda Forsvarets Forskningsinstitutt ndw@ffi.no November 1, 2018 Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56 Overview Probability Review 1 Bayesian
Narada Warakagoda
Forsvarets Forskningsinstitutt ndw@ffi.no
November 1, 2018
Narada Warakagoda (FFI) Short title November 1, 2018 1 / 56
1
Probability Review
2
Bayesian Deep Learning
3
Restricted Boltzmann Machines
Narada Warakagoda (FFI) Short title November 1, 2018 2 / 56
Narada Warakagoda (FFI) Short title November 1, 2018 3 / 56
Normal (Gaussian) Distribution p (x) = 1 (2π)d/2 |Σ Σ Σ|1/2 exp
2 (x − µ µ µ)T Σ Σ Σ−1 (x − µ µ µ)
µ, Σ) µ, Σ) Categorical Distribution P (x) =
k
p[x=i]
i
Sampling x x x ∼ p (x x x)
Narada Warakagoda (FFI) Short title November 1, 2018 4 / 56
Independent variables p (x x x1,x x x2, · · · ,x x xk) =
k
p (x x xi) Expectation Ep(x
x x)f (x
x x) =
x x) p (x x x) dx
Ep(x
x x)f (x
x x) =
k
f (x x xi) P (x x xi)
Narada Warakagoda (FFI) Short title November 1, 2018 5 / 56
KL (q (x x x) ||p (x x x)) = Eq(x
x x) log
q (x x x) p (x x x)
x x) log q (x x x) − q (x x x) log p (x x x)] dx x x For the discrete case KL (Q (x x x) ||P (x x x)) =
k
[Q (x x xi) log Q (x x xi) − Q (x x xi) log P (x x xi)]
Narada Warakagoda (FFI) Short title November 1, 2018 6 / 56
Narada Warakagoda (FFI) Short title November 1, 2018 7 / 56
Joint distribution p (x x x,y y y) = p (x x x|y y y) p (y y y) Marginalization p (x x x) =
x x,y y y) dy y y P (x x x) =
y y
P (x x x,y y y) Conditional distribution p (x x x|y y y) = p (x x x,y y y) p (y y y) = p (y y y|x x x) p (x x x)
y y|x x x) p (x x x) dx x x
Narada Warakagoda (FFI) Short title November 1, 2018 8 / 56
Prediction p (y y y|x x x,w w w) = N (f f f w
w w (x
x x) ,Σ Σ Σ) Classification P (y|x x x,w w w) =
k
f f f i
w w w (x
x x)[y=i]
Narada Warakagoda (FFI) Short title November 1, 2018 9 / 56
Maximum Likelihood(ML)
w w = arg max
w w w
p (Y |X Y |X Y |X,w w w) Maximum A-Priori (MAP)
w w = arg max
w w w
p (Y ,w w w|X Y ,w w w|X Y ,w w w|X) = arg max
w w w
p (Y |X Y |X Y |X,w w w) p(w w w) Bayesian p (w w w|Y Y Y ,X X X) = p (Y Y Y |X X X,w w w) p (w w w) P (Y Y Y |X X X) = p (Y Y Y |X X X,w w w) p (w w w)
Y Y |X X X,w w w) p (w w w) dw w w
Narada Warakagoda (FFI) Short title November 1, 2018 10 / 56
Narada Warakagoda (FFI) Short title November 1, 2018 11 / 56
Narada Warakagoda (FFI) Short title November 1, 2018 12 / 56
Not only prediction/classification, but their uncertainty can also be calculated
Since we have p (w w w|Y Y Y ,X X X) we can sample w w w and use each sample as network parameters in calculating the prediction/classification p ( y| x,w w w)) (i.e.network output for a given input ). Prediction/classification is the mean of p ( y| x,w w w) pout = p ( y| x,Y Y Y ,X X X) =
y| x,w w w) p (w w w|Y Y Y ,X X X) dw w w Uncertainty of prediction/classification is the variance of p ( y| x,w w w) Var(p ( y| x,w w w)) =
y| x,w w w) − pout]2 p (w w w|Y Y Y ,X X X) dw w w
Uncertainty is important in safety critical applications (eg: self-driving cars, medical diagnosis, military applications
Narada Warakagoda (FFI) Short title November 1, 2018 13 / 56
Natural interpretation for regularization Model selection Input data selection (active learning)
Narada Warakagoda (FFI) Short title November 1, 2018 14 / 56
We calculate
For continuous case: p (w w w|Y Y Y ,X X X) = p (Y Y Y |X X X,w w w) p (w w w)
Y Y |X X X,w w w) p (w w w) dw w w For discrete case: P (w w w|Y Y Y ,X X X) = p (Y Y Y |X X X,w w w) P (w w w)
w w p (Y
Y Y |X X X,w w w) P (w w w)
Calculating denominator is often intractable
Eg: Consider a weight vector w w w of 100 elements, each can have two
Compare this with universe’s age 13.7 billion years.
We need approximations
Narada Warakagoda (FFI) Short title November 1, 2018 15 / 56
Monte Carlo techniques (Eg: Markov Chain Monte Carlo -MCMC) Variational Inference Introducing random elements in training (eg: Dropout)
Narada Warakagoda (FFI) Short title November 1, 2018 16 / 56
Markov Chain Monte Carlo - MCMC
Asymptotically exact Computationally expensive
Variational Inference
No guarantee of exactness Possibility for faster computation
Narada Warakagoda (FFI) Short title November 1, 2018 17 / 56
We are interested in pout = Mean(p ( y| x,w w w)) = p ( y| x,Y Y Y ,X X X) =
y| x,w w w) p (w w w|Y Y Y ,X X X) dw w w Var(p ( y| x,w w w)) =
y| x,w w w) − pout]2 p (w w w|Y Y Y ,X X X) dw w w Both are integrals of the type I =
w w) p (w w w|D) dw w w where D = (Y Y Y ,X X X) is training data. Approximate the integral by sampling w w wi from p (w w w|D) I ≈ 1 L
L
F (w w wi) .
Narada Warakagoda (FFI) Short title November 1, 2018 18 / 56
Challenge: We don’t have the posterior p (w w w|D) = p (w w w|Y Y Y ,X X X) = p (Y Y Y |X X X,w w w) p (w w w)
Y Y |X X X,w w w) p (w w w) dw w w ”Solution”: Use importance sampling by sampling from a proposal distribution q(w w w) I =
w w) p (w w w|D) q (w w w) q (w w w) dw w w ≈ 1 L
L
F (w w wi) p (w w wi|D) q (w w wi) Problem: We still do not have p (w w w|D)
Narada Warakagoda (FFI) Short title November 1, 2018 19 / 56
Problem: We still do not have p (w w w|D) Solution: use unnormalized posterior ˜ p (w w w|D) = p (Y Y Y |X X X,w w w) p (w w w) where normalization factor Z =
Y Y |X X X,w w w) p (w w w) dw w w such that p (w w w|D) = ˜ p (w w w|D) Z Integral can be calculated with: I ≈ L
i=1 F (w
w wi) ˜ p (w w wi|D) /q (w w wi) L
i=1 ˜
p (w w wi|D) /q (w w wi)
Narada Warakagoda (FFI) Short title November 1, 2018 20 / 56
Proposal distribution must be close to the non-zero areas of original distribution p (w w w|D). In neural networks, p (w w w|D) is typically small except for few narrow areas. Blind sampling from q (w w w) has a high chance that they fall outside non-zero areas of p (w w w|D) We must actively try to get samples that lie close to p (w w w|D) Markov Chain Monte Carlo (MCMC) is such technique.
Narada Warakagoda (FFI) Short title November 1, 2018 21 / 56
Metropolis algorithm is an example of MCMC Draw samples repeatedly from random walk w w wt+1 = w w wt + ǫ ǫ ǫ where ǫ ǫ ǫ is a small random vector, ǫ ǫ ǫ ∼ q(ǫ ǫ ǫ) (eg: Gaussian noise) Drawn sample at t = t is either accepted based on the ratio
˜ p(w w wt|D) ˜ p(w w wt−1|D)
If ˜ p (w w w t|D) > ˜ p (w w w t−1|D) accept sample If ˜ p (w w w t|D) < ˜ p (w w w t−1|D) accept sample with probability
˜ p(w w w t|D) ˜ p(w w w t−1|D)
If sample accepted use it for calculating I
Can use the same formula for calculating I I ≈ L
i=1 F (w
w wi) ˜ p (w w wi|D) /q (w w wi) L
i=1 ˜
p (w w wi|D) /q (w w wi)
Narada Warakagoda (FFI) Short title November 1, 2018 22 / 56
Hybrid Monte Carlo (Hamiltonian Monte Carlo)
Similar to Metropolis algorithm But uses gradient information rather than a random walk.
Simulated Annealing
Narada Warakagoda (FFI) Short title November 1, 2018 23 / 56
Goal: computation of posterior p (w w w|D), i.e. the parameters of the neural network w w w given data D = (Y Y Y ,X X X) But this computation is often intractable Idea: find a distribution q(w w w) from a family of distributions Q such that q(w w w) can closely approximate p(w w w|D) How to measure the distance between q(w w w) and p(w w w|D) ?
Kullback-Leibler Distance KL
w w)||p(w w w|D)
ˆ p(w w w|D) = arg min
q(w w w) KL
w w)||p(w w w|D)
Short title November 1, 2018 24 / 56
Using the definition of KL distance KL
w w)||p(w w w|D)
w w) ln q (w w w) p (w w w|D)dw w w Cannot minimize this directly, because we do not know p (w w w|D) But we can manipulate it further, and transform it to another equivalent optimization problem involving a quantity known as Evidence Lower Bound (ELBO)
Narada Warakagoda (FFI) Short title November 1, 2018 25 / 56
KL
w w)||p(w w w|D)
w w) ln q (w w w) p (w w w|D)dw w w =
w w) ln q (w w w) p(D) p (w w w, D) dw w w =
w w) ln q (w w w) p (w w w, D)dw w w +
w w) ln p(D)dw w w = Eq(w
w w) ln
q (w w w) p (w w w, D) + ln p(D)
w w) dw w w ln p(D) = Eq(w
w w) ln p (w
w w, D) q (w w w) + KL
w w)||p(w w w|D)
w w)||p(w w w|D)
equivalent to maximizing ELBO
Narada Warakagoda (FFI) Short title November 1, 2018 26 / 56
ELBO = Eq(w
w w) ln p (w
w w, D) q (w w w) =
w w) ln p(w w w, D)dw w w −
w w) ln q(w w w)dw w w =
w w) ln[p(D|w w w)p(w w w)]dw w w −
w w) ln q(w w w)dw w w =
w w) ln p(D|w w w)dw w w −
w w) ln q(w w w) p(w w w)dw w w = Eq(w
w w)p(D|w
w w) − KL
w w)||p(w w w)
w w) First term Eq(w
w w)p(D|w
w w) is equivalent to maximizing q(w w w)’s ability explain training data Second term KL
w w)||p(w w w)
w w)’s distance to p(w w w)
Narada Warakagoda (FFI) Short title November 1, 2018 27 / 56
Start with ELBO ELBO = L = Eq(w
w w) ln p (w
w w, D) q (w w w) = Eq(w
w w)
w w, D) − ln q (w w w)
w w) and expand expectation L(λ) =
w w, D)]q (w w w, λ) dw w w −
w w, λ)]q (w w w, λ) dw w w Maximize L(λ) with respect to λ λ⋆ = arg max
λ
L(λ) Use the optimized q witn respect to λ as posterior q (w w w, λ⋆) = p(w w w, D)
Narada Warakagoda (FFI) Short title November 1, 2018 28 / 56
Analytical methods are not practical for deep neural networks We resort to gradient methods with Monte Carlo sampling We discuss two methods:
Black box variational inference: Based on log derivative trick Bayes by Backprop: Based on re-parameterization trick
Narada Warakagoda (FFI) Short title November 1, 2018 29 / 56
Start with ELBO: L(λ) =
w w, D)]q (w w w, λ) dw w w −
w w, λ)]q (w w w, λ) dw w w Differentiate with respect to λ. ∇λL(λ) =
w w, D)]∇λ[q (w w w, λ)]dw w w −
w w, λ)]∇λ[q (w w w, λ)]dw w w −
w w, λ)]
w w, λ) dw w w The last term is zero (Can you prove it?)
Narada Warakagoda (FFI) Short title November 1, 2018 30 / 56
Now we have ∇λL(λ) =
w w, D)]∇λ[q (w w w, λ)]dw w w −
w w, λ)]∇λ[q (w w w, λ)]dw w w = [p (w w w, D)] − ln[q (w w w, λ)]
w w, λ)]dw We want to write this as an expectation with respect to q Use the log derivative trick ∇λ[q (w w w, λ)] = ∇λ[ln q (w w w, λ)]q (w w w, λ)
Narada Warakagoda (FFI) Short title November 1, 2018 31 / 56
Now we get ∇λL(λ) =
w w, D)]∇λ[ln q (w w w, λ)]q (w w w, λ) dw w w −
w w, λ)]∇λ[ln q (w w w, λ)]q (w w w, λ) dw w w Rearranging terms ∇λL(λ) = ln[p (w w w, D)] − ln q (w w w, λ)
w w, λ)]q (w w w, λ) dw w w This is the same as Expectation with respect to q ∇λL(λ) = Eq(w
w w,λ)
w w, D)] − ln q (w w w, λ)
w w, λ)]
Narada Warakagoda (FFI) Short title November 1, 2018 32 / 56
Assume a distribution q (w w w, λ) parameterized by λ. Draw S samples of w w w from the distribution using the current value of λ = λt Estimate the gradient of ELBO using the sample values: ∇λ ˆ L(λ) = 1 S
S
w ws, D)] − ln q (w w ws, λ)
w ws, λ)] Update λ λt+1 = λt + ρ∇λ ˆ L(λ) repeat from step 2
Narada Warakagoda (FFI) Short title November 1, 2018 33 / 56
Try to approximate ELBO directly by sampling from the q(w w w, λ) ELBO = L(λ) = Eq(w
w w,λ)
w w, D) − ln q (w w w, λ)
ˆ L(λ) = 1 S
S
w ws, D) − ln q (w w ws, λ)
L(λ) and we can not differentiate ˆ L(λ) because it is not a smooth function of λ Use the re-parameterization trick w w ws = w w w(λ,ǫ ǫ ǫs) where ǫ ǫ ǫs is drawn from for example a standard Gaussian distribution.
Narada Warakagoda (FFI) Short title November 1, 2018 34 / 56
The estimated ELBO now ˆ L(λ) = 1 S
S
w w(λ,ǫ ǫ ǫs), D) − ln q (w w w(λ,ǫ ǫ ǫs), λ)
∇λ ˆ L(λ) = 1 S
S
Ls ∂w w w ∂w w w ∂λ + ∂ ˆ Ls ∂λ
Ls = ln p (w w w(λ,ǫ ǫ ǫs), D) − ln q (w w w(λ,ǫ ǫ ǫs), λ) Once the gradients are known, optimum λ⋆ and hence q(w w w, λ⋆) can be found by gradient descent.
Narada Warakagoda (FFI) Short title November 1, 2018 35 / 56
Both methods estimate approximate gradients by sampling High variance of the estimated gradients is a problem In practice, these algorithms need modifications to tackle high variance BbB tends to have a lower variance estimates than BBVI
Narada Warakagoda (FFI) Short title November 1, 2018 36 / 56
Stochastic gradient descent and Dropout can be given Bayesian interpretations Dropout procedure in testing can be used for estimating the uncertainty of model outputs (Monte Carlo Dropout).
Enable dropout and feed the network S times with data and collect the
Output variance = 1
S
f (s))2 where ¯ f (s) = 1
S
Narada Warakagoda (FFI) Short title November 1, 2018 37 / 56
Narada Warakagoda (FFI) Short title November 1, 2018 38 / 56
We consider stochastic binary neurons, i.e. y can be either 1 or 0 p(y = 1) = σ(b +
wixi) p(y = 0) = 1 − p(y = 1)
Narada Warakagoda (FFI) Short title November 1, 2018 39 / 56
A Boltzmann machine is a recurrent network with stochastic neurons Weights are symmetrical At the equilibrium, the relationships of the neuron outputs can be represented using an undirected graphical model
Narada Warakagoda (FFI) Short title November 1, 2018 40 / 56
Neurons are divided into two groups: Visible and Hidden Restricted architecture: No connections within visible group or hidden group Network parameters:
Bias vector hidden units, b b b = [b1, b2, · · · , bH] Bias vector visible units, c c c = [c1, c2, · · · , cV ] Connection weights, W = {wi,j}
Network values are binary random vectors: v v v = [v1, v2, · · · , vV ] and h h h = [h1, h2, · · · , hH]
Narada Warakagoda (FFI) Short title November 1, 2018 41 / 56
Through the definition of an Energy function In RBM, the energy function is defined as E(v v v,h h h) = −h h hTW W Wv v v − c c cTv v v − b b bTh h h We assign probabilities to (v v v,h h h) based on Boltzmann distribution p(v v v,h h h) = exp(−E(v v v,h h h)) Z where Z =
v′ v′,h′ h′ h′
exp(−E(v′ v′ v′,h′ h′ h′))
Narada Warakagoda (FFI) Short title November 1, 2018 42 / 56
Assume that the network parameters W W W ,b b b,c c c are known. Can we calculate the probability of a given pair of vectors (ˆ v v v, ˆ h h h)?
This is generally not tractable, because calculating Z requires to sum all combinations v and h values.
Can we calculate the probability of h h h given v v v or vice-versa?
Yes, this is ”inference” and possible.
Assume that a data set of v v v vectors given. Can we estimate the network parameters W W W ,b b b,c c c ?
Yes, this is training and possible
Narada Warakagoda (FFI) Short title November 1, 2018 43 / 56
We want to find p(h h h|v v v) assuming W W W ,b b b,c c c are known. We start with the Bayes rule p(h h h|v v v) = p(h h h|v v v)
h′ h′ p(h′
h′ h′,v′ v′ v′) = exp
h hTW W Wv v v + c c cTv v v + b b bTh h h
h′ h′∈{0,1}H exp
h′ h′TW W Wv′ v′ v′ + c c cTv′ v′ v′ + b b bTh′ h′ h′ /Z Canceling common factors and expanding vector-matrix multiplication as a summation p(h h h|v v v) = exp
W W jv v v + bjhj)
1∈{0,1}
2∈{0,1} . . .
h′
H∈{0,1} exp(
j(h′ jW
W W jv v v + bjh′
j))
Narada Warakagoda (FFI) Short title November 1, 2018 44 / 56
We want to find p(h h h|v v v) assuming W W W ,b b b,c c c are known. Writing exponential of sums as product of exponentials
p(h h h|v v v) =
W W jv v v + bjhj))
1∈{0,1}
2∈{0,1} . . .
h′
H∈{0,1}
jW
W W jv v v + bjh′
j))
=
W W jv v v + bjhj)) (
h′
1∈{0,1} exp(h′
1W
W W 1v v v + b1h′
1)) . . . ( h′
H∈{0,1} exp(h′
HW
W W Hv v v + bHh′
H))
=
W W jv v v + bjhj))
h′
j ∈{0,1} exp(h′
jW
W W jv v v + bjh′
j))
=
W W jv v v + bjhj))
W W jv v v + bj) =
(exp (hjW W W jv v v + bjhj)) exp(W W W jv v v + bj)
This implies that calculation of p(h h h|v v v) is tractable
Narada Warakagoda (FFI) Short title November 1, 2018 45 / 56
Let’s try to interpret p(h h h|v v v) =
(exp (hjW W W jv v v + bjhj)) exp(W W W jv v v + bj) Consider the quantity q(hj) = exp (hjW W W jv v v + bjhj) exp(W W W jv v v + bj) q(hj) takes two values, q(0) and q(= 1). And sum of these values are
Since we assumed v v v is given, q(hj) is actually p(hj|v v v) A simple manipulation shows that p(hj = 1) = σ(W W W jv v v + bj) i.e. The activation function of a stochastic neuron.
Narada Warakagoda (FFI) Short title November 1, 2018 46 / 56
We consider maximum likelihood training with a given dataset {v v v1,v v v2, . . . ,v v vN} with respect to the log likelihood L = log N
i p(v
v vi) = N
i log p(v
v vi) We use gradient descent and therefore calculate ∂L ∂θ the gradient of L with respect to a model parameter θ Derive the gradient for a single sample ∂ log (p (v v v)) ∂θ
Narada Warakagoda (FFI) Short title November 1, 2018 47 / 56
By definition we know that p(v v v,h h h) = exp(−E(v v v,h h h)) Z (1) where Z =
v′ v′,h′ h′ h′
exp(−E(v′ v′ v′,h′ h′ h′)) (2) Therefore p(v v v) =
h h
p(v v v,h h h) =
h h
exp(−E(v v v,h h h)) Z (3) Take log and differentiate wrt θ ∂ log p(v v v) ∂θ = ∂ log
h h h exp(−E(v
v v,h h h)) ∂θ − ∂ log Z ∂θ (4)
Narada Warakagoda (FFI) Short title November 1, 2018 48 / 56
Consider the first term ∂ log
h h h exp(−E(v
v v,h h h)) ∂θ = −
h h exp(−E(v
v v,h h h))∂(E(v v v,h h h)) ∂θ
h h exp(−E(v
v v,h h h)) (5) = −
h h
exp(−E(v v v,h h h))
h h exp(−E(v
v v,h h h)) ∂(E(v v v,h h h)) ∂θ (6) But dividing equation 1 by equation 3 we get p(v v v,h h h) p(v v v) = p(h h h|v v v) = exp(−E(v v v,h h h))
h h exp(−E(v
v v,h h h)) (7) Substitute equation 7 in equation 6 ∂ log
h h h exp(−E(v
v v,h h h)) ∂θ = −
h h
p(h h h|v v v)∂(E(v v v,h h h)) ∂θ (8)
Narada Warakagoda (FFI) Short title November 1, 2018 49 / 56
Consider the second term in equation 4 and substitute for Z from equation 2 ∂ log Z ∂θ = ∂ log
v′ v′ v′,h′ h′ h′ exp(−E(v′
v′ v′,h′ h′ h′)) ∂θ (9) = −
v′ v′,h′ h′ h′ exp(−E(v′
v′ v′,h′ h′ h′))∂(E(v′ v′ v′,h′ h′ h′)) ∂θ
v′ v′,h′ h′ h′ exp(−E(v′
v′ v′,h′ h′ h′)) (10) = −
v v,h h h
exp(−E(v v v,h h h))
v′ v′,h′ h′ h′ exp(−E(v′
v′ v′,h′ h′ h′)) ∂(E(v v v,h h h)) ∂θ (11) From equations 1 and 2 it is clear that exp(−E(v v v,h h h))
v′ v′,h′ h′ h′ exp(−E(v′
v′ v′,h′ h′ h′)) is p(v v v,h h h) Therefore ∂ log Z ∂θ = −
v v,h h h
p(v v v,h h h)∂(E(v v v,h h h)) ∂θ (12)
Narada Warakagoda (FFI) Short title November 1, 2018 50 / 56
From equations 4, 8 and 12 ∂ log p(v v v) ∂θ = −
h h
p(h h h|v v v)∂(E(v v v,h h h)) ∂θ +
v v,h h h
p(v v v,h h h)∂(E(v v v,h h h)) ∂θ (13) ∂ log p(v v v) ∂θ = −Ep(h
h h|v v v)
∂(E(v v v,h h h)) ∂θ
v v,h h h)
∂(E(v v v,h h h)) ∂θ
The first term of equation 14
Known as positive phase Depends on training data Can be computed exactly
The second term of equation 14
Known as negative phase independent of training data, completely model dependent Must be estimated through Gibb’s sampling and a procedure known as Contrastive Divergence
Narada Warakagoda (FFI) Short title November 1, 2018 51 / 56
Deep belief networks Collaborative filtering
Narada Warakagoda (FFI) Short title November 1, 2018 52 / 56
Method for initializing a multilayer network
1 Train an RBM with training data 2 Initialize the current layer with the trained parameters 3 Present training data to the RBM and sample the hidden layer values 4 Use the hidden layer values as training data and repeat from step 1. Narada Warakagoda (FFI) Short title November 1, 2018 53 / 56
Application in recommendation systems. Eg: Movie rating/recommendation Different users rate different items (eg: movies) using a rating scale such as 1 to 5 Problem is to estimate the rating for an unrated item by a given user
Narada Warakagoda (FFI) Short title November 1, 2018 54 / 56
Train a different RBM for each user. But share weights across users Visible units correspond to the ratings given to each movie In training movies with missing ratings are omitted For prediction of a missing rating, find p(h h h|v v v) and back to p(v|h h h)
Narada Warakagoda (FFI) Short title November 1, 2018 55 / 56
Narada Warakagoda (FFI) Short title November 1, 2018 56 / 56