RBM, DBN, and DBM
- M. Soleymani
Sharif University of Technology Fall 2017 Slides are based on Salakhutdinov lectures, CMU 2017 and Hugo Larochelle’s class on Neural Networks: https://sites.google.com/site/deeplearningsummerschool2016/.
RBM, DBN, and DBM M. Soleymani Sharif University of Technology - - PowerPoint PPT Presentation
RBM, DBN, and DBM M. Soleymani Sharif University of Technology Fall 2017 Slides are based on Salakhutdinov lectures, CMU 2017 and Hugo Larochelle s class on Neural Networks: https://sites.google.com/site/deeplearningsummerschool2016/.
Sharif University of Technology Fall 2017 Slides are based on Salakhutdinov lectures, CMU 2017 and Hugo Larochelle’s class on Neural Networks: https://sites.google.com/site/deeplearningsummerschool2016/.
2
two sequential steps:
– First pick the hidden states from p(h). – Then pick the visible states from p(v|h)
vector, v, is computed by summing
This slide has been adopted from Hinton lectures, “Neural Networks for Machine Learning”, coursera, 2015.
configurations of the visible and hidden units.
– We can simply define the probability to be 𝑞 𝑤, ℎ = 𝑓−𝐹 𝑤,ℎ
A Restricted Boltzmann Machine (RBM) is an undirected graphical model with hidden and visible layers. Learnable parameters are 𝑐, 𝑑 which are linear weight vectors for 𝑤, ℎ and 𝑋 which models interaction between them.
𝐹 𝑤, ℎ = −𝑤𝑈𝑋ℎ − 𝑑𝑈𝑤 − 𝑐𝑈ℎ = −
𝑗,𝑘
𝑥𝑗𝑘𝑤𝑗ℎ𝑘 −
𝑗
𝑑𝑗𝑤𝑗 −
𝑘
𝑐
𝑘ℎ𝑘
7
All hidden units are conditionally independent given the visible units and vice versa.
RBM probabilities: 𝑞 𝑤|ℎ =
𝑗
𝑞 𝑤𝑗|ℎ 𝑞 ℎ|𝑤 =
𝑘
𝑞 ℎ𝑘|𝑤 𝑞 𝑤𝑗 = 1|ℎ = 𝜏 𝑋
𝑗 𝑈ℎ + 𝑐𝑗
𝑞 ℎ𝑘 = 1|𝑤 = 𝜏 𝑋
𝑘𝑤 + 𝑑 𝑘
𝒘
𝒘
Larochelle et al., JMLR 2009
The effect of the latent variables can be appreciated by considering the marginal distribution over the visible units:
12
𝐹 𝑤, ℎ = −𝑤𝑈𝑋ℎ − 𝑑𝑈𝑤 − 𝑐𝑈ℎ = −
𝑗,𝑘
𝑥𝑗𝑘𝑤𝑗ℎ𝑘 −
𝑗
𝑑𝑗𝑤𝑗 −
𝑘
𝑐
𝑘ℎ𝑘
𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘
𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘 𝒘
𝒘 𝒘 𝒘 𝒘 𝒘 𝒘
𝒘 𝒘 𝒘 𝒘 𝒘 𝒘
term: intractable due to exponential number
configurations. 𝒘(𝑜) 𝒘(𝑜) 𝒘
𝒘
𝜖 𝜖𝜄 log 𝑄(𝒘(𝑜)) = 𝜖 𝜖𝜄 log
ℎ
exp 𝑤(𝑜)𝑈𝑋ℎ + 𝑑𝑈𝑤(𝑜) + 𝑐𝑈ℎ − 𝜖 𝜖𝜄 log 𝑎
Positive phase Negative phase
𝑎 =
𝑤 ℎ
exp 𝑤𝑈𝑋ℎ + 𝑑𝑈𝑤 + 𝑐𝑈ℎ
𝜖 𝜖𝑋 log
ℎ
exp 𝑤(𝑜)𝑈𝑋ℎ + 𝑑𝑈𝑤(𝑜) + 𝑐𝑈ℎ = 𝜖 𝜖𝑋 ℎ exp 𝑤(𝑜)𝑈𝑋ℎ + 𝑑𝑈𝑤(𝑜) + 𝑐𝑈ℎ ℎ exp 𝑤(𝑜)𝑈𝑋ℎ + 𝑑𝑈𝑤(𝑜) + 𝑐𝑈ℎ = ℎ ℎ𝑤 𝑜 𝑈 exp 𝑤(𝑜)𝑈𝑋ℎ + 𝑑𝑈𝑤(𝑜) + 𝑐𝑈ℎ ℎ exp 𝑤(𝑜)𝑈𝑋ℎ + 𝑑𝑈𝑤(𝑜) + 𝑐𝑈ℎ = 𝐹ℎ~𝑞 𝑤𝑜,ℎ ℎ𝑤 𝑜 𝑈
Maximize with respect to
20
𝜖 𝜖𝑋
𝑗𝑘
log 𝑄(𝑤(𝑜)) = 𝐹ℎ𝑘 𝑤𝑗ℎ𝑘|𝑤 = 𝑤(𝑜) − 𝐹𝑤𝑗,ℎ𝑘 𝑤𝑗ℎ𝑘 𝜖 𝜖𝑐
𝑘
log 𝑄(𝑤(𝑜)) = 𝐹ℎ𝑘 ℎ𝑘|𝑤 = 𝑤(𝑜) − 𝐹ℎ𝑘 ℎ𝑘 𝜖 𝜖𝑑𝑗 log 𝑄(𝑤(𝑜)) = 𝐹𝑤𝑗 𝑤𝑗|𝑤 = 𝑤(𝑜) − 𝐹𝑤𝑗 𝑤𝑗
𝜖 𝜖𝑋
𝑗𝑘
log 𝑄(𝑤(𝑜)) = 𝐹 𝑤𝑗ℎ𝑘|𝑤 = 𝑤(𝑜) − 𝐹 𝑤𝑗ℎ𝑘 𝐹 𝑤𝑗ℎ𝑘|𝑤 = 𝑤(𝑜) = 𝐹 ℎ𝑘|𝑤 = 𝑤(𝑜) 𝑤𝑗
𝑜
= 𝑤𝑗
𝑜
1 + exp − 𝑗 𝑋
𝑗𝑘𝑤𝑗 𝑜 + 𝑐 𝑘
sampler over time can be used to get an estimate of gradients).
Positive statistic
𝐹𝒘,𝒊 − 𝜖𝐹 𝒊, 𝒘 𝜖𝜄 =
𝒊,𝒘
𝑞 𝒊, 𝒘 𝒊𝒘𝑈
examples.
Getting an unbiased sample of the second term is very difficult. It can be done by starting at any random state of the visible units and performing Gibbs sampling for a very long time. Block-Gibbs MCMC
25
Initialize v0 = v Sample h0 from P(h|v0) For t=1:T Sample vt from P(v|ht-1) Sample ht from P(h|vt )
𝐹[𝑤𝑗ℎ𝑘] ≈ 1 𝑁
𝑛=1 𝑁
𝑤𝑗
(𝑛)ℎ𝑘 (𝑛)
𝑤(𝑛)ℎ(𝑛)~𝑄 𝑤, ℎ
until convergence: 𝐹[𝑤𝑗ℎ𝑘] ≈ 1 𝑂
𝑜=1 𝑂
𝑤𝑗
𝑜 ,𝑈ℎ𝑘 𝑜 ,𝑈
𝑤 𝑜 ,0 = 𝑤 𝑜 ℎ 𝑜 ,𝑙~𝑄(ℎ|𝑤 = 𝑤 𝑜 ,𝑙) for 𝑙 ≥ 0 𝑤 𝑜 ,𝑙~𝑄(𝑤|ℎ = ℎ 𝑜 ,𝑙−1) for 𝑙 ≥ 1
𝒘 𝒘
𝒘(𝑜) 𝒘𝑙 = 𝒘 𝒘1 𝒘 𝒘 𝒘(𝑜)
𝑞 𝑤|ℎ =
𝑗
𝑞 𝑤𝑗|ℎ 𝑞 ℎ|𝑤 =
𝑘
𝑞 ℎ𝑘|𝑤 𝑞 𝑤𝑗 = 1|ℎ = 𝜏 𝑋
𝑗 𝑈ℎ + 𝑐𝑗
𝑞 ℎ𝑘 = 1|𝑤 = 𝜏 𝑋
𝑘𝑤 + 𝑑 𝑘
will be
training
29
– For each training sample 𝑤(𝑜)
𝑤 using k steps of Gibbs sampling starting at the point 𝑤 𝑜
𝜖 log 𝑞 𝑤(𝑜) 𝜖𝑋
= 𝑋 + 𝛽 ℎ 𝑤 𝑜 𝑤 𝑜 𝑈 − ℎ 𝑤 𝑤𝑈
− ℎ 𝑤
𝑤
Since convergence to a final distribution takes time, good initialization can speeds things up dramatically. Contrastive divergence uses a sample image to initialize the visible weights, then runs Gibbs sampling for a few iterations (even k = 1) – not to “equilibrium.” This gives acceptable estimates of the expected values in the gradient update formula.
Larochelle et al., JMLR 2009
gradient checks)
– we plot the average stochastic reconstruction 𝑤(𝑜) − 𝑤 and see if it tends to decrease – for inputs that correspond to image, we visualize the connection coming into each hidden unit as if it was an image – gives an idea of the type of visual feature each hidden unit detects – we can also try to approximate the partition function Z and see whether the (approximated) NLL decreases
Salakhutdinov, Murray, ICML 2008.
35
term to the energy function 𝐹 𝑤, ℎ = 𝑤𝑈𝑋ℎ + 𝑑𝑈𝑤 + 𝑐𝑈ℎ + 1 2 𝑤𝑈𝑤
+ 𝑋𝑈ℎ and identity covariance matrix
– subtracting the mean of each input – dividing each input by the training set standard deviation
training of deep architectures (2007)
– first layer: find hidden unit features that are more common in training inputs than in random inputs – second layer: find combinations of hidden unit features that are more common than random hidden unit features – third layer: find combinations of combinations of ...
reach a better parameters
Tutorial on Deep Learning and Applications, Honglak Lee, NIPS 2010
Tutorial on Deep Learning and Applications, Honglak Lee, NIPS 2010
𝑞 ℎ1|𝑤
𝑞 ℎ1|𝑤 𝑞 ℎ2|ℎ1
Tutorial on Deep Learning and Applications, Honglak Lee, NIPS 2010
Important in the history of deep learning
The global fine-tuning uses backpropagation. Initially encoder and decoder networks use the same weights.
– add output layer – train the whole network using supervised learning
– forward propagation, backpropagation and update – We call this last phase fine-tuning
– Use DBN to initialize a multi-layer neural network. – Maximize the conditional distribution:
Important in the history of deep learning
specific task
set.
– Hinton, Teh and Osindero suggested this procedure with RBMs:
– To recognize shapes, first learn to generate images (Hinton, 2006).
– Bengio, Lamblin, Popovici and Larochelle (stacked autoencoders) – Ranzato, Poultney, Chopra and LeCun (stacked sparse coding models)
Vincent et al., Extracting and Composing Robust Features with Denoising Autoencoders, 2008.
Hinton et.al. Neural Computation 2006.
– we obtain the greedy layer-wise pre-training procedure for neural networks
– in theory, if our approximation 𝑟(ℎ(1)|𝑤) is very far from the true posterior, the bound might be very loose – this only means we might not be improving the true likelihood – we might still be extracting better features!
– A fast learning algorithm for deep belief nets (Hinton, Teh, Osindero, 2006).
Lee et al., ICML 2009
Lee et al., ICML 2009
second layer activations into the image space.
Lee et al., Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, ICML 2009.
Tutorial on Deep Learning and Applications, Honglak Lee, NIPS 2010
You can also run a DBN generator unsupervised:
https://www.youtube.com/watch?v=RSF5PbwKU3I
There are limitations on the types of structure that can be represented efficiently by a single layer of hidden variables. Similar idea, but more layers. Training more complicated…
63
DBM
64
RBM DBM
65
Conditional distributions remain factorized due to layering.
66
All connections are undirected. Bottom-up and Top-down:
Unlike many existing feed-forward models: ConvNet (LeCun), HMAX (Poggio et.al.), Deep Belief Nets (Hinton et.al.)
Conditional distributions:
Typically trained with many examples.
69
DBM
DBM’s have the potential of learning internal representations that become increasingly complex at higher layers
60,000 training and 10,000 testing examples 0.9 million parameters Gibbs sampler for 100,000 steps
Tutorial on Deep Learning and Applications, Honglak Lee, NIPS 2010
Running a generator open-loop: https://www.youtube.com/watch?v=-l1QTbgLTyQ