- 12. Unsupervised
Deep Learning
CS 535 Deep Learning, Winter 2018 Fuxin Li
With materials from Wanli Ouyang, Zsolt Kira, Lawrence Neal, Raymond Yeh, Junting Lou and Teck-Yian Lim
12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 - - PowerPoint PPT Presentation
12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from Wanli Ouyang, Zsolt Kira, Lawrence Neal, Raymond Yeh, Junting Lou and Teck-Yian Lim Unsupervised Learning in General Unsupervised learning is
CS 535 Deep Learning, Winter 2018 Fuxin Li
With materials from Wanli Ouyang, Zsolt Kira, Lawrence Neal, Raymond Yeh, Junting Lou and Teck-Yian Lim
the input
model simple
to represent
,,
x4 x5 x6 +1
Layer 1 Layer 2
x1 x2 x3 x4 x5 x6 x1 x2 x3 +1
Layer 3
Autoencoder. Network is trained to
identify function). Trivial solution unless:
units in Layer 2 (learn compressed representation), or
be sparse.
a1 a2 a3
Autoencoders
Sparse Autoencoders (SAE)
Training deep sparse Autoencoders
Note we are reconstructing a at this point, not x
real data 30-D deep auto 30-D PCA
1024 1024 1024 8192 4096 2048 1024 512
256-bit binary code The encoder has about 67,000,000 parameters. It takes a few days on a GTX 285 GPU to train on two million images.
Reconstructions of 32x32 color images from 256-bit codes
retrieved using 256 bit codes retrieved using Euclidean distance in pixel intensity space
retrieved using 256 bit codes retrieved using Euclidean distance in pixel intensity space
and one decoder that generates from unit Gaussian
there is an inverse distribution represented by
Identity between the encoder and decoder
|~
|~
as
Realistic Generation Learning with Conditional Data Learning to Encode
we solve
entire batches, not individual images
perfect Inception score
Comparison with training set images
i i i j i j i ij
x x x w E ) (x;
, ) ( ) ; ( ) ; ( ) ; ( ) ; (
) ; ( ) ; (
Z f e e f f P
E E m m m m m m m m
x x x x
x x x x
} , { :
i ij
w Boltzmann machine:
inference and learning easier.
units.
hidden units.
reach thermal equilibrium when the visible units are clamped.
exact value of :
vis i ij i j
w v b j
1
v
j ih
v
hidden visible i j
t = 0 t = 1
Dwij e ( vihj0 vihj1)
Start with a training vector on the visible units. Update all the hidden units in parallel. Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again. This is not following the gradient of the log likelihood. But it works well. reconstruction data
vihj0
vihj
1
i j i j
100
nice way to do non-linear dimensionality reduction:
using backpropagation.
way to optimize them:
1000 neurons 500 neurons 500 neurons 250 neurons 250 neurons 30 1000 neurons
28x28 28x28
1 2 3 4 4 3 2 1
W W W W W W W W
T T T T
linear units
101
acyclic graph composed of random variables.
random hidden cause visible effect
102
h1 v h2 h3 … … … … … … … …
P(v,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)
Pixels=>edges=> local shapes=> object parts
104
variables to make the network more likely to generate the observed data
unobserved variables.
h1 v h2 h3 … … … … … … … …
P(v,h1,h2,h3) = p(v|h1) p(h1|h2) p(h2,h3)
105
B A C
h11 h12 x1 h1 x … … … …
P(A,B|C) = P(A|C)P(B|C) P(h11, h12 | x1) ≠ P(h11| x1) P(h12 | x1)
An example from manuscript Sol: Complementary prior
106
h1 x h2 h4 … … … … … …
… …
h3 … … 2000 1000 500 30 Sol: Complementary prior
Inference problem (the problem
Sol: Complementary prior
107
manuscript)
and fine tuning
h1 x h2 h3 … … … … … … … …
P(hi = 1|x) = σ(ci +Wi · x)
… … … … … … … … … … … … h1 x h2 h1 h3 h2
108
are optimizing the P(h1) in (1)
1
h 1 1 1 1 1 1
x | h x | h x | h h x | h h x, x )]} ( log ) ( )] ( log ) ( )[log ( { ) ( log ) ( log Q Q P P Q P P
h
… … … … … … … … … … … … h1 x h2 h1 h3 h2
(1)
109
but top layer should be big
distribution over binary vectors [1]
Copied from http://videolectures.net/mlss09uk_hinton_dbn/
[1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural Computation, 2007
110
Erhan et. al. AISTATS’2009
111
w/o pre-training
with pre-training without pre-training
112
stuff image label stuff image label
If image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example, do the pixels have even parity? If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway.
high bandwidth low bandwidth
113
wake-sleep algorithm.
114