Deep Unsupervised Learning
Russ Salakhutdinov
Machine Learning Department Carnegie Mellon University Canadian Institute of Advanced Research
1
Mining for Structure Massive increase in both computational power - - PowerPoint PPT Presentation
Deep Unsupervised Learning Russ Salakhutdinov Machine Learning Department Carnegie Mellon University Canadian Institute of Advanced Research 1 Mining for Structure Massive increase in both computational power and the amount of data available
Machine Learning Department Carnegie Mellon University Canadian Institute of Advanced Research
1
Images & Video Relational Data/ Social Network
Massive increase in both computational power and the amount of data available from web, video cameras, laboratory measurements.
Speech & Audio Text & Language Product Recommendation
statistical correlation from data in unsupervised or semi-supervised way.
Gene Expression fMRI Tumor region
2
Images & Video Relational Data/ Social Network
Massive increase in both computational power and the amount of data available from web, video cameras, laboratory measurements.
Speech & Audio Gene Expression Text & Language Product Recommendation fMRI
statistical correlation from data in unsupervised or semi-supervised way.
Tumor region
3
4
Personal assistants, self-driving cars, etc.
5
6
Legal/Judicial Leading Economic Indicators European Community Monetary/Economic Accounts/ Earnings Interbank Markets Government Borrowings Disasters and Accidents Energy Markets
Bag of words
(Hinton & Salakhutdinov, Science 2006)
7
Unsupervised Learning
Non-probabilistic Models
Ø Sparse Coding Ø Autoencoders Ø Others (e.g. k-means)
Explicit Density p(x) Probabilistic (Generative) Models Tractable Models
Ø Fully observed
Belief Nets
Ø NADE Ø PixelRNN
Non-Tractable Models
Ø Boltzmann Machines Ø Variational
Autoencoders
Ø Helmholtz Machines Ø Many others…
Ø Generative Adversarial
Networks
Ø Moment Matching
Networks
Implicit Density
8
Ø
Sparse Coding
Ø
Autoencoders
Ø
Restricted Boltzmann Machines
Ø
Deep Boltzmann Machines
Ø
Helmholtz Machines / Variational Autoencoders
9
10
Slide Credit: Honglak Lee
11
12
13
Classification Algorithm (SVM)
Algorithm Accuracy Baseline (Fei-Fei et al., 2004) 16% PCA 37% Sparse Coding 47% Input Image Features (coefficients) Learned bases
9K images, 101 classes Lee, Battle, Raina, Ng, 2006
Slide Credit: Honglak Lee
14
g(a)
x’
Explicit Linear Decoding
a
f(x) Implicit nonlinear encoding
x a
Sparse features
15
16
17
z=σ(Wx) Dz Input Image x Binary Features z
18
z=σ(Wx) Dz Input Image x Binary Features z
19
are linear, it will learn hidden units that are a linear function of the data and minimize the squared error.
same space as the first k principal
may not be orthogonal.
z=Wx Wz Input Image x Linear Features z
20
21
At training time
Kavukcuoglu, Ranzato, Fergus, LeCun, 2009
22
Input x Features
Class Labels
Features
23
Input x Features
Features Class Labels
24
W W W + W W W W W + W + W + W W + W + W + + W W W W W W
1
2000 RBM
2
2000 1000 500 500 1000 1000 500
1 1
2000 2000 500 500 1000 1000 2000 500 2000
T 4 T
RBM
Pretraining Unrolling
1000 RBM
3 4
30 30
Finetuning
4 4 2 2 3 3 4 T 5 3 T 6 2 T 7 1 T 8
Encoder
1 2 3
30
4 3 2 T 1 T
Code layer Decoder RBM Top
25
valued codes for Olivetti face patches.
26
2-D LSA space
Legal/Judicial Leading Economic Indicators European Community Monetary/Economic Accounts/ Earnings Interbank Markets Government Borrowings Disasters and Accidents Energy Markets
(randomly split into 402,207 training and 402,207 test).
containing the counts of the most frequently used 2000 words in the training set.
(Hinton and Salakhutdinov, Science 2006)
27
search at all.
Accounts/Earnings Government Borrowing European Community Monetary/Economic Disasters and Accidents Energy Markets
Semantically Similar Documents Document Address Space Semantic Hashing Function
(Salakhutdinov and Hinton, SIGIR 2007)
28
29
Ø
Sparse Coding
Ø
Autoencoders
Ø
Restricted Boltzmann Machines
Ø
Deep Boltzmann Machines
Ø
Helmholtz Machines / Variational Autoencoders
30
n
i=2
Each conditional can be a complicated neural network
Ø
NADE, RNADE (Larochelle, et.al. 20011)
Ø
Pixel CNN (van den Ord et. al. 2016)
Ø
Pixel RNN (van den Ord et. al. 2016) Pixel CNN
31
Pair-wise Unary
Markov random fields, Boltzmann machines, log-linear models.
Image visible variables hidden variables
32
Learned W: “edges” Subset of 1000 features
= ….
New Image:
Logistic Function: Suitable for modeling binary images
Sparse representations
Observed Data Subset of 25,000 characters
33
Difficult to compute: exponentially many configurations
Given a set of i.i.d. training examples , we want to learn model parameters . Maximize log-likelihood objective: Derivative of the log-likelihood:
Image visible variables hidden variables
34
Learned features: ``topics’’
russian russia moscow yeltsin soviet clinton house president bill congress computer system product software develop trade country import world economy stock wall street point dow
Reuters dataset: 804,414 unlabeled newswire stories Bag-of-Words Learned features (out of 10,000) 4 million unlabelled images
35
36
Learned features: ``genre’’
Fahrenheit 9/11 Bowling for Columbine The People vs. Larry Flynt Canadian Bacon La Dolce Vita Independence Day The Day After Tomorrow Con Air Men in Black II Men in Black Friday the 13th The Texas Chainsaw Massacre Children of the Corn Child's Play The Return of Michael Myers Scary Movie Naked Gun Hot Shots! American Pie Police Academy
Netflix dataset: 480,189 users 17,770 movies
Over 100 million ratings
(Salakhutdinov, Mnih, Hinton, ICML 2007) h v W1 Multinomial visible: user ratings Binary hidden: user preferences
37
Marginalizing over hidden variables:
Product of Experts
The joint distribution is given by: Silvio Berlusconi
government authority power empire federation clinton house president bill congress bribery corruption dishonesty corrupt fraud mafia business gang mob insider stock wall street point dow …
Topics “government”, ”corruption” and ”mafia” can combine to give very high probability to a word “Silvio Berlusconi”.
38
Marginalizing over hidden variables:
Product of Experts
The joint distribution is given by: Silvio Berlusconi
government authority power empire federation clinton house president bill congress bribery corruption dishonesty corrupt fraud mafia business gang mob insider stock wall street point dow …
Topics “government”, ”corruption” and ”mafia” can combine to give very high probability to a word “Silvio Berlusconi”.
0.001 0.006 0.051 0.4 1.6 6.4 25.6 100 10 20 30 40 50
Recall (%) Precision (%)
Replicated Softmax 50−D LDA 50−D
39
Neighbors, RBF SVM, local density estimators
Learned prototypes Local regions C3=0 C1=1 C1=0 C3=0 C3=0 C2=1 C2=1 C1=1 C2=0 C1=0 C2=0 C3=0 C1=1 C2=1 C3=1 C1=0 C2=1 C3=1 C1=0 C2=0 C3=1
PCA, Sparse Coding, Deep models
# of parameters.
C2 C1 C3
Bengio, 2009, Foundations and Trends in Machine Learning
40
Neighbors, RBF SVM, local density estimators
Learned prototypes Local regions C3=0 C1=1 C1=0 C3=0 C3=0 C2=1 C2=1 C1=1 C2=0 C1=0 C2=0 C3=0 C1=1 C2=1 C3=1 C1=0 C2=1 C3=1 C1=0 C2=0 C3=1
PCA, Sparse Coding, Deep models
# of parameters.
regions, not just local.
exponentially in # of parameters.
C2 C1 C3
Bengio, 2009, Foundations and Trends in Machine Learning
41
Ø
Sparse Coding
Ø
Autoencoders
Ø
Restricted Boltzmann Machines
Ø
Deep Boltzmann Machines
Ø
Helmholtz Machines / Variational Autoencoders
42
Image
Low-level features: Edges Input: Pixels Built from unlabeled inputs.
(Salakhutdinov 2008, Salakhutdinov & Hinton 2012)
43
Image
Higher-level features: Combination of edges Low-level features: Edges Input: Pixels Built from unlabeled inputs.
Learn simpler representations, then compose more complex ones
(Salakhutdinov 2008, Salakhutdinov & Hinton 2012)
44
model parameters
h3 h2 h1 v W3 W2 W1
Top-down Bottom-up Input
Same as RBMs
45
46
v h2 h1 h3 W1 W3 W2
Trained on face images. Object Parts Groups of parts.
(Lee, Grosse, Ranganath, Ng, ICML 2009)
47
(Lee, Grosse, Ranganath, Ng, ICML 2009)
48
Ø
Sparse Coding
Ø
Autoencoders
Ø
Restricted Boltzmann Machines
Ø
Deep Boltzmann Machines
Ø
Helmholtz Machines / Variational Autoencoders
49
Input data
h3 h2 h1 v W3 W2 W1
Generative Process Approximate Inference
50
Input data
h3 h2 h1 v W3 W2 W1
Generative Process Approximate Inference h3 h2 h1 v W3 W2 W1 Deep Boltzmann Machine Helmholtz Machine
51
h3 h2 h1 v W3 W2 W1 Each term may denote a complicated nonlinear relationship
evaluation is tractable for each .
Generative Process
stochastic layers.
Input data
52
This term denotes a one-layer neural net.
Deterministic Layer Stochastic Layer Stochastic Layer
evaluation is tractable for each .
stochastic layers.
53
Input data
h3 h2 h1 v W3 W2 W1
with respect to the recognition network (high-variance).
54
with mean and covariance computed from the state of the hidden units at the previous layer.
55
Deterministic Encoder
Distribution of does not depend on
56
Gradients can be computed by backprop The mapping h is a deterministic neural net for fixed . Autoencoder
57
Input data h3 h2 h1 v W3 W2 W1
unnormalized importance weights
Burda, Grosse, Salakhutdinov, 2015
58
Stochastic Layer
Gregor et. al. 2015 (Mansimov, Parisotto, Ba, Salakhutdinov, 2015)
59
A stop sign is flying in blue skies A pale yellow school bus is flying in blue skies A herd of elephants is flying in blue skies A large commercial airplane is flying in blue skies
(Mansimov, Parisotto, Ba, Salakhutdinov, 2015)
60
A yellow school bus parked in the parking lot A red school bus parked in the parking lot A green school bus parked in the parking lot A blue school bus parked in the parking lot
(Mansimov, Parisotto, Ba, Salakhutdinov, 2015)
61
62
A toilet seat sits open in the bathroom Ask Google? A toilet seat sits open in the grass field
63
Ø
Sparse Coding
Ø
Autoencoders
Ø
Restricted Boltzmann Machines
Ø
Deep Boltzmann Machines
Ø
Helmholtz Machines / Variational Autoencoders
64
be able to sample from it.
65
Ø
Discriminator D
Ø
Generator G
Ø
A sample from the data distribution.
Ø
And a sample from the generator G.
are hard for D to distinguish from the real data.
Generative Adversarial Networks” Goodfellow et al., NIPS 2014
66
Slide Credit: Ian Goodfellow
67
Slide Credit: Ian Goodfellow
68
Slide Credit: Ian Goodfellow
69
min
G max D V (D, G) = Ex∼pdata(x)[log D(x)] + Ez∼pz(z)[log(1 − D(G(z)))].
Discriminator: Classify data as being real Discriminator: Classify generator samples as being fake Generator: generate samples that D would classify as real Discriminator: Pushes up Generator: Pushes down
70
(Radford et al 2015)
71
(Radford et al 2015)
72
(Salimans et. al., 2016)
Training Samples
73
(Salimans et. al., 2016)
Training Samples
74
Slide Credit: Ian Goodfellow
75