Marrying Graphical Models & Deep Learning
Max Welling University of Amsterdam
Uva-Qualcomm Quva Lab
Canadian Institute for Advanced Research Universiteit van Amsterdam
Marrying Graphical Models & Deep Learning Max Welling - - PowerPoint PPT Presentation
Marrying Graphical Models & Deep Learning Max Welling University of Amsterdam Universiteit van Amsterdam Uva-Qualcomm Quva Lab Canadian Institute for Advanced Research 0 Overview: Generative versus discriminative modeling Machine
Uva-Qualcomm Quva Lab
Canadian Institute for Advanced Research Universiteit van Amsterdam
1
under resampling the data: , or risk overfitting. (unsupervised) (supervised) (supervised)
2
http://scott.fortmann-roe.com/docs/BiasVariance.html
3
P(all) = P(traffic-jam | rush-hour, bad-weather, accident) x P(sirens | accident) x P(accident | bad-weather) x P(bad-weather) x P(rush-hour) P(rush-hour) independent P(bad-weather) ßà sum_{traffic-jam,sirens,accident) P(all) = P(rush-hour) P(bad-weather)
4
Rush-hour independent of bad-weather
Source:
5
Source: Bishop
A independent B given C (for independence, all paths must be blocked) Undirected edges (Conditional) independence relationships easy: Probability distribution: : maximal clique = largest completely connected subgraphs Hammersley-Clifford Theorem: if P>0 all x, then all (conditional) independencies in P match those of the graph.
6
7
Variational Inference Sampling
Variational Family Q
q∗
All probability distributions
p p
8
Generating Independent Samples Sample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling
Markov Chain Monte Carlo
p(θ|X)
9
Burn-in ( Throw away)
θ0
θ1
T(θt+1|θt)
Given target distribution S0, design transitions s.t. pt(θt) → S0 as t → ∞ θt+1
Samples from S0
Auto correlation time
200 400 600 800 1000 −3 −2 −1 1 2 3 iteration last position coordinate 200 400 600 800 1000 −3 −2 −1 1 2 3 iteration last position coordinateθt
t
t
High τ Low τ
θt
I = hfiS0 ⇡ ˆ I = 1 T
T
X
t=1
f(θt) Bias(ˆ I) = E[ˆ I − I] = 0
Var(ˆ I) = τ Var(f) T
10
Transition Kernel T(θt+1|θt)
θ0 ∼ q(θ0|θt)
Accept/Reject Test Propose
θt+1 ← ⇢ θ0 with probability Pa θt with probability 1 − Pa
θt
θt+1
Is the new state more probable? Is it easy to come back to the current state?
Pa = min 1, q(θt|θ0) q(θ0|θt) S0(θ0) S0(θt)
N
Y
i=1
p(xi|θ)
For Bayesian Posterior Inference, V ar[ˆ I] ∝ 1 T 2) is too high. 1) Burn-in is unnecessarily slow.
11
Low Variance ( Fast ) High Variance ( Slow ) High Bias Low Bias
xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
Decreasing ϵ
12
X Axis – ϵ, Y Axis – Bias2, Variance, Risk
Computational Time Risk Bias Variance
E h (I − ˆ I)2i
13
Stochastic Gradient Ascent Gradient Ascent
Langevin Dynamics
e.g.
↓ Metropolis-Hastings Accept Step Stochastic Gradient Langevin Dynamics Metropolis-Hastings Accept Step Welling & Teh 2011
14
15
large
16
small
17
18
19
P Q
Φ
20
E-step:
Bound
M-step:
Gap: (variational inference) (approximate learning)
21
parameters , we can do very fast inference at test time (i.e. avoid iterative optimization of qtest(z))
φ
22
X Y P(Y|X)
23
à ”Variational Autoencoder” (VAE). Replace with deep NN!
24
deepify deepify
25
x z h h x z h h
Q P
μ σ p
deterministic NN node unobserved stochastic node
stochastic node deep neural net deep neural net
26
very high variance Sample Z
27
subsample mini-batch X
B(Q) = X
Z
Q(Z|X, Φ)(log P(X|Z, Θ) + log P(Z) − log Q(Z|X, Φ))
rΦB(Q) = X
Z
Q(Z|X, Φ)rΦ log Q(Z|X, Φ)(log P(X|Z, Θ) + log P(Z) log Q(Z|X, Φ))
rΦB(Q) = 1 N 1 S
N
X
i=1 S
X
s=1
rΦ log Q(Zis|Xi, Φ)(log P(Xi|Zis, Θ) + log P(Zis) log Q(Zis|Xi, Φ))
28
Kingma 2013, Bengio 2013, Kingma & Welling 2014
rµ Z dzNz(µ, )z = 1 S X
s
zs(zs µ)/2, zs ⇠ Nz(µ, )
S X
s
1, ✏s ⇠ N✏(0, 1), z = µ + ✏
rΦB(Θ, Φ) = rΦ Z dz QΦ(z|x)[log PΘ(x, z) log QΦ(z|x)] ⇡ rΦ[log PΘ(x, zs) log QΦ(zs|x)]zs=g(✏s,Φ), ✏s ⇠ P(✏)
x h x h h z h z h
y y
Q P
h
Sometimes
stochastic node
D.P. Kingma, D.J. Rezende, S. Mohamed, M. Welling, NIPS 2014
(normal VB objective) (boosting influence q(y|x) )
Variational Auto-Encoder
N=10^8-10^9
N = 100-1000
generalize much better to new unknown situation (domain invariance) We need statistical efficiency We need computational efficiency
32
Use physics Use causality Use expert knowledge Black box DNN/CNN
Backward: backpropagation (propagate error signal backward) Forward: Filter, subsample, filter, nonlinearity, subsample, …., classify
34
35
36
37
38
39
It’s the same CNN in all cases: Inception-v3
40
However:
41
Computer Aided Diagnosis Autonomous Driving
Increased uncertainty away from data
Picture credit:
Complex models can have lower marginal likelihood:
P(X|M) = Z dΘ P(X|Θ, M)P(Θ|M)
P(Θ|X, M) = P(X|Θ, M)P(Θ|M) P(X|M)
P(x|X, M) = Z dΘ P(x|Θ, M)P(Θ|X, M) P(M|X) = P(X|M)P(M) P(X)
P(X) = X
M
P(X|M)P(M)
(prediction) (model selection) (evidence) (posterior) (model evidence)
log P(X) ≥ Z
Θ
dΘ Q(Θ) [log P(X|Θ) + log P(Θ) − log Q(Θ)] ≡ B(Q(Θ)|X)
45
error loss ~N complexity loss ~const.
46
flow of information The signal in NNs are very robust to noise addition (e.g dropout) "neurons" act as bottlenecks
THE PLAN:
very high variance sample
48
subsample mini-batch X
B(Q(Θ)|X) = Z
Θ
dΘ Q(Θ) [log P(X|Θ) + log P(Θ) − log Q(Θ)]
rΦB = Z
Θ
dΘ QΦ(Θ) rΦ log QΦ(Θ) [log P(X|Θ) + log P(Θ) log QΦ(Θ)]
rΦB = 1 S
S
X
s=1
rΦ log QΦ(Θs) " N n
n
X
i=1
log P(xi|Θs) + log P(Θs) log QΦ(Θs) #
correlations between data-cases and thus high variance in gradient.
Θs
F W
(and it’s much less expensive than resampling all the weights independently per data case) Conclusion: using this trick we can further reduce variance in the gradients compute exactly Reparameterize: B(X)
Kingma, Salimans & Welling 2015
P(X|Θ) → P(Y |W, X)
X Y W2 W1 F B
H = σ(B)
Now use the “normal” reparameterization trick
multiplicative dropout noise If then Conclusion: by using a special form of posterior we simulate dropout noise: i.e. dropout can be understood as variational Bayesian inference with multiplicative noise. B=AW A W
Y Gal, Z Ghahramani 2016, Dropout as a Bayesian approximation: Representing model uncertainty in deep learning S Wang, C Manning, Fast dropout training
(improper prior) (variational dropout posterior) Learn dropout rate . When weight is pruned
(Kingma, Salimans, Welling 2015, Mochanov, Ashuka, Vetrov 2017)
Conclusion: we can learn the dropout rates and prune unnecessary weights.
Animation: Molchanov, D., Ashukha, A. and Vetrov, D.
Animation: Molchanov, D., Ashukha, A. and Vetrov, D.
Fully connected layer
54
Use hierarchical prior:
(dropout multiplicative noise)
Prior-posterior pair
(Louizos, Ullrich, Welling, 2017)
55
Conclusion: by using special, hierarchical priors we can prune hidden units instead of individual weights (which is much better for compression).
P(W, z) = Y
hidden units i
p(zi) Y
units j outgoing from node i
P(wij|zi)
(Louizos, Ullrich, Welling 2017, submitted)
encoding is cheaper. Additional Bayesian Bonus: By monitoring posterior fluctuations
fixed point precision.
56
57