Ba Bayesi esian Deep Deep Le Lear arning ning
- Prof. Leal-Taixé and Prof. Niessner
1
Ba Bayesi esian Deep Deep Le Lear arning ning Prof. Leal-Taix - - PowerPoint PPT Presentation
Ba Bayesi esian Deep Deep Le Lear arning ning Prof. Leal-Taix and Prof. Niessner 1 Go Going ful g full B Baye yesi sian Bayes = Probabilities Hypothesis = Model Bayes Theorem p ( H | E ) = p ( E | H ) p ( H ) p ( E )
1
2
p(H|E) = p(E|H)p(H) p(E)
Evidence = data Hypothesis = Model
3
p(θ)
No dependence
data
p(θ|x) = p(x|θ)p(θ) p(x) p(x|θ)
4
p(θ)
prior posterior likelihood data
p(x|θ) p(θ|x) = p(x|θ)p(θ)
5
– Finding a point estimate (MAP) à what we have been doing so far! – Finding a probability distribution of θ
This lecture
p(θ|x) = p(x|θ)p(θ) p(θ|x) = p(x|θ)p(θ) p(x)
Advant antag ages of Deep Learning models
– Very expressive models – Good for tasks such as classification, regression, sequence prediction – Modular structure, efficient training, many tools – Scales well with large amounts of data
advant antag ages…
– ”Black-box” feeling – We cannot judge how “confident” the model is about a decision
7
– We want to know what our models know and what they do not know
8
9
Bulldog German sheperd Chihuaha What answer will my NN give?
10
Bulldog German sheperd Chihuaha I would rather get as an answer that my model is not certain about the type of dog breed
– We want to know what our models know and what they do not know
– Decision making – Learning from limited, noisy, and missing data – Insights on why a model failed
11
12
– Finding a point estimate (MAP) à what we have been doing so far! – Finding a probability distribution of θ
Image: https://medium.com/@joeDiHare/deep-bayesian-neural-networks-952763a9537
13
see how this affects our model’s predictions
Image: https://medium.com/@joeDiHare/deep-bayesian-neural-networks-952763a9537
14
I am not really sure
Kendal & Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?“ NIPS 2016
15
model parameters
16
How do we compute this?
p(θ|x) = p(x|θ)p(θ) p(x) p(θ|x) = p(x|θ)p(θ) R
θ p(x|θ)p(θ)dθ
possible combinations
approximation of the posterior:
17
Markov Chain Monte Carlo Variational Inference
p(θ|x) = p(x|θ)p(θ) R
θ p(x|θ)p(θ)dθ
– A chain of samples that converge to
– Find an approximation that.
18
θt → θt+1 → θt+2 ...
SLOW
p(θ|x) q(θ) arg min KL(q(θ)||p(θ|x))
19
20
Srivastava 2014
Forward
21
Furry Has two eyes Has a tail Has paws Has two ears Redundant representations
– Redundant representations – Base your scores on more features
22
23
Model 1 Model 2
– Find an approximation that
– The variational distribution is from a Bernoulli distribution (where the states are “on” and “off”)
24
Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016
q(θ) arg min KL(q(θ)||p(θ|x))
layer
test time
– Sampling is done in a Monte Carlo fashion, hence the name Monte Carlo dropout
25
Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016
– Sampling is done in a Monte Carlo fashion, e.g., where and is the dropout distribution
26
Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016
ˆ θt ∼ q(θ) ∼ q(θ) p(y = c|x) ≈ 1 T
T
X
t=1
Softmax(f ˆ
θt(x))
classification Parameter sampling NN
27
Kendal & Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?“ NIPS 2016
32
and reconstruct it with the decoder
33
Conv Transpose Conv
z x ˜ x
Encoder Decoder
θ
34
Conv Transpose Conv
z x ˜ x
Encoder Decoder
qφ(z|x) φ pθ(˜ x|z)
θ
35
Conv Transpose Conv
z x ˜ x
Goal: Sample from the latent distribution to generate new outputs!
φ
36
x
Encoder
φ µz|x Σz|x θ z ˜ x
Decoder Sample
z|x ∼ N(µz|x, Σz|x)
37
x
Encoder
φ µz|x Σz|x z|x ∼ N(µz|x, Σz|x)
Mean Diagonal covariance
38
x
Encoder
φ µz|x Σz|x θ z ˜ x
Decoder Sample
z|x ∼ N(µz|x, Σz|x)
39
µz|x Σz|x θ z ˜ x
Decoder Sample
z|x ∼ N(µz|x, Σz|x)
40
θ z
Goal: Want to estimate the parameters of my generative model
pθ(x) = Z
z
pθ(x|z)pθ(z)dz x |z)pθ(z)dz
Prior = Gaussian
Z
z
pθ(x|z)p
Decoder (Neural Network) Intractable to compute the
41
θ z
Goal: Want to estimate the parameters of my generative model
x
Encoder
φ µz|x Σz|x
Sample
qφ(z|x) pθ(˜ x|z) ˜ x
42
log(pθ(xi)) = Ez∼qφ(z|xi)[log(pθ(xi))]
I draw samples of the latent variable z from my encoder
43
log(pθ(xi)) = Ez∼qφ(z|xi)[log(pθ(xi))] = Ez∼qφ(z|xi) log pθ(xi|z)pθ(z) pθ(z|xi)
Using the latent variable, which will become useful to simplify the expressions later according to our AE formulation
pθ(z|x) = pθ(x|z)pθ(z) pθ(x)
Recall:
44
log(pθ(xi)) = Ez∼qφ(z|xi)[log(pθ(xi))] = Ez∼qφ(z|xi) log pθ(xi|z)pθ(z) pθ(z|xi)
log pθ(xi|z)pθ(z) pθ(z|xi) qφ(z|xi) qφ(z|xi)
45
log(pθ(xi)) = Ez log pθ(xi|z)pθ(z) pθ(z|xi) qφ(z|xi) qφ(z|xi)
log qφ(z|xi) pθ(z)
log qφ(z|xi) pθ(z|xi)
46
Kullback-Leibler Divergences to measure how similar two distributions are
= Ez [log pθ(xi|z)] − Ez log qφ(z|xi) pθ(z)
log qφ(z|xi) pθ(z|xi)
47
Kullback-Leibler Divergences
= Ez [log pθ(xi|z)] − Ez log qφ(z|xi) pθ(z)
log qφ(z|xi) pθ(z|xi)
48
Reconstruction loss (how well does my decoder reconstruct a data point given the latent vector z). We need to sample from z. Measures how good my latent distribution is with respect to my Gaussian prior
= Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))
I still cannot express the shape of the
≥ 0
49
Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi)) ≥ 0
Loss function (lower bound)
L(xi, φ, θ) log(p(xi)) ≥ L(xi, φ, θ)
50
xi
Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi)) ≥ 0
Loss function (lower bound)
L(xi, φ, θ) φ∗, θ∗ = arg max
N
X
i=1
L(xi, φ, θ)
51
Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))
x
Encoder
φ µz|x Σz|x
Make posterior distribution close to prior (close to unit Gaussian distribution)
52
Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))
x
Encoder
φ µz|x Σz|x z|x ∼ N(µz|x, Σz|x)
53
Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))
x
Encoder
φ µz|x Σz|x z
Sample
z|x ∼ N(µz|x, Σz|x)
54
Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))
x
Encoder
φ µz|x Σz|x θ z ˜ x
Decoder Sample
z|x ∼ N(µz|x, Σz|x)
55
Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))
θ
Decoder
x|z ∼ N(µx|z, Σx|z)
Sample
˜ x
µx|z Σx|z
Output is also parameterized
56
Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))
˜ x
Maximize the likelihood of reconstructing the input
57
Gaussians) that allows us to perform backpropagation
Bayes“. ICLR 2014
http://kvfrans.com/variational-autoencoders-explained/
58
Sample from the distribution (e.g., unit Gaussian
59
60
Each element of z encodes a different feature
61
Degree of smile Head pose
Autoencoder Variational Autoencoder Ground Truth
https://github.com/kvfrans/variational-autoencoder
62
– Reconstruct input – Unsupervised learning – Latent space features are useful
– Probability distribution in latent space (e.g., Gaussian) – Interpretable latent space (head pose, smile) – Sample from model to generate output
63
64
– Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. “Learning Structured Output Representation using Deep Conditional Generative Models.” Advances in Neural Information Processing Systems. 2015. – Xinchen Yan, Jimei Yang, Kihyuk Sohn, Honglak Lee, Attribute2Image: Conditional Image Generation from Visual Attributes, ECCV, 2016 –
65
– Jacob Walker, Carl Doersch, Abhinav Gupta, Martial Hebert, An Uncertain Future: Forecasting from Static Images using Variational Autoencoders, ECCV, 2016 – Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, Ole Winther, Autoencoding beyond pixels using a learned similarity metric, ICML, 2016 – Aditya Deshpande, Jiajun Lu, Mao-Chuang Yeh, David Forsyth, Learning Diverse Image Colorization, arXiv, 2016 – Raymond Yeh, Ziwei Liu, Dan B Goldman, Aseem Agarwala, Semantic Facial Expression Editing using Autoencoded Flow, arXiv, 2016 – Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling, Semi-Supervised Learning with Deep Generative Models, NIPS, 2014
66