Ba Bayesi esian Deep Deep Le Lear arning ning Prof. Leal-Taix - - PowerPoint PPT Presentation

ba bayesi esian deep deep le lear arning ning
SMART_READER_LITE
LIVE PREVIEW

Ba Bayesi esian Deep Deep Le Lear arning ning Prof. Leal-Taix - - PowerPoint PPT Presentation

Ba Bayesi esian Deep Deep Le Lear arning ning Prof. Leal-Taix and Prof. Niessner 1 Go Going ful g full B Baye yesi sian Bayes = Probabilities Hypothesis = Model Bayes Theorem p ( H | E ) = p ( E | H ) p ( H ) p ( E )


slide-1
SLIDE 1

Ba Bayesi esian Deep Deep Le Lear arning ning

  • Prof. Leal-Taixé and Prof. Niessner

1

slide-2
SLIDE 2

Go Going ful g full B Baye yesi sian

  • Prof. Leal-Taixé and Prof. Niessner

2

  • Bayes = Probabilities
  • Bayes Theorem

p(H|E) = p(E|H)p(H) p(E)

Evidence = data Hypothesis = Model

slide-3
SLIDE 3

Go Going ful g full B Baye yesi sian

  • Prof. Leal-Taixé and Prof. Niessner

3

  • Start with a prior on the model parameters
  • Choose a statistical model
  • Use data to refine my prior, i.e., compute the posterior

p(θ)

No dependence

  • n parameters

data

p(θ|x) = p(x|θ)p(θ) p(x) p(x|θ)

slide-4
SLIDE 4

Go Going ful g full B Baye yesi sian

  • Prof. Leal-Taixé and Prof. Niessner

4

  • Start with a prior on the model parameters
  • Choose a statistical model
  • Use data to refine my prior, i.e., compute the posterior

p(θ)

prior posterior likelihood data

p(x|θ) p(θ|x) = p(x|θ)p(θ)

slide-5
SLIDE 5

Go Going ful g full B Baye yesi sian

  • Prof. Leal-Taixé and Prof. Niessner

5

  • 1. Learning: Computing the posterior

– Finding a point estimate (MAP) à what we have been doing so far! – Finding a probability distribution of θ

This lecture

p(θ|x) = p(x|θ)p(θ) p(θ|x) = p(x|θ)p(θ) p(x)

slide-6
SLIDE 6

Wh What at hav ave e we e lear earned ed so

  • far

ar?

  • Ad

Advant antag ages of Deep Learning models

– Very expressive models – Good for tasks such as classification, regression, sequence prediction – Modular structure, efficient training, many tools – Scales well with large amounts of data

  • But we have also disad

advant antag ages…

– ”Black-box” feeling – We cannot judge how “confident” the model is about a decision

  • Prof. Leal-Taixé and Prof. Niessner

7

slide-7
SLIDE 7

Model Modeling uncer ertai ainty

  • Wish list:

– We want to know what our models know and what they do not know

  • Prof. Leal-Taixé and Prof. Niessner

8

slide-8
SLIDE 8

Model Modeling uncer ertai ainty

  • Example: I have built a dog breed classifier
  • Prof. Leal-Taixé and Prof. Niessner

9

Bulldog German sheperd Chihuaha What answer will my NN give?

slide-9
SLIDE 9

Model Modeling uncer ertai ainty

  • Example: I have built a dog breed classifier
  • Prof. Leal-Taixé and Prof. Niessner

10

Bulldog German sheperd Chihuaha I would rather get as an answer that my model is not certain about the type of dog breed

slide-10
SLIDE 10

Model Modeling uncer ertai ainty

  • Wish list:

– We want to know what our models know and what they do not know

  • Why do we care?

– Decision making – Learning from limited, noisy, and missing data – Insights on why a model failed

  • Prof. Leal-Taixé and Prof. Niessner

11

slide-11
SLIDE 11

Model Modeling uncer ertai ainty

  • Prof. Leal-Taixé and Prof. Niessner

12

  • Finding the posterior

– Finding a point estimate (MAP) à what we have been doing so far! – Finding a probability distribution of θ

Image: https://medium.com/@joeDiHare/deep-bayesian-neural-networks-952763a9537

slide-12
SLIDE 12

Model Modeling uncer ertai ainty

  • Prof. Leal-Taixé and Prof. Niessner

13

  • We can sample many times from the distribution and

see how this affects our model’s predictions

  • If predictions are consistent = model is confident

Image: https://medium.com/@joeDiHare/deep-bayesian-neural-networks-952763a9537

slide-13
SLIDE 13

Model Modeling uncer ertai ainty

  • Prof. Leal-Taixé and Prof. Niessner

14

I am not really sure

Kendal & Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?“ NIPS 2016

slide-14
SLIDE 14

Wh Why?

  • Prof. Leal-Taixé and Prof. Niessner

15

slide-15
SLIDE 15

Ho How d do w we g get t the p post sterio ior?

  • Compute the posterior over the weights
  • Probability of observing our data under all possible

model parameters

  • Prof. Leal-Taixé and Prof. Niessner

16

How do we compute this?

p(θ|x) = p(x|θ)p(θ) p(x) p(θ|x) = p(x|θ)p(θ) R

θ p(x|θ)p(θ)dθ

slide-16
SLIDE 16

Ho How d do w we g get t the p post sterio ior?

  • How do we compute this?
  • Denominator = we cannot compute all

possible combinations

  • Two ways to compute the

approximation of the posterior:

  • Prof. Leal-Taixé and Prof. Niessner

17

Markov Chain Monte Carlo Variational Inference

p(θ|x) = p(x|θ)p(θ) R

θ p(x|θ)p(θ)dθ

slide-17
SLIDE 17

Ho How d do w we g get t the p post sterio ior?

  • Markov Chain Monte Carlo (MCMC)

– A chain of samples that converge to

  • Variational Inference

– Find an approximation that.

  • Prof. Leal-Taixé and Prof. Niessner

18

θt → θt+1 → θt+2 ...

SLOW

p(θ|x) q(θ) arg min KL(q(θ)||p(θ|x))

slide-18
SLIDE 18

Dropout Dropout for

  • r

Ba Bayesi esian I Inferen erence ce

  • Prof. Leal-Taixé and Prof. Niessner

19

slide-19
SLIDE 19

Rec Recal all: Drop

  • pou
  • ut
  • Disable a random set of neurons (typically 50%)
  • Prof. Leal-Taixé and Prof. Niessner

20

Srivastava 2014

Forward

slide-20
SLIDE 20

Rec Recal all: Drop

  • pou
  • ut
  • Using half the network = half capacity
  • Prof. Leal-Taixé and Prof. Niessner

21

Furry Has two eyes Has a tail Has paws Has two ears Redundant representations

slide-21
SLIDE 21

Rec Recal all: Drop

  • pou
  • ut
  • Using half the network = half capacity

– Redundant representations – Base your scores on more features

  • Consider it as model ensemble
  • Prof. Leal-Taixé and Prof. Niessner

22

slide-22
SLIDE 22

Rec Recal all: Drop

  • pou
  • ut
  • Two models in one
  • Prof. Leal-Taixé and Prof. Niessner

23

Model 1 Model 2

slide-23
SLIDE 23

MC MC dr drop

  • pou
  • ut
  • Variational Inference

– Find an approximation that

  • Dropout training

– The variational distribution is from a Bernoulli distribution (where the states are “on” and “off”)

  • Prof. Leal-Taixé and Prof. Niessner

24

Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016

q(θ) arg min KL(q(θ)||p(θ|x))

slide-24
SLIDE 24

MC MC dr drop

  • pou
  • ut
  • 1. Train a model with dropout before every weight

layer

  • 2. Apply dropout at te

test time

– Sampling is done in a Monte Carlo fashion, hence the name Monte Carlo dropout

  • Prof. Leal-Taixé and Prof. Niessner

25

Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016

slide-25
SLIDE 25

MC MC dr drop

  • pou
  • ut

– Sampling is done in a Monte Carlo fashion, e.g., where and is the dropout distribution

  • Prof. Leal-Taixé and Prof. Niessner

26

Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016

ˆ θt ∼ q(θ) ∼ q(θ) p(y = c|x) ≈ 1 T

T

X

t=1

Softmax(f ˆ

θt(x))

classification Parameter sampling NN

slide-26
SLIDE 26

Meas Measure e you

  • ur model
  • del’s uncer

ertai ainty

  • Prof. Leal-Taixé and Prof. Niessner

27

Kendal & Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?“ NIPS 2016

slide-27
SLIDE 27

Variational l Au Autoenc ncoders

  • Prof. Leal-Taixé and Prof. Niessner

32

slide-28
SLIDE 28

Rec Recal all: Autoen

  • encoder
  • ders
  • Encode the input into a representation (bottleneck)

and reconstruct it with the decoder

  • Prof. Leal-Taixé and Prof. Niessner

33

Conv Transpose Conv

z x ˜ x

Encoder Decoder

slide-29
SLIDE 29

Var Variat ation

  • nal

al Autoen

  • encoder
  • der

θ

  • Prof. Leal-Taixé and Prof. Niessner

34

Conv Transpose Conv

z x ˜ x

Encoder Decoder

qφ(z|x) φ pθ(˜ x|z)

slide-30
SLIDE 30

Var Variat ation

  • nal

al Autoen

  • encoder
  • der

θ

  • Prof. Leal-Taixé and Prof. Niessner

35

Conv Transpose Conv

z x ˜ x

Goal: Sample from the latent distribution to generate new outputs!

φ

slide-31
SLIDE 31

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

36

  • Latent space is now a distribution
  • Specifically it is a Gaussian

x

Encoder

φ µz|x Σz|x θ z ˜ x

Decoder Sample

z|x ∼ N(µz|x, Σz|x)

slide-32
SLIDE 32

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

37

  • Latent space is now a distribution
  • Specifically it is a Gaussian

x

Encoder

φ µz|x Σz|x z|x ∼ N(µz|x, Σz|x)

Mean Diagonal covariance

slide-33
SLIDE 33

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

38

  • Training

x

Encoder

φ µz|x Σz|x θ z ˜ x

Decoder Sample

z|x ∼ N(µz|x, Σz|x)

slide-34
SLIDE 34

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

39

  • Test: sampling from the latent space

µz|x Σz|x θ z ˜ x

Decoder Sample

z|x ∼ N(µz|x, Σz|x)

slide-35
SLIDE 35

VA VAE: trai aining

  • Prof. Leal-Taixé and Prof. Niessner

40

  • Back to the Bayesian view for training

θ z

Goal: Want to estimate the parameters of my generative model

pθ(x) = Z

z

pθ(x|z)pθ(z)dz x |z)pθ(z)dz

Prior = Gaussian

Z

z

pθ(x|z)p

Decoder (Neural Network) Intractable to compute the

  • utput for every z
slide-36
SLIDE 36

VA VAE: trai aining

  • Prof. Leal-Taixé and Prof. Niessner

41

  • We approximate it with an encoder

θ z

Goal: Want to estimate the parameters of my generative model

x

Encoder

φ µz|x Σz|x

Sample

qφ(z|x) pθ(˜ x|z) ˜ x

slide-37
SLIDE 37

VA VAE: los

  • ss function
  • n
  • Prof. Leal-Taixé and Prof. Niessner

42

  • Loss function for a data point xi

log(pθ(xi)) = Ez∼qφ(z|xi)[log(pθ(xi))]

I draw samples of the latent variable z from my encoder

slide-38
SLIDE 38

VA VAE: los

  • ss function

ion

  • Prof. Leal-Taixé and Prof. Niessner

43

  • Loss function for a data point xi

log(pθ(xi)) = Ez∼qφ(z|xi)[log(pθ(xi))] = Ez∼qφ(z|xi)  log pθ(xi|z)pθ(z) pθ(z|xi)

  • Bayes Rule

Using the latent variable, which will become useful to simplify the expressions later according to our AE formulation

pθ(z|x) = pθ(x|z)pθ(z) pθ(x)

Recall:

slide-39
SLIDE 39

VA VAE: los

  • ss function

ion

  • Prof. Leal-Taixé and Prof. Niessner

44

  • Loss function for a data point xi

log(pθ(xi)) = Ez∼qφ(z|xi)[log(pθ(xi))] = Ez∼qφ(z|xi)  log pθ(xi|z)pθ(z) pθ(z|xi)

  • = Ez

 log pθ(xi|z)pθ(z) pθ(z|xi) qφ(z|xi) qφ(z|xi)

  • Just a constant
slide-40
SLIDE 40

VA VAE: los

  • ss function
  • n
  • Prof. Leal-Taixé and Prof. Niessner

45

  • Loss function for a data point xi

log(pθ(xi)) = Ez  log pθ(xi|z)pθ(z) pθ(z|xi) qφ(z|xi) qφ(z|xi)

  • = Ez [log pθ(xi|z)] − Ez

 log qφ(z|xi) pθ(z)

  • + Ez

 log qφ(z|xi) pθ(z|xi)

  • Apply the logarithm and group as needed
slide-41
SLIDE 41

VA VAE: los

  • ss function
  • n
  • Prof. Leal-Taixé and Prof. Niessner

46

  • Loss function for a data point xi

Kullback-Leibler Divergences to measure how similar two distributions are

= Ez [log pθ(xi|z)] − Ez  log qφ(z|xi) pθ(z)

  • + Ez

 log qφ(z|xi) pθ(z|xi)

slide-42
SLIDE 42

VA VAE: los

  • ss function
  • n
  • Prof. Leal-Taixé and Prof. Niessner

47

  • Loss function for a data point xi

Kullback-Leibler Divergences

= Ez [log pθ(xi|z)] − Ez  log qφ(z|xi) pθ(z)

  • + Ez

 log qφ(z|xi) pθ(z|xi)

  • = Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))
slide-43
SLIDE 43

VA VAE: los

  • ss function
  • n
  • Prof. Leal-Taixé and Prof. Niessner

48

  • Loss function for a data point xi

Reconstruction loss (how well does my decoder reconstruct a data point given the latent vector z). We need to sample from z. Measures how good my latent distribution is with respect to my Gaussian prior

= Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))

I still cannot express the shape of the

  • distribution. But I know

≥ 0

slide-44
SLIDE 44

VA VAE: los

  • ss function
  • n
  • Prof. Leal-Taixé and Prof. Niessner

49

  • Loss function for a data point xi

Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi)) ≥ 0

Loss function (lower bound)

L(xi, φ, θ) log(p(xi)) ≥ L(xi, φ, θ)

slide-45
SLIDE 45

VA VAE: los

  • ss function
  • n
  • Prof. Leal-Taixé and Prof. Niessner

50

  • Loss function for a data point
  • Optimize

xi

Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi)) ≥ 0

Loss function (lower bound)

L(xi, φ, θ) φ∗, θ∗ = arg max

N

X

i=1

L(xi, φ, θ)

slide-46
SLIDE 46

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

51

  • Training

Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))

x

Encoder

φ µz|x Σz|x

Make posterior distribution close to prior (close to unit Gaussian distribution)

slide-47
SLIDE 47

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

52

  • Training

Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))

x

Encoder

φ µz|x Σz|x z|x ∼ N(µz|x, Σz|x)

slide-48
SLIDE 48

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

53

  • Training

Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))

x

Encoder

φ µz|x Σz|x z

Sample

z|x ∼ N(µz|x, Σz|x)

slide-49
SLIDE 49

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

54

  • Training

Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))

x

Encoder

φ µz|x Σz|x θ z ˜ x

Decoder Sample

z|x ∼ N(µz|x, Σz|x)

slide-50
SLIDE 50

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

55

  • Training

Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))

θ

Decoder

x|z ∼ N(µx|z, Σx|z)

Sample

˜ x

µx|z Σx|z

Output is also parameterized

slide-51
SLIDE 51

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

56

  • Training

Ez [log pθ(xi|z)] − KL(qφ(z|xi)||pθ(z)) + KL(qφ(z|xi)||pθ(z|xi))

˜ x

Maximize the likelihood of reconstructing the input

slide-52
SLIDE 52

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

57

  • For more details and mathematical derivation
  • Reparameterization trick (expressing variables as

Gaussians) that allows us to perform backpropagation

  • Kingman and Welling. “Auto-Encoding Variational

Bayes“. ICLR 2014

slide-53
SLIDE 53

Ho How a about g generatin ing d data?

  • Training as seen before

http://kvfrans.com/variational-autoencoders-explained/

  • Prof. Leal-Taixé and Prof. Niessner

58

slide-54
SLIDE 54

Ho How a about g generatin ing d data?

  • After training, generate random samples

Sample from the distribution (e.g., unit Gaussian

  • Prof. Leal-Taixé and Prof. Niessner

59

slide-55
SLIDE 55

Ge Genera rating d g data

  • Prof. Leal-Taixé and Prof. Niessner

60

Each element of z encodes a different feature

slide-56
SLIDE 56

Ge Genera rating d g data

  • Prof. Leal-Taixé and Prof. Niessner

61

Degree of smile Head pose

slide-57
SLIDE 57

Au Auto toenc ncoder vs vs VAE AE

Autoencoder Variational Autoencoder Ground Truth

https://github.com/kvfrans/variational-autoencoder

  • Prof. Leal-Taixé and Prof. Niessner

62

slide-58
SLIDE 58

Au Auto toenc ncoder Ov Overv rview

  • Autoencoders (AE)

– Reconstruct input – Unsupervised learning – Latent space features are useful

  • Variational Autoencoders (VAE)

– Probability distribution in latent space (e.g., Gaussian) – Interpretable latent space (head pose, smile) – Sample from model to generate output

  • Prof. Leal-Taixé and Prof. Niessner

63

slide-59
SLIDE 59

Ne Next l lecture ures

  • More on Generative models
  • This Wednesday 4th, first project presentations!
  • Prof. Leal-Taixé and Prof. Niessner

64

slide-60
SLIDE 60

Ot Other r re refe fere rence ces

  • Conditional Variational Autoencoders:

– Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. “Learning Structured Output Representation using Deep Conditional Generative Models.” Advances in Neural Information Processing Systems. 2015. – Xinchen Yan, Jimei Yang, Kihyuk Sohn, Honglak Lee, Attribute2Image: Conditional Image Generation from Visual Attributes, ECCV, 2016 –

  • Prof. Leal-Taixé and Prof. Niessner

65

slide-61
SLIDE 61

Ot Other r re refe fere rence ces

  • Interesting read:

– Jacob Walker, Carl Doersch, Abhinav Gupta, Martial Hebert, An Uncertain Future: Forecasting from Static Images using Variational Autoencoders, ECCV, 2016 – Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, Ole Winther, Autoencoding beyond pixels using a learned similarity metric, ICML, 2016 – Aditya Deshpande, Jiajun Lu, Mao-Chuang Yeh, David Forsyth, Learning Diverse Image Colorization, arXiv, 2016 – Raymond Yeh, Ziwei Liu, Dan B Goldman, Aseem Agarwala, Semantic Facial Expression Editing using Autoencoded Flow, arXiv, 2016 – Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling, Semi-Supervised Learning with Deep Generative Models, NIPS, 2014

  • Prof. Leal-Taixé and Prof. Niessner

66