Ba Bayesi esian Deep Deep Le Lear arning ning Prof. Leal-Taix - - PowerPoint PPT Presentation

ba bayesi esian deep deep le lear arning ning
SMART_READER_LITE
LIVE PREVIEW

Ba Bayesi esian Deep Deep Le Lear arning ning Prof. Leal-Taix - - PowerPoint PPT Presentation

Ba Bayesi esian Deep Deep Le Lear arning ning Prof. Leal-Taix and Prof. Niessner 1 Go Going ful g full Bay ayes esian an Bayes = Probabilities Hypothesis = Model Bayes Theorem Evidence = data Prof. Leal-Taix and Prof.


slide-1
SLIDE 1

Ba Bayesi esian Deep Deep Le Lear arning ning

  • Prof. Leal-Taixé and Prof. Niessner

1

slide-2
SLIDE 2

Go Going ful g full Bay ayes esian an

  • Prof. Leal-Taixé and Prof. Niessner

2

  • Bayes = Probabilities
  • Bayes Theorem

Evidence = data Hypothesis = Model

slide-3
SLIDE 3

Go Going ful g full Bay ayes esian an

  • Prof. Leal-Taixé and Prof. Niessner

3

  • Start with a prior on the model parameters
  • Choose a statistical model
  • Use data to refine my prior, i.e., compute the posterior

No dependence

  • n parameters

data

slide-4
SLIDE 4

Go Going ful g full Bay ayes esian an

  • Prof. Leal-Taixé and Prof. Niessner

4

  • Start with a prior on the model parameters
  • Choose a statistical model
  • Use data to refine my prior, i.e., compute the posterior

prior posterior likelihood data

slide-5
SLIDE 5

Go Going ful g full Bay ayes esian an

  • Prof. Leal-Taixé and Prof. Niessner

5

  • 1. Learning: Computing the posterior

– Finding a point estimate (MAP) à what we have been doing so far! – Finding a probability distribution of

This lecture

slide-6
SLIDE 6

Wh What at hav ave e we e lear earned ed so

  • far

ar?

  • Ad

Advant antag ages of Deep Learning models

– Very expressive models – Good for tasks such as classification, regression, sequence prediction – Modular structure, efficient training, many tools – Scales well with large amounts of data

  • But we have also disad

advant antag ages…

– ”Black-box” feeling – We cannot judge how “confident” the model is about a decision

  • Prof. Leal-Taixé and Prof. Niessner

7

slide-7
SLIDE 7

Model Modeling uncer ertai ainty

  • Wish list:

– We want to know what our models know and what they do not know

  • Prof. Leal-Taixé and Prof. Niessner

8

slide-8
SLIDE 8

Model Modeling uncer ertai ainty

  • Example: I have built a dog breed classifier
  • Prof. Leal-Taixé and Prof. Niessner

9

Bulldog German sheperd Chihuaha What answer will my NN give?

slide-9
SLIDE 9

Model Modeling uncer ertai ainty

  • Example: I have built a dog breed classifier
  • Prof. Leal-Taixé and Prof. Niessner

10

Bulldog German sheperd Chihuaha I would rather get as an answer that my model is not certain about the type of dog breed

slide-10
SLIDE 10

Model Modeling uncer ertai ainty

  • Wish list:

– We want to know what our models know and what they do not know

  • Why do we care?

– Decision making – Learning from limited, noisy, and missing data – Insights on why a model failed

  • Prof. Leal-Taixé and Prof. Niessner

11

slide-11
SLIDE 11

Model Modeling uncer ertai ainty

  • Prof. Leal-Taixé and Prof. Niessner

12

  • Finding the posterior

– Finding a point estimate (MAP) à what we have been doing so far! – Finding a probability distribution of

Image: https://medium.com/@joeDiHare/deep-bayesian-neural-networks-952763a9537

slide-12
SLIDE 12

Model Modeling uncer ertai ainty

  • Prof. Leal-Taixé and Prof. Niessner

13

  • We can sample many times from the distribution and

see how this affects our model’s predictions

  • If predictions are consistent = model is confident

Image: https://medium.com/@joeDiHare/deep-bayesian-neural-networks-952763a9537

slide-13
SLIDE 13

Model Modeling uncer ertai ainty

  • Prof. Leal-Taixé and Prof. Niessner

14

I am not really sure

Kendal & Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?“ NIPS 2016

slide-14
SLIDE 14

Ho How d do w we g get t the p post sterio ior?

  • Compute the posterior over the weights
  • Probability of observing our data under all possible

model parameters

  • Prof. Leal-Taixé and Prof. Niessner

15

How do we compute this?

slide-15
SLIDE 15

Ho How d do w we g get t the p post sterio ior?

  • How do we compute this?
  • Denominator = we cannot compute all

possible combinations

  • Two ways to compute the

approximation of the posterior:

  • Prof. Leal-Taixé and Prof. Niessner

16

Markov Chain Monte Carlo Variational Inference

slide-16
SLIDE 16

Ho How d do w we g get t the p post sterio ior?

  • Markov Chain Monte Carlo (MCMC)

– A chain of samples that converge to

  • Variational Inference

– Find an approximation that.

  • Prof. Leal-Taixé and Prof. Niessner

17

SLOW

slide-17
SLIDE 17

Dropout Dropout for

  • r

Ba Bayesi esian I Inferen erence ce

  • Prof. Leal-Taixé and Prof. Niessner

18

slide-18
SLIDE 18

Rec Recal all: Drop

  • pou
  • ut
  • Disable a random set of neurons (typically 50%)
  • Prof. Leal-Taixé and Prof. Niessner

19

Srivastava 2014

Forward

slide-19
SLIDE 19

Rec Recal all: Drop

  • pou
  • ut
  • Using half the network = half capacity
  • Prof. Leal-Taixé and Prof. Niessner

20

Furry Has two eyes Has a tail Has paws Has two ears Redundant representations

slide-20
SLIDE 20

Rec Recal all: Drop

  • pou
  • ut
  • Using half the network = half capacity

– Redundant representations – Base your scores on more features

  • Consider it as model ensemble
  • Prof. Leal-Taixé and Prof. Niessner

21

slide-21
SLIDE 21

Rec Recal all: Drop

  • pou
  • ut
  • Two models in one
  • Prof. Leal-Taixé and Prof. Niessner

22

Model 1 Model 2

slide-22
SLIDE 22

MC MC dr drop

  • pou
  • ut
  • Variational Inference

– Find an approximation that

  • Dropout training

– The variational distribution is from a Bernoulli distribution (where the states are “on” and “off”)

  • Prof. Leal-Taixé and Prof. Niessner

23

Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016

slide-23
SLIDE 23

MC MC dr drop

  • pou
  • ut
  • 1. Train a model with dropout before every weight

layer

  • 2. Apply dropout at te

test time

– Sampling is done in a Monte Carlo fashion, hence the name Monte Carlo dropout

  • Prof. Leal-Taixé and Prof. Niessner

24

Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016

slide-24
SLIDE 24

MC MC dr drop

  • pou
  • ut

– Sampling is done in a Monte Carlo fashion, e.g., where and is the dropout distribution

  • Prof. Leal-Taixé and Prof. Niessner

25

Y Gal, Z Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML 2016

classification Parameter sampling NN

slide-25
SLIDE 25

Meas Measure e you

  • ur model
  • del’s uncer

ertai ainty

  • Prof. Leal-Taixé and Prof. Niessner

26

Kendal & Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?“ NIPS 2016

slide-26
SLIDE 26

Another lo look

  • Prof. Leal-Taixé and Prof. Niessner

27

slide-27
SLIDE 27

Le Let t us ta take ke ano nothe ther lo look

  • We know it is intractable, we approximate it
  • The denominator expresses how my data is

generated

  • Prof. Leal-Taixé and Prof. Niessner

28

slide-28
SLIDE 28

Le Let t us ta take ke ano nothe ther lo look

  • We assume that the data is generated by some

random process, involving an unobserved continuous random (latent) variable

  • Generation process:
  • Posterior:
  • Prof. Leal-Taixé and Prof. Niessner

29

slide-29
SLIDE 29

Le Let t us ta take ke ano nothe ther lo look

  • Variational Inference

– Find an approximation.

  • My approximation is parameterized by a model
  • Prof. Leal-Taixé and Prof. Niessner

30

slide-30
SLIDE 30

Variational l Au Autoenc ncoders

  • Prof. Leal-Taixé and Prof. Niessner

31

slide-31
SLIDE 31

Rec Recal all: Autoen

  • encoder
  • ders
  • Encode the input into a representation (bottleneck)

and reconstruct it with the decoder

  • Prof. Leal-Taixé and Prof. Niessner

32

Conv Transpose Conv Encoder Decoder

slide-32
SLIDE 32

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

33

Conv Transpose Conv Encoder Decoder

slide-33
SLIDE 33

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

34

  • Latent space is now a distribution
  • Specifically it is a Gaussian

Encoder

slide-34
SLIDE 34

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

35

  • Latent space is now a distribution
  • Specifically it is a Gaussian

Encoder Mean Diagonal covariance

slide-35
SLIDE 35

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

36

  • Latent space is now a distribution
  • Specifically it is a Gaussian

Encoder Mean Diagonal covariance

slide-36
SLIDE 36

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

37

  • Back to our Bayesian view, our generation process

was:

  • Which is the denominator of the posterior:

I want to optimize

slide-37
SLIDE 37

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

38

  • Loss function for a data point

I draw samples of the latent variable z from my encoder

slide-38
SLIDE 38

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

39

  • Loss function for a data point

Bayes Rule Posterior

slide-39
SLIDE 39

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

40

  • Loss function for a data point

Just a constant

slide-40
SLIDE 40

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

41

  • Loss function for a data point
slide-41
SLIDE 41

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

42

  • Loss function for a data point

Kullback-Leibler Divergences

slide-42
SLIDE 42

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

43

  • Loss function for a data point

Reconstruction loss Measures how good my latent distribution is with respect to my prior I still cannot express the shape of the

  • distribution. But I know
slide-43
SLIDE 43

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

44

  • Loss function for a data point

Loss function (lower bound)

slide-44
SLIDE 44

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

45

  • Loss function for a data point
  • Optimize

Loss function (lower bound)

slide-45
SLIDE 45

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

46

  • Training

Encoder Make posterior distribution close to prior (close to unit Gaussian distribution)

slide-46
SLIDE 46

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

47

  • Training

Encoder

slide-47
SLIDE 47

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

48

  • Training

Encoder Sample

slide-48
SLIDE 48

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

49

  • Training

Encoder Decoder Sample

slide-49
SLIDE 49

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

50

  • Training

Decoder Sample Output is also parameterized

slide-50
SLIDE 50

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

51

  • Training

Maximize the likelihood of reconstructing the input

slide-51
SLIDE 51

Var Variat ation

  • nal

al Autoen

  • encoder
  • der
  • Prof. Leal-Taixé and Prof. Niessner

52

  • For more details and mathematical derivation
  • Reparameterization trick that allows us to backprop
  • Kingman and Welling. “Auto-Encoding Variational

Bayes“. ICLR 2014

slide-52
SLIDE 52

Ho How a about g generatin ing d data?

  • Training as seen before

http://kvfrans.com/variational-autoencoders-explained/

  • Prof. Leal-Taixé and Prof. Niessner

53

slide-53
SLIDE 53

Ho How a about g generatin ing d data?

  • After training, generate random samples

Sample from the distribution (e.g., unit Gaussian

  • Prof. Leal-Taixé and Prof. Niessner

54

slide-54
SLIDE 54

Ge Genera rating d g data

  • Prof. Leal-Taixé and Prof. Niessner

55

Each element of z encodes a different feature

slide-55
SLIDE 55

Ge Genera rating d g data

  • Prof. Leal-Taixé and Prof. Niessner

56

Degree of smile Head pose

slide-56
SLIDE 56

Au Auto toenc ncoder vs vs VA VAE

Autoencoder Variational Autoencoder Ground Truth

https://github.com/kvfrans/variational-autoencoder

  • Prof. Leal-Taixé and Prof. Niessner

57

slide-57
SLIDE 57

Au Auto toenc ncoder Ov Overv rview

  • Autoencoders (AE)

– Reconstruct input – Unsupervised learning – Latent space features are useful

  • Variational Autoencoders (VAE)

– Probability distribution in latent space (e.g., Gaussian) – Sample from model to generate output

  • Prof. Leal-Taixé and Prof. Niessner

58

slide-58
SLIDE 58

Au Auto toenc ncoder Ov Overv rview

  • Autoencoders (AE)

– Reconstruct input – Unsupervised learning – Latent space features are useful

  • Variational Autoencoders (VAE)

– Probability distribution in latent space (e.g., Gaussian) – Interpretable latent space (head pose, smile) – Sample from model to generate output

  • Prof. Leal-Taixé and Prof. Niessner

59

slide-59
SLIDE 59

Generative models ls

  • Prof. Leal-Taixé and Prof. Niessner

60

slide-60
SLIDE 60

Ta Taxo xono nomy of gene nerati tive ve models ls

  • Prof. Leal-Taixé and Prof. Niessner

61

Figure from Ian Goodfellow, Tutorial on Generative Adversarial /networks, 2017

slide-61
SLIDE 61

Ta Taxo xono nomy of gene nerati tive ve models ls

  • Prof. Leal-Taixé and Prof. Niessner

62

Figure from Ian Goodfellow, Tutorial on Generative Adversarial /networks, 2017

slide-62
SLIDE 62

Ta Taxo xono nomy of gene nerati tive ve models ls

  • Prof. Leal-Taixé and Prof. Niessner

63

Figure from Ian Goodfellow, Tutorial on Generative Adversarial /networks, 2017

Define a more tractable density function

slide-63
SLIDE 63

Ta Taxo xono nomy of gene nerati tive ve models ls

  • Prof. Leal-Taixé and Prof. Niessner

64

Figure from Ian Goodfellow, Tutorial on Generative Adversarial /networks, 2017

I do not care about the shape, I just want to sample!

slide-64
SLIDE 64

Ne Next l lecture ures

  • Next Monday 10th, more on Generative models
  • 3rd round of presentations this Friday à you will

receive feedback about the presentations

  • Keep working on the projects!
  • Prof. Leal-Taixé and Prof. Niessner

65

slide-65
SLIDE 65

Ot Other re refe fere rences

  • Conditional Variational Autoencoders:

– Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. “Learning Structured Output Representation using Deep Conditional Generative Models.” Advances in Neural Information Processing Systems. 2015. – Xinchen Yan, Jimei Yang, Kihyuk Sohn, Honglak Lee, Attribute2Image: Conditional Image Generation from Visual Attributes, ECCV, 2016 –

  • Prof. Leal-Taixé and Prof. Niessner

66

slide-66
SLIDE 66

Ot Other re refe fere rences

  • Interesting read:

– Jacob Walker, Carl Doersch, Abhinav Gupta, Martial Hebert, An Uncertain Future: Forecasting from Static Images using Variational Autoencoders, ECCV, 2016 – Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, Ole Winther, Autoencoding beyond pixels using a learned similarity metric, ICML, 2016 – Aditya Deshpande, Jiajun Lu, Mao-Chuang Yeh, David Forsyth, Learning Diverse Image Colorization, arXiv, 2016 – Raymond Yeh, Ziwei Liu, Dan B Goldman, Aseem Agarwala, Semantic Facial Expression Editing using Autoencoded Flow, arXiv, 2016 – Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, Max Welling, Semi-Supervised Learning with Deep Generative Models, NIPS, 2014

  • Prof. Leal-Taixé and Prof. Niessner

67