Marrying Graphical Models & Deep Learning Max Welling - - PowerPoint PPT Presentation

marrying graphical models deep learning
SMART_READER_LITE
LIVE PREVIEW

Marrying Graphical Models & Deep Learning Max Welling - - PowerPoint PPT Presentation

Marrying Graphical Models & Deep Learning Max Welling University of Amsterdam Universiteit van Amsterdam Uva-Qualcomm Quva Lab Canadian Institute for Advanced Research 0 Overview: Generative versus discriminative modeling Machine


slide-1
SLIDE 1

Marrying Graphical Models & Deep Learning

Max Welling University of Amsterdam

Uva-Qualcomm Quva Lab

Canadian Institute for Advanced Research Universiteit van Amsterdam

slide-2
SLIDE 2

Overview:

  • Machine Learning as Computational Statistics
  • Graphical Models:
  • Bayes nets
  • MRFs
  • Latent variable models
  • Inference:
  • Variational inference
  • MCMC
  • Learning:
  • EM
  • Amortized EM
  • Variational autoencoder
  • Generative versus discriminative modeling
  • Deep Learning:
  • CNN
  • Dropout
  • Bayesian inference
  • Bayesian deep models
  • Compression

1

slide-3
SLIDE 3

ML as Statistics

  • Data:
  • Optimize objective:
  • maximize log likelihood:
  • minimize loss:
  • ML is more than an optimization problem: it’s a statistical inference problem.
  • E.g.: you should not optimize parameters more precisely than the scale at which the MLE fluctuates

under resampling the data: , or risk overfitting. (unsupervised) (supervised) (supervised)

2

slide-4
SLIDE 4

Bias Variance Tradeoff

http://scott.fortmann-roe.com/docs/BiasVariance.html

3

slide-5
SLIDE 5

Graphical Models

  • A graphical representation to concisely represent (conditional) independence relations between variables.
  • There is a one-to-one correspondence between the dependencies implied by the graph and the probabilistic model.
  • E.g. Bayes Nets

P(all) = P(traffic-jam | rush-hour, bad-weather, accident) x P(sirens | accident) x P(accident | bad-weather) x P(bad-weather) x P(rush-hour) P(rush-hour) independent P(bad-weather) ßà sum_{traffic-jam,sirens,accident) P(all) = P(rush-hour) P(bad-weather)

4

slide-6
SLIDE 6

Rush-hour independent of bad-weather

Source:

5

slide-7
SLIDE 7

Markov Random Fields

Source: Bishop

A independent B given C (for independence, all paths must be blocked) Undirected edges (Conditional) independence relationships easy: Probability distribution: : maximal clique = largest completely connected subgraphs Hammersley-Clifford Theorem: if P>0 all x, then all (conditional) independencies in P match those of the graph.

6

slide-8
SLIDE 8

Latent Variable Models

  • Introducing latent (unobserved) variables will dramatically increase the capacity of a model.
  • Problem: P(Z|X) is intractable for most nontrivial models

7

slide-9
SLIDE 9

Approximate Inference

Variational Inference Sampling

Variational Family Q

q∗

All probability distributions

  • Deterministic
  • Biased
  • Local minima
  • Easy to assess convergence
  • Stochastic (sample error)
  • Unbiased
  • Hard to mix between modes
  • Hard to assess convergence

p p

8

slide-10
SLIDE 10

Independence Samplers & MCMC

Generating Independent Samples Sample from g and suppress samples with low p(θ|X) e.g. a) Rejection Sampling b) Importance Sampling

  • Does not scale to high dimensions

Markov Chain Monte Carlo

  • Make steps by perturbing previous sample
  • Probability of visiting a state is equal to P(θ|X)

g

p(θ|X)

9

slide-11
SLIDE 11

Sampling 101 – What is MCMC?

Burn-in ( Throw away)

θ0

θ1

T(θt+1|θt)

Given target distribution S0, design transitions s.t. pt(θt) → S0 as t → ∞ θt+1

Samples from S0

Auto correlation time

200 400 600 800 1000 −3 −2 −1 1 2 3 iteration last position coordinate 200 400 600 800 1000 −3 −2 −1 1 2 3 iteration last position coordinate

θt

t

t

High τ Low τ

θt

I = hfiS0 ⇡ ˆ I = 1 T

T

X

t=1

f(θt) Bias(ˆ I) = E[ˆ I − I] = 0

Var(ˆ I) = τ Var(f) T

10

slide-12
SLIDE 12

Sampling 101 – Metropolis-Hastings

Transition Kernel T(θt+1|θt)

θ0 ∼ q(θ0|θt)

Accept/Reject Test Propose

θt+1 ← ⇢ θ0 with probability Pa θt with probability 1 − Pa

θt

θt+1

Is the new state more probable? Is it easy to come back to the current state?

Pa = min  1, q(θt|θ0) q(θ0|θt) S0(θ0) S0(θt)

  • S0(θ) ∝ p(θ)

N

Y

i=1

p(xi|θ)

For Bayesian Posterior Inference, V ar[ˆ I] ∝ 1 T 2) is too high. 1) Burn-in is unnecessarily slow.

11

slide-13
SLIDE 13

Approximate MCMC

Low Variance ( Fast ) High Variance ( Slow ) High Bias Low Bias

xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

S✏

S0

Decreasing ϵ

12

slide-14
SLIDE 14

hfiP hfiP✏

Minimizing Risk

X Axis – ϵ, Y Axis – Bias2, Variance, Risk

Computational Time Risk Bias Variance

E h (I − ˆ I)2i

=

+ 2

σ2τ/T

Given finite sampling time, ϵ=0 is not the

  • ptimal setting.

13

slide-15
SLIDE 15

Stochastic Gradient Ascent Gradient Ascent

Stochastic Gradient Langevin Dynamics

Langevin Dynamics

e.g.

↓ Metropolis-Hastings Accept Step Stochastic Gradient Langevin Dynamics Metropolis-Hastings Accept Step Welling & Teh 2011

14

slide-16
SLIDE 16

Demo: Stochastic Gradient LD

15

slide-17
SLIDE 17

A Closer Look …

large

16

slide-18
SLIDE 18

A Closer Look …

small

17

slide-19
SLIDE 19

Demo SGLD: large stepsize

18

slide-20
SLIDE 20

Demo SGLD: small stepsize

19

slide-21
SLIDE 21

Variational Inference

  • Choose tractable family of distributions (e.g. Gaussian, discrete)
  • Minimize over Q:
  • Equivalent to maximize over :

P Q

Φ

20

slide-22
SLIDE 22

Learning: Expectation Maximization

E-step:

Bound

M-step:

Gap: (variational inference) (approximate learning)

21

slide-23
SLIDE 23

Amortized Inference

  • Bij making q(z|x) a function of x and sharing

parameters , we can do very fast inference at test time (i.e. avoid iterative optimization of qtest(z))

φ

22

slide-24
SLIDE 24

Deep NN as a glorified conditional distribution

X Y P(Y|X)

23

slide-25
SLIDE 25

The “Deepify” Operator

  • Find a graphical model with conditional distributions and replace those with a deep NN.
  • Logistic regression à deep NN.
  • “deep survival analysis”. Cox’s proportional hazard function:
  • Latent variable model: replace generative and recognition models with deep NNs:

à ”Variational Autoencoder” (VAE). Replace with deep NN!

24

slide-26
SLIDE 26

Variational Autoencoder

deepify deepify

25

slide-27
SLIDE 27

x z h h x z h h

Q P

μ σ p

Deep Generative Model: The Variational Auto-Encoder

deterministic NN node unobserved stochastic node

  • bserved

stochastic node deep neural net deep neural net

26

slide-28
SLIDE 28

Stochastic Variational Bayesian Inference

very high variance Sample Z

27

subsample mini-batch X

B(Q) = X

Z

Q(Z|X, Φ)(log P(X|Z, Θ) + log P(Z) − log Q(Z|X, Φ))

rΦB(Q) = X

Z

Q(Z|X, Φ)rΦ log Q(Z|X, Φ)(log P(X|Z, Θ) + log P(Z) log Q(Z|X, Φ))

rΦB(Q) = 1 N 1 S

N

X

i=1 S

X

s=1

rΦ log Q(Zis|Xi, Φ)(log P(Xi|Zis, Θ) + log P(Zis) log Q(Zis|Xi, Φ))

slide-29
SLIDE 29

Reducing the Variance: The Reparametrization Trick

28

Kingma 2013, Bengio 2013, Kingma & Welling 2014

  • Reparameterization:
  • Applied to VAE:
  • Example:

rµ Z dzNz(µ, )z = 1 S X

s

zs(zs µ)/2, zs ⇠ Nz(µ, )

  • r 1

S X

s

1, ✏s ⇠ N✏(0, 1), z = µ + ✏

rΦB(Θ, Φ) = rΦ Z dz QΦ(z|x)[log PΘ(x, z) log QΦ(z|x)] ⇡ rΦ[log PΘ(x, zs) log QΦ(zs|x)]zs=g(✏s,Φ), ✏s ⇠ P(✏)

slide-30
SLIDE 30

x h x h h z h z h

y y

Semi-Supervised VAE I

Q P

h

Sometimes

  • bserved

stochastic node

D.P. Kingma, D.J. Rezende, S. Mohamed, M. Welling, NIPS 2014

(normal VB objective) (boosting influence q(y|x) )

slide-31
SLIDE 31
slide-32
SLIDE 32

Discriminative or Generative?

  • Advantages generative models:
  • Inject expert knowledge
  • Model causal relations
  • Interpretable
  • Data efficient
  • More robust to domain shift
  • Facilitate un/semi-supervised learning
  • Deep Learning
  • Kernel Methods
  • Random Forests
  • Boosting
  • Bayesian Networks
  • Probabilistic Programs
  • Simulator Models
  • Advantages discriminative models:
  • Flexible map from input to target (low bias)
  • Efficient training algorithms available
  • Solve the problem you are evaluating on.
  • Very successful and accurate!

Variational Auto-Encoder

slide-33
SLIDE 33

Big N vs. Small N?

N=10^8-10^9

  • Customer Intelligence
  • Finance
  • Video/image
  • Internet of Things

N = 100-1000

  • Healthcare (p>>N)
  • Generative, causal models

generalize much better to new unknown situation (domain invariance) We need statistical efficiency We need computational efficiency

32

slide-34
SLIDE 34

Combining Generative and Discriminative Models

Use physics Use causality Use expert knowledge Black box DNN/CNN

slide-35
SLIDE 35

Deep Convolutional Networks

Backward: backpropagation (propagate error signal backward) Forward: Filter, subsample, filter, nonlinearity, subsample, …., classify

34

  • Input dimensions have "topology”: (1D, speech, 2D image, 3D MRI, 2+1D video, 4D fMRI)
slide-36
SLIDE 36

Dropout

35

slide-37
SLIDE 37

Example: Dermatology

36

slide-38
SLIDE 38

37

slide-39
SLIDE 39

38

slide-40
SLIDE 40

Example: Retinopathy

39

slide-41
SLIDE 41

What do these Problems have in common?

It’s the same CNN in all cases: Inception-v3

40

slide-42
SLIDE 42

So..., CNNs work really well.

However:

  • They are way too big
  • They consume too much energy
  • They use too much memory
  • à we need to make them more efficient!

41

slide-43
SLIDE 43

Reasons for Bayesian Deep Learning

  • Automatic model selection / pruning
  • Automatic regularization
  • Realistic prediction uncertainty (important for decision making)

Computer Aided Diagnosis Autonomous Driving

slide-44
SLIDE 44

Example

Increased uncertainty away from data

slide-45
SLIDE 45

Bayesian Learning

Picture credit:

Complex models can have lower marginal likelihood:

P(X|M) = Z dΘ P(X|Θ, M)P(Θ|M)

P(Θ|X, M) = P(X|Θ, M)P(Θ|M) P(X|M)

P(x|X, M) = Z dΘ P(x|Θ, M)P(Θ|X, M) P(M|X) = P(X|M)P(M) P(X)

P(X) = X

M

P(X|M)P(M)

(prediction) (model selection) (evidence) (posterior) (model evidence)

slide-46
SLIDE 46

Variational Bayes

log P(X) ≥ Z

Θ

dΘ Q(Θ) [log P(X|Θ) + log P(Θ) − log Q(Θ)] ≡ B(Q(Θ)|X)

= EQ(Θ)[log P(X|Θ)] − KL[Q(Θ)||P(Θ)])

45

slide-47
SLIDE 47

Sparsifying & Compressing CNNs

  • DNNs are vastly overparameterized (e.g. distillation, Bucilua et al 2006).
  • Interpret variational bound as coding cost for data transmission (minimum description length)
  • Idea: learn a soft weight sharing prior, a.k.a. quantize the weights (Nowlan & Hinton 1991, Ullrich et al 2016)

error loss ~N complexity loss ~const.

= EQ(Θ)[log P(X|Θ)] − KL[Q(Θ)||P(Θ)])

46

slide-48
SLIDE 48

Full Bayesian Deep Learning

flow of information The signal in NNs are very robust to noise addition (e.g dropout) "neurons" act as bottlenecks

  • Marginalize out weights for the price of introducing stochastic hidden units.
  • Reinterpret stochasticity on hidden units as dropout noise.
  • Use sparsity inducing priors to prune weights / hidden units.

THE PLAN:

slide-49
SLIDE 49

Stochastic Variational Bayes

very high variance sample

48

subsample mini-batch X

B(Q(Θ)|X) = Z

Θ

dΘ Q(Θ) [log P(X|Θ) + log P(Θ) − log Q(Θ)]

rΦB = Z

Θ

dΘ QΦ(Θ) rΦ log QΦ(Θ) [log P(X|Θ) + log P(Θ) log QΦ(Θ)]

rΦB = 1 S

S

X

s=1

rΦ log QΦ(Θs) " N n

n

X

i=1

log P(xi|Θs) + log P(Θs) log QΦ(Θs) #

  • Reparametrization? Yes but not enough: same sample for all data cases Xi in minibatch induces high

correlations between data-cases and thus high variance in gradient.

Θs

slide-50
SLIDE 50

Local Reparametrization

F W

  • Hidden units now become stochastic and correlated.
  • We draw different samples Fis for different data-cases in the minibatch

(and it’s much less expensive than resampling all the weights independently per data case) Conclusion: using this trick we can further reduce variance in the gradients compute exactly Reparameterize: B(X)

Kingma, Salimans & Welling 2015

P(X|Θ) → P(Y |W, X)

( )

slide-51
SLIDE 51

Two Layers

X Y W2 W1 F B

H = σ(B)

Now use the “normal” reparameterization trick

slide-52
SLIDE 52

Variational Dropout

multiplicative dropout noise If then Conclusion: by using a special form of posterior we simulate dropout noise: i.e. dropout can be understood as variational Bayesian inference with multiplicative noise. B=AW A W

Y Gal, Z Ghahramani 2016, Dropout as a Bayesian approximation: Representing model uncertainty in deep learning S Wang, C Manning, Fast dropout training

slide-53
SLIDE 53

Sparsity Inducing Priors

(improper prior) (variational dropout posterior) Learn dropout rate . When weight is pruned

(Kingma, Salimans, Welling 2015, Mochanov, Ashuka, Vetrov 2017)

Conclusion: we can learn the dropout rates and prune unnecessary weights.

slide-54
SLIDE 54

Variational Dropout

Animation: Molchanov, D., Ashukha, A. and Vetrov, D.

slide-55
SLIDE 55

Animation: Molchanov, D., Ashukha, A. and Vetrov, D.

Fully connected layer

54

slide-56
SLIDE 56

Node (instead of Weight) Sparsification

Use hierarchical prior:

(dropout multiplicative noise)

Prior-posterior pair

(Louizos, Ullrich, Welling, 2017)

55

Conclusion: by using special, hierarchical priors we can prune hidden units instead of individual weights (which is much better for compression).

P(W, z) = Y

hidden units i

p(zi) Y

units j outgoing from node i

P(wij|zi)

slide-57
SLIDE 57

Preliminary Results

(Louizos, Ullrich, Welling 2017, submitted)

  • Compression rate of a factor 700x with no loss in accuracy!
  • Compression rates for node sparsity are higher because

encoding is cheaper. Additional Bayesian Bonus: By monitoring posterior fluctuations

  • f weights one can determine their

fixed point precision.

56

slide-58
SLIDE 58

Conclusions

  • Deep learning is a no silver bullet: it is mainly very good at signal processing (auditory, image data)
  • Optimization plays an important role in getting good solutions (e.g. reducing variance gradients)
  • But… deep learning is more than optimization, it’s also statistics!
  • DL can be successfully combined with ”classical” graphical models (as a glorified conditional distribution)
  • Bayesian DL has a elegant interpretation as principled dropout
  • Bayesian DL is ideally suited for compression
  • There is a lot we do not understand about DL:
  • Why do they not overfit (easy to get 0 training error on data with random labels)
  • Why does SGD regularize so effectively?
  • Strange behavior in the face of adversarial examples
  • Huge over-parameterization (up to 400x)

57