marrying graphical models deep learning
play

Marrying Graphical Models & Deep Learning Max Welling - PowerPoint PPT Presentation

Marrying Graphical Models & Deep Learning Max Welling University of Amsterdam Universiteit van Amsterdam Uva-Qualcomm Quva Lab Canadian Institute for Advanced Research 0 Overview: Generative versus discriminative modeling Machine


  1. Marrying Graphical Models & Deep Learning Max Welling University of Amsterdam Universiteit van Amsterdam Uva-Qualcomm Quva Lab Canadian Institute for Advanced Research 0

  2. Overview: Generative versus discriminative modeling • Machine Learning as Computational Statistics • Deep Learning: • Graphical Models: • CNN Bayes nets • • Dropout • MRFs • Latent variable models • Bayesian inference • Bayesian deep models • Inference: • Compression Variational inference • • MCMC • Learning: • EM • Amortized EM • Variational autoencoder • 1

  3. ML as Statistics Data: • Optimize objective: • (unsupervised) maximize log likelihood: • (supervised) (supervised) minimize loss: • ML is more than an optimization problem: it’s a statistical inference problem . • E.g.: you should not optimize parameters more precisely than the scale at which the MLE fluctuates • under resampling the data: , or risk overfitting. 2

  4. Bias Variance Tradeoff 3 http://scott.fortmann-roe.com/docs/BiasVariance.html

  5. Graphical Models A graphical representation to concisely represent (conditional) independence relations between variables. • There is a one-to-one correspondence between the dependencies implied by the graph and the probabilistic model. • E.g. Bayes Nets • P(all) = P(traffic-jam | rush-hour, bad-weather, accident) x P(sirens | accident) x P(accident | bad-weather) x P(bad-weather) x P(rush-hour) P(rush-hour) independent P(bad-weather) ßà sum_{traffic-jam,sirens,accident) P(all) = P(rush-hour) P(bad-weather) 4

  6. Rush-hour independent of bad-weather Source: 5

  7. Markov Random Fields Source: Bishop Undirected edges (Conditional) independence relationships easy: A independent B given C (for independence, all paths must be blocked) Probability distribution: : maximal clique = largest completely connected subgraphs Hammersley-Clifford Theorem: if P>0 all x, then all (conditional) independencies in P match those of the graph. 6

  8. Latent Variable Models Introducing latent (unobserved) variables will dramatically increase the capacity of a model. • Problem: P(Z|X) is intractable for most nontrivial models • 7

  9. Approximate Inference Variational Inference Sampling p p q ∗ Variational Family Q All probability distributions Deterministic • Stochastic (sample error) • Biased • Unbiased • Local minima • Hard to mix between modes • • Easy to assess convergence • Hard to assess convergence 8

  10. Independence Samplers & MCMC Generating Independent Samples Sample from g and suppress samples with low p( θ |X) e.g. a) Rejection Sampling b) Importance Sampling p ( θ | X ) - Does not scale to high dimensions g Markov Chain Monte Carlo • Make steps by perturbing previous sample • Probability of visiting a state is equal to P( θ |X) 9

  11. Sampling 101 – What is MCMC? Given target distribution S 0 , design transitions s.t. p t ( θ t ) → S 0 as t → ∞ T ( θ t +1 | θ t ) θ 0 θ t +1 θ 1 θ t Burn-in ( Throw away) Samples from S 0 T I = 1 3 3 I = h f i S 0 ⇡ ˆ X f ( θ t ) 2 T 2 last position coordinate last position coordinate t =1 1 1 θ t 0 0 Bias(ˆ I ) = E [ˆ I − I ] = 0 − 1 − 1 − 2 − 2 I ) = τ Var( f ) − 3 − 3 Var(ˆ 0 200 400 600 800 1000 0 200 400 600 800 1000 t t T iteration iteration High τ Low τ Auto correlation time 10

  12. Sampling 101 – Metropolis-Hastings Transition Kernel T( θ t+1 | θ t ) Propose Accept/Reject Test  1 , q ( θ t | θ 0 ) S 0 ( θ 0 ) � P a = min θ t +1 θ t θ 0 ∼ q ( θ 0 | θ t ) q ( θ 0 | θ t ) S 0 ( θ t ) ⇢ θ 0 with probability P a θ t +1 ← θ t with probability 1 − P a Is it easy to come back Is the new state to the current state? more probable? N For Bayesian Posterior Inference, Y S 0 ( θ ) ∝ p ( θ ) p ( x i | θ ) i =1 1) Burn-in is unnecessarily slow. I ] ∝ 1 V ar [ˆ 2) is too high. T 11

  13. Approximate MCMC S ✏ x xx x x x x x Low High x x x x x x Variance Variance x x x x x x ( Fast ) ( Slow ) x x x S 0 x x x x x x x x High Bias Low Bias x x x Decreasing ϵ 12

  14. Minimizing Risk 2 Risk Bias Variance = + σ 2 τ /T h I ) 2 i h f i P � h f i P ✏ ( I − ˆ E Given finite sampling time, ϵ =0 is not the optimal setting. X Axis – ϵ, Y Axis – Bias 2 , Variance, Risk Computational Time 13

  15. Stochastic Gradient Langevin Dynamics Welling & Teh 2011 Gradient Ascent Langevin Dynamics ↓ Metropolis-Hastings Accept Step Stochastic Gradient Ascent Stochastic Gradient Langevin Dynamics Metropolis-Hastings Accept Step e.g. 14

  16. Demo: Stochastic Gradient LD 15

  17. A Closer Look … large 16

  18. A Closer Look … small 17

  19. Demo SGLD: large stepsize 18

  20. Demo SGLD: small stepsize 19

  21. Variational Inference Choose tractable family of distributions (e.g. Gaussian, discrete) • Minimize over Q: • Equivalent to maximize over : • Φ P Q 20

  22. Learning: Expectation Maximization Gap: Bound E-step: (variational inference) M-step: (approximate learning) 21

  23. Amortized Inference Bij making q(z|x) a function of x and sharing • parameters , we can do very fast inference at test φ time (i.e. avoid iterative optimization of q test (z)) 22

  24. Deep NN as a glorified conditional distribution Y X P(Y|X) 23

  25. The “Deepify” Operator Find a graphical model with conditional distributions and replace those with a deep NN. • • Logistic regression à deep NN. • “deep survival analysis”. Cox’s proportional hazard function: Replace with deep NN! • Latent variable model: replace generative and recognition models with deep NNs: à ”Variational Autoencoder” (VAE). 24

  26. Variational Autoencoder deepify deepify 25

  27. Deep Generative Model: The Variational Auto-Encoder Q P z z deterministic h NN node μ σ deep neural net deep neural net h h unobserved stochastic node h p x x observed stochastic node 26

  28. Stochastic Variational Bayesian Inference X B ( Q ) = Q ( Z | X, Φ )(log P ( X | Z, Θ ) + log P ( Z ) − log Q ( Z | X, Φ )) Z X r Φ B ( Q ) = Q ( Z | X, Φ ) r Φ log Q ( Z | X, Φ )(log P ( X | Z, Θ ) + log P ( Z ) � log Q ( Z | X, Φ )) Z subsample mini-batch X Sample Z N S r Φ B ( Q ) = 1 1 X X r Φ log Q ( Z is | X i , Φ )(log P ( X i | Z is , Θ ) + log P ( Z is ) � log Q ( Z is | X i , Φ )) N S s =1 i =1 very high variance 27

  29. Reducing the Variance: Kingma 2013, Bengio 2013, Kingma & Welling 2014 The Reparametrization Trick Reparameterization: • Z r Φ B ( Θ , Φ ) = r Φ dz Q Φ ( z | x )[log P Θ ( x, z ) � log Q Φ ( z | x )] Applied to VAE: • ⇡ r Φ [log P Θ ( x, z s ) � log Q Φ ( z s | x )] z s = g ( ✏ s , Φ ) , ✏ s ⇠ P ( ✏ ) Z Example: • dz N z ( µ, � ) z r µ = 1 X z s ( z s � µ ) / � 2 , z s ⇠ N z ( µ, � ) S s or 1 X ✏ s ⇠ N ✏ (0 , 1) , 1 , z = µ + �✏ 28 S s

  30. Semi-Supervised VAE I D.P. Kingma, D.J. Rezende, S. Mohamed, M. Welling, NIPS 2014 Q P y y z z h h h Sometimes observed h h h stochastic node x x (normal VB objective) (boosting influence q(y|x) )

  31. Discriminative or Generative? -Deep Learning Variational Auto-Encoder -Bayesian Networks -Kernel Methods -Probabilistic Programs -Random Forests -Simulator Models -Boosting Advantages generative models: • Advantages discriminative models: • Inject expert knowledge • • Flexible map from input to target (low bias) Model causal relations • • Efficient training algorithms available Interpretable • Solve the problem you are evaluating on. • • Data efficient Very successful and accurate! • More robust to domain shift • Facilitate un/semi-supervised learning •

  32. Big N vs. Small N? We need statistical efficiency We need computational efficiency N=10^8-10^9 N = 100-1000 -Customer Intelligence -Healthcare (p>>N) -Finance -Generative, causal models -Video/image generalize much better to new -Internet of Things unknown situation (domain invariance) 32

  33. Combining Generative and Discriminative Models Use physics Use causality Use expert knowledge Black box DNN/CNN

  34. Deep Convolutional Networks Input dimensions have "topology”: (1D, speech, 2D image, 3D MRI, 2+1D video, 4D fMRI) • Forward: Filter, subsample, filter, nonlinearity, subsample, …., classify Backward: backpropagation (propagate error signal backward) 34

  35. Dropout 35

  36. Example: Dermatology 36

  37. 37

  38. 38

  39. Example: Retinopathy 39

  40. What do these Problems have in common? It’s the same CNN in all cases: Inception-v3 40

  41. So..., CNNs work really well. However : T hey are way too big • They consume too much energy • • They use too much memory • à we need to make them more efficient! 41

  42. Reasons for Bayesian Deep Learning • Automatic model selection / pruning • Automatic regularization Realistic prediction uncertainty (important for decision making) • Autonomous Driving Computer Aided Diagnosis

  43. Example Increased uncertainty away from data

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend