Marrying Graphical Models & Deep Learning Max Welling - PowerPoint PPT Presentation

Marrying Graphical Models & Deep Learning Max Welling University of Amsterdam Universiteit van Amsterdam Uva-Qualcomm Quva Lab Canadian Institute for Advanced Research 0

Overview: Generative versus discriminative modeling • Machine Learning as Computational Statistics • Deep Learning: • Graphical Models: • CNN Bayes nets • • Dropout • MRFs • Latent variable models • Bayesian inference • Bayesian deep models • Inference: • Compression Variational inference • • MCMC • Learning: • EM • Amortized EM • Variational autoencoder • 1

ML as Statistics Data: • Optimize objective: • (unsupervised) maximize log likelihood: • (supervised) (supervised) minimize loss: • ML is more than an optimization problem: it’s a statistical inference problem . • E.g.: you should not optimize parameters more precisely than the scale at which the MLE fluctuates • under resampling the data: , or risk overfitting. 2

Bias Variance Tradeoff 3 http://scott.fortmann-roe.com/docs/BiasVariance.html

Graphical Models A graphical representation to concisely represent (conditional) independence relations between variables. • There is a one-to-one correspondence between the dependencies implied by the graph and the probabilistic model. • E.g. Bayes Nets • P(all) = P(traffic-jam | rush-hour, bad-weather, accident) x P(sirens | accident) x P(accident | bad-weather) x P(bad-weather) x P(rush-hour) P(rush-hour) independent P(bad-weather) ßà sum_{traffic-jam,sirens,accident) P(all) = P(rush-hour) P(bad-weather) 4

Rush-hour independent of bad-weather Source: 5

Markov Random Fields Source: Bishop Undirected edges (Conditional) independence relationships easy: A independent B given C (for independence, all paths must be blocked) Probability distribution: : maximal clique = largest completely connected subgraphs Hammersley-Clifford Theorem: if P>0 all x, then all (conditional) independencies in P match those of the graph. 6

Latent Variable Models Introducing latent (unobserved) variables will dramatically increase the capacity of a model. • Problem: P(Z|X) is intractable for most nontrivial models • 7

Approximate Inference Variational Inference Sampling p p q ∗ Variational Family Q All probability distributions Deterministic • Stochastic (sample error) • Biased • Unbiased • Local minima • Hard to mix between modes • • Easy to assess convergence • Hard to assess convergence 8

Independence Samplers & MCMC Generating Independent Samples Sample from g and suppress samples with low p( θ |X) e.g. a) Rejection Sampling b) Importance Sampling p ( θ | X ) - Does not scale to high dimensions g Markov Chain Monte Carlo • Make steps by perturbing previous sample • Probability of visiting a state is equal to P( θ |X) 9

Sampling 101 – What is MCMC? Given target distribution S 0 , design transitions s.t. p t ( θ t ) → S 0 as t → ∞ T ( θ t +1 | θ t ) θ 0 θ t +1 θ 1 θ t Burn-in ( Throw away) Samples from S 0 T I = 1 3 3 I = h f i S 0 ⇡ ˆ X f ( θ t ) 2 T 2 last position coordinate last position coordinate t =1 1 1 θ t 0 0 Bias(ˆ I ) = E [ˆ I − I ] = 0 − 1 − 1 − 2 − 2 I ) = τ Var( f ) − 3 − 3 Var(ˆ 0 200 400 600 800 1000 0 200 400 600 800 1000 t t T iteration iteration High τ Low τ Auto correlation time 10

Sampling 101 – Metropolis-Hastings Transition Kernel T( θ t+1 | θ t ) Propose Accept/Reject Test  1 , q ( θ t | θ 0 ) S 0 ( θ 0 ) � P a = min θ t +1 θ t θ 0 ∼ q ( θ 0 | θ t ) q ( θ 0 | θ t ) S 0 ( θ t ) ⇢ θ 0 with probability P a θ t +1 ← θ t with probability 1 − P a Is it easy to come back Is the new state to the current state? more probable? N For Bayesian Posterior Inference, Y S 0 ( θ ) ∝ p ( θ ) p ( x i | θ ) i =1 1) Burn-in is unnecessarily slow. I ] ∝ 1 V ar [ˆ 2) is too high. T 11

Approximate MCMC S ✏ x xx x x x x x Low High x x x x x x Variance Variance x x x x x x ( Fast ) ( Slow ) x x x S 0 x x x x x x x x High Bias Low Bias x x x Decreasing ϵ 12

Minimizing Risk 2 Risk Bias Variance = + σ 2 τ /T h I ) 2 i h f i P � h f i P ✏ ( I − ˆ E Given finite sampling time, ϵ =0 is not the optimal setting. X Axis – ϵ, Y Axis – Bias 2 , Variance, Risk Computational Time 13

Stochastic Gradient Langevin Dynamics Welling & Teh 2011 Gradient Ascent Langevin Dynamics ↓ Metropolis-Hastings Accept Step Stochastic Gradient Ascent Stochastic Gradient Langevin Dynamics Metropolis-Hastings Accept Step e.g. 14

Demo: Stochastic Gradient LD 15

A Closer Look … large 16

A Closer Look … small 17

Demo SGLD: large stepsize 18

Demo SGLD: small stepsize 19

Variational Inference Choose tractable family of distributions (e.g. Gaussian, discrete) • Minimize over Q: • Equivalent to maximize over : • Φ P Q 20

Learning: Expectation Maximization Gap: Bound E-step: (variational inference) M-step: (approximate learning) 21

Amortized Inference Bij making q(z|x) a function of x and sharing • parameters , we can do very fast inference at test φ time (i.e. avoid iterative optimization of q test (z)) 22

Deep NN as a glorified conditional distribution Y X P(Y|X) 23

The “Deepify” Operator Find a graphical model with conditional distributions and replace those with a deep NN. • • Logistic regression à deep NN. • “deep survival analysis”. Cox’s proportional hazard function: Replace with deep NN! • Latent variable model: replace generative and recognition models with deep NNs: à ”Variational Autoencoder” (VAE). 24

Variational Autoencoder deepify deepify 25

Deep Generative Model: The Variational Auto-Encoder Q P z z deterministic h NN node μ σ deep neural net deep neural net h h unobserved stochastic node h p x x observed stochastic node 26

Reducing the Variance: Kingma 2013, Bengio 2013, Kingma & Welling 2014 The Reparametrization Trick Reparameterization: • Z r Φ B ( Θ , Φ ) = r Φ dz Q Φ ( z | x )[log P Θ ( x, z ) � log Q Φ ( z | x )] Applied to VAE: • ⇡ r Φ [log P Θ ( x, z s ) � log Q Φ ( z s | x )] z s = g ( ✏ s , Φ ) , ✏ s ⇠ P ( ✏ ) Z Example: • dz N z ( µ, � ) z r µ = 1 X z s ( z s � µ ) / � 2 , z s ⇠ N z ( µ, � ) S s or 1 X ✏ s ⇠ N ✏ (0 , 1) , 1 , z = µ + �✏ 28 S s

Semi-Supervised VAE I D.P. Kingma, D.J. Rezende, S. Mohamed, M. Welling, NIPS 2014 Q P y y z z h h h Sometimes observed h h h stochastic node x x (normal VB objective) (boosting influence q(y|x) )

Discriminative or Generative? -Deep Learning Variational Auto-Encoder -Bayesian Networks -Kernel Methods -Probabilistic Programs -Random Forests -Simulator Models -Boosting Advantages generative models: • Advantages discriminative models: • Inject expert knowledge • • Flexible map from input to target (low bias) Model causal relations • • Efficient training algorithms available Interpretable • Solve the problem you are evaluating on. • • Data efficient Very successful and accurate! • More robust to domain shift • Facilitate un/semi-supervised learning •

Big N vs. Small N? We need statistical efficiency We need computational efficiency N=10^8-10^9 N = 100-1000 -Customer Intelligence -Healthcare (p>>N) -Finance -Generative, causal models -Video/image generalize much better to new -Internet of Things unknown situation (domain invariance) 32

Combining Generative and Discriminative Models Use physics Use causality Use expert knowledge Black box DNN/CNN

Deep Convolutional Networks Input dimensions have "topology”: (1D, speech, 2D image, 3D MRI, 2+1D video, 4D fMRI) • Forward: Filter, subsample, filter, nonlinearity, subsample, …., classify Backward: backpropagation (propagate error signal backward) 34

Dropout 35

Example: Dermatology 36

Example: Retinopathy 39

What do these Problems have in common? It’s the same CNN in all cases: Inception-v3 40

So..., CNNs work really well. However : T hey are way too big • They consume too much energy • • They use too much memory • à we need to make them more efficient! 41

Reasons for Bayesian Deep Learning • Automatic model selection / pruning • Automatic regularization Realistic prediction uncertainty (important for decision making) • Autonomous Driving Computer Aided Diagnosis

Example Increased uncertainty away from data

Marrying Graphical Models & Deep Learning Max Welling - PowerPoint PPT Presentation

Marrying Graphical Models & Deep Learning Max Welling University of Amsterdam Universiteit van Amsterdam Uva-Qualcomm Quva Lab Canadian Institute for Advanced Research 0 Overview: Generative versus discriminative modeling Machine

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan

Marrying for Money: Evidence from the First Wave of Married Womens Property Laws in the U.S.

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Learning in Graphical Models Andrea Passerini passerini@disi.unitn.it Machine Learning Learning

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Learning with partial observations

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Probabilistic Graphical Models 10-708 Learning Completely Observed Learning Completely Observed

Probabilistic Graphical Models 10-708 Learning Partially Observed Learning Partially Observed

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

Results for the 6 months to 30 June 2010 Agenda Introduction: Paul Pindar Chief Executive

travel ctm .com HALF YEAR RESULTS 2013 Jamie Pherous, Managing Director Steve Fleming, CFO

rhipe Limited (ASX code RHP) 2015 Half Year Results 23 February, 2015 Disclaimer This

QANTM Intellectual Property Limited HALF-YEAR RESULTS PRESENTATION 6 MONTHS TO 31 DECEMBER 2018

LOCAL RECOVERY MAPS AS DUCT TAPE FOR MANY BODY SYSTEMS Michael J. Kastoryano November 14 2016,

Evolution of Biomolecular Structure 2006 and RNA Secondary Structures in the Years to Come Peter

Cross-categorized-seeds Iv an Paz Universitat Polit` ecnica de Catalunya TOPLAP-Barcelona

Binary Deep Neural Network 20183385 Huisu Yun 27 November 2018 CS688 Fall 2018 Student