Variational Inference for Bayesian Neural Networks Jesse - PowerPoint PPT Presentation

Variational Inference for Bayesian Neural Networks Jesse Bettencourt, Harris Chan, Ricky Chen, Elliot Creager, Wei Cui, Mo- hammad Firouzi, Arvid Frydenlund, Amanjit Singh Kainth, Xuechen Li, Jeff Wintersinger, Bowen Xu October 6, 2017 University of Toronto 1

Overview Variational Autoencoders Kingma and Welling, 2014. Auto-encoding variational Bayes. Variational Inference for BNNs Origins of VI: MDL Interpretation Hinton and van Camp, 1993. Keeping the neural networks simple by minimizing the description length of the weights. Practical VI for Neural Networks Graves, 2011. Practical variational inference for neural networks. Weight Uncertainty in Neural Networks Blundell et al., 2015. Weight uncertainty in neural networks. The Local Reparameterization Trick Kingma, Salimans, and Welling, 2015. Variational dropout and the local reparameterization trick. Sparsification Louizos et al., 2017. Bayesian compression for deep learning. 2

Variational Autoencoders (VAE)

From Autoencoders to Variational Autoencoders • Autoencoders (AE) • Neural network which reconstructs its own inputs, x • Learns useful latent representation, z • Regularized by bottleneck layer – compresses latent representation • Encoder f ( x ) → z and decoder g ( z ) → x • Compresses point in input space to point in latent space • Variational autoencoders (VAE) • Regularized by forcing z to be close to some given distribution • z ∼ N ( µ = 0 , σ 2 = 1), with diagonal covariance • Learn distribution over latent space • Compresses point in input space to distribution in latent space 3

Implementing a VAE Three implementation differences between a VAE and an AE 1. Our encoder network parameterizes a probability distribution • Normal distribution is parameterized by its means µ and variances σ 2 • Encoder f ( x ) → µ, σ 2 • Decoder g ( z ) → x , where z ∼ N ( µ, σ 2 ) 2. Need to sample z • Problem: Can not backpropagate through sampling z • Solution: reparameterization trick • z = µ + σ ∗ ǫ , where ǫ is a noise input variable and ǫ ∼ N (0 , 1) 3. We need to add a new term to the cost function • Reconstruction error (log-likelihood) • KL divergence between distribution of z and normal distribution • KL term acts as regularizer on z 4

Autoencoders Encoder Decoder x 1 x 1 z 1 x 2 x 2 z 2 x 3 x 3 z 3 x 4 x 4 Figure 1: Inputs are shown in blue and the latent representation is shown in red. 5

Variational Autoencoders Encoder Decoder µ 1 µ 2 x 1 x 1 σ 2 x 2 z 1 x 2 1 σ 2 x 3 z 2 x 3 2 ǫ 1 x 4 x 4 ǫ 2 Figure 2: Inputs, x , are shown in blue. The latent representation, z , is shown in red. The parameters, µ and σ 2 , of the normal distribution are shown in yellow. They are combined with the noise input, ǫ , by 6 z = µ + σ ∗ ǫ , shown in dashed lines.

Paper Results Figure 3: Sampled 2D latent space of MNIST. 7

The big picture of VAEs � • Goal: maximize p θ ( x ) = p θ ( x | z ) p ( z ) dz • Generative model intuition: if our model has high likelihood of reproducing the data it has seen, it also has high probability of producing samples similar to x , and low probability of producing dissimilar samples • How to proceed? Simple: choose p θ ( x | z ) st it’s continuous and easy to compute—then we can optimize via SGD • Examples from ”Tutorial on Variational Autoencoders” (Doersch 2016), arXiv:1606.05908 8

Defining a latent space • How do we define what information the latent z carries? • Naively, for MNIST, we might say one dimension conveys digit identity, another conveys stroke width, another stroke angle • But we’d rather have the network learn this • VAE solution: say there’s no simple interpretation of z • Instead, draw z from N (0 , I ), then map through a parameterized and sufficiently expressive function • Let p θ ( x | z ) � N ( x ; µ θ ( z ) , Σ θ ( z )), with µ θ ( · ) , Σ θ ( · ) as deterministic neural nets. • Now tune the parameters θ in order to maximize p θ ( x ). 9

Estimating p θ ( x ) is hard • To optimize p θ ( x ) via SGD we will need to compute it. • We could do Monte Carlo estimate of p θ ( x ) with z ∼ N (0 , I ), � and p θ ( x ) ≈ 1 i p θ ( x | z i ) n • But ... in high dimensions, we likely need extremely large n • Here, ( a ) is the original, ( b ) is a bad sample from model, and ( c ) is a good sample from model • Since p θ ( x | z ) = N ( x ; µ θ ( z ) , Σ θ ( z )) and with Σ θ ( z ) � σ 2 I , we have log p θ ( x ) ∝ − || µ θ ( z ) − x || 2 2 σ 2 • x b is subjectively “bad” but has distance relatively close to the original: || x b − x a || 2 2 = 0 . 0387 • x c is subjectively “good” (just x a shifted down & right by half-pixel), but scores poorly since || x c − x a || 2 10 2 = 0 . 2693

Sampling z values efficiently estimate p θ ( x ) • Conclusion: to reject bad samples like x b , we must set σ 2 to be extremely small • But this means that to get samples similar to x a , we’ll need to sample a huge number of z values • One solution: define better distance metric—but these are difficult to engineer • Better solution: sample only z that have non-negligible p θ ( z | x ) • For most z sampled from p ( z ), we have p θ ( x | z ) ≈ 0, so contribute almost nothing to p θ ( x ) estimate • Idea: define function q φ ( z | x ) that helps us sample z with non-negligible contribution to p θ ( x ) 11

What is Variational Inference? Posterior inference over z often intractable: p θ ( z | x ) = p θ ( x | z ) p ( z ) = p θ ( z , x ) p θ ( z , x ) = � p θ ( x ) p θ ( x ) z p θ ( x , z ) Want: Q – tractable family of distribution q φ ( z | x ) ∈ Q similar to p θ ( z | x ) Approximate posterior inference using q φ Idea: Inference → Optimization L ( x ; θ, φ ) 12

Measuring Similarity of Distributions Optimization objective must measure similarity between p θ and q φ . To capture this we use the Kullback-Leibler divergence: � q φ ( z | x ) log q φ ( z | x ) KL( q φ || p θ ) = p θ ( z | x ) z = E q log q φ ( z | x ) p θ ( z | x ) Divergence not distance: KL( q φ || p θ ) ≥ 0 KL( q φ || p θ ) = 0 ⇐ ⇒ q φ = p θ KL( q || p θ ) = KL( p θ || q φ ) KL is not symmetric! 13

Intuiting KL Divergence To get a feeling for what KL Divergence is doing: � q φ ( z | x ) log q φ ( z | x ) p θ ( z | x ) = E q φ log q φ ( z | x ) KL( q φ || p θ ) = p θ ( z | x ) z Consider these three cases: q is high & p is high q is high & p is low q is low 14

Isolating Intractability in KL-Divergence We can’t minimize the KL-Divergence directly: KL( q φ || p θ ) = E q φ log q φ ( z | x ) p θ ( z | x ) = E q φ log q φ ( z | x ) p θ ( x ) ( p θ ( z | x ) = p θ ( z , x ) p θ ( x ) ) p θ ( z , x ) = E q φ log q φ ( z | x ) p θ ( z , x ) + E q φ log p θ ( x ) = E q φ log q φ ( z | x ) p θ ( z , x ) + log p θ ( x ) 15

Isolating Intractability in KL-Divergence We have isolated the intractable evidence term in KL-Divergence! KL( q φ || p θ ) = ( E q φ log q φ ( z | x ) p θ ( z , x )) + log p θ ( x ) = −L ( x ; θ, φ ) + log p θ ( x ) Rearrange terms to express isolated intractable evidence: log p θ ( x ) = KL( q φ || p θ ) + L ( x ; θ, φ ) 16

Deriving a Variational Lower Bound Since KL-Divergence is non-negative: log p θ ( x ) = KL( q φ || p θ ) + L ( x ; θ, φ ) log p θ ( x ) ≥ L ( x ; θ, φ ) where L ( x ; θ, φ ) = − E q φ log q φ ( z | x ) p θ ( z , x ) A Variational Lower Bound on the intractable evidence term! This is also called the Evidence Lower Bound (ELBO). 17

Intuiting Variational Lower Bound Expand the derived variational lower bound: L ( x ; θ, φ ) = − E q φ [log q φ ( z | x ) p θ ( z , x )] = E q φ [log p θ ( x | z ) p ( z ) ] q φ ( z | x ) = E q φ [log p θ ( x | z ) + log p ( z ) − log q φ ( z | x )] p ( z ) = E q φ [log p θ ( x | z ) + log q φ ( z | x )] E q φ [log p θ ( x | z )] − KL( q φ ( z | x ) || p ( z )) = � �� Divergence from Prior Reconstruction Likelihood 18

Optimizing the ELBO in VAE To optimize the ELBO, L ( x ; θ, φ ) = E z ∼ q φ ( z | x ) [log p θ ( x | z )] − KL( q φ ( z | x ) || p ( z )) , � �� Divergence from prior; � R ( x ; θ,φ ) analytic expression by design Reconstruction likelihood we need to compute gradients ∇ θ L and ∇ φ L . • ∇ θ KL( · ) and ∇ φ KL( · ) by automatic differentiation • ∇ θ R ( x ; θ, φ ) by auto diff given samples z ∼ q φ ( z | x ) • ∇ φ R ( x ; θ, φ ) by reparameterization trick or other gradient estimator 19

Reparameterizing: a computation graph view With 1 q φ ( z | x ) � g ( φ, x , ǫ ): ∇ φ E z ∼ q φ ( z | x ) [ f ( z )] � = ∇ φ f ( z ) q φ ( z | x ) dz � ( rep . tr . ) ∇ φ = f ( g ( φ, x , ǫ )) p ( ǫ ) d ǫ = E p ( ǫ ) [ ∇ φ f ( g ( φ, x , ǫ ))] With ( rep . tr . ) due to Figure 4: from Kingma’s slides at | q φ ( z | x ) dz | = | p θ ( ǫ ) d ǫ | . This NIPS 2015 Workshop on Approx. permits a specific alteration to Inference the computation graph without introducing bias. 20

Variational Inference for Bayesian Neural Networks Jesse - PowerPoint PPT Presentation

Variational Inference for Bayesian Neural Networks Jesse Bettencourt, Harris Chan, Ricky Chen, Elliot Creager, Wei Cui, Mo- hammad Firouzi, Arvid Frydenlund, Amanjit Singh Kainth, Xuechen Li, Jeff Wintersinger, Bowen Xu October 6, 2017

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

CS 730/730W/830: Intro AI Bayesian Networks Approx. Inference Exact Inference 1 handout: slides

CS 730/830: Intro AI Bayesian Networks Approx. Inference Exact Inference Wheeler Ruml (UNH)

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Variational Russian Roulette for Variational Russian Roulette for Deep Bayesian Nonparametrics

Primary Care Research: No Longer Lost in Translation James W. Mold, MD, MPH George Lynn Cross

Class 20 Fault localization (contd) Test-data generation Exam review: Nov 3,

Focusing via display Giuseppe Greco - University of Utrecht - (work in progress with W. Fussner,

ACM Highlights Learning Center tools for professional development: http://learning.acm.org

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides

A Simple Obfuscation Scheme for Pattern-Matching with Wildcards Allison Bishop Lucas Kowalczyk

Crea&ng Change: Primary Preven&on of Sexual Violence

RF Strategy & the M uCool T est Area Meeting the RF in Magnetic Field Challenge Alan Bross

Sambuz

Useful Links

Newsletter

Mail Us

Variational Inference for Bayesian Neural Networks Jesse - PowerPoint PPT Presentation

Variational Inference for Bayesian Neural Networks Jesse Bettencourt, Harris Chan, Ricky Chen, Elliot Creager, Wei Cui, Mo- hammad Firouzi, Arvid Frydenlund, Amanjit Singh Kainth, Xuechen Li, Jeff Wintersinger, Bowen Xu October 6, 2017

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

CS 730/730W/830: Intro AI Bayesian Networks Approx. Inference Exact Inference 1 handout: slides

CS 730/830: Intro AI Bayesian Networks Approx. Inference Exact Inference Wheeler Ruml (UNH)

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Variational Russian Roulette for Variational Russian Roulette for Deep Bayesian Nonparametrics

Primary Care Research: No Longer Lost in Translation James W. Mold, MD, MPH George Lynn Cross

Class 20 Fault localization (contd) Test-data generation Exam review: Nov 3,

Focusing via display Giuseppe Greco - University of Utrecht - (work in progress with W. Fussner,

ACM Highlights Learning Center tools for professional development: http://learning.acm.org

CSE 573 Markov Decision Processes: Heuristic Search &amp; Real-Time Dynamic Programming Slides

A Simple Obfuscation Scheme for Pattern-Matching with Wildcards Allison Bishop Lucas Kowalczyk

Crea&amp;ng Change: Primary Preven&amp;on of Sexual Violence

RF Strategy &amp; the M uCool T est Area Meeting the RF in Magnetic Field Challenge Alan Bross

Sambuz

Useful Links

Newsletter

Mail Us

CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides

Crea&ng Change: Primary Preven&on of Sexual Violence

RF Strategy & the M uCool T est Area Meeting the RF in Magnetic Field Challenge Alan Bross