A-NICE-MC Jiaming Song 1. Motivation 2. Notations and Problem - PowerPoint PPT Presentation

Adversarial Training for MCMC Shengjia Zhao Stefano Ermon March 7, 2018 Stanford University A-NICE-MC Jiaming Song

1. Motivation 2. Notations and Problem Setup 3. Adversarial Training for Markov Chains 4. Adversarial Training for MCMC 5. Experiments 1 Table of contents

Motivation

2 Bayesian Inference Parameters θ , observations D : Input prior p ( θ ) and likelihood p ( D| θ ) Output posterior p ( θ |D ) through Bayes’ rule: p ( θ |D ) = p ( θ ) p ( D| θ ) p ( D ) Problem : marginal p ( D ) is intractable! Solutions : Variational Inference and Markov chain Monte Carlo

3 tractable model and minimize its distance with the posterior. Bayesian Inference Variational Inference : approximate the posterior with some Examples mean field approximation [2] Advantages optimization is efficient Drawbacks performance limited by choice of model

sampled from Markov chain with desired stationary distribution. 4 Bayesian Inference Markov chain Monte Carlo : approximate the posterior with particles Method Proposal for next particle + Metropolis-Hastings Examples Gibbs sampling [4], Hamiltonian Monte Carlo [8] Advantages reaches the true posterior asymptotically Drawbacks need many samples to obtain good estimates

• cannot apply expressive function approximations directly • proposals are hand-designed in general • hard to evaluate / optimize metrics 5 Deep Bayesian Learning Variational Inference <- Deep Learning : ✓ • stochastic gradient descent as optimization algorithm • expressive function approximations to represent model Markov chain Monte Carlo <- Deep Learning : ✗

We introduce A-NICE-MC, a new method for training flexible MCMC kernels. • proposals are parameterized using (deep) neural networks • use adversarial methods to train a Markov chain that • matches a target stationary quickly (burn-in) • achieves low autocorrelation between samples (mixing) • learned proposals are much more efficient than traditional ones 6 Outline Markov chain Monte Carlo + Deep Learning : ✓

Notations and Problem Setup

where the following Markov chain: 7 Notations A sequence of continuous random variables { x t } ∞ t = 0 is drawn through x 0 ∼ π 0 x t + 1 ∼ T θ ( x t + 1 | x t ) • T θ ( ·| x ) : a stochastic transition kernel parametrized by θ • π 0 : some initial distribution for x 0 . • π t θ : state distribution at time t . T θ is defined through an implicit generative model f θ ( ·| x , v ) , where v ∼ p ( v ) is an auxiliary random variable.

8 • an (intractable) posterior distribution • a data distribution (which we can sample from) Problem Setup Let p d ( x ) be a target distribution over x ∈ R n , e.g.: Our objective is to find a T θ such that: 1. Low bias: The stationary distribution is close to the target distribution (minimize | π θ − p d | ) . 2. Efficiency: { π t θ } ∞ t = 0 converges quickly (minimize t such that | π t θ − p d | < δ ). 3. Low variance: Samples from one chain { x t } ∞ t = 0 should be as uncorrelated as possible (minimize autocorrelation of { x t } ∞ t = 0 ).

Problem setup: We consider two settings for specifying the target distribution. samples) 9 Settings Input A target distribution p d ( x ) Output A transition kernel T θ ( ·| x ) • p d ( x ) is a data distribution (samples, no analytic expression) • p d ( x ) an analytic expression (up to normalization constant, no

Adversarial Training for Markov Chains

10 (1) Parametrized Markov Chains Assume we have direct access to samples from p d ( x ) , and the transition kernel T θ ( x t + 1 | x t ) is the following implicit generative model: v ∼ p ( v ) x t + 1 = f θ ( x t , v ) for which the stationary π θ ( x ) exists. Goal : find θ such that π θ ( x ) is close to p d .

(integration over all the possible paths) • sampling is easy for Markov chains! • likelihood-free methods only requires samples • Example: Generative Adversarial Networks [5] 11 Training Markov Chains Likelihood-based Approaches : • the value of π θ ( x ) is typically intractable to compute • the marginal distribution π t θ ( x ) at time t is also intractable Likelihood-free Apporoaches

12 D (2) D max G max G min This describes the following objective [1]: the generator and samples from p d . Generative Adversarial Networks Generator G ( z ) : generates samples by transforming a noise variable z ∼ p ( z ) into G ( z ) Discriminator D ( x ) : trained to distinguish between samples from V ( D , G ) = min E x ∼ p d [ D ( x )] − E z ∼ p ( z ) [ D ( G ( z ))]

In our settings: It is hard to sample from the stationary or optimize through a long chain! 13 Likelihood-free Training for Markov Chains • p d ( x ) is the empirical distribution from the samples ✓ • G θ ( z ) is the stationary? approximate with state after t steps? ✗

We can construct an objective that can be optimized efficiently through the two conditions. 14 Conditions for Stationary Distribution We consider two necessary conditions for p d to be a stationary: • p d should be close to π b for some time step b • p d is a fixed point for the transition operator

15 Markov GAN (MGAN) objective: min is applied m times, starting from some “real” sample x d max D • T m x denotes “fake” samples from the generator where Markov GAN θ [ D (¯ x | x d ) [ D (¯ E x ∼ p d [ D ( x )] − λ E ¯ x )] − ( 1 − λ ) E x d ∼ p d , ¯ x )] (3) x ∼ T m θ (¯ x ∼ π b θ • λ ∈ ( 0 , 1 ) , b ∈ N + , m ∈ N + are hyperparameters • ¯ θ ( x | x d ) denotes the distribution of x when the transition kernel

16 converge to p d min We use two types of samples from the generator for training: max D (4) fixed point at p d Markov GAN (MGAN) objective: Markov GAN E x ∼ p d [ D ( x )] − λ E ¯ θ [ D (¯ x )] − ( 1 − λ ) E x d ∼ p d , ¯ x | x d ) [ D (¯ x )] x ∼ T m θ (¯ x ∼ π b θ � �� 1. Samples after b transitions, starting from x 0 ∼ π 0 . 2. Samples after m transitions, starting from x d ∼ p d .

17 chain. If the following two conditions hold: m variation; Justifications Proposition Consider a sequence of ergodic Markov chains over state space S . Define π n as the stationary distribution for the n-th Markov chain, and π t n as the probability distribution at time step t for the n-th n } ∞ 1. ∃ b > 0 such that the sequence { π b n = 1 converges to p d in total 2. ∃ ϵ > 0 , ρ < 1 such that ∃ M > 0 , ∀ m > M if ∥ π t m − p d ∥ TV < ϵ , then ∥ π t + 1 − p d ∥ TV < ρ ∥ π t m − p d ∥ TV ; then the sequence of stationary distributions { π n } ∞ n = 1 converges to p d in total variation.

18 Sketch of Proof Proof. The goal is to prove that ∀ δ > 0, ∃ N > 0 , T > 0, such that ∀ n > N , t > T , ∥ π t n − p d ∥ TV < δ . • ∃ N > 0, such that ∀ n > N , ∥ π b n − p d ∥ TV < ϵ (Assumption 1). • ∀ n > max ( N , M ) , ∀ δ > 0, ∃ T = b + max ( 0 , ⌈ log ρ δ − log ρ ϵ ⌉ ) + 1, such that ∀ t > T , ∥ π t n − p d ∥ TV < δ (Assumption 2). Hence the sequence { π n } ∞ n = 1 converges to p d in total variation.

(5) the MNIST dataset. Consecutive samples can be related in label (red box), inclination (green box) or width (blue box). 19 Example: Generative Model for Images We experiment with a distribution p d over images, such as digits (MNIST) and faces (CelebA), where x t + 1 = f θ ( x t , v ) is defined as z ′ = ReLU ( z + β v ) x t + 1 = decoder θ ( z ′ ) z = encoder θ ( x t ) where β is a hyperparameter we set to 0 . 1. Figure 1: Visualizing samples of π 1 to π 50 (each row) from a model trained on

We use a classifier to classify the generated images and evaluate the 20 Transition Probabilities on MNIST class transition probabilities T θ ( y t + 1 | y t ) Figure 2: The transition is not symmetric!

Adversarial Training for MCMC

specified by an analytical expression: (6) where There are two additional challenges: • We want the stationary to be exactly p d • We do not have direct access to samples from p d 21 Analytical Target Now consider the settings where the target distribution p d is p d ( x ) ∝ exp ( − U ( x )) • U ( x ) is a known energy function • normalization constant for U ( x ) is not available

We use ideas from the Markov Chain Monte Carlo (MCMC) literature to address the first challenge. (7) 22 Metropolis Hastings Detailed Balance : p d ( x ) T θ ( x ′ | x ) = p d ( x ′ ) T θ ( x | x ′ ) for all x and x ′ . Metropolis-Hastings • a sample x ′ is first obtained from a proposal distribution g θ ( x ′ | x ) • x ′ is accepted with the following probability: ( ) 1 , exp ( U ( x ) − U ( x ′ )) g θ ( x | x ′ ) A θ ( x ′ | x ) = min g θ ( x ′ | x ) Let T θ ( x ′ | x ) = g θ ( x ′ | x ) A θ ( x ′ | x ) , then the Markov chain has stationary of p d [6].

A-NICE-MC Jiaming Song 1. Motivation 2. Notations and Problem - PowerPoint PPT Presentation

Adversarial Training for MCMC Shengjia Zhao Stefano Ermon March 7, 2018 Stanford University A-NICE-MC Jiaming Song 1. Motivation 2. Notations and Problem Setup 3. Adversarial Training for Markov Chains 4. Adversarial Training for MCMC 5.

PORTUGAL Nice wheather, Nice people Nice country! POR NSO Anbal Marianito Lausanne

All Things Nice Company Presentation Company Profile All Things Nice (ATN) is a platform to

All Things Nice Company Presentation Company Profile All Things Nice (ATN) is a platform to

Roadmap for Customer Service Transformation How NICE Transformed Customer Service for its Global

21 November 2019 Aims of the workshop Overview of NICE and its role within social care

Governor Harry W. Nice Memorial Bridge Improvement Project Project Background and Status Update

are we just learning to be nice are we just learning to be nice Bryce Wiebe, just some guy Bryce

How can you use NICE resources to help adopt evidence into practice? Jane Moore Implementation

Hey, those are Hey, those are nice shoes! nice shoes! I still have my feet on the ground, I

An introduction to NICE Navigating the Highly Specialised Technologies Programme Eli Gajraj,

CWB Network Information exChange Environment (NICE) Mark Cheng Central Weather Bureau, Taiwan,

Quantum jump processes for decoherence Maxime Hauray Aix-Marseille University Nice, December

from Dicke Sub- and Superradiance to Anderson localisation Robin KAISER INLN, Nice, France

Motor neurone disease: assessment and management DR ALEKSANDAR RADUNOVIC PHD FRCP, CONSULTANT

A week in the life of a NICE information specialist... Caroline Miller Senior Information

The IBM Machine Intelligence Project - Overview (Wilcke) and Neural Model (Ozcan) NICE V March

Halt return result; } true 5 6 1 The Halting Problem Undecidability Alan Turing, 1936

Op#miza#on Challenges for Deep Learning Yoshua Bengio U.

Simple Problems. . . Example a 0 a 1 a 2 b 0 b 1 b 2 Question What is some preferred extension?

Introduction to Computer Science CSCI 109 An al thm (pronounced AL-go-rith- algori rithm

Almost All Complex Quantifiers are Simple Jakub Szymanik MoL 2009 Outline Introduction

Lattice-based cryptography: reduced to a special closest vector Episode V: problem which is much

Comparing Problems Remember the concepts of Problem, Algorithm, and Program. Weve gotten

Approximate inference on planar graphs using Loop Calculus and Belief Propagation Vicen Gmez 1