slides of Layered Adaptive Importance Sampling Presentation June - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/280102383 slides of Layered Adaptive Importance Sampling Presentation · June 2016 CITATION READS 1 40 3 authors: Luca Martino Victor Elvira King Juan Carlos University The University of Edinburgh 156 PUBLICATIONS 1,615 CITATIONS 113 PUBLICATIONS 1,011 CITATIONS SEE PROFILE SEE PROFILE David Luengo Universidad Politécnica de Madrid 136 PUBLICATIONS 1,344 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: SEDAL: Statistical Learning for Earth Observation Data Analysis View project Probabilistic Adaptive Filters View project All content following this page was uploaded by Luca Martino on 18 July 2015. The user has requested enhancement of the downloaded file.

Layered Adaptive Importance Sampling Luca Martino FIRST PART (and brief intro of the second part) OF: L. Martino, V. Elvira, D. Luengo, J. Corander. “Layered Adaptive Importance Sampling”, arXiv:1505.04732, 2015. 2015 2014 1 / 32

Introduction and notation ◮ Bayesian inference: ◮ g ( x ): prior pdf. ◮ ℓ ( y | x ): likelihood function. ◮ Posterior pdf and marginal likelihood (evidence) π ( x ) = p ( x | y ) = ℓ ( y | x ) g ( x ) ¯ , Z ( y ) � Z ( y ) = ℓ ( y | x ) g ( x ) d x . X ◮ In general, Z ( y ) is unknown, we can evaluate π ( x ) ∝ ¯ π ( x ): � π ( x ) = ℓ ( y | x ) g ( x ) , and we denote Z ( y ) as Z = π ( x ) d x . X 2 / 32

Goal ◮ Our goal is computing efficiently an integral w.r.t. the target pdf, � I = 1 f ( x ) π ( x ) d x , (1) Z X for instance, � x MMSE = 1 � x π ( x ) d x , Z X and the normalizing constant, � Z = π ( x ) d x , (2) X via Monte Carlo. 3 / 32

Monte Carlo approximation ◮ (Monte Carlo) IDEAL CASE: Draw x ( m ) ∼ ¯ π ( x ), m = 1 , . . . , M , and M � � f ( x ( m ) ) ≈ I . I = m =1 ◮ However, in general: ◮ it is not possible to draw from ¯ π ( x ). ◮ Even in this ”ideal” case it is not trivial to approximate Z , i.e., to find � Z ≈ Z . 4 / 32

Proposal densities ◮ If drawing directly from the target ¯ π ( x ) is impossible: MC techniques use a simpler proposal density q ( x ) for generating random candidates, and then filtering them according to some suitable rule. ◮ The performance depends strictly on the choice of q ( x ): ◮ better q ( x ) → closer to ¯ π ( x ). ◮ proper tuning of the parameters; ◮ adaptive methods. ◮ Another strategy for increasing the robustness: ◮ Combined use of several proposal pdfs q 1 , . . . , q N . 5 / 32

In this work: brief sketch ◮ We design Adaptive Importance Sampling schemes using a population of different proposals q 1 , . . . , q N , formed by two layers: 1. Upper level - MCMC adaptation: The location parameters of the proposal pdfs are updated via (parallel or interacting) MCMC chains/transitions. 2. Lower level - IS estimation: Different ways of yielding IS estimators are considered. ◮ We mix the benefits of IS and MCMC methods: ◮ with MCMC → good explorative behavior. ◮ with IS → easy to estimate Z . ◮ It is also a way for exchanging info among parallel MCMC chains. 6 / 32

First contribution ◮ HIERARCHICAL PROCEDURE (for generating random candidates within MC methods) 7 / 32

General hierarchical generation procedure ◮ Two independent Levels/ Layers: 1. For n = 1 , . . . , N : 1.1 (Upper level) Draw a possible location parameter µ n ∼ h ( µ ). 1.2 (Lower level) Draw x ( m ) ∼ q ( x | µ n , C ) = q ( x − µ n | C ) , n with m = 1 , . . . , M , and use them, x ( m ) ’s, as n candidates inside a Monte Carlo method. 2. Equivalent Proposal Density, � � q ( x | C ) = q ( x − µ | C ) h ( µ ) d µ , (3) X i.e., x ( m ) ∼ � q ( x | C ). n 8 / 32

Optimal “prior” h ∗ ( µ ) over the location parameters ◮ The desirable (best) scenario is: � q ( x | C ) = ¯ π ( x ). ◮ In terms of the characteristic functions, Q ( ν | C ) = E [ q ( x | C ) e i ν x ], H ∗ ( ν | C ) = E [ h ∗ ( x | C ) e i ν x ] and ¯ π ( x ) e i ν x ], we have that the optimal prior is Π( ν ) = E [¯ ¯ Π( ν ) H ∗ ( ν | C ) = (4) Q ( ν | C ) . ◮ In general, it is not possible to know analytically h ∗ ( µ | C ), and thus, an efficient approximation is called for. 9 / 32

Alternative prior h ( µ ) ◮ Considering the following hierarchical procedure: For n = 1 , . . . , N : 1. Draw µ n ∼ h ( µ ), 2. Draw x n ∼ q ( x | µ n , C ). For drawing the set { x 1 , . . . , x N } we use N different proposal pdfs q ( x | µ 1 , C ) , . . . , q ( x | µ N , C ). ◮ It is possible to interpret that the set { x 1 , . . . , x N } is distributed according to the following mixture N � { x 1 , . . . , x N } ∼ Φ ( x ) = 1 q ( x | µ n , C ) , (5) N n =1 following the deterministic mixture argument [Owen00] , [Elvira15] , [Cornuet12] . 10 / 32

Alternative prior h ( µ ) ◮ The performance of the resulting MC method, where such a hierarchical procedure is applied, depends on how closely Φ ( x ) resembles ¯ π ( x ). ◮ Suitable alternative prior h ( µ ) = ¯ π ( µ ), i.e., the prior h is exactly the target! why?? see below: ◮ Theoretical argument - Kernel density estimation: if µ n ∼ ¯ π ( µ ), then Φ ( x ) can be interpreted as a kernel estimation of ¯ π ( x ), where q ( x | µ 1 , C ) , . . . , q ( x | µ N , C ) play the role of kernel functions. 11 / 32

MCMC adaptation ◮ Clearly, we are not able to draw from h ( µ ) = ¯ π ( µ ). ◮ We suggest to use MCMC transitions with target ¯ π ( µ ), for proposing the location parameters µ 1 , . . . , µ N . ◮ For instance, we can use N parallel MCMC chains. ◮ Thus, in our case, we have MCMC schemes (upper level) driven a Multiple Adaptive Importance Sampling (lower level). 12 / 32

Why do that? ◮ Improvement of the performance. Indeed: ◮ Different well-known MC methods implicitly employ (or attempt to apply) this hierarchical procedure. ◮ In the paper, we describe a hierarchical interpretation of the Random Walk Metropolis (RMW) and Population Monte Carlo (PMC) techniques. 13 / 32

Random Walk Metropolis (RWM) method ◮ One transition of the MH algorithm is given by 1. Draw x ′ from a proposal pdf q ( x | x t − 1 , C ). 2. Set x t = x ′ with probability � � π ( x ′ ) q ( x t − 1 | x ′ , C ) α = min 1 , , π ( x t − 1 ) q ( x ′ | x t − 1 , C ) otherwise set x t = x t − 1 (with probability 1 − α ). ◮ In a random walk (RW) proposal q ( x | x t − 1 , C ) = q ( x − x t − 1 | C ), x t − 1 plays the role of a location parameter of q . ◮ A RW proposal q ( x | x t − 1 , C ) is often used due to its explorative behavior , when no information about π ( x ) is available. 14 / 32

Hierarchical interpretation of RWM q ( x | x t − 1 , C ) q ( x | x t , C ) π ( x ) π ( x ) x t − 1 x x t x x t (a) (b) Figure: Graphical representation of a RW proposal. 15 / 32

Hierarchical interpretation of RWM ◮ Let us assume a “burn-in” length T b − 1. Hence, considering an iteration t ≥ T b , we have x t ∼ ¯ π ( x ). ◮ For t ≥ T b , the probability of proposing a new sample using the RW proposal q ( x − x t − 1 | C ) can be written as � q ( x | C ) � = q ( x − x t − 1 | C )¯ π ( x t − 1 ) d x t − 1 , t ≥ T b . X π ( x ) ˜ q ( x | C ) x 16 / 32

Hierarchical interpretation of RWM ◮ The function � q ( x | C ) is an equivalent independent proposal pdf of a RW proposal. ◮ It implies that the RW generating process is equivalent, for t ≥ T b , to the following hierarchical procedure: 1. Draw µ ′ from ¯ π ( µ ), 2. Draw x ′ from q ( x | µ ′ , C ). ◮ This interpretation is useful for clarifying the main advantage of the RW approach, i.e., that the equivalent proposal � q is a better choice than an independent proposal roughly tuned. ◮ the RW generating procedure includes indirectly certain information about the target. 17 / 32

Hierarchical interpretation of RWM ◮ Denoting Z ∼ � q ( x | C ), S ∼ q ( x | µ , C ) (assuming E [ S ] = µ = 0), X ∼ ¯ π ( x ), we can write E [ Z ] = E [ X ] , Σ Z = C + Σ X , which are the mean and covariance matrix of Z with pdf � q . q ( x | x t − 1 , C ) q ( x | x t , C ) π ( x ) π ( x ) π ( x ) q ( x | C ) ˜ x t − 1 x x t x x t x Figure: Graphical representation of a RW proposal and its equivalent independent pdf. 18 / 32

Population Monte Carlo (PMC) method ◮ A standard PMC scheme [Cappe04] is an adaptive importance sampler using a cloud of proposal pdfs q 1 , . . . , q N . 1. For t = 1 , . . . , T : 1.1 Draw x n , t ∼ q n ( x | µ n , t − 1 , C n ), for n = 1 , . . . , N . 1.2 Assign to each sample x n , t the weights, π ( x n , t ) w n , t = (6) q n ( x n , t | µ n , t − 1 , C n ) . 1.3 Resampling: draw N independent samples µ 1 , t , . . . , µ N , t , according to the particle approximation N 1 π ( N ) X ˆ ( µ ) = w n , t δ ( µ − x n , t ) . (7) t P N n =1 w n , t n =1 Note that each µ n , t ∈ { x 1 , t , . . . , x N , t } , with n = 1 , . . . , N . 2. Return all the pairs { x n , t , w n , t } , ∀ n and ∀ t . 19 / 32

slides of Layered Adaptive Importance Sampling Presentation June - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/280102383 slides of Layered Adaptive Importance Sampling Presentation June 2016 CITATION READS 1 40 3 authors: Luca Martino

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Layered Systems Software Design and Architectures Layered Systems BSD Unix Layered Architecture

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Estimation of cosmological parameters using adaptive importance sampling Gersende FORT LTCI,

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Estimation of cosmological parameters using adaptive importance sampling Gersende FORT LTCI,

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Overview 1. Probabilistic Reasoning/Graphical models 2. Importance Sampling 3. Markov Chain

Beyond graphene: The amazing world of layered transition metal dichalcogenides (TMDs) Humberto

noise and number of sensors Giovanni Capellari Eleni Chatzi Stefano Mariani 3 rd International

Understanding MCMC Dynamics as Flows on the Wasserstein Space Chang Liu, Jingwei Zhuo, Jun Zhu 1

Improve your work fl ow for reproducible science Mine etinkaya-Rundel University of Edinburgh

Lecture on Parameter Estimation for Stochastic Differential Equations Erik Lindstrm Recap (1)

Course : Data mining Lecture : Computing basic graph statistics Aristides Gionis Department of

Sequential Detection and Isolation of a Correlated Pair Anamitra Chaudhuri Department of

Choice with multiple alternatives 5.2 Specification of the deterministic part Michel

Online k -MLE for mixture modelling with exponential families Christophe Saint-Jean Frank