slides of Layered Adaptive Importance Sampling Presentation June - - PDF document

slides of layered adaptive importance sampling
SMART_READER_LITE
LIVE PREVIEW

slides of Layered Adaptive Importance Sampling Presentation June - - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/280102383 slides of Layered Adaptive Importance Sampling Presentation June 2016 CITATION READS 1 40 3 authors: Luca Martino


slide-1
SLIDE 1

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/280102383

slides of Layered Adaptive Importance Sampling

Presentation · June 2016

CITATION

1

READS

40

3 authors: Some of the authors of this publication are also working on these related projects: SEDAL: Statistical Learning for Earth Observation Data Analysis View project Probabilistic Adaptive Filters View project Luca Martino King Juan Carlos University

156 PUBLICATIONS 1,615 CITATIONS

SEE PROFILE

Victor Elvira The University of Edinburgh

113 PUBLICATIONS 1,011 CITATIONS

SEE PROFILE

David Luengo Universidad Politécnica de Madrid

136 PUBLICATIONS 1,344 CITATIONS

SEE PROFILE

All content following this page was uploaded by Luca Martino on 18 July 2015.

The user has requested enhancement of the downloaded file.

slide-2
SLIDE 2

Layered Adaptive Importance Sampling

Luca Martino FIRST PART (and brief intro of the second part) OF:

  • L. Martino, V. Elvira, D. Luengo, J. Corander. “Layered Adaptive

Importance Sampling”, arXiv:1505.04732, 2015. 2015

2014

1 / 32

slide-3
SLIDE 3

Introduction and notation

◮ Bayesian inference:

◮ g(x): prior pdf. ◮ ℓ(y|x): likelihood function. ◮ Posterior pdf and marginal likelihood (evidence)

¯ π(x) = p(x|y) = ℓ(y|x)g(x) Z(y) , Z(y) =

  • X

ℓ(y|x)g(x)dx.

◮ In general, Z(y) is unknown, we can evaluate π(x) ∝ ¯

π(x): π(x) = ℓ(y|x)g(x), and we denote Z(y) as Z =

  • X

π(x)dx.

2 / 32

slide-4
SLIDE 4

Goal

◮ Our goal is computing efficiently an integral w.r.t. the target

pdf, I = 1 Z

  • X

f (x)π(x)dx, (1) for instance,

  • xMMSE = 1

Z

  • X

xπ(x)dx, and the normalizing constant, Z =

  • X

π(x)dx, (2) via Monte Carlo.

3 / 32

slide-5
SLIDE 5

Monte Carlo approximation

◮ (Monte Carlo) IDEAL CASE: Draw x(m) ∼ ¯

π(x), m = 1, . . . , M, and

  • I =

M

  • m=1

f (x(m)) ≈ I.

◮ However, in general:

◮ it is not possible to draw from ¯

π(x).

◮ Even in this ”ideal” case it is not trivial to approximate

Z, i.e., to find Z ≈ Z.

4 / 32

slide-6
SLIDE 6

Proposal densities

◮ If drawing directly from the target ¯

π(x) is impossible: MC techniques use a simpler proposal density q(x) for generating random candidates, and then filtering them according to some suitable rule.

◮ The performance depends strictly on the choice of q(x):

◮ better q(x) → closer to ¯

π(x).

◮ proper tuning of the parameters; ◮ adaptive methods.

◮ Another strategy for increasing the robustness:

◮ Combined use of several proposal pdfs q1, . . . , qN. 5 / 32

slide-7
SLIDE 7

In this work: brief sketch

◮ We design Adaptive Importance Sampling schemes using a

population of different proposals q1, . . . , qN, formed by two layers:

  • 1. Upper level - MCMC adaptation: The location parameters of

the proposal pdfs are updated via (parallel or interacting) MCMC chains/transitions.

  • 2. Lower level - IS estimation: Different ways of yielding IS

estimators are considered.

◮ We mix the benefits of IS and MCMC methods:

◮ with MCMC → good explorative behavior. ◮ with IS → easy to estimate Z.

◮ It is also a way for exchanging info among parallel MCMC

chains.

6 / 32

slide-8
SLIDE 8

First contribution

◮ HIERARCHICAL PROCEDURE

(for generating random candidates within MC methods)

7 / 32

slide-9
SLIDE 9

General hierarchical generation procedure

◮ Two independent Levels/ Layers:

  • 1. For n = 1, . . . , N :

1.1 (Upper level) Draw a possible location parameter µn ∼ h(µ). 1.2 (Lower level) Draw x(m)

n

∼ q(x|µn, C) = q(x − µn|C), with m = 1, . . . , M, and use them, x(m)

n

’s, as candidates inside a Monte Carlo method.

  • 2. Equivalent Proposal Density,
  • q(x|C) =
  • X

q(x − µ|C)h(µ)dµ, (3) i.e., x(m)

n

∼ q(x|C).

8 / 32

slide-10
SLIDE 10

Optimal “prior” h∗(µ) over the location parameters

◮ The desirable (best) scenario is:

q(x|C) = ¯ π(x).

◮ In terms of the characteristic functions,

Q(ν|C) = E[q(x|C)eiνx], H∗(ν|C) = E[h∗(x|C)eiνx] and ¯ Π(ν) = E[¯ π(x)eiνx], we have that the optimal prior is H∗(ν|C) = ¯ Π(ν) Q(ν|C). (4)

◮ In general, it is not possible to know analytically h∗(µ|C), and

thus, an efficient approximation is called for.

9 / 32

slide-11
SLIDE 11

Alternative prior h(µ)

◮ Considering the following hierarchical procedure:

For n = 1, . . . , N:

  • 1. Draw µn ∼ h(µ),
  • 2. Draw xn ∼ q(x|µn, C).

For drawing the set {x1, . . . , xN} we use N different proposal pdfs q(x|µ1, C), . . . , q(x|µN, C).

◮ It is possible to interpret that the set {x1, . . . , xN} is

distributed according to the following mixture {x1, . . . , xN} ∼ Φ(x) = 1 N

N

  • n=1

q(x|µn, C), (5) following the deterministic mixture argument [Owen00],

[Elvira15], [Cornuet12].

10 / 32

slide-12
SLIDE 12

Alternative prior h(µ)

◮ The performance of the resulting MC method, where such a

hierarchical procedure is applied, depends on how closely Φ(x) resembles ¯ π(x).

◮ Suitable alternative prior h(µ) = ¯

π(µ), i.e., the prior h is exactly the target! why?? see below:

◮ Theoretical argument - Kernel density estimation: if

µn ∼ ¯ π(µ), then Φ(x) can be interpreted as a kernel estimation of ¯ π(x), where q(x|µ1, C), . . . , q(x|µN, C) play the role of kernel functions.

11 / 32

slide-13
SLIDE 13

MCMC adaptation

◮ Clearly, we are not able to draw from h(µ) = ¯

π(µ).

◮ We suggest to use MCMC transitions with target ¯

π(µ), for proposing the location parameters µ1, . . . , µN.

◮ For instance, we can use N parallel MCMC chains. ◮ Thus, in our case, we have MCMC schemes (upper level)

driven a Multiple Adaptive Importance Sampling (lower level).

12 / 32

slide-14
SLIDE 14

Why do that?

◮ Improvement of the performance. Indeed: ◮ Different well-known MC methods implicitly employ (or

attempt to apply) this hierarchical procedure.

◮ In the paper, we describe a hierarchical interpretation of the

Random Walk Metropolis (RMW) and Population Monte Carlo (PMC) techniques.

13 / 32

slide-15
SLIDE 15

Random Walk Metropolis (RWM) method

◮ One transition of the MH algorithm is given by

  • 1. Draw x′ from a proposal pdf q(x|xt−1, C).
  • 2. Set xt = x′ with probability

α = min

  • 1,

π(x′)q(xt−1|x′, C) π(xt−1)q(x′|xt−1, C)

  • ,
  • therwise set xt = xt−1 (with probability 1 − α).

◮ In a random walk (RW) proposal

q(x|xt−1, C) = q(x − xt−1|C), xt−1 plays the role of a location parameter of q.

◮ A RW proposal q(x|xt−1, C) is often used due to its

explorative behavior, when no information about π(x) is available.

14 / 32

slide-16
SLIDE 16

Hierarchical interpretation of RWM

π(x) x

xt xt−1 q(x|xt−1, C)

(a)

π(x)

q(x|xt, C) xt

x

(b)

Figure: Graphical representation of a RW proposal.

15 / 32

slide-17
SLIDE 17

Hierarchical interpretation of RWM

◮ Let us assume a “burn-in” length Tb − 1. Hence, considering

an iteration t ≥ Tb, we have xt ∼ ¯ π(x).

◮ For t ≥ Tb, the probability of proposing a new sample using

the RW proposal q(x − xt−1|C) can be written as

  • q(x|C)

=

  • X

q(x − xt−1|C)¯ π(xt−1)dxt−1, t ≥ Tb.

x π(x)

˜ q(x|C)

16 / 32

slide-18
SLIDE 18

Hierarchical interpretation of RWM

◮ The function

q(x|C) is an equivalent independent proposal pdf of a RW proposal.

◮ It implies that the RW generating process is equivalent, for

t ≥ Tb, to the following hierarchical procedure:

  • 1. Draw µ′ from ¯

π(µ),

  • 2. Draw x′ from q(x|µ′, C).

◮ This interpretation is useful for clarifying the main advantage

  • f the RW approach, i.e., that the equivalent proposal

q is a better choice than an independent proposal roughly tuned.

◮ the RW generating procedure includes indirectly certain

information about the target.

17 / 32

slide-19
SLIDE 19

Hierarchical interpretation of RWM

◮ Denoting Z ∼

q(x|C), S ∼ q(x|µ, C) (assuming E[S] = µ = 0), X ∼ ¯ π(x), we can write E[Z] = E[X], ΣZ = C + ΣX, which are the mean and covariance matrix of Z with pdf q.

π(x) x

xt xt−1 q(x|xt−1, C)

π(x)

q(x|xt, C) xt

x x π(x)

˜ q(x|C)

Figure: Graphical representation of a RW proposal and its equivalent independent pdf.

18 / 32

slide-20
SLIDE 20

Population Monte Carlo (PMC) method

◮ A standard PMC scheme [Cappe04] is an adaptive importance

sampler using a cloud of proposal pdfs q1, . . . , qN.

  • 1. For t = 1, . . . , T :

1.1 Draw xn,t ∼ qn(x|µn,t−1, Cn), for n = 1, . . . , N. 1.2 Assign to each sample xn,t the weights, wn,t = π(xn,t) qn(xn,t|µn,t−1, Cn). (6) 1.3 Resampling: draw N independent samples µ1,t, . . . , µN,t, according to the particle approximation ˆ π(N)

t

(µ) = 1 PN

n=1 wn,t N

X

n=1

wn,tδ(µ − xn,t). (7) Note that each µn,t ∈ {x1,t, . . . , xN,t}, with n = 1, . . . , N.

  • 2. Return all the pairs {xn,t, wn,t}, ∀n and ∀t.

19 / 32

slide-21
SLIDE 21

Hierarchical interpretation of PMC

◮ Fixing an iteration t, the generating procedure used in one

iteration of PMC can be formulated in the hierarchical way:

  • 1. Draw N samples µ1,t−1, . . . , µN,t−1 from ˆ

π(N)

t−1(µ), i.e.,

µn,t−1 ∼ ˆ π(N)

t−1(µ).

  • 2. Draw xn,t ∼ qn(x|µn,t−1, Cn), for n = 1, . . . , N.

◮ ˆ

π(N)

t

(x) is a particle approximation of ¯ π(x) that improves when N grows.

◮ The performance of PMC depends on how well ˆ

π(N)

t

approximates (the distribution of) ¯ π.

20 / 32

slide-22
SLIDE 22

Equivalent proposal pdf for PMC

◮ (for obtaining the equiv. prop. e q it is easier to invert the hierarchical procedure) ◮ Denoting as x′ ∈ {x1, . . . , xN}, xn ∼ qn(x), a sample after

applying one multinomial resampling step; the pdf of x is given by

  • q(x′) =
  • X N ˆ

π(N)(x′) N

  • n=1

qn(xn)

  • dx1 . . . dxN,

(8)

  • Eq. (8) can be rewritten as
  • q(x′) = π(x′)

N

  • j=1

  

  • X N−1

1 N ˆ Z   

N

  • n=1

n=j

qn(xn)    dm¬j    , (9) where m¬n = [x1, . . . , xn−1, xn+1, . . . , xN]⊤, and ˆ Z ≈ Z =

  • X π(x)dx.

21 / 32

slide-23
SLIDE 23

Equivalent proposal pdf for PMC

◮ For simplicity, we consider only two pdfs q1(x), q2(x), and

draw M samples from each one.

◮ Thus, for depicting

q(x), we consider the following procedure:

  • 1. For r = 1, . . . , R (different realizations):

1.1 Draw xi,m ∼ qi(x), i = 1, 2 and m = 1, . . . , M. 1.2 Build ˆ π(2M)(x) with weights wi,m =

π(xi,m) qi (xi,m).

1.3 (Resample once) Draw x(r) ∼ ˆ π(2M)(x).

◮ We show the (interpolated) histogram of x(r), r = 1, . . . , R. ◮ Clearly, with M → +∞ =

⇒ better approximation ˆ π(2M)(x) = ⇒ better q(x).

22 / 32

slide-24
SLIDE 24

Equivalent proposal pdf for PMC

M=1 M=10 M=100

23 / 32

slide-25
SLIDE 25

PI-MAIS adaptation versus PMC adaptation

−20 −10 10 20 −20 −10 10 20

(a) PMC (N = 100, σ = 5)

−20 −10 10 20 −20 −10 10 20

(b) PI-MAIS (N = 100, λ = 5)

Figure: Initial (squares) and final (circles) configurations of the location parameters of the proposal densities for the standard PMC and the PI-MAIS methods, in different specific runs.

24 / 32

slide-26
SLIDE 26

Second contribution

◮ So far, we have described the theoretical reason of the

adaptation by MCMC (Upper Level).

◮ Lower Level:

GENERALIZED MULTIPLE IMPORTANCE SAMPLING SCHEMES.

◮ We provide a unified framework for adaptive MIS schemes.

25 / 32

slide-27
SLIDE 27

Multiple Importance Sampling

◮ IS with multiple proposals: consider for instance J = 2

proposal pdfs, q1(x) and q2(x). We desire to use them jointly.

◮ First valid approach: x1 ∼ q1(x) and x2 ∼ q2(x), and then

w1 = π(x1) q1(x1), w2 = π(x2) q2(x2).

◮ Second valid approach: x1, x2 ∼ 1

2q1(x) + 1 2q2(x), and then

w1 = π(x1)

1 2q1(x1) + 1 2q2(x1),

w2 = π(x2)

1 2q1(x2) + 1 2q2(x2)

◮ Third valid approach: x1 ∼ q1(x) and x2 ∼ q2(x), and then

w1 = π(x1)

1 2q1(x1) + 1 2q2(x1),

w2 = π(x2)

1 2q1(x2) + 1 2q2(x2).

(known as deterministic mixture approach; see [Owen00],[Elvira15], [Cournuet12]).

◮ The third one provides the best performance; the first one the lowest computational cost.

26 / 32

slide-28
SLIDE 28

Multiple Importance Sampling

◮ and with J > 2 proposal pdfs? Consider

x1 ∼ q1(x), x2 ∼ q2(x) . . . , xj ∼ qj(x), . . . , xJ ∼ qJ(x), and the generalized IS weight wj = π(xj) Φ(xj). (10)

◮ Standard IS approach: Φ(xj) = qj(xj). ◮ “Full” Deterministic Mixture IS (DM-MIS) approach:

Φ(xj) = 1

J

J

j=1 qj(xj). ◮ For J > 2, there is another alternative: the partial DM-MIS

approach, [Elvira15].

27 / 32

slide-29
SLIDE 29

MIS within adaptive methods

◮ Let us consider an adaptive IS algorithm using N proposals at

each iteration t = 1, . . . , T, t → q1,t(x), . . . , qn,t(x), . . . , qN,t(x), t + 1 → q1,t+1(x), . . . , qn,t+1(x), . . . , qN,t+1(x), t + 2 → . . .

◮ J = NT total number of proposal pdfs (adapted via MCMC,

for instance).

◮ In this case, we have several possibilities,

xn,t ∼ qn,t(x), wn,t = π(xn,t) Φn,t(xn,t).

28 / 32

slide-30
SLIDE 30

Proposal pdfs spread in time-space

◮ For instance, Φn,t(x) = ψ(x) (DM-MIS), Φn,t(x) = ξn(x)

(partial DM-MIS), Φn,t(x) = φt(x) (partial DM-MIS).

Iterations (Time) Domain (Space)

ψ(x)

q1,1(x) . . . q1,t(x) . . . q1,T (x) . . . . . . . . . . . . . . . qn,1(x) . . . qn,t(x) . . . qn,T (x) . . . . . . . . . . . . . . . qN,1(x) . . . qN,t(x) . . . qN,T (x)

φt(x) ξn(x)

Figure: J = NT proposal pdfs used in the generalized adaptive multiple IS scheme, spread through the state space X (n = 1, . . . , N) and adapted

  • ver time (t = 1, . . . , T).

29 / 32

slide-31
SLIDE 31

DM-MIS schemes in adaptive IS methods

◮ Full DM-MIS (involving all the J = NT proposals):

Φn,t(x) = ψ(x) =

N

  • n=1

T

  • t=1

qn,t(x),

◮ Partial DM-MIS (involving only the N proposals at the

iteration t): Φn,t(x) = ψ(x) =

N

  • n=1

qn,t(x), used in APIS [Martino15], and in certain PMC schemes.

◮ Partial DM-MIS (considering the temporal evolution of the

n-th proposal): Φn,t(x) = ψ(x) =

T

  • t=1

qn,t(x), used in AMIS [Cornuet12].

30 / 32

slide-32
SLIDE 32

◮ Thank you very much! ◮ Any questions?

Further info, a detailed descriptions of the algorithms, and numerical results see:

  • L. Martino, V. Elvira, D. Luengo, J. Corander. “Layered Adaptive

Importance Sampling”, arXiv:1505.04732, 2015.

31 / 32

slide-33
SLIDE 33

Main references

[Owen00]: A.Owen, Y.Zhou. Safe and effective importance sampling. Journal of the American Statistical Association, 95 (449):135-143. 2000. [Elvira15]: V. Elvira, L. Martino, D. Luengo, and M. Bugallo. Efficient multiple importance sampling estimators. IEEE Signal Processing Letters, 22 (10):1757-1761, 2015. [Cornuet12]: J.M. Cornuet, J.M. Marin, A. Mira, C.P. Robert. Adaptive multiple importance sampling. Scandinavian Journal of Statistics, 39 (4):798-812, 2012. [Martino15]: L. Martino, V. Elvira, D. Luengo, J. Corander. An adaptive population importance sampler: Learning from the uncertainty. IEEE Transactions on Signal Processing (In Press), 2015. [Cappe04]: O. Capp´ e, A. Guillin, J. M. Marin, and C. P. Robert. Population Monte Carlo. Journal of Computational and Graphical Statistics, 13 (4):907-929, 2004.

32 / 32

View publication stats View publication stats