Tuning of pseudo-marginal MCMC Alex Thiry 1 1 National University of - - PowerPoint PPT Presentation

tuning of pseudo marginal mcmc
SMART_READER_LITE
LIVE PREVIEW

Tuning of pseudo-marginal MCMC Alex Thiry 1 1 National University of - - PowerPoint PPT Presentation

Tuning of pseudo-marginal MCMC Alex Thiry 1 1 National University of Singapore Joint work with C. Sherlock (Lancaster), G. Roberts (Warwick), J. Rosenthal (Toronto). Pseudo-marginal MCMC Target density ( x ) , positive unbiased estimate


slide-1
SLIDE 1

Tuning of pseudo-marginal MCMC

Alex Thiéry1

1National University of Singapore

Joint work with C. Sherlock (Lancaster),

  • G. Roberts (Warwick), J. Rosenthal (Toronto).
slide-2
SLIDE 2

Pseudo-marginal MCMC

Target density π(x), positive unbiased estimate π(x) Theorem (Beaumont 2003 - Andrieu-Roberts 2009) Usual MCMC with π(x) instead of π(x) is correct! No free lunch: poor estimator π, poor mixing.

slide-3
SLIDE 3

Wrong Pseudo-marginal MCMC

1

start from x0 ∈ X

2

Proposal: y D ∼ q(xk, ·)

3

Noisy Metropolis-Hastings: α = 1 ∧

  • π(y) q(y,xk)
  • π(xk) q(xy,y)

4

y → xk+1 with probability α Not correct. Not too bad, after correction. Notice that E π(y)/ π(x)

  • = π(y)/π(x)
slide-4
SLIDE 4

Pseudo-marginal MCMC

1

start from x0 ∈ X, estimator z0 = π(x0)

2

Proposal: y D ∼ q(xk, ·), estimator zy = π(y)

3

Noisy Metropolis-Hastings: α = 1 ∧

zy q(y,xk) zk q(xy,y)

4

(y, zy) → (xk+1, zk+1) with probability α

slide-5
SLIDE 5

Latent variables

Observation y, latent variable u, parameter x ∈ X π(x) ∝ π0(x) ×

  • U

P(y | u, x) P(du | x) data augmentation: MCMC on X × U

high correlation between latent variable and parameter? High-dimensional latent variable u ?

Pseudo-marginal: imitate MCMC on X only

If one can simulate u | x and compute P(y | u, x)

  • π(x) = π0(x) ×

n

k=1 P(y | uk, x)

n

slide-6
SLIDE 6

Unbiased estimators?

importance sampling sequential importance sampling particle filter unbiased estimator of f(µ) from unbiased estimator of µ. Random truncation, Russian’s roulette Probabilistic interpretation of PDEs (Feynman-Kac): replace solving a PDE by simulating a few particles

slide-7
SLIDE 7

Stickyness

better estimator, better mixing [talk C. Andrieu]

  • π(x) = π0(x) ×

n

k=1 P(y | uk, x)

n behaves like marginal chain as Var( π(x)) → 0 if π(x) ≫ π(x), parameter x is over-estimated: stuck! Trade-off: MCMC mixing, estimator computational time good estimators take a lot of time to compute. good estimators yield good mixing.

slide-8
SLIDE 8

How to measure efficiency? Integrated Autocorrelation Time: test function? IACT = 1 + 2

  • k=1

Cor(ϕ(Xt), ϕ(Xt+k)) Spectral Gap, Log-Sobolev, mixing time: complicated! Target distribution of (Xk, Wk) depends on distribution of the noise: difficult to compare chain with different target distributions.

slide-9
SLIDE 9

a toy example One dimensional Random Walk metropolis y = x + λ ξ with ξ D ∼ N(0, 1) Gaussian target π(x) ∝ exp(−x2/2) Estimator of the form

  • π(x) = π(x) × eW

with W = σ Z − σ2/2 and Z D ∼ N(0, 1) so that Var(W) = σ2 Process (Xk, Wk) is a Markov chain

slide-10
SLIDE 10

Influence of the noise σ

−3 −2 −1 1 2 100 200 300 400 500

iteration number X

noise sd=0

−2 −1 1 2 100 200 300 400 500

iteration number X

noise sd=1

1 2 100 200 300 400 500

iteration number X

noise sd=5

−0.8 −0.6 −0.4 −0.2 0.0 100 200 300 400 500

iteration number X

noise sd=10

slide-11
SLIDE 11

Influence of the jump size λ

10 20 30 0.25 0.50 0.75

mean acceptance IACT

0.15 0.20 0.25 0.30 0.35 0.2 0.4 0.6

Mean Acceptance Probability Spectral Gap

slide-12
SLIDE 12

Diffusion limit

High-dimensional state space, local move MCMC: Looks like a diffusion from far away all above approaches are identical if diffusion limit intuitive understanding + simple conclusions process Wk is averaged out relevant for moderate dimension robust results (eg. 0.234 rule)

slide-13
SLIDE 13

Goal

Tuning of Pseudo Marginal RWM small jump size: explore slowly large jump size: most proposals are rejected cheap estimator: fast computation time, bad mixing good estimator: expensive, good mixing

slide-14
SLIDE 14

High Dimension + simple target distributions

product form target distribution in Rd π(x) = f(x1) × f(x2) × . . . × f(xd) Random Walk Metropolis y = x + µ √ d ξ with ξ D ∼ N(0, Id). Unbiased estimator

  • π(x) = π(x) × eW.

Assumption: W is independent from x.

slide-15
SLIDE 15

Accelerated first coordinate process. V (d)(t) = X (d)

1

(d × t) V (d) is not Markovian. In the limit d → ∞ it is!

slide-16
SLIDE 16

Theorem (Sherlock-T-Roberts-Rosenthal) Weak convergence of V (d) on [0, T] towards Langevin diffusion dV = 1 2J(µ) (log f)′(V) dt +

  • J(µ) dW

Sketch of proof: look at Markov chain (Xk, Wk) on augmented space compute drift-variance (generator of diffusion) along X only two time scales: homogenization arguments

X-process mixes after O(d) steps W-process mixes after O(1) steps average out the W process

slide-17
SLIDE 17

Theorem (Sherlock-T-Roberts-Rosenthal) Weak convergence of V (d) on [0, T] towards Langevin diffusion dV = 1 2J(µ) (log f)′(V) dt +

  • J(µ) dW

Speed function: the larger J(µ), the better the mixing Expected Squared Jump Distance: J(µ) = EXk+1 − Xk2 Limiting mean acceptance probability: 0 < α(µ) < 1 α(µ) → 0 as µ increases Almost closed form expression for α(µ) and J(µ) if noise W is fixed, optimal µ independent from f(x) dx

slide-18
SLIDE 18

Gaussian case: joint optimisation (µ, σ)

2 4 6 8 1 2 3 4 5

Jump Size estimator variance

0.2 0.4 0.6 level

Speed function

slide-19
SLIDE 19

Gaussian case

10-3 10-2.5 10-2 10-1.5 10-1 10-0.5 1 2 3 4 5

sigma speed function in log-scale

Speed function VS noise

slide-20
SLIDE 20

Discussion

Gaussian assumption: is it relevant? What do we want to optimise? How robust the conclusions are?

slide-21
SLIDE 21

Thank You! Questions?