SLIDE 1 Tuning of pseudo-marginal MCMC
Alex Thiéry1
1National University of Singapore
Joint work with C. Sherlock (Lancaster),
- G. Roberts (Warwick), J. Rosenthal (Toronto).
SLIDE 2
Pseudo-marginal MCMC
Target density π(x), positive unbiased estimate π(x) Theorem (Beaumont 2003 - Andrieu-Roberts 2009) Usual MCMC with π(x) instead of π(x) is correct! No free lunch: poor estimator π, poor mixing.
SLIDE 3 Wrong Pseudo-marginal MCMC
1
start from x0 ∈ X
2
Proposal: y D ∼ q(xk, ·)
3
Noisy Metropolis-Hastings: α = 1 ∧
- π(y) q(y,xk)
- π(xk) q(xy,y)
4
y → xk+1 with probability α Not correct. Not too bad, after correction. Notice that E π(y)/ π(x)
SLIDE 4 Pseudo-marginal MCMC
1
start from x0 ∈ X, estimator z0 = π(x0)
2
Proposal: y D ∼ q(xk, ·), estimator zy = π(y)
3
Noisy Metropolis-Hastings: α = 1 ∧
zy q(y,xk) zk q(xy,y)
4
(y, zy) → (xk+1, zk+1) with probability α
SLIDE 5 Latent variables
Observation y, latent variable u, parameter x ∈ X π(x) ∝ π0(x) ×
P(y | u, x) P(du | x) data augmentation: MCMC on X × U
high correlation between latent variable and parameter? High-dimensional latent variable u ?
Pseudo-marginal: imitate MCMC on X only
If one can simulate u | x and compute P(y | u, x)
n
k=1 P(y | uk, x)
n
SLIDE 6
Unbiased estimators?
importance sampling sequential importance sampling particle filter unbiased estimator of f(µ) from unbiased estimator of µ. Random truncation, Russian’s roulette Probabilistic interpretation of PDEs (Feynman-Kac): replace solving a PDE by simulating a few particles
SLIDE 7 Stickyness
better estimator, better mixing [talk C. Andrieu]
n
k=1 P(y | uk, x)
n behaves like marginal chain as Var( π(x)) → 0 if π(x) ≫ π(x), parameter x is over-estimated: stuck! Trade-off: MCMC mixing, estimator computational time good estimators take a lot of time to compute. good estimators yield good mixing.
SLIDE 8 How to measure efficiency? Integrated Autocorrelation Time: test function? IACT = 1 + 2
∞
Cor(ϕ(Xt), ϕ(Xt+k)) Spectral Gap, Log-Sobolev, mixing time: complicated! Target distribution of (Xk, Wk) depends on distribution of the noise: difficult to compare chain with different target distributions.
SLIDE 9 a toy example One dimensional Random Walk metropolis y = x + λ ξ with ξ D ∼ N(0, 1) Gaussian target π(x) ∝ exp(−x2/2) Estimator of the form
with W = σ Z − σ2/2 and Z D ∼ N(0, 1) so that Var(W) = σ2 Process (Xk, Wk) is a Markov chain
SLIDE 10 Influence of the noise σ
−3 −2 −1 1 2 100 200 300 400 500
iteration number X
noise sd=0
−2 −1 1 2 100 200 300 400 500
iteration number X
noise sd=1
1 2 100 200 300 400 500
iteration number X
noise sd=5
−0.8 −0.6 −0.4 −0.2 0.0 100 200 300 400 500
iteration number X
noise sd=10
SLIDE 11 Influence of the jump size λ
10 20 30 0.25 0.50 0.75
mean acceptance IACT
0.15 0.20 0.25 0.30 0.35 0.2 0.4 0.6
Mean Acceptance Probability Spectral Gap
SLIDE 12
Diffusion limit
High-dimensional state space, local move MCMC: Looks like a diffusion from far away all above approaches are identical if diffusion limit intuitive understanding + simple conclusions process Wk is averaged out relevant for moderate dimension robust results (eg. 0.234 rule)
SLIDE 13
Goal
Tuning of Pseudo Marginal RWM small jump size: explore slowly large jump size: most proposals are rejected cheap estimator: fast computation time, bad mixing good estimator: expensive, good mixing
SLIDE 14 High Dimension + simple target distributions
product form target distribution in Rd π(x) = f(x1) × f(x2) × . . . × f(xd) Random Walk Metropolis y = x + µ √ d ξ with ξ D ∼ N(0, Id). Unbiased estimator
Assumption: W is independent from x.
SLIDE 15
Accelerated first coordinate process. V (d)(t) = X (d)
1
(d × t) V (d) is not Markovian. In the limit d → ∞ it is!
SLIDE 16 Theorem (Sherlock-T-Roberts-Rosenthal) Weak convergence of V (d) on [0, T] towards Langevin diffusion dV = 1 2J(µ) (log f)′(V) dt +
Sketch of proof: look at Markov chain (Xk, Wk) on augmented space compute drift-variance (generator of diffusion) along X only two time scales: homogenization arguments
X-process mixes after O(d) steps W-process mixes after O(1) steps average out the W process
SLIDE 17 Theorem (Sherlock-T-Roberts-Rosenthal) Weak convergence of V (d) on [0, T] towards Langevin diffusion dV = 1 2J(µ) (log f)′(V) dt +
Speed function: the larger J(µ), the better the mixing Expected Squared Jump Distance: J(µ) = EXk+1 − Xk2 Limiting mean acceptance probability: 0 < α(µ) < 1 α(µ) → 0 as µ increases Almost closed form expression for α(µ) and J(µ) if noise W is fixed, optimal µ independent from f(x) dx
SLIDE 18 Gaussian case: joint optimisation (µ, σ)
2 4 6 8 1 2 3 4 5
Jump Size estimator variance
0.2 0.4 0.6 level
Speed function
SLIDE 19 Gaussian case
10-3 10-2.5 10-2 10-1.5 10-1 10-0.5 1 2 3 4 5
sigma speed function in log-scale
Speed function VS noise
SLIDE 20
Discussion
Gaussian assumption: is it relevant? What do we want to optimise? How robust the conclusions are?
SLIDE 21
Thank You! Questions?