Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Optimal scaling of the transient phase of Metropolis Hastings - - PowerPoint PPT Presentation
Optimal scaling of the transient phase of Metropolis Hastings - - PowerPoint PPT Presentation
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm Optimal scaling of the transient phase of Metropolis Hastings algorithms Tony Leli` evre
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Outline of the talk
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Metropolis Hastings algorithm
The aim of the MH algorithm is to sample a target probability measure, say with density p on Rn. Algorithm: iterate on k ≥ 0,
- Proposition: At time k, given X n
k , propose a move to
ˆ X n
k+1 ∼ q(X n k , y) dy, where q(x, y) Markov density kernel on Rn,
- Acception/Rejection: Accept the move (X n
k+1 = ˆ
X n
k+1) with
probability α(X n
k , ˆ
X n
k ), where
α(x, y) := p(y)q(y, x) p(x)q(x, y) ∧ 1. Otherwise, reject the move (X n
k+1 = X n k ).
(X n
k )k≥0 is a reversible Markov chain wrt p(x) dx.
The efficiency of the algorithm crucially depends on the choice of the proposal distribution q.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Metropolis Hastings algorithm
In the following, we focus on the Gaussian random walk proposal (RWM):
- ˆ
X n
k+1 = X n k + σGk+1 where (Gk)k≥1 i.i.d. ∼ Nn(0, In)
- q(x, y) =
1 (2πσ2)n/2 exp
- − |x−y|2
2σ2
- = q(y, x)
- Acceptance probability α(x, y) = p(y)
p(x) ∧ 1.
Another standard choice: one step of overdamped Langevin (MALA):
- ˆ
X n
k+1 = X n k + σ2 2 (∇ ln p)(X n k ) + σGk+1 where (Gk)k≥1 i.i.d.
∼ Nn(0, In)
- q(x, y) = q(y, x).
Question: How to choose σ as a function of the dimension n?
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Previous work: Roberts, Gelman, Gilks 97
Two fundamental assumptions:
- (H1) Product target: p(x) = p(x1, . . . , xn) = n
i=1 e−V(xi),
- (H2) Stationarity: X n
0 = (X 1,n
, . . . , X n,n ) ∼ p(x)dx and thus ∀k, X n
k = (X 1,n k
, . . . , X n,n
k
) ∼ p(x)dx. Then, pick the first component X 1,n
k
, choose σn = ℓ √n, and rescale the time accordingly (diffusive scaling) by considering (X 1,n
⌊nt⌋)t≥0.
Under regularity assumptions on V, as n → ∞, (X 1,n
⌊nt⌋)t≥0 (d)
⇒ (Xt)t≥0 unique solution of the SDE dXt = −h(ℓ)1 2V ′(Xt) dt +
- h(ℓ) dBt,
where h(ℓ) = 2ℓ2 Φ
- −
ℓ√
R(V ′)2 exp(−V)
2
- with Φ(x) =
x
−∞ e− y2
2
dy √ 2π.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Previous work: Roberts, Gelman, Gilks 97
Practical counterparts: (i) scaling of the variance proposal, (ii) scaling
- f the number of iterations.
Question: How to choose ℓ ?
- The function ℓ → h(ℓ) = 2ℓ2 Φ
- −
ℓ√
R(V ′)2 exp(−V)
2
- is maximum
at ℓ⋆ ≃
2.38
√
R(V ′)2 exp(−V).
- Besides, the limiting average acceptance rate is
E[α(X n
k , ˆ
X n
k+1)] =
- Rn×Rn e
n
i=1(V(xi)−V(yi)) ∧ 1
- α(x,y)
q(x, y)e− n
i=1 V(xi)dxdy
− →n→∞ acc(ℓ) = 2Φ − ℓ
- R(V ′)2 exp(−V)
2 ∈ (0, 1). Observe that acc(ℓ⋆) ≃ 0.234, whatever V. This justifies a constant acceptance rate strategy, with a target acceptance rate of approximately 25%.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
A few references
- (H1) + (H2) and various proposals: Gaussian RWM Roberts Gelman Gilks 1997,
MALARoberts Rosenthal 1997, nonGaussian RWM Neal Roberts 2011, RWM discontinuous target Neal Roberts Yuen 2012, Mutiple try MCMC B´
edard Douc Moulines 2012,
Delayed rejection MCMC B´
edard Douc Moulines 2013, Hybrid Monte Carlo Beskos Pillai Roberts Sanz-Serna Stuart 2013.
- Beyond (H1): i. but non i.d. components RWM B´
edard 2007,2009; finite range
interactions Breyer Roberts 2000; mean-field interaction Breyer Piccioni Scarlatti 2004; density w.r.t. i.i.d. Beskos Roberts Stuart 2009; infinite-dimensional target with density w.r.t. Gaussian field RWM Mattingly, Pillai, Stuart 2012, MALA Pillai, Stuart, Thiery 2012.
- Beyond (H2):Christensen, Roberts, Rosenthal 2005 Partial results for RWM and MALA
with Gaussian target, Pillai, Stuart, Thiery 2013 modified RWM for infinite-dimensional target with density w.r.t. Gaussian field.
Aim of this work: Study of the limit n → ∞ without the stationarity assumption (H2).
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
The limit n → ∞ without (H2)
We consider the RWMH with target p(x) = n
i=1 exp(−V(xi)):
(Gi
k)i,k≥1 are i.i.d. ∼ N1(0, 1) independent of (Uk)k≥1 i.i.d. ∼ U[0, 1],
and X i,n
k+1 = X i,n k
+
ℓ √nGi k+11Ak+1, 1 ≤ i ≤ n,
with Ak+1 =
- Uk+1 ≤ e
n
i=1(V(X i,n k )−V(X i,n k + ℓ √n Gi k+1))
. From now on, we assume that V is C3 with V ′′ and V (3) bounded.
Theorem
Assume that
- 1. m is a probability measure on R s.t.
- R(V ′)4(x) m(dx) < +∞,
- 2. ∀n ≥ 1, X 1,n
, . . . , X n,n are i.i.d. according to m. Then the process (X 1,n
⌊nt⌋)t≥0 converges in distribution to the unique
solution of the SDE nonlinear in the sense of McKean: X0 ∼ m, dXt = −G (a(t), b(t)) V ′(Xt) dt + Γ1/2(a(t), b(t)) dBt with a(t) = E[(V ′(Xt))2], b(t) = E[V ′′(Xt)], and...
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
The functions Γ and G
Γ(a, b) = ℓ2Φ
- − ℓb
2√a
- + ℓ2e
ℓ2(a−b) 2
Φ
- ℓ
- b
2√a − √a
- if a ∈ (0, +∞),
ℓ2 2 if a = +∞,
ℓ2e− ℓ2b+
2
where b+ = max(b, 0) if a = 0, G(a, b) = ℓ2e
ℓ2(a−b) 2
Φ
- ℓ
- b
2√a − √a
- if a ∈ (0, +∞),
0 if a = +∞ and 1{b>0}ℓ2e− ℓ2b
2 if a = 0.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Remarks
- Limiting acceptance rate: t → P(A⌊nt⌋) converges to
t → acc(a(t), b(t)) where a(t) = E[(V ′(Xt))2], b(t) = E[V ′′(Xt)] and acc(a, b) = 1 ℓ2 Γ(a, b).
- Stationary case: If m(dx) = e−V(x)dx, then ∀t ≥ 0 Xt ∼ e−V(x)dx
and a(t) = E[(V ′(Xt))2] =
- R V ′(V ′e−V) =
- R V ′(−e−V)′ =
- R V ′′e−V = E[V ′′(Xt)] = b(t) are constant. Using the fact that
for a > 0, Γ(a, a) = 2G(a, a) = 2ℓ2Φ
- −ℓ√a/2
- , we are back to
the dynamics dXt = −h(ℓ)1 2V ′(Xt) dt +
- h(ℓ)dBt
with h(ℓ) = 2ℓ2 Φ
- − ℓ
2
- R(V ′)2 exp(−V)
- .
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Propagation of chaos
- One can actually prove a propagation of chaos result.
Definition
A sequence (χn
1, . . . , χn n)n≥1 of exchangeable random variables is said
to be ν-chaotic if for fixed k ∈ N∗, the law of (χn
1, . . . , χn k) converges in
distribution to ν⊗k as n goes to ∞. The processes ((X 1,n
⌊nt⌋, . . . , X n,n ⌊nt⌋)t≥0)n≥1 are P-chaotic where P
is the law of the unique solution to the SDE nonlinear in the sense of McKean: X0 ∼ m dXt = −G(a(t), b(t))V ′(Xt) dt + Γ1/2(a(t), b(t)) dBt. with a(t) = E[(V ′(Xt))2] and b(t) = E[V ′′(Xt)].
- The assumption on the IC may then be replaced by: the initial
positions (X 1,n , . . . , X n,n )n≥1 are exchangeable, m-chaotic and s.t. supn E[(V ′(X 1,n ))4] < ∞.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Proof
The proof is based on:
- A weak formulation of the nonlinear SDE (martingale problem)
- Tightness arguments
This is a mean field limit.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Invariant measure
We would like to understand the longtime behavior of the nonlinear SDE dXt = −G(a(t), b(t))V ′(Xt)dt + Γ1/2(a(t), b(t)) dBt, where a(t) = E[(V ′(Xt))2] and b(t) = E[V ′′(Xt)].
Proposition
The probability measure e−V(x)dx is the unique invariant measure for this SDE.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Fokker-Planck equation
Denoting by ψt the density of Xt, one has ∂tψt = ∂x
- G(a[ψt], b[ψt])V ′ψt + 1
2Γ(a[ψt], b[ψt])∂xψt
- ,
a[ψt] =
- (V ′(x))2ψt(x) dx,
b[ψt] =
- V ′′(x)ψt(x) dx.
Question 1: Does ψt converges to ψ∞ = exp(−V) ? Question 2: Is it possible to optimize the convergence, by appropriately choosing ℓ (recall that the variance of the proposal is ℓ2/n, and thus that Γ(a, b) = Γ(a, b, ℓ) and G(a, b) = G(a, b, ℓ)) ?
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Entropy techniques
To analyze the longtime behavior, we use entropy techniques.
Definition
The probability measure ν satisfies a log-Sobolev inequality with constant ρ > 0 (in short LSI(ρ)) if, for any probability measure µ absolutely continuous wrt ν, H(µ|ν) ≤ 1 2ρI(µ|ν) where
- H(µ|ν) =
- ln
dµ dν
- dµ is the Kullback-Leibler divergence (or
relative entropy) of µ wrt ν,
- I(µ|ν) =
- ∇ ln
dµ dν
- 2
dµ is the Fisher information of µ wrt ν. Roughly speaking, e−V satisfies LSI(ρ) for some ρ > 0 if V has at least quadratic growth at ∞. In the Gaussian case V(x) = x2+ln(2π)
2
, exp(−V) satisfies LSI(1).
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Entropy techniques
Recall the nonlinear FP equation: ∂tψt = ∂x
- G(a[ψt], b[ψt])V ′ψt + 1
2Γ(a[ψt], b[ψt])∂xψt
- .
We can prove exponential convergence of ψt to the invariant density ψ∞ = e−V in entropy.
Theorem
If X0 admits a density ψ0 s.t. E[(V ′(X0))2] < +∞ and H(ψ0|ψ∞) < ∞, then d dt H(ψt|ψ∞) ≤ −b[ψt] Γ(a[ψt], b[ψt]) − 2a[ψt] G(a[ψt], b[ψt]) 2(b[ψt] − a[ψt]) I(ψt|ψ∞) < 0. If moreover ψ∞ = e−V satisfies LSI(ρ), then there exists a positive and non-increasing function λ : [0, +∞) → (0, +∞) such that ∀t ≥ 0 H(ψt|ψ∞) ≤ e−t λ(H(ψ0|ψ∞))H(ψ0|ψ∞).
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Elements of proof
Writing a, b for a[ψt], b[ψt], one has d dt H(ψt|ψ∞) =
- R
∂tψt ln ψt +
- R
V∂tψt = −Γ(a, b) 2 I(ψt|ψ∞) + (a − b)2 2G(a, b) − Γ(a, b) 2(b − a) , where 2G(a,b)−Γ(a,b)
2(b−a)
≥ 0. Moreover, (a − b)2 =
- R
(V ′)2ψt −
- R
V ′′ψt 2 =
- R
V ′(V ′ψt + ∂xψt) 2 =
- R
V ′∂x ln(ψt/e−V)ψt 2 ≤ a I(ψt|ψ∞). Hence d dt H(ψt|ψ∞) ≤ −bΓ(a, b) − 2aG(a, b) 2(b − a) I(ψt|ψ∞). If ψ∞ satisfies LSI(ρ), then (i) −I(ψt|ψ∞) ≤ −2ρH(ψt|ψ∞) and (ii) using the fact that t → H(ψt|ψ∞) is decreasing, ∀t ≥ 0, 2ρ bΓ(a,b)−2aG(a,b)
2(b−a)
≥ λ(H(ψ0|ψ∞)) > 0.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Strategies to optimize the convergence of RWMH
We want to choose ℓ in order to accelerate the convergence to
- equilibrium. Two natural strategies: (i) optimize the exponential rate of
convergence to zero of H(ψt|ψ∞) (ii) choose ℓ in order to obtain a constant average acceptance rate. Preliminary remark: When b ≤ 0, one has
d dt H(ψt|ψ∞) ≤ − Γ(a,b) 2
- R(∂x ln ψt)2ψt with limℓ→∞ Γ(a, b) = +∞. So
- ne should choose ℓ as large as possible.
From now on, suppose that b > 0 (recall that in the longtime limit b = a > 0). We have: d dt H(ψt|ψ∞) ≤ − bΓ(a, b) − 2aG(a, b) 2(b − a)
- 1
b F( a b ,ℓ
√ b)
I(ψt|ψ∞) < 0, where F(s, ℓ) = ℓ2e− ℓ2
2 if s = 0,
2ℓ2 1 + ℓ2
4
- Φ
- − ℓ
2
- −
ℓ 2 √ 2πe− ℓ2
8
- if s = 1,
ℓ2 1−s
- Φ
- −
ℓ 2√s
- + (1 − 2s)e
ℓ2(s−1) 2
Φ
- ℓ
2√s − ℓ√s
- if 0 < s = 1.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Choice of ℓ maximizing the exponential rate of cv
Lemma
Let b > 0. Then ℓ ≥ 0 → 1
bF( a b, ℓ
√ b) admits a unique maximum at point ˜ ℓ⋆(a, b). Moreover ˜ ℓ⋆(a, b) = 1 √ b ℓ⋆ a b
- where for any s ≥ 0, ℓ⋆(s) realizes the unique maximum of
ℓ → F(s, ℓ). The function s → ℓ⋆(s) is continuous on [0, +∞) and
- ˜
ℓ⋆(a, b) ∼a/b→0
ℓ⋆(0) √ b = √ 2 √ b.
- ˜
ℓ⋆(a, b) ∼a/b→1
ℓ⋆(1) √ b .
- ˜
ℓ⋆(a, b) ∼a/b→+∞
x⋆√a b
where x⋆ ≃ 1.22. Remark: Since dV(Xt) = V ′(Xt)
- Γ(a, b)dBt − G(a, b)V ′(Xt))dt
- + 1
2Γ(a, b)V ′′(Xt)dt, we
have d
dt E[V(Xt)] = 1 2(bΓ(a, b) − 2aG(a, b)) and ˜
ℓ⋆(a, b) also maximizes | d
dt E[V(Xt)]|.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm 1 2 3 4 5 6 2 4 6 8 10 12 14 16 18 20
Figure: Solid line: the function s → ℓ⋆(s). Dashed line: the function: s → x⋆√s.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Comparison with constant acceptance rate strategies
Recall that the limiting mean acceptance rate is acc(a, b, ℓ) = 1 ℓ2 Γ(a, b) = G a b, ℓ √ b
- where G(s, ℓ) = Φ
- −
ℓ 2√s
- + e
ℓ2(s−1) 2
Φ
- ℓ
- 1
2√s − √ s
- .
Lemma
For s > 0, the function ℓ → G(s, ℓ) is decreasing. Moreover, for α ∈ (0, 1), the unique ℓ s.t. acc(a, b, ℓ) = α is ˜ ℓα(a, b) = 1 √ b ℓα a b
- where ℓα(s) is the unique solution to G(s, ℓα(s)) = α. Last,
- ˜
ℓα(a, b) ∼a/b→0 √
−2 ln(α) √ b
.
- ˜
ℓα(a, b) ∼a/b→1
ℓα(1) √ b .
- ˜
ℓα(a, b) ∼a/b→∞ −2Φ−1(α)
√a b .
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Comparison with constant acceptance rate strategies
Remark 1: Notice that ˜ ℓ⋆(a, b) =
1 √ bℓ⋆ a b
- and ˜
ℓα(a, b) =
1 √ bℓα a b
- have the same scaling in (a, b).
− → Constant acceptance rate strategy seems sensible. Remark 2: Choice of α: how to choose α to get ˜ ℓ⋆(a, b) ≃ ˜ ℓα(a, b) ?
- a/b → 0: α = 1
e ≃ 0.37.
- a/b → 1: α such that ℓα(1) = ℓ⋆(1), namely α ≃ 0.35.
- a/b → ∞: α = Φ(−x⋆/2) ≃ 0.27.
(Recall that the standard choice for the RWM under the stationarity assumption is α = 0.234.) − → Constant acceptance rate with α ∈ (1/4, 1/3) seems sensible. Let us plot the relative difference in terms of exponential rate of convergence, for the three values α = 1
e ≃ 0.37, α ≃ 0.35 and
α = Φ(−x⋆/2) ≃ 0.27.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
2 4 6 8 10 0.00 0.10
b=1
a 2 4 6 8 10 0.0 0.3 0.6
b=0.1
a 2 4 6 8 10 0.00 0.03
b=10
a
Figure:
F( a
b ,˜
l⋆(a,b) √ b)−F( a
b ,˜
lα(a,b) √ b) F( a
b ,˜
l⋆(a,b) √ b)
as function of a for b = 1, 0.1, 10 and α ≃ 0.27 solid line, α ≃ 0.35 dashed line, α = e−1 ≃ 0.37 dotted line. − → α ≃ 0.27 seems to be the best compromise.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Gaussian target : V(x) = 1
2(x2 + ln(2π)) Setting m(t)
def
= E[Xt] =
- R xψt(x)dx and
s(t)
def
= E[(Xt)2] =
- R x2ψt(x)dx, one has
H(ψt|ψ∞) = 1 2
- s(t) − ln(s(t) − m(t)2) − 1
- ,
d dt H(ψt|ψ∞) = 1 2
- F(s, ℓ)(1 − s) − F(s, ℓ)(1 − s) + 2mG(s, 1, ℓ)
s − m2
- .
It is possible to compute numerically ℓent(m, s) maximizing
- d
dt H(ψt|ψ∞)
- .
To assess the convergence, we compute t0 → ˆ Im
t0,t0+T = 1
T
t0+T
- k=t0+1
X 1,n
k
+ . . . + X 1,n
k
n t0 → ˆ Is
t0,t0+T = 1
T
t0+T
- k=t0+1
(X 1,n
k
)2 + . . . + (X n,n
k
)2 n .
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm 100 200 300 400 500 200 600
I ^s
burn−in−time square bias = 2.38
0.27 − A 0.27 − N ent
100 200 300 400 500 5 10 15
I ^m
burn−in−time square bias = 2.38
0.27 − A 0.27 − N ent
ℓ ℓ ℓ ℓ ℓ ℓ ℓ ℓ ℓ⋆ ℓ⋆
Figure: t0 →square bias of (ˆ Is
t0,T+t0,ˆ
Im
t0,T+t0), (X 1,n
, . . . , X n,n ) = (10, . . . , 10), n = 100(ℓ0.27 − A → adaptive scaling Metropolis algorithm and ℓ0.27 − N → numerical approximation of ℓ0.27(s, 1).)
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Conclusions:
- 1. The constant ℓ strategy is bad ;
- 2. The constant average acceptance rate strategy (using ℓα) leads
to very close convergence curves compared to the optimal exponential rate of convergence strategy (using ℓ⋆) ;
- 3. The optimal exponential rate of convergence strategy is as good
as the most optimal strategy one could design in terms of entropy decay (using ℓent).
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Example of non Gaussian target
V(x) =
- (x − 1)2(x + 1)2
if |x| ≤ 1, 4x2 − 8|x| + 4
- therwise.
- I =
- R(V ′)2e−V = 4.07 so that 2.38
√ I = 1.18
- X i,n
i.i.d. ∼ N1(1, 0.143) so that E[(V ′(X 1,n ))2] = E[V ′′(X 1,n )] = 5.24
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
100 200 300 400 500 0e+00 3e−04 6e−04
I ^s
burn−in−time square bias = 1.18
0.27 0.35
100 200 300 400 500 0.10 0.20
I ^m
burn−in−time square bias = 1.18
0.27 0.35
ℓ ℓ ℓ ℓ ℓ ℓ ℓ⋆ ℓ⋆
The constant acceptance rate strategies are implemented using an adaptive scaling Metropolis algorithm.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
References
- B. Jourdain, TL and B. Miasojedow, Optimal scaling for the
transient phase of the random walk Metropolis algorithm: the mean-field limit, http://arxiv.org/abs/1210.7639.
- B. Jourdain, TL and B. Miasojedow, Optimal scaling for the
transient phase of Metropolis Hastings algorithms: the longtime behavior, http://arxiv.org/abs/1212.5517.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Optimal scaling of the transient phase of MALA (1)
Consider the MALA algorithm: X i,n
k+1 = X i,n k
+
- Z i,n
k+1
- σnGi
k+1 − σn2
2 V ′(X i,n
k )
- 1Ak+1, 1 ≤ i ≤ n where Ak+1 =
- Uk+1 ≤ e
n
i=1(V(X i,n k )−V(X i,n k +Z i,n k+1)+ 1 2[(Gi k+1)2−(Gi k+1− σn 2 (V ′(X i,n k )+V ′(X i,n k +Z i,n k+1)))2])
For σn =
ℓ n1/4 and ((X 1,n
, . . . , X n,n ))n≥1 m-chaotic, one expects prop.
- f chaos for the processes ((X 1,n
⌊√nt⌋, . . . , X n,n ⌊√nt⌋)t≥0)n≥1 to the law of
dXt =
- w(t, ℓ)dBt − w(t, ℓ) 1
2V ′(Xt) dt, X0 ∼ m(dx)
where w(t, ℓ) = ℓ2
- e
ℓ4 8 E(((V ′)2V ′′+V (4)−2V (3)V ′−(V ′′)2)(Xt)) ∧ 1
- .
Remark: If V(x) = x2+ln(2π)
2
, then
d dt E(X 2 t ) = ℓ2
e
ℓ4 8 (E(X 2 t )−1) ∧ 1
- (1 − E(X 2
t )), [Christensen, Roberts, Rosenthal 2005].
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Optimal scaling of the transient phase of MALA (2)
w(t, ℓ) = ℓ2
- e
ℓ4 8 E(((V ′)2V ′′+V (4)−2V (3)V ′−(V ′′)2)(Xt)) ∧ 1
- on time intervals such that
E
- (V ′)2V ′′ + V (4) − 2V (3)V ′ − (V ′′)2
(Xt)
- < 0, then
ℓ → w(t, ℓ) maximum at ℓ⋆ ≃
1.42 E1/4((2V (3)V ′+(V ′′)2−(V ′)2V ′′−V (4))(Xt))
- on time intervals such that
E
- (V ′)2V ′′ + V (4) − 2V (3)V ′ − (V ′′)2
(Xt)
- = 0 (this is in
particular the case at equilibrium), the correct scaling [Roberts, Rosenthal
1998] is
σn = ℓ n1/6 and one obtain a diffusive limit for (X 1,n
⌊n1/3t⌋)t≥0. At equilibrium,
there exists an optimal ℓ = ℓ⋆ and acc(ℓ⋆) = 0.574.
- on time intervals such that
E
- (V ′)2V ′′ + V (4) − 2V (3)V ′ − (V ′′)2
(Xt)
- > 0, with the
scaling σn =
ℓ n1/4 , we have w(t, ℓ) = ℓ2 → +∞ as ℓ → +∞. One
should take σn ≫
ℓ n1/4 .
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
Optimal scaling of the transient phase of MALA (3)
The case E
- (V ′)2V ′′ + V (4) − 2V (3)V ′ − (V ′′)2
(Xt)
- > 0: one
should take σn going to zero as slowly as possible. Let us consider the Gaussian case V(x) = (x2 + ln(2π))/2, so that E
- (V ′)2V ′′ + V (4) − 2V (3)V ′ − (V ′′)2
(Xt)
- = E(X 2
t − 1).
Proposition
If the initial random variables (X 1,n , . . . , X n,n ) are i.i.d. according to m such that m, x2 − 1 > 0 and m, x8 < +∞, and σn satisfies: lim
n→∞ σn = 0 and
lim
n→∞ nσ2 n = +∞,
then the processes ((X 1,n
⌊t/σ2
n⌋)t≥0, . . . , (X n,n
⌊t/σ2
n⌋)t≥0) are Q-chaotic
where Q denotes the law of the Ornstein-Uhlenbeck process dXt = dBt − Xt
2 dt, X0 ∼ m. Moreover, the limiting mean acceptance
rate is 1. Remark: this result still holds if limn→∞ nσ2
n = 0.
Introduction Optimal scaling of the transient phase of RWMH Longtime convergence of the nonlinear SDE Optimization strategies for the RWMH algorithm
In summary
RWMH MALA Equilibrium σn =
ℓ √n, acc(ℓ⋆) = 0.234
σn =
ℓ n1/6 , acc(ℓ⋆) = 0.574
Transient σn =
ℓ √n, acc(ℓ⋆) = 0.27
σn =
ℓ n1/4 , optimal ℓ ???
In all cases, the associated timescale is the diffusive one:
- X 1,n
⌊t/σ2
n⌋
- t≥0.