[PPT] - Sampling, MCMC and Spectral Gaps in Infinite Dimensions Martin PowerPoint Presentation

SLIDE 1

Introduction Result Ideas of the proof Summary

Sampling, MCMC and Spectral Gaps in Infinite Dimensions

Martin Hairer1 Andrew Stuart1 Sebastian Vollmer1

1Department of Mathematics

University of Warwick

Sydney, 2012

SLIDE 2

Introduction Result Ideas of the proof Summary

1

Introduction Notation Target Measure Spectral Gaps

2

Result Key result Dimension Dependent Results for the RWM Preliminaries & Weak Harris Theorem

3

Ideas of the proof d-contracting d-contracting Dimensionality

4

Summary

SLIDE 3

Introduction Result Ideas of the proof Summary

Notation

Given a measure µ, generate Xi such that Sn(f ) = 1 n

n

∑

j=1

h(Xn) → Eµ[h(X)]. Complexity of an algorithm: number of necessary steps × cost of a step

SLIDE 4

Introduction Result Ideas of the proof Summary Target Measure

Target measure

Assumption: The target measure µ has density w.r.t Gaussian γ µ(dx) = M exp(−Φ(x))γ(dx). (1) γ = N (0, C) on a separable Hilbert space H, {ϕn}n∈N orthonormal basis of eigenvectors of C with eigenvalues {λ2

n}n∈N, then

Karhunen-Loeve expansion yields γ = L(

∞

∑

i=1

λi ϕiξi ), where ξi

i.i.d

∼ N (0, 1) Example: Brownian motion on [0, 1] Bt =

∞

∑

k=1

1 (k − 1

2 )2π2

√ 2 sin

(k − 1

2 )πt

ξk

Using the projections Pm on span{ϕi}m

i=1, then m-dim. approximations are

γm(dx) = L(

m

∑

i=1

λi ϕiξi )(dx) µm(dx) = Mm exp(−Φ(Pmx))γm(dx). (2)

SLIDE 5

Introduction Result Ideas of the proof Summary Target Measure

Metropolis-Hastings algorithms

α(x, y) acceptance probability for transition from x to y [Tierney, 1998] Random Walk Metropolis (RWM) algorithm on Rm Q(x, dy) = L(x + √ 2δξ)(dy) with ξ ∼ γm α(x,y) = 1 ∧ exp

Φ(x) − Φ(y) + 1

2 x, Cx − 1 2 y, Cy

Preconditioned Crank-Nicolson (pCN) algorithm on H

Q(x, dy) = L

(1 − 2δ)

1 2 x +

√ 2δξ

(dy) with ξ ∼ γ

α(x, y) = 1 ∧ exp(Φ(x) − Φ(y)) Transition kernel P(x,dz) = Q(x, dz)α(x,z) + δx(dz) (1 − α(x,u))Q(x,du) P, Pm denote the transition kernel respectively

SLIDE 6

Introduction Result Ideas of the proof Summary

Spectral Gaps

Definition

A Markov-transiation kernel P with invariant measure µ has an L2

µ-spectral-gap 1 − β

iff β = sup

f ∈L2

µ

Pf − µ(f )2 / f − µ(f )2 < 1.

Proposition

[Kipnis and Varadhan, 1986] If X0 ∼ µ , then for any f ∈ L2 f (Xn) satisfies a CLT with asymptotic variance σ2

f ,P ≤ 2µ(f 2)

1 − β .

Proposition

[Rudolf, 2011] For X0 ∼ ν with ν absolutely continuous w.r.t. µ non asymptotic result of the form MSE : Eν,K |Sn(f ) − µ(f )|2 2 n(1 − β) .

SLIDE 7

Introduction Result Ideas of the proof Summary Key result

Key result

Theorem (Key Result)

1

The RWM algorithm has an L2-spectral gap that decays to zero faster than any negative power of m.

2

If Φ is locally Lipschitz and satisfies a growth assumption, then the transition kernel P of the pCN has a lower bound on the L2-spectral gap uniformly in m.

SLIDE 8

Introduction Result Ideas of the proof Summary Dimension Dependent Results for the RWM

Dimension Dependent Results for the RWM

Conductance C = inf

µ(A)≤ 1

2

A P(x,Ac)dµ(x)

µ(A) Relation to spectral gap (c.f. [Lawler and Sokal, 1988, Sinclair and Jerrum, 1989]) C2 2 ≤ 1 − β ≤ 2C.

Proposition

For any Metropolis-Hastings transition kernel P and µ(B) ≤ 1

2,

1 − β ≤ 2 sup

x∈B

α(x).

Proof.

The algorithm started in B can only move to Bc if it accepts the move. Hence P(x,Bc) ≤ α(x).

SLIDE 9

Introduction Result Ideas of the proof Summary Dimension Dependent Results for the RWM

Consider µm = γm = L(∑m

i=1 1 i ξiei)

with ξ i.i.d ∼ N (0, 1) Theorem Let Pm be the Markov kernel of RWM applied to γm. Scaling of δ Upper bound on spectral gap δm ∼ m−a, a ∈ [0, 1) 1 − βm ≤ Kpm−p for any p δm ∼ m−a, a ∈ [1, ∞) 1 − βm ≤ Km− a

2

SLIDE 10

Introduction Result Ideas of the proof Summary Preliminaries & Weak Harris Theorem

Why is the small set/minorization approach (Meyn and Tweedie) not applicable?

Definition A Markov chain (Xt) with kernel P is said to be ψ-irreducible if for a non-trivial Borel measure ψ there is an n ∈ N s.t. ψ(A) > 0 ⇒ Pn(x, A) > 0 for all x Here P(x, ·) and P(y, ·) are mutually singular for some x and y P(x, dz) = α(x, z)N

(1 − 2δ)

1 2 x, 2δC

(dz) + δx(dz)r(x)

P(y, dz) = α(y, z)N

(1 − 2δ)

1 2 y, 2δC

(dz) + δy(dz)r(y)

SLIDE 11

Introduction Result Ideas of the proof Summary Preliminaries & Weak Harris Theorem

Preliminaries Weak Harris Theorem

Definition

d : H × H → R+ is a distance-like function if it is symmetric, lower semi-continuous and d(x, y) = 0 ⇔ x = y .

Definition

The corresponding Wasserstein distance is given by d(ν1, ν2) = inf

π∈Γ(ν1,ν2)

H2 d(x, y)π(dx,dy).

with Γ(ν1, ν2) = {π ∈ M(H2)|Pi∗π = νi}.

Definition

P has a Wasserstein spectral gap if ∃λ > 0, C > 0 s.t. d(ν1Pn, ν2Pn) ≤ C exp(−λn)d(ν1, ν2) for all n ∈ N.

SLIDE 12

Introduction Result Ideas of the proof Summary Preliminaries & Weak Harris Theorem

Definition S ⊂ E is d-small if ∃ 0 < s < 1 s.t. x, y ∈ S d(P(x, ·), P(y, ·)) ≤ s. If S is a small set, then it is also d-small for d(x, y) = χ{x=y}(x, y) where the Wasserstein distance coincides with the total variation P(x, ·) − P(y, ·))TV ≤ 1 − s. Definition P is d-contracting if ∃ 0 < c < 1 such that d(x, y) < 1 implies d(P(x, ·), P(y, ·)) ≤ c · d(x, y). Definition V is a Lyapunov function for P if ∃K > 0 and 0 ≤ l < 1 s.t. PnV (x) ≤ lnV (x) + K for all x ∈ H and all n ∈ N.

SLIDE 13

Introduction Result Ideas of the proof Summary Preliminaries & Weak Harris Theorem

Weak Harris Theorem

Theorem Weak Harris Theorem [Hairer et al., 2011] If ν1 and ν2 are probability measures on H and d : H × H → [0, 1] a distance-like function s. t.

1 P has a Lyapunov function V ; 2 P is d-contracting 3

the set S = {x ∈ H : V (x) ≤ 4K} is d-small, then ˜ d(ν1P ˜

n, ν2P ˜ n) ≤ 1

2 ˜ d(ν1, ν2) with ˜ d =

d(1 + V (x) + V (y))

with ˜ n(l, K, c, s) increasing in l, K, c and s. Moreover, if there is a complete metric d0 such that d0 ≤ √ d P is Feller then there exist a unique invariant measure µ for P.

SLIDE 14

Introduction Result Ideas of the proof Summary Preliminaries & Weak Harris Theorem

Wasserstein spectral gap

Why do we care?

CLT for ˜ d-Lipschitz functionals by Komorowski and Walczuk in [Komorowski and Walczuk, 2011] (holds in the non-reverisble case) For reversible Markov processes:

Theorem (Due to [Wang, 2003]) If Lip(˜ d) ∩ L∞

µ is dense in L2 µ, then

Wasserstein spectral gap ⇒ L2-spectral gap of the same size

SLIDE 15

Introduction Result Ideas of the proof Summary Preliminaries & Weak Harris Theorem

Globally Lipschitz log-density

Weak Harris Theorem for d(x, y) = 1 ∧ x−y

ǫ

Assumption

There is an r > 0 and αl > 0 s.t. P (qx (ξ) is accepted| ξ ≤ r) ≥ αl

Theorem

Assume that Φ has a global Lipschitz constant L and the Assumption above is satisfied then for ǫ small enough the pCN algorithm for µ (µm) converges exponentially in ˜ d(x,y) =

d(x, y)(1 + V (x) + V (y)) with V = xi

d(x, y) = 1 ∧ x − y ǫ with an m-independent bound on the rate. Moreover, µ (µm) is the unique invariant measure.

SLIDE 16

Introduction Result Ideas of the proof Summary d-contracting

Basic coupling

Recall d-contracting: d(x, y) < 1 implies d(P(x,·),P(y, ·)) ≤ cd(x, y) c < 1

Proposals from x and y for ξ ∼ γ qx (ξ) = (1 − 2δ)

1 2 x +

√ 2δξ qy (ξ) = (1 − 2δ)

1 2 y +

√ 2δξ U uniform independent random variable ˜ x = qx (ξ)χ[0,α(x,qx)](U) + x · χ(α(x,qx),1] ˜ y = qy (ξ)χ[0,α(y,qy)](U) + y · χ(α(y,qy ),1] P(x, ·) = L( ˜ x), P(y, ·) = L( ˜ y), Basic coupling πBasic = L(( ˜ x, ˜ y))

SLIDE 17

Introduction Result Ideas of the proof Summary d-contracting

d-contracting

d(P(x, ·), P(y, ·)) ≤ inf

π∈Γ(P(x,·),P(y,·))

d(a, b)dπ

≤ inf

π∈Γ(P(x,·),P(y,·))

d(a, b)dπBasic ≤ Ed(˜

x, ˜ y) Observation: If both algorithms accept the proposal ˜ x − ˜ y = qx(ξ) − qy(ξ) = (1 − 2δ)

1 2 x − y

If both algorithms reject the proposal ˜ x − ˜ y = x − y If one accepts and the other rejects we use d ≤ 1

P(only one accepts) ≤

X |α(x,qx )(ξ) − α(y, qy)(ξ)|dγ(ξ)

≤

|Φ(qx) − Φ(qy)| + |Φ(x) − Φ(y)|dγ(ξ)

≤ 2L |x − y| ≤ 2Lǫd(x,y)

SLIDE 18

Introduction Result Ideas of the proof Summary d-contracting

d-smallness

Recall d-smallness: S ⊂ H is d-small if ∃ 0 < s < 1 s.t. x, y ∈ S d(P(x, ·), P(y, ·)) ≤ s. If the pCn started at x and aty both accept n-times in the row, then d(Xn, Yn) ≤ 1 ǫdiam(S)(1 − 2δ)

1 2 n ≤ 1

2. We take the Basic coupling again and prove a lower bound for the probability of this Event. This yields smallness.

SLIDE 19

Introduction Result Ideas of the proof Summary Dimensionality

Where do we get the m-independence from? Lemma Let f : R → R be monotone increasing, then

f (ξ)dγm(ξ)

≤

f (ξ)dγ(ξ)

for f = χBR(0)c : γm(BR(0)c) ≤ γ(BR(0)c). Proof. The truncated Karhunen-Loeve expansion relates γm and γ

m

∑

i=1

λiξ2

i ≤ ∞

∑

i=1

λiξ2

i .

Hence the result follows by monotonicity of the integral and f

f (ξ)dγm(ξ) = E(f (

m

∑

i=1

λi ξ2

i )) ≤ E(f ( ∞

∑

i=1

λiξ2

i )) =

f (ξ)dγ(ξ).

SLIDE 20

Introduction Result Ideas of the proof Summary

Summary

We showed a dimension independent Wasserstein and L2

µ spectral

gap for the pCN algorithm ⇒CLT and bound on mean square error for the sample average The RWM degenerates in high dimension even for simple target measure. We advocate the use of the weak Harris theorem, useful even in the non reversible case. Open

Behavior at infinity. Maybe two norms approach.

SLIDE 21

Introduction Result Ideas of the proof Summary

Thank you for your attention!

Reference: Hairer, M. and Stuart, A. and Vollmer, S. Spectral Gaps for a Metropolis-Hastings Algorithm in Infinite Dimensions http://arxiv.org/abs/1202.0709, 2011

SLIDE 22

Introduction Result Ideas of the proof Summary

References

Hairer, M., Mattingly, J. C., and Scheutzow, M. (2011). Asymptotic coupling and a general form of Harris’ theorem with applications to stochastic delay equations.

Probab. Theory Related Fields, 149(1-2):223–259.

Kipnis, C. and Varadhan, S. R. S. (1986). Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Communications in Mathematical Physics, 104(1):1–19. Komorowski, T. and Walczuk, A. (2011). Central limit theorem for Markov processes with spectral gap in Wasserstein metric. Arxiv preprint arXiv:1102.18422. Lawler, G. F. and Sokal, A. D. (1988). Bounds on the L2 spectrum for Markov chains and Markov processes: a generalization of Cheeger’s inequality. American Mathematical Society, 309(2). Rudolf, D. (2011). Explicit error bounds for Markov chain Monte Carlo. PhD thesis, to appear in Dissertationes Mathematicae, Friedrich-Schiller-Universität Jena. Sinclair, A. and Jerrum, M. (1989). Approximate counting, uniform generation and rapidly mixing Markov chains. Information and Computation, 82(1):93–133.

SLIDE 23

Introduction Result Ideas of the proof Summary

Tierney, L. (1998). A note on Metropolis-Hastings kernels for general state spaces.

Ann. Appl. Probab., 8(1):1–9.

Wang, F.-Y. (2003). Functional inequalities for the decay of sub-Markov semi-groups. Potential Anal., 18(1):1–23.

SLIDE 24

Introduction Result Ideas of the proof Summary

Theorem

Assume that exp(−Φ) is integrable and for all κ φ(r) = sup

x=y∈Br (0)

|Φ(x) − Φ(y)| / x − y ≤ Mκeκr. Moreover, there exists a ∈ ( 1

2 , 1) and R > 0 s.t. ∀x ∈ BR(0)c

inf

z∈B(rxa)((1−2δ)

1 2 x)

− Φ(z) + Φ(x) > αl. Then for ǫ small enough the pCN algorithm for µ (µm) converges exponentially in ˜ d(x,y) =

d(x, y)(1 + V (x) + V (y)) with V = exp(v x)

d(x,y) = 1 ∧ inf

L,ψ∈A(L,x,y)

1 ǫ

L

0 exp(η ψ)dt

for A(L, x, y) := {ψ ∈ C1([0, L], H), ψ(0) = x, ψ(L) = y, ˙ ψ = 1} and with an m-independent bound on the rate. Moreover, µ(µm) is the unique invariant measure.