Optimal scaling and convergence of Markov chain Monte Carlo methods - - PowerPoint PPT Presentation

optimal scaling and convergence of markov chain monte
SMART_READER_LITE
LIVE PREVIEW

Optimal scaling and convergence of Markov chain Monte Carlo methods - - PowerPoint PPT Presentation

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Optimal scaling and convergence of Markov chain Monte Carlo methods Alain Durmus Joint work with: Sylvain Le Corff, Eric Moulines, Gareth


slide-1
SLIDE 1

1/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Optimal scaling and convergence of Markov chain Monte Carlo methods

Alain Durmus Joint work with: Sylvain Le Corff, ´ Eric Moulines, Gareth Roberts, Umut S ¸im¸ sekli February 16, 2016

Stochastic seminar, Helsinki university

slide-2
SLIDE 2

2/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

1 Introduction 2 Optimal scaling of the symmetric RWM algorithm 3 Explicit bounds for the ULA algorithm

Stochastic seminar, Helsinki university

slide-3
SLIDE 3

3/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Introduction

Sampling distributions over high-dimensional state-space has recently attracted a lot of research efforts in computational statistics and machine learning community... Applications (non-exhaustive)

Bayesian inference for high-dimensional models and Bayesian non parametric. Bayesian linear inverse problems (typically function space problems). Aggregation of estimators and experts.

Stochastic seminar, Helsinki university

slide-4
SLIDE 4

4/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Bayesian setting

A Bayesian model is specified by

1 a prior distribution p on the parameter space θ ∈ Rd 2 the sampling distribution of the observed data conditional on its

parameters, often termed likelihood: Y ∼ L(·|θ)

The inference is based on the posterior distribution: π(dθ) = p(dθ)L(Y |θ)

  • L(Y |u)p(du) .

In most cases the normalizing constant is not tractable: π(dθ) ∝ p(dθ)L(Y |θ) .

Stochastic seminar, Helsinki university

slide-5
SLIDE 5

5/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Logistic and probit regression

Likelihood: Binary regression set-up in which the binary observations (responses) (Y1, . . . , Yn) are conditionally independent Bernoulli random variables with success probability F(θT Xi), where

1 Xi is a d dimensional vector of known covariates, 2 θ is a d dimensional vector of unknown regression coefficient 3 F is a distribution function.

Two important special cases:

1 probit regression: F is the standard normal distribution function, 2 logistic regression: F is the standard logistic distribution function,

F(t) = et/(1 + et).

Stochastic seminar, Helsinki university

slide-6
SLIDE 6

6/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Logistic and probit regression (II)

The posterior density distribution of θ is given, up to a proportionality constant by π(θ|(Y, X)) ∝ exp(−U(θ)), where the potential U(θ) is given by U(θ) = −

p

  • i=1

{Yi log F(θT Xi)+(1−Yi) log(1−F(θT Xi))}+g(θ) , where g is the log density of the posterior distribution. Two important cases:

Gaussian prior g(θ) = (1/2)θT Σθ, ridge regression. Laplace prior g(θ) = λ d

i=1 |θi|, lasso regression.

Stochastic seminar, Helsinki university

slide-7
SLIDE 7

7/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Bayesian setting (II)

Bayesian decision theory relies on computing expectations: π(f) =

  • Rd f(θ)π(dθ)

Generic problem: estimation of an integral π(f), where

  • π is known up to a multiplicative factor ;
  • Sampling directly from π is not an option;

A solution is to approximate Eπ[f] by n−1

n

  • i=1

f(Xi) , where (Xi)i≥0 is a Markov chain associated with a Markov kernel P for which π is invariant.

Stochastic seminar, Helsinki university

slide-8
SLIDE 8

8/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Markov chain theory

Invariant probability measure: π is said to be an invariant probability measure for the Markov kernel P if X0 ∼ π then X1 ∼ π Ergodic Theorem (Meyn and Tweedie, 2003): If π is invariant, With some conditions on P, we have for any f ∈ L1(π), 1 n

n

  • i=1

f(Xi) − →

π-a.s.

  • f(x)π(x)dx .

Stochastic seminar, Helsinki university

slide-9
SLIDE 9

9/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

MCMC: rationale

To approximate π(f): find P with invariant measure π, from which we can efficiently sample. MCMC methods are algorithms which aims to build such kernel. One of the most famous example: The Metropolis-Hastings algorithm.

Stochastic seminar, Helsinki university

slide-10
SLIDE 10

10/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

The Metropolis-Hastings algorithm

Initial Data: the target density π, a transition density q, X0 ∼ µ0. For k ≥ 0 given Xk,

1 Generate Yk+1 ∼ q(Xk, ·). 2 Set

Xk+1 =

  • Yk+1

with probability α(Xk, Yk+1) , Xk with probability 1 − α(Xk, Yk+1) . where α(x, y) = 1 ∧ π(y) π(x) q(y, x) q(x, y) . π is invariant for the corresponding Markov kernel P.

Stochastic seminar, Helsinki university

slide-11
SLIDE 11

11/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Example: The symmetric Random Walk Metropolis algorithm

The Random Walk Metropolis:      Yk+1 = Xk + σZk+1 (Zk)k≥0 i.i.d. sequence of law Nd(0, Idd) q(x, y) = σ−dφd(y − x /σ) where φd is the Gaussian density on Rd α(x, y) = π(y)/π(x) .

Stochastic seminar, Helsinki university

slide-12
SLIDE 12

12/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Study of MCMC methods: measures of efficiency

1 How to measure the efficiency of MCMC methods ? 2 Equivalent problem: quantifying the convergence of the Markov

kernel P to its stationary distribution π.

3 We consider two criteria:

the asymptotic variance ⇒ justifies optimal scaling results. convergence in some metric on the set of probability measures.

Stochastic seminar, Helsinki university

slide-13
SLIDE 13

13/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

1 Introduction 2 Optimal scaling of the symmetric RWM algorithm 3 Explicit bounds for the ULA algorithm

Stochastic seminar, Helsinki university

slide-14
SLIDE 14

14/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Behaviour of the RWM

Recall the RWM proposal: Yk+1 = Xk + σZk+1 On the one hand, σ should be as large as possible so that the chain explores the state spaces. On the other hand, σ should not be too large as possible

  • therwise α → 0.

Stochastic seminar, Helsinki university

slide-15
SLIDE 15

15/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Scaling problems

Questions: How should σ depend on the dimension d ? We study the following very simple model. Consider π a one dimensional positive density on R of the form π ∝ e−u . Define the positive density on given for all x ∈ Rd by πd(x) =

d

  • i=1

π(xi) =

d

  • i=1

eu(xi) , where xi is the i-th component of x.

Stochastic seminar, Helsinki university

slide-16
SLIDE 16

16/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Study of the acceptance ratio (I)

Recall πd(x) = d

i=1 π(xi) = d i=1 eu(xi)

Then the acceptance ratio can be written of the form for all x, y ∈ Rd, α(x, y) = 1 ∧ π(y) π(x) = 1 ∧ exp d

  • i=1

u(xi) − u(yi)

  • .

Stochastic seminar, Helsinki university

slide-17
SLIDE 17

17/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Study of the acceptance ratio (II)

Recall α(x, y) = 1 ∧ exp d

i=1 u(xi) − u(yi)

  • We want that the acceptance ratio during the algorithm ∈ (0, 1).

Let Xd

0 ∼ πd and the proposal based on Xd 0, Y d 1 = Xd 0 + σZd 1.

We consider the mean acceptance ratio, i.e. the quantity: E

  • α(Xd

0, Y d 1 )

  • = E
  • α(Xd

0, Xd 0 + σZd 1)

  • = E
  • 1 ∧ exp

d

  • i=1

u(Xd

0,i) − u(Xd 0,i + σZd 1,i)

  • .

Stochastic seminar, Helsinki university

slide-18
SLIDE 18

18/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Study of the acceptance ratio (III)

E

  • α(Xd

0, Y d 1 )

  • = E
  • 1 ∧ exp

d

i=1 u(Xd 0,i) − u(Xd 0,i + σZd 1,i)

  • If u is C3 then a third Taylor expansion gives:

u(Xd

0,i) − u(Xd 0,i + σZd 1,i)

= σZd

1,iu′(Xd 0,i) + (σZd 1,i)2u′′(Xd 0,i)/2 + o(σ3) .

(1) Set now σ = ℓd−ξ. By (3) if ξ < 1/2, then lim inf

d→+∞ d

  • i=1

u(Xd

0,i) − u(Xd 0,i + ℓd−ξZd 1,i) = −∞

and therefore lim inf

d→+∞ E

  • α(Xd

0, Y d 1 )

  • → 0 .

Stochastic seminar, Helsinki university

slide-19
SLIDE 19

19/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Study of the acceptance ratio (IV)

E

  • α(Xd

0, Y d 1 )

  • = E
  • 1 ∧ exp

d

i=1 u(Xd 0,i) − u(Xd 0,i + σZd 1,i)

  • If u is C3 then a third Taylor expansion gives:

u(Xd

0,i) − u(Xd 0,i + σZd 1,i)

= σZd

1,iu′(Xd 0,i) + (σZd 1,i)2u′′(Xd 0,i)/2 + o(σ3) .

(2) Set now σ = ℓd−ξ. By (3) if ξ > 1/2, then lim inf

d→+∞ d

  • i=1

u(Xd

0,i) − u(Xd 0,i + ℓd−ξZd 1,i) = 0

and therefore lim

d→+∞ E

  • α(Xd

0, Y d 1 )

  • → 1 .

Stochastic seminar, Helsinki university

slide-20
SLIDE 20

20/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Study of the acceptance ratio (V)

E

  • α(Xd

0, Y d 1 )

  • = E
  • 1 ∧ exp

d

i=1 u(Xd 0,i) − u(Xd 0,i + σZd 1,i)

  • If u is C3 then a third Taylor expansion gives:

u(Xd

0,i) − u(Xd 0,i + σZd 1,i)

= σZd

1,iu′(Xd 0,i) + (σZd 1,i)2u′′(Xd 0,i)/2 + o(σ3) .

(3) Set now σ = ℓd−1/2. Then d

i=1 u(Xd 0,i) − u(Xd 0,i + ℓd−1/2Zd 1,i) ⇒ G, as d → +∞

where G ∼ N

  • (ℓ2I/2), ℓ2I
  • , I = E
  • u′(Xd

0,i)

2 and therefore [Roberts, Gelman, Gilks, 1996] lim

d→+∞ E

  • α(Xd

0, Y d 1 )

  • → E
  • 1 ∧ eG

= 2Φ

  • −ℓ

√ I/2

  • .

Stochastic seminar, Helsinki university

slide-21
SLIDE 21

21/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

From the MH algorithm to the LAN property

Problem: what happens if π is non continuously differentiable ? Recall we want to control: α(Xd

0, Xd 0 + σZd 1) = 1 ∧ π(Xd 0 + σZd 1)/π(Xd 0)

= 1 ∧

d

  • i=1

π(Xd

0,i + σZd 1,i)/π(Xd 0,i)

We recognize the likelihood ratio for a translation model. The issue of non-differentiability of π has been also raised for the LAN (locally asymptotically normal) property of the likelihood ratio.

Stochastic seminar, Helsinki university

slide-22
SLIDE 22

22/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

The LAN property (simplified)

Consider the translation model θ → π(· + θ) on R where π ∝ eu is still a positive one dimensional density. Define the likelihood ratio (at 0) r((xi)1≤i≤N, θ) =

N

  • i=1

π(xi + θ)/π(xi) . The model is said to satisfy the LAN property if for all ℓ ∈ R, θ = ℓ/√n, log r((xi)1≤i≤N, θ) = SN where as N → +∞, SN ⇒ N(ς2ℓ2/2, (ςℓ)2) , ς2 =

  • R

(u′(x))2π(x)dx .

Stochastic seminar, Helsinki university

slide-23
SLIDE 23

23/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

DQM condition

If π is C3 then the LAN property is straightforward using a third Taylor expansion. Otherwise, Le Cam suggests to consider the following condition: there exists φ ∈ L2 such that

  • R
  • π1/2(x + θ) − π1/2(x) − φ(x)θ

2 dθ =θ→0 o(θ2) . The model is said to be differentiable in quadratic mean. If π is C2 and positive, φ(θ)/π1/2(θ) = (log π)′(θ) .

Stochastic seminar, Helsinki university

slide-24
SLIDE 24

24/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Assumptions

We assume that there exists a measurable function ˙ u : R → R such that:

1 Differenciability in Lp mean: There exist p > 4, C > 0 and β > 1

such that for all x ∈ R,

  • R

{u(y + x) − u(y) − x ˙ u(y)}p π(y)dy ≤ C|x|pβ .

2 This condition implies that θ → π(· + θ) is DQM [D., Le Corff,

Moulines, Roberts, 2016].

3 Moment condition The function ˙

u satisfies

  • R

| ˙ u(x)|6 π(y)dy < +∞ .

Stochastic seminar, Helsinki university

slide-25
SLIDE 25

25/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Limiting acceptance ratio for non-smooth densities

Assume these two conditions. Then we recover d

i=1 u(Xd 0,i) − u(Xd 0,i + ℓd−1/2Zd 1,i) ⇒ G where

G ∼ N

  • (ℓ2I/2), ℓ2I
  • , I = E
  • ˙

u(Xd

0,i)

2 and therefore [D., Le Corff, Moulines, Roberts, 2016] lim

d→+∞ E

  • α(Xd

0, Y d 1 )

  • → E
  • 1 ∧ eG

= 2Φ

  • −ℓ

√ I/2

  • .

Stochastic seminar, Helsinki university

slide-26
SLIDE 26

26/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Scaling problems

Questions:

  • How should σ depend on the dimension d ? done : σ = ℓd−1/2
  • What does this tell us about the efficiency of the algorithm ?
  • Can we optimize ℓ in a sensible way ?
  • Can we characterize the optimal choice of ℓ by some intrinsic criteria

independent of π ? For the case of Metropolis-Hastings type algorithms , there are diffusion limits which answers to these questions.

Stochastic seminar, Helsinki university

slide-27
SLIDE 27

27/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Efficiency of MCMC algorithms: asymptotic variance

Let (Xk)k≥0 be a Markov chain with invariant measure π. With some conditions we have a LLN and a CLT: for some f, 1 n

n

  • i=1

f(Xi)

a.s.

− →

n→+∞

  • f(x)π(x)dx

√n

  • 1

n

n

  • i=1

f(Xi) −

  • f(x)π(x)dx

= ⇒

n→+∞ N(0, σ2(f, P)) ,

where σ2(f, P) = lim

n→+∞ n Varπ

  • 1

n

n

  • i=1

f(Xi)

  • = Varπ {f(X0)} +
  • i≥1

Covπ {f(Xi), f(X0)} .

Stochastic seminar, Helsinki university

slide-28
SLIDE 28

28/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Expected Square Jump Distance

Common efficiency criteria: the ESJD defined for Markov chain in one dimension by: ESJD = Eπ[(X1 − X0)2] . Property: If f is a linear function Maximizing the ESJD ⇔ Minimizing Covπ {f(X1), f(X0)} ,

Stochastic seminar, Helsinki university

slide-29
SLIDE 29

29/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Efficiency of MH algorithms

  • Given f, the CLT allows us to compare two Markov kernel P1, P2:

σ2(f, P1) ≤ σ2(f, P2) = ⇒ P1 is more efficient than P2 .

  • However it can be hard to ensure for all f,

σ2(f, P1) ≤ σ2(f, P2) .

Stochastic seminar, Helsinki university

slide-30
SLIDE 30

30/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Langevin diffusion

Let π a probability measure on Rd with log-density U ∈ C1(Rd) Consider the overdamped Langevin equation: dYt = −∇U(Yt)dt + √ 2dBt , where (Bt)t≥0 is a d-dimensional Brownian Motion. Under some conditions on U, (Yt)t≥0 is ergodic with respect to π, and we have a LLN and a CLT again: 1 t t f(Xs)ds

a.s.

− →

t→+∞

  • f(x)π(x)dx

√ t 1 t t f(Xs)ds −

  • f(x)π(x)dx

= ⇒

t→+∞ N(0, σ2(f, Y )) ,

where σ2(f, Y ) = lim

t→+∞ t Varπ

1 t t f(Ys)ds

  • .

Stochastic seminar, Helsinki university

slide-31
SLIDE 31

31/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

scaled Langevin equation

Consider the following scaled Langevin equation: dY c

t = −c∇U(Y c t )dt +

√ 2cdBt , for c > 0 . (4) The solution of the scaled Langevin equation are speed-up-version of (Y 1

ct)t≥0 [the SLE with unit-speed]:

Y 1

ct = Y 1 0 +

ct ∇U(Y 1

s )ds +

√ 2Bct

s=cu

= Y 1

0 +

t c∇U(Y 1

s )ds +

√ 2c ˜ Bt , with the Brownian motion ˜ Bt = c−1/2Bct.

Stochastic seminar, Helsinki university

slide-32
SLIDE 32

32/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Efficiency of Langevin solutions

Which c leads to the best convergence, i.e. minimizes σ2(f, (Y c

t )t≥0) ?

1 To reach the equilibrium, it is sensible to speed-up the diffusion: so take large c. 2 Speeding-up the diffusion is also justified by the variance in the CLT : σ2(f, (Y c

t )t≥0) =

lim

t→+∞ t Varπ

1 t t f(Y 1

cs)ds

  • u=cs

= c−1 lim

t→+∞ ct Varπ

1 ct ct f(Y 1

s )ds

  • .

σ2(f, (Y c

t )t≥0) = c−1σ2(f, (Y 1 t )t≥0) ,

Conclusion: the faster, the better: this result holds for all f (under appropriate smoothness and moment conditions).

Stochastic seminar, Helsinki university

slide-33
SLIDE 33

33/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Action plan

  • Under some (strong) conditions, the MH iterates converges to a

diffusion process.

  • Then tune the variance σ to optimize the speed of the limiting

diffusion.

Stochastic seminar, Helsinki university

slide-34
SLIDE 34

34/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Scaling of the RWM (Roberts, Gelman and Gilks, 1997)

Assumption [Controversial !] πd(x) = d

i=1 π(xi) = d i=1 eu(xi)

{Xd,

k , d ≥ 0} be the Markov chain produced by the RWM on Rd

with target density πd and Xd,

0 ∼ πd

σd = ℓd−1/2 , ℓ > 0 . Results: {(Xd,

⌊td⌋,1)t≥0, d ≥ 1} ∗

= ⇒

d→+∞ (Yt)t≥0 ,

where (Yt)t≥0 is a solution of the SLE: dYt = h(ℓ)u′(Xt)dt + (2h(ℓ))1/2dBt , for a function h(ℓ) known in closed form and that can be optimized.

Stochastic seminar, Helsinki university

slide-35
SLIDE 35

35/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Consequences on the tuning of the two algorithms

If the semigroup of the Langevin equation explores the invariant distribution in O(1) at stationarity: the RWM explores it in O(d). To get the best mixing algorithm, tune the parameter ℓ in order to maximize the different speed measures h(ℓ) (the RWM then approximates the fastest Langevin solution). The best parameter ℓ is characterized by a mean acceptance rate of

  • rder ≈ 0.234.

Conclusion: tune during your algorithm ℓ to have an acceptance ratio with empirical mean ≈ 0.23

Stochastic seminar, Helsinki university

slide-36
SLIDE 36

36/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Extension of the result to non-smooth densities

Recall that it is assumed that πd(x) = d

i=1 π(xi) = d i=1 eu(xi).

The original result of Roberts et al. in addition assumes that u ∈ C3(R) (very smooth). Our contribution: Extension to non-smooth u with S. Le Corff, ´

  • E. Moulines and G. Roberts.

π is possibly non differentiable in some points or supported on an

  • pen interval of R.

The proof follows from the ideas in (Jourdain, Lelievre, Miasojedow, 2015).

Stochastic seminar, Helsinki university

slide-37
SLIDE 37

37/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Assumptions (II)

We assume that there exists a measurable function ˙ u : R → R such that:

1 Differenciability in Lp mean: There exist p > 4, C > 0 and β > 1

such that for all x ∈ R,

  • R

{u(y + x) − u(y) − x ˙ u(y)}p π(y)dy ≤ C|x|pβ .

2 Moment condition The function ˙

u satisfies

  • R

| ˙ u(x)|6 π(y)dy < +∞ .

3 Smoothness condition ˙

u is almost everywhere continuous.

Stochastic seminar, Helsinki university

slide-38
SLIDE 38

38/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Optimal scaling results

1 For all d ≥ 1, consider {Xd,R k

, d ≥ 0} be the Markov chain produced by the RWM on Rd with target density πd and Xd,R ∼ πd σ = ℓd−1/2 , ℓ > 0 . Then if the previous assumptions hold {(Xd,R

⌊td⌋,1)t≥0, d ≥ 1} ∗

= ⇒

d→+∞ (Yt)t≥0 ,

weak solution of the possibly singular scaled Langevin equation: dYt = hR(ℓ) ˙ u(Xt)dt + (2hR(ℓ))1/2dBt , for some function hR(ℓ) which is explicit and can be optimized.

2 Extension to density supported in an open interval I ⊂ R.

Stochastic seminar, Helsinki university

slide-39
SLIDE 39

39/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Simulation for beta distributions

Figure: Expected square jumped distance for the beta distribution with parameters (10, 10) as a function of the mean acceptance rate for d = 10, 50, 100.

Stochastic seminar, Helsinki university

slide-40
SLIDE 40

40/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Simulation for the lasso logistic regression

Figure: Autocovariance function for different mean acceptance rates for the lasso logistic regression. Dataset: Musk of dimension 167

Stochastic seminar, Helsinki university

slide-41
SLIDE 41

41/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm

Work in progress: Optimal scaling result for MALA applied to convex non-smooth densities. Use of proximal operator to improve the dependency on the dimension.

Stochastic seminar, Helsinki university

slide-42
SLIDE 42

42/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

1 Introduction 2 Optimal scaling of the symmetric RWM algorithm 3 Explicit bounds for the ULA algorithm

The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Stochastic seminar, Helsinki university

slide-43
SLIDE 43

43/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

1 Introduction 2 Optimal scaling of the symmetric RWM algorithm 3 Explicit bounds for the ULA algorithm

The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Stochastic seminar, Helsinki university

slide-44
SLIDE 44

44/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

The Unadjusted Langevin Algorithm (ULA)

Langevin SDE: dYt = −∇U(Yt)dt + √ 2dBt , where (Bt)t≥0 is a d-dimensional Brownian Motion. Idea: Sample the diffusion paths, using for example the Euler-Maruyama (EM) scheme:

1 initial state X0 ∼ µ0 2 for k ≥ 0, given Xk,

Xk+1 = Xk − γ∇U(Xk) +

  • 2γZk+1

where

  • (Zk)k≥1 is i.i.d. N(0, Id)
  • γ > 0 is a step size

Stochastic seminar, Helsinki university

slide-45
SLIDE 45

45/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Discretized Langevin diffusion: constant stepzize

(Xk)k≥1 is an homogeneous Markov chain with Markov kernel Rγ Under some appropriate conditions, this Markov chain is irreducible, positive recurrent ❀ unique invariant distribution πγ. Problem: πγ = π.

Stochastic seminar, Helsinki university

slide-46
SLIDE 46

46/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Convergence of Markov chains

Another measure of efficiency of MCMC to target π associated to a Markov kernel P: P k(x, ·) − πTV ≤ C(x)v(k) , where

1 The total variation distance defined for µ, ν two probabilities

measure on Rd by µ − νTV = sup

|f|≤1

|µ(f) − ν(f)| .

2 C(x) ≥ 0: dependence on the initial condition. 3 Ideally limk→+∞ v(k) = 0 (or close to 0) with the beter possible

rate.

Stochastic seminar, Helsinki university

slide-47
SLIDE 47

47/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Weak error result for the ULA algorithm

Recall the ULA algorithm Xk+1 = Xk − γ∇U(Xk) +

  • 2γZk+1 .

We have seen that the ULA algorithm is biased. The Markov chain (Xk)k≥0 has an invariant distribution πγ = π. However, (Talay and Tubaro 1991) shows that for U, f ∈ C∞(Rd) and additional assumptions, there exists a constant C depending on f and π such that

  • Rd f(x)dπ(x) −
  • Rd f(x)dπγ(x) = Cγ + O(γ2) .

Stochastic seminar, Helsinki university

slide-48
SLIDE 48

48/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Discussion

The previous result is not quantitative. No explicit bounds. We aimed with ´

  • E. Moulines at giving computable bounds in total

variation or Wassestein distance. In particular to see the dependence on the dimension. We make the assumption that U is continuously differentiable, convex and gradient Lipschitz. Complete and improve the result of (Dalalyan 2014).

Stochastic seminar, Helsinki university

slide-49
SLIDE 49

49/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Notation and framework

Recall that Xk+1 = Xk − γ∇U(Xk) +

  • 2γZk+1 , γ > 0 ,

and Rγ denotes the Markov kernel associated with (Xk)k≥0. We answer to the following questions:

1 For a target precision ε > 0, can we find explicit γ > 0 and N ≥ 0

such that δxRn

γ − πTV ≤ ε for all n ≥ N .

2 For all n ≥ 0, can we find explicit γ > 0 such that

δxRn

γ − πTV ≤ v(n) with

lim

n→∞ v(n) = 0 .

Stochastic seminar, Helsinki university

slide-50
SLIDE 50

50/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

1 Introduction 2 Optimal scaling of the symmetric RWM algorithm 3 Explicit bounds for the ULA algorithm

The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Stochastic seminar, Helsinki university

slide-51
SLIDE 51

51/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Main result (I)

Assume that U is gradient Lipschitz and convex. Assume in addition that there exists η > 0 and R ≥ 0 independent

  • f d such that for all x, y ∈ Rd,

∇U(x) − ∇U(y), x − y ≥ η x − y . Then, for all ε > 0, δxRn

γ − πTV ≤ ε for γ well chosen and

n ≥ Cd5 for C ≥ 0 which is explicit and independent of d . For all n ≥ 0, there exist C ≥ 0 explicit and independent of the dimension and γ > 0 δxRn

γ − πTV ≤ C log(n)d5/2

n1/2 .

Stochastic seminar, Helsinki university

slide-52
SLIDE 52

52/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Discussion

This kind of results has the subject of numerous paper concerning the RWM applied to logconcave target: (A. Frieze, R. Kannan, and

  • N. Polson, 1994), (A. Frieze and R. Kannan, 1999)...

The best results have been obtain in (L. Lov` asz and S. Vempala, 2007). They show that a sufficient number of iteration n for the RWM to achieve a target precision ε is of order: n ≥ Cd4 . Besides their result does not assume that U is continuously differentiable.

Stochastic seminar, Helsinki university

slide-53
SLIDE 53

53/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Discussion (II)

But they assume that the target is well rounded: there exists C independent of the dimension such that

  • Rd
  • x −
  • Rd ydπ(y)
  • 2

dπ(x) ≤ Cd . Our result does not require such assumption. In fact with this kind of assumption, we can show that for all ε > 0, δxRn

γ − πTV ≤ ε for γ well chosen and

n ≥ Cd3 for C ≥ 0 which is explicit and independent of d .

Stochastic seminar, Helsinki university

slide-54
SLIDE 54

54/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Main result (III)

Assume that U ∈ C3(Rd), strongly convex, gradient Lipschitz and there exists ˜ L such that for all x, y ∈ Rd:

  • ∇2U(x) − ∇2U(y)
  • ≤ ˜

L x − y . Then, for all ε > 0, δxRn

γ − πTV ≤ ε for γ well chosen and

n ≥ Cd1/2 log2(d) for C ≥ 0 which is explicit and independent of d . Almost sharp bounds for the Gaussian case ! For all n ≥ 0, there exist C ≥ 0 explicit and independent of the dimension and γ > 0 δxRn

γ − πTV ≤ C log(n)d1/2 log2(d)

n .

Stochastic seminar, Helsinki university

slide-55
SLIDE 55

55/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

1 Introduction 2 Optimal scaling of the symmetric RWM algorithm 3 Explicit bounds for the ULA algorithm

The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Stochastic seminar, Helsinki university

slide-56
SLIDE 56

56/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Metropolis-Adjusted Langevin Algorithm

To correct the target distribution, a Metropolis-Hastings step can be included ❀ Metropolis Adjusted Langevin Agorithm (MALA).

  • (Rossky, Doll, and Friedman 1978)

Algorithm:

1 Propose Yk+1 ∼ Xk − γ∇U(Xk) + √2γZk+1, Zk+1 ∼ N(0, Id) 2 Compute the acceptance ratio αγ(Xk, Yk+1)

αγ(x, y) = 1 ∧ π(y)qγ(y, x) π(x)qγ(x, y) , qγ(x, y) ∝ e−y−x−γ∇U(x)2/(4γ)

3 Accept / Reject the proposal.

Work which studied this algorithm: (Roberts and Tweedie 1996), (Bou-Rabee and Hairer 2010), (Eberle 2014)... Very difficult to analyze because of the behaviour of the acceptance ratio, which leads to very conservative bounds.

Stochastic seminar, Helsinki university

slide-57
SLIDE 57

57/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Comparison of MALA and ULA (I)

We compare MALA and ULA for the logistic regression with Gaussian prior on five real data sets. Data set Observations p Covariates d German credit 1000 25 Heart disease 270 14 Australian credit 690 35 Prima indian diabetes 768 9 Musk 476 167

Table: Dimension of the data sets

Stochastic seminar, Helsinki university

slide-58
SLIDE 58

58/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Comparison of MALA and ULA (II)

Define the marginal accuracy between two probability measure µ, ν

  • n (R, B(Rd)) by

MA(µ, ν) = 1 − (1/2)µ − νTV . We compare MALA and ULA for each data sets by estimating for each component i ∈ {1, . . . , d} the marginal accuracy between their d marginal empirical distributions and the d marginal posterior distributions.

Stochastic seminar, Helsinki university

slide-59
SLIDE 59

59/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Comparison of MALA and ULA (III)

To estimate the d marginal posterior distributions, we run 2 · 107 iterations of the Polya-Gamma Gibbs sampler. Then 100 runs of MALA and ULA (106 iterations per run) have been performed. For MALA, the step-size is chosen so that the acceptance probability is ≈ 0.5. For ULA, we choose the same constant step-size than MALA.

Stochastic seminar, Helsinki university

slide-60
SLIDE 60

60/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Comparison of MALA and ULA (IV)

0.975 0.98 0.985

ULA MALA

0.95 0.96 0.97 0.98

ULA MALA

0.95 0.96 0.97 0.98

ULA MALA

0.95 0.96 0.97 0.98

ULA MALA

Figure: Marginal accuracy accross all the dimensions. Upper left: German credit data set. Upper right: Australian credit data set. Lower left: Heart disease data set. Lower right: Pima Indian diabetes data set

Stochastic seminar, Helsinki university

slide-61
SLIDE 61

61/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Comparison of MALA and ULA (V)

0.92 0.94 0.96 0.98

ULA MALA

Figure: Marginal accuracy accross all the dimensions for the Musk data Set

Stochastic seminar, Helsinki university

slide-62
SLIDE 62

62/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Comparison of MALA and ULA (VI)

Figure: 2-dimensional histogram for the Musk data Set.

Stochastic seminar, Helsinki university

slide-63
SLIDE 63

63/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Matrix factorization

We compare MALA and ULA on a matrix factorization problem for a link prediction application. Consider X an observed matrix with missing entries of size I × J. The model is for observed indexes i, j Xi,j =

K

  • k=1

Wi,kHk,j + Zi,j , for K ≥ 0, (Zi,j) iid normal random variables N(0, σz).

Stochastic seminar, Helsinki university

slide-64
SLIDE 64

64/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Matrix Factorization (II)

The aim is then to infer the two matrices W and H of dimensions I × K and K × J respectively to predict the missing values of X. We take as prior distributions: Wj,k ∼ N(0, σw) and Hk,j ∼ N(0, σh) . Comparison of MALA and ULA on the MovieLens 1 Million dataset.

Stochastic seminar, Helsinki university

slide-65
SLIDE 65

65/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Matrix Factorization (III)

Parameters: σz = 1, σw = σh = 100

Stochastic seminar, Helsinki university

slide-66
SLIDE 66

66/66 Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Unadjusted Langevin Algorithm Explicit bounds for logconcave densities Numerical Comparison of ULA and MALA

Thank you for your attention. Any questions ?

Stochastic seminar, Helsinki university