[PPT] - Non-asymptotic convergence bound for the Unadjusted Langevin PowerPoint Presentation

SLIDE 1

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Non-asymptotic convergence bound for the Unadjusted Langevin Algorithm

Alain Durmus, Eric Moulines, Marcelo Pereyra

Telecom ParisTech, Ecole Polytechnique, Bristol University

November 23, 2016

LS3 seminar

SLIDE 2

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials

LS3 seminar

SLIDE 3

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Introduction

Sampling distribution over high-dimensional state-space has recently attracted a lot of research efforts in computational statistics and machine learning community... Applications (non-exhaustive)

1 Bayesian inference for high-dimensional models and Bayesian non

parametrics

2 Bayesian linear inverse problems (typically function space problems) 3 Aggregation of estimators and experts

Most of the sampling techniques known so far do not scale to high-dimension... Challenges are numerous in this area...

LS3 seminar

SLIDE 4

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Bayesian setting (I)

In a Bayesian setting, a parameter β

β β ∈ Rd is embedded with a prior distribution p and the observations are given by a likelihood: Y ∼ L(·|β β β) The inference is then based on the posterior distribution: π(dβ β β|Y ) = p(dβ β β)L(Y |β β β)

L(Y |u)p(du) .

In most cases the normalizing constant is not tractable: π(dβ β β|Y ) ∝ p(dβ β β)L(Y |β β β) .

LS3 seminar

SLIDE 5

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Bayesian setting (II)

Bayesian decision theory relies on computing expectations:

Rd f(β

β β)L(Y |β β β)p(dβ β β) Generic problem: estimation of an expectation Eπ[f], where

π is known up to a multiplicative factor ;
Sampling directly from π is not an option;
π is high dimensional.

LS3 seminar

SLIDE 6

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Logistic and probit regression

Likelihood: Binary regression set-up in which the binary observations (responses) (Y1, . . . , Yn) are conditionally independent Bernoulli random variables with success probability F(β β βT Xi), where

1 Xi is a d dimensional vector of known covariates, 2 β

β β is a d dimensional vector of unknown regression coefficient

3 F is a distribution function.

Two important special cases:

1 probit regression: F is the standard normal distribution function, 2 logistic regression: F is the standard logistic distribution function,

F(t) = et/(1 + et).

LS3 seminar

SLIDE 7

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

A daunting problem ?

The posterior density distribution of β β β is given , up to a proportionality constant by π(β β β|(Y, X)) ∝ exp(−U(β β β)); where the potential U(β β β) is given by U(β β β) = −

p

i=1

{Yi log F(β β βT Xi)+(1−Yi) log(1−F(β β βT Xi))}+g(β β β) , where g is the log density of the posterior distribution. Two important cases:

Gaussian prior g(β β β) = (1/2)β β βT Σβ β β, ridge regression. Laplace prior g(β β β) = λ d

i=1 |β

β βi|, lasso regression.

LS3 seminar

SLIDE 8

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

New challenges

Problem the number of predictor variables d is large (104 and up). Examples: text categorization, genomics and proteomics (gene expression analysis)... The most popular algorithms for Bayesian inference in binary regression models are based on data augmentation:

1 probit link: Albert and Chib (1993). 2 logistic link: Polya-Gamma sampler, Polsson and Scott (2012)... !

LS3 seminar

SLIDE 9

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Data Augmentation algorithms (I)

Data Augmentation:

Instead on sampling π(β β β|(X, Y )) sample π(β β β, W|(X, Y )) probability measure on Rd1 × Rd2 and take the marginal wrt β β β. Typical application of the Gibbs sampler: sample in turn π(β β β|(X, Y, W)) and π(W|(X, Y,β β β)). The Gibbs sampler consists in sampling a Markov chain (β β βk, Wk)k≥0 defined by

1

Given (β β βk, Wk),

2

Draw Wk+1 ∼ π (·|(β β βk, X, Y )) . β β βk+1 ∼ π (·|(Wk+1, X, Y )) .

The target density π(β β β, W|(X, Y )) is invariant for the Markov chain (β β βk, Wk)k≥0 ! The choice of the DA should make these two steps reasonably easy...

LS3 seminar

SLIDE 10

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Data Augmentation algorithms (II)

Question: Control the distance between the law of (β β βn, Wn) and the stationary distribution π(β β β, W|(X, Y ))? Definition (Geometric ergodicity) We will say that the Markov kernel P on (Rd, B(Rd)) is geometrically ergodic if there exits κ ∈ (0, 1) such that for all n ≥ 0 and x ∈ Rd, P n(x, ·) − πTV ≤ C(x)κn . where for µ, ν two probabilities measure on Rd, define µ − νTV = sup

|f|≤1

|µ(f) − ν(f)| .

LS3 seminar

SLIDE 11

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Data Augmentation algorithms (III)

The algorithm of Albert and Chib and the Polya-Gamma sampler have been shown to be uniformly geometrically ergodic, BUT

The geometric rate of convergence is exponentially small with the

dimension

do not allow to construct honest confidence intervals, credible regions

The algorithms are very demanding in terms of computational ressources...

applicable only when is d small 10 to moderate 100 but certainly not

when d is large (104 or more).

convergence time prohibitive as soon as d ≥ 102.

LS3 seminar

SLIDE 12

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

A daunting problem ?

In the case of the ridge regression, the potential U is smooth strongly convex. In the case of the lasso regression, the potential U is non-smooth but still convex... A wealth of reasonably fast optimisation algorithms are available to solve this problem in high-dimension...

LS3 seminar

SLIDE 13

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials

LS3 seminar

SLIDE 14

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Framework

Denote by π a target density w.r.t. the Lebesgue measure on Rd, known up to a normalisation factor x → e−U(x)/

Rd e−U(y)dy ,

Implicitly, d ≫ 1. Assumption: U is L-smooth : twice continuously differentiable and there exists a constant L such that for all x, y ∈ Rd, ∇U(x) − ∇U(y) ≤ Lx − y . This condition can be weakened.

LS3 seminar

SLIDE 15

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Langevin diffusion

Langevin SDE: dYt = −∇U(Yt)dt + √ 2dBt , where (Bt)t≥0 is a d-dimensional Brownian Motion. (Pt)t≥0 is a Markov semigroup:

aperiodic, strong Feller (all compact sets are small).
reversible w.r.t. to π (admits π as its unique invariant distribution).

π ∝ e−U is reversible ❀ the unique invariant probability measure. For all x ∈ Rd, lim

t→+∞ δxPt − πTV = 0 .

LS3 seminar

SLIDE 16

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Discretized Langevin diffusion

Idea: Sample the diffusion paths, using for example the Euler-Maruyama (EM) scheme: Xk+1 = Xk − γk+1∇U(Xk) +

2γk+1Zk+1

where

(Zk)k≥1 is i.i.d. N(0, Id)
(γk)k≥1 is a sequence of stepsizes, which can either be held constant
r be chosen to decrease to 0 at a certain rate.

Closely related to the gradient algorithm.

LS3 seminar

SLIDE 17

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Discretized Langevin diffusion: constant stepzize

When γk = γ, then (Xk)k≥1 is an homogeneous Markov chain with Markov kernel Rγ Under some appropriate conditions, this Markov chain is irreducible, positive recurrent ❀ unique invariant distribution πγ. Problem: πγ = π.

LS3 seminar

SLIDE 18

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

The Euler-Maruyama (EM) Markov chain

When (γk)k≥1 is nonincreasing and non constant, (Xk)k≥1 is an inhomogeneous Markov chain associated with the sequence of Markov kernel (Rγk)k≥1. We denote by δxQp

γ the law of Xp stated

at x. The diffusion converges to the target distribution... Question: since the EM discretization approximates the diffusion, might we use it to sample from π ?

Control δxQp

γ − πTV?

Obtain bounds with explicit dependence in the dimension d. Previous works: Lamberton, Pages, 2002, Dalalyan, 2014.

LS3 seminar

SLIDE 19

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Metropolis-Adjusted Langevin Algorithm

To correct the target distribution, a Metropolis-Hastings step can be included ❀ Metropolis Adjusted Langevin Agorithm (MALA).

Key references Roberts and Tweedie, 1996

Algorithm:

1 Propose Yk+1 ∼ Xk − γ∇U(Xk) + √2γZk+1, Zk+1 ∼ N(0, Id) 2 Compute the acceptance ratio αγ(Xk, Yk+1)

αγ(x, y) = 1 ∧ π(y)rγ(y, x) π(x)rγ(x, y) , rγ(x, y) ∝ e−y−x−γ∇U(x)2/(4γ)

3 Accept / Reject the proposal.

LS3 seminar

SLIDE 20

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

MALA: pros and cons

Require to compute one gradient at each iteration and to evaluate

ne time the objective function

Geometric convergence is established under the condition that in the tail the acceptance region is inwards in q, lim

x→∞

Aγ(x)∆I(x)

rγ(x, y)dy = 0 . where I(x) = {y, y ≤ x} and Aγ(x) is the acceptance region Aγ(x) = {y, π(x)rγ(x, y) ≤ π(y)rγ(y, x)}

LS3 seminar

SLIDE 21

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution

Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials

LS3 seminar

SLIDE 22

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Strongly convex potential

Assume that U is strongly convex: there exists m > 0, such that for all x, y ∈ Rd, ∇U(x) − ∇U(y), x − y ≥ m x − y2 . Convergence in Wasserstein distance of the semigroup !

LS3 seminar

SLIDE 23

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Wasserstein distance

Definition For µ, ν two probabilities measure on Rd, define W2 (µ, ν) = inf

(X,Y )∈Π(µ,ν) E1/2

X − Y 2 , where (X, Y ) ∈ Π(µ, ν) if X ∼ µ and Y ∼ ν. Note by the Cauchy-Schwarz inequality, for all f : Rd → R, fLip ≤ 1, (X, Y ) ∈ Π(µ, ν), |µ(f) − ν(f)| ≤ E1/2 X − Y 2 ≤ W2(µ, ν) .

LS3 seminar

SLIDE 24

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Wasserstein distance convergence

Theorem Assume that U is L-smooth and m-strongly convex. Then, for all probability measures x, y ∈ Rd and t ≥ 0, W2 (δxPt, δyPt) ≤ e−mt x − y The mixing rate depends only on the strong convexity constant.

LS3 seminar

SLIDE 25

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Elements of proof

dYt

= −∇U(Yt)dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt)dt + √ 2dBt , where (Y0, ˜ Y0) = (x, y). This SDE has a unique strong solution (Yt, ˜ Yt)t≥0

LS3 seminar

SLIDE 26

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Elements of proof

dYt

= −∇U(Yt)dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt)dt + √ 2dBt , where (Y0, ˜ Y0) = (x, y). This SDE has a unique strong solution (Yt, ˜ Yt)t≥0 Consider Vt =

Yt − ˜

Yt

2

. Since d{Yt − ˜ Yt} = −

∇U(Yt) − ∇U( ˜

Yt)

dt

we get a very simple SDE for (Vt, t ≥ 0) dVt = −

∇U(Yt) − ∇U( ˜

Yt), Yt − ˜ Yt

dt .

LS3 seminar

SLIDE 27

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Elements of proof

Integrating this SDE we get

Yt − ˜

Yt

2

=

Y0 − ˜

Y0

2

− 2 t

(∇U(Ys) − ∇U( ˜

Ys)), Ys − ˜ Ys

ds ,

Since U is strongly convex

∇U(y) − ∇U(y′), y − y′

≥ m

y − y′

2 which implies

Yt − ˜

Yt

2

≤

Y0 − ˜

Y0

2

− 2m t

Ys − ˜

Ys

2

ds .

LS3 seminar

SLIDE 28

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Elements of proof

Yt − ˜

Yt

2

≤

Y0 − ˜

Y0

2

− 2m t

Ys − ˜

Ys

2

ds . By Gr¨

mwall inequality, we obtain
Yt − ˜

Yt

2

≤

Y0 − ˜

Y0

2

e−2mt The proof follows since for all t ≥ 0, the law of (Yt, ˜ Yt) is a coupling between δxPt and δyPt.

LS3 seminar

SLIDE 29

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Theorem Assume that U is L-smooth and m-strongly convex. Then, or any x ∈ Rd and t ≥ 0 W2 (δxPt, π) ≤

x − x⋆ +
d

m

e−mt .

where x⋆ = arg min

x∈Rd

U(x) . The constant depends only linearly in the dimension d.

LS3 seminar

SLIDE 30

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Elements of proof

dYt

= −∇U(Yt)dt + √ 2dBt , d ˜ Yt = −∇U( ˜ Yt)dt + √ 2dBt , where Y0 = x and ˜ Y0 ∼ π. Since π is invariant for (Pt)t≥0, then ˜ Yt ∼ π for all t ≥ 0, showing that W 2

2 (δxPt, π) ≤ E

Yt − ˜

Yt

2

≤ e−mtE

x − ˜

Y0

2

. The proof follows from E

x⋆ − ˜

Y0

2

≤ d/m.

LS3 seminar

SLIDE 31

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

A coupling proof

We now proceed to establish explicit bounds for W2(µ0Qn

γ, π) for

µ0 ∈ P2(Rd). Since πPt = π for all t ≥ 0, it suffices to get some bounds on W2

µ0Qn

γ, ν0PΓn

, with ν0 ∈ P2(Rd) and take ν0 = π.

Idea ! Construct a coupling between the diffusion and the linear interpolation of the Euler discretization.

LS3 seminar

SLIDE 32

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

A coupling proof (II)

An obvious candidate is the synchronous coupling (Yt, Y t)t≥0 for all n ≥ 0 and t ∈ [Γn, Γn+1) by

Yt = YΓn −

t

Γn ∇U(Ys)ds +

√ 2(Bt − BΓn) ¯ Yt = ¯ YΓn − ∇U( ¯ YΓn)(t − Γn) + √ 2(Bt − BΓn) , For all n ≥ 0, W 2

2

µ0PΓn, ν0Qn

γ

≤ E[YΓn − ¯

YΓn2] ,

LS3 seminar

SLIDE 33

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Explicit bound in Wasserstein distance for the Euler discretisation

Theorem Assume U is L-smooth and strongly convex. Let (γk)k≥1 be a nonincreasing sequence with γ1 ≤ 1/(m + L). (Optional assumption) U ∈ C3(Rd) and there exists ˜ L such that for all x, y ∈ Rd:

∇2U(x) − ∇2U(y)
≤ ˜

L x − y. Then there exist sequences {u(1)

n (γ), n ∈ N} and {u(2) n (γ), n ∈ N}

(explicit expressions are available) such that for all x ∈ Rd and n ≥ 1, W2

δxQn

γ, π

≤ u(1)

n (γ)

Rd y − x2 π(dy) + u(2)

n (γ) ,

LS3 seminar

SLIDE 34

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Decreasing step sizes

If limk→+∞ γk = 0 and limk→+∞ Γk = +∞, then lim

p→+∞ W2

δxQn

γ, π

= 0 ,

with explicit control. Order of convergence for decreasing stepsize. α ∈ (0, 1) Order of convergence O(n−α)

Table : Order of convergence of W2

δxQn

γ, π

for γk = γ1k−α

LS3 seminar

SLIDE 35

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Constant step sizes

For any ǫ > 0, the minimal number of iterations to achieve W2

δxQp

γ, π

≤ ǫ is

p = O( √ dǫ−1) . For a fixed number of iterations p, we can choose the stepsize γ such that W2

δxQp

γ, π

≤ Cp−1

. For a given stepsize γ, letting p → +∞, we get: W2 (πγ, π) ≤ Cγ .

LS3 seminar

SLIDE 36

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

From the Wasserstein distance to the TV

Theorem If U is strongly convex, then for all x, y ∈ Rd, Pt(x, ·) − Pt(y, ·)TV ≤ 1 − 2Φ

−

x − y

(4/m)(e2mt − 1)
Proof.

reflection coupling For all x, y ∈ Rd, Pt(x, ·) − Pt(y, ·)TV ≤ x − y

(2π/m)(e2mt − 1)

LS3 seminar

SLIDE 37

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

From the Wasserstein distance to the TV (II)

Pt(x, ·) − Pt(y, ·)TV ≤ x − y

(2π/m)(e2mt − 1)

Consequences:

1 (Pt)t≥0 converges exponentially fast to π in total variation at a rate

e−mt.

2 For all f : Rd → R, measurable and sup |f| ≤ 1, then

x → Ptf(x) , is Lipschitz with Lipschitz constant smaller than 1/

(2π/m)(e2mt − 1) .

LS3 seminar

SLIDE 38

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Explicit bound in total variation

Theorem Assume U is L-smooth and strongly convex. Let (γk)k≥1 be a nonincreasing sequence with γ1 ≤ 1/(m + L). (Optional assumption) U ∈ C3(Rd) and there exists ˜ L such that for all x, y ∈ Rd:

∇2U(x) − ∇2U(y)
≤ ˜

L x − y. Then there exist sequences {˜ u(1)

n (γ), n ∈ N} and {˜

u(2)

n (γ), n ∈ N} such

that for all x ∈ Rd and n ≥ 1, δxQn

γ − πTV ≤ ˜

u(1)

n (γ)

Rd y − x2 π(dy) + ˜

u(2)

n (γ) .

LS3 seminar

SLIDE 39

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Constant step sizes

For any ǫ > 0, the minimal number of iterations to achieve δxQp

γ − πTV ≤ ǫ is

p = O( √ d log(d)ǫ−1 |log(ǫ)|) . For a given stepsize γ, letting p → +∞, we get: πγ − πTV ≤ Cγ |log(γ)| .

LS3 seminar

SLIDE 40

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Comparison of MALA and ULA (I)

We compare MALA and ULA for the logistic regression with Gaussian prior on five real data sets. Data set Observations p Covariates d German credit 1000 25 Heart disease 270 14 Australian credit 690 35 Prima indian diabetes 768 9 Musk 476 167

Table : Dimension of the data sets

LS3 seminar

SLIDE 41

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Comparison of MALA and ULA (II)

Define the marginal accuracy between two probability measure µ, ν

n (R, B(Rd)) by

MA(µ, ν) = 1 − (1/2)µ − νTV . We compare MALA and ULA for each data sets by estimating for each component i ∈ {1, . . . , d} the marginal accuracy between their d marginal empirical distributions and the d marginal posterior distributions.

LS3 seminar

SLIDE 42

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Comparison of MALA and ULA (III)

To estimate the d marginal posterior distributions, we run 2 · 107 iterations of the Polya-Gamma Gibbs sampler. Then 100 runs of MALA and ULA (106 iterations per run) have been performed. For MALA, the step-size is chosen so that the acceptance probability is ≈ 0.5. For ULA, we choose the same constant step-size than MALA.

LS3 seminar

SLIDE 43

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Comparison of MALA and ULA (IV)

0.975 0.98 0.985

ULA MALA

0.95 0.96 0.97 0.98

ULA MALA

0.95 0.96 0.97 0.98

ULA MALA

0.95 0.96 0.97 0.98

ULA MALA

Figure : Marginal accuracy accross all the dimensions. Upper left: German credit data set. Upper right: Australian credit data set. Lower left: Heart disease data set. Lower right: Pima Indian diabetes data set

LS3 seminar

SLIDE 44

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Comparison of MALA and ULA (V)

ULA MALA 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 MALA ULA

Figure : Marginal accuracy accross all the dimensions for the Musk data Set

LS3 seminar

SLIDE 45

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Convergence in Wasserstein distance Convergence in Wasserstein distance Numerical experiments

Comparison of MALA and ULA (VI)

Figure : 2-dimensional histogram for the Musk data Set.

LS3 seminar

SLIDE 46

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials

LS3 seminar

SLIDE 47

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Convergence of the Euler discretization

Assume one of the following conditions:

There exist α > 1, ρ > 0 and Mρ ≥ 0 such that for all y ∈ Rd,

y ≥ Mρ: ∇U(y), y ≥ ρ yα .

U is convex.

If limγk→+∞ γk = 0, and

k γk = +∞ then

lim

p→+∞ δxQp γ − πTV = 0 .

Computable bounds for the convergence1.

1D., Moulines, Annals of Applied Probability, 2016

LS3 seminar

SLIDE 48

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Target precision ǫ: the convex case

Setting U is convex. Constant stepsize Optimal stepsize γ and number of iterations p to achieve ǫ-accuracy in TV: δxQp

γ − πTV ≤ ǫ .

d ε L γ O(d−3) O(ε2/ log(ε−1)) O(L−2) p O(d5) O(ε−2 log2(ε−1)) O(L2) In the strongly convex case, the convergence of the semigroup of the diffusion to π depends only on the strong convexity constant m. In the convex case, this depends on the dimension !.

LS3 seminar

SLIDE 49

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Strongly convex outside a ball potential

U is convex everywhere and strongly convex outside a ball, i.e. there exist R ≥ 0 and m > 0, such that for all x, y ∈ Rd, x − y ≥ R, ∇U(x) − ∇U(y), x − y ≥ m x − y2 . Eberle, 2015 established that the convergence in the Wasserstein distance does not depends on the dimension. D., Moulines 2016 established that the convergence of the semi-group in TV to π does not depends on the dimension but just

n R ❀ new bounds which scale nicely in the dimension.

LS3 seminar

SLIDE 50

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Dependence on the dimension

Setting U is convex and strongly convex outside a ball. Constant stepsize Optimal stepsize γ and number of iterations p to achieve ǫ-accuracy in TV: δxQp

γ − πTV ≤ ǫ .

d ε L m R γ O(d−1) O(ε2/ log(ε−1)) O(L−2) O(m) O(R−4) p O(d log(d)) O(ε−2 log2(ε−1)) O(L2) O(m−2) O(R8)

LS3 seminar

SLIDE 51

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials

LS3 seminar

SLIDE 52

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Bounds for functionals

Let f : Rd → R be a Lipshitz function and (Xk)k≥0 the Euler discretization of the Langevin diffusion. We approximate

Rd f(x)π(dx) by the weighted average estimator

ˆ πN

n (f) = N+n

k=N+1

ωk,nf(Xk) , ωk,n = γk+1Γ−1

N+2,N+n+1 .

where N ≥ 0 is the length of the burn-in period, n ≥ 1 is the number of effective samples. Objective: compute an explicit bounds for the Mean Square Error (MSE) of this estimator defined by: MSEf(N, n) = Ex

ˆ

πN

n (f) − π(f)

2

.

LS3 seminar

SLIDE 53

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

The MSE can be decomposed into the sum of the squared bias and the variance MSEf(N, n) =

Ex[ˆ

πN

n (f)] − π(f)

2 + Varx

ˆ

πN

n (f)

,

By definition of the Wasserstein distance:, Bias2 ≤ f2

Lip N+n

k=N+1

ωk,nW 2

1 (δxQk γ, π) .

and W 2

1 (δxQk γ, π) ≤ 2(x − x⋆2 + d/m)u(1) k (γ) + u(2,3) k

(γ) .

LS3 seminar

SLIDE 54

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

MSE

Theorem Assume that U is L-smooth and strongly convex. Let (γk)k≥1 be a nonincreasing sequence with γ1 ≤ 2/(m + L). Then for all N ≥ 0, n ≥ 1 and Lipschitz functions f : Rd → R, we get Varx

ˆ

πN

n (f)

≤ 8κ−2 f2

Lip Γ−1 N+2,N+n+1u(3) N,n(γ)

where u(3)

N,n(γ)

def

=

1 + Γ−1

N+2,N+n+1(κ−1 + 2/(m + L))

.

The upper bound is independent of the dimension and allow to construct honest confidence bounds.

LS3 seminar

SLIDE 55

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Bound for the MSE α = 0 γ2

1 + (γ1n)−1 exp(−κγ1N/2)

α ∈ (0, 1/3) γ2

1n−2α + (γ1n1−α)−1 exp(−κγ1N 1−α/(2(1 − α)))

α = 1/3 γ2

1 log(n)n−2/3 + (γ1n2/3)−1 exp(−κγ1N 1/2/4)

α ∈ (1/3, 1) nα−1 γ2

1 + γ−1 1

exp(−κγ1N 1−α/(2(1 − α)))

α = 1

log(n)−1 γ2

1 + γ−1 1 N −γ1κ/2

Table : Bound for the MSE for γk = γ1k−α for fixed γ1 and N with more

regularity on U

LS3 seminar

SLIDE 56

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Optimal choice of γ1 Bound for the MSE α = 0 n−1/3 n−2/3 α ∈ (0, 1/2) nα−1/3 n−2/3 α = 1/2 (log(n))−1/3 log1/3(n)n−2/3 α ∈ (1/2, 1) 1/(m + L) n1−α α = 1 1/(m + L) log(n)

Table : Bound for the MSE for γk = γ1k−α for fixed n with more

regularity on U

LS3 seminar

SLIDE 57

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials

LS3 seminar

SLIDE 58

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Non-smooth potentials

The target distribution has a density π with respect to the Lebesgue measure on Rd of the form x → e−U(x)/

Rd e−U(y)dy where U = f + g,

with f : Rd → R and g : Rd → (−∞, +∞] are two lower bounded, convex functions satisfying:

1 f is continuously differentiable and gradient Lipschitz with Lipschitz

constant Lf, i.e. for all x, y ∈ Rd ∇f(x) − ∇f(y) ≤ Lf x − y .

2 g is lower semi-continuous and

Rd e−g(y)dy ∈ (0, +∞).

LS3 seminar

SLIDE 59

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Moreau-Yosida regularization

Let h : Rd → (−∞, +∞] be a l.s.c convex function and λ > 0. The λ-Moreau-Yosida envelope hλ : Rd → R and the proximal operator proxλ

h : Rd → Rd associated with h are defined for all x ∈ Rd by

hλ(x) = inf

y∈Rd

h(y) + (2λ)−1 x − y2

≤ h(x) . For every x ∈ Rd, the minimum is achieved at a unique point, proxλ

h(x), which is characterized by the inclusion

x − proxλ

h(x) ∈ γ∂h(proxλ h(x)) .

The Moreau-Yosida envelope is a regularized version of g, which approximates g from below.

LS3 seminar

SLIDE 60

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Properties of proximal operators

As λ ↓ 0, converges hλ converges pointwise h, i.e. for all x ∈ Rd, hλ(x) ↑ h(x) , as λ ↓ 0 . The function hλ is convex and continuously differentiable ∇hλ(x) = λ−1(x − proxλ

h(x)) .

The proximal operator is a monotone operator, for all x, y ∈ Rd,

proxλ

h(x) − proxλ h(y), x − y

≥ 0 ,

which implies that the Moreau-Yosida envelope is L-smooth:

∇hλ(x) − ∇hλ(y)
≤ λ−1 x − y, for all x, y ∈ Rd.

LS3 seminar

SLIDE 61

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

MY regularized potential

If g is not differentiable, but the proximal operator associated with g is available, its λ-Moreau Yosida envelope gλ can be considered. This leads to the approximation of the potential U λ : Rd → R defined for all x ∈ Rd by U λ(x) = f(x) + gλ(x) . Theorem (D., Moulines, Pereira, 2016) Under (H), for all λ > 0, 0 <

Rd e−U λ(y)dy < +∞.

LS3 seminar

SLIDE 62

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Some approximation results

Theorem Assume (H).

1 Then, limλ→0 πλ − πTV = 0. 2 Assume in addition that g is Lipschitz. Then for all λ > 0,

πλ − πTV ≤ λ g2

Lip .

LS3 seminar

SLIDE 63

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

The MYULA algorithm-I

Given a regularization parameter λ > 0 and a sequence of stepsizes {γk, k ∈ N∗}, the algorithm produces the Markov chain {XM

k , k ∈ N}:

for all k ≥ 0, XM

k+1 = XM k −γk+1

∇f(XM

k ) + λ−1(XM k − proxλ g(XM k ))

+
2γk+1Zk+1 ,

where {Zk, k ∈ N∗} is a sequence of i.i.d. d-dimensional standard Gaussian random variables.

LS3 seminar

SLIDE 64

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

The MYULA algorithm-II

The ULA target the smoothed distribution πλ. To compute the expectation of a function h : Rd → R under π from {XM

k ; 0 ≤ k ≤ n}, an importance sampling step is used to correct

the regularization. This step amounts to approximate

Rd h(x)π(x)dx by the weighted

sum Sh

n = n

k=0

ωk,nh(Xk) , with ωk,n = n

k=0

γke¯

gλ(XM

k )

−1 γke¯

gλ(XM

k ) ,

where for all x ∈ Rd ¯ gλ(x) = gλ(x)−g(x) = g(proxλ

g(x))−g(x)+(2λ)−1

x − proxλ

g(x)

2 .

LS3 seminar

SLIDE 65

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Image deconvolution

Objective recover an original image x ∈ Rn from a blurred and noisy

bserved image y ∈ Rn related to x by the linear observation model

y = Hx + w, where H is a linear operator representing the blur point spread function and w is a Gaussian vector with zero-mean and covariance matrix σ2In. This inverse problem is usually ill-posed or ill-conditioned: exploits prior knowledge about x. One of the most widely used image prior for deconvolution problems is the improper total-variation norm prior, π(x) ∝ exp (−α∇dx1), where ∇d denotes the discrete gradient operator that computes the vertical and horizontal differences between neighbour pixels. π(x|y) ∝ exp

−y − Hx2/2σ2 − α∇dx1
.

LS3 seminar

SLIDE 66

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

(a) (b) (c) Figure : (a) Original Boat image (256 × 256 pixels), (b) Blurred image, (c) MAP estimate.

LS3 seminar

SLIDE 67

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Credibility intervals

(a) (b) (c) Figure : (a) Pixel-wise 90% credibility intervals computed with proximal MALA (computing time 35 hours), (b) Approximate intervals estimated with MYULA using λ = 0.01 (computing time 3.5 hours), (c) Approximate intervals estimated with MYULA using λ = 0.1 (computing time 20 minutes).

LS3 seminar

SLIDE 68

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Microscopy dataset

(a) (b)

LS3 seminar

SLIDE 69

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

HPD credible region

Consider the ℓ1 norm prior, π(x) ∝ exp (−αx1), then π(x|y) ∝ exp

−y − Hx2/2σ2 − αx1
.

We want to test if the molecules are indeed present in true image (as opposed to being noise artefacts for example), Uncertainty about their position. For this, it can be relevent to compute the HPD credible region C∗

α = {x : U(x) ≤ ηα}

with ηα ∈ R chosen such that P(x ∈ C∗

α|y) = 1 − α holds. Here we

use α = 0.01 related to the 99% confidence level.

LS3 seminar

SLIDE 70

Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials

Comparison with PMALA

(a) (b) Figure : Microscopy experiment: (a) HDP region thresholds ηα for MYULA (2 × 106 iterations λ = 1, γ = 0.6) and PMALA (2 × 107 iterations), (b) relative approximation error of MYULA.

LS3 seminar