Langevin Dynamics Loucas Pillaud-Vivien November 7, 2019 Loucas - - PowerPoint PPT Presentation

langevin dynamics
SMART_READER_LITE
LIVE PREVIEW

Langevin Dynamics Loucas Pillaud-Vivien November 7, 2019 Loucas - - PowerPoint PPT Presentation

Diffusions and their numerical approximation Applications of Langevin algorithms Langevin Dynamics Loucas Pillaud-Vivien November 7, 2019 Loucas Pillaud-Vivien Langevin Dynamics Diffusions and their numerical approximation Applications of


slide-1
SLIDE 1

Diffusions and their numerical approximation Applications of Langevin algorithms

Langevin Dynamics

Loucas Pillaud-Vivien November 7, 2019

Loucas Pillaud-Vivien Langevin Dynamics

slide-2
SLIDE 2

Diffusions and their numerical approximation Applications of Langevin algorithms

Introduction

Sampling distribution over high-dimensional space is an important topic in computational statistics and machine learning Example of application : Bayesian inference for high-dimensional models Problems:

1

Most of sampling techniques do not scale to high-dimension. Big d.

2

And to large number of data (recall HMC, need the full gradient). Big N.

Loucas Pillaud-Vivien Langevin Dynamics

slide-3
SLIDE 3

Diffusions and their numerical approximation Applications of Langevin algorithms

Example: Bayesian setting

A Bayesian model is specified by:

1

sampling distribution of observed data: likelihood Y ∼ L(·|θ)

2

a prior distribution p on the parameter space θ ∈ Rd

The inference is based on the posterior distribution π(dθ) = p(dθ)L(Y |θ)

L(Y |u)p(du)

The normalizing constant is often not tractable (too high dimensional), we can only compute: π(dθ) ∝ p(dθ)L(Y |θ)

Loucas Pillaud-Vivien Langevin Dynamics

slide-4
SLIDE 4

Diffusions and their numerical approximation Applications of Langevin algorithms

Outline

1

Diffusions and their numerical approximation Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

2

Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

Loucas Pillaud-Vivien Langevin Dynamics

slide-5
SLIDE 5

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Framework

We want to sample the following measure that has a density w.r.t Lebesgue known up to a normalization factor. dµ(x) = e−V (x)dx

  • Rd e−V (y)dy

We assume that V is L-smooth : i.e. continuously differentiable and ∃L > 0 s.t. ∇V (x) − ∇V (y) Lx − y

Loucas Pillaud-Vivien Langevin Dynamics

slide-6
SLIDE 6

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Convergence to equilibrium for Diffusions

Let us consider the overdamped Langevin diffusion in Rd: dXt = −∇V (Xt)dt + √ 2dBt, L-smoothness of V gives existence and unicity of a solution Stationnary measure: dµ(x) =

e−V (x)dx

  • Rd e−V (y)dy .

Semi-group: Pt(f )(x) = E[f (Xt)|X0 = x] − → ”law of Xt”. Infinitesimal generator: Lφ = ∆φ − ∇V · ∇φ. We can verify that the semi-group follows the dynamics: d dt Pt(f ) = LPt(f ). − → Question : what speed of convergence then??? ?

Loucas Pillaud-Vivien Langevin Dynamics

slide-7
SLIDE 7

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Convergence to equilibrium for Diffusions

Theorem (Poincar´ e implies convergence to equilibrium) With the notations above, the following propositions are equivalent: µ satisfies a Poincar´ e Inequality with constant P For all f smooth, Varµ(Pt(f )) e−2t/PVarµ(f ) for all t 0. Proof: Integration by part formula (µ is reversible), −

  • f (Lg) dµ =
  • ∇f · ∇g dµ = −
  • (Lf )g dµ,

hence, d dt Varµ(Pt(f )) = d dt

  • (Pt(f ))2dµ = 2
  • Pt(f )(LPt(f ))dµ

= −2

  • ∇Pt(f )2dµ

−2/P Varµ(Pt(f ))

Loucas Pillaud-Vivien Langevin Dynamics

slide-8
SLIDE 8

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Poincar´ e inequalities: definition in modern language

Definition (Poincar´ e inequality) µ ∈ P(Rd) satisfies a Poincar´ e Inequality with constant P if Varµ(f ) Pµ

  • ∇f 2dµ,

for all (bounded) f : Rd − → R of class C1. Recall that : Varµ(f ) =

  • f 2dµ −
  • fdµ

2

= f −

  • fdµ

2

  • ∇f 2dµ = E(f ) is the Dirichlet Energy.

Spectral interpretation: E(f ) =

  • ∇f · ∇fdµ =
  • f (−Lf )dµ

− → 1/P = λ2, first non-trivial eigenvalue of L.

Loucas Pillaud-Vivien Langevin Dynamics

slide-9
SLIDE 9

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Application to the Ornstein-Uhlenbeck process

The diffusion of the Ornstein-Uhlenbeck process follows the SDE in Rd: dXt = −Xtdt + √ 2dBt, Denote L the operator Lφ = ∆φ − x · ∇φ, then

1 For dµ(x) =

1 (2π)d/2 e−x2/2dx, L is self adjoint in L2 µ

2 µ stationnary measure of O-U process 3 µ verifies Poincar´

e inequality with constant 1.

4 for all f smooth, for all t 0.

Varµ(Pt(f )) e−2tVarµ(f ).

Loucas Pillaud-Vivien Langevin Dynamics

slide-10
SLIDE 10

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Poincar´ e inequalities

Long story short: Poincar´ e inequality ⇐ ⇒ Spectral gap for L ⇐ ⇒ Exponential convergence for the diffusion

Loucas Pillaud-Vivien Langevin Dynamics

slide-11
SLIDE 11

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Poincar´ e inequalities

For what distribution do they occur? When V is m-stongly convex: P = 1/m (linear convergence

  • f gradient descent)

Loucas Pillaud-Vivien Langevin Dynamics

slide-12
SLIDE 12

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Poincar´ e inequalities

For what distribution do they occur? When V is m-stongly convex: P = 1/m (linear convergence

  • f gradient descent)

When V is only convex: yes but no bound...

Loucas Pillaud-Vivien Langevin Dynamics

slide-13
SLIDE 13

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Poincar´ e inequalities

For what distribution do they occur? When V is m-stongly convex: P = 1/m (linear convergence

  • f gradient descent)

When V is only convex: yes but no bound... A generic condition for non necessarily convex potential : 1 2 |∇V |2 − ∆V α

Loucas Pillaud-Vivien Langevin Dynamics

slide-14
SLIDE 14

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Poincar´ e inequalities

For what distribution do they occur? When V is m-stongly convex: P = 1/m (linear convergence

  • f gradient descent)

When V is only convex: yes but no bound... A generic condition for non necessarily convex potential : 1 2 |∇V |2 − ∆V α For mixture of Gaussian P explodes exponentially.

Loucas Pillaud-Vivien Langevin Dynamics

slide-15
SLIDE 15

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Ok, fine. But how do I get back to the real world and draw samples ?

Loucas Pillaud-Vivien Langevin Dynamics

slide-16
SLIDE 16

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Discretized Langevin Diffusion

Idea: Sample the diffusion paths, using Euler-Maruyama scheme dXt = −∇V (Xt)dt + √ 2dBt Xk+1 = Xk − γk+1∇V (Xk) +

  • 2γk+1ξk+1

where

(ξk)k is i.i.d N(0, Id) (γk)k is a sequence of stepsizes, either constant or decreasing to 0

Note the similarity with gradient descent or its stochastic counterpart. This algorithm is referred to Unajusted Langevin Algorithm, Langevin Monte Carlo or Gradient Langevin Dynamics.

Loucas Pillaud-Vivien Langevin Dynamics

slide-17
SLIDE 17

Diffusions and their numerical approximation Applications of Langevin algorithms Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

Discretized Langevin Diffusion: constant stepsize

When ∀k, γk = γ, then (Xk)k is an homogeneous Markov chain with Markov kernel Rγ Under some mild assumptions Rγ is irreducible, positive recurrent and hence has an invariant distribution dµγ = dµ. Typical questions:

For a given precision how do we choose the stepsize γ and the number of iterations such that dist(δxRn

γ, dµ) ǫ

How do we choose x ? How do we quantify dist(dµγ, dµ) ?

Loucas Pillaud-Vivien Langevin Dynamics

slide-18
SLIDE 18

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

Outline

1

Diffusions and their numerical approximation Setting Continuous time Markov process: diffusions Discretized Langevin diffusion

2

Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

Loucas Pillaud-Vivien Langevin Dynamics

slide-19
SLIDE 19

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

Result for a strongly convex potential

Theorem (Durmus, Moulines 2016) Assume that V is m-strongly convex and L smooth. Set γ ∈ (0, 1/(m + L)]) and κ = mL/(m + L) then for all x ∈ Rd, W 2

2 (δxRn γ, π) 2(1 − κγ)nW 2 2 (δx, π) + Cdγ

Remarks : Decomposition bias + variance as for SGD. Geometric rate then distance from dµ to dµγ One may choose γ s.t. for n = Θ

  • d

ǫ2

  • iterations

W 2

2 (δxRn γ, π) ǫ

Explicit way of choosing γ (it was a problem! –see MALA)

Loucas Pillaud-Vivien Langevin Dynamics

slide-20
SLIDE 20

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

Result for a strongly convex potential : remarks

Remarks : Exactly same results for

Total variation (Dalalyan 2014) KL divergence (Bartlett et al. 2017)

Same result with decreasing step sizes but no parameter to tune! Quadratic improvement by Jordan et. al 2018 by considering underdamped Langevin (similar to HMC) for n = Θ

d ǫ

  • iterations W 2

2 (δxRn γ, π) ǫ (needed also only strong convexity

  • utside of a ball).

Loucas Pillaud-Vivien Langevin Dynamics

slide-21
SLIDE 21

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

Grrrrr...But you know... I do not like to compute all the gradients...

Loucas Pillaud-Vivien Langevin Dynamics

slide-22
SLIDE 22

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

Stochastic Gradient Langevin Dynamics (SGLD)

Recall: the ULA algorithm is a discretization of the

  • verdamped Langevin diffusion, which leaves invariant the

target distribution dµ. To further reduce the computational cost, SGLD uses unbiased estimates of the gradient! Initially proposed by Welling, M. and Teh, Y.W. 2011.

Loucas Pillaud-Vivien Langevin Dynamics

slide-23
SLIDE 23

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

SGLD algorithm

Interested in situations where the distribution dµ arises as the posterior distribution in a Bayesian inference problem with prior dµ0 and a large number N >> 1 of i.i.f observations zi with likelihoods p(zi|X): dµ(X|zi) ∝ dµ0(X)

N

  • i=1

p(zi|X). Denote for i ∈ {1, . . . , N},

Vi(X) = − log(p(zi|X)) V0(X) = − log(dµ0(X)) V = N

i=0 Vi

ULA cost of one iteration is Nd which is prohibitively large

Loucas Pillaud-Vivien Langevin Dynamics

slide-24
SLIDE 24

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

SGLD algorithm

Welling, M and Teh, Y.W. suggested to replace ∇V with an unbiased estimate ∇V0 + (N/p)

  • i∈S

∇Vi, where S is a minibatch of size p. A single update of SGLD is thus (cost pd): Xk+1 = Xk − γ

 ∇V0(Xk) + N

p

  • i∈Sk+1

∇Vi(Xk)

  +

  • 2γZk+1

Same idea as SGD. Two sources of randomness: estimates of the gradient and Gaussian added noise to sample.

Loucas Pillaud-Vivien Langevin Dynamics

slide-25
SLIDE 25

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

SGLD algorithm: need for variance reduction

Xk+1 = Xk − γ

 ∇V0(Xk) + N

p

  • i∈Sk+1

∇Vi(Xk)

  +

  • 2γZk+1

Two sources of noise. For γ = γ0/N:

1 Noise from gradient estimates too big ⇒ no sampling. 2 Need to decrease the variance: assume x∗ unique minimizer of

V ,

Xk+1 = Xk−γ

 ∇V0(Xk) − ∇V0(x∗) + N

p

  • i∈Sk+1

∇Vi(Xk) − ∇Vi(x∗)

 +

  • 2γZk+1

If γ = γ0/N, SGLD ∼ SGD. Use variance control to sample. Precise analysis from Moulines et al. (2018).

Loucas Pillaud-Vivien Langevin Dynamics

slide-26
SLIDE 26

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

Non-convex Learning via SGLD

Classical learning problem: Find the minimum of F(w) := EP[f (w, Z)] where f is not necessarily convex. Call FZ(w) := 1

n

n

i=1 f (w, zi)

Consider the Langevin diffusion and its associated discretization : dXt = −∇FZ(Xt) +

  • 2β−1dBt

Xk+1 = Xk − η∇f (w, zk) +

  • 2ηβ−1ξk

Converges to dµz(dw) ∝ exp (−βFz(w)), when β ∼ 1/T is big, it concentrates around minimizers of Fz and hence F.

Loucas Pillaud-Vivien Langevin Dynamics

slide-27
SLIDE 27

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

Non-convex Learning via SGLD

Xk+1 = Xk − η∇f (w, zk) +

  • 2ηβ−1ξk

(Xk) converges to dµz(w) ∝ exp (−βFz(w)), β ∼ 1/T. Theorem (Raginsky, Rakhlin, Telgarsky (2018)) For k ǫ−4, η ǫ4, EF(Xk) − F ∗ cǫ + (β + d)2 n + d log(β + 1) β Sketch of proof: control of three terms How far from the true diffusion + invariant measure exp (−βFz(w)) How far FZ is from F How far a sample from exp (−βFz(w)) is near a minimizer of FZ in terms of β

Loucas Pillaud-Vivien Langevin Dynamics

slide-28
SLIDE 28

Diffusions and their numerical approximation Applications of Langevin algorithms Sampling a strongly convex potential Stochastic Gradient Langevin Dynamics Non convex Learning via SGLD

Conclusion

We have seen how Langevin Dynamics can be used to derive new algorithm for: Sampling Bayesian Learning Non-convex optimization Problem with non-convexity: metastability of the markov process − → old problem in computational chemistry. ”Particle remain trap in wells for a long time before going out.” There has been a huge effort in this community to tackle this problem Inspiration for Machine Learning ?

Loucas Pillaud-Vivien Langevin Dynamics