Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for - - PowerPoint PPT Presentation

non asymptotic analysis of fractional langevin monte
SMART_READER_LITE
LIVE PREVIEW

Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for - - PowerPoint PPT Presentation

Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization Thanh Huy Nguyen, Umut S im sekli, Ga el Richard LTCI, T el ecom Paris, Institut Polytechnique de Paris, France Non-Asymptotic Analysis of


slide-1
SLIDE 1

Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard

LTCI, T´ el´ ecom Paris, Institut Polytechnique de Paris, France

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-2
SLIDE 2

Introduction

Non-convex optimization problem: min f (x)

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-3
SLIDE 3

Introduction

Non-convex optimization problem: min f (x) Fractional Langevin Algorithm (FLA) (Simsekli, 2017): W k+1 = W k − ηcα∇f (W k) +

  • η/β

1/α∆Lα

k+1

− {∆Lα

k }k∈N+: α-stable random variables

− α ∈ (1, 2]: the characteristic index, cα: a known constant

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-4
SLIDE 4

Introduction

Non-convex optimization problem: min f (x) Fractional Langevin Algorithm (FLA) (Simsekli, 2017): W k+1 = W k − ηcα∇f (W k) +

  • η/β

1/α∆Lα

k+1

− {∆Lα

k }k∈N+: α-stable random variables

− α ∈ (1, 2]: the characteristic index, cα: a known constant α-stable Distribution α-stable L´ evy Motion:

  • 15
  • 10
  • 5

5 10 15 10-3 10-2 10-1

=1.2 =1.6 =2.0 500 1000 1500 2000 2500 3000

  • 50

50 100

=1.2 =1.6 =2.0

Generalizes Stochastic Gradient Langevin Dynamics (α = 2)

(Welling and Teh, 2011)

Strong links with SGD for Deep Neural Networks (Simsekli et al. 2019)

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-5
SLIDE 5

Introduction

Non-convex optimization problem: min f (x) Fractional Langevin Algorithm (FLA) (Simsekli, 2017): W k+1 = W k − ηcα∇f (W k) +

  • η/β

1/α∆Lα

k+1

− {∆Lα

k }k∈N+: α-stable random variables

− α ∈ (1, 2]: the characteristic index, cα: a known constant α-stable Distribution α-stable L´ evy Motion:

  • 15
  • 10
  • 5

5 10 15 10-3 10-2 10-1

=1.2 =1.6 =2.0 500 1000 1500 2000 2500 3000

  • 50

50 100

=1.2 =1.6 =2.0

Generalizes Stochastic Gradient Langevin Dynamics (α = 2)

(Welling and Teh, 2011)

Strong links with SGD for Deep Neural Networks (Simsekli et al. 2019) Our Goal: Analyze E[f (W k) − f ⋆], where f ⋆ min f (x)

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-6
SLIDE 6

Method of Analysis

Define three stochastic processes: dX1(t) = −cα∇f (X1(t−))dt + β−1/αdLα(t), dX2(t) = −cα

  • k=0

∇f (X2(jη))I[jη,(j+1)η[(t)dt + β−1/αdLα(t), dX3(t) = −Dα−2

xi

  • φ(X3(t−))∂f (X3(t−))

∂xi

  • /φ(X3(t−))dt + β−1/αdLα(t).

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-7
SLIDE 7

Method of Analysis

Define three stochastic processes: dX1(t) = −cα∇f (X1(t−))dt + β−1/αdLα(t), dX2(t) = −cα

  • k=0

∇f (X2(jη))I[jη,(j+1)η[(t)dt + β−1/αdLα(t), dX3(t) = −Dα−2

xi

  • φ(X3(t−))∂f (X3(t−))

∂xi

  • /φ(X3(t−))dt + β−1/αdLα(t).

− D: Riesz fractional (directional) derivative − X1 is the continuous-time limit of the FLA algorithm − X2 is a linearly interpolated version of W k: X2(kη) = W k, ∀k ∈ N+ − X3 admits π ∝ exp(−βf (x))dx as its unique invariant distribution

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-8
SLIDE 8

Method of Analysis

Define three stochastic processes: dX1(t) = −cα∇f (X1(t−))dt + β−1/αdLα(t), dX2(t) = −cα

  • k=0

∇f (X2(jη))I[jη,(j+1)η[(t)dt + β−1/αdLα(t), dX3(t) = −Dα−2

xi

  • φ(X3(t−))∂f (X3(t−))

∂xi

  • /φ(X3(t−))dt + β−1/αdLα(t).

− D: Riesz fractional (directional) derivative − X1 is the continuous-time limit of the FLA algorithm − X2 is a linearly interpolated version of W k: X2(kη) = W k, ∀k ∈ N+ − X3 admits π ∝ exp(−βf (x))dx as its unique invariant distribution Decompose the error Ef (W k) − f ∗ as: [ Ef (X2(kη)) − Ef (X1(kη))] + [Ef (X1(kη)) − Ef (X3(kη))] + [Ef (X3(kη)) − Ef ( ˆ W )] + [Ef ( ˆ W ) − f ∗] − ˆ W ∼ π ∝ exp(−βf (x))dx − Relate these terms to Wasserstein distance between processes

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-9
SLIDE 9

Main Result

Main assumptions: 1) H¨

  • lder continuous gradients: cα∇f (x) − ∇f (y) ≤ Mx − yγ

2) Dissipativity: cαx, ∇f (x) ≥ mx1+γ − b

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-10
SLIDE 10

Main Result

Main assumptions: 1) H¨

  • lder continuous gradients: cα∇f (x) − ∇f (y) ≤ Mx − yγ

2) Dissipativity: cαx, ∇f (x) ≥ mx1+γ − b Theorem For 0 < η < m/M2, there exists C > 0 such that: E[f (W k)] − f ∗ ≤C

  • k1+max{ 1

q ,γ+ γ q }η 1 q + k1+max{ 1 q ,γ+ γ q }η 1 q + γ αq d

β

(q−1)γ αq

+ βb + d m exp(−λ∗kη β )

  • +

Mc−1

α

βγ+1(1 + γ) + 1 β log (2e(b + d

β ))

d 2 Γ( d

2 + 1)βd

(dm)

d 2

.

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-11
SLIDE 11

Main Result

Main assumptions: 1) H¨

  • lder continuous gradients: cα∇f (x) − ∇f (y) ≤ Mx − yγ

2) Dissipativity: cαx, ∇f (x) ≥ mx1+γ − b Theorem For 0 < η < m/M2, there exists C > 0 such that: E[f (W k)] − f ∗ ≤C

  • k1+max{ 1

q ,γ+ γ q }η 1 q + k1+max{ 1 q ,γ+ γ q }η 1 q + γ αq d

β

(q−1)γ αq

+ βb + d m exp(−λ∗kη β )

  • +

Mc−1

α

βγ+1(1 + γ) + 1 β log (2e(b + d

β ))

d 2 Γ( d

2 + 1)βd

(dm)

d 2

. − Worse dependency on η and k than the case α = 2 − Requires smaller η

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-12
SLIDE 12

Additional Results

Posterior Sampling: sampling from π ∝ exp(−βf (x))dx Stochastic Gradients: f (x) 1

n

n

i=1 f (i)(x)

∇f ≈ ∇fk(x)

i∈Ωk ∇f (i)(x)

  • /ns

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization

slide-13
SLIDE 13

Additional Results

Posterior Sampling: sampling from π ∝ exp(−βf (x))dx Stochastic Gradients: f (x) 1

n

n

i=1 f (i)(x)

∇f ≈ ∇fk(x)

i∈Ωk ∇f (i)(x)

  • /ns

For more information/questions, come to our poster #198!

NON-ASYMPTOTIC ANALYSIS OF FRACTIONAL LANGEVIN MONTE CARLO FOR NON-CONVEX OPTIMIZATION

Thanh Huy Nguyen1, Umut ¸ Sim¸ sekli1, Gaël Richard1 1: LTCI, Télécom ParisTech, Université Paris-Saclay, 75013, Paris, France.

Supported by the French National Research Agency (ANR) as a part of the FBIMATRIX project (ANR-16-CE23-0014)

INTRODUCTION

Non-convex optimization problem: min f(x) Fractional Langevin Algorithm (FLA) [1]: W k+1 = W k − ηcα∇f(W k) +

  • η/β

1/α∆Lα

k+1

− {∆Lα

k}k∈N+: α-stable random variables

− α ∈ (1, 2]: the characteristic index, cα: a known constant α-stable Lévy Motion:

  • 15
  • 10
  • 5
5 10 15 10-3 10-2 10-1 =1.2 =1.6 =2.0 500 1000 1500 2000 2500 3000
  • 50
50 100 =1.2 =1.6 =2.0

Generalizes Stoch. Grad. Langevin Dynamics [2] (α = 2) Strong links with SGD for Deep Neural Networks [3] Has better (empirical) generalization properties Our Goal: Analyze the expected error: E[f(W k) − f ⋆], where f ⋆ min f(x)

METHOD OF ANALYSIS

Define three stochastic processes: dX1(t) = b1(X1(t−), α)dt + β−1/αdLα(t), dX2(t) = b2(X2, α)dt + β−1/αdLα(t), dX3(t) = b(X3(t−), α)dt + β−1/αdLα(t). with b1(x, α) − cα∇f(x), b2(X2, α) − cα

  • k=0

∇f(X2(jη))I[jη,(j+1)η[(t), (b(x, α))i − Dα−2

xi

  • φ(x)∂f(x)

∂xi

  • /φ(x).

− D: Riesz fractional (directional) derivative − X2(kη) = W k for all k ∈ N+ (i.e. linear interpolation) − X3 targets π ∝ exp(−βf(x))dx Decompose the error Ef(W k) − f ∗ as: [ Ef(X2(kη)) − Ef(X1(kη))] + [Ef(X1(kη)) − Ef(X3(kη))] + [Ef(X3(kη)) − Ef( ˆ W)] + [Ef( ˆ W) − f ∗], − ˆ W ∼ π ∝ exp(−βf(x))dx − Relate these terms to Wasserstein distance between processes

ASSUMPTIONS & INTERMEDIATE RESULTS

Assumption: There exist constants M > 0, 0 ≤ γ < 1: cα∇f(x) − ∇f(y) ≤ Mx − yγ, x, y ∈ Rd. Assumption: For some m > 0 and b ≥ 0: cαx, ∇f(x) ≥ mx1+γ − b, x ∈ Rd. Assumption: ∃p, q, p1, q1 > 0 such that: q < α, γp < 1, γq1 < 1, (q − 1)p1 < 1 and 1/p + 1/q = 1/p1 + 1/q1 = 1. Assumption: 1) For some ¯ γ ∈ [0, 1], l0 ≥ 0, K1, K2 > 0: b(x) − b(y), x − y x − y ≤

  • K1x − y¯

γ,

x − y < l0, −K2x − y, x − y ≥ l0. , 2) For any coupling Pt of X3(t) and ˆ W ∼ π, ∃C∗ > 0:

  • X3(t) − ˆ

γdPt < C∗,

t > 0, ˆ γ ∈ (0, α). Assumption: There exists L > 0 such that L < m and sup

x∈Rd cα∇f(x) + b(x, α) ≤ L.

Lemma 1 Let V ∼ µ and W ∼ ν and let g ∈ C1(Rd, R). Assume that for some c1 > 0, c2 ≥ 0 and 0 ≤ γ < 1, ∇g(x) ≤ c1xγ + c2, ∀x ∈ Rd and max

  • EWγp 1
p ,
  • EV γp 1
p

< ∞. Then we have:

  • E[g(V ) − g(W)]
  • ≤ C Wq(µ, ν),

for some C > 0. Lemma 2 We have the following identity: Wλ(µit, µjt) = inf

  • E

t λ ∆Xij(s)λ−2∆Xij(s), ∆bij(s−)ds 1/λ , where the infimum is taken over the couplings and ∆Xij(s) Xi(s) − Xj(s) ∆bij(s−) bi(Xi(s−), α) − bj(Xj(s−), α).

MAIN RESULT

Theorem 1 For 0 < η < m/M 2, there exists C > 0 such that: E[f(W k)] − f ∗ ≤C

  • k1+max{ 1
q ,γ+ γ q }η 1 q + k1+max{ 1 q ,γ+ γ q }η 1 q + γ αq d

β

(q−1)γ αq

+ βb + d m exp(−λ∗kη β )

  • +

Mc−1

α

βγ+1(1 + γ) + 1 β log (2e(b + d

β )) d 2 Γ( d 2 + 1)βd

(dm)d/2 . − Worse dependency on η and k than the case α = 2 − Requires smaller η

ADDITIONAL RESULTS

Posterior Sampling: If our aim is only to draw samples from the distribution π, we have the result: Corollary 1 For 0 < η ≤ m/M 2, the following bound holds: Wq(µ2t, π) ≤C

  • k
max{2,q+γ} q

η

1 q + k max{2,q+γ} q

η

1 q + γ qα β− γ(q−1) qα

d

1 q

+ βe−λ∗

kη β
  • .

Stochastic Gradients: Assume: f(x) 1

n

n

i=1 f (i)(x)

− Approximate ∇f by: ∇fk(x)

i∈Ωk ∇f (i)(x)

  • /ns

− Ωk is a random subset of {1, . . . , n} with |Ωk| = ns ≪ n. Theorem 2 If there exists δ ∈ [0, 1) such that, for any k, EΩkcα(∇f(x) − ∇fk(x))q1 ≤δq1M q1xγq1, x ∈ Rd, then we have the following bound: Wq

q (µ1t, µ2t) ≤C(1 + δ)(k2η + k2η1+γ/αβ−γ(q−1)/αd).

REFERENCES

[1] ¸ Sim¸ sekli, U. "Fractional Langevin Monte Carlo: Exploring Levy Driven Stochastic Differential Equa- tions for Markov Chain Monte Carlo." ICML 2017. [2] Raginsky, M., Rakhlin, A., Telgarsky, M. "Non-convex learning via Stochastic Gradient Langevin Dy- namics: a nonasymptotic analysis." COLT 2017. [3] Simsekli U., Sagun L., Gurbuzbalaban, M. "A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks." ICML 2019.

Thanh Huy Nguyen, Umut S ¸im¸ sekli, Ga¨ el Richard Non-Asymptotic Analysis of FLMC for Non-Convex Optimization