Convergence and Efficiency of Adaptive Importance Sampling - - PowerPoint PPT Presentation

convergence and efficiency of adaptive importance
SMART_READER_LITE
LIVE PREVIEW

Convergence and Efficiency of Adaptive Importance Sampling - - PowerPoint PPT Presentation

Convergence and Efficiency of Adaptive Importance Sampling techniques with partial biasing Gersende Fort Institut de Math ematiques de Toulouse CNRS France Joint work with B. Jourdain, T. Leli` evre and G. Stoltz Talk based on the paper


slide-1
SLIDE 1

Convergence and Efficiency

  • f Adaptive Importance Sampling techniques

with partial biasing

Gersende Fort

Institut de Math´ ematiques de Toulouse CNRS France

Joint work with B. Jourdain, T. Leli` evre and G. Stoltz

Talk based on the paper G.F., B. Jourdain, T. Leli` evre, G. Stoltz Convergence and Efficiency of Adaptive Importance Sampling techniques with partial biasing, J. Stat. Phys (2018)

1 / 17

slide-2
SLIDE 2

The goal

Assumption Let π · dµ be a probability distribution on X ⊆ Rp assumed to be highly metastable and (possibly) known up to a normalizing constant. Question 1: How to design a MC sampler for an approximation of

  • X

f π dµ Question 2: How to compute the free energy − ln

  • Xi

π dµ Xi ⊂ X In this talk, an approach by Free Energy-based Adaptive Importance Sampling technique which is a generalization of Wang Landau, Self Healing Umbrella Sampling, Well tempered metadynamics.

2 / 17

slide-3
SLIDE 3

The intuition (1/3) - a family of auxiliary distributions

π(x) = 1 Z exp(−V (x)) ◮ The auxiliary distribution Choose a partition X1, · · · , Xd of X

−2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −2 −1 1 2 3 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 8

θ∗,i

def

=

  • Xi

π dµ

3 / 17

slide-4
SLIDE 4

The intuition (1/3) - a family of auxiliary distributions

π(x) = 1 Z exp(−V (x)) ◮ The auxiliary distribution Choose a partition X1, · · · , Xd of X and for positive weights∗ θ = (θ1, · · · , θd) set πθ(x) ∝

d

  • i=1

1 IXi(x) exp (−V (x) − ln θi) ◮ Property 1: ∀i ∈ {1, · · · , d},

  • Xi

πθ dµ ∝ θ∗,i θi θ∗,i

def

=

  • Xi

π dµ ◮ Property 2: ∀i ∈ {1, · · · , d},

  • Xi

πθ∗ dµ = 1 d.

∗θi ∈ (0, 1), d i=1 θi = 1

3 / 17

slide-5
SLIDE 5

Intuition (2/3) - How to choose θ ?

πθ(x) ∝

d

  • i=1

1 IXi(x) exp (−V (x) − ln θi) θ∗,i

def

=

  • Xi

π dµ ◮ If θ = θ∗ Efficient exploration under πθ∗: each subset Xi has the same weight under πθ∗ Poor ESS: The IS approximation gets into

  • X

f πdµ ≈ d N

N

  • n=1

d

  • i=1

1 IXi(Xn) θ∗,i

  • f(Xn)

◮ Choose ρ ∈ (0, 1) and set θρ

∗ ∝ (θρ ∗,1, · · · , θρ ∗,d):

  • X

f πdµ ≈ d

  • i=1

θ1−ρ

∗,i

  • 1

N

N

  • n=1

d

  • i=1

1 IXi(Xn) θρ

∗,i

  • f(Xn)

◮ But θ∗ is unknown

4 / 17

slide-6
SLIDE 6

Intuition (3/3) -Estimation of the free energy

θ∗,i

def

=

  • Xi

π dµ ≈ θn,i

def

= Cn,i d

j=1 Cn,j

”Normalized count of the visits to Xi” ◮ Exact sampling If Xn+1 ∼ π dµ: Cn+1,i = Cn,i + 1 IXi(Xn+1)

This yields for all i = 1, · · · , d Cn+1,i =

n+1

  • k=1

1 IXi(Xk) Sn+1

def

=

d

  • i=1

Cn+1,i = (n + 1) = O(n) and θn+1,i = 1 n + 1

n+1

  • k=1

1 IXi(Xn+1) = θn,i + 1 n + 1 (1 IXi(Xn+1) − θn,i) i.e. Stochastic Apprimation scheme with learning rate 1/Sn+1, and limiting point θ∗,i

5 / 17

slide-7
SLIDE 7

Intuition (3/3) -Estimation of the free energy

θ∗,i

def

=

  • Xi

π dµ ≈ θn,i

def

= Cn,i d

j=1 Cn,j

”Normalized count of the visits to Xi” ◮ Exact sampling If Xn+1 ∼ π dµ: Cn+1,i = Cn,i + 1 IXi(Xn+1) ◮ IS sampling If Xn+1 ∼ π

θ dµ:

Cn+1,i = Cn,i + γ θi 1 IXi(Xn+1)

This yields for all i = 1, · · · , d Cn+1,i = γ θi

n+1

  • k=1

1 IXi(Xn+1) Sn+1

def

=

d

  • i=1

Cn+1,i = Owp1(n) and θn+1,i = θn,i + γ Sn+1 Hi(θn, Xn+1) + O( 1 n2 ) i.e. Stochastic Apprimation scheme with learning rate 1/Sn+1, and limiting point θ∗,i

5 / 17

slide-8
SLIDE 8

Intuition (3/3) -Estimation of the free energy

θ∗,i

def

=

  • Xi

π dµ ≈ θn,i

def

= Cn,i d

j=1 Cn,j

”Normalized count of the visits to Xi” ◮ Exact sampling If Xn+1 ∼ π dµ: Cn+1,i = Cn,i + 1 IXi(Xn+1) ◮ IS sampling If Xn+1 ∼ π

θ dµ:

Cn+1,i = Cn,i + γ θi 1 IXi(Xn+1) ◮ IS sampling with a leverage effect If Xn+1 ∼ π

θ dµ:

Cn+1,i = Cn,i + γ Sn g(Sn)

  • θi 1

IXi(Xn+1)

lim

+∞ g = +∞, lim inf s

s/g(s) > 0

This yields Sn+1 ↑ +∞ Sn+1 − Sn Sn = γ g(Sn)

  • θi1

IXi(Xn+1) and θn+1,i = θn,i + γ g(Sn)Hi(θn, Xn+1) + O

  • γ2

g2(Sn)

  • i.e. S.A. scheme with learning rate γ/g(Sn), and limiting point θ∗,i.

5 / 17

slide-9
SLIDE 9

Intuition (3/3) -Estimation of the free energy

θ∗,i

def

=

  • Xi

π dµ ≈ θn,i

def

= Cn,i d

j=1 Cn,j

”Normalized count of the visits to Xi” ◮ Exact sampling If Xn+1 ∼ π dµ: Cn+1,i = Cn,i + 1 IXi(Xn+1) ◮ IS sampling If Xn+1 ∼ π

θ dµ:

Cn+1,i = Cn,i + γ θi 1 IXi(Xn+1) ◮ IS sampling with a leverage effect If Xn+1 ∼ π

θ dµ:

Cn+1,i = Cn,i + γ Sn g(Sn)

  • θi 1

IXi(Xn+1)

lim

+∞ g = +∞, lim inf s

s/g(s) > 0

If g(s) = ln(1 + s)α/(1+α), the learning rate is O(t−α) ◮ Key property: if Xn+1 ∈ Xi, then for any j = i πθn+1(Xj) > πθn(Xj)

the probability of stratum #j increases

5 / 17

slide-10
SLIDE 10

The algorithm: Adaptive IS with partial biasing

◮ Fix: ρ ∈ (0, 1) and α ∈ (1/2, 1). Set g(s)

def

= (ln(1 + s))α/(1−α). ◮ Initialisation: X0 ∈ X, a positive weight vector θ0, ◮ Repeat, for n = 0, · · · , N − 1 sample Xn+1 ∼ Pθρ

n(Xn, ·), a Markov kernel invariant wrt πθρ n dµ

compute Cn+1,i = Cn,i + γ g(Sn) Sn θρ

n,i 1

IXi(Xn+1) Sn+1 =

d

  • i=1

Cn+1,i θn+1,i = Cn+1,i Sn+1 ◮ Return (θn)n sequ. of estimates of θ∗; and the IS estimator

  • f πdµ ≈

1 N

N

  • n=1

d

  • i=1

θ1−ρ

n−1,i

d

  • i=1

1 IXi(Xn) θρ

n−1,i

  • f(Xn)

6 / 17

slide-11
SLIDE 11

Convergence results

1 The limiting behavior of the estimates (θn)n 2 The limiting distribution of Xn 3 The limiting behavior of the IS estimator

7 / 17

slide-12
SLIDE 12

Assumptions

1 On the target density and the strata Xi:

sup

X

π < ∞, min

1≤i≤d θ∗(i) > 0 2 On the kernels Pθ: Hastings-Metropolis kernel, with symmetric

proposal q(x, y)dµ(y) such that infX2 q > 0.

for any compact subset K, there exists C and λ ∈ (0, 1) s.t. sup

θ∈K

P n

θ (x, ·) − πθTV ≤ Cλn

3 ρ ∈ (0, 1) 4 g(s) = (ln(1 + s))α/(1−α) with α ∈ (1/2, 1).

8 / 17

slide-13
SLIDE 13

Convergence results: on the sequence θn

◮ Recall θn+1 = θn + γn+1 H(Xn+1, θn) + γ2

n+1 Λn+1

γn+1

def

= γ/g(Sn) where γn is a positive random learning rate supn Λn+1 is bounded a.s.

  • H(·, θ) πθρ dµ = 0 iff θ = θ∗.

◮ Result 1 lim

n γn nα = (1 − α)α γ1−α

 

d

  • j=1

θ1−ρ

∗,j

  a.s. ◮ Result 2: lim

n θn = θ∗

a.s.

9 / 17

slide-14
SLIDE 14

Convergence results - on the samples Xn

◮ Recall Xn+1 ∼ Pθρ

n(Xn, ·)

πθPθ = πθ ◮ Result 1 For any bounded function f lim

n E [f(Xn)] =

  • f πθρ

∗ dµ

◮ Result 2 For any bounded function f lim

N

1 N

N

  • n=1

f(Xn) =

  • f πθρ

∗ dµ

a.s.

10 / 17

slide-15
SLIDE 15

Convergence results - on the IS estimator

◮ Result 1 For any bounded function f lim

N E

  1 N

N

  • n=1

f(Xn)  

d

  • j=1

θρ

n−1,j1

IXj(Xn)    

d

  • j=1

θ1−ρ

n−1,j

    =

  • f π dµ

◮ Result 1 For any bounded function f, a.s.: lim

N

1 N

N

  • n=1

f(Xn)  

d

  • j=1

θρ

n−1,j1

IXj(Xn)    

d

  • j=1

θ1−ρ

n−1,j

  =

  • f π dµ

11 / 17

slide-16
SLIDE 16

Is it new ?

◮ Theoretical contribution Self Healing Umbrella Sampling

ρ = 1 (no biasing intensity) g(s) = s (also covered by the theory; not detailed here)

Well-tempered metadynamics

ρ ∈ (0, 1) (biasing intensity) g(s) = s1−ρ (also covered by the theory; not detailed here)

◮ Methodological contribution: the introduction of a function g(s) in the updating scheme of the estimator θn, allowing a random learning rate γn ∼ Owp1(n−α) for α ∈ (1/2, 1).

12 / 17

slide-17
SLIDE 17

Is there a gain in such a self-tuned and partially biasing algorithm ?

−2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −2 −1 1 2 3 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 8

−2 −1 1 2 3 −4 −2 2 4 10 20 30 40 50 60 beta=1 −2 −1 1 2 3 −4 −2 2 4 1 2 3 4 5 x 10

8

beta=5

Make the metastability larger by increasing β.

13 / 17

slide-18
SLIDE 18

Case ρ ∈ [0, 1)

(ρ = a on the plot) and α ∈ (1/2, 1) ⇒ γn = Owp1(n−α)

1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 1e+11 10 100

Exit time β

a = 0.2 a = 0.4 a = 0.6 a = 0.8 a = 1 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 10 100 1000

Exit time β

a = 0.2 a = 0.4 a = 0.6 a = 0.8 a = 1

Figure: Left: Exit times for α = 0.8. Right: Exit times for α = 0.6.

Start from the left mode, compute the exit time T i.e. time to reach Xn,1 > 1 T ↑ when β ↑ fixed β and ρ: T ↓ when α ↓ - a slowly ↓ learning rate is better fixed β and α: T ↓ when ρ ↑ - a small (or no) bias is better Linear fit with a slope indep of ρ: ln T = cρ + (1 − α)−1 ln β

14 / 17

slide-19
SLIDE 19

Case ρ ∈ (0, 1)

(ρ = a on the plot) and α = 1 ⇒ γn = Owp1(n−1) (case

well tempered metadynamics)

100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 10 20 30 40 50 60 70

Exit time β

a = 1 - µ = 0.1 a = 1 - µ = 0.3 a = 1 - µ = 0.5 a = 1 - µ = 0.7 a = 1 - µ = 0.8 a = 1 - µ = 0.9 0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1

slope a = 1 - µ

Figure: Left: Exit times for many values of ρ. Right: Associated slopes, fitted by x → 2.43(1 − x).

Exit time T For fixed β: T ↓ when ρ ↑ - a small bias is better Linear fit: ln T = c + 2.43(1 − ρ)β

15 / 17

slide-20
SLIDE 20

Normalized Effective Sample Size (EF)

Case γn = O(1/nα) for α ∈ (1/2, 1), ρ ∈ [0, 1)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0.2 0.4 0.6 0.8 1

Efficiency factor a

β = 1 β = 2 β = 3 β = 5 β = 10

Figure: Efficiency factors ρ → EF(ρ) for various values of β.

EF =

  • N −1 N

n=1 w(Xn)

2

  • N −1 N

n=1 w2(Xn)

∈ [0, 1] By definition, when constant weights, EF = 1. For fixed β, EF ↑ when ρ ↓ - a strong bias is better

16 / 17

slide-21
SLIDE 21

Conclusion

A new algorithm which estimates the free energy of π by a Stochastic Approximation algorithm, where the stepsize sequence {γn, n ≥ 0} is tuned on the fly which provides an approximation of π by a set of weighted points with a controlled discrepancy of the weights. which requires two design parameters (α, ρ) to be fixed by the user · Transient phase: ρ close to 1 and α close to 1/2. · At convergence: ρ close to 0 and α close to 1. In the transient phase: far more efficient than Well-Tempered Metadynamics, SHUS and WL.

17 / 17