[PPT] - THermodynamic Formalism and Uncertainty Quantification Luc PowerPoint Presentation

SLIDE 1

THermodynamic Formalism and Uncertainty Quantification Luc Rey-Bellet University of Massachusetts Amherst Quantissima III, Venice, August 2019 Work supported by NSF and AFOSR 1

SLIDE 2

Collaborators on related projects

Paul Dupuis (Brown University),
Markos Katsoulakis (UMass Amherst)
Sung-Ha Hwang (KAIST)
Peter Plechac (U. of Delaware)
Yannis Pantazis (FORTH Crete)
Jeremiah Birrell (UMass Amherst)
Panagiota Birmpa(UMass Amherst)
Konstantinos Gourgoulias (UMass Amherst)
Jinchao Feng (UMass Amherst)
Jie Wang (UMass Amherst)
Sosung Baek (KAIST)

2

SLIDE 3

Some references: [1] K. Chowdhary and P. Dupuis: Distinguishing and integrating aleatoric and epistemic variation in uncertainty quantification. ESAIM: M2AN, 47:635–662, 2013. [2] R. Atar, K. Chowdhary, and P. Dupuis: Robust bounds on risk-sensitive functionals via r´ enyi divergence. SIAM/ASA Jour- nal on UQ, 3:18–33, 2015. [3] P. Dupuis, M. A. Katsoulakis, Y. Pantazis, and P. Plech´ aˇ c. Path-Space Information Bounds for Uncertainty Quantification and Sensitivity Analysis of Stochastic Dynamics. SIAM/ASA Journal on UQ, 4(1):80–111, 2016. [4] M. Katsoulakis M., L. Rey-Bellet L. and J. Wang.: Scal- able Information Inequalities for Uncertainty Quantification. J.

Comp. Phys. 336, 1, 513-545 (2017)

[5] K. Gourgoulias, M. Katsoulakis, L. Rey-Bellet L. and J. Wang: How biased is your model? Concentration inequalities, information and model bias. To be published in IEEE Trans.

Inf. Theory

3

SLIDE 4

[6] P. Dupuis, M. Katsoulakis,Y. Pantazis, and L. Rey-Bellet: Sensitivity Analysis for Rare Events based on Renyi Divergence. To be published in Ann. Appl. Prob. [7] J. Birrel and L. Rey-Bellet: Uncertainty Quantification for Markov Processes via Variational Principles and Functional In-

equalities. Submitted. arXiv:1812.05174

[8] J. Birrel and L. Rey-Bellet: Concentration Ineqaulities and Performance Guaraantees for Hypocoercive Samplers. Submit- ted arXiv:1907.11973 [9] J. Birrell, M. Katsoulakis, and L. Rey-Bellet: Robustness of Dynamical Quantities of Interest via Goal-Oriented Information

Theory. arXiv:1906.09282

[10] S. Baek, S.-H. Hwang, and L. Rey-Bellet: Thermodynami- cal formalism and Uncertainty Quantification. In Preparation.

and several more to come.

4

SLIDE 5

UQ framework: Baseline model → Baseline model P (= probability measure on X). Think of it as a (tractable) model you use to compute or do analysis Maybe obtained after inference and/or model reduction, and so

n....

Mots interesting you should think of P is high-dimensional, e.g, Pν is the distribution of a process {Xt}0≤t≤∞ with X0 ∼ ν. P is a Gibbs measure on ΩZd In any case, we think there are possibly lots of and large uncer- tainties in the model (model-form uncertainties) P IS NOT TO BE TRUSTED!! 5

SLIDE 6

UQ framework: Quantities of interest Specific observables/statistics/quantities of interest = QoI

EP[f] (Expectation)
VarP(f) (Variance) or

CovP (f,g)

√

VarP (f)VarP (g) (correlation), or

ΛP,f(c) = log EP[ecf] (risk sensitive functional)
log P(A) ∼ log e−I(A)/ǫ (probability of some rare event)
r maybe path-space QoI
EPν

τ

0 f(xt) dt

. where τ is a stopping time.

EPν
1

T

0 f(xs) dt

that is ergodic averages.
EPν

∞

0 e−λsf(xs) dt

that is discounted observables.

and so on....

6

SLIDE 7

UQ framework: Non Parametric Stress tests → Family of alternative models Q. Think of it as describing the true but ”unknowable” or partially known models. Set Qη = {Q is η ”close” to P} Given a QoI f can one find uncertainty bounds or performance guarantees inf

Q∈Qη

EQ[f] ≤ EP[f] ≤ sup

Q∈Qη

EQ[f]?

and similarly for other quantities. The bounds should be tight and computable (numerically or analytically). → Robustness , cf book by Hansen (Nobel 2011) and Sargent (Nobel 2013) → Stress tests in Operation research, Finance, etc.... 7

SLIDE 8

UQ framework: distances and divergences Which measure of distance or pseudo-distance divergence should

ne use?

→ Use Information Theory concepts to measure information loss between Q and P.

Relative entropy (a.k.a Kullback-Leibler divergence)

R(Q||P) = EQ

log dQ

dP

Relative Renyi entropy (a.k.a Renyi divergence): For α = 0, 1

Rα(Q||P) = 1 α(α − 1) log EP

dQ

dP

α

= 1 α(α − 1) log EP

eα log dQ

dP

Note that

Rα(Q||P) →

R(Q||P)

as α → 1 R(P||Q) as α → 0 8

SLIDE 9

UQ framework: distances and divergences

Scalability: If Q0:T and P 0:T are the distribution of the process

restricted to the time window 0 to T then, typically, Rα(Q0:T||P 0:T). = O(T) as T → ∞ i.e. Information is additive. For the relative entropy we have the chain rule for relative entropy which is even better (not asymptotic in T).

Information processing inequality: If F is a sub σ-algebra

then Rα(Q|F||P|F) ≤ Rα(Q||P)

What is the right divergence for the QoI?
Not the whole story:

→ Heavy tailed observable may require other entropies (f-divergences) → Wasserstein type distances— needed if Q ≪ P.... 9

SLIDE 10

What is wrong with CKP? Scalability Czsizar-Kullback-Pinsker |EQ[f] − EP[f]| ≤

2R(Q||P) f − EP[f]∞

Take e.g. Markov measures P = P 0:T and Q = Q0:T and FT = 1 T

T

f(Xs) ds . Then FT∞ = f∞ = O(1) and R(Q0:T||P 0:T) = O(T) and so |EQ0:T[FT] − EP 0:T[FT]|

=O(1)

≤

2R(Q0:T||P 0:T) FT − EP[FT]∞
=O(

√ T)

CKP does not scale correctly! Note though that VarP 0:T[FT] = O

1

T

so one would need the variance instead of the sup norm.

10

SLIDE 11

Gibbs Variational principle a.k.a. F = U − TS

Relative entropy (a.k.a Kullback-Leibler divergence).

R(Q || P) =

EQ
log dQ

dP

if Q ≪ P

+∞

therwise

R(Q || P) is a divergence, that is R(Q || P) ≥ 0 and R(Q || P) = 0 if and only if Q = P.

Gibbs variational principle for the relative entropy:

(convex duality). log EP

ef

= sup

Q

{EQ[f] − R(Q||P)} with the supremum attained if and only if dQ = dQf = efdP EP[ef] Play a central role in statistical mechanics, in large deviation theory and in dynamical systems. 11

SLIDE 12

Gibbs information inequality From the Gibbs variational principle, for any Q and c ≥ 0

EQ[±cf] ≤ log EP

e±cf

+ R(Q||P) . Theorem (Gibbs Information inequality) − inf

c>0

Λ(−c) + R(Q||P)

c

=ΞP,−f(R(Q||P))

≤ EQ[f] − EP[f] ≤ inf

c>0

Λ(c) + R(Q||P)

c

=ΞP,f(R(Q||P))

ΞP,f(η) ≡ inf

c>0

Λ(c) + η

c

Λ(c) = log EP
ec(f−EP [f])

= log EP

ecf

− EP[f] How good is it? (Long history... Dupuis; Bobkov; Boucheron,

Lugosi. Massart; Breuer,Czizsar, etc...)

12

SLIDE 13

Properties of the Gibbs information inequality ΞP,f(R(Q||P)) is a divergence, i.e. ΞP,f(η) ≥ 0 and ΞP,f(η) = 0 ⇔

η = 0 i.e. Q = P

r f = const

Moreover the Gibbs information inequality is tight: Given the family of alternative models Qη = {Q ; R(Q||P) ≤ η} we have ΞP,f(η) = max

Q∈Qη

{EQ[f] − EP[f]} and the maximum is attained at Qη ∈ Qη with dQη dP = ec(η)f EP[ec(η)f] with c such that R(Qη||Q) = η and of course similarly for min 13

SLIDE 14

Concentration / UQ duality Recall: If X1, X2, · · · are IID copies with (centered) MGF Λ(c) for f(X) then by Chernov bound P

1

N

k=1

f(Xi) − EP[f] > x

≤ e−NΛ∗(x)

Concentration and by Cramer and Sanov Theorem and the contraction principle Λ∗(x) = sup

c

{xc − Λ(c)} (Legendre transform) = inf

Q {R(Q||P) ; EQ[f] − EP[f] = x} ”(Entropy maximization)”

versus (duality of optimization problems) (Λ∗)−1

± (η)

= inf

c≥0

Λ(±c) + η

c

(Fenchel-Young)

= sup

Q

{±(EQ[f] − EP[f]) ; R(Q||P) = η} (UQ bounds) 14

SLIDE 15

Linearization/ Variance Linearization: For small η = R(Q||P) one has the asymptotic expansion ΞP,f(η) =

2VarP[f]η + 1

3

VarP[f]γP(f)η + O(η3/2)

where γP(f) = E[|f−EP [f]|3]

VarP [f]3/2

is the skewness. − → For small pertubation of P UQ is driven by CLT fluctuations, in the linear regime. − → For large perturbations of P UQ is driven by rare events or rather concentration of measure 15

SLIDE 16

Markov process: chosing the right path space entropy Baselines: Markov process Xt with path-space measure P 0:T Alternative: Stochastic process Yt with path-space measure Q0:T (not necessarily Markovian!) and Q0:T ≪ P 0:T Idea is to restrict the relative entropy to a sub σ-algebra tailored to the observables at hand

Ergodic averages. Apply the inequality to FT = T

0 f(Xt) dt

EQ

FT

T

−EP

FT

T

≤ inf

c>0

1

T log EP[ec(FT −EP [FT ])] + 1 T R(Q0:T ν0 ||P 0:T µ0 )

c

Under suitable ergodicity assumptions for Xt the bounds scale

as T → ∞. The important quantity is the relative entropy rate (it scales nicely with T as we shall see later)... 16

SLIDE 17

Ergodic averages: statistical mechanics.

P = Gibbs measure on ΩZd (Ω finite set) with potential Φ. Q = any translation invariant measure on ΩZd. r(Q||P) = lim

V րZd

1 |V |R(Q|V ||P|V ) always exist and is finite Theorem: For (quasilocal) observable f inf

c>0

λ(−c) + r(Q|P)

c

≤ EQ[f] − EP[f] ≤ inf

c>0

λ(c) + r(Q|P)

c

λ(c) = P(Φ + cΨf) − P(φ) translated pressure

(that is local Hamiltonian HV + c

x∈V τx(f))

17

SLIDE 18

Stopping time τ and QoI Fτ = τ

0 f(Xt) dt.

It is natural to restrict the relative entropy to the σ-algebra Fτ. EQ [Fτ] − EP [Fτ] ≤ inf

c>0

log EP[ecFτ−EP [Fτ]] + R(Q0:τ||P 0:τ)

c

Just stop the process....
Discounted observable QoI Gλ(f) = ∞

0 f(Xt)λe−λt dt.

Define a new measure Pλ: Xt runs up to a random time T with exponential distribution with mean 1/λ. Then R(Qλ||Pλ) =

∞

R(Q0:t||P 0:t)λe−λt dt discounted entropy EQ [Gλ(f)] − EP [Gλ(f)] ≤ inf

c>0

Gλ(ecf) + Rλ(Q||P)

c

18

SLIDE 19

UQ for statistical estimators/ mean field formalism How do we get UQ bounds for non-linear functionals of P, for example variance or skewness VarP[f(X)]

r

γP[f] = EP[(f − EP[f(X)])3] VarP[f(X)]3/2

r more general statistical estimators?

A fundamental result in large deviations Laplace principle: (Varadhan, Bryc, Dupuis-Ellis) The sequence SN taking value in Y satisfy a LDP with rate function I(Y ) if and only if for all Φ : Y → R bounded and continuous lim

N→∞

1 N log EP[eNΦ(NSN)] = sup

y

{Φ(y) − I(y)} 19

SLIDE 20

Example: UQ for the variance Build a statistical estimator for the variance 1 N

N

i=1

f(Xi)2 −

1

N

i=1

f(Xi)

2

→ VarP[f] where Xi are IID copies of X. Apply the Gibbs information inequality to statistical estimator, to find Theorem Gibbs UQ Bounds for the variance − inf

c>0

H(−c) + R(Q||P)

c

≤ VarQ[f] ≤ inf

c>0

H(c) + R(Q||P)

c

where

H(c) = lim

N→∞

1 N log EP 0:N

e

N

i=1 f(Xi)2− 1 N

N

i=1 f(Xi)2

20

SLIDE 21

Using the Laplace principle for the joint (f(X), f 2(X)) one finds the convex function H(c) = sup

(u,v)∈R2

c(v − u2) − I(u, v)

where Λ(α, β) = log EP

eαf(X)+βf 2(X)

(cumulant generating function) I(u, v) = sup

α,β

{αu + βv − Λ(α, β)} (rate function in Cramer’s Theorem) The inequality is tight with optimizer dQα,β = eαf+βf 2 EP[eαf+βf 2]dP for suitable α and β such that R(Qα,β||P) = η This generalizes to general statistical estimators. 21

SLIDE 22

Rare events and risk sensitive functionals . UQ for rare events: P(A) ∼ e−I(A)/ǫ (rare event probability) We really want to control I(A) = −ǫ log P(A). More generally we consider risk sensitive functionals log EP[ecf] if c large (free energy) Relative Renyi entropy (a.k.a Renyi divergence): For α = 0, 1 Rα(Q||P) = 1 α(α − 1) log EP

dQ

dP

α

= 1 α(α − 1) log EP

eα log dQ

dP

22

SLIDE 23

Variational principle for the Relative Reny entropy: (Dupuis et al.) Extension of the Gibbs Variational Principle proved by Atar, Chowdhary, and Dupuis. Relative Renyi entropy (a.k.a Renyi divergence): For α = 0, 1 Rα(Q||P) = 1 α(α − 1) log EP

dQ

dP

α

= 1 α(α − 1) log EP

eα log dQ

dP

Renyi Variational Principle

proved by Atar, Chowdhary, and Dupuis. 1 β log EQ

eβg

= inf

Q

1

γ log EP

eγg

+ 1 γ − βR

γ γ−β (Q || P)

γ > β

1 β log EQ

eβg

= sup

Q

1

γ log EP

eγg

− 1 β − γR

β β−γ (Q || P)

γ < β

23

SLIDE 24

UQ bounds for risk sensitive functionals sup

β<γ

1

β log EP[eβg] + 1 β − γR

γ γ−β (Q || P)

≤ 1

γ log EQ[eγg] 1 γ log EQ[eγg] ≤ inf

β>γ

1

β log EP[eβg] + 1 γ − βR

γ γ−β (Q || P)

You can prove similar tightness properties as well.

To treat rare events you take g = −M1Ac and take M → ∞ and relabeling the indices UQ bounds for rare events − inf

α>0

  

log EP

e−α log dQ

dP

− log P(A)

α

  

≤ log Q(A) − log P(A) log Q(A) − log P(A) ≤ inf

α>1

  

log EP

eα log dQ

dP

− log P(A)

α

  

Making it computable with concentration inequalities Some examples: (Much more in Gourgoulias, Katsoulakis, R.- B., Wang).

If a ≤ f ≤ b we have Hoeffding’s inequality

Λ(c) ≤ c2(b − a)2 8 ≤ c2f − EP[f]∞ 2 and then ΞP,f(η) ≤

2ηf − EP[f]∞

(Cziszar-Kullback Pinsker).

If f is bounded and VarP[f] = σ2 then we have Bernstein

inequality Λ(c) ≤ c2σ2 2(1 − cf − EP[f]∞) and then ΞP,f(η) ≤

2VarP[f]η + f − EP[f]∞η

This beats Pinsker if η is not too big (especially if σ2 is small) and captures the exact small η asymptotics.

Many more: Sharper inequalities for bounded f and other for

Poissonian, Gaussian, exponential tails.... 25

SLIDE 26

Steady state UQ bounds for ergodic Markov processes Consider ergodic averages

1 T

T

0 f(Xs) ds then using the Gibbs

UQ bound one obtains the steady state bias bound ξP,−f(r(Q||P)) ≤ lim

T→∞

1 T

T

f(Ys)ds

true process

− Eµ[f]

baseline

≤ ξP,f(r(Q||P)) where ξP,f(η) = inf

c>0

λ(c) + η

c

λ(c) = lim

T→∞

1 T log EP 0:T

e

c T

0 (f(Xs)−Eµ[f])ds

(CGF)

η = lim

T→∞

1 T R(Q0:T||P 0:T) (relative entropy rate) 26

SLIDE 27

Coercive Dynamics Langevin equation: dX = −∇V + J∇V + √ 2dWt for any any antisymmetric J has invariant measure dµ = Z−1e−V dx and we have L = ∆ − ∇V ∇

symmetric

+ J∇V ∇

antisymmetric

Main idea (from Liming Wu): Bound the Feynmann-Kac semi

group eT(L+V )h(x) = EP 0:T

δx

e

T

0 V (Xs)dsh(Xt)

using Lumer-Philips Theorem

1 T log eT(L+V )L2(µ) ≤ sup

g , LgL2(µ) +
V |g|2dµ , g2 = 1
.

See the works on concentration inequalities by Lezeaud, Wu, Catiaux, Guillin and collaborators on which we rely here. 27

SLIDE 28

Poincar´ e inequalities and bounded f Theorem: If we have a Poincar´ e inequality (spectral gap) Varµ[f] ≤ −αf , LfL2(µ) , f ∈ D(L) then for bounded f and general L λ(c) ≤ c2αVarµ[f] 1 − αcf − Eµ[f]∞ Bernstein type bound ξP,f(η) ≤ 2

αVarµ[f]η + αf − Eµ[f]∞η

Theorem: For symmetric L we we have the sharper bound λ(c) ≤ c2σ2(f) 2(1 − αc f∞) Bernstein type bound ξP,f(η) ≤

2σ2(f)η + αf − Eµ[f]∞η

(sharp for small η). 28

SLIDE 29

Log-Sobolev inequalities and unbounded f Assume a stronger Log-Sobolev inequality

Eµ[f 2 log(f 2)] − Eµ[f 2] log Eµ[f 2] ≤ −βf , Lf

f ∈ D(L) Then using the Gibbs variational principle get the bound ξP,f(η) = inf

c>0

log Eµ
ec(f−Eµ[f])

c + βη c

The only trace of the dynamics is left in the constant β.

The tail behavior of f in the stationary distribution determines the UQ. Use another concentration inequality If V (x) ∼ |x|β (usual bounds on ∇V and ∆V ...)

Poincar´

e for β > 1

Log Sobolev for β > 2 so UQ bounds for V (X) itself.

For 1 < b ≤ 2 we can use F- Sobolev inequalities to consider unbounded f. 29

SLIDE 30

Hypocoercive samplers Goal: To sample from ν(dq) ∝ e−βV (q)dq extending the phase space and sample from the measure µ(dp, dq) = ν(dq)π(dp) ∝ e−β(V (q)+p2/2m)dpdq You can use other distribution of p too. Why?: Add extra dimensions to escape your bad karma.... Make the dynamics irreversible to get faster (maybe).

Ex1: Langevin equation

dqt = pt mdt, dpt =

−∇V (qt) − γ pt

m

dt +
2γ

β dWt (1) L =

pT

m

∇q − ∇V T∇p
T=−T ∗

+ 1 β(∆p − γ

p

M

T

∇p)

S=S∗

30

SLIDE 31

Ex2: Randomized Hamiltonian Monte-Carlo.

The particle follow Hamiltonian equation of motions dqt = pt mdt, dpt = −∇V (qt) without noise or dissipation for a random amount of time at which we resample the momentum according to the stationary measure. With the projection Πf = f(p, q)dπ(p) the generator is (2) L =

pT

m

∇q − ∇V T∇p
T=−T ∗

+ λ(Π − I)

S=S∗

31

SLIDE 32

EX 3: Bouncy particle sampler.

The particle follow straight lines for a random time. At updat- ing time one either resample the momentum according to the stationary measure or the particle ”bounces”, i.e., it undergoes a Newtonian elastic collision on the hyperplane tangential to the gradient of the energy and the momentum is updated according to the rule (3) r(q)p = p − pT∇V (q) ∇V 2 ∇V Rf(p, q) = f(q, r(q)p) (4) L =

p

m

T

∇q

free motion

+

p

m

T

∇V (q)

+

(R − I)

bouncing

+ λ(Π − I)

noise

Zig-zag sampler..... etc...
Temperature accelerated molecular dynamics
Ask Gabriel Stoltz.

32

SLIDE 33

Hypocoercvity Dolbeaut-Mouhot-Schmeiser Andrieu-Durmus-N¨ usken-Roussel Rouset-Stoltz-Trstanova, Olla, ... after many other works (Villani, Hereau-Nier, Hairer-Eckmann). Idea: The dynamics is not coercive (no Poincar´ e inequality in L2(µ) for L), but there exists a scalar product equivalent to L2(µ) where a Poincar’e inequality holds. f, gǫ = f, g + ǫf, (B + B∗)g. B = (1 + (TΠ)∗(TΠ))−1(−TΠ)∗ and T is the antisymmetric part of the generator Modified Poincar´ e inequality: −Lg, gǫ ≥ Λ(ǫ)Varµ(f) (5) and Λ(ǫ) is explicitly expressed in terms of the Poincar´ e constant for ν(dq) the spectral gap of the noise operator and the potential V .... 33

SLIDE 34

Performance guarantees for hypocoercive samplers New results (Jeremiah Birell and L. R.-B.) Theorem (Bernstein type inequalities for hypocoercive samplers) For bounded f we have Pµ0

1

T

f(Xt)dt −

fdµ
≥ r
≤ a(ǫ)
dµ0

dµ

L2(µ)

exp

−T

b(ǫ)Λ(ǫ)r2 4Varµ[f] + 2c(ǫ)f − Eµ[f]r

where a(ǫ), b(ǫ), c(ǫ) only depends on ǫ.

→ Explicit non asymptotic confidence intervals for fdµ, i.e. → UQ bounds for alternative processes ξP,f(η) ≤

2d(ǫ)Λ(ǫ)Varµ[f]η + e(ǫ)Λ(ǫ)f − Eµ[f]∞η