SLIDE 1
THermodynamic Formalism and Uncertainty Quantification Luc Rey-Bellet University of Massachusetts Amherst Quantissima III, Venice, August 2019 Work supported by NSF and AFOSR 1
SLIDE 2 Collaborators on related projects
- Paul Dupuis (Brown University),
- Markos Katsoulakis (UMass Amherst)
- Sung-Ha Hwang (KAIST)
- Peter Plechac (U. of Delaware)
- Yannis Pantazis (FORTH Crete)
- Jeremiah Birrell (UMass Amherst)
- Panagiota Birmpa(UMass Amherst)
- Konstantinos Gourgoulias (UMass Amherst)
- Jinchao Feng (UMass Amherst)
- Jie Wang (UMass Amherst)
- Sosung Baek (KAIST)
2
SLIDE 3 Some references: [1] K. Chowdhary and P. Dupuis: Distinguishing and integrating aleatoric and epistemic variation in uncertainty quantification. ESAIM: M2AN, 47:635–662, 2013. [2] R. Atar, K. Chowdhary, and P. Dupuis: Robust bounds on risk-sensitive functionals via r´ enyi divergence. SIAM/ASA Jour- nal on UQ, 3:18–33, 2015. [3] P. Dupuis, M. A. Katsoulakis, Y. Pantazis, and P. Plech´ aˇ c. Path-Space Information Bounds for Uncertainty Quantification and Sensitivity Analysis of Stochastic Dynamics. SIAM/ASA Journal on UQ, 4(1):80–111, 2016. [4] M. Katsoulakis M., L. Rey-Bellet L. and J. Wang.: Scal- able Information Inequalities for Uncertainty Quantification. J.
- Comp. Phys. 336, 1, 513-545 (2017)
[5] K. Gourgoulias, M. Katsoulakis, L. Rey-Bellet L. and J. Wang: How biased is your model? Concentration inequalities, information and model bias. To be published in IEEE Trans.
3
SLIDE 4 [6] P. Dupuis, M. Katsoulakis,Y. Pantazis, and L. Rey-Bellet: Sensitivity Analysis for Rare Events based on Renyi Divergence. To be published in Ann. Appl. Prob. [7] J. Birrel and L. Rey-Bellet: Uncertainty Quantification for Markov Processes via Variational Principles and Functional In-
- equalities. Submitted. arXiv:1812.05174
[8] J. Birrel and L. Rey-Bellet: Concentration Ineqaulities and Performance Guaraantees for Hypocoercive Samplers. Submit- ted arXiv:1907.11973 [9] J. Birrell, M. Katsoulakis, and L. Rey-Bellet: Robustness of Dynamical Quantities of Interest via Goal-Oriented Information
[10] S. Baek, S.-H. Hwang, and L. Rey-Bellet: Thermodynami- cal formalism and Uncertainty Quantification. In Preparation.
- and several more to come.
4
SLIDE 5 UQ framework: Baseline model → Baseline model P (= probability measure on X). Think of it as a (tractable) model you use to compute or do analysis Maybe obtained after inference and/or model reduction, and so
Mots interesting you should think of P is high-dimensional, e.g, Pν is the distribution of a process {Xt}0≤t≤∞ with X0 ∼ ν. P is a Gibbs measure on ΩZd In any case, we think there are possibly lots of and large uncer- tainties in the model (model-form uncertainties) P IS NOT TO BE TRUSTED!! 5
SLIDE 6 UQ framework: Quantities of interest Specific observables/statistics/quantities of interest = QoI
- EP[f] (Expectation)
- VarP(f) (Variance) or
CovP (f,g)
√
VarP (f)VarP (g) (correlation), or
- ΛP,f(c) = log EP[ecf] (risk sensitive functional)
- log P(A) ∼ log e−I(A)/ǫ (probability of some rare event)
- r maybe path-space QoI
- EPν
τ
0 f(xt) dt
. where τ is a stopping time.
T
T
0 f(xs) dt
- that is ergodic averages.
- EPν
∞
0 e−λsf(xs) dt
that is discounted observables.
6
SLIDE 7
UQ framework: Non Parametric Stress tests → Family of alternative models Q. Think of it as describing the true but ”unknowable” or partially known models. Set Qη = {Q is η ”close” to P} Given a QoI f can one find uncertainty bounds or performance guarantees inf
Q∈Qη
EQ[f] ≤ EP[f] ≤ sup
Q∈Qη
EQ[f]?
and similarly for other quantities. The bounds should be tight and computable (numerically or analytically). → Robustness , cf book by Hansen (Nobel 2011) and Sargent (Nobel 2013) → Stress tests in Operation research, Finance, etc.... 7
SLIDE 8 UQ framework: distances and divergences Which measure of distance or pseudo-distance divergence should
→ Use Information Theory concepts to measure information loss between Q and P.
- Relative entropy (a.k.a Kullback-Leibler divergence)
R(Q||P) = EQ
dP
- Relative Renyi entropy (a.k.a Renyi divergence): For α = 0, 1
Rα(Q||P) = 1 α(α − 1) log EP
dQ
dP
α
= 1 α(α − 1) log EP
dP
Rα(Q||P) →
R(Q||P)
as α → 1 R(P||Q) as α → 0 8
SLIDE 9 UQ framework: distances and divergences
- Scalability: If Q0:T and P 0:T are the distribution of the process
restricted to the time window 0 to T then, typically, Rα(Q0:T||P 0:T). = O(T) as T → ∞ i.e. Information is additive. For the relative entropy we have the chain rule for relative entropy which is even better (not asymptotic in T).
- Information processing inequality: If F is a sub σ-algebra
then Rα(Q|F||P|F) ≤ Rα(Q||P)
- What is the right divergence for the QoI?
- Not the whole story:
→ Heavy tailed observable may require other entropies (f-divergences) → Wasserstein type distances— needed if Q ≪ P.... 9
SLIDE 10 What is wrong with CKP? Scalability Czsizar-Kullback-Pinsker |EQ[f] − EP[f]| ≤
Take e.g. Markov measures P = P 0:T and Q = Q0:T and FT = 1 T
T
f(Xs) ds . Then FT∞ = f∞ = O(1) and R(Q0:T||P 0:T) = O(T) and so |EQ0:T[FT] − EP 0:T[FT]|
≤
- 2R(Q0:T||P 0:T) FT − EP[FT]∞
- =O(
√ T)
CKP does not scale correctly! Note though that VarP 0:T[FT] = O
1
T
- so one would need the variance instead of the sup norm.
10
SLIDE 11 Gibbs Variational principle a.k.a. F = U − TS
- Relative entropy (a.k.a Kullback-Leibler divergence).
R(Q || P) =
dP
+∞
R(Q || P) is a divergence, that is R(Q || P) ≥ 0 and R(Q || P) = 0 if and only if Q = P.
- Gibbs variational principle for the relative entropy:
(convex duality). log EP
= sup
Q
{EQ[f] − R(Q||P)} with the supremum attained if and only if dQ = dQf = efdP EP[ef] Play a central role in statistical mechanics, in large deviation theory and in dynamical systems. 11
SLIDE 12 Gibbs information inequality From the Gibbs variational principle, for any Q and c ≥ 0
EQ[±cf] ≤ log EP
+ R(Q||P) . Theorem (Gibbs Information inequality) − inf
c>0
Λ(−c) + R(Q||P)
c
≤ EQ[f] − EP[f] ≤ inf
c>0
Λ(c) + R(Q||P)
c
ΞP,f(η) ≡ inf
c>0
Λ(c) + η
c
- Λ(c) = log EP
- ec(f−EP [f])
= log EP
− EP[f] How good is it? (Long history... Dupuis; Bobkov; Boucheron,
- Lugosi. Massart; Breuer,Czizsar, etc...)
12
SLIDE 13 Properties of the Gibbs information inequality ΞP,f(R(Q||P)) is a divergence, i.e. ΞP,f(η) ≥ 0 and ΞP,f(η) = 0 ⇔
η = 0 i.e. Q = P
Moreover the Gibbs information inequality is tight: Given the family of alternative models Qη = {Q ; R(Q||P) ≤ η} we have ΞP,f(η) = max
Q∈Qη
{EQ[f] − EP[f]} and the maximum is attained at Qη ∈ Qη with dQη dP = ec(η)f EP[ec(η)f] with c such that R(Qη||Q) = η and of course similarly for min 13
SLIDE 14 Concentration / UQ duality Recall: If X1, X2, · · · are IID copies with (centered) MGF Λ(c) for f(X) then by Chernov bound P
N
N
f(Xi) − EP[f] > x
Concentration and by Cramer and Sanov Theorem and the contraction principle Λ∗(x) = sup
c
{xc − Λ(c)} (Legendre transform) = inf
Q {R(Q||P) ; EQ[f] − EP[f] = x} ”(Entropy maximization)”
versus (duality of optimization problems) (Λ∗)−1
± (η)
= inf
c≥0
Λ(±c) + η
c
= sup
Q
{±(EQ[f] − EP[f]) ; R(Q||P) = η} (UQ bounds) 14
SLIDE 15 Linearization/ Variance Linearization: For small η = R(Q||P) one has the asymptotic expansion ΞP,f(η) =
3
where γP(f) = E[|f−EP [f]|3]
VarP [f]3/2
is the skewness. − → For small pertubation of P UQ is driven by CLT fluctuations, in the linear regime. − → For large perturbations of P UQ is driven by rare events or rather concentration of measure 15
SLIDE 16 Markov process: chosing the right path space entropy Baselines: Markov process Xt with path-space measure P 0:T Alternative: Stochastic process Yt with path-space measure Q0:T (not necessarily Markovian!) and Q0:T ≪ P 0:T Idea is to restrict the relative entropy to a sub σ-algebra tailored to the observables at hand
- Ergodic averages. Apply the inequality to FT = T
0 f(Xt) dt
EQ
FT
T
FT
T
c>0
1
T log EP[ec(FT −EP [FT ])] + 1 T R(Q0:T ν0 ||P 0:T µ0 )
c
- Under suitable ergodicity assumptions for Xt the bounds scale
as T → ∞. The important quantity is the relative entropy rate (it scales nicely with T as we shall see later)... 16
SLIDE 17
- Ergodic averages: statistical mechanics.
P = Gibbs measure on ΩZd (Ω finite set) with potential Φ. Q = any translation invariant measure on ΩZd. r(Q||P) = lim
V րZd
1 |V |R(Q|V ||P|V ) always exist and is finite Theorem: For (quasilocal) observable f inf
c>0
λ(−c) + r(Q|P)
c
c>0
λ(c) + r(Q|P)
c
- λ(c) = P(Φ + cΨf) − P(φ) translated pressure
(that is local Hamiltonian HV + c
x∈V τx(f))
17
SLIDE 18
- Stopping time τ and QoI Fτ = τ
0 f(Xt) dt.
It is natural to restrict the relative entropy to the σ-algebra Fτ. EQ [Fτ] − EP [Fτ] ≤ inf
c>0
- log EP[ecFτ−EP [Fτ]] + R(Q0:τ||P 0:τ)
c
- Just stop the process....
- Discounted observable QoI Gλ(f) = ∞
0 f(Xt)λe−λt dt.
Define a new measure Pλ: Xt runs up to a random time T with exponential distribution with mean 1/λ. Then R(Qλ||Pλ) =
∞
R(Q0:t||P 0:t)λe−λt dt discounted entropy EQ [Gλ(f)] − EP [Gλ(f)] ≤ inf
c>0
c
SLIDE 19 UQ for statistical estimators/ mean field formalism How do we get UQ bounds for non-linear functionals of P, for example variance or skewness VarP[f(X)]
γP[f] = EP[(f − EP[f(X)])3] VarP[f(X)]3/2
- r more general statistical estimators?
A fundamental result in large deviations Laplace principle: (Varadhan, Bryc, Dupuis-Ellis) The sequence SN taking value in Y satisfy a LDP with rate function I(Y ) if and only if for all Φ : Y → R bounded and continuous lim
N→∞
1 N log EP[eNΦ(NSN)] = sup
y
{Φ(y) − I(y)} 19
SLIDE 20 Example: UQ for the variance Build a statistical estimator for the variance 1 N
N
f(Xi)2 −
N
N
f(Xi)
2
→ VarP[f] where Xi are IID copies of X. Apply the Gibbs information inequality to statistical estimator, to find Theorem Gibbs UQ Bounds for the variance − inf
c>0
H(−c) + R(Q||P)
c
c>0
H(c) + R(Q||P)
c
H(c) = lim
N→∞
1 N log EP 0:N
N
i=1 f(Xi)2− 1 N
N
i=1 f(Xi)2
20
SLIDE 21 Using the Laplace principle for the joint (f(X), f 2(X)) one finds the convex function H(c) = sup
(u,v)∈R2
where Λ(α, β) = log EP
(cumulant generating function) I(u, v) = sup
α,β
{αu + βv − Λ(α, β)} (rate function in Cramer’s Theorem) The inequality is tight with optimizer dQα,β = eαf+βf 2 EP[eαf+βf 2]dP for suitable α and β such that R(Qα,β||P) = η This generalizes to general statistical estimators. 21
SLIDE 22 Rare events and risk sensitive functionals . UQ for rare events: P(A) ∼ e−I(A)/ǫ (rare event probability) We really want to control I(A) = −ǫ log P(A). More generally we consider risk sensitive functionals log EP[ecf] if c large (free energy) Relative Renyi entropy (a.k.a Renyi divergence): For α = 0, 1 Rα(Q||P) = 1 α(α − 1) log EP
dQ
dP
α
= 1 α(α − 1) log EP
dP
SLIDE 23 Variational principle for the Relative Reny entropy: (Dupuis et al.) Extension of the Gibbs Variational Principle proved by Atar, Chowdhary, and Dupuis. Relative Renyi entropy (a.k.a Renyi divergence): For α = 0, 1 Rα(Q||P) = 1 α(α − 1) log EP
dQ
dP
α
= 1 α(α − 1) log EP
dP
- Renyi Variational Principle
proved by Atar, Chowdhary, and Dupuis. 1 β log EQ
= inf
Q
1
γ log EP
+ 1 γ − βR
γ γ−β (Q || P)
1 β log EQ
= sup
Q
1
γ log EP
− 1 β − γR
β β−γ (Q || P)
23
SLIDE 24 UQ bounds for risk sensitive functionals sup
β<γ
1
β log EP[eβg] + 1 β − γR
γ γ−β (Q || P)
γ log EQ[eγg] 1 γ log EQ[eγg] ≤ inf
β>γ
1
β log EP[eβg] + 1 γ − βR
γ γ−β (Q || P)
- You can prove similar tightness properties as well.
To treat rare events you take g = −M1Ac and take M → ∞ and relabeling the indices UQ bounds for rare events − inf
α>0
log EP
dP
α
≤ log Q(A) − log P(A) log Q(A) − log P(A) ≤ inf
α>1
log EP
dP
α
Similar optimization problem as before. 24
SLIDE 25 Making it computable with concentration inequalities Some examples: (Much more in Gourgoulias, Katsoulakis, R.- B., Wang).
- If a ≤ f ≤ b we have Hoeffding’s inequality
Λ(c) ≤ c2(b − a)2 8 ≤ c2f − EP[f]∞ 2 and then ΞP,f(η) ≤
(Cziszar-Kullback Pinsker).
- If f is bounded and VarP[f] = σ2 then we have Bernstein
inequality Λ(c) ≤ c2σ2 2(1 − cf − EP[f]∞) and then ΞP,f(η) ≤
This beats Pinsker if η is not too big (especially if σ2 is small) and captures the exact small η asymptotics.
- Many more: Sharper inequalities for bounded f and other for
Poissonian, Gaussian, exponential tails.... 25
SLIDE 26 Steady state UQ bounds for ergodic Markov processes Consider ergodic averages
1 T
T
0 f(Xs) ds then using the Gibbs
UQ bound one obtains the steady state bias bound ξP,−f(r(Q||P)) ≤ lim
T→∞
1 T
T
f(Ys)ds
− Eµ[f]
≤ ξP,f(r(Q||P)) where ξP,f(η) = inf
c>0
λ(c) + η
c
T→∞
1 T log EP 0:T
c T
0 (f(Xs)−Eµ[f])ds
η = lim
T→∞
1 T R(Q0:T||P 0:T) (relative entropy rate) 26
SLIDE 27 Coercive Dynamics Langevin equation: dX = −∇V + J∇V + √ 2dWt for any any antisymmetric J has invariant measure dµ = Z−1e−V dx and we have L = ∆ − ∇V ∇
+ J∇V ∇
antisymmetric
- Main idea (from Liming Wu): Bound the Feynmann-Kac semi
group eT(L+V )h(x) = EP 0:T
δx
T
0 V (Xs)dsh(Xt)
- using Lumer-Philips Theorem
1 T log eT(L+V )L2(µ) ≤ sup
- g , LgL2(µ) +
- V |g|2dµ , g2 = 1
- .
See the works on concentration inequalities by Lezeaud, Wu, Catiaux, Guillin and collaborators on which we rely here. 27
SLIDE 28 Poincar´ e inequalities and bounded f Theorem: If we have a Poincar´ e inequality (spectral gap) Varµ[f] ≤ −αf , LfL2(µ) , f ∈ D(L) then for bounded f and general L λ(c) ≤ c2αVarµ[f] 1 − αcf − Eµ[f]∞ Bernstein type bound ξP,f(η) ≤ 2
Theorem: For symmetric L we we have the sharper bound λ(c) ≤ c2σ2(f) 2(1 − αc f∞) Bernstein type bound ξP,f(η) ≤
(sharp for small η). 28
SLIDE 29 Log-Sobolev inequalities and unbounded f Assume a stronger Log-Sobolev inequality
Eµ[f 2 log(f 2)] − Eµ[f 2] log Eµ[f 2] ≤ −βf , Lf
f ∈ D(L) Then using the Gibbs variational principle get the bound ξP,f(η) = inf
c>0
c + βη c
- The only trace of the dynamics is left in the constant β.
The tail behavior of f in the stationary distribution determines the UQ. Use another concentration inequality If V (x) ∼ |x|β (usual bounds on ∇V and ∆V ...)
e for β > 1
- Log Sobolev for β > 2 so UQ bounds for V (X) itself.
For 1 < b ≤ 2 we can use F- Sobolev inequalities to consider unbounded f. 29
SLIDE 30 Hypocoercive samplers Goal: To sample from ν(dq) ∝ e−βV (q)dq extending the phase space and sample from the measure µ(dp, dq) = ν(dq)π(dp) ∝ e−β(V (q)+p2/2m)dpdq You can use other distribution of p too. Why?: Add extra dimensions to escape your bad karma.... Make the dynamics irreversible to get faster (maybe).
dqt = pt mdt, dpt =
m
β dWt (1) L =
m
+ 1 β(∆p − γ
p
M
T
∇p)
30
SLIDE 31
- Ex2: Randomized Hamiltonian Monte-Carlo.
The particle follow Hamiltonian equation of motions dqt = pt mdt, dpt = −∇V (qt) without noise or dissipation for a random amount of time at which we resample the momentum according to the stationary measure. With the projection Πf = f(p, q)dπ(p) the generator is (2) L =
m
+ λ(Π − I)
S=S∗
31
SLIDE 32
- EX 3: Bouncy particle sampler.
The particle follow straight lines for a random time. At updat- ing time one either resample the momentum according to the stationary measure or the particle ”bounces”, i.e., it undergoes a Newtonian elastic collision on the hyperplane tangential to the gradient of the energy and the momentum is updated according to the rule (3) r(q)p = p − pT∇V (q) ∇V 2 ∇V Rf(p, q) = f(q, r(q)p) (4) L =
p
m
T
∇q
+
p
m
T
∇V (q)
+
(R − I)
+ λ(Π − I)
noise
- Zig-zag sampler..... etc...
- Temperature accelerated molecular dynamics
- Ask Gabriel Stoltz.
32
SLIDE 33
Hypocoercvity Dolbeaut-Mouhot-Schmeiser Andrieu-Durmus-N¨ usken-Roussel Rouset-Stoltz-Trstanova, Olla, ... after many other works (Villani, Hereau-Nier, Hairer-Eckmann). Idea: The dynamics is not coercive (no Poincar´ e inequality in L2(µ) for L), but there exists a scalar product equivalent to L2(µ) where a Poincar’e inequality holds. f, gǫ = f, g + ǫf, (B + B∗)g. B = (1 + (TΠ)∗(TΠ))−1(−TΠ)∗ and T is the antisymmetric part of the generator Modified Poincar´ e inequality: −Lg, gǫ ≥ Λ(ǫ)Varµ(f) (5) and Λ(ǫ) is explicitly expressed in terms of the Poincar´ e constant for ν(dq) the spectral gap of the noise operator and the potential V .... 33
SLIDE 34 Performance guarantees for hypocoercive samplers New results (Jeremiah Birell and L. R.-B.) Theorem (Bernstein type inequalities for hypocoercive samplers) For bounded f we have Pµ0
T
T
f(Xt)dt −
dµ
exp
b(ǫ)Λ(ǫ)r2 4Varµ[f] + 2c(ǫ)f − Eµ[f]r
- where a(ǫ), b(ǫ), c(ǫ) only depends on ǫ.
→ Explicit non asymptotic confidence intervals for fdµ, i.e. → UQ bounds for alternative processes ξP,f(η) ≤
- 2d(ǫ)Λ(ǫ)Varµ[f]η + e(ǫ)Λ(ǫ)f − Eµ[f]∞η
34