Optimal Bounds between f -Divergences and Integral Probability - - PowerPoint PPT Presentation

▶

Aug 15, 2022 7 likes •219 views

Optimal Bounds between f -Divergences and Integral Probability Metrics Rohit Agrawal (Harvard) Thibaut Horel (MIT) Motivation Is the empirical distribution approximately normal? What is the normal distribution best approximating it? 1 1.0

SLIDE 1

Optimal Bounds between f-Divergences and Integral Probability Metrics

Rohit Agrawal (Harvard) Thibaut Horel (MIT)

SLIDE 2

Motivation

40 20 20 40 x 0.0 0.2 0.4 0.6 0.8 1.0 P(X x) Empirial dist. Normal dist.

Is the empirical distribution approximately normal? What is the normal distribution best approximating it?

SLIDE 3

Motivation

40 20 20 40 x 0.0 0.2 0.4 0.6 0.8 1.0 P(X x) Empirial dist. Normal dist.

Is the empirical distribution approximately normal? What is the normal distribution best approximating it?

SLIDE 4

Motivation

Typical learning procedure: given       

bservations X

model class M cost function D(··)        solve min

Y∈M D(XY)

X Y

Example: if X Y is the Kullback–Leibler divergence maximum likelihood estimation Problem: what statistical guarantees are implied by X Y ?

SLIDE 5

Motivation

Typical learning procedure: given       

bservations X

model class M cost function D(··)        solve min

Y∈M D(XY)

X Y

Example: if D(XY) is the Kullback–Leibler divergence ⇒ maximum likelihood estimation Problem: what statistical guarantees are implied by X Y ?

SLIDE 6

Motivation

Typical learning procedure: given       

bservations X

model class M cost function D(··)        solve min

Y∈M D(XY)

X Y

Example: if D(XY) is the Kullback–Leibler divergence ⇒ maximum likelihood estimation Problem: what statistical guarantees are implied by D(XY) ≤ ε?

SLIDE 7

Measures of similarity for random variables

How “close” to each other are X and Y? φ-divergences Dφ(XY) = Ey∼Y

P[X = y] P[Y = y]

for convex φ with φ(1) = 0

Ex: Kullback–Leibler (KL) div., χ2-div., Hellinger dist., α-div., etc. integral probability metrics dF(X, Y) = sup

f∈F

E[f(X)]−E[f(Y)]
class F of “test” functions

Ex: total variation dist., max. mean discrepancy, etc.

SLIDE 8

What is the best lower bound of Dφ(XY) in terms of E[f(X)] − E[f(Y)]?

SLIDE 9

Result

Theorem (Informal) There exists an explicit function Kf(Y) : R → R associated with f(Y) inducing a correspondence between

1. lower bounds Dφ(XY) ≥ L
E[f(X)] − E[f(Y)]
for all X

and

2. upper bounds Kf(Y)(t) ≤ B(t) for all t ∈ R

Ex: for the KL divergence, Kf Y is the log moment-generating function

SLIDE 10

Result

Theorem (Informal) There exists an explicit function Kf(Y) : R → R associated with f(Y) inducing a correspondence between

1. lower bounds Dφ(XY) ≥ L
E[f(X)] − E[f(Y)]
for all X

and

2. upper bounds Kf(Y)(t) ≤ B(t) for all t ∈ R

Ex: for the KL divergence, Kf(Y) is the log moment-generating function

SLIDE 11

Cumulant-generating function

For a given φ-divergence, defjne:

the convex conjugate φ⋆(y) = supx≥0
x · y − φ(x)
the φ-cumulant-generating function of f(Y)

Kf(Y)(t) = inf

λ∈R E

φ⋆(t · f(Y) + λ) − t · f(Y) − λ
Example: for the KL divergence,

x x x and:

ey

we recover the (centered) cumulant-generating function

Kf Y t et f Y

t f Y 6

SLIDE 12

Cumulant-generating function

For a given φ-divergence, defjne:

the convex conjugate φ⋆(y) = supx≥0
x · y − φ(x)
the φ-cumulant-generating function of f(Y)

Kf(Y)(t) = inf

λ∈R E

φ⋆(t · f(Y) + λ) − t · f(Y) − λ
Example: for the KL divergence, φ(x) = x log x and:
φ⋆(y) = ey−1
we recover the (centered) cumulant-generating function

Kf(Y)(t) = log E

et·f(Y)−t·E[f(Y)]

SLIDE 13

Result

Theorem The following are equivalent:

1. Kf(Y)(t) ≤ B(t) for all t ∈ R
2. Dφ(XY) ≥ B⋆

E[f(X)] − E[f(Y)]

for all X

where Kf(Y)(t) = inf

λ∈R E

φ⋆(t · f(Y) + λ) − t · f(Y) − λ
and ⋆ denotes the convex conjugate

Key technique: use convex analysis to obtain variational representations of X Y

SLIDE 14

Result

Theorem The following are equivalent:

1. Kf(Y)(t) ≤ B(t) for all t ∈ R
2. Dφ(XY) ≥ B⋆

E[f(X)] − E[f(Y)]

for all X

where Kf(Y)(t) = inf

λ∈R E

φ⋆(t · f(Y) + λ) − t · f(Y) − λ
and ⋆ denotes the convex conjugate

Key technique: use convex analysis to obtain variational representations of Dφ(XY)

SLIDE 15

Applications and examples

1. for the KL divergence, if f takes values in [−1, 1]:

Kf(Y)(t) = log E

et·f(Y)−t·E[f(Y)]

≤ t2 2 (Hoefgding’s lemma) ⇒ D(XY) ≥ 1

E[f(X)] − E[f(Y)]

2 (Pinsker’s inequality) Holds more generally if f(Y) is subgaussian

2. “Pinkser’s type” inequality for all
divergences (Rényi

divergences)

3. Negative result, when

x x : f Y unbounded no nontrivial lower bound

SLIDE 16

Applications and examples

1. for the KL divergence, if f takes values in [−1, 1]:

Kf(Y)(t) = log E

et·f(Y)−t·E[f(Y)]

≤ t2 2 (Hoefgding’s lemma) ⇒ D(XY) ≥ 1

E[f(X)] − E[f(Y)]

2 (Pinsker’s inequality) Holds more generally if f(Y) is subgaussian

2. “Pinkser’s type” inequality for all α-divergences (Rényi

divergences)

3. Negative result, when

x x : f Y unbounded no nontrivial lower bound

SLIDE 17

Applications and examples

1. for the KL divergence, if f takes values in [−1, 1]:

Kf(Y)(t) = log E

et·f(Y)−t·E[f(Y)]

≤ t2 2 (Hoefgding’s lemma) ⇒ D(XY) ≥ 1

E[f(X)] − E[f(Y)]

2 (Pinsker’s inequality) Holds more generally if f(Y) is subgaussian

2. “Pinkser’s type” inequality for all α-divergences (Rényi

divergences)

3. Negative result, when limx→∞ φ(x)/x < ∞:

f(Y) unbounded ⇒ no nontrivial lower bound

SLIDE 18

Conclusion

complete description of optimal lower bounds of φ-divergences

in terms of IPMs

results of independent interest on topological properties of
divergences
tools and techniques could be more broadly applied

Thanks!

SLIDE 19

Conclusion

complete description of optimal lower bounds of φ-divergences

in terms of IPMs

results of independent interest on topological properties of

φ-divergences

tools and techniques could be more broadly applied

Thanks!

SLIDE 20

Conclusion

complete description of optimal lower bounds of φ-divergences

in terms of IPMs

results of independent interest on topological properties of

φ-divergences

tools and techniques could be more broadly applied

Thanks!

SLIDE 21

Conclusion

complete description of optimal lower bounds of φ-divergences

in terms of IPMs

results of independent interest on topological properties of

φ-divergences

tools and techniques could be more broadly applied