Optimal Bounds between f -Divergences and Integral Probability - - PowerPoint PPT Presentation

optimal bounds between f divergences and integral
SMART_READER_LITE
LIVE PREVIEW

Optimal Bounds between f -Divergences and Integral Probability - - PowerPoint PPT Presentation

Optimal Bounds between f -Divergences and Integral Probability Metrics Rohit Agrawal (Harvard) Thibaut Horel (MIT) Motivation Is the empirical distribution approximately normal? What is the normal distribution best approximating it? 1 1.0


slide-1
SLIDE 1

Optimal Bounds between f-Divergences and Integral Probability Metrics

Rohit Agrawal (Harvard) Thibaut Horel (MIT)

slide-2
SLIDE 2

Motivation

40 20 20 40 x 0.0 0.2 0.4 0.6 0.8 1.0 P(X x) Empirial dist. Normal dist.

Is the empirical distribution approximately normal? What is the normal distribution best approximating it?

1

slide-3
SLIDE 3

Motivation

40 20 20 40 x 0.0 0.2 0.4 0.6 0.8 1.0 P(X x) Empirial dist. Normal dist.

Is the empirical distribution approximately normal? What is the normal distribution best approximating it?

1

slide-4
SLIDE 4

Motivation

Typical learning procedure: given       

  • bservations X

model class M cost function D(··)        solve min

Y∈M D(XY)

X Y

Example: if X Y is the Kullback–Leibler divergence maximum likelihood estimation Problem: what statistical guarantees are implied by X Y ?

2

slide-5
SLIDE 5

Motivation

Typical learning procedure: given       

  • bservations X

model class M cost function D(··)        solve min

Y∈M D(XY)

X Y

Example: if D(XY) is the Kullback–Leibler divergence ⇒ maximum likelihood estimation Problem: what statistical guarantees are implied by X Y ?

2

slide-6
SLIDE 6

Motivation

Typical learning procedure: given       

  • bservations X

model class M cost function D(··)        solve min

Y∈M D(XY)

X Y

Example: if D(XY) is the Kullback–Leibler divergence ⇒ maximum likelihood estimation Problem: what statistical guarantees are implied by D(XY) ≤ ε?

2

slide-7
SLIDE 7

Measures of similarity for random variables

How “close” to each other are X and Y? φ-divergences Dφ(XY) = Ey∼Y

  • φ

P[X = y] P[Y = y]

  • for convex φ with φ(1) = 0

Ex: Kullback–Leibler (KL) div., χ2-div., Hellinger dist., α-div., etc. integral probability metrics dF(X, Y) = sup

f∈F

  • E[f(X)]−E[f(Y)]
  • class F of “test” functions

Ex: total variation dist., max. mean discrepancy, etc.

3

slide-8
SLIDE 8

What is the best lower bound of Dφ(XY) in terms of E[f(X)] − E[f(Y)]?

4

slide-9
SLIDE 9

Result

Theorem (Informal) There exists an explicit function Kf(Y) : R → R associated with f(Y) inducing a correspondence between

  • 1. lower bounds Dφ(XY) ≥ L
  • E[f(X)] − E[f(Y)]
  • for all X

and

  • 2. upper bounds Kf(Y)(t) ≤ B(t) for all t ∈ R

Ex: for the KL divergence, Kf Y is the log moment-generating function

5

slide-10
SLIDE 10

Result

Theorem (Informal) There exists an explicit function Kf(Y) : R → R associated with f(Y) inducing a correspondence between

  • 1. lower bounds Dφ(XY) ≥ L
  • E[f(X)] − E[f(Y)]
  • for all X

and

  • 2. upper bounds Kf(Y)(t) ≤ B(t) for all t ∈ R

Ex: for the KL divergence, Kf(Y) is the log moment-generating function

5

slide-11
SLIDE 11

Cumulant-generating function

For a given φ-divergence, defjne:

  • the convex conjugate φ⋆(y) = supx≥0
  • x · y − φ(x)
  • the φ-cumulant-generating function of f(Y)

Kf(Y)(t) = inf

λ∈R E

  • φ⋆(t · f(Y) + λ) − t · f(Y) − λ
  • Example: for the KL divergence,

x x x and:

  • y

ey

1

  • we recover the (centered) cumulant-generating function

Kf Y t et f Y

t f Y 6

slide-12
SLIDE 12

Cumulant-generating function

For a given φ-divergence, defjne:

  • the convex conjugate φ⋆(y) = supx≥0
  • x · y − φ(x)
  • the φ-cumulant-generating function of f(Y)

Kf(Y)(t) = inf

λ∈R E

  • φ⋆(t · f(Y) + λ) − t · f(Y) − λ
  • Example: for the KL divergence, φ(x) = x log x and:
  • φ⋆(y) = ey−1
  • we recover the (centered) cumulant-generating function

Kf(Y)(t) = log E

  • et·f(Y)−t·E[f(Y)]

6

slide-13
SLIDE 13

Result

Theorem The following are equivalent:

  • 1. Kf(Y)(t) ≤ B(t) for all t ∈ R
  • 2. Dφ(XY) ≥ B⋆

E[f(X)] − E[f(Y)]

  • for all X

where Kf(Y)(t) = inf

λ∈R E

  • φ⋆(t · f(Y) + λ) − t · f(Y) − λ
  • and ⋆ denotes the convex conjugate

Key technique: use convex analysis to obtain variational representations of X Y

7

slide-14
SLIDE 14

Result

Theorem The following are equivalent:

  • 1. Kf(Y)(t) ≤ B(t) for all t ∈ R
  • 2. Dφ(XY) ≥ B⋆

E[f(X)] − E[f(Y)]

  • for all X

where Kf(Y)(t) = inf

λ∈R E

  • φ⋆(t · f(Y) + λ) − t · f(Y) − λ
  • and ⋆ denotes the convex conjugate

Key technique: use convex analysis to obtain variational representations of Dφ(XY)

7

slide-15
SLIDE 15

Applications and examples

  • 1. for the KL divergence, if f takes values in [−1, 1]:

Kf(Y)(t) = log E

  • et·f(Y)−t·E[f(Y)]

≤ t2 2 (Hoefgding’s lemma) ⇒ D(XY) ≥ 1

2

  • E[f(X)] − E[f(Y)]

2 (Pinsker’s inequality) Holds more generally if f(Y) is subgaussian

  • 2. “Pinkser’s type” inequality for all
  • divergences (Rényi

divergences)

  • 3. Negative result, when

x

x x : f Y unbounded no nontrivial lower bound

8

slide-16
SLIDE 16

Applications and examples

  • 1. for the KL divergence, if f takes values in [−1, 1]:

Kf(Y)(t) = log E

  • et·f(Y)−t·E[f(Y)]

≤ t2 2 (Hoefgding’s lemma) ⇒ D(XY) ≥ 1

2

  • E[f(X)] − E[f(Y)]

2 (Pinsker’s inequality) Holds more generally if f(Y) is subgaussian

  • 2. “Pinkser’s type” inequality for all α-divergences (Rényi

divergences)

  • 3. Negative result, when

x

x x : f Y unbounded no nontrivial lower bound

8

slide-17
SLIDE 17

Applications and examples

  • 1. for the KL divergence, if f takes values in [−1, 1]:

Kf(Y)(t) = log E

  • et·f(Y)−t·E[f(Y)]

≤ t2 2 (Hoefgding’s lemma) ⇒ D(XY) ≥ 1

2

  • E[f(X)] − E[f(Y)]

2 (Pinsker’s inequality) Holds more generally if f(Y) is subgaussian

  • 2. “Pinkser’s type” inequality for all α-divergences (Rényi

divergences)

  • 3. Negative result, when limx→∞ φ(x)/x < ∞:

f(Y) unbounded ⇒ no nontrivial lower bound

8

slide-18
SLIDE 18

Conclusion

  • complete description of optimal lower bounds of φ-divergences

in terms of IPMs

  • results of independent interest on topological properties of
  • divergences
  • tools and techniques could be more broadly applied

Thanks!

9

slide-19
SLIDE 19

Conclusion

  • complete description of optimal lower bounds of φ-divergences

in terms of IPMs

  • results of independent interest on topological properties of

φ-divergences

  • tools and techniques could be more broadly applied

Thanks!

9

slide-20
SLIDE 20

Conclusion

  • complete description of optimal lower bounds of φ-divergences

in terms of IPMs

  • results of independent interest on topological properties of

φ-divergences

  • tools and techniques could be more broadly applied

Thanks!

9

slide-21
SLIDE 21

Conclusion

  • complete description of optimal lower bounds of φ-divergences

in terms of IPMs

  • results of independent interest on topological properties of

φ-divergences

  • tools and techniques could be more broadly applied

Thanks!

9