Limit Distributions for Smooth Total Variation and 2 -Divergence in - - PowerPoint PPT Presentation

limit distributions for smooth total variation and 2
SMART_READER_LITE
LIVE PREVIEW

Limit Distributions for Smooth Total Variation and 2 -Divergence in - - PowerPoint PPT Presentation

Limit Distributions for Smooth Total Variation and 2 -Divergence in High Dimensions Ziv Goldfeld and Kengo Kato Cornell University The 2020 International Symposium on Information Theory June 2020 Statistical Distances Definition: Measure


slide-1
SLIDE 1

Limit Distributions for Smooth Total Variation and χ2-Divergence in High Dimensions

Ziv Goldfeld and Kengo Kato

Cornell University

The 2020 International Symposium on Information Theory June 2020

slide-2
SLIDE 2

Statistical Distances

Definition: Measure discrepancy between prob. distributions

2/12

slide-3
SLIDE 3

Statistical Distances

Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q

2/12

slide-4
SLIDE 4

Statistical Distances

Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q If symmetric & δ(P, Q) ≤ δ(P, R) + δ(R, Q) then δ is a metric

2/12

slide-5
SLIDE 5

Statistical Distances

Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q If symmetric & δ(P, Q) ≤ δ(P, R) + δ(R, Q) then δ is a metric Popular Examples:

2/12

slide-6
SLIDE 6

Statistical Distances

Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q If symmetric & δ(P, Q) ≤ δ(P, R) + δ(R, Q) then δ is a metric Popular Examples: f-divergence: Df(PQ) := EQ

  • f
  • dP

dQ

  • , convex f : R → [0, ∞)

(KL divergence, total variation, χ2-divergence, etc.)

2/12

slide-7
SLIDE 7

Statistical Distances

Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q If symmetric & δ(P, Q) ≤ δ(P, R) + δ(R, Q) then δ is a metric Popular Examples: f-divergence: Df(PQ) := EQ

  • f
  • dP

dQ

  • , convex f : R → [0, ∞)

(KL divergence, total variation, χ2-divergence, etc.) p-Wasserstein dist.: Wp(P, Q) :=

  • inf

π∈Π(P,Q) Eπ

X − Y p1/p

Π(P, Q) is the set of coupling of P, Q

2/12

slide-8
SLIDE 8

Statistical Distances

Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q If symmetric & δ(P, Q) ≤ δ(P, R) + δ(R, Q) then δ is a metric Popular Examples: f-divergence: Df(PQ) := EQ

  • f
  • dP

dQ

  • , convex f : R → [0, ∞)

(KL divergence, total variation, χ2-divergence, etc.) p-Wasserstein dist.: Wp(P, Q) :=

  • inf

π∈Π(P,Q) Eπ

X − Y p1/p

Π(P, Q) is the set of coupling of P, Q Integral probability metrics: γF(P, Q) := supf∈F EP [f] − EQ[f] (W1, TV, MMD, Dudley, Sobolev)

2/12

slide-9
SLIDE 9

Statistical Distances: Why are they useful?

Historically: Prob. theory, mathematical statistics, information theory

3/12

slide-10
SLIDE 10

Statistical Distances: Why are they useful?

Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd)

3/12

slide-11
SLIDE 11

Statistical Distances: Why are they useful?

Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.)

3/12

slide-12
SLIDE 12

Statistical Distances: Why are they useful?

Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc.

3/12

slide-13
SLIDE 13

Statistical Distances: Why are they useful?

Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . .

3/12

slide-14
SLIDE 14

Statistical Distances: Why are they useful?

Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . . Recently: Variety of applications in machine learning

3/12

slide-15
SLIDE 15

Statistical Distances: Why are they useful?

Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . . Recently: Variety of applications in machine learning Implicit generative modeling

3/12

slide-16
SLIDE 16

Statistical Distances: Why are they useful?

Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . . Recently: Variety of applications in machine learning Implicit generative modeling Barycenter computation

3/12

slide-17
SLIDE 17

Statistical Distances: Why are they useful?

Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . . Recently: Variety of applications in machine learning Implicit generative modeling Barycenter computation Anomaly detection, model ensembling, etc.

3/12

slide-18
SLIDE 18

Statistical Distances: Why are they useful?

Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . . Recently: Variety of applications in machine learning Implicit generative modeling Barycenter computation Anomaly detection, model ensembling, etc.

3/12

slide-19
SLIDE 19

Implicit (Latent Variable) Generative Models

Goal: Learn a model Qθ ≈ P to approximate data distribution

4/12

slide-20
SLIDE 20

Implicit (Latent Variable) Generative Models

Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable

4/12

slide-21
SLIDE 21

Implicit (Latent Variable) Generative Models

Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable Latent variable Z ∼ QZ ∈ P(Rp), p ≪ d

4/12

slide-22
SLIDE 22

Implicit (Latent Variable) Generative Models

Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable Latent variable Z ∼ QZ ∈ P(Rp), p ≪ d Expand Z to Rd space via (random) transformation Q(θ)

X|Z

4/12

slide-23
SLIDE 23

Implicit (Latent Variable) Generative Models

Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable Latent variable Z ∼ QZ ∈ P(Rp), p ≪ d Expand Z to Rd space via (random) transformation Q(θ)

X|Z

= ⇒ Generative model: Qθ(·) :=

  • Rd0 Q(θ)

X|Z(·|z) dQZ(z)

4/12

slide-24
SLIDE 24

Implicit (Latent Variable) Generative Models

Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable Latent variable Z ∼ QZ ∈ P(Rp), p ≪ d Expand Z to Rd space via (random) transformation Q(θ)

X|Z

= ⇒ Generative model: Qθ(·) :=

  • Rd0 Q(θ)

X|Z(·|z) dQZ(z)

LatentSpace TargetSpace

  • |
  • 4/12
slide-25
SLIDE 25

Implicit (Latent Variable) Generative Models

Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable Latent variable Z ∼ QZ ∈ P(Rp), p ≪ d Expand Z to Rd space via (random) transformation Q(θ)

X|Z

= ⇒ Generative model: Qθ(·) :=

  • Rd0 Q(θ)

X|Z(·|z) dQZ(z)

Minimum Distance Estimation: Solve θ⋆ ∈ argmin

θ

δ

P, Qθ

  • LatentSpace

TargetSpace

  • |
  • 4/12
slide-26
SLIDE 26

Implicit (Latent Variable) Generative Models 2

Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆)

5/12

slide-27
SLIDE 27

Implicit (Latent Variable) Generative Models 2

Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data

5/12

slide-28
SLIDE 28

Implicit (Latent Variable) Generative Models 2

Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n

i=1 are i.i.d. samples from P ∈ P(Rd)

5/12

slide-29
SLIDE 29

Implicit (Latent Variable) Generative Models 2

Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n

i=1 are i.i.d. samples from P ∈ P(Rd)

Empirical distribution Pn := 1

n n

  • i=1

δXi

5/12

slide-30
SLIDE 30

Implicit (Latent Variable) Generative Models 2

Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n

i=1 are i.i.d. samples from P ∈ P(Rd)

Empirical distribution Pn := 1

n n

  • i=1

δXi = ⇒ Inherently we work with δ(Pn, Qθ)

5/12

slide-31
SLIDE 31

Implicit (Latent Variable) Generative Models 2

Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n

i=1 are i.i.d. samples from P ∈ P(Rd)

Empirical distribution Pn := 1

n n

  • i=1

δXi = ⇒ Inherently we work with δ(Pn, Qθ) Optimization: Can solve infθ δ (Pn, Qθ) approximately

5/12

slide-32
SLIDE 32

Implicit (Latent Variable) Generative Models 2

Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n

i=1 are i.i.d. samples from P ∈ P(Rd)

Empirical distribution Pn := 1

n n

  • i=1

δXi = ⇒ Inherently we work with δ(Pn, Qθ) Optimization: Can solve infθ δ (Pn, Qθ) approximately Find ˆ θn s.t. δ

Pn, Qˆ

θn

≤ infθ δ Pn, Qθ + ǫ

5/12

slide-33
SLIDE 33

Implicit (Latent Variable) Generative Models 2

Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n

i=1 are i.i.d. samples from P ∈ P(Rd)

Empirical distribution Pn := 1

n n

  • i=1

δXi = ⇒ Inherently we work with δ(Pn, Qθ) Optimization: Can solve infθ δ (Pn, Qθ) approximately Find ˆ θn s.t. δ

Pn, Qˆ

θn

≤ infθ δ Pn, Qθ + ǫ

Generalization [Zhang et al’18]: δ

P, Qˆ

θn

− OPT ≤ 2δ (Pn, P) + ǫ

5/12

slide-34
SLIDE 34

Implicit (Latent Variable) Generative Models 2

Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n

i=1 are i.i.d. samples from P ∈ P(Rd)

Empirical distribution Pn := 1

n n

  • i=1

δXi = ⇒ Inherently we work with δ(Pn, Qθ) Optimization: Can solve infθ δ (Pn, Qθ) approximately Find ˆ θn s.t. δ

Pn, Qˆ

θn

≤ infθ δ Pn, Qθ + ǫ

Generalization [Zhang et al’18]: δ

P, Qˆ

θn

− OPT ≤ 2δ (Pn, P) + ǫ

= ⇒ Boils down to empirical approximation question under δ

5/12

slide-35
SLIDE 35

Empirical Approximation

Question: What can we say about δ(Pn, P)?

6/12

slide-36
SLIDE 36

Empirical Approximation

Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd)

6/12

slide-37
SLIDE 37

Empirical Approximation

Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞

6/12

slide-38
SLIDE 38

Empirical Approximation

Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞ Total variation: δTV(Pn, P) = 1 . . .

6/12

slide-39
SLIDE 39

Empirical Approximation

Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞ Total variation: δTV(Pn, P) = 1 . . . Solution 1: Use more sophisticated estimate ˆ Pn of P (KDE, kNN, etc.)

6/12

slide-40
SLIDE 40

Empirical Approximation

Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞ Total variation: δTV(Pn, P) = 1 . . . Solution 1: Use more sophisticated estimate ˆ Pn of P (KDE, kNN, etc.) Problem 2: Empirical approximation rates are n−1/d

6/12

slide-41
SLIDE 41

Empirical Approximation

Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞ Total variation: δTV(Pn, P) = 1 . . . Solution 1: Use more sophisticated estimate ˆ Pn of P (KDE, kNN, etc.) Problem 2: Empirical approximation rates are n−1/d

6/12

slide-42
SLIDE 42

Empirical Approximation

Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞ Total variation: δTV(Pn, P) = 1 . . . Solution 1: Use more sophisticated estimate ˆ Pn of P (KDE, kNN, etc.) Problem 2: Empirical approximation rates are n−1/d f-divergence: [Nguyen-Wainwright-Jordan’15], [Kandasamy et at’15] p-Wasserstein dist.: [Fournier-Guillin’15], [Singh-P´

  • czos’18]

Integral probability metrics: [Sriperumbudur et at’09], [Liang’19]

6/12

slide-43
SLIDE 43

Smooth Statistical Distances

Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian.

7/12

slide-44
SLIDE 44

Smooth Statistical Distances

Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ

7/12

slide-45
SLIDE 45

Smooth Statistical Distances

Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ X ⊥ Z1 = ⇒ X + Z1 ∼ P ∗ Nσ & Y ⊥ Z2 = ⇒ Y + Z2 ∼ Q ∗ Nσ

7/12

slide-46
SLIDE 46

Smooth Statistical Distances

Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ X ⊥ Z1 = ⇒ X + Z1 ∼ P ∗ Nσ & Y ⊥ Z2 = ⇒ Y + Z2 ∼ Q ∗ Nσ ~

  • ~
  • ~

~ + ~

  • + ~
  • Channel
  • ()

7/12

slide-47
SLIDE 47

Smooth Statistical Distances

Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ X ⊥ Z1 = ⇒ X + Z1 ∼ P ∗ Nσ & Y ⊥ Z2 = ⇒ Y + Z2 ∼ Q ∗ Nσ Robustness to Supp. Mismatch: δ(σ)(P, Q) < ∞, ∀P, Q ∈ P(Rd)

7/12

slide-48
SLIDE 48

Smooth Statistical Distances

Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ X ⊥ Z1 = ⇒ X + Z1 ∼ P ∗ Nσ & Y ⊥ Z2 = ⇒ Y + Z2 ∼ Q ∗ Nσ Robustness to Supp. Mismatch: δ(σ)(P, Q) < ∞, ∀P, Q ∈ P(Rd) Preserves metric structure: If δ is a metric on P(Rd), then so is δ(σ)

7/12

slide-49
SLIDE 49

Smooth Statistical Distances

Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ X ⊥ Z1 = ⇒ X + Z1 ∼ P ∗ Nσ & Y ⊥ Z2 = ⇒ Y + Z2 ∼ Q ∗ Nσ Robustness to Supp. Mismatch: δ(σ)(P, Q) < ∞, ∀P, Q ∈ P(Rd) Preserves metric structure: If δ is a metric on P(Rd), then so is δ(σ)

  • Pf. idea: Use characteristic functions ΦP (t) := EP

eitX and:

ΦP∗Nσ = ΦP ΦNσ & ΦNσ(t) = e− σ2t2

2

= 0, ∀t.

7/12

slide-50
SLIDE 50

Smooth Statistical Distances: Empirical Approx.

High Level: Alleviate curse of dimensionality & get limit distributions

8/12

slide-51
SLIDE 51

Smooth Statistical Distances: Empirical Approx.

High Level: Alleviate curse of dimensionality & get limit distributions Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any d ≥ 1 and σ > 0, E[δ(σ)(Pn, P

] → 0 at a dimension-free rate:

Distance: E[δ(σ)(Pn, P

] ≍ n−1/2

for δ(σ)

TV and W(σ) 1

Distance2: E[δ(σ)(Pn, P

] ≍ n−1

for

W(σ)

2

2, D(σ)

KL and χ2 σ

under a sub-Gaussian condition on P.

8/12

slide-52
SLIDE 52

Smooth Statistical Distances: Empirical Approx.

High Level: Alleviate curse of dimensionality & get limit distributions Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any d ≥ 1 and σ > 0, E[δ(σ)(Pn, P

] → 0 at a dimension-free rate:

Distance: E[δ(σ)(Pn, P

] ≍ n−1/2

for δ(σ)

TV and W(σ) 1

Distance2: E[δ(σ)(Pn, P

] ≍ n−1

for

W(σ)

2

2, D(σ)

KL and χ2 σ

under a sub-Gaussian condition on P. Theorem (ZG-Kato’20) For sub-Gaussian P: √nW(σ)

1 (Pn, P) D

− → supf∈Lip1(Rd) G(σ)

P (f), ∀d ≥ 1

8/12

slide-53
SLIDE 53

Smooth Statistical Distances: Empirical Approx.

Limit distribution shows n− 1

2 rate is sharp

High Level: Alleviate curse of dimensionality & get limit distributions Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any d ≥ 1 and σ > 0, E[δ(σ)(Pn, P

] → 0 at a dimension-free rate:

Distance: E[δ(σ)(Pn, P

] ≍ n−1/2

for δ(σ)

TV and W(σ) 1

Distance2: E[δ(σ)(Pn, P

] ≍ n−1

for

W(σ)

2

2, D(σ)

KL and χ2 σ

under a sub-Gaussian condition on P. Theorem (ZG-Kato’20) For sub-Gaussian P: √nW(σ)

1 (Pn, P) D

− → supf∈Lip1(Rd) G(σ)

P (f), ∀d ≥ 1

8/12

slide-54
SLIDE 54

Smooth Statistical Distances: Empirical Approx.

Limit distribution shows n− 1

2 rate is sharp

W1 limit dist. known only for d = 1 (W1(Pn, P) = Fn − FL1(R)) High Level: Alleviate curse of dimensionality & get limit distributions Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any d ≥ 1 and σ > 0, E[δ(σ)(Pn, P

] → 0 at a dimension-free rate:

Distance: E[δ(σ)(Pn, P

] ≍ n−1/2

for δ(σ)

TV and W(σ) 1

Distance2: E[δ(σ)(Pn, P

] ≍ n−1

for

W(σ)

2

2, D(σ)

KL and χ2 σ

under a sub-Gaussian condition on P. Theorem (ZG-Kato’20) For sub-Gaussian P: √nW(σ)

1 (Pn, P) D

− → supf∈Lip1(Rd) G(σ)

P (f), ∀d ≥ 1

8/12

slide-55
SLIDE 55

Smooth Statistical Distances: Empirical Approx.

Limit distribution shows n− 1

2 rate is sharp

W1 limit dist. known only for d = 1 (W1(Pn, P) = Fn − FL1(R)) W(σ)

1

has a stable limit (characterized) in all d! High Level: Alleviate curse of dimensionality & get limit distributions Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any d ≥ 1 and σ > 0, E[δ(σ)(Pn, P

] → 0 at a dimension-free rate:

Distance: E[δ(σ)(Pn, P

] ≍ n−1/2

for δ(σ)

TV and W(σ) 1

Distance2: E[δ(σ)(Pn, P

] ≍ n−1

for

W(σ)

2

2, D(σ)

KL and χ2 σ

under a sub-Gaussian condition on P. Theorem (ZG-Kato’20) For sub-Gaussian P: √nW(σ)

1 (Pn, P) D

− → supf∈Lip1(Rd) G(σ)

P (f), ∀d ≥ 1

8/12

slide-56
SLIDE 56

New Limit Distribution Results: Smooth TV

TV distance: δTV(P, Q) := EQ

  • 1

2

  • dP

dQ − 1

  • = 1

2p − q1

9/12

slide-57
SLIDE 57

New Limit Distribution Results: Smooth TV

TV distance: δTV(P, Q) := EQ

  • 1

2

  • dP

dQ − 1

  • = 1

2p − q1

Smooth TV: δ(σ)

TV(P, Q) := δTV(P ∗ Nσ, Q ∗ Nσ) = 1 2pσ − qσ1

9/12

slide-58
SLIDE 58

New Limit Distribution Results: Smooth TV

TV distance: δTV(P, Q) := EQ

  • 1

2

  • dP

dQ − 1

  • = 1

2p − q1

Smooth TV: δ(σ)

TV(P, Q) := δTV(P ∗ Nσ, Q ∗ Nσ) = 1 2pσ − qσ1

Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if C(TV)

P,σ

:=

  • Rd
  • VarP (ϕσ(x − X)) dx < ∞, then

√nδ(σ)

TV(Pn, P) D

− → 1 2

  • G(σ)

P

  • 1 ,

for a centered GP

  • G(σ)

P (x)

  • x∈Rd w/ sample paths in L1(Rd) and

E

  • G(σ)

P (x)G(σ) P (y)

  • = CovP

ϕσ(x − X), ϕσ(y − X)

  • 9/12
slide-59
SLIDE 59

New Limit Distribution Results: Smooth TV

TV distance: δTV(P, Q) := EQ

  • 1

2

  • dP

dQ − 1

  • = 1

2p − q1

Smooth TV: δ(σ)

TV(P, Q) := δTV(P ∗ Nσ, Q ∗ Nσ) = 1 2pσ − qσ1

Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if C(TV)

P,σ

:=

  • Rd
  • VarP (ϕσ(x − X)) dx < ∞, then

√nδ(σ)

TV(Pn, P) D

− → 1 2

  • G(σ)

P

  • 1 ,

for a centered GP

  • G(σ)

P (x)

  • x∈Rd w/ sample paths in L1(Rd) and

E

  • G(σ)

P (x)G(σ) P (y)

  • = CovP

ϕσ(x − X), ϕσ(y − X)

  • Comments:

1

n−1/2 rate is sharp for E

δ(σ)

TV(Pn, P)

& Concentration inequality

9/12

slide-60
SLIDE 60

New Limit Distribution Results: Smooth TV

TV distance: δTV(P, Q) := EQ

  • 1

2

  • dP

dQ − 1

  • = 1

2p − q1

Smooth TV: δ(σ)

TV(P, Q) := δTV(P ∗ Nσ, Q ∗ Nσ) = 1 2pσ − qσ1

Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if C(TV)

P,σ

:=

  • Rd
  • VarP (ϕσ(x − X)) dx < ∞, then

√nδ(σ)

TV(Pn, P) D

− → 1 2

  • G(σ)

P

  • 1 ,

for a centered GP

  • G(σ)

P (x)

  • x∈Rd w/ sample paths in L1(Rd) and

E

  • G(σ)

P (x)G(σ) P (y)

  • = CovP

ϕσ(x − X), ϕσ(y − X)

  • Comments:

1

n−1/2 rate is sharp for E

δ(σ)

TV(Pn, P)

& Concentration inequality

2

C(TV)

P,σ

< ∞ condition is sharp: lim infn √nE

δ(σ)

TV(Pn, P)

≥ CP,σ

√ 2π

9/12

slide-61
SLIDE 61

Proof Outline

PDFs: Pn ∗ ϕσ(x) = 1

n n

  • i=1

ϕσ(x − Xi)

10/12

slide-62
SLIDE 62

Proof Outline

PDFs: Pn ∗ ϕσ(x) = 1

n n

  • i=1

ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E

Pn ∗ ϕσ(x)

  • 10/12
slide-63
SLIDE 63

Proof Outline

PDFs: Pn ∗ ϕσ(x) = 1

n n

  • i=1

ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E

Pn ∗ ϕσ(x)

  • Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)

& ¯ Zn(x) =

1 √n n

  • i=1

Zi(x)

10/12

slide-64
SLIDE 64

Proof Outline

PDFs: Pn ∗ ϕσ(x) = 1

n n

  • i=1

ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E

Pn ∗ ϕσ(x)

  • Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)

& ¯ Zn(x) =

1 √n n

  • i=1

Zi(x) = ⇒ √nδ(σ)

TV(Pn, P) = 1 2

  • ¯

Zn

  • 1

10/12

slide-65
SLIDE 65

Proof Outline

PDFs: Pn ∗ ϕσ(x) = 1

n n

  • i=1

ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E

Pn ∗ ϕσ(x)

  • Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)

& ¯ Zn(x) =

1 √n n

  • i=1

Zi(x) = ⇒ √nδ(σ)

TV(Pn, P) = 1 2

  • ¯

Zn

  • 1

Theorem (CLT in Banach Spaces) For p ∈ [1,∞), Z1, . . . ,Zn i.i.d. Lp-valued centered RVs, ¯ Zn =

1 √n

n

i=1Zi:

P

Z1p > t = o(t−2) as t → ∞ &

  • Rd

E |Z1(x)|2p/2 dx < ∞

⇐ ⇒ ¯ Zn converges in Lp to centered Gaussian G w/ same covariance as Z1

10/12

slide-66
SLIDE 66

Proof Outline

PDFs: Pn ∗ ϕσ(x) = 1

n n

  • i=1

ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E

Pn ∗ ϕσ(x)

  • Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)

& ¯ Zn(x) =

1 √n n

  • i=1

Zi(x) = ⇒ √nδ(σ)

TV(Pn, P) = 1 2

  • ¯

Zn

  • 1

Theorem (CLT in Banach Spaces) For p ∈ [1,∞), Z1, . . . ,Zn i.i.d. Lp-valued centered RVs, ¯ Zn =

1 √n

n

i=1Zi:

P

Z1p > t = o(t−2) as t → ∞ &

  • Rd

E |Z1(x)|2p/2 dx < ∞

⇐ ⇒ ¯ Zn converges in Lp to centered Gaussian G w/ same covariance as Z1 Verify for p = 1: Z11 ≤ 2 &

  • Rd
  • E[|Z1(x)|2] dx < ∞ by assump.

10/12

slide-67
SLIDE 67

Proof Outline

PDFs: Pn ∗ ϕσ(x) = 1

n n

  • i=1

ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E

Pn ∗ ϕσ(x)

  • Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)

& ¯ Zn(x) =

1 √n n

  • i=1

Zi(x) = ⇒ √nδ(σ)

TV(Pn, P) = 1 2

  • ¯

Zn

  • 1

Theorem (CLT in Banach Spaces) For p ∈ [1,∞), Z1, . . . ,Zn i.i.d. Lp-valued centered RVs, ¯ Zn =

1 √n

n

i=1Zi:

P

Z1p > t = o(t−2) as t → ∞ &

  • Rd

E |Z1(x)|2p/2 dx < ∞

⇐ ⇒ ¯ Zn converges in Lp to centered Gaussian G w/ same covariance as Z1 Verify for p = 1: Z11 ≤ 2 &

  • Rd
  • E[|Z1(x)|2] dx < ∞ by assump.

= ⇒ ¯ Zn

w

− → G

10/12

slide-68
SLIDE 68

Proof Outline

PDFs: Pn ∗ ϕσ(x) = 1

n n

  • i=1

ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E

Pn ∗ ϕσ(x)

  • Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)

& ¯ Zn(x) =

1 √n n

  • i=1

Zi(x) = ⇒ √nδ(σ)

TV(Pn, P) = 1 2

  • ¯

Zn

  • 1

Theorem (CLT in Banach Spaces) For p ∈ [1,∞), Z1, . . . ,Zn i.i.d. Lp-valued centered RVs, ¯ Zn =

1 √n

n

i=1Zi:

P

Z1p > t = o(t−2) as t → ∞ &

  • Rd

E |Z1(x)|2p/2 dx < ∞

⇐ ⇒ ¯ Zn converges in Lp to centered Gaussian G w/ same covariance as Z1 Verify for p = 1: Z11 ≤ 2 &

  • Rd
  • E[|Z1(x)|2] dx < ∞ by assump.

= ⇒ ¯ Zn

w

− → G and by CMT √nδ(σ)

TV(Pn, P) D

− → 1 2

  • G
  • 1 for G = G(σ)

P

10/12

slide-69
SLIDE 69

New Limit Distribution Results: Smooth χ2

χ2-divergence: χ2(PQ) := EQ

dP dQ − 1

2

11/12

slide-70
SLIDE 70

New Limit Distribution Results: Smooth χ2

χ2-divergence: χ2(PQ) := EQ

dP dQ − 1

2

Smooth χ2: χ2

σ(PQ) := χ2(P ∗NσQ∗Nσ) =

  • Rd

pσ(x)

qσ(x) −1

2qσ(x)dx

11/12

slide-71
SLIDE 71

New Limit Distribution Results: Smooth χ2

χ2-divergence: χ2(PQ) := EQ

dP dQ − 1

2

Smooth χ2: χ2

σ(PQ) := χ2(P ∗NσQ∗Nσ) =

  • Rd

pσ(x)

qσ(x) −1

2qσ(x)dx

Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if

  • Rd VarP (ϕσ(x−X))

P∗ϕσ(x)

dx < ∞, then nχ2

σ(PnP) D

− →

  • Rd
  • G(σ)

P (x)

  • 2

P ∗ ϕσ(x) dx, for centered GP G(σ)

P

w/ same cov. s.t.

G(σ)

P

√P∗ϕσ has sample paths in L 2(Rd).

11/12

slide-72
SLIDE 72

New Limit Distribution Results: Smooth χ2

χ2-divergence: χ2(PQ) := EQ

dP dQ − 1

2

Smooth χ2: χ2

σ(PQ) := χ2(P ∗NσQ∗Nσ) =

  • Rd

pσ(x)

qσ(x) −1

2qσ(x)dx

Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if

  • Rd VarP (ϕσ(x−X))

P∗ϕσ(x)

dx < ∞, then nχ2

σ(PnP) D

− →

  • Rd
  • G(σ)

P (x)

  • 2

P ∗ ϕσ(x) dx, for centered GP G(σ)

P

w/ same cov. s.t.

G(σ)

P

√P∗ϕσ has sample paths in L 2(Rd).

Comments:

1

Condition holds for any β-sub-Gaussian P with β <

σ √ 2

11/12

slide-73
SLIDE 73

New Limit Distribution Results: Smooth χ2

χ2-divergence: χ2(PQ) := EQ

dP dQ − 1

2

Smooth χ2: χ2

σ(PQ) := χ2(P ∗NσQ∗Nσ) =

  • Rd

pσ(x)

qσ(x) −1

2qσ(x)dx

Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if

  • Rd VarP (ϕσ(x−X))

P∗ϕσ(x)

dx < ∞, then nχ2

σ(PnP) D

− →

  • Rd
  • G(σ)

P (x)

  • 2

P ∗ ϕσ(x) dx, for centered GP G(σ)

P

w/ same cov. s.t.

G(σ)

P

√P∗ϕσ has sample paths in L 2(Rd).

Comments:

1

Condition holds for any β-sub-Gaussian P with β <

σ √ 2

2

  • Pf. like δ(σ)

TV but with Zi(x) := ϕσ(x−Xi) P∗ϕσ(x) −1 and CLT in L2(Rd, P ∗Nσ)

11/12

slide-74
SLIDE 74

Summary

Classic statistical distances: Rich history and modern applications

12/12

slide-75
SLIDE 75

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues

12/12

slide-76
SLIDE 76

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues ◮ Slow empirical approximation n− 1

d

12/12

slide-77
SLIDE 77

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues ◮ Slow empirical approximation n− 1

d

Smooth statistical distances: Convolve distributions with Nσ

12/12

slide-78
SLIDE 78

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues ◮ Slow empirical approximation n− 1

d

Smooth statistical distances: Convolve distributions with Nσ

◮ Robust to mismatched support

12/12

slide-79
SLIDE 79

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues ◮ Slow empirical approximation n− 1

d

Smooth statistical distances: Convolve distributions with Nσ

◮ Robust to mismatched support ◮ Inherits metric structure

12/12

slide-80
SLIDE 80

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues ◮ Slow empirical approximation n− 1

d

Smooth statistical distances: Convolve distributions with Nσ

◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions

12/12

slide-81
SLIDE 81

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues ◮ Slow empirical approximation n− 1

d

Smooth statistical distances: Convolve distributions with Nσ

◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions ◮ Limit dist. for scaled δ(σ)(Pn, P) in all dimensions (W(σ)

1 , δ(σ) TV , χ2 σ)

12/12

slide-82
SLIDE 82

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues ◮ Slow empirical approximation n− 1

d

Smooth statistical distances: Convolve distributions with Nσ

◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions ◮ Limit dist. for scaled δ(σ)(Pn, P) in all dimensions (W(σ)

1 , δ(σ) TV , χ2 σ)

Applications: Based on smooth statistical distance paradigm

12/12

slide-83
SLIDE 83

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues ◮ Slow empirical approximation n− 1

d

Smooth statistical distances: Convolve distributions with Nσ

◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions ◮ Limit dist. for scaled δ(σ)(Pn, P) in all dimensions (W(σ)

1 , δ(σ) TV , χ2 σ)

Applications: Based on smooth statistical distance paradigm

◮ Generative modeling via minimum distance estimation

12/12

slide-84
SLIDE 84

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues ◮ Slow empirical approximation n− 1

d

Smooth statistical distances: Convolve distributions with Nσ

◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions ◮ Limit dist. for scaled δ(σ)(Pn, P) in all dimensions (W(σ)

1 , δ(σ) TV , χ2 σ)

Applications: Based on smooth statistical distance paradigm

◮ Generative modeling via minimum distance estimation ◮ Goodness-of-fit testing, two-sample testing, etc.

12/12

slide-85
SLIDE 85

Summary

Classic statistical distances: Rich history and modern applications

◮ Support mismatch issues ◮ Slow empirical approximation n− 1

d

Smooth statistical distances: Convolve distributions with Nσ

◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions ◮ Limit dist. for scaled δ(σ)(Pn, P) in all dimensions (W(σ)

1 , δ(σ) TV , χ2 σ)

Applications: Based on smooth statistical distance paradigm

◮ Generative modeling via minimum distance estimation ◮ Goodness-of-fit testing, two-sample testing, etc.

Thank you!

12/12