Limit Distributions for Smooth Total Variation and 2 -Divergence in - - PowerPoint PPT Presentation
Limit Distributions for Smooth Total Variation and 2 -Divergence in - - PowerPoint PPT Presentation
Limit Distributions for Smooth Total Variation and 2 -Divergence in High Dimensions Ziv Goldfeld and Kengo Kato Cornell University The 2020 International Symposium on Information Theory June 2020 Statistical Distances Definition: Measure
Statistical Distances
Definition: Measure discrepancy between prob. distributions
2/12
Statistical Distances
Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q
2/12
Statistical Distances
Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q If symmetric & δ(P, Q) ≤ δ(P, R) + δ(R, Q) then δ is a metric
2/12
Statistical Distances
Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q If symmetric & δ(P, Q) ≤ δ(P, R) + δ(R, Q) then δ is a metric Popular Examples:
2/12
Statistical Distances
Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q If symmetric & δ(P, Q) ≤ δ(P, R) + δ(R, Q) then δ is a metric Popular Examples: f-divergence: Df(PQ) := EQ
- f
- dP
dQ
- , convex f : R → [0, ∞)
(KL divergence, total variation, χ2-divergence, etc.)
2/12
Statistical Distances
Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q If symmetric & δ(P, Q) ≤ δ(P, R) + δ(R, Q) then δ is a metric Popular Examples: f-divergence: Df(PQ) := EQ
- f
- dP
dQ
- , convex f : R → [0, ∞)
(KL divergence, total variation, χ2-divergence, etc.) p-Wasserstein dist.: Wp(P, Q) :=
- inf
π∈Π(P,Q) Eπ
X − Y p1/p
Π(P, Q) is the set of coupling of P, Q
2/12
Statistical Distances
Definition: Measure discrepancy between prob. distributions δ : P(Rd) × P(Rd) → [0, ∞) s.t. δ(P, Q) = 0 ⇐ ⇒ P = Q If symmetric & δ(P, Q) ≤ δ(P, R) + δ(R, Q) then δ is a metric Popular Examples: f-divergence: Df(PQ) := EQ
- f
- dP
dQ
- , convex f : R → [0, ∞)
(KL divergence, total variation, χ2-divergence, etc.) p-Wasserstein dist.: Wp(P, Q) :=
- inf
π∈Π(P,Q) Eπ
X − Y p1/p
Π(P, Q) is the set of coupling of P, Q Integral probability metrics: γF(P, Q) := supf∈F EP [f] − EQ[f] (W1, TV, MMD, Dudley, Sobolev)
2/12
Statistical Distances: Why are they useful?
Historically: Prob. theory, mathematical statistics, information theory
3/12
Statistical Distances: Why are they useful?
Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd)
3/12
Statistical Distances: Why are they useful?
Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.)
3/12
Statistical Distances: Why are they useful?
Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc.
3/12
Statistical Distances: Why are they useful?
Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . .
3/12
Statistical Distances: Why are they useful?
Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . . Recently: Variety of applications in machine learning
3/12
Statistical Distances: Why are they useful?
Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . . Recently: Variety of applications in machine learning Implicit generative modeling
3/12
Statistical Distances: Why are they useful?
Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . . Recently: Variety of applications in machine learning Implicit generative modeling Barycenter computation
3/12
Statistical Distances: Why are they useful?
Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . . Recently: Variety of applications in machine learning Implicit generative modeling Barycenter computation Anomaly detection, model ensembling, etc.
3/12
Statistical Distances: Why are they useful?
Historically: Prob. theory, mathematical statistics, information theory Topological and metric structure of P(Rd) Inequalities (Pinsker, Talagrand, joint-range, etc.) Hypothesis testing, goodness-of-fit tests, etc. Fundamental performance limits of operational problems . . . Recently: Variety of applications in machine learning Implicit generative modeling Barycenter computation Anomaly detection, model ensembling, etc.
3/12
Implicit (Latent Variable) Generative Models
Goal: Learn a model Qθ ≈ P to approximate data distribution
4/12
Implicit (Latent Variable) Generative Models
Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable
4/12
Implicit (Latent Variable) Generative Models
Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable Latent variable Z ∼ QZ ∈ P(Rp), p ≪ d
4/12
Implicit (Latent Variable) Generative Models
Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable Latent variable Z ∼ QZ ∈ P(Rp), p ≪ d Expand Z to Rd space via (random) transformation Q(θ)
X|Z
4/12
Implicit (Latent Variable) Generative Models
Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable Latent variable Z ∼ QZ ∈ P(Rp), p ≪ d Expand Z to Rd space via (random) transformation Q(θ)
X|Z
= ⇒ Generative model: Qθ(·) :=
- Rd0 Q(θ)
X|Z(·|z) dQZ(z)
4/12
Implicit (Latent Variable) Generative Models
Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable Latent variable Z ∼ QZ ∈ P(Rp), p ≪ d Expand Z to Rd space via (random) transformation Q(θ)
X|Z
= ⇒ Generative model: Qθ(·) :=
- Rd0 Q(θ)
X|Z(·|z) dQZ(z)
LatentSpace TargetSpace
- |
- 4/12
Implicit (Latent Variable) Generative Models
Goal: Learn a model Qθ ≈ P to approximate data distribution Method: Complicated transformation of a simple latent variable Latent variable Z ∼ QZ ∈ P(Rp), p ≪ d Expand Z to Rd space via (random) transformation Q(θ)
X|Z
= ⇒ Generative model: Qθ(·) :=
- Rd0 Q(θ)
X|Z(·|z) dQZ(z)
Minimum Distance Estimation: Solve θ⋆ ∈ argmin
θ
δ
P, Qθ
- LatentSpace
TargetSpace
- |
- 4/12
Implicit (Latent Variable) Generative Models 2
Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆)
5/12
Implicit (Latent Variable) Generative Models 2
Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data
5/12
Implicit (Latent Variable) Generative Models 2
Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n
i=1 are i.i.d. samples from P ∈ P(Rd)
5/12
Implicit (Latent Variable) Generative Models 2
Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n
i=1 are i.i.d. samples from P ∈ P(Rd)
Empirical distribution Pn := 1
n n
- i=1
δXi
5/12
Implicit (Latent Variable) Generative Models 2
Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n
i=1 are i.i.d. samples from P ∈ P(Rd)
Empirical distribution Pn := 1
n n
- i=1
δXi = ⇒ Inherently we work with δ(Pn, Qθ)
5/12
Implicit (Latent Variable) Generative Models 2
Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n
i=1 are i.i.d. samples from P ∈ P(Rd)
Empirical distribution Pn := 1
n n
- i=1
δXi = ⇒ Inherently we work with δ(Pn, Qθ) Optimization: Can solve infθ δ (Pn, Qθ) approximately
5/12
Implicit (Latent Variable) Generative Models 2
Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n
i=1 are i.i.d. samples from P ∈ P(Rd)
Empirical distribution Pn := 1
n n
- i=1
δXi = ⇒ Inherently we work with δ(Pn, Qθ) Optimization: Can solve infθ δ (Pn, Qθ) approximately Find ˆ θn s.t. δ
Pn, Qˆ
θn
≤ infθ δ Pn, Qθ + ǫ
5/12
Implicit (Latent Variable) Generative Models 2
Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n
i=1 are i.i.d. samples from P ∈ P(Rd)
Empirical distribution Pn := 1
n n
- i=1
δXi = ⇒ Inherently we work with δ(Pn, Qθ) Optimization: Can solve infθ δ (Pn, Qθ) approximately Find ˆ θn s.t. δ
Pn, Qˆ
θn
≤ infθ δ Pn, Qθ + ǫ
Generalization [Zhang et al’18]: δ
P, Qˆ
θn
− OPT ≤ 2δ (Pn, P) + ǫ
5/12
Implicit (Latent Variable) Generative Models 2
Goal: Solve OPT := infθ δ (P, Qθ) exactly (find θ⋆) Estimation: We don’t have P but data {Xi}n
i=1 are i.i.d. samples from P ∈ P(Rd)
Empirical distribution Pn := 1
n n
- i=1
δXi = ⇒ Inherently we work with δ(Pn, Qθ) Optimization: Can solve infθ δ (Pn, Qθ) approximately Find ˆ θn s.t. δ
Pn, Qˆ
θn
≤ infθ δ Pn, Qθ + ǫ
Generalization [Zhang et al’18]: δ
P, Qˆ
θn
− OPT ≤ 2δ (Pn, P) + ǫ
= ⇒ Boils down to empirical approximation question under δ
5/12
Empirical Approximation
Question: What can we say about δ(Pn, P)?
6/12
Empirical Approximation
Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd)
6/12
Empirical Approximation
Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞
6/12
Empirical Approximation
Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞ Total variation: δTV(Pn, P) = 1 . . .
6/12
Empirical Approximation
Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞ Total variation: δTV(Pn, P) = 1 . . . Solution 1: Use more sophisticated estimate ˆ Pn of P (KDE, kNN, etc.)
6/12
Empirical Approximation
Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞ Total variation: δTV(Pn, P) = 1 . . . Solution 1: Use more sophisticated estimate ˆ Pn of P (KDE, kNN, etc.) Problem 2: Empirical approximation rates are n−1/d
6/12
Empirical Approximation
Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞ Total variation: δTV(Pn, P) = 1 . . . Solution 1: Use more sophisticated estimate ˆ Pn of P (KDE, kNN, etc.) Problem 2: Empirical approximation rates are n−1/d
6/12
Empirical Approximation
Question: What can we say about δ(Pn, P)? Problem 1: δ(Pn, P) can be ill-defined/vacuous when P ≪ Leb(Rd) KL or χ2 divergence: DKL(PnP) = χ2(PnP) = ∞ Total variation: δTV(Pn, P) = 1 . . . Solution 1: Use more sophisticated estimate ˆ Pn of P (KDE, kNN, etc.) Problem 2: Empirical approximation rates are n−1/d f-divergence: [Nguyen-Wainwright-Jordan’15], [Kandasamy et at’15] p-Wasserstein dist.: [Fournier-Guillin’15], [Singh-P´
- czos’18]
Integral probability metrics: [Sriperumbudur et at’09], [Liang’19]
6/12
Smooth Statistical Distances
Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian.
7/12
Smooth Statistical Distances
Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ
7/12
Smooth Statistical Distances
Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ X ⊥ Z1 = ⇒ X + Z1 ∼ P ∗ Nσ & Y ⊥ Z2 = ⇒ Y + Z2 ∼ Q ∗ Nσ
7/12
Smooth Statistical Distances
Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ X ⊥ Z1 = ⇒ X + Z1 ∼ P ∗ Nσ & Y ⊥ Z2 = ⇒ Y + Z2 ∼ Q ∗ Nσ ~
- ~
- ~
~ + ~
- + ~
- Channel
- ()
7/12
Smooth Statistical Distances
Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ X ⊥ Z1 = ⇒ X + Z1 ∼ P ∗ Nσ & Y ⊥ Z2 = ⇒ Y + Z2 ∼ Q ∗ Nσ Robustness to Supp. Mismatch: δ(σ)(P, Q) < ∞, ∀P, Q ∈ P(Rd)
7/12
Smooth Statistical Distances
Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ X ⊥ Z1 = ⇒ X + Z1 ∼ P ∗ Nσ & Y ⊥ Z2 = ⇒ Y + Z2 ∼ Q ∗ Nσ Robustness to Supp. Mismatch: δ(σ)(P, Q) < ∞, ∀P, Q ∈ P(Rd) Preserves metric structure: If δ is a metric on P(Rd), then so is δ(σ)
7/12
Smooth Statistical Distances
Definition (ZG-Greenewald-Polyanskiy-Weed’19) For σ ≥ 0, the smooth δ statistical distance between P and Q is δ(σ)(P, Q) := δ(P ∗ Nσ, Q ∗ Nσ), where Nσ N(0, σ2Id) is a d-dimensional isotropic Gaussian. Interpretation: X ∼ P, Y ∼ Q and Z1, Z2 ∼ Nσ X ⊥ Z1 = ⇒ X + Z1 ∼ P ∗ Nσ & Y ⊥ Z2 = ⇒ Y + Z2 ∼ Q ∗ Nσ Robustness to Supp. Mismatch: δ(σ)(P, Q) < ∞, ∀P, Q ∈ P(Rd) Preserves metric structure: If δ is a metric on P(Rd), then so is δ(σ)
- Pf. idea: Use characteristic functions ΦP (t) := EP
eitX and:
ΦP∗Nσ = ΦP ΦNσ & ΦNσ(t) = e− σ2t2
2
= 0, ∀t.
7/12
Smooth Statistical Distances: Empirical Approx.
High Level: Alleviate curse of dimensionality & get limit distributions
8/12
Smooth Statistical Distances: Empirical Approx.
High Level: Alleviate curse of dimensionality & get limit distributions Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any d ≥ 1 and σ > 0, E[δ(σ)(Pn, P
] → 0 at a dimension-free rate:
Distance: E[δ(σ)(Pn, P
] ≍ n−1/2
for δ(σ)
TV and W(σ) 1
Distance2: E[δ(σ)(Pn, P
] ≍ n−1
for
W(σ)
2
2, D(σ)
KL and χ2 σ
under a sub-Gaussian condition on P.
8/12
Smooth Statistical Distances: Empirical Approx.
High Level: Alleviate curse of dimensionality & get limit distributions Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any d ≥ 1 and σ > 0, E[δ(σ)(Pn, P
] → 0 at a dimension-free rate:
Distance: E[δ(σ)(Pn, P
] ≍ n−1/2
for δ(σ)
TV and W(σ) 1
Distance2: E[δ(σ)(Pn, P
] ≍ n−1
for
W(σ)
2
2, D(σ)
KL and χ2 σ
under a sub-Gaussian condition on P. Theorem (ZG-Kato’20) For sub-Gaussian P: √nW(σ)
1 (Pn, P) D
− → supf∈Lip1(Rd) G(σ)
P (f), ∀d ≥ 1
8/12
Smooth Statistical Distances: Empirical Approx.
Limit distribution shows n− 1
2 rate is sharp
High Level: Alleviate curse of dimensionality & get limit distributions Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any d ≥ 1 and σ > 0, E[δ(σ)(Pn, P
] → 0 at a dimension-free rate:
Distance: E[δ(σ)(Pn, P
] ≍ n−1/2
for δ(σ)
TV and W(σ) 1
Distance2: E[δ(σ)(Pn, P
] ≍ n−1
for
W(σ)
2
2, D(σ)
KL and χ2 σ
under a sub-Gaussian condition on P. Theorem (ZG-Kato’20) For sub-Gaussian P: √nW(σ)
1 (Pn, P) D
− → supf∈Lip1(Rd) G(σ)
P (f), ∀d ≥ 1
8/12
Smooth Statistical Distances: Empirical Approx.
Limit distribution shows n− 1
2 rate is sharp
W1 limit dist. known only for d = 1 (W1(Pn, P) = Fn − FL1(R)) High Level: Alleviate curse of dimensionality & get limit distributions Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any d ≥ 1 and σ > 0, E[δ(σ)(Pn, P
] → 0 at a dimension-free rate:
Distance: E[δ(σ)(Pn, P
] ≍ n−1/2
for δ(σ)
TV and W(σ) 1
Distance2: E[δ(σ)(Pn, P
] ≍ n−1
for
W(σ)
2
2, D(σ)
KL and χ2 σ
under a sub-Gaussian condition on P. Theorem (ZG-Kato’20) For sub-Gaussian P: √nW(σ)
1 (Pn, P) D
− → supf∈Lip1(Rd) G(σ)
P (f), ∀d ≥ 1
8/12
Smooth Statistical Distances: Empirical Approx.
Limit distribution shows n− 1
2 rate is sharp
W1 limit dist. known only for d = 1 (W1(Pn, P) = Fn − FL1(R)) W(σ)
1
has a stable limit (characterized) in all d! High Level: Alleviate curse of dimensionality & get limit distributions Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any d ≥ 1 and σ > 0, E[δ(σ)(Pn, P
] → 0 at a dimension-free rate:
Distance: E[δ(σ)(Pn, P
] ≍ n−1/2
for δ(σ)
TV and W(σ) 1
Distance2: E[δ(σ)(Pn, P
] ≍ n−1
for
W(σ)
2
2, D(σ)
KL and χ2 σ
under a sub-Gaussian condition on P. Theorem (ZG-Kato’20) For sub-Gaussian P: √nW(σ)
1 (Pn, P) D
− → supf∈Lip1(Rd) G(σ)
P (f), ∀d ≥ 1
8/12
New Limit Distribution Results: Smooth TV
TV distance: δTV(P, Q) := EQ
- 1
2
- dP
dQ − 1
- = 1
2p − q1
9/12
New Limit Distribution Results: Smooth TV
TV distance: δTV(P, Q) := EQ
- 1
2
- dP
dQ − 1
- = 1
2p − q1
Smooth TV: δ(σ)
TV(P, Q) := δTV(P ∗ Nσ, Q ∗ Nσ) = 1 2pσ − qσ1
9/12
New Limit Distribution Results: Smooth TV
TV distance: δTV(P, Q) := EQ
- 1
2
- dP
dQ − 1
- = 1
2p − q1
Smooth TV: δ(σ)
TV(P, Q) := δTV(P ∗ Nσ, Q ∗ Nσ) = 1 2pσ − qσ1
Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if C(TV)
P,σ
:=
- Rd
- VarP (ϕσ(x − X)) dx < ∞, then
√nδ(σ)
TV(Pn, P) D
− → 1 2
- G(σ)
P
- 1 ,
for a centered GP
- G(σ)
P (x)
- x∈Rd w/ sample paths in L1(Rd) and
E
- G(σ)
P (x)G(σ) P (y)
- = CovP
ϕσ(x − X), ϕσ(y − X)
- 9/12
New Limit Distribution Results: Smooth TV
TV distance: δTV(P, Q) := EQ
- 1
2
- dP
dQ − 1
- = 1
2p − q1
Smooth TV: δ(σ)
TV(P, Q) := δTV(P ∗ Nσ, Q ∗ Nσ) = 1 2pσ − qσ1
Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if C(TV)
P,σ
:=
- Rd
- VarP (ϕσ(x − X)) dx < ∞, then
√nδ(σ)
TV(Pn, P) D
− → 1 2
- G(σ)
P
- 1 ,
for a centered GP
- G(σ)
P (x)
- x∈Rd w/ sample paths in L1(Rd) and
E
- G(σ)
P (x)G(σ) P (y)
- = CovP
ϕσ(x − X), ϕσ(y − X)
- Comments:
1
n−1/2 rate is sharp for E
δ(σ)
TV(Pn, P)
& Concentration inequality
9/12
New Limit Distribution Results: Smooth TV
TV distance: δTV(P, Q) := EQ
- 1
2
- dP
dQ − 1
- = 1
2p − q1
Smooth TV: δ(σ)
TV(P, Q) := δTV(P ∗ Nσ, Q ∗ Nσ) = 1 2pσ − qσ1
Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if C(TV)
P,σ
:=
- Rd
- VarP (ϕσ(x − X)) dx < ∞, then
√nδ(σ)
TV(Pn, P) D
− → 1 2
- G(σ)
P
- 1 ,
for a centered GP
- G(σ)
P (x)
- x∈Rd w/ sample paths in L1(Rd) and
E
- G(σ)
P (x)G(σ) P (y)
- = CovP
ϕσ(x − X), ϕσ(y − X)
- Comments:
1
n−1/2 rate is sharp for E
δ(σ)
TV(Pn, P)
& Concentration inequality
2
C(TV)
P,σ
< ∞ condition is sharp: lim infn √nE
δ(σ)
TV(Pn, P)
≥ CP,σ
√ 2π
9/12
Proof Outline
PDFs: Pn ∗ ϕσ(x) = 1
n n
- i=1
ϕσ(x − Xi)
10/12
Proof Outline
PDFs: Pn ∗ ϕσ(x) = 1
n n
- i=1
ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E
Pn ∗ ϕσ(x)
- 10/12
Proof Outline
PDFs: Pn ∗ ϕσ(x) = 1
n n
- i=1
ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E
Pn ∗ ϕσ(x)
- Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)
& ¯ Zn(x) =
1 √n n
- i=1
Zi(x)
10/12
Proof Outline
PDFs: Pn ∗ ϕσ(x) = 1
n n
- i=1
ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E
Pn ∗ ϕσ(x)
- Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)
& ¯ Zn(x) =
1 √n n
- i=1
Zi(x) = ⇒ √nδ(σ)
TV(Pn, P) = 1 2
- ¯
Zn
- 1
10/12
Proof Outline
PDFs: Pn ∗ ϕσ(x) = 1
n n
- i=1
ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E
Pn ∗ ϕσ(x)
- Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)
& ¯ Zn(x) =
1 √n n
- i=1
Zi(x) = ⇒ √nδ(σ)
TV(Pn, P) = 1 2
- ¯
Zn
- 1
Theorem (CLT in Banach Spaces) For p ∈ [1,∞), Z1, . . . ,Zn i.i.d. Lp-valued centered RVs, ¯ Zn =
1 √n
n
i=1Zi:
P
Z1p > t = o(t−2) as t → ∞ &
- Rd
E |Z1(x)|2p/2 dx < ∞
⇐ ⇒ ¯ Zn converges in Lp to centered Gaussian G w/ same covariance as Z1
10/12
Proof Outline
PDFs: Pn ∗ ϕσ(x) = 1
n n
- i=1
ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E
Pn ∗ ϕσ(x)
- Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)
& ¯ Zn(x) =
1 √n n
- i=1
Zi(x) = ⇒ √nδ(σ)
TV(Pn, P) = 1 2
- ¯
Zn
- 1
Theorem (CLT in Banach Spaces) For p ∈ [1,∞), Z1, . . . ,Zn i.i.d. Lp-valued centered RVs, ¯ Zn =
1 √n
n
i=1Zi:
P
Z1p > t = o(t−2) as t → ∞ &
- Rd
E |Z1(x)|2p/2 dx < ∞
⇐ ⇒ ¯ Zn converges in Lp to centered Gaussian G w/ same covariance as Z1 Verify for p = 1: Z11 ≤ 2 &
- Rd
- E[|Z1(x)|2] dx < ∞ by assump.
10/12
Proof Outline
PDFs: Pn ∗ ϕσ(x) = 1
n n
- i=1
ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E
Pn ∗ ϕσ(x)
- Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)
& ¯ Zn(x) =
1 √n n
- i=1
Zi(x) = ⇒ √nδ(σ)
TV(Pn, P) = 1 2
- ¯
Zn
- 1
Theorem (CLT in Banach Spaces) For p ∈ [1,∞), Z1, . . . ,Zn i.i.d. Lp-valued centered RVs, ¯ Zn =
1 √n
n
i=1Zi:
P
Z1p > t = o(t−2) as t → ∞ &
- Rd
E |Z1(x)|2p/2 dx < ∞
⇐ ⇒ ¯ Zn converges in Lp to centered Gaussian G w/ same covariance as Z1 Verify for p = 1: Z11 ≤ 2 &
- Rd
- E[|Z1(x)|2] dx < ∞ by assump.
= ⇒ ¯ Zn
w
− → G
10/12
Proof Outline
PDFs: Pn ∗ ϕσ(x) = 1
n n
- i=1
ϕσ(x − Xi) = ⇒ P ∗ ϕσ(x) = E
Pn ∗ ϕσ(x)
- Smooth TV: Zi(x) := Pn ∗ ϕσ(x) − P ∗ ϕσ(x)
& ¯ Zn(x) =
1 √n n
- i=1
Zi(x) = ⇒ √nδ(σ)
TV(Pn, P) = 1 2
- ¯
Zn
- 1
Theorem (CLT in Banach Spaces) For p ∈ [1,∞), Z1, . . . ,Zn i.i.d. Lp-valued centered RVs, ¯ Zn =
1 √n
n
i=1Zi:
P
Z1p > t = o(t−2) as t → ∞ &
- Rd
E |Z1(x)|2p/2 dx < ∞
⇐ ⇒ ¯ Zn converges in Lp to centered Gaussian G w/ same covariance as Z1 Verify for p = 1: Z11 ≤ 2 &
- Rd
- E[|Z1(x)|2] dx < ∞ by assump.
= ⇒ ¯ Zn
w
− → G and by CMT √nδ(σ)
TV(Pn, P) D
− → 1 2
- G
- 1 for G = G(σ)
P
10/12
New Limit Distribution Results: Smooth χ2
χ2-divergence: χ2(PQ) := EQ
dP dQ − 1
2
11/12
New Limit Distribution Results: Smooth χ2
χ2-divergence: χ2(PQ) := EQ
dP dQ − 1
2
Smooth χ2: χ2
σ(PQ) := χ2(P ∗NσQ∗Nσ) =
- Rd
pσ(x)
qσ(x) −1
2qσ(x)dx
11/12
New Limit Distribution Results: Smooth χ2
χ2-divergence: χ2(PQ) := EQ
dP dQ − 1
2
Smooth χ2: χ2
σ(PQ) := χ2(P ∗NσQ∗Nσ) =
- Rd
pσ(x)
qσ(x) −1
2qσ(x)dx
Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if
- Rd VarP (ϕσ(x−X))
P∗ϕσ(x)
dx < ∞, then nχ2
σ(PnP) D
− →
- Rd
- G(σ)
P (x)
- 2
P ∗ ϕσ(x) dx, for centered GP G(σ)
P
w/ same cov. s.t.
G(σ)
P
√P∗ϕσ has sample paths in L 2(Rd).
11/12
New Limit Distribution Results: Smooth χ2
χ2-divergence: χ2(PQ) := EQ
dP dQ − 1
2
Smooth χ2: χ2
σ(PQ) := χ2(P ∗NσQ∗Nσ) =
- Rd
pσ(x)
qσ(x) −1
2qσ(x)dx
Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if
- Rd VarP (ϕσ(x−X))
P∗ϕσ(x)
dx < ∞, then nχ2
σ(PnP) D
− →
- Rd
- G(σ)
P (x)
- 2
P ∗ ϕσ(x) dx, for centered GP G(σ)
P
w/ same cov. s.t.
G(σ)
P
√P∗ϕσ has sample paths in L 2(Rd).
Comments:
1
Condition holds for any β-sub-Gaussian P with β <
σ √ 2
11/12
New Limit Distribution Results: Smooth χ2
χ2-divergence: χ2(PQ) := EQ
dP dQ − 1
2
Smooth χ2: χ2
σ(PQ) := χ2(P ∗NσQ∗Nσ) =
- Rd
pσ(x)
qσ(x) −1
2qσ(x)dx
Theorem (ZG-Kato’20) For any d ≥ 1, σ > 0, if
- Rd VarP (ϕσ(x−X))
P∗ϕσ(x)
dx < ∞, then nχ2
σ(PnP) D
− →
- Rd
- G(σ)
P (x)
- 2
P ∗ ϕσ(x) dx, for centered GP G(σ)
P
w/ same cov. s.t.
G(σ)
P
√P∗ϕσ has sample paths in L 2(Rd).
Comments:
1
Condition holds for any β-sub-Gaussian P with β <
σ √ 2
2
- Pf. like δ(σ)
TV but with Zi(x) := ϕσ(x−Xi) P∗ϕσ(x) −1 and CLT in L2(Rd, P ∗Nσ)
11/12
Summary
Classic statistical distances: Rich history and modern applications
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues ◮ Slow empirical approximation n− 1
d
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues ◮ Slow empirical approximation n− 1
d
Smooth statistical distances: Convolve distributions with Nσ
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues ◮ Slow empirical approximation n− 1
d
Smooth statistical distances: Convolve distributions with Nσ
◮ Robust to mismatched support
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues ◮ Slow empirical approximation n− 1
d
Smooth statistical distances: Convolve distributions with Nσ
◮ Robust to mismatched support ◮ Inherits metric structure
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues ◮ Slow empirical approximation n− 1
d
Smooth statistical distances: Convolve distributions with Nσ
◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues ◮ Slow empirical approximation n− 1
d
Smooth statistical distances: Convolve distributions with Nσ
◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions ◮ Limit dist. for scaled δ(σ)(Pn, P) in all dimensions (W(σ)
1 , δ(σ) TV , χ2 σ)
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues ◮ Slow empirical approximation n− 1
d
Smooth statistical distances: Convolve distributions with Nσ
◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions ◮ Limit dist. for scaled δ(σ)(Pn, P) in all dimensions (W(σ)
1 , δ(σ) TV , χ2 σ)
Applications: Based on smooth statistical distance paradigm
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues ◮ Slow empirical approximation n− 1
d
Smooth statistical distances: Convolve distributions with Nσ
◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions ◮ Limit dist. for scaled δ(σ)(Pn, P) in all dimensions (W(σ)
1 , δ(σ) TV , χ2 σ)
Applications: Based on smooth statistical distance paradigm
◮ Generative modeling via minimum distance estimation
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues ◮ Slow empirical approximation n− 1
d
Smooth statistical distances: Convolve distributions with Nσ
◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions ◮ Limit dist. for scaled δ(σ)(Pn, P) in all dimensions (W(σ)
1 , δ(σ) TV , χ2 σ)
Applications: Based on smooth statistical distance paradigm
◮ Generative modeling via minimum distance estimation ◮ Goodness-of-fit testing, two-sample testing, etc.
12/12
Summary
Classic statistical distances: Rich history and modern applications
◮ Support mismatch issues ◮ Slow empirical approximation n− 1
d
Smooth statistical distances: Convolve distributions with Nσ
◮ Robust to mismatched support ◮ Inherits metric structure ◮ Fast (parametric) empirical convergence in all dimensions ◮ Limit dist. for scaled δ(σ)(Pn, P) in all dimensions (W(σ)
1 , δ(σ) TV , χ2 σ)
Applications: Based on smooth statistical distance paradigm
◮ Generative modeling via minimum distance estimation ◮ Goodness-of-fit testing, two-sample testing, etc.
Thank you!
12/12