[PPT] - Comparing distributions: 1 geometry improves kernel two-sample PowerPoint Presentation

SLIDE 1

Références

Comparing distributions: ℓ1 geometry improves kernel two-sample testing

M. Scetbon1,2
G. Varoquaux1

1Inria, Université Paris-Saclay 2CREST, ENSAE

12 décembre 2019

1 / 11

SLIDE 2

Références

Two collections of samples X, Y from unknown distributions P and Q.

McDonald's KFC

Problem : Are the two set of observations X and Y drawn from the same distribution ?

2 / 11

SLIDE 3

Références

Two collections of samples X, Y from unknown distributions P and Q.

McDonald's KFC

Problem : Are the two set of observations X and Y drawn from the same distribution ?

2 / 11

SLIDE 4

Références

Two-Sample Test Test the null hypothesis H0 : P = Q against H1 : P = Q Samples : X = {xi}n

i=1 ∼ P and Y = {yi}n i=1 ∼ Q

3 / 11

SLIDE 5

Références

Two-Sample Test Test the null hypothesis H0 : P = Q against H1 : P = Q Samples : X = {xi}n

i=1 ∼ P and Y = {yi}n i=1 ∼ Q

3 / 11

SLIDE 6

Références

Gaussian Kernel : kσ(x, y) = exp

− x−y2

2

2σ2

Empirical Mean Embeddings of P and Q :
µP(T) =

n

i=1

k(xi, T)

µQ(T) =

n

j=1

k(yj, T) '( '+

4 / 11

SLIDE 7

Références

Aboslute difference of the Mean Embeddings :

S(T) = |

µP(T) − µQ(T)|

'( | '( − '+| '+

5 / 11

SLIDE 8

Références

Aboslute difference of the Mean Embeddings :

S(T) = |

µP(T) − µQ(T)| Test locations : (Tj)J

j=1 ∼ Γ

!" !# !$ !% !&

'( | '( − '+| '+

6 / 11

SLIDE 9

Références

Test Statistic 1 with p ≥ 1 :

dℓp,µ,J(X, Y)

p := n

p 2

J

j=1

| µP(Tj) − µQ(Tj)| p These Statistics are derived from metrics which metrize the weak convergence : dLp,µ(P, Q) :=

t∈Rd
µP(t) − µQ(t)
p

dΓ(t)

1/p

Theorem : Weak Convergence αn

D

− → α ⇐ ⇒ dLp,µ(αn, α) → 0

1. The case when p = 2 has been studied by [1, 2]

7 / 11

SLIDE 10

Références

Test Statistic 1 with p ≥ 1 :

dℓp,µ,J(X, Y)

p := n

p 2

J

j=1

| µP(Tj) − µQ(Tj)| p These Statistics are derived from metrics which metrize the weak convergence : dLp,µ(P, Q) :=

t∈Rd
µP(t) − µQ(t)
p

dΓ(t)

1/p

Theorem : Weak Convergence αn

D

− → α ⇐ ⇒ dLp,µ(αn, α) → 0

1. The case when p = 2 has been studied by [1, 2]

7 / 11

SLIDE 11

Références

Test Statistic 1 with p ≥ 1 :

dℓp,µ,J(X, Y)

p := n

p 2

J

j=1

| µP(Tj) − µQ(Tj)| p These Statistics are derived from metrics which metrize the weak convergence : dLp,µ(P, Q) :=

t∈Rd
µP(t) − µQ(t)
p

dΓ(t)

1/p

Theorem : Weak Convergence αn

D

− → α ⇐ ⇒ dLp,µ(αn, α) → 0

1. The case when p = 2 has been studied by [1, 2]

7 / 11

SLIDE 12

Références

67(:;#, :;) = 2 )*+,-(:;#,:;) → 0

:;# :; !<=# !<=# | !<=# − !<=# |

8 / 11

SLIDE 13

Références

Test of level α : Compute

dℓp,µ,J(X, Y)

p and reject H0 if

dℓp,µ,J(X, Y)

p > Tα,p = 1 − α quantile of the asymptotic null distribution. Proposition : ℓ1 geometry improves power Let δ > 0. Under the alternative hypothesis H1, almost surely there exist N ≥ 1 such that for all n ≥ N with a probability 1 − δ :

dℓ2,µ,J(X, Y)

2 > Tα,2 ⇒ dℓ1,µ,J(X, Y) > Tα,1

9 / 11

SLIDE 14

Références

Test of level α : Compute

dℓp,µ,J(X, Y)

p and reject H0 if

dℓp,µ,J(X, Y)

p > Tα,p = 1 − α quantile of the asymptotic null distribution. Proposition : ℓ1 geometry improves power Let δ > 0. Under the alternative hypothesis H1, almost surely there exist N ≥ 1 such that for all n ≥ N with a probability 1 − δ :

dℓ2,µ,J(X, Y)

2 > Tα,2 ⇒ dℓ1,µ,J(X, Y) > Tα,1

9 / 11

SLIDE 15

Références

Conclusion Under the alternative hypothesis, Analytic Kernel (e.g Gaussian Kernel) guarantees dense differences between µP and µQ We have also considered statistics based on Smooth Characteristic Functions and obtained similar results. Finally we have normalized the tests to obtain a simple null distribution and learn the locations where the distributions differ the most.

@ East Exhibition Hall B + C #6

10 / 11

SLIDE 16

Références

Conclusion Under the alternative hypothesis, Analytic Kernel (e.g Gaussian Kernel) guarantees dense differences between µP and µQ ℓ1 geometry captures better these dense differences. We have also considered statistics based on Smooth Characteristic Functions and obtained similar results. Finally we have normalized the tests to obtain a simple null distribution and learn the locations where the distributions differ the most.

@ East Exhibition Hall B + C #6

10 / 11

SLIDE 17

Références

Conclusion Under the alternative hypothesis, Analytic Kernel (e.g Gaussian Kernel) guarantees dense differences between µP and µQ ℓ1 geometry captures better these dense differences. We have also considered statistics based on Smooth Characteristic Functions and obtained similar results. Finally we have normalized the tests to obtain a simple null distribution and learn the locations where the distributions differ the most.

@ East Exhibition Hall B + C #6

10 / 11

SLIDE 18

Références

Conclusion Under the alternative hypothesis, Analytic Kernel (e.g Gaussian Kernel) guarantees dense differences between µP and µQ ℓ1 geometry captures better these dense differences. We have also considered statistics based on Smooth Characteristic Functions and obtained similar results. Finally we have normalized the tests to obtain a simple null distribution and learn the locations where the distributions differ the most.

@ East Exhibition Hall B + C #6

10 / 11

SLIDE 19

Références

Conclusion Under the alternative hypothesis, Analytic Kernel (e.g Gaussian Kernel) guarantees dense differences between µP and µQ ℓ1 geometry captures better these dense differences. We have also considered statistics based on Smooth Characteristic Functions and obtained similar results. Finally we have normalized the tests to obtain a simple null distribution and learn the locations where the distributions differ the most.

@ East Exhibition Hall B + C #6

10 / 11

SLIDE 20

Références

References I

[1] K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with analytic representations of probability measures. In Advances in Neural Information Processing Systems, pages 1981–1989, 2015. [2] W. Jitkrittum, Z. Szabó, K. P. Chwialkowski, and A. Gretton. Interpretable distribution features with maximum testing power. In Advances in Neural Information Processing Systems, pages 181–189, 2016.

11 / 11