Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS - - PowerPoint PPT Presentation

lecture 2 mappings of probabilities to rkhs and
SMART_READER_LITE
LIVE PREVIEW

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS - - PowerPoint PPT Presentation

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS T ubingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL Outline Kernel metric on the space of probability measures Function revealing differences in distributions


slide-1
SLIDE 1

Lecture 2: Mappings of Probabilities to RKHS and Applications

MLSS T¨ ubingen, 2015

Arthur Gretton Gatsby Unit, CSML, UCL

slide-2
SLIDE 2

Outline

  • Kernel metric on the space of probability measures

– Function revealing differences in distributions – Distance between means in space of features (RKHS) – Independence measure: features of joint minus product of marginals

  • Characteristic kernels: feature space mappings of probabilities unique
  • Two-sample, independence tests for (almost!) any data type

– distributions on strings, images, graphs, groups (rotation matrices), semigroups,. . .

  • Advanced topics

– testing on big data, kernel choice – Energy distance/distance covariance: special case of kernel statistic

slide-3
SLIDE 3

Feature mean difference

  • Simple example: 2 Gaussians with different means
  • Answer: t-test

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different means X

  • Prob. density

PX QX

slide-4
SLIDE 4

Feature mean difference

  • Two Gaussians with same means, different variance
  • Idea: look at difference in means of features of the RVs
  • In Gaussian case: second order features of form ϕ(x) = x2

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different variances

  • Prob. density

X

PX QX

slide-5
SLIDE 5

Feature mean difference

  • Two Gaussians with same means, different variance
  • Idea: look at difference in means of features of the RVs
  • In Gaussian case: second order features of form ϕ(x) = x2

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different variances

  • Prob. density

X

PX QX 10

−1

10 10

1

10

2

0.2 0.4 0.6 0.8 1 1.2 1.4

Densities of feature X2 X2

  • Prob. density

PX QX

slide-6
SLIDE 6

Feature mean difference

  • Gaussian and Laplace distributions
  • Same mean and same variance
  • Difference in means using higher order features...RKHS

−4 −3 −2 −1 1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Gaussian and Laplace densities

  • Prob. density

X

PX QX

slide-7
SLIDE 7

Probabilities in feature space: the mean trick

The kernel trick

  • Given x ∈ X for some set X,

define feature map ϕx ∈ F, ϕx = [. . . ϕi(x) . . .] ∈ ℓ2

  • For positive definite k(x, x′),

k(x, x′) = ϕx, ϕx′F

  • The kernel trick: ∀f ∈ F,

f(x) = f, ϕxF

slide-8
SLIDE 8

Probabilities in feature space: the mean trick

The kernel trick

  • Given x ∈ X for some set X,

define feature map ϕx ∈ F, ϕx = [. . . ϕi(x) . . .] ∈ ℓ2

  • For positive definite k(x, x′),

k(x, x′) = ϕx, ϕx′F

  • The kernel trick: ∀f ∈ F,

f(x) = f, ϕxF The mean trick

  • Given P a Borel probability

measure on X, define feature map µP ∈ F µP = [. . . EP [ϕi(x)] . . .]

  • For positive definite k(x, x′),

EP,Qk(x, y) = µP, µQF for x ∼ P and y ∼ Q.

  • The mean trick: (we call µP a

mean/distribution embedding) EP(f(x)) =: µP, fF

slide-9
SLIDE 9

What does µP look like?

We plot the function µP

  • Mean embedding µP ∈ F

µP(·), f(·)F = EPf(x).

  • What does prob. feature map

look like? µP(x) = µP(·), ϕ(x)F = µP(·), k(·, x)F = EPk(x, x). Expectation of kernel!

  • Empirical estimate:

ˆ µP(x) = 1 m

m

  • i=1

k(xi, x) xi ∼ P

slide-10
SLIDE 10

What does µP look like?

We plot the function µP

  • Mean embedding µP ∈ F

µP(·), f(·)F = EPf(x).

  • What does prob. feature map

look like? µP(x) = µP(·), ϕ(x)F = µP(·), k(·, x)F = EPk(x, x). Expectation of kernel!

  • Empirical estimate:

ˆ µP(x) = 1 m

m

  • i=1

k(xi, x) xi ∼ P

−2 2 0.01 0.02 0.03 X Histogram Embedding

slide-11
SLIDE 11

Does the feature space mean exist?

Does there exist an element µP ∈ F such that EPf(x) = EPf(·), ϕ(x)F = f(·), EPϕ(x)F = f(·), µP(·)F ∀f ∈ F

slide-12
SLIDE 12

Does the feature space mean exist?

Does there exist an element µP ∈ F such that EPf(x) = EPf(·), ϕ(x)F = f(·), EPϕ(x)F = f(·), µP(·)F ∀f ∈ F Yes: You can exchange expectation and innner product (i.e. ϕ(x) is Bochner integrable [Steinwart and Christmann, 2008]) under the condition EPϕ(x)F = EP

  • k(x, x) < ∞
slide-13
SLIDE 13

Function Showing Difference in Distributions

  • Are P and Q different?
slide-14
SLIDE 14

Function Showing Difference in Distributions

  • Are P and Q different?

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

slide-15
SLIDE 15

Function Showing Difference in Distributions

  • Are P and Q different?

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

slide-16
SLIDE 16

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

x f(x) Smooth function

slide-17
SLIDE 17

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

x f(x) Smooth function

slide-18
SLIDE 18

Function Showing Difference in Distributions

  • What if the function is not smooth?

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Bounded continuous function x f(x)

slide-19
SLIDE 19

Function Showing Difference in Distributions

  • What if the function is not smooth?

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Bounded continuous function x f(x)

slide-20
SLIDE 20

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

  • Gauss P vs Laplace Q

−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8

Witness f for Gauss and Laplace densities X

  • Prob. density and f

f Gauss Laplace

slide-21
SLIDE 21

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

  • Classical results: MMD(P, Q; F) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨

uller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

slide-22
SLIDE 22

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

  • Classical results: MMD(P, Q; F) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨

uller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

  • MMD(P, Q; F) = 0 iff P = Q when F =the unit ball in a characteristic

RKHS F (coming soon!)

[ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]

slide-23
SLIDE 23

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

  • Classical results: MMD(P, Q; F) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨

uller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

  • MMD(P, Q; F) = 0 iff P = Q when F =the unit ball in a characteristic

RKHS F (coming soon!)

[ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]

How do smooth functions relate to feature maps?

slide-24
SLIDE 24

Function view vs feature mean view

  • The (kernel) MMD: [ISMB06, NIPS06a]

MMD2(P, Q; F) =

  • sup

f∈F

[EPf(x) − EQf(y)] 2

−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8

Witness f for Gauss and Laplace densities X

  • Prob. density and f

f Gauss Laplace

slide-25
SLIDE 25

Function view vs feature mean view

  • The (kernel) MMD: [ISMB06, NIPS06a]

MMD2(P, Q; F) =

  • sup

f∈F

[EPf(x) − EQf(y)] 2 use EP(f(x)) =: µP, fF

slide-26
SLIDE 26

Function view vs feature mean view

  • The (kernel) MMD: [ISMB06, NIPS06a]

MMD2(P, Q; F) =

  • sup

f∈F

[EPf(x) − EQf(y)] 2 =

  • sup

f∈F

f, µP − µQF 2 use EP(f(x)) =: µP, fF

slide-27
SLIDE 27

Function view vs feature mean view

  • The (kernel) MMD: [ISMB06, NIPS06a]

MMD2(P, Q; F) =

  • sup

f∈F

[EPf(x) − EQf(y)] 2 =

  • sup

f∈F

f, µP − µQF 2 = µP − µQ2

F

use θF = sup

f∈F

f, θF Function view and feature view equivalent

slide-28
SLIDE 28

Empirical estimate of MMD

  • An unbiased empirical estimate: for {xi}m

i=1 ∼ P and {yi}m i=1 ∼ Q,

  • MMD

2 = 1 m(m−1)

m

i=1

m

j=i [k(xi, xj) + k(yi, yj)]

− 1

m2

m

i=1

m

j=1 [k(yi, xj) + k(xi, yj)]

slide-29
SLIDE 29

Empirical estimate of MMD

  • An unbiased empirical estimate: for {xi}m

i=1 ∼ P and {yi}m i=1 ∼ Q,

  • MMD

2 = 1 m(m−1)

m

i=1

m

j=i [k(xi, xj) + k(yi, yj)]

− 1

m2

m

i=1

m

j=1 [k(yi, xj) + k(xi, yj)]

  • Proof:

µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ

slide-30
SLIDE 30

Empirical estimate of MMD

  • An unbiased empirical estimate: for {xi}m

i=1 ∼ P and {yi}m i=1 ∼ Q,

  • MMD

2 = 1 m(m−1)

m

i=1

m

j=i [k(xi, xj) + k(yi, yj)]

− 1

m2

m

i=1

m

j=1 [k(yi, xj) + k(xi, yj)]

  • Proof:

µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ

slide-31
SLIDE 31

Empirical estimate of MMD

  • An unbiased empirical estimate: for {xi}m

i=1 ∼ P and {yi}m i=1 ∼ Q,

  • MMD

2 = 1 m(m−1)

m

i=1

m

j=i [k(xi, xj) + k(yi, yj)]

− 1

m2

m

i=1

m

j=1 [k(yi, xj) + k(xi, yj)]

  • Proof:

µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . .

slide-32
SLIDE 32

Empirical estimate of MMD

  • An unbiased empirical estimate: for {xi}m

i=1 ∼ P and {yi}m i=1 ∼ Q,

  • MMD

2 = 1 m(m−1)

m

i=1

m

j=i [k(xi, xj) + k(yi, yj)]

− 1

m2

m

i=1

m

j=1 [k(yi, xj) + k(xi, yj)]

  • Proof:

µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), ϕ(x) + . . .

slide-33
SLIDE 33

Empirical estimate of MMD

  • An unbiased empirical estimate: for {xi}m

i=1 ∼ P and {yi}m i=1 ∼ Q,

  • MMD

2 = 1 m(m−1)

m

i=1

m

j=i [k(xi, xj) + k(yi, yj)]

− 1

m2

m

i=1

m

j=1 [k(yi, xj) + k(xi, yj)]

  • Proof:

µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . .

slide-34
SLIDE 34

Empirical estimate of MMD

  • An unbiased empirical estimate: for {xi}m

i=1 ∼ P and {yi}m i=1 ∼ Q,

  • MMD

2 = 1 m(m−1)

m

i=1

m

j=i [k(xi, xj) + k(yi, yj)]

− 1

m2

m

i=1

m

j=1 [k(yi, xj) + k(xi, yj)]

  • Proof:

µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . . = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

slide-35
SLIDE 35

Empirical estimate of MMD

  • An unbiased empirical estimate: for {xi}m

i=1 ∼ P and {yi}m i=1 ∼ Q,

  • MMD

2 = 1 m(m−1)

m

i=1

m

j=i [k(xi, xj) + k(yi, yj)]

− 1

m2

m

i=1

m

j=1 [k(yi, xj) + k(xi, yj)]

  • Proof:

µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . . = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y) Then Ek(x, x′) =

1 m(m−1)

m

i=1

m

j=i k(xi, xj)

slide-36
SLIDE 36

MMD for independence: HSIC

  • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Sz´

ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC(PXY , PXPY ) := µPXY − µPXPY 2

slide-37
SLIDE 37

MMD for independence: HSIC

  • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Sz´

ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC(PXY , PXPY ) := µPXY − µPXPY 2

k( , )

!" #" !"

l( , )

#"

k( , ) × l( , )

!" #" !" #"

κ( , ) =

!" #" !" #"

slide-38
SLIDE 38

MMD for independence: HSIC

  • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Sz´

ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC(PXY , PXPY ) := µPXY − µPXPY 2 HSIC using expectations of kernels: Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(x, x′)l(y, y′) + EXEX′k(x, x′)EY EY ′l(y, y′) − 2EX′Y ′ EXk(x, x′)EY l(y, y′)

  • .
slide-39
SLIDE 39

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

  • "#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

  • "2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

slide-40
SLIDE 40

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

  • "#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

  • "2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

slide-41
SLIDE 41

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

  • "#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

  • "2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

Empirical HSIC(PXY , PXPY ): 1 n2 (HKH ◦ HLH)++

slide-42
SLIDE 42

Characteristic kernels (Via Fourier, on the torus T)

slide-43
SLIDE 43

Characteristic Kernels (via Fourier)

Reminder: Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, JMLR10] In the next slides:

  • 1. Characteristic property on [−π, π] with periodic boundary
  • 2. Characteristic property on Rd
slide-44
SLIDE 44

Characteristic Kernels (via Fourier)

Reminder: Fourier series

  • Function [−π, π] with periodic boundary.

f(x) =

  • ℓ=−∞

ˆ fℓ exp(ıℓx) =

  • l=−∞

ˆ fℓ (cos(ℓx) + ı sin(ℓx)) .

−4 −2 2 4 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4

x f (x) Top hat

−10 −5 5 10 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5

ℓ ˆ fℓ Fourier series coefficients

slide-45
SLIDE 45

Characteristic Kernels (via Fourier)

Reminder: Fourier series of kernel k(x, y) = k(x − y) = k(z), k(z) =

  • ℓ=−∞

ˆ kℓ exp (ıℓz) , E.g., k(x) =

1 2πϑ

  • x

2π, ıσ2 2π

  • ,

ˆ kℓ =

1 2π exp

  • −σ2ℓ2

2

  • .

ϑ is the Jacobi theta function, close to Gaussian when σ2 sufficiently narrower than [−π, π].

−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

x k(x) Kernel

−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

ℓ ˆ fℓ Fourier series coefficients

slide-46
SLIDE 46

Characteristic Kernels (via Fourier)

Maximum mean embedding via Fourier series:

  • Fourier series for P is characteristic function ¯

φP

  • Fourier series for mean embedding is product of fourier series!

(convolution theorem) µP(x) = EPk(x − x) = π

−π

k(x − t)dP(t) ˆ µP,ℓ = ˆ kℓ × ¯ φP,ℓ

slide-47
SLIDE 47

Characteristic Kernels (via Fourier)

Maximum mean embedding via Fourier series:

  • Fourier series for P is characteristic function ¯

φP

  • Fourier series for mean embedding is product of fourier series!

(convolution theorem) µP(x) = EPk(x − x) = π

−π

k(x − t)dP(t) ˆ µP,ℓ = ˆ kℓ × ¯ φP,ℓ

  • MMD can be written in terms of Fourier series:

MMD(P, Q; F) :=

  • ℓ=−∞

¯ φP,ℓ − ¯ φQ,ℓ ˆ kℓ

  • exp(ıℓx)
  • F
slide-48
SLIDE 48

A simpler Fourier expression for MMD

  • From previous slide,

MMD(P, Q; F) :=

  • ℓ=−∞

¯ φP,ℓ − ¯ φQ,ℓ ˆ kℓ

  • exp(ıℓx)
  • F
  • The squared norm of a function f in F is:

f2

F = f, fF = ∞

  • l=−∞

| ˆ fℓ|2 ˆ kℓ .

  • Simple, interpretable expression for squared MMD:

MMD2(P, Q; F) =

  • l=−∞

[|φP,ℓ − φQ,ℓ|2ˆ kℓ]2 ˆ kℓ =

  • l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

slide-49
SLIDE 49

Example

  • Example: P differs from Q at one frequency

−2 2 0.05 0.1 0.15 0.2

x P (x)

−2 2 0.05 0.1 0.15 0.2

x Q(x)

slide-50
SLIDE 50

Characteristic Kernels (2)

  • Example: P differs from Q at (roughly) one frequency

−2 2 0.05 0.1 0.15 0.2

x P (x)

−2 2 0.05 0.1 0.15 0.2

x Q(x)

F

F

−10 10 0.5 1

ℓ φP,ℓ

−10 10 0.5 1

ℓ φQ,ℓ

slide-51
SLIDE 51

Characteristic Kernels (2)

  • Example: P differs from Q at (roughly) one frequency

−2 2 0.05 0.1 0.15 0.2

x P (x)

−2 2 0.05 0.1 0.15 0.2

x Q(x)

F

F

−10 10 0.5 1

ℓ φP,ℓ

−10 10 0.5 1

ℓ φQ,ℓ

ց ր

Characteristic function difference

−10 10 0.2 0.4 0.6 0.8 1

ℓ φP,ℓ − φQ,ℓ

slide-52
SLIDE 52

Example

Is the Gaussian-spectrum kernel characteristic?

−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

x k(x) Kernel

−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

  • l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

slide-53
SLIDE 53

Example

Is the Gaussian-spectrum kernel characteristic? YES

−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

x k(x) Kernel

−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

  • l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

slide-54
SLIDE 54

Example

Is the triangle kernel characteristic?

−4 −2 2 4 −0.1 −0.05 0.05 0.1 0.15 0.2 0.25 0.3

x f (x) Triangle

−10 −5 5 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07

ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

  • l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

slide-55
SLIDE 55

Example

Is the triangle kernel characteristic? NO

−4 −2 2 4 −0.1 −0.05 0.05 0.1 0.15 0.2 0.25 0.3

x f (x) Triangle

−10 −5 5 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07

ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

  • l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

slide-56
SLIDE 56

Characteristic kernels (Via Fourier, on Rd)

slide-57
SLIDE 57

Characteristic Kernels (via Fourier)

  • Can we prove characteristic on Rd?
slide-58
SLIDE 58

Characteristic Kernels (via Fourier)

  • Can we prove characteristic on Rd?
  • Characteristic function of P via Fourier transform

φP(ω) =

  • Rd eix⊤ωdP(x)
slide-59
SLIDE 59

Characteristic Kernels (via Fourier)

  • Can we prove characteristic on Rd?
  • Characteristic function of P via Fourier transform

φP(ω) =

  • Rd eix⊤ωdP(x)
  • Translation invariant kernels: k(x, y) = k(x − y) = k(z)
  • Bochner’s theorem:

k(z) =

  • Rd e−iz⊤ωdΛ(ω)

– Λ finite non-negative Borel measure

slide-60
SLIDE 60

Characteristic Kernels (via Fourier)

  • Can we prove characteristic on Rd?
  • Characteristic function of P via Fourier transform

φP(ω) =

  • Rd eix⊤ωdP(x)
  • Translation invariant kernels: k(x, y) = k(x − y) = k(z)
  • Bochner’s theorem:

k(z) =

  • Rd e−iz⊤ωdΛ(ω)

– Λ finite non-negative Borel measure

slide-61
SLIDE 61

Characteristic Kernels (via Fourier)

  • Fourier representation of MMD:

MMD(P, Q; F) := |φP(ω) − φQ(ω)|2 dΛ(ω) – φP characteristic function of P Proof: Using Bochner’s theorem (a) and Fubini’s theorem (b), MMD(P, Q) =

Rd k(x − y) d(P − Q)(x) d(P − Q)(y) (a)

=

Rd e−i(x−y)T ω dΛ(ω) d(P − Q)(x) d(P − Q)(y) (b)

=

Rd e−ixT ω d(P − Q)(x)

  • Rd eiyT ω d(P − Q)(y) dΛ(ω)

=

  • |φP(ω) − φQ(ω)|2 dΛ(ω)
slide-62
SLIDE 62

Example

  • Example: P differs from Q at (roughly) one frequency

−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

X P(X)

−10 −5 5 10 0.1 0.2 0.3 0.4 0.5

X Q(X)

slide-63
SLIDE 63

Example

  • Example: P differs from Q at (roughly) one frequency

−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

X P(X)

−10 −5 5 10 0.1 0.2 0.3 0.4 0.5

X Q(X)

F

F

−20 −10 10 20 0.1 0.2 0.3 0.4

ω |φP |

−20 −10 10 20 0.1 0.2 0.3 0.4

ω |φQ |

slide-64
SLIDE 64

Example

  • Example: P differs from Q at (roughly) one frequency

−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35

X P(X)

−10 −5 5 10 0.1 0.2 0.3 0.4 0.5

X Q(X)

F

F

−20 −10 10 20 0.1 0.2 0.3 0.4

ω |φP |

−20 −10 10 20 0.1 0.2 0.3 0.4

ω |φQ |

ց ր

Characteristic function difference

−30 −20 −10 10 20 30 0.05 0.1 0.15 0.2

ω |φP − φQ|

slide-65
SLIDE 65

Example

  • Example: P differs from Q at (roughly) one frequency

Gaussian kernel Difference |φP − φQ|

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-66
SLIDE 66

Example

  • Example: P differs from Q at (roughly) one frequency

Characteristic

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-67
SLIDE 67

Example

  • Example: P differs from Q at (roughly) one frequency

Sinc kernel Difference |φP − φQ|

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-68
SLIDE 68

Example

  • Example: P differs from Q at (roughly) one frequency

NOT characteristic

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-69
SLIDE 69

Example

  • Example: P differs from Q at (roughly) one frequency

Triangle (B-spline) kernel Difference |φP − φQ|

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-70
SLIDE 70

Example

  • Example: P differs from Q at (roughly) one frequency

???

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-71
SLIDE 71

Example

  • Example: P differs from Q at (roughly) one frequency

Characteristic

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-72
SLIDE 72

Summary: Characteristic Kernels

Characteristic kernel: (MMD = 0 iff P = Q) [NIPS07b, COLT08] Main theorem: A translation invariant k characteristic for prob. measures on Rd if and only if supp(Λ) = Rd (i.e. support zero on at most a countable set)

[COLT08, JMLR10]

Corollary: continuous, compactly supported k characteristic (since Fourier spectrum Λ(ω) cannot be zero on an interval).

1-D proof sketch from [Mallat, 1999, Theorem 2.6] proof on Rd via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]

slide-73
SLIDE 73

k characteristic iff supp(Λ) = Rd

Proof: supp {Λ} = Rd = ⇒ k characteristic: Recall Fourier definition of MMD: MMD2(P, Q) =

  • Rd |φP(ω) − φQ(ω)|2 dΛ(ω).

Characteristic functions φP(ω) and φQ(ω) uniformly continuous, hence their difference cannot be non-zero only on a countable set.

Map φP uniformly continuous: ∀ǫ > 0, ∃δ > 0 such that ∀(ω1, ω2) ∈ Ω for which d(ω1, ω2) < δ, we have d(φP(ω1), φP(ω2)) < ǫ. Uniform: δ depends only on ǫ, not on ω1, ω2.

slide-74
SLIDE 74

k characteristic iff supp(Λ) = Rd

Proof: k characteristic = ⇒ supp {Λ} = Rd : Proof by contrapositive. Given supp {Λ} Rd, hence ∃ open interval U such that Λ(ω) zero on U. Construct densities p(x), q(x) such that φP, φQ differ only inside U

slide-75
SLIDE 75

Further extensions

  • Similar reasoning wherever extensions of Bochner’s theorem exist:

[Fukumizu et al., 2009]

– Locally compact Abelian groups (periodic domains, as we saw) – Compact, non-Abelian groups (orthogonal matrices) – The semigroup R+

n (histograms)

  • Related kernel statistics: Fisher statistic [Harchaoui et al., 2008](zero iff P = Q

for characteristic kernels), other distances [Zhou and Chellappa, 2006](not yet shown to establish whether P = Q), energy distances

slide-76
SLIDE 76

Statistical hypothesis testing

slide-77
SLIDE 77

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP near spike burst Time LFP amplitude

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP without spike burst Time LFP amplitude

slide-78
SLIDE 78

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

slide-79
SLIDE 79

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

slide-80
SLIDE 80

Statistical test using MMD (1)

  • Two hypotheses:

– H0: null hypothesis (P = Q) – H1: alternative hypothesis (P = Q)

slide-81
SLIDE 81

Statistical test using MMD (1)

  • Two hypotheses:

– H0: null hypothesis (P = Q) – H1: alternative hypothesis (P = Q)

  • Observe samples x := {x1, . . . , xn} from P and y from Q
  • If empirical MMD(x, y; F) is

– “far from zero”: reject H0 – “close to zero”: accept H0

slide-82
SLIDE 82

Statistical test using MMD (2)

  • “far from zero” vs “close to zero” - threshold?
  • One answer: asymptotic distribution of

MMD

2

slide-83
SLIDE 83

Statistical test using MMD (2)

  • “far from zero” vs “close to zero” - threshold?
  • One answer: asymptotic distribution of

MMD

2

  • An unbiased empirical estimate (quadratic cost):
  • MMD

2 = 1 n(n−1)

  • i=j

k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)

  • h((xi,yi),(xj,yj))
slide-84
SLIDE 84

Statistical test using MMD (2)

  • “far from zero” vs “close to zero” - threshold?
  • One answer: asymptotic distribution of

MMD

2

  • An unbiased empirical estimate (quadratic cost):
  • MMD

2 = 1 n(n−1)

  • i=j

k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)

  • h((xi,yi),(xj,yj))
  • When P = Q, asymptotically normal

(√n)

  • MMD

2 − MMD2

∼ N(0, σ2

u)

[Hoeffding, 1948, Serfling, 1980]

  • Expression for the variance: zi := (xi, yi)

σ2

u = 4

  • Ez
  • (Ez′h(z, z′))2

  • Ez,z′(h(z, z′))

2

slide-85
SLIDE 85

Statistical test using MMD (3)

  • Example: laplace distributions with different variance

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 2 4 6 8 10 12 14

MMD distribution and Gaussian fit under H1 MMD

  • Prob. density

Empirical PDF Gaussian fit

−6 −4 −2 2 4 6 0.5 1 1.5

Two Laplace distributions with different variances X

  • Prob. density

PX QX

slide-86
SLIDE 86

Statistical test using MMD (4)

  • When P = Q, U-statistic degenerate: Ez′[h(z, z′)] = 0 [Anderson et al., 1994]
  • Distribution is

nMMD(x, y; F) ∼

  • l=1

λl

  • z2

l − 2

  • where

– zl ∼ N(0, 2) i.i.d –

  • X ˜

k(x, x′)

centred

ψi(x)dP(x) = λiψi(x′)

slide-87
SLIDE 87

Statistical test using MMD (4)

  • When P = Q, U-statistic degenerate: Ez′[h(z, z′)] = 0 [Anderson et al., 1994]
  • Distribution is

nMMD(x, y; F) ∼

  • l=1

λl

  • z2

l − 2

  • where

– zl ∼ N(0, 2) i.i.d –

  • X ˜

k(x, x′)

centred

ψi(x)dP(x) = λiψi(x′)

−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

MMD density under H0 n× MMD2

  • Prob. density

χ2 sum Empirical PDF

slide-88
SLIDE 88

Statistical test using MMD (5)

  • Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
  • MMD

2 = KP,P + KQ,Q − 2KP,Q

−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

MMD density under H0 and H1 n× MMD2

  • Prob. density

null alternative

1−α null quantile Type II error

slide-89
SLIDE 89

Statistical test using MMD (5)

  • Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
slide-90
SLIDE 90

Statistical test using MMD (5)

  • Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
  • Permutation for empirical CDF [Arcones and Gin´

e, 1992, Alba Fern´ andez et al., 2008]

  • Pearson curves by matching first four moments [Johnson et al., 1994]
  • Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]
  • Consistent test using kernel eigenspectrum [NIPS09b]
slide-91
SLIDE 91

Statistical test using MMD (5)

  • Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
  • Permutation for empirical CDF [Arcones and Gin´

e, 1992, Alba Fern´ andez et al., 2008]

  • Pearson curves by matching first four moments [Johnson et al., 1994]
  • Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]
  • Consistent test using kernel eigenspectrum [NIPS09b]

−0.02 0.02 0.04 0.06 0.08 0.1 0.2 0.4 0.6 0.8 1

CDF of the MMD and Pearson fit mmd P(MMD < mmd)

MMD Pearson

slide-92
SLIDE 92

Approximate null distribution of MMD via permutation

Empirical MMD: w = (1, 1, 1, . . . 1

  • n

, −1 . . . , −1, −1, −1

  • n

)⊤

1 n2

KP,P KP,Q KQ,P KQ,Q   ⊙

  • ww⊤

  • MMD

2

slide-93
SLIDE 93

Approximate null distribution of MMD via permutation

Permuted case:

[Alba Fern´ andez et al., 2008]

w = (1, −1, 1, . . . 1

  • n

, −1 . . . , 1, −1, −1

  • n

)⊤

(equal number of +1 and −1)

1 n2

KP,P KP,Q KQ,P KQ,Q   ⊙

  • ww⊤

= ?

slide-94
SLIDE 94

Approximate null distribution of MMD via permutation

Permuted case:

[Alba Fern´ andez et al., 2008]

w = (1, −1, 1, . . . 1

  • n

, −1 . . . , 1, −1, −1

  • n

)⊤

(equal number of +1 and −1)

1 n2

KP,P KP,Q KQ,P KQ,Q   ⊙

  • ww⊤

= ?

=

Figure thanks to Kacper Chwialkowski.

slide-95
SLIDE 95

Approximate null distribution of MMD

2 via permutation

  • MMD

2 p ≈ 1

n2

KP,P KP,Q KQ,P KQ,Q   ⊙

  • ww⊤

−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

MMD density under H0 n× MMD2

  • Prob. density

Null PDF Null PDF from permutation

slide-96
SLIDE 96

Detecting differences in brain signals

Do local field potential (LFP) signals change when measured near a spike burst?

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP near spike burst Time LFP amplitude

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP without spike burst Time LFP amplitude

slide-97
SLIDE 97

Nero data: consistent test w/o permutation

  • Maximum mean discrepancy (MMD): distance between P and Q

MMD(P, Q; F) := µP − µQ2

F

  • Is

MMD significantly > 0?

  • P = Q, null distrib. of

MMD: n MMD →

D ∞

  • l=1

λl(z2

l − 2),

– λl is lth eigenvalue of kernel ˜ k(xi, xj)

100 150 200 250 300 0.1 0.2 0.3 0.4 0.5

P ≠ Q (neuro) Sample size m Type II error

Spectral Permutation

Use Gram matrix spectrum for ˆ λl: consistent test without permutation

slide-98
SLIDE 98

Hypothesis testing with HSIC

slide-99
SLIDE 99

Distribution of HSIC at independence

  • (Biased) empirical HSIC a v-statistic

HSICb = 1 n2 trace(KHLH) – Statistical testing: How do we find when this is larger enough that the null hypothesis P = PxPy is unlikely? – Formally: given P = PxPy, what is the threshold T such that P(HSIC > T) < α for small α?

slide-100
SLIDE 100

Distribution of HSIC at independence

  • (Biased) empirical HSIC a v-statistic

HSICb = 1 n2 trace(KHLH)

  • Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:

nHSICb

D

  • l=1

λlz2

l ,

zl ∼ N(0, 1)i.i.d.

λlψl(zj) =

  • hijqrψl(zi)dFi,q,r,

hijqr = 1 4!

(i,j,q,r)

  • (t,u,v,w)

ktultu + ktulvw − 2ktultv

slide-101
SLIDE 101

Distribution of HSIC at independence

  • (Biased) empirical HSIC a v-statistic

HSICb = 1 n2 trace(KHLH)

  • Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:

nHSICb

D

  • l=1

λlz2

l ,

zl ∼ N(0, 1)i.i.d.

λlψl(zj) =

  • hijqrψl(zi)dFi,q,r,

hijqr = 1 4!

(i,j,q,r)

  • (t,u,v,w)

ktultu + ktulvw − 2ktultv

  • First two moments [NIPS07b]

E(HSICb) = 1 nTrCxxTrCyy var(HSICb) = 2(n − 4)(n − 5) (n)4 Cxx2

HS Cyy2 HS + O(n−3).

slide-102
SLIDE 102

Statistical testing with HSIC

  • Given P = PxPy, what is the threshold T such that P(HSIC > T) < α

for small α?

  • Null distribution via permutation [Feuerverger, 1993]

– Compute HSIC for {xi, yπ(i)}n

i=1 for random permutation π of indices

{1, . . . , n}. This gives HSIC for independent variables. – Repeat for many different permutations, get empirical CDF – Threshold T is 1 − α quantile of empirical CDF

slide-103
SLIDE 103

Statistical testing with HSIC

  • Given P = PxPy, what is the threshold T such that P(HSIC > T) < α

for small α?

  • Null distribution via permutation [Feuerverger, 1993]

– Compute HSIC for {xi, yπ(i)}n

i=1 for random permutation π of indices

{1, . . . , n}. This gives HSIC for independent variables. – Repeat for many different permutations, get empirical CDF – Threshold T is 1 − α quantile of empirical CDF

  • Approximate null distribution via moment matching [Kankainen, 1995]:

nHSICb(Z) ∼ xα−1e−x/β βαΓ(α) where α = (E(HSICb))2 var(HSICb) , β = var(HSICb) nE(HSICb).

slide-104
SLIDE 104

Experiment: dependence testing for translation Are the French text extracts translations of English?

X1: Honourable senators, I have a question for

the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet.

Y1:

Honorables s´ enateurs, ma question s’adresse au leader du gouvernement au S´ enat et concerne l’aide financi´ ere qu’on a annonc´ ee pour les agriculteurs. La plupart des agriculteurs n’ont encore rien reu de cet argent.

X2: No doubt there is great pressure on provin-

cial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.

· · ·

?

⇐ ⇒

Y2:Il est ´

evident que les ordres de gouverne- ments provinciaux et municipaux subissent de fortes pressions en ce qui concerne les ser- vices de garde, mais le gouvernement n’a pas r´ eduit le financement qu’il verse aux provinces pour les services de garde. Au contraire, nous avons augment´ e le financement f´ ed´ eral pour le d´ eveloppement des jeunes enfants.

· · ·

slide-105
SLIDE 105

Experiment: dependence testing for translation

  • (Biased) empirical HSIC:

HSICb = 1 n2 trace(KHLH)

  • Translation example:

[NIPS07b]

Canadian Hansard (agriculture)

  • 5-line extracts,

k-spectrum kernel, k = 10, repetitions=300, sample size 10 ⇓ K

⇒HSIC⇐

⇓ L

  • k-spectrum kernel: average Type II error 0 (α = 0.05)
slide-106
SLIDE 106

Experiment: dependence testing for translation

  • (Biased) empirical HSIC:

HSICb = 1 n2 trace(KHLH)

  • Translation example:

[NIPS07b]

Canadian Hansard (agriculture)

  • 5-line extracts,

k-spectrum kernel, k = 10, repetitions=300, sample size 10 ⇓ K

⇒HSIC⇐

⇓ L

  • k-spectrum kernel: average Type II error 0 (α = 0.05)
  • Bag of words kernel: average Type II error 0.18
slide-107
SLIDE 107

Kernel two-sample tests for big data, optimal kernel choice

slide-108
SLIDE 108

Quadratic time estimate of MMD

MMD2 = µP − µQ2

F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

slide-109
SLIDE 109

Quadratic time estimate of MMD

MMD2 = µP − µQ2

F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P, Q, respectively: The earlier estimate: (quadratic time)

  • EPk(x, x′) =

1 m(m − 1)

m

  • i=1

m

  • j=i

k(xi, xj)

slide-110
SLIDE 110

Quadratic time estimate of MMD

MMD2 = µP − µQ2

F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P, Q, respectively: The earlier estimate: (quadratic time)

  • EPk(x, x′) =

1 m(m − 1)

m

  • i=1

m

  • j=i

k(xi, xj) New, linear time estimate:

  • EPk(x, x′) = 2

m [k(x1, x2) + k(x3, x4) + . . .] = 2 m

m/2

  • i=1

k(x2i−1, x2i)

slide-111
SLIDE 111

Linear time MMD

Shorter expression with explicit k dependence: MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v), where hk(x, x′, y, y′) = k(x, x′) + k(y, y′) − k(x, y′) − k(x′, y), and v := [x, x′, y, y′].

slide-112
SLIDE 112

Linear time MMD

Shorter expression with explicit k dependence: MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v), where hk(x, x′, y, y′) = k(x, x′) + k(y, y′) − k(x, y′) − k(x′, y), and v := [x, x′, y, y′]. The linear time estimate again: ˇ ηk = 2 m

m/2

  • i=1

hk(vi), where vi := [x2i−1, x2i, y2i−1, y2i] and hk(vi) := k(x2i−1, x2i) + k(y2i−1, y2i) − k(x2i−1, y2i) − k(x2i, y2i−1)

slide-113
SLIDE 113

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

  • Much higher variance for a given m, hence. . .
  • . . .a much less powerful test for a given m
slide-114
SLIDE 114

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

  • Much higher variance for a given m, hence. . .
  • . . .a much less powerful test for a given m

Advantages of the linear time MMD vs quadratic time MMD

  • Very simple asymptotic null distribution (a Gaussian, vs an infinite

weighted sum of χ2)

  • Both test statistic and threshold computable in O(m), with storage O(1).
  • Given unlimited data, a given Type II error can be attained with less

computation

slide-115
SLIDE 115

Asymptotics of linear time MMD

By central limit theorem, m1/2 (ˇ ηk − ηk(p, q)) D → N(0, 2σ2

k)

  • assuming 0 < E(h2

k) < ∞ (true for bounded k)

  • σ2

k = Evh2 k(v) − [Ev(hk(v))]2 .

slide-116
SLIDE 116

Hypothesis test

Hypothesis test of asymptotic level α: tk,α = m−1/2σk √ 2Φ−1(1 − α) where Φ−1 is inverse CDF of N(0, 1).

−4 −2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Null distribution, linear time

  • MMD

2 = ˇ

ηk P (ˇ ηk) ˇ ηk Type I error tk,α = (1 − α) quantile

slide-117
SLIDE 117

Type II error

−4 −2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

P (ˇ ηk) ˇ ηk Null vs alternative distribution, P (ˇ ηk)

null alternative

Type II error ηk(p, q)

slide-118
SLIDE 118

The best kernel: minimizes Type II error

Type II error: ˇ ηk falls below the threshold tk,α and ηk(p, q) > 0.

  • Prob. of a Type II error:

P(ˇ ηk < tk,α) = Φ

  • Φ−1(1 − α) − ηk(p, q)√m

σk √ 2

  • where Φ is a Normal CDF.
slide-119
SLIDE 119

The best kernel: minimizes Type II error

Type II error: ˇ ηk falls below the threshold tk,α and ηk(p, q) > 0.

  • Prob. of a Type II error:

P(ˇ ηk < tk,α) = Φ

  • Φ−1(1 − α) − ηk(p, q)√m

σk √ 2

  • where Φ is a Normal CDF.

Since Φ monotonic, best kernel choice to minimize Type II error prob. is: k∗ = arg max

k∈K ηk(p, q)σ−1 k ,

where K is the family of kernels under consideration.

slide-120
SLIDE 120

Learning the best kernel in a family

Define the family of kernels as follows: K :=

  • k : k =

d

  • u=1

βuku, β1 = D, βu ≥ 0, ∀u ∈ {1, . . . , d}

  • .

Properties: if at least one βu > 0

  • all k ∈ K are valid kernels,
  • If all ku charateristic then k characteristic
slide-121
SLIDE 121

Test statistic

The squared MMD becomes ηk(p, q) = µk(p) − µk(q)2

Fk = d

  • u=1

βuηu(p, q), where ηu(p, q) := Evhu(v).

slide-122
SLIDE 122

Test statistic

The squared MMD becomes ηk(p, q) = µk(p) − µk(q)2

Fk = d

  • u=1

βuηu(p, q), where ηu(p, q) := Evhu(v). Denote:

  • β = (β1, β2, . . . , βd)⊤ ∈ Rd,
  • h = (h1, h2, . . . , hd)⊤ ∈ Rd,

– hu(x, x′, y, y′) = ku(x, x′) + ku(y, y′) − ku(x, y′) − ku(x′, y)

  • η = Ev(h) = (η1, η2, . . . , ηd)⊤ ∈ Rd.

Quantities for test: ηk(p, q) = E(β⊤h) = β⊤η σ2

k := β⊤cov(h)β.

slide-123
SLIDE 123

Optimization of ratio ηk(p, q)σ−1

k

Empirical test parameters: ˆ ηk = β⊤ˆ η ˆ σk,λ =

  • β⊤
  • ˆ

Q + λmI

  • β,

ˆ Q is empirical estimate of cov(h). Note: ˆ ηk, ˆ σk,λ computed on training data, vs ˇ ηk, ˇ σk on data to be tested (why?)

slide-124
SLIDE 124

Optimization of ratio ηk(p, q)σ−1

k

Empirical test parameters: ˆ ηk = β⊤ˆ η ˆ σk,λ =

  • β⊤
  • ˆ

Q + λmI

  • β,

ˆ Q is empirical estimate of cov(h). Note: ˆ ηk, ˆ σk,λ computed on training data, vs ˇ ηk, ˇ σk on data to be tested (why?) Objective: ˆ β∗ = arg max

β0 ˆ

ηk(p, q)ˆ σ−1

k,λ

= arg max

β0

  • β⊤ˆ

η β⊤ ˆ Q + λmI

  • β

−1/2 =: α(β; ˆ η, ˆ Q)

slide-125
SLIDE 125

Optmization of ratio ηk(p, q)σ−1

k

Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0

slide-126
SLIDE 126

Optmization of ratio ηk(p, q)σ−1

k

Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0 Solve easier problem: ˆ β∗ = arg maxβ0 α2(β; ˆ η, ˆ Q). Quadratic program: min{β⊤ ˆ Q + λmI

  • β : β⊤ˆ

η = 1, β 0}

slide-127
SLIDE 127

Optmization of ratio ηk(p, q)σ−1

k

Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0 Solve easier problem: ˆ β∗ = arg maxβ0 α2(β; ˆ η, ˆ Q). Quadratic program: min{β⊤ ˆ Q + λmI

  • β : β⊤ˆ

η = 1, β 0} What if ˆ η has no positive entries?

slide-128
SLIDE 128

Test procedure

  • 1. Split the data into testing and training.
  • 2. On the training data:

(a) Compute ˆ ηu for all ku ∈ K (b) If at least one ˆ ηu > 0, solve the QP to get β∗, else choose random kernel from K

  • 3. On the test data:

(a) Compute ˇ ηk∗ using k∗ = d

u=1 β∗ku

(b) Compute test threshold ˇ tα,k∗ using ˇ σk∗

  • 4. Reject null if ˇ

ηk∗ > ˇ tα,k∗

slide-129
SLIDE 129

Convergence bounds

Assume bounded kernel, σk, bounded away from 0. If λm = Θ(m−1/3) then

  • sup

k∈K

ˆ ηkˆ σ−1

k,λ − sup k∈K

ηkσ−1

k

  • = OP
  • m−1/3

.

slide-130
SLIDE 130

Convergence bounds

Assume bounded kernel, σk, bounded away from 0. If λm = Θ(m−1/3) then

  • sup

k∈K

ˆ ηkˆ σ−1

k,λ − sup k∈K

ηkσ−1

k

  • = OP
  • m−1/3

. Idea:

  • sup

k∈K

ˆ ηkˆ σ−1

k,λ − sup k∈K

ηkσ−1

k

  • ≤ sup

k∈K

  • ˆ

ηkˆ σ−1

k,λ − ηkσ−1 k,λ

  • + sup

k∈K

  • ηkσ−1

k,λ − ηkσ−1 k

√ d D√λm

  • C1 sup

k∈K

|ˆ ηk − ηk| + C2 sup

k∈K

|ˆ σk,λ − σk,λ|

  • + C3D2λm,
slide-131
SLIDE 131

Experiments

slide-132
SLIDE 132

Competing approaches

  • Median heuristic
  • Max. MMD: choose ku ∈ K with the largest ˆ

ηu – same as maximizing β⊤ˆ η subject to β1 ≤ 1

  • ℓ2 statistic: maximize β⊤ˆ

η subject to β2 ≤ 1

  • Cross validation on training set

Also compare with:

  • Single kernel that maximizes ratio ηk(p, q)σ−1

k

slide-133
SLIDE 133

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same as that of the distributions.

slide-134
SLIDE 134

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same as that of the distributions. We distinguish a field of Gaussian blobs with different covariances.

5 10 15 20 25 30 35 5 10 15 20 25 30 35

Blob data p x1 x2

5 10 15 20 25 30 35 5 10 15 20 25 30 35

Blob data q y1 y2

Ratio ε = 3.2 of largest to smallest eigenvalues of blobs in q.

slide-135
SLIDE 135

Blobs: results

5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ε ratio Type II error

max ratio

  • pt

l2 maxmmd xval xvalc med

Parameters: m = 10, 000 (for training and test). Ratio ε of largest to smallest eigenvalues of blobs in q. Results are average over 617 trials.

slide-136
SLIDE 136

Blobs: results

5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ε ratio Type II error max ratio

  • pt

l2 maxmmd xval xvalc med

Optimize ratio ηk(p, q)σ−1

k

slide-137
SLIDE 137

Blobs: results

5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ε ratio Type II error max ratio

  • pt

l2 maxmmd xval xvalc med

Maximize ηk(p, q) with β constraint

slide-138
SLIDE 138

Blobs: results

5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ε ratio Type II error max ratio

  • pt

l2 maxmmd xval xvalc med

Median heuristic

slide-139
SLIDE 139

Feature selection: data

Idea: no single best kernel. Each of the ku are univariate (along a single coordinate)

slide-140
SLIDE 140

Feature selection: data

Idea: no single best kernel. Each of the ku are univariate (along a single coordinate)

−4 −2 2 4 6 8 −4 −2 2 4 6 8

Selection data x1 x2

p q

slide-141
SLIDE 141

Feature selection: results

5 10 15 20 25 30 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Feature selection dimension Type II error

max ratio

  • pt

l2 maxmmd

Single best kernel Linear combination

m = 10, 000, average over 5000 trials

slide-142
SLIDE 142

Amplitude modulated signals

Given an audio signal s(t), an amplitude modulated signal can be defined u(t) = sin(ωct) [a s(t) + l]

  • ωc: carrier frequency
  • a = 0.2 is signal scaling, l = 2 is offset
slide-143
SLIDE 143

Amplitude modulated signals

Given an audio signal s(t), an amplitude modulated signal can be defined u(t) = sin(ωct) [a s(t) + l]

  • ωc: carrier frequency
  • a = 0.2 is signal scaling, l = 2 is offset

Two amplitude modulated signals from same artist (in this case, Magnetic Fields).

  • Music sampled at 8KHz (very low)
  • Carrier frequency is 24kHz
  • AM signal observed at 120kHz
  • Samples are extracts of length N = 1000, approx. 0.01 sec (very short).
  • Total dataset size is 30,000 samples from each of p, q.
slide-144
SLIDE 144

Amplitude modulated signals

Samples from P Samples from Q

slide-145
SLIDE 145

Results: AM signals

−0.2 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

added noise Type II error

max ratio

  • pt

med l2 maxmmd

m = 10, 000 (for training and test) and scaling a = 0.5. Average over 4124

  • trials. Gaussian noise added.
slide-146
SLIDE 146

Observations on kernel choice

  • It is possible to choose the best kernel for a kernel

two-sample test

  • Kernel choice matters for “difficult” problems, where the

distributions differ on a lengthscale different to that of the data.

  • Ongoing work:

– quadratic time statistic – avoid training/test split

slide-147
SLIDE 147

Summary

  • MMD a distance between distributions [ISMB06, NIPS06a, JMLR10, JMLR12a]

– high dimensionality – non-euclidean data (strings, graphs) – Nonparametric hypothesis tests

  • Measure and test independence [ALT05, NIPS07a, NIPS07b, ALT08, JMLR10, JMLR12a]
  • Characteristic RKHS: MMD a metric [NIPS07b, COLT08, NIPS08a]

– Easy to check: does spectrum cover Rd

slide-148
SLIDE 148

Co-authors

  • From UCL:

– Luca Baldasssarre – Steffen Grunewalder – Guy Lever – Sam Patterson – Massimiliano Pontil – Dino Sejdinovic

  • External:

– Karsten Borgwardt, MPI – Wicher Bergsma, LSE – Kenji Fukumizu, ISM – Zaid Harchaoui, INRIA – Bernhard Schoelkopf, MPI – Alex Smola, CMU/Google – Le Song, Georgia Tech – Bharath Sriperumbudur, Cambridge

slide-149
SLIDE 149

Selected references

Characteristic kernels and mean embeddings:

  • Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.
  • Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space

embeddings and metrics on probability measures. JMLR.

  • Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.

Two-sample, independence, conditional independence tests:

  • Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of
  • independence. NIPS
  • Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.
  • Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample
  • test. NIPS.
  • Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

Energy distance, relation to kernel distances

  • Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. Annals of Statistics.

Three way interaction

  • Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.
slide-150
SLIDE 150

Selected references (continued)

Conditional mean embedding, RKHS-valued regression:

  • Weston, J., Chapelle, O., Elisseeff, A., Sch¨
  • lkopf, B., and Vapnik, V., (2003). Kernel Dependency

Estimation, NIPS.

  • Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.
  • Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.

Foundations of Computational Mathematics.

  • Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional
  • Distributions. ICML.
  • Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean

embeddings as regressors. ICML.

  • Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.

Kernel Bayes rule:

  • Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified

kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.

  • Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite

kernels, JMLR

slide-151
SLIDE 151
slide-152
SLIDE 152

Local departures from the null

What is a hard testing problem?

slide-153
SLIDE 153

Local departures from the null

What is a hard testing problem?

  • First version: for fixed m, “closer” P and Q have higher Type II error

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

slide-154
SLIDE 154

Local departures from the null

What is a hard testing problem?

  • As m increases, distinguish “closer” P and Q with fixed Type II error
slide-155
SLIDE 155

Local departures from the null

What is a hard testing problem?

  • As m increases, distinguish “closer” P and Q with fixed Type II error
  • Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant

slide-156
SLIDE 156

More general local departures from null

  • Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35

X P(X)

VS

−6 −4 −2 2 4 6 0.1 0.2 0.3

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3

X Q(X)

slide-157
SLIDE 157

Local departures from the null

What is a hard testing problem?

  • As we see more samples m, distinguish “closer” P and Q with same

Type II error

  • Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant

  • ...but other choices also possible – how to characterize them all?
slide-158
SLIDE 158

Local departures from the null

What is a hard testing problem?

  • As we see more samples m, distinguish “closer” P and Q with same

Type II error

  • Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant

  • ...but other choices also possible – how to characterize them all?

General characterization of local departures from H0:

  • Write µQ = µP + gm, where gm ∈ F chosen such that µP + gm a valid

distribution embedding

  • Minimum distinguishable distance [JMLR12]

gmF = cm−1/2

slide-159
SLIDE 159

More general local departures from null

  • More advanced example of a local departure from the null
  • Recall: µQ = µP + gm, and gmF = cm−1/2

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35

X P(X)

VS

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

slide-160
SLIDE 160

Kernels vs kernels

  • How does MMD relate to Parzen density estimate?

[Anderson et al., 1994]

ˆ fP(x) = 1 m

m

  • i=1

κ (xi − x) , where κ satisfies

  • X

κ (x) dx = 1 and κ (x) ≥ 0.

slide-161
SLIDE 161

Kernels vs kernels

  • How does MMD relate to Parzen density estimate?

[Anderson et al., 1994]

ˆ fP(x) = 1 m

m

  • i=1

κ (xi − x) , where κ satisfies

  • X

κ (x) dx = 1 and κ (x) ≥ 0.

  • L2 distance between Parzen density estimates:

D2( ˆ fP, ˆ fQ)2 = 1 m

m

  • i=1

κ(xi − z) − 1 m

m

  • i=1

κ(yi − z) 2 dz = 1 m2

m

  • i,j=1

k(xi − xj) + 1 m2

m

  • i,j=1

k(yi − yj) − 2 m2

m

  • i,j=1

k(xi − yj), where k(x − y) =

  • κ(x − z)κ(y − z)dz
slide-162
SLIDE 162

Kernels vs kernels

  • How does MMD relate to Parzen density estimate?

[Anderson et al., 1994]

ˆ fP(x) = 1 m

m

  • i=1

κ (xi − x) , where κ satisfies

  • X

κ (x) dx = 1 and κ (x) ≥ 0.

  • L2 distance between Parzen density estimates:

D2( ˆ fP, ˆ fQ)2 = 1 m

m

  • i=1

κ(xi − z) − 1 m

m

  • i=1

κ(yi − z) 2 dz = 1 m2

m

  • i,j=1

k(xi − xj) + 1 m2

m

  • i,j=1

k(yi − yj) − 2 m2

m

  • i,j=1

k(xi − yj), where k(x − y) =

  • κ(x − z)κ(y − z)dz
  • fQ = fP + δg, minimum distance to discriminate fP from fQ is

δ = (m)−1/2h−d/2

m

, where hm is width of κ.

slide-163
SLIDE 163

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]

slide-164
SLIDE 164

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X

[Dudley, 2002]

slide-165
SLIDE 165

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X

[Dudley, 2002]

Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X) with respect to L∞ [Steinwart, 2001]

slide-166
SLIDE 166

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X

[Dudley, 2002]

Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X) with respect to L∞ [Steinwart, 2001] If F universal, then MMD {P, Q; F} = 0 iff P = Q

slide-167
SLIDE 167

Characteristic Kernels (via universality)

Proof: First, it is clear that P = Q implies MMD {P, Q; F} is zero. Converse: by the universality of F, for any given ǫ > 0 and f ∈ C(X) ∃g ∈ F f − g∞ ≤ ǫ.

slide-168
SLIDE 168

Characteristic Kernels (via universality)

Proof: First, it is clear that P = Q implies MMD {P, Q; F} is zero. Converse: by the universality of F, for any given ǫ > 0 and f ∈ C(X) ∃g ∈ F f − g∞ ≤ ǫ. We next make the expansion |EPf(x) − EQf(y)| ≤ |EPf(x) − EPg(x)|+|EPg(x) − EQg(y)|+|EQg(y) − EQf(y)| . The first and third terms satisfy |EPf(x) − EPg(x)| ≤ EP |f(x) − g(x)| ≤ ǫ.

slide-169
SLIDE 169

Characteristic Kernels (via universality)

Proof (continued): Next, write EPg(x) − EQg(y) = g(·), µP − µQF = 0, since MMD {P, Q; F} = 0 implies µP = µQ. Hence |EPf(x) − EQf(y)| ≤ 2ǫ for all f ∈ C(X) and ǫ > 0, which implies P = Q.

slide-170
SLIDE 170

References

  • V. Alba Fern´

andez, M. Jim´ enez-Gamero, and J. Mu˜ noz Garcia. A test for the two-sample problem based on empirical characteristic functions. Comput.

  • Stat. Data An., 52:3730–3748, 2008.
  • N. Anderson, P. Hall, and D. Titterington.

Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Anal- ysis, 50:41–54, 1994.

  • M. Arcones and E. Gin´
  • e. On the bootstrap of u and v statistics. The Annals
  • f Statistics, 20(2):655–674, 1992.
  • R. M. Dudley. Real analysis and probability. Cambridge University Press, Cam-

bridge, UK, 2002. Andrey Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3):419–433, 1993.

  • K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Schoelkopf. Character-

istic kernels on groups and semigroups. In Advances in Neural Information Processing Systems 21, pages 473–480, Red Hook, NY, 2009. Curran Asso- ciates Inc.

  • Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel

Fisher discriminant analysis. In Advances in Neural Information Processing Systems 20, pages 609–616. MIT Press, Cambridge, MA, 2008.

  • W. Hoeffding.

Probability inequalities for sums of bounded random vari-

  • ables. Journal of the American Statistical Association, 58:13–30, 1963.

Wassily Hoeffding. A class of statistics with asymptotically normal distri-

  • bution. The Annals of Mathematical Statistics, 19(3):293–325, 1948.
  • N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distribu-
  • tions. Volume 1. John Wiley and Sons, 2nd edition, 1994.
  • A. Kankainen.

Consistent Testing of Total Independence Based on the Empirical Characteristic Function. PhD thesis, University of Jyv¨ askyl¨ a, 1995.

  • S. Mallat.

A Wavelet Tour of Signal Processing. Academic Press, 2nd edition, 1999.

  • C. McDiarmid. On the method of bounded differences. In Survey in Combina-

torics, pages 148–188. Cambridge University Press, 1989.

103-1

slide-171
SLIDE 171
  • A. M¨
  • uller. Integral probability metrics and their generating classes of func-
  • tions. Advances in Applied Probability, 29(2):429–443, 1997.
  • R. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York,

1980. B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and

  • B. Sch¨
  • lkopf. Hilbert space embeddings and metrics on probability mea-
  • sures. Journal of Machine Learning Research, 11:1517–1561, 2010.
  • I. Steinwart.

On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, 2008.

  • G. Sz´

ekely and M. Rizzo. Brownian distance covariance. Annals of Applied Statistics, 4(3):1233–1303, 2009.

  • G. Sz´

ekely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Stat., 35(6):2769–2794, 2007.

  • S. K. Zhou and R. Chellappa. From sample similarity to ensemble similarity:

Probabilistic distance measures in reproducing kernel hilbert space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6):917–929, 2006.

103-2