[PPT] - Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS PowerPoint Presentation

SLIDE 1

Lecture 2: Mappings of Probabilities to RKHS and Applications

MLSS Cadiz, 2016

Arthur Gretton Gatsby Unit, CSML, UCL

SLIDE 2

Outline

Kernel metric on the space of probability measures

– Function revealing differences in distributions – Distance between means in space of features (RKHS) – Independence measure: features of joint minus product of marginals

Characteristic kernels: feature space mappings of probabilities unique
Two-sample, independence tests for (almost!) any data type

– distributions on strings, images, graphs, groups (rotation matrices), semigroups,. . .

SLIDE 3

Feature mean difference

Simple example: 2 Gaussians with different means
Answer: t-test

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different means X

Prob. density

PX QX

SLIDE 4

Feature mean difference

Two Gaussians with same means, different variance
Idea: look at difference in means of features of the RVs
In Gaussian case: second order features of form ϕ(x) = x2

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different variances

Prob. density

X

PX QX

SLIDE 5

Feature mean difference

Two Gaussians with same means, different variance
Idea: look at difference in means of features of the RVs
In Gaussian case: second order features of form ϕ(x) = x2

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different variances

Prob. density

X

PX QX 10 −1 10 10 1 10 2 0.2 0.4 0.6 0.8 1 1.2 1.4

Densities of feature X2 X2

Prob. density

PX QX

SLIDE 6

Feature mean difference

Gaussian and Laplace distributions
Same mean and same variance
Difference in means using higher order features...RKHS

−4 −3 −2 −1 1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Gaussian and Laplace densities

Prob. density

X

PX QX

SLIDE 7

Probabilities in feature space: the mean trick

The reproducing property (kernel trick)

Given x ∈ X for some set X,

define feature map ϕ(x) ∈ F, ϕ(x) = [. . . ϕi(x) . . .] ∈ ℓ2

For positive definite k(x, x′),

k(x, x′) = ϕ(x), ϕ(x′)F

The reproducing property:

∀f ∈ F, f(x) = f(·), ϕ(x)F

SLIDE 8

Probabilities in feature space: the mean trick

The reproducing property (kernel trick)

Given x ∈ X for some set X,

define feature map ϕ(x) ∈ F, ϕ(x) = [. . . ϕi(x) . . .] ∈ ℓ2

For positive definite k(x, x′),

k(x, x′) = ϕ(x), ϕ(x′)F

The reproducing property:

∀f ∈ F, f(x) = f(·), ϕ(x)F The mean trick

Given P a Borel probability

measure on X, define feature map µP ∈ F µP = [. . . EP [ϕi(x)] . . .]

For positive definite k(x, x′),

EP,Qk(x, y) = µP, µQF for x ∼ P and y ∼ Q.

The mean trick: (we call µP a

mean/distribution embedding) EP(f(x)) =: µP, f(·)F

SLIDE 9

Does the feature space mean exist?

Does there exist an element µP ∈ F such that EPf(x) = EPf(·), ϕ(x)F = f(·), EPϕ(x)F = f(·), µP(·)F ∀f ∈ F

SLIDE 10

Does the feature space mean exist?

Does there exist an element µP ∈ F such that EPf(x) = EPf(·), ϕ(x)F = f(·), EPϕ(x)F = f(·), µP(·)F ∀f ∈ F Yes: You can exchange expectation and innner product (i.e. ϕ(x) is Bochner integrable [Steinwart and Christmann, 2008]) under the condition EPϕ(x)F = EP

k(x, x) < ∞

SLIDE 11

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

(a)

+ EQk(y, y′)

(a)

− 2EP,Qk(x, y)

(b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity

SLIDE 12

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

(a)

+ EQk(y, y′)

(a)

− 2EP,Qk(x, y)

(b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ

SLIDE 13

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

(a)

+ EQk(y, y′)

(a)

− 2EP,Qk(x, y)

(b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ

SLIDE 14

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

(a)

+ EQk(y, y′)

(a)

− 2EP,Qk(x, y)

(b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . .

SLIDE 15

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

(a)

+ EQk(y, y′)

(a)

− 2EP,Qk(x, y)

(b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), ϕ(x) + . . .

SLIDE 16

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

(a)

+ EQk(y, y′)

(a)

− 2EP,Qk(x, y)

(b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . .

SLIDE 17

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

(a)

+ EQk(y, y′)

(a)

− 2EP,Qk(x, y)

(b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . . = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

SLIDE 18

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

(a)

+ EQk(y, y′)

(a)

− 2EP,Qk(x, y)

(b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Unbiased empirical estimate of first term (quadratic time)

EPk(x, x′) =

1 m(m − 1)

m

i=1

m

j=i

k(xi, xj)

SLIDE 19

The maximum mean discrepancy

∼ P

∼ Q

SLIDE 20

The maximum mean discrepancy

SLIDE 21

The maximum mean discrepancy

k(dogi, fishj) k(fishi, fishj) k(dogi, dogj) k(fishj, dogi)

MMD

2 = KP,P + KQ,Q − 2KP,Q

(diagonal terms removed from KP,P and KQ,Q)

SLIDE 22

Function Showing Difference in Distributions

Are P and Q different?

SLIDE 23

Function Showing Difference in Distributions

Are P and Q different?

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

SLIDE 24

Function Showing Difference in Distributions

Are P and Q different?

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

SLIDE 25

Function Showing Difference in Distributions

Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

x f(x) Smooth function

SLIDE 26

Function Showing Difference in Distributions

Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

x f(x) Smooth function

SLIDE 27

Function Showing Difference in Distributions

What if the function is not smooth?

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Bounded continuous function x f(x)

SLIDE 28

Function Showing Difference in Distributions

What if the function is not smooth?

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Bounded continuous function x f(x)

SLIDE 29

Function Showing Difference in Distributions

Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

Gauss P vs Laplace Q

−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8

Witness f for Gauss and Laplace densities X

Prob. density and f

f Gauss Laplace

SLIDE 30

Function Showing Difference in Distributions

Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

Classical results: MMD(P, Q; F) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨

uller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

SLIDE 31

Function Showing Difference in Distributions

Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

Classical results: MMD(P, Q; F) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨

uller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

MMD(P, Q; F) = 0 iff P = Q when F =the unit ball in a characteristic

RKHS F (coming soon!)

[ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]

SLIDE 32

Function Showing Difference in Distributions

Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

Classical results: MMD(P, Q; F) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨

uller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

MMD(P, Q; F) = 0 iff P = Q when F =the unit ball in a characteristic

RKHS F (coming soon!)

[ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]

How do smooth functions relate to feature maps?

SLIDE 33

Function view vs feature mean view

The (kernel) MMD: [ISMB06, NIPS06a]

MMD(P, Q; F) = sup

f∈F

[EPf(x) − EQf(y)]

−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8

Witness f for Gauss and Laplace densities X

Prob. density and f

f Gauss Laplace

SLIDE 34

Function view vs feature mean view

The (kernel) MMD: [ISMB06, NIPS06a]

MMD(P, Q; F) = sup

f∈F

[EPf(x) − EQf(y)] use EP(f(x)) =: µP, fF

SLIDE 35

Function view vs feature mean view

The (kernel) MMD: [ISMB06, NIPS06a]

MMD(P, Q; F) = sup

f∈F

[EPf(x) − EQf(y)] = sup

f∈F

f, µP − µQF use EP(f(x)) =: µP, fF

SLIDE 36

Function view vs feature mean view

The (kernel) MMD: [ISMB06, NIPS06a]

MMD(P, Q; F) = sup

f∈F

[EPf(x) − EQf(y)] = sup

f∈F

f, µP − µQF = µP − µQF use θF = sup

f∈F

f, θF since F := {f ∈ F : f ≤ 1} Function view and feature view equivalent

SLIDE 37

MMD for independence: HSIC

Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Sz´

ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC(PXY , PXPY ) := µPXY − µPXPY 2

SLIDE 38

MMD for independence: HSIC

Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Sz´

ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC(PXY , PXPY ) := µPXY − µPXPY 2

k( , )

!" #" !"

l( , )

#"

k( , ) × l( , )

!" #" !" #"

κ( , ) =

!" #" !" #"

SLIDE 39

MMD for independence: HSIC

Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Sz´

ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC(PXY , PXPY ) := µPXY − µPXPY 2 HSIC using expectations of kernels: Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(x, x′)l(y, y′) + EXEX′k(x, x′)EY EY ′l(y, y′) − 2EX′Y ′ EXk(x, x′)EY l(y, y′)

.

SLIDE 40

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

SLIDE 41

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

SLIDE 42

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

Empirical HSIC(PXY , PXPY ): 1 n2 (HKH ◦ HLH)++

SLIDE 43

Characteristic kernels (Via Fourier, on the torus T)

SLIDE 44

Characteristic Kernels (via Fourier)

Reminder: Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, JMLR10] In the next slides:

1. Characteristic property on [−π, π] with periodic boundary
2. Characteristic property on Rd

SLIDE 45

Characteristic Kernels (via Fourier)

Reminder: Fourier series

Function [−π, π] with periodic boundary.

f(x) =

∞

ℓ=−∞

ˆ fℓ exp(ıℓx) =

∞

l=−∞

ˆ fℓ (cos(ℓx) + ı sin(ℓx)) .

−4 −2 2 4 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4 x f (x) Top hat −10 −5 5 10 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 ℓ ˆ fℓ Fourier series coefficients

SLIDE 46

Characteristic Kernels (via Fourier)

Reminder: Fourier series of kernel k(x, y) = k(x − y) = k(z), k(z) =

∞

ℓ=−∞

ˆ kℓ exp (ıℓz) , E.g., k(x) =

1 2πϑ

x

2π, ıσ2 2π

,

ˆ kℓ =

1 2π exp

−σ2ℓ2

2

.

ϑ is the Jacobi theta function, close to Gaussian when σ2 sufficiently narrower than [−π, π].

−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6 x k(x) Kernel −10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 ℓ ˆ fℓ Fourier series coefficients

SLIDE 47

Characteristic Kernels (via Fourier)

Maximum mean embedding via Fourier series:

Fourier series for P is characteristic function ¯

φP

Fourier series for mean embedding is product of fourier series!

(convolution theorem) µP(x) = EPk(x − x) = π

−π

k(x − t)dP(t) ˆ µP,ℓ = ˆ kℓ × ¯ φP,ℓ

SLIDE 48

Characteristic Kernels (via Fourier)

Maximum mean embedding via Fourier series:

Fourier series for P is characteristic function ¯

φP

Fourier series for mean embedding is product of fourier series!

(convolution theorem) µP(x) = EPk(x − x) = π

−π

k(x − t)dP(t) ˆ µP,ℓ = ˆ kℓ × ¯ φP,ℓ

MMD can be written in terms of Fourier series:

MMD(P, Q; F) :=

∞
ℓ=−∞

¯ φP,ℓ − ¯ φQ,ℓ ˆ kℓ

exp(ıℓx)
F

SLIDE 49

A simpler Fourier expression for MMD

From previous slide,

MMD(P, Q; F) :=

∞
ℓ=−∞

¯ φP,ℓ − ¯ φQ,ℓ ˆ kℓ

exp(ıℓx)
F
The squared norm of a function f in F is:

f2

F = f, fF = ∞

l=−∞

| ˆ fℓ|2 ˆ kℓ .

Simple, interpretable expression for squared MMD:

MMD2(P, Q; F) =

∞

l=−∞

[|φP,ℓ − φQ,ℓ|2ˆ kℓ]2 ˆ kℓ =

∞

l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

SLIDE 50

Example

Example: P differs from Q at one frequency

−2 2 0.05 0.1 0.15 0.2

x P (x)

−2 2 0.05 0.1 0.15 0.2

x Q(x)

SLIDE 51

Characteristic Kernels (2)

Example: P differs from Q at (roughly) one frequency

−2 2 0.05 0.1 0.15 0.2

x P (x)

−2 2 0.05 0.1 0.15 0.2

x Q(x)

F

→

F

→

−10 10 0.5 1

ℓ φP,ℓ

−10 10 0.5 1

ℓ φQ,ℓ

SLIDE 52

Characteristic Kernels (2)

Example: P differs from Q at (roughly) one frequency

−2 2 0.05 0.1 0.15 0.2

x P (x)

−2 2 0.05 0.1 0.15 0.2

x Q(x)

F

→

F

→

−10 10 0.5 1

ℓ φP,ℓ

−10 10 0.5 1

ℓ φQ,ℓ

ց ր

Characteristic function difference

−10 10 0.2 0.4 0.6 0.8 1

ℓ φP,ℓ − φQ,ℓ

SLIDE 53

Example

Is the Gaussian-spectrum kernel characteristic?

−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

x k(x) Kernel

−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

∞

l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

SLIDE 54

Example

Is the Gaussian-spectrum kernel characteristic? YES

−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

x k(x) Kernel

−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

∞

l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

SLIDE 55

Example

Is the triangle kernel characteristic?

−4 −2 2 4 −0.1 −0.05 0.05 0.1 0.15 0.2 0.25 0.3 x f (x) Triangle −10 −5 5 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

∞

l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

SLIDE 56

Example

Is the triangle kernel characteristic? NO

−4 −2 2 4 −0.1 −0.05 0.05 0.1 0.15 0.2 0.25 0.3 x f (x) Triangle −10 −5 5 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

∞

l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

SLIDE 57

Characteristic kernels (Via Fourier, on Rd)

SLIDE 58

Characteristic Kernels (via Fourier)

Can we prove characteristic on Rd?

SLIDE 59

Characteristic Kernels (via Fourier)

Can we prove characteristic on Rd?
Characteristic function of P via Fourier transform

φP(ω) =

Rd eix⊤ωdP(x)

SLIDE 60

Characteristic Kernels (via Fourier)

Can we prove characteristic on Rd?
Characteristic function of P via Fourier transform

φP(ω) =

Rd eix⊤ωdP(x)
Translation invariant kernels: k(x, y) = k(x − y) = k(z)
Bochner’s theorem:

k(z) =

Rd e−iz⊤ωdΛ(ω)

– Λ finite non-negative Borel measure

SLIDE 61

Characteristic Kernels (via Fourier)

Can we prove characteristic on Rd?
Characteristic function of P via Fourier transform

φP(ω) =

Rd eix⊤ωdP(x)
Translation invariant kernels: k(x, y) = k(x − y) = k(z)
Bochner’s theorem:

k(z) =

Rd e−iz⊤ωdΛ(ω)

– Λ finite non-negative Borel measure

SLIDE 62

Characteristic Kernels (via Fourier)

Fourier representation of MMD: MMD2(P, Q; F) =

|φP(ω) − φQ(ω)|2 dΛ(ω)

φP characteristic function of P

Proof: Using Bochner’s theorem (a)... and Fubini’s theorem (b) MMD2(P, Q) := EPk(x − x′) + EQk(y − y′) − 2EP,Qk(x, y) = k(s − t) d(P − Q)(s)

d(P − Q)(t)

(a)

=

Rd e−i(s−t)T ω dΛ(ω) d(P − Q)(s) d(P − Q)(t) (b)

=

Rd e−ixT ω d(P − Q)(s)

Rd eiyT ω d(P − Q)(t) dΛ(ω)

=

Rd |φP(ω) − φQ(ω)|2 dΛ(ω)

SLIDE 63

Example

Example: P differs from Q at (roughly) one frequency

−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X) −10 −5 5 10 0.1 0.2 0.3 0.4 0.5 X Q(X)

SLIDE 64

Example

Example: P differs from Q at (roughly) one frequency

−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X) −10 −5 5 10 0.1 0.2 0.3 0.4 0.5 X Q(X)

F

→

F

→

−20 −10 10 20 0.1 0.2 0.3 0.4 ω |φP | −20 −10 10 20 0.1 0.2 0.3 0.4 ω |φQ |

SLIDE 65

Example

Example: P differs from Q at (roughly) one frequency

−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X) −10 −5 5 10 0.1 0.2 0.3 0.4 0.5 X Q(X)

F

→

F

→

−20 −10 10 20 0.1 0.2 0.3 0.4 ω |φP | −20 −10 10 20 0.1 0.2 0.3 0.4 ω |φQ |

ց ր

Characteristic function difference

−30 −20 −10 10 20 30 0.05 0.1 0.15 0.2 ω |φP − φQ|

SLIDE 66

Example

Example: P differs from Q at (roughly) one frequency

Exponentiated quadratic kernel Difference |φP − φQ|

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

SLIDE 67

Example

Example: P differs from Q at (roughly) one frequency

Characteristic

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

SLIDE 68

Example

Example: P differs from Q at (roughly) one frequency

Sinc kernel Difference |φP − φQ|

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

SLIDE 69

Example

Example: P differs from Q at (roughly) one frequency

NOT characteristic

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

SLIDE 70

Example

Example: P differs from Q at (roughly) one frequency

Triangle (B-spline) kernel Difference |φP − φQ|

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

SLIDE 71

Example

Example: P differs from Q at (roughly) one frequency

???

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

SLIDE 72

Example

Example: P differs from Q at (roughly) one frequency

Characteristic

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

SLIDE 73

Summary: Characteristic Kernels

Characteristic kernel: (MMD = 0 iff P = Q) [NIPS07b, COLT08] Main theorem: A translation invariant k characteristic for prob. measures on Rd if and only if supp(Λ) = Rd (i.e. support zero on at most a countable set)

[COLT08, JMLR10]

Corollary: continuous, compactly supported k characteristic (since Fourier spectrum Λ(ω) cannot be zero on an interval).

1-D proof sketch from [Mallat, 1999, Theorem 2.6] proof on Rd via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]

SLIDE 74

k characteristic iff supp(Λ) = Rd

Proof: supp {Λ} = Rd = ⇒ k characteristic: Recall Fourier definition of MMD: MMD2(P, Q) =

Rd |φP(ω) − φQ(ω)|2 dΛ(ω).

Characteristic functions φP(ω) and φQ(ω) uniformly continuous, hence their difference cannot be non-zero only on a countable set.

Map φP uniformly continuous: ∀ǫ > 0, ∃δ > 0 such that ∀(ω1, ω2) ∈ Ω for which d(ω1, ω2) < δ, we have d(φP(ω1), φP(ω2)) < ǫ. Uniform: δ depends only on ǫ, not on ω1, ω2.

SLIDE 75

k characteristic iff supp(Λ) = Rd

Proof: k characteristic = ⇒ supp {Λ} = Rd : Proof by contrapositive. Given supp {Λ} Rd, hence ∃ open interval U such that Λ(ω) zero on U. Construct densities p(x), q(x) such that φP, φQ differ only inside U

SLIDE 76

Further extensions

Similar reasoning wherever extensions of Bochner’s theorem exist:

[Fukumizu et al., 2009]

– Locally compact Abelian groups (periodic domains, as we saw) – Compact, non-Abelian groups (orthogonal matrices) – The semigroup R+

n (histograms)

Related kernel statistics: Fisher statistic [Harchaoui et al., 2008](zero iff P = Q

for characteristic kernels), other distances [Zhou and Chellappa, 2006](not yet shown to establish whether P = Q), energy distances

SLIDE 77

Statistical hypothesis testing

SLIDE 78

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP near spike burst Time LFP amplitude

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP without spike burst Time LFP amplitude

SLIDE 79

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

SLIDE 80

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

SLIDE 81

Statistical test using MMD (1)

Two hypotheses:

– H0: null hypothesis (P = Q) – H1: alternative hypothesis (P = Q)

SLIDE 82

Statistical test using MMD (1)

Two hypotheses:

– H0: null hypothesis (P = Q) – H1: alternative hypothesis (P = Q)

Observe samples x := {x1, . . . , xn} from P and y from Q
If empirical MMD(x, y; F) is

– “far from zero”: reject H0 – “close to zero”: accept H0

SLIDE 83

Statistical test using MMD (2)

“far from zero” vs “close to zero” - threshold?
One answer: asymptotic distribution of

MMD

2

SLIDE 84

Statistical test using MMD (2)

“far from zero” vs “close to zero” - threshold?
One answer: asymptotic distribution of

MMD

2

An unbiased empirical estimate (quadratic cost):
MMD

2 = 1 n(n−1)

i=j

k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)

h((xi,yi),(xj,yj))

SLIDE 85

Statistical test using MMD (2)

“far from zero” vs “close to zero” - threshold?
One answer: asymptotic distribution of

MMD

2

An unbiased empirical estimate (quadratic cost):
MMD

2 = 1 n(n−1)

i=j

k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)

h((xi,yi),(xj,yj))
When P = Q, asymptotically normal

(√n)

MMD

2 − MMD2

∼ N(0, σ2

u)

[Hoeffding, 1948, Serfling, 1980]

Expression for the variance: zi := (xi, yi)

σ2

u = 4

Ez
(Ez′h(z, z′))2

−

Ez,z′(h(z, z′))

2

SLIDE 86

Statistical test using MMD (3)

Example: laplace distributions with different variance

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 2 4 6 8 10 12 14

MMD distribution and Gaussian fit under H1 MMD

Prob. density

Empirical PDF Gaussian fit

−6 −4 −2 2 4 6 0.5 1 1.5 Two Laplace distributions with different variances X

Prob. density

PX QX

SLIDE 87

Statistical test using MMD (4)

When P = Q, U-statistic degenerate: Ez′[h(z, z′)] = 0 [Anderson et al., 1994]
Distribution is

nMMD(x, y; F) ∼

∞

l=1

λl

z2

l − 2

where

– zl ∼ N(0, 2) i.i.d –

X ˜

k(x, x′)

centred

ψi(x)dP(x) = λiψi(x′)

SLIDE 88

Statistical test using MMD (4)

When P = Q, U-statistic degenerate: Ez′[h(z, z′)] = 0 [Anderson et al., 1994]
Distribution is

nMMD(x, y; F) ∼

∞

l=1

λl

z2

l − 2

where

– zl ∼ N(0, 2) i.i.d –

X ˜

k(x, x′)

centred

ψi(x)dP(x) = λiψi(x′)

−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

MMD density under H0 n× MMD2

Prob. density

χ2 sum Empirical PDF

SLIDE 89

Statistical test using MMD (5)

Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
MMD

2 = KP,P + KQ,Q − 2KP,Q

−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

MMD density under H0 and H1 n× MMD2

Prob. density

null alternative 1−α null quantile Type II error

SLIDE 90

Statistical test using MMD (5)

Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05

SLIDE 91

Statistical test using MMD (5)

Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
Permutation for empirical CDF [Arcones and Gin´

e, 1992, Alba Fern´ andez et al., 2008]

Pearson curves by matching first four moments [Johnson et al., 1994]
Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]
Consistent test using kernel eigenspectrum [NIPS09b]

SLIDE 92

Statistical test using MMD (5)

Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
Permutation for empirical CDF [Arcones and Gin´

e, 1992, Alba Fern´ andez et al., 2008]

Pearson curves by matching first four moments [Johnson et al., 1994]
Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]
Consistent test using kernel eigenspectrum [NIPS09b]

−0.02 0.02 0.04 0.06 0.08 0.1 0.2 0.4 0.6 0.8 1

CDF of the MMD and Pearson fit mmd P(MMD < mmd)

MMD Pearson

SLIDE 93

Approximate null distribution of MMD via permutation

Empirical MMD: w = (1, 1, 1, . . . 1

n

, −1 . . . , −1, −1, −1

n

)⊤

1 n2



KP,P KP,Q KQ,P KQ,Q   ⊙

ww⊤

≈

MMD

2

SLIDE 94

Approximate null distribution of MMD via permutation

Permuted case:

[Alba Fern´ andez et al., 2008]

w = (1, −1, 1, . . . 1

n

, −1 . . . , 1, −1, −1

n

)⊤

(equal number of +1 and −1)

1 n2



KP,P KP,Q KQ,P KQ,Q   ⊙

ww⊤

= ?

SLIDE 95

Approximate null distribution of MMD via permutation

Permuted case:

[Alba Fern´ andez et al., 2008]

w = (1, −1, 1, . . . 1

n

, −1 . . . , 1, −1, −1

n

)⊤

(equal number of +1 and −1)

1 n2



KP,P KP,Q KQ,P KQ,Q   ⊙

ww⊤

= ?

⊙

=

Figure thanks to Kacper Chwialkowski.

SLIDE 96

Approximate null distribution of MMD

2 via permutation

MMD

2 p ≈ 1

n2



KP,P KP,Q KQ,P KQ,Q   ⊙

ww⊤

−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

MMD density under H0 n× MMD2

Prob. density

Null PDF Null PDF from permutation

SLIDE 97

Detecting differences in brain signals

Do local field potential (LFP) signals change when measured near a spike burst?

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP near spike burst Time LFP amplitude

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP without spike burst Time LFP amplitude

SLIDE 98

Neuro data: consistent test w/o permutation

Maximum mean discrepancy (MMD): distance between P and Q

MMD(P, Q; F) := µP − µQ2

F

Is

MMD significantly > 0?

P = Q, null distrib. of

MMD: n MMD →

D ∞

l=1

λl(z2

l − 2),

– λl is lth eigenvalue of kernel ˜ k(xi, xj)

100 150 200 250 300 0.1 0.2 0.3 0.4 0.5

P ≠ Q (neuro) Sample size m Type II error

Spectral Permutation

Use Gram matrix spectrum for ˆ λl: consistent test without permutation

SLIDE 99

Hypothesis testing with HSIC

SLIDE 100

Distribution of HSIC at independence

(Biased) empirical HSIC a v-statistic

HSICb = 1 n2 trace(KHLH) – Statistical testing: How do we find when this is larger enough that the null hypothesis P = PxPy is unlikely? – Formally: given P = PxPy, what is the threshold T such that P(HSIC > T) < α for small α?

SLIDE 101

Distribution of HSIC at independence

(Biased) empirical HSIC a v-statistic

HSICb = 1 n2 trace(KHLH)

Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:

nHSICb

D

→

∞

l=1

λlz2

l ,

zl ∼ N(0, 1)i.i.d.

λlψl(zj) =

hijqrψl(zi)dFi,q,r,

hijqr = 1 4!

(i,j,q,r)

(t,u,v,w)

ktultu + ktulvw − 2ktultv

SLIDE 102

Distribution of HSIC at independence

(Biased) empirical HSIC a v-statistic

HSICb = 1 n2 trace(KHLH)

Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:

nHSICb

D

→

∞

l=1

λlz2

l ,

zl ∼ N(0, 1)i.i.d.

λlψl(zj) =

hijqrψl(zi)dFi,q,r,

hijqr = 1 4!

(i,j,q,r)

(t,u,v,w)

ktultu + ktulvw − 2ktultv

First two moments [NIPS07b]

E(HSICb) = 1 nTrCxxTrCyy var(HSICb) = 2(n − 4)(n − 5) (n)4 Cxx2

HS Cyy2 HS + O(n−3).

SLIDE 103

Statistical testing with HSIC

Given P = PxPy, what is the threshold T such that P(HSIC > T) < α

for small α?

Null distribution via permutation [Feuerverger, 1993]

– Compute HSIC for {xi, yπ(i)}n

i=1 for random permutation π of indices

{1, . . . , n}. This gives HSIC for independent variables. – Repeat for many different permutations, get empirical CDF – Threshold T is 1 − α quantile of empirical CDF

SLIDE 104

Statistical testing with HSIC

Given P = PxPy, what is the threshold T such that P(HSIC > T) < α

for small α?

Null distribution via permutation [Feuerverger, 1993]

– Compute HSIC for {xi, yπ(i)}n

i=1 for random permutation π of indices

{1, . . . , n}. This gives HSIC for independent variables. – Repeat for many different permutations, get empirical CDF – Threshold T is 1 − α quantile of empirical CDF

Approximate null distribution via moment matching [Kankainen, 1995]:

nHSICb(Z) ∼ xα−1e−x/β βαΓ(α) where α = (E(HSICb))2 var(HSICb) , β = var(HSICb) nE(HSICb).

SLIDE 105

Experiment: dependence testing for translation Are the French text extracts translations of English?

X1: Honourable senators, I have a question for

the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet.

Y1:

Honorables s´ enateurs, ma question s’adresse au leader du gouvernement au S´ enat et concerne l’aide financi´ ere qu’on a annonc´ ee pour les agriculteurs. La plupart des agriculteurs n’ont encore rien reu de cet argent.

X2: No doubt there is great pressure on provin-

cial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.

· · ·

?

⇐ ⇒

Y2:Il est ´

evident que les ordres de gouverne- ments provinciaux et municipaux subissent de fortes pressions en ce qui concerne les services de garde, mais le gouvernement n’a pas r´ eduit le financement qu’il verse aux provinces pour les services de garde. Au contraire, nous avons augment´ e le financement f´ ed´ eral pour le d´ eveloppement des jeunes enfants.

· · ·

SLIDE 106

Experiment: dependence testing for translation

(Biased) empirical HSIC:

HSICb = 1 n2 trace(KHLH)

Translation example:

[NIPS07b]

Canadian Hansard (agriculture)

5-line extracts,

k-spectrum kernel, k = 10, repetitions=300, sample size 10 ⇓ K

⇒HSIC⇐

⇓ L

k-spectrum kernel: average Type II error 0 (α = 0.05)

SLIDE 107

Experiment: dependence testing for translation

(Biased) empirical HSIC:

HSICb = 1 n2 trace(KHLH)

Translation example:

[NIPS07b]

Canadian Hansard (agriculture)

5-line extracts,

k-spectrum kernel, k = 10, repetitions=300, sample size 10 ⇓ K

⇒HSIC⇐

⇓ L

k-spectrum kernel: average Type II error 0 (α = 0.05)
Bag of words kernel: average Type II error 0.18

SLIDE 108

Summary

MMD a distance between distributions [ISMB06, NIPS06a, JMLR10, JMLR12a]

– high dimensionality – non-euclidean data (strings, graphs) – Nonparametric hypothesis tests

Measure and test independence [ALT05, NIPS07a, NIPS07b, ALT08, JMLR10, JMLR12a]
Characteristic RKHS: MMD a metric [NIPS07b, COLT08, NIPS08a]

– Easy to check: does spectrum cover Rd

SLIDE 109

Co-authors

From UCL:

– Luca Baldasssarre – Steffen Grunewalder – Guy Lever – Sam Patterson – Massimiliano Pontil – Dino Sejdinovic

External:

– Karsten Borgwardt, MPI – Wicher Bergsma, LSE – Kenji Fukumizu, ISM – Zaid Harchaoui, INRIA – Bernhard Schoelkopf, MPI – Alex Smola, CMU/Google – Le Song, Georgia Tech – Bharath Sriperumbudur, Cambridge

SLIDE 110

Selected references

Characteristic kernels and mean embeddings:

Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.
Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space

embeddings and metrics on probability measures. JMLR.

Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.

Two-sample, independence, conditional independence tests:

Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of
independence. NIPS
Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.
Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample
test. NIPS.
Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

Energy distance, relation to kernel distances

Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. Annals of Statistics.

Three way interaction

Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.

SLIDE 111

Selected references (continued)

Conditional mean embedding, RKHS-valued regression:

Weston, J., Chapelle, O., Elisseeff, A., Sch¨
lkopf, B., and Vapnik, V., (2003). Kernel Dependency

Estimation, NIPS.

Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.
Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.

Foundations of Computational Mathematics.

Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional
Distributions. ICML.
Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean

embeddings as regressors. ICML.

Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.

Kernel Bayes rule:

Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified

kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.

Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite

kernels, JMLR

SLIDE 112

SLIDE 113

Local departures from the null

What is a hard testing problem?

SLIDE 114

Local departures from the null

What is a hard testing problem?

First version: for fixed m, “closer” P and Q have higher Type II error

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

SLIDE 115

Local departures from the null

What is a hard testing problem?

As m increases, distinguish “closer” P and Q with fixed Type II error

SLIDE 116

Local departures from the null

What is a hard testing problem?

As m increases, distinguish “closer” P and Q with fixed Type II error
Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant

SLIDE 117

More general local departures from null

Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X)

VS

−6 −4 −2 2 4 6 0.1 0.2 0.3

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3

X Q(X)

SLIDE 118

Local departures from the null

What is a hard testing problem?

As we see more samples m, distinguish “closer” P and Q with same

Type II error

Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant

...but other choices also possible – how to characterize them all?

SLIDE 119

Local departures from the null

What is a hard testing problem?

As we see more samples m, distinguish “closer” P and Q with same

Type II error

Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant

...but other choices also possible – how to characterize them all?

General characterization of local departures from H0:

Write µQ = µP + gm, where gm ∈ F chosen such that µP + gm a valid

distribution embedding

Minimum distinguishable distance [JMLR12]

gmF = cm−1/2

SLIDE 120

More general local departures from null

More advanced example of a local departure from the null
Recall: µQ = µP + gm, and gmF = cm−1/2

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X)

VS

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

SLIDE 121

Kernels vs kernels

How does MMD relate to Parzen density estimate?

[Anderson et al., 1994]

ˆ fP(x) = 1 m

m

i=1

κ (xi − x) , where κ satisfies

X

κ (x) dx = 1 and κ (x) ≥ 0.

SLIDE 122

Kernels vs kernels

How does MMD relate to Parzen density estimate?

[Anderson et al., 1994]

ˆ fP(x) = 1 m

m

i=1

κ (xi − x) , where κ satisfies

X

κ (x) dx = 1 and κ (x) ≥ 0.

L2 distance between Parzen density estimates:

D2( ˆ fP, ˆ fQ)2 = 1 m

m

i=1

κ(xi − z) − 1 m

m

i=1

κ(yi − z) 2 dz = 1 m2

m

i,j=1

k(xi − xj) + 1 m2

m

i,j=1

k(yi − yj) − 2 m2

m

i,j=1

k(xi − yj), where k(x − y) =

κ(x − z)κ(y − z)dz

SLIDE 123

Kernels vs kernels

How does MMD relate to Parzen density estimate?

[Anderson et al., 1994]

ˆ fP(x) = 1 m

m

i=1

κ (xi − x) , where κ satisfies

X

κ (x) dx = 1 and κ (x) ≥ 0.

L2 distance between Parzen density estimates:

D2( ˆ fP, ˆ fQ)2 = 1 m

m

i=1

κ(xi − z) − 1 m

m

i=1

κ(yi − z) 2 dz = 1 m2

m

i,j=1

k(xi − xj) + 1 m2

m

i,j=1

k(yi − yj) − 2 m2

m

i,j=1

k(xi − yj), where k(x − y) =

κ(x − z)κ(y − z)dz
fQ = fP + δg, minimum distance to discriminate fP from fQ is

δ = (m)−1/2h−d/2

m

, where hm is width of κ.

SLIDE 124

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]

SLIDE 125

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X

[Dudley, 2002]

SLIDE 126

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X

[Dudley, 2002]

Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X) with respect to L∞ [Steinwart, 2001]

SLIDE 127

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X

[Dudley, 2002]

Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X) with respect to L∞ [Steinwart, 2001] If F universal, then MMD {P, Q; F} = 0 iff P = Q

SLIDE 128

Characteristic Kernels (via universality)

Proof: First, it is clear that P = Q implies MMD {P, Q; F} is zero. Converse: by the universality of F, for any given ǫ > 0 and f ∈ C(X) ∃g ∈ F f − g∞ ≤ ǫ.

SLIDE 129

Characteristic Kernels (via universality)

Proof: First, it is clear that P = Q implies MMD {P, Q; F} is zero. Converse: by the universality of F, for any given ǫ > 0 and f ∈ C(X) ∃g ∈ F f − g∞ ≤ ǫ. We next make the expansion |EPf(x) − EQf(y)| ≤ |EPf(x) − EPg(x)|+|EPg(x) − EQg(y)|+|EQg(y) − EQf(y)| . The first and third terms satisfy |EPf(x) − EPg(x)| ≤ EP |f(x) − g(x)| ≤ ǫ.

SLIDE 130

Characteristic Kernels (via universality)

Proof (continued): Next, write EPg(x) − EQg(y) = g(·), µP − µQF = 0, since MMD {P, Q; F} = 0 implies µP = µQ. Hence |EPf(x) − EQf(y)| ≤ 2ǫ for all f ∈ C(X) and ǫ > 0, which implies P = Q.

SLIDE 131

References

V. Alba Fern´

andez, M. Jim´ enez-Gamero, and J. Mu˜ noz Garcia. A test for the two-sample problem based on empirical characteristic functions. Comput.

Stat. Data An., 52:3730–3748, 2008.
N. Anderson, P. Hall, and D. Titterington.

Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Anal- ysis, 50:41–54, 1994.

M. Arcones and E. Gin´
e. On the bootstrap of u and v statistics. The Annals
f Statistics, 20(2):655–674, 1992.
R. M. Dudley. Real analysis and probability. Cambridge University Press, Cam-

bridge, UK, 2002. Andrey Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3):419–433, 1993.

K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Schoelkopf. Character-

istic kernels on groups and semigroups. In Advances in Neural Information Processing Systems 21, pages 473–480, Red Hook, NY, 2009. Curran Asso- ciates Inc.

Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel

Fisher discriminant analysis. In Advances in Neural Information Processing Systems 20, pages 609–616. MIT Press, Cambridge, MA, 2008.

W. Hoeffding.

Probability inequalities for sums of bounded random vari-

ables. Journal of the American Statistical Association, 58:13–30, 1963.

Wassily Hoeffding. A class of statistics with asymptotically normal distri-

bution. The Annals of Mathematical Statistics, 19(3):293–325, 1948.
N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distribu-
tions. Volume 1. John Wiley and Sons, 2nd edition, 1994.
A. Kankainen.

Consistent Testing of Total Independence Based on the Empirical Characteristic Function. PhD thesis, University of Jyv¨ askyl¨ a, 1995.

S. Mallat.

A Wavelet Tour of Signal Processing. Academic Press, 2nd edition, 1999.

C. McDiarmid. On the method of bounded differences. In Survey in Combina-

torics, pages 148–188. Cambridge University Press, 1989.

79-1

SLIDE 132

A. M¨
uller. Integral probability metrics and their generating classes of func-
tions. Advances in Applied Probability, 29(2):429–443, 1997.
R. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York,

1980. B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and

B. Sch¨
lkopf. Hilbert space embeddings and metrics on probability mea-
sures. Journal of Machine Learning Research, 11:1517–1561, 2010.
I. Steinwart.

On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, 2008.

G. Sz´

ekely and M. Rizzo. Brownian distance covariance. Annals of Applied Statistics, 4(3):1233–1303, 2009.

G. Sz´

ekely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Stat., 35(6):2769–2794, 2007.

S. K. Zhou and R. Chellappa. From sample similarity to ensemble similarity:

Probabilistic distance measures in reproducing kernel hilbert space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6):917–929, 2006.

79-2