Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS - - PowerPoint PPT Presentation

lecture 2 mappings of probabilities to rkhs and
SMART_READER_LITE
LIVE PREVIEW

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS - - PowerPoint PPT Presentation

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton Gatsby Unit, CSML, UCL Outline Kernel metric on the space of probability measures Function revealing differences in distributions


slide-1
SLIDE 1

Lecture 2: Mappings of Probabilities to RKHS and Applications

MLSS Cadiz, 2016

Arthur Gretton Gatsby Unit, CSML, UCL

slide-2
SLIDE 2

Outline

  • Kernel metric on the space of probability measures

– Function revealing differences in distributions – Distance between means in space of features (RKHS) – Independence measure: features of joint minus product of marginals

  • Characteristic kernels: feature space mappings of probabilities unique
  • Two-sample, independence tests for (almost!) any data type

– distributions on strings, images, graphs, groups (rotation matrices), semigroups,. . .

slide-3
SLIDE 3

Feature mean difference

  • Simple example: 2 Gaussians with different means
  • Answer: t-test
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different means X

  • Prob. density
PX QX
slide-4
SLIDE 4

Feature mean difference

  • Two Gaussians with same means, different variance
  • Idea: look at difference in means of features of the RVs
  • In Gaussian case: second order features of form ϕ(x) = x2
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different variances

  • Prob. density

X

PX QX
slide-5
SLIDE 5

Feature mean difference

  • Two Gaussians with same means, different variance
  • Idea: look at difference in means of features of the RVs
  • In Gaussian case: second order features of form ϕ(x) = x2
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different variances

  • Prob. density

X

PX QX 10 −1 10 10 1 10 2 0.2 0.4 0.6 0.8 1 1.2 1.4

Densities of feature X2 X2

  • Prob. density
PX QX
slide-6
SLIDE 6

Feature mean difference

  • Gaussian and Laplace distributions
  • Same mean and same variance
  • Difference in means using higher order features...RKHS
−4 −3 −2 −1 1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Gaussian and Laplace densities

  • Prob. density

X

PX QX
slide-7
SLIDE 7

Probabilities in feature space: the mean trick

The reproducing property (kernel trick)

  • Given x ∈ X for some set X,

define feature map ϕ(x) ∈ F, ϕ(x) = [. . . ϕi(x) . . .] ∈ ℓ2

  • For positive definite k(x, x′),

k(x, x′) = ϕ(x), ϕ(x′)F

  • The reproducing property:

∀f ∈ F, f(x) = f(·), ϕ(x)F

slide-8
SLIDE 8

Probabilities in feature space: the mean trick

The reproducing property (kernel trick)

  • Given x ∈ X for some set X,

define feature map ϕ(x) ∈ F, ϕ(x) = [. . . ϕi(x) . . .] ∈ ℓ2

  • For positive definite k(x, x′),

k(x, x′) = ϕ(x), ϕ(x′)F

  • The reproducing property:

∀f ∈ F, f(x) = f(·), ϕ(x)F The mean trick

  • Given P a Borel probability

measure on X, define feature map µP ∈ F µP = [. . . EP [ϕi(x)] . . .]

  • For positive definite k(x, x′),

EP,Qk(x, y) = µP, µQF for x ∼ P and y ∼ Q.

  • The mean trick: (we call µP a

mean/distribution embedding) EP(f(x)) =: µP, f(·)F

slide-9
SLIDE 9

Does the feature space mean exist?

Does there exist an element µP ∈ F such that EPf(x) = EPf(·), ϕ(x)F = f(·), EPϕ(x)F = f(·), µP(·)F ∀f ∈ F

slide-10
SLIDE 10

Does the feature space mean exist?

Does there exist an element µP ∈ F such that EPf(x) = EPf(·), ϕ(x)F = f(·), EPϕ(x)F = f(·), µP(·)F ∀f ∈ F Yes: You can exchange expectation and innner product (i.e. ϕ(x) is Bochner integrable [Steinwart and Christmann, 2008]) under the condition EPϕ(x)F = EP

  • k(x, x) < ∞
slide-11
SLIDE 11

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

  • (a)

+ EQk(y, y′)

  • (a)

− 2EP,Qk(x, y)

  • (b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity

slide-12
SLIDE 12

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

  • (a)

+ EQk(y, y′)

  • (a)

− 2EP,Qk(x, y)

  • (b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ

slide-13
SLIDE 13

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

  • (a)

+ EQk(y, y′)

  • (a)

− 2EP,Qk(x, y)

  • (b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ

slide-14
SLIDE 14

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

  • (a)

+ EQk(y, y′)

  • (a)

− 2EP,Qk(x, y)

  • (b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . .

slide-15
SLIDE 15

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

  • (a)

+ EQk(y, y′)

  • (a)

− 2EP,Qk(x, y)

  • (b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), ϕ(x) + . . .

slide-16
SLIDE 16

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

  • (a)

+ EQk(y, y′)

  • (a)

− 2EP,Qk(x, y)

  • (b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . .

slide-17
SLIDE 17

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

  • (a)

+ EQk(y, y′)

  • (a)

− 2EP,Qk(x, y)

  • (b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2

F

= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . . = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

slide-18
SLIDE 18

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2

F = µP, µPF + µQ, µQF − 2 µP, µQF

= EPk(x, x′)

  • (a)

+ EQk(y, y′)

  • (a)

− 2EP,Qk(x, y)

  • (b)

(a)= within distrib. similarity, (b)= cross-distrib. similarity Unbiased empirical estimate of first term (quadratic time)

  • EPk(x, x′) =

1 m(m − 1)

m

  • i=1

m

  • j=i

k(xi, xj)

slide-19
SLIDE 19

The maximum mean discrepancy

∼ P

∼ Q

slide-20
SLIDE 20

The maximum mean discrepancy

slide-21
SLIDE 21

The maximum mean discrepancy

k(dogi, fishj) k(fishi, fishj) k(dogi, dogj) k(fishj, dogi)

  • MMD

2 = KP,P + KQ,Q − 2KP,Q

(diagonal terms removed from KP,P and KQ,Q)

slide-22
SLIDE 22

Function Showing Difference in Distributions

  • Are P and Q different?
slide-23
SLIDE 23

Function Showing Difference in Distributions

  • Are P and Q different?

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

slide-24
SLIDE 24

Function Showing Difference in Distributions

  • Are P and Q different?

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

slide-25
SLIDE 25

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

x f(x) Smooth function

slide-26
SLIDE 26

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

x f(x) Smooth function

slide-27
SLIDE 27

Function Showing Difference in Distributions

  • What if the function is not smooth?

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Bounded continuous function x f(x)

slide-28
SLIDE 28

Function Showing Difference in Distributions

  • What if the function is not smooth?

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Bounded continuous function x f(x)

slide-29
SLIDE 29

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

  • Gauss P vs Laplace Q

−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8

Witness f for Gauss and Laplace densities X

  • Prob. density and f

f Gauss Laplace

slide-30
SLIDE 30

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

  • Classical results: MMD(P, Q; F) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨

uller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

slide-31
SLIDE 31

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

  • Classical results: MMD(P, Q; F) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨

uller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

  • MMD(P, Q; F) = 0 iff P = Q when F =the unit ball in a characteristic

RKHS F (coming soon!)

[ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]

slide-32
SLIDE 32

Function Showing Difference in Distributions

  • Maximum mean discrepancy: smooth function for P vs Q

MMD(P, Q; F) := sup

f∈F

[EPf(x) − EQf(y)] .

  • Classical results: MMD(P, Q; F) = 0 iff P = Q, when

– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨

uller, 1997]

– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

  • MMD(P, Q; F) = 0 iff P = Q when F =the unit ball in a characteristic

RKHS F (coming soon!)

[ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]

How do smooth functions relate to feature maps?

slide-33
SLIDE 33

Function view vs feature mean view

  • The (kernel) MMD: [ISMB06, NIPS06a]

MMD(P, Q; F) = sup

f∈F

[EPf(x) − EQf(y)]

−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8

Witness f for Gauss and Laplace densities X

  • Prob. density and f
f Gauss Laplace
slide-34
SLIDE 34

Function view vs feature mean view

  • The (kernel) MMD: [ISMB06, NIPS06a]

MMD(P, Q; F) = sup

f∈F

[EPf(x) − EQf(y)] use EP(f(x)) =: µP, fF

slide-35
SLIDE 35

Function view vs feature mean view

  • The (kernel) MMD: [ISMB06, NIPS06a]

MMD(P, Q; F) = sup

f∈F

[EPf(x) − EQf(y)] = sup

f∈F

f, µP − µQF use EP(f(x)) =: µP, fF

slide-36
SLIDE 36

Function view vs feature mean view

  • The (kernel) MMD: [ISMB06, NIPS06a]

MMD(P, Q; F) = sup

f∈F

[EPf(x) − EQf(y)] = sup

f∈F

f, µP − µQF = µP − µQF use θF = sup

f∈F

f, θF since F := {f ∈ F : f ≤ 1} Function view and feature view equivalent

slide-37
SLIDE 37

MMD for independence: HSIC

  • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Sz´

ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC(PXY , PXPY ) := µPXY − µPXPY 2

slide-38
SLIDE 38

MMD for independence: HSIC

  • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Sz´

ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC(PXY , PXPY ) := µPXY − µPXPY 2

k( , )

!" #" !"

l( , )

#"

k( , ) × l( , )

!" #" !" #"

κ( , ) =

!" #" !" #"

slide-39
SLIDE 39

MMD for independence: HSIC

  • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05,

NIPS07a, ALT07, ALT08, JMLR10]

Related to [Feuerverger, 1993]and [Sz´

ekely and Rizzo, 2009, Sz´ ekely et al., 2007]

HSIC(PXY , PXPY ) := µPXY − µPXPY 2 HSIC using expectations of kernels: Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(x, x′)l(y, y′) + EXEX′k(x, x′)EY EY ′l(y, y′) − 2EX′Y ′ EXk(x, x′)EY l(y, y′)

  • .
slide-40
SLIDE 40

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

  • "#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

  • "2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

slide-41
SLIDE 41

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
  • "#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
  • "2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

slide-42
SLIDE 42

HSIC: empirical estimate and intuition

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
  • "#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
  • "2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

Empirical HSIC(PXY , PXPY ): 1 n2 (HKH ◦ HLH)++

slide-43
SLIDE 43

Characteristic kernels (Via Fourier, on the torus T)

slide-44
SLIDE 44

Characteristic Kernels (via Fourier)

Reminder: Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, JMLR10] In the next slides:

  • 1. Characteristic property on [−π, π] with periodic boundary
  • 2. Characteristic property on Rd
slide-45
SLIDE 45

Characteristic Kernels (via Fourier)

Reminder: Fourier series

  • Function [−π, π] with periodic boundary.

f(x) =

  • ℓ=−∞

ˆ fℓ exp(ıℓx) =

  • l=−∞

ˆ fℓ (cos(ℓx) + ı sin(ℓx)) .

−4 −2 2 4 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4 x f (x) Top hat −10 −5 5 10 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 ℓ ˆ fℓ Fourier series coefficients
slide-46
SLIDE 46

Characteristic Kernels (via Fourier)

Reminder: Fourier series of kernel k(x, y) = k(x − y) = k(z), k(z) =

  • ℓ=−∞

ˆ kℓ exp (ıℓz) , E.g., k(x) =

1 2πϑ

  • x

2π, ıσ2 2π

  • ,

ˆ kℓ =

1 2π exp

  • −σ2ℓ2

2

  • .

ϑ is the Jacobi theta function, close to Gaussian when σ2 sufficiently narrower than [−π, π].

−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6 x k(x) Kernel −10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 ℓ ˆ fℓ Fourier series coefficients
slide-47
SLIDE 47

Characteristic Kernels (via Fourier)

Maximum mean embedding via Fourier series:

  • Fourier series for P is characteristic function ¯

φP

  • Fourier series for mean embedding is product of fourier series!

(convolution theorem) µP(x) = EPk(x − x) = π

−π

k(x − t)dP(t) ˆ µP,ℓ = ˆ kℓ × ¯ φP,ℓ

slide-48
SLIDE 48

Characteristic Kernels (via Fourier)

Maximum mean embedding via Fourier series:

  • Fourier series for P is characteristic function ¯

φP

  • Fourier series for mean embedding is product of fourier series!

(convolution theorem) µP(x) = EPk(x − x) = π

−π

k(x − t)dP(t) ˆ µP,ℓ = ˆ kℓ × ¯ φP,ℓ

  • MMD can be written in terms of Fourier series:

MMD(P, Q; F) :=

  • ℓ=−∞

¯ φP,ℓ − ¯ φQ,ℓ ˆ kℓ

  • exp(ıℓx)
  • F
slide-49
SLIDE 49

A simpler Fourier expression for MMD

  • From previous slide,

MMD(P, Q; F) :=

  • ℓ=−∞

¯ φP,ℓ − ¯ φQ,ℓ ˆ kℓ

  • exp(ıℓx)
  • F
  • The squared norm of a function f in F is:

f2

F = f, fF = ∞

  • l=−∞

| ˆ fℓ|2 ˆ kℓ .

  • Simple, interpretable expression for squared MMD:

MMD2(P, Q; F) =

  • l=−∞

[|φP,ℓ − φQ,ℓ|2ˆ kℓ]2 ˆ kℓ =

  • l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

slide-50
SLIDE 50

Example

  • Example: P differs from Q at one frequency

−2 2 0.05 0.1 0.15 0.2

x P (x)

−2 2 0.05 0.1 0.15 0.2

x Q(x)

slide-51
SLIDE 51

Characteristic Kernels (2)

  • Example: P differs from Q at (roughly) one frequency

−2 2 0.05 0.1 0.15 0.2

x P (x)

−2 2 0.05 0.1 0.15 0.2

x Q(x)

F

F

−10 10 0.5 1

ℓ φP,ℓ

−10 10 0.5 1

ℓ φQ,ℓ

slide-52
SLIDE 52

Characteristic Kernels (2)

  • Example: P differs from Q at (roughly) one frequency

−2 2 0.05 0.1 0.15 0.2

x P (x)

−2 2 0.05 0.1 0.15 0.2

x Q(x)

F

F

−10 10 0.5 1

ℓ φP,ℓ

−10 10 0.5 1

ℓ φQ,ℓ

ց ր

Characteristic function difference

−10 10 0.2 0.4 0.6 0.8 1

ℓ φP,ℓ − φQ,ℓ

slide-53
SLIDE 53

Example

Is the Gaussian-spectrum kernel characteristic?

−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

x k(x) Kernel

−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

  • l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

slide-54
SLIDE 54

Example

Is the Gaussian-spectrum kernel characteristic? YES

−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6

x k(x) Kernel

−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

  • l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

slide-55
SLIDE 55

Example

Is the triangle kernel characteristic?

−4 −2 2 4 −0.1 −0.05 0.05 0.1 0.15 0.2 0.25 0.3 x f (x) Triangle −10 −5 5 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

  • l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

slide-56
SLIDE 56

Example

Is the triangle kernel characteristic? NO

−4 −2 2 4 −0.1 −0.05 0.05 0.1 0.15 0.2 0.25 0.3 x f (x) Triangle −10 −5 5 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ℓ ˆ fℓ Fourier series coefficients

MMD2(P, Q; F) :=

  • l=−∞

|φP,ℓ − φQ,ℓ|2ˆ kℓ

slide-57
SLIDE 57

Characteristic kernels (Via Fourier, on Rd)

slide-58
SLIDE 58

Characteristic Kernels (via Fourier)

  • Can we prove characteristic on Rd?
slide-59
SLIDE 59

Characteristic Kernels (via Fourier)

  • Can we prove characteristic on Rd?
  • Characteristic function of P via Fourier transform

φP(ω) =

  • Rd eix⊤ωdP(x)
slide-60
SLIDE 60

Characteristic Kernels (via Fourier)

  • Can we prove characteristic on Rd?
  • Characteristic function of P via Fourier transform

φP(ω) =

  • Rd eix⊤ωdP(x)
  • Translation invariant kernels: k(x, y) = k(x − y) = k(z)
  • Bochner’s theorem:

k(z) =

  • Rd e−iz⊤ωdΛ(ω)

– Λ finite non-negative Borel measure

slide-61
SLIDE 61

Characteristic Kernels (via Fourier)

  • Can we prove characteristic on Rd?
  • Characteristic function of P via Fourier transform

φP(ω) =

  • Rd eix⊤ωdP(x)
  • Translation invariant kernels: k(x, y) = k(x − y) = k(z)
  • Bochner’s theorem:

k(z) =

  • Rd e−iz⊤ωdΛ(ω)

– Λ finite non-negative Borel measure

slide-62
SLIDE 62

Characteristic Kernels (via Fourier)

Fourier representation of MMD: MMD2(P, Q; F) =

  • |φP(ω) − φQ(ω)|2 dΛ(ω)

φP characteristic function of P

Proof: Using Bochner’s theorem (a)... and Fubini’s theorem (b) MMD2(P, Q) := EPk(x − x′) + EQk(y − y′) − 2EP,Qk(x, y) = k(s − t) d(P − Q)(s)

  • d(P − Q)(t)

(a)

=

Rd e−i(s−t)T ω dΛ(ω) d(P − Q)(s) d(P − Q)(t) (b)

=

Rd e−ixT ω d(P − Q)(s)

  • Rd eiyT ω d(P − Q)(t) dΛ(ω)

=

  • Rd |φP(ω) − φQ(ω)|2 dΛ(ω)
slide-63
SLIDE 63

Example

  • Example: P differs from Q at (roughly) one frequency
−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X) −10 −5 5 10 0.1 0.2 0.3 0.4 0.5 X Q(X)
slide-64
SLIDE 64

Example

  • Example: P differs from Q at (roughly) one frequency
−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X) −10 −5 5 10 0.1 0.2 0.3 0.4 0.5 X Q(X)

F

F

−20 −10 10 20 0.1 0.2 0.3 0.4 ω |φP | −20 −10 10 20 0.1 0.2 0.3 0.4 ω |φQ |
slide-65
SLIDE 65

Example

  • Example: P differs from Q at (roughly) one frequency
−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X) −10 −5 5 10 0.1 0.2 0.3 0.4 0.5 X Q(X)

F

F

−20 −10 10 20 0.1 0.2 0.3 0.4 ω |φP | −20 −10 10 20 0.1 0.2 0.3 0.4 ω |φQ |

ց ր

Characteristic function difference

−30 −20 −10 10 20 30 0.05 0.1 0.15 0.2 ω |φP − φQ|
slide-66
SLIDE 66

Example

  • Example: P differs from Q at (roughly) one frequency

Exponentiated quadratic kernel Difference |φP − φQ|

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-67
SLIDE 67

Example

  • Example: P differs from Q at (roughly) one frequency

Characteristic

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-68
SLIDE 68

Example

  • Example: P differs from Q at (roughly) one frequency

Sinc kernel Difference |φP − φQ|

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-69
SLIDE 69

Example

  • Example: P differs from Q at (roughly) one frequency

NOT characteristic

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-70
SLIDE 70

Example

  • Example: P differs from Q at (roughly) one frequency

Triangle (B-spline) kernel Difference |φP − φQ|

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-71
SLIDE 71

Example

  • Example: P differs from Q at (roughly) one frequency

???

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-72
SLIDE 72

Example

  • Example: P differs from Q at (roughly) one frequency

Characteristic

−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Frequency ω

slide-73
SLIDE 73

Summary: Characteristic Kernels

Characteristic kernel: (MMD = 0 iff P = Q) [NIPS07b, COLT08] Main theorem: A translation invariant k characteristic for prob. measures on Rd if and only if supp(Λ) = Rd (i.e. support zero on at most a countable set)

[COLT08, JMLR10]

Corollary: continuous, compactly supported k characteristic (since Fourier spectrum Λ(ω) cannot be zero on an interval).

1-D proof sketch from [Mallat, 1999, Theorem 2.6] proof on Rd via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]

slide-74
SLIDE 74

k characteristic iff supp(Λ) = Rd

Proof: supp {Λ} = Rd = ⇒ k characteristic: Recall Fourier definition of MMD: MMD2(P, Q) =

  • Rd |φP(ω) − φQ(ω)|2 dΛ(ω).

Characteristic functions φP(ω) and φQ(ω) uniformly continuous, hence their difference cannot be non-zero only on a countable set.

Map φP uniformly continuous: ∀ǫ > 0, ∃δ > 0 such that ∀(ω1, ω2) ∈ Ω for which d(ω1, ω2) < δ, we have d(φP(ω1), φP(ω2)) < ǫ. Uniform: δ depends only on ǫ, not on ω1, ω2.

slide-75
SLIDE 75

k characteristic iff supp(Λ) = Rd

Proof: k characteristic = ⇒ supp {Λ} = Rd : Proof by contrapositive. Given supp {Λ} Rd, hence ∃ open interval U such that Λ(ω) zero on U. Construct densities p(x), q(x) such that φP, φQ differ only inside U

slide-76
SLIDE 76

Further extensions

  • Similar reasoning wherever extensions of Bochner’s theorem exist:

[Fukumizu et al., 2009]

– Locally compact Abelian groups (periodic domains, as we saw) – Compact, non-Abelian groups (orthogonal matrices) – The semigroup R+

n (histograms)

  • Related kernel statistics: Fisher statistic [Harchaoui et al., 2008](zero iff P = Q

for characteristic kernels), other distances [Zhou and Chellappa, 2006](not yet shown to establish whether P = Q), energy distances

slide-77
SLIDE 77

Statistical hypothesis testing

slide-78
SLIDE 78

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP near spike burst Time LFP amplitude

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP without spike burst Time LFP amplitude

slide-79
SLIDE 79

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

slide-80
SLIDE 80

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

slide-81
SLIDE 81

Statistical test using MMD (1)

  • Two hypotheses:

– H0: null hypothesis (P = Q) – H1: alternative hypothesis (P = Q)

slide-82
SLIDE 82

Statistical test using MMD (1)

  • Two hypotheses:

– H0: null hypothesis (P = Q) – H1: alternative hypothesis (P = Q)

  • Observe samples x := {x1, . . . , xn} from P and y from Q
  • If empirical MMD(x, y; F) is

– “far from zero”: reject H0 – “close to zero”: accept H0

slide-83
SLIDE 83

Statistical test using MMD (2)

  • “far from zero” vs “close to zero” - threshold?
  • One answer: asymptotic distribution of

MMD

2

slide-84
SLIDE 84

Statistical test using MMD (2)

  • “far from zero” vs “close to zero” - threshold?
  • One answer: asymptotic distribution of

MMD

2

  • An unbiased empirical estimate (quadratic cost):
  • MMD

2 = 1 n(n−1)

  • i=j

k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)

  • h((xi,yi),(xj,yj))
slide-85
SLIDE 85

Statistical test using MMD (2)

  • “far from zero” vs “close to zero” - threshold?
  • One answer: asymptotic distribution of

MMD

2

  • An unbiased empirical estimate (quadratic cost):
  • MMD

2 = 1 n(n−1)

  • i=j

k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)

  • h((xi,yi),(xj,yj))
  • When P = Q, asymptotically normal

(√n)

  • MMD

2 − MMD2

∼ N(0, σ2

u)

[Hoeffding, 1948, Serfling, 1980]

  • Expression for the variance: zi := (xi, yi)

σ2

u = 4

  • Ez
  • (Ez′h(z, z′))2

  • Ez,z′(h(z, z′))

2

slide-86
SLIDE 86

Statistical test using MMD (3)

  • Example: laplace distributions with different variance

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 2 4 6 8 10 12 14

MMD distribution and Gaussian fit under H1 MMD

  • Prob. density

Empirical PDF Gaussian fit

−6 −4 −2 2 4 6 0.5 1 1.5 Two Laplace distributions with different variances X
  • Prob. density
PX QX
slide-87
SLIDE 87

Statistical test using MMD (4)

  • When P = Q, U-statistic degenerate: Ez′[h(z, z′)] = 0 [Anderson et al., 1994]
  • Distribution is

nMMD(x, y; F) ∼

  • l=1

λl

  • z2

l − 2

  • where

– zl ∼ N(0, 2) i.i.d –

  • X ˜

k(x, x′)

centred

ψi(x)dP(x) = λiψi(x′)

slide-88
SLIDE 88

Statistical test using MMD (4)

  • When P = Q, U-statistic degenerate: Ez′[h(z, z′)] = 0 [Anderson et al., 1994]
  • Distribution is

nMMD(x, y; F) ∼

  • l=1

λl

  • z2

l − 2

  • where

– zl ∼ N(0, 2) i.i.d –

  • X ˜

k(x, x′)

centred

ψi(x)dP(x) = λiψi(x′)

−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

MMD density under H0 n× MMD2

  • Prob. density
χ2 sum Empirical PDF
slide-89
SLIDE 89

Statistical test using MMD (5)

  • Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
  • MMD

2 = KP,P + KQ,Q − 2KP,Q

−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

MMD density under H0 and H1 n× MMD2

  • Prob. density
null alternative 1−α null quantile Type II error
slide-90
SLIDE 90

Statistical test using MMD (5)

  • Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
slide-91
SLIDE 91

Statistical test using MMD (5)

  • Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
  • Permutation for empirical CDF [Arcones and Gin´

e, 1992, Alba Fern´ andez et al., 2008]

  • Pearson curves by matching first four moments [Johnson et al., 1994]
  • Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]
  • Consistent test using kernel eigenspectrum [NIPS09b]
slide-92
SLIDE 92

Statistical test using MMD (5)

  • Given P = Q, want threshold T such that P(MMD > T) ≤ 0.05
  • Permutation for empirical CDF [Arcones and Gin´

e, 1992, Alba Fern´ andez et al., 2008]

  • Pearson curves by matching first four moments [Johnson et al., 1994]
  • Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989]
  • Consistent test using kernel eigenspectrum [NIPS09b]
−0.02 0.02 0.04 0.06 0.08 0.1 0.2 0.4 0.6 0.8 1

CDF of the MMD and Pearson fit mmd P(MMD < mmd)

MMD Pearson
slide-93
SLIDE 93

Approximate null distribution of MMD via permutation

Empirical MMD: w = (1, 1, 1, . . . 1

  • n

, −1 . . . , −1, −1, −1

  • n

)⊤

1 n2

KP,P KP,Q KQ,P KQ,Q   ⊙

  • ww⊤

  • MMD

2

slide-94
SLIDE 94

Approximate null distribution of MMD via permutation

Permuted case:

[Alba Fern´ andez et al., 2008]

w = (1, −1, 1, . . . 1

  • n

, −1 . . . , 1, −1, −1

  • n

)⊤

(equal number of +1 and −1)

1 n2

KP,P KP,Q KQ,P KQ,Q   ⊙

  • ww⊤

= ?

slide-95
SLIDE 95

Approximate null distribution of MMD via permutation

Permuted case:

[Alba Fern´ andez et al., 2008]

w = (1, −1, 1, . . . 1

  • n

, −1 . . . , 1, −1, −1

  • n

)⊤

(equal number of +1 and −1)

1 n2

KP,P KP,Q KQ,P KQ,Q   ⊙

  • ww⊤

= ?

=

Figure thanks to Kacper Chwialkowski.

slide-96
SLIDE 96

Approximate null distribution of MMD

2 via permutation

  • MMD

2 p ≈ 1

n2

KP,P KP,Q KQ,P KQ,Q   ⊙

  • ww⊤
−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

MMD density under H0 n× MMD2

  • Prob. density
Null PDF Null PDF from permutation
slide-97
SLIDE 97

Detecting differences in brain signals

Do local field potential (LFP) signals change when measured near a spike burst?

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP near spike burst Time LFP amplitude

20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3

LFP without spike burst Time LFP amplitude

slide-98
SLIDE 98

Neuro data: consistent test w/o permutation

  • Maximum mean discrepancy (MMD): distance between P and Q

MMD(P, Q; F) := µP − µQ2

F

  • Is

MMD significantly > 0?

  • P = Q, null distrib. of

MMD: n MMD →

D ∞

  • l=1

λl(z2

l − 2),

– λl is lth eigenvalue of kernel ˜ k(xi, xj)

100 150 200 250 300 0.1 0.2 0.3 0.4 0.5

P ≠ Q (neuro) Sample size m Type II error

Spectral Permutation

Use Gram matrix spectrum for ˆ λl: consistent test without permutation

slide-99
SLIDE 99

Hypothesis testing with HSIC

slide-100
SLIDE 100

Distribution of HSIC at independence

  • (Biased) empirical HSIC a v-statistic

HSICb = 1 n2 trace(KHLH) – Statistical testing: How do we find when this is larger enough that the null hypothesis P = PxPy is unlikely? – Formally: given P = PxPy, what is the threshold T such that P(HSIC > T) < α for small α?

slide-101
SLIDE 101

Distribution of HSIC at independence

  • (Biased) empirical HSIC a v-statistic

HSICb = 1 n2 trace(KHLH)

  • Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:

nHSICb

D

  • l=1

λlz2

l ,

zl ∼ N(0, 1)i.i.d.

λlψl(zj) =

  • hijqrψl(zi)dFi,q,r,

hijqr = 1 4!

(i,j,q,r)

  • (t,u,v,w)

ktultu + ktulvw − 2ktultv

slide-102
SLIDE 102

Distribution of HSIC at independence

  • (Biased) empirical HSIC a v-statistic

HSICb = 1 n2 trace(KHLH)

  • Associated U-statistic degenerate when P = PxPy [Serfling, 1980]:

nHSICb

D

  • l=1

λlz2

l ,

zl ∼ N(0, 1)i.i.d.

λlψl(zj) =

  • hijqrψl(zi)dFi,q,r,

hijqr = 1 4!

(i,j,q,r)

  • (t,u,v,w)

ktultu + ktulvw − 2ktultv

  • First two moments [NIPS07b]

E(HSICb) = 1 nTrCxxTrCyy var(HSICb) = 2(n − 4)(n − 5) (n)4 Cxx2

HS Cyy2 HS + O(n−3).

slide-103
SLIDE 103

Statistical testing with HSIC

  • Given P = PxPy, what is the threshold T such that P(HSIC > T) < α

for small α?

  • Null distribution via permutation [Feuerverger, 1993]

– Compute HSIC for {xi, yπ(i)}n

i=1 for random permutation π of indices

{1, . . . , n}. This gives HSIC for independent variables. – Repeat for many different permutations, get empirical CDF – Threshold T is 1 − α quantile of empirical CDF

slide-104
SLIDE 104

Statistical testing with HSIC

  • Given P = PxPy, what is the threshold T such that P(HSIC > T) < α

for small α?

  • Null distribution via permutation [Feuerverger, 1993]

– Compute HSIC for {xi, yπ(i)}n

i=1 for random permutation π of indices

{1, . . . , n}. This gives HSIC for independent variables. – Repeat for many different permutations, get empirical CDF – Threshold T is 1 − α quantile of empirical CDF

  • Approximate null distribution via moment matching [Kankainen, 1995]:

nHSICb(Z) ∼ xα−1e−x/β βαΓ(α) where α = (E(HSICb))2 var(HSICb) , β = var(HSICb) nE(HSICb).

slide-105
SLIDE 105

Experiment: dependence testing for translation Are the French text extracts translations of English?

X1: Honourable senators, I have a question for

the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet.

Y1:

Honorables s´ enateurs, ma question s’adresse au leader du gouvernement au S´ enat et concerne l’aide financi´ ere qu’on a annonc´ ee pour les agriculteurs. La plupart des agriculteurs n’ont encore rien reu de cet argent.

X2: No doubt there is great pressure on provin-

cial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.

· · ·

?

⇐ ⇒

Y2:Il est ´

evident que les ordres de gouverne- ments provinciaux et municipaux subissent de fortes pressions en ce qui concerne les ser- vices de garde, mais le gouvernement n’a pas r´ eduit le financement qu’il verse aux provinces pour les services de garde. Au contraire, nous avons augment´ e le financement f´ ed´ eral pour le d´ eveloppement des jeunes enfants.

· · ·

slide-106
SLIDE 106

Experiment: dependence testing for translation

  • (Biased) empirical HSIC:

HSICb = 1 n2 trace(KHLH)

  • Translation example:

[NIPS07b]

Canadian Hansard (agriculture)

  • 5-line extracts,

k-spectrum kernel, k = 10, repetitions=300, sample size 10 ⇓ K

⇒HSIC⇐

⇓ L

  • k-spectrum kernel: average Type II error 0 (α = 0.05)
slide-107
SLIDE 107

Experiment: dependence testing for translation

  • (Biased) empirical HSIC:

HSICb = 1 n2 trace(KHLH)

  • Translation example:

[NIPS07b]

Canadian Hansard (agriculture)

  • 5-line extracts,

k-spectrum kernel, k = 10, repetitions=300, sample size 10 ⇓ K

⇒HSIC⇐

⇓ L

  • k-spectrum kernel: average Type II error 0 (α = 0.05)
  • Bag of words kernel: average Type II error 0.18
slide-108
SLIDE 108

Summary

  • MMD a distance between distributions [ISMB06, NIPS06a, JMLR10, JMLR12a]

– high dimensionality – non-euclidean data (strings, graphs) – Nonparametric hypothesis tests

  • Measure and test independence [ALT05, NIPS07a, NIPS07b, ALT08, JMLR10, JMLR12a]
  • Characteristic RKHS: MMD a metric [NIPS07b, COLT08, NIPS08a]

– Easy to check: does spectrum cover Rd

slide-109
SLIDE 109

Co-authors

  • From UCL:

– Luca Baldasssarre – Steffen Grunewalder – Guy Lever – Sam Patterson – Massimiliano Pontil – Dino Sejdinovic

  • External:

– Karsten Borgwardt, MPI – Wicher Bergsma, LSE – Kenji Fukumizu, ISM – Zaid Harchaoui, INRIA – Bernhard Schoelkopf, MPI – Alex Smola, CMU/Google – Le Song, Georgia Tech – Bharath Sriperumbudur, Cambridge

slide-110
SLIDE 110

Selected references

Characteristic kernels and mean embeddings:

  • Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.
  • Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space

embeddings and metrics on probability measures. JMLR.

  • Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.

Two-sample, independence, conditional independence tests:

  • Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of
  • independence. NIPS
  • Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.
  • Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample
  • test. NIPS.
  • Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

Energy distance, relation to kernel distances

  • Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. Annals of Statistics.

Three way interaction

  • Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.
slide-111
SLIDE 111

Selected references (continued)

Conditional mean embedding, RKHS-valued regression:

  • Weston, J., Chapelle, O., Elisseeff, A., Sch¨
  • lkopf, B., and Vapnik, V., (2003). Kernel Dependency

Estimation, NIPS.

  • Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.
  • Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.

Foundations of Computational Mathematics.

  • Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional
  • Distributions. ICML.
  • Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean

embeddings as regressors. ICML.

  • Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.

Kernel Bayes rule:

  • Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified

kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.

  • Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite

kernels, JMLR

slide-112
SLIDE 112
slide-113
SLIDE 113

Local departures from the null

What is a hard testing problem?

slide-114
SLIDE 114

Local departures from the null

What is a hard testing problem?

  • First version: for fixed m, “closer” P and Q have higher Type II error
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Samples from P and Q

slide-115
SLIDE 115

Local departures from the null

What is a hard testing problem?

  • As m increases, distinguish “closer” P and Q with fixed Type II error
slide-116
SLIDE 116

Local departures from the null

What is a hard testing problem?

  • As m increases, distinguish “closer” P and Q with fixed Type II error
  • Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant

slide-117
SLIDE 117

More general local departures from null

  • Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X)

VS

−6 −4 −2 2 4 6 0.1 0.2 0.3

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3

X Q(X)

slide-118
SLIDE 118

Local departures from the null

What is a hard testing problem?

  • As we see more samples m, distinguish “closer” P and Q with same

Type II error

  • Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant

  • ...but other choices also possible – how to characterize them all?
slide-119
SLIDE 119

Local departures from the null

What is a hard testing problem?

  • As we see more samples m, distinguish “closer” P and Q with same

Type II error

  • Example: fP and fQ probability densities, fQ = fP + δg, where δ ∈ R, g

some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant

  • ...but other choices also possible – how to characterize them all?

General characterization of local departures from H0:

  • Write µQ = µP + gm, where gm ∈ F chosen such that µP + gm a valid

distribution embedding

  • Minimum distinguishable distance [JMLR12]

gmF = cm−1/2

slide-120
SLIDE 120

More general local departures from null

  • More advanced example of a local departure from the null
  • Recall: µQ = µP + gm, and gmF = cm−1/2
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X)

VS

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

slide-121
SLIDE 121

Kernels vs kernels

  • How does MMD relate to Parzen density estimate?

[Anderson et al., 1994]

ˆ fP(x) = 1 m

m

  • i=1

κ (xi − x) , where κ satisfies

  • X

κ (x) dx = 1 and κ (x) ≥ 0.

slide-122
SLIDE 122

Kernels vs kernels

  • How does MMD relate to Parzen density estimate?

[Anderson et al., 1994]

ˆ fP(x) = 1 m

m

  • i=1

κ (xi − x) , where κ satisfies

  • X

κ (x) dx = 1 and κ (x) ≥ 0.

  • L2 distance between Parzen density estimates:

D2( ˆ fP, ˆ fQ)2 = 1 m

m

  • i=1

κ(xi − z) − 1 m

m

  • i=1

κ(yi − z) 2 dz = 1 m2

m

  • i,j=1

k(xi − xj) + 1 m2

m

  • i,j=1

k(yi − yj) − 2 m2

m

  • i,j=1

k(xi − yj), where k(x − y) =

  • κ(x − z)κ(y − z)dz
slide-123
SLIDE 123

Kernels vs kernels

  • How does MMD relate to Parzen density estimate?

[Anderson et al., 1994]

ˆ fP(x) = 1 m

m

  • i=1

κ (xi − x) , where κ satisfies

  • X

κ (x) dx = 1 and κ (x) ≥ 0.

  • L2 distance between Parzen density estimates:

D2( ˆ fP, ˆ fQ)2 = 1 m

m

  • i=1

κ(xi − z) − 1 m

m

  • i=1

κ(yi − z) 2 dz = 1 m2

m

  • i,j=1

k(xi − xj) + 1 m2

m

  • i,j=1

k(yi − yj) − 2 m2

m

  • i,j=1

k(xi − yj), where k(x − y) =

  • κ(x − z)κ(y − z)dz
  • fQ = fP + δg, minimum distance to discriminate fP from fQ is

δ = (m)−1/2h−d/2

m

, where hm is width of κ.

slide-124
SLIDE 124

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]

slide-125
SLIDE 125

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X

[Dudley, 2002]

slide-126
SLIDE 126

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X

[Dudley, 2002]

Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X) with respect to L∞ [Steinwart, 2001]

slide-127
SLIDE 127

Characteristic Kernels (via universality)

Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X

[Dudley, 2002]

Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X) with respect to L∞ [Steinwart, 2001] If F universal, then MMD {P, Q; F} = 0 iff P = Q

slide-128
SLIDE 128

Characteristic Kernels (via universality)

Proof: First, it is clear that P = Q implies MMD {P, Q; F} is zero. Converse: by the universality of F, for any given ǫ > 0 and f ∈ C(X) ∃g ∈ F f − g∞ ≤ ǫ.

slide-129
SLIDE 129

Characteristic Kernels (via universality)

Proof: First, it is clear that P = Q implies MMD {P, Q; F} is zero. Converse: by the universality of F, for any given ǫ > 0 and f ∈ C(X) ∃g ∈ F f − g∞ ≤ ǫ. We next make the expansion |EPf(x) − EQf(y)| ≤ |EPf(x) − EPg(x)|+|EPg(x) − EQg(y)|+|EQg(y) − EQf(y)| . The first and third terms satisfy |EPf(x) − EPg(x)| ≤ EP |f(x) − g(x)| ≤ ǫ.

slide-130
SLIDE 130

Characteristic Kernels (via universality)

Proof (continued): Next, write EPg(x) − EQg(y) = g(·), µP − µQF = 0, since MMD {P, Q; F} = 0 implies µP = µQ. Hence |EPf(x) − EQf(y)| ≤ 2ǫ for all f ∈ C(X) and ǫ > 0, which implies P = Q.

slide-131
SLIDE 131

References

  • V. Alba Fern´
andez, M. Jim´ enez-Gamero, and J. Mu˜ noz Garcia. A test for the two-sample problem based on empirical characteristic functions. Comput.
  • Stat. Data An., 52:3730–3748, 2008.
  • N. Anderson, P. Hall, and D. Titterington.
Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Anal- ysis, 50:41–54, 1994.
  • M. Arcones and E. Gin´
  • e. On the bootstrap of u and v statistics. The Annals
  • f Statistics, 20(2):655–674, 1992.
  • R. M. Dudley. Real analysis and probability. Cambridge University Press, Cam-
bridge, UK, 2002. Andrey Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3):419–433, 1993.
  • K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Schoelkopf. Character-
istic kernels on groups and semigroups. In Advances in Neural Information Processing Systems 21, pages 473–480, Red Hook, NY, 2009. Curran Asso- ciates Inc.
  • Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel
Fisher discriminant analysis. In Advances in Neural Information Processing Systems 20, pages 609–616. MIT Press, Cambridge, MA, 2008.
  • W. Hoeffding.
Probability inequalities for sums of bounded random vari-
  • ables. Journal of the American Statistical Association, 58:13–30, 1963.
Wassily Hoeffding. A class of statistics with asymptotically normal distri-
  • bution. The Annals of Mathematical Statistics, 19(3):293–325, 1948.
  • N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distribu-
  • tions. Volume 1. John Wiley and Sons, 2nd edition, 1994.
  • A. Kankainen.
Consistent Testing of Total Independence Based on the Empirical Characteristic Function. PhD thesis, University of Jyv¨ askyl¨ a, 1995.
  • S. Mallat.
A Wavelet Tour of Signal Processing. Academic Press, 2nd edition, 1999.
  • C. McDiarmid. On the method of bounded differences. In Survey in Combina-
torics, pages 148–188. Cambridge University Press, 1989.

79-1

slide-132
SLIDE 132
  • A. M¨
  • uller. Integral probability metrics and their generating classes of func-
  • tions. Advances in Applied Probability, 29(2):429–443, 1997.
  • R. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York,
1980. B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and
  • B. Sch¨
  • lkopf. Hilbert space embeddings and metrics on probability mea-
  • sures. Journal of Machine Learning Research, 11:1517–1561, 2010.
  • I. Steinwart.
On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, 2008.
  • G. Sz´
ekely and M. Rizzo. Brownian distance covariance. Annals of Applied Statistics, 4(3):1233–1303, 2009.
  • G. Sz´
ekely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Stat., 35(6):2769–2794, 2007.
  • S. K. Zhou and R. Chellappa. From sample similarity to ensemble similarity:
Probabilistic distance measures in reproducing kernel hilbert space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6):917–929, 2006.

79-2