Lecture 2: Mappings of Probabilities to RKHS and Applications
MLSS T¨ ubingen, 2015
Arthur Gretton Gatsby Unit, CSML, UCL
Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS - - PowerPoint PPT Presentation
Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS T ubingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL Outline Kernel metric on the space of probability measures Function revealing differences in distributions
Arthur Gretton Gatsby Unit, CSML, UCL
Outline
– Function revealing differences in distributions – Distance between means in space of features (RKHS) – Independence measure: features of joint minus product of marginals
– distributions on strings, images, graphs, groups (rotation matrices), semigroups,. . .
– testing on big data, kernel choice – Energy distance/distance covariance: special case of kernel statistic
Feature mean difference
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Two Gaussians with different means X
PX QX
Feature mean difference
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Two Gaussians with different variances
X
PX QX
Feature mean difference
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Two Gaussians with different variances
X
PX QX 10
−110 10
110
20.2 0.4 0.6 0.8 1 1.2 1.4
Densities of feature X2 X2
PX QX
Feature mean difference
−4 −3 −2 −1 1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Gaussian and Laplace densities
X
PX QX
Probabilities in feature space: the mean trick
The kernel trick
define feature map ϕx ∈ F, ϕx = [. . . ϕi(x) . . .] ∈ ℓ2
k(x, x′) = ϕx, ϕx′F
f(x) = f, ϕxF
Probabilities in feature space: the mean trick
The kernel trick
define feature map ϕx ∈ F, ϕx = [. . . ϕi(x) . . .] ∈ ℓ2
k(x, x′) = ϕx, ϕx′F
f(x) = f, ϕxF The mean trick
measure on X, define feature map µP ∈ F µP = [. . . EP [ϕi(x)] . . .]
EP,Qk(x, y) = µP, µQF for x ∼ P and y ∼ Q.
mean/distribution embedding) EP(f(x)) =: µP, fF
What does µP look like?
We plot the function µP
µP(·), f(·)F = EPf(x).
look like? µP(x) = µP(·), ϕ(x)F = µP(·), k(·, x)F = EPk(x, x). Expectation of kernel!
ˆ µP(x) = 1 m
m
k(xi, x) xi ∼ P
What does µP look like?
We plot the function µP
µP(·), f(·)F = EPf(x).
look like? µP(x) = µP(·), ϕ(x)F = µP(·), k(·, x)F = EPk(x, x). Expectation of kernel!
ˆ µP(x) = 1 m
m
k(xi, x) xi ∼ P
−2 2 0.01 0.02 0.03 X Histogram Embedding
Does the feature space mean exist?
Does there exist an element µP ∈ F such that EPf(x) = EPf(·), ϕ(x)F = f(·), EPϕ(x)F = f(·), µP(·)F ∀f ∈ F
Does the feature space mean exist?
Does there exist an element µP ∈ F such that EPf(x) = EPf(·), ϕ(x)F = f(·), EPϕ(x)F = f(·), µP(·)F ∀f ∈ F Yes: You can exchange expectation and innner product (i.e. ϕ(x) is Bochner integrable [Steinwart and Christmann, 2008]) under the condition EPϕ(x)F = EP
Function Showing Difference in Distributions
Function Showing Difference in Distributions
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
Samples from P and Q
Function Showing Difference in Distributions
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
Samples from P and Q
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
x f(x) Smooth function
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
x f(x) Smooth function
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
Bounded continuous function x f(x)
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
Bounded continuous function x f(x)
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8
Witness f for Gauss and Laplace densities X
f Gauss Laplace
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨
uller, 1997]
– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨
uller, 1997]
– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]
RKHS F (coming soon!)
[ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨
uller, 1997]
– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]
RKHS F (coming soon!)
[ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]
How do smooth functions relate to feature maps?
Function view vs feature mean view
MMD2(P, Q; F) =
f∈F
[EPf(x) − EQf(y)] 2
−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8
Witness f for Gauss and Laplace densities X
f Gauss Laplace
Function view vs feature mean view
MMD2(P, Q; F) =
f∈F
[EPf(x) − EQf(y)] 2 use EP(f(x)) =: µP, fF
Function view vs feature mean view
MMD2(P, Q; F) =
f∈F
[EPf(x) − EQf(y)] 2 =
f∈F
f, µP − µQF 2 use EP(f(x)) =: µP, fF
Function view vs feature mean view
MMD2(P, Q; F) =
f∈F
[EPf(x) − EQf(y)] 2 =
f∈F
f, µP − µQF 2 = µP − µQ2
F
use θF = sup
f∈F
f, θF Function view and feature view equivalent
Empirical estimate of MMD
i=1 ∼ P and {yi}m i=1 ∼ Q,
2 = 1 m(m−1)
m
i=1
m
j=i [k(xi, xj) + k(yi, yj)]
− 1
m2
m
i=1
m
j=1 [k(yi, xj) + k(xi, yj)]
Empirical estimate of MMD
i=1 ∼ P and {yi}m i=1 ∼ Q,
2 = 1 m(m−1)
m
i=1
m
j=i [k(xi, xj) + k(yi, yj)]
− 1
m2
m
i=1
m
j=1 [k(yi, xj) + k(xi, yj)]
µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ
Empirical estimate of MMD
i=1 ∼ P and {yi}m i=1 ∼ Q,
2 = 1 m(m−1)
m
i=1
m
j=i [k(xi, xj) + k(yi, yj)]
− 1
m2
m
i=1
m
j=1 [k(yi, xj) + k(xi, yj)]
µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ
Empirical estimate of MMD
i=1 ∼ P and {yi}m i=1 ∼ Q,
2 = 1 m(m−1)
m
i=1
m
j=i [k(xi, xj) + k(yi, yj)]
− 1
m2
m
i=1
m
j=1 [k(yi, xj) + k(xi, yj)]
µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . .
Empirical estimate of MMD
i=1 ∼ P and {yi}m i=1 ∼ Q,
2 = 1 m(m−1)
m
i=1
m
j=i [k(xi, xj) + k(yi, yj)]
− 1
m2
m
i=1
m
j=1 [k(yi, xj) + k(xi, yj)]
µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), ϕ(x) + . . .
Empirical estimate of MMD
i=1 ∼ P and {yi}m i=1 ∼ Q,
2 = 1 m(m−1)
m
i=1
m
j=i [k(xi, xj) + k(yi, yj)]
− 1
m2
m
i=1
m
j=1 [k(yi, xj) + k(xi, yj)]
µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . .
Empirical estimate of MMD
i=1 ∼ P and {yi}m i=1 ∼ Q,
2 = 1 m(m−1)
m
i=1
m
j=i [k(xi, xj) + k(yi, yj)]
− 1
m2
m
i=1
m
j=1 [k(yi, xj) + k(xi, yj)]
µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . . = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)
Empirical estimate of MMD
i=1 ∼ P and {yi}m i=1 ∼ Q,
2 = 1 m(m−1)
m
i=1
m
j=i [k(xi, xj) + k(yi, yj)]
− 1
m2
m
i=1
m
j=1 [k(yi, xj) + k(xi, yj)]
µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . . = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y) Then Ek(x, x′) =
1 m(m−1)
m
i=1
m
j=i k(xi, xj)
MMD for independence: HSIC
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Sz´
ekely and Rizzo, 2009, Sz´ ekely et al., 2007]
HSIC(PXY , PXPY ) := µPXY − µPXPY 2
MMD for independence: HSIC
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Sz´
ekely and Rizzo, 2009, Sz´ ekely et al., 2007]
HSIC(PXY , PXPY ) := µPXY − µPXPY 2
!" #" !"
#"
!" #" !" #"
!" #" !" #"
MMD for independence: HSIC
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Sz´
ekely and Rizzo, 2009, Sz´ ekely et al., 2007]
HSIC(PXY , PXPY ) := µPXY − µPXPY 2 HSIC using expectations of kernels: Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(x, x′)l(y, y′) + EXEX′k(x, x′)EY EY ′l(y, y′) − 2EX′Y ′ EXk(x, x′)EY l(y, y′)
HSIC: empirical estimate and intuition
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%& 2',&0(//(7&3(+#%37"#%#:&
HSIC: empirical estimate and intuition
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%& 2',&0(//(7&3(+#%37"#%#:&
!" #"
HSIC: empirical estimate and intuition
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%& 2',&0(//(7&3(+#%37"#%#:&
!" #"
Empirical HSIC(PXY , PXPY ): 1 n2 (HKH ◦ HLH)++
Characteristic kernels (Via Fourier, on the torus T)
Characteristic Kernels (via Fourier)
Reminder: Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, JMLR10] In the next slides:
Characteristic Kernels (via Fourier)
Reminder: Fourier series
f(x) =
∞
ˆ fℓ exp(ıℓx) =
∞
ˆ fℓ (cos(ℓx) + ı sin(ℓx)) .
−4 −2 2 4 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4
x f (x) Top hat
−10 −5 5 10 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5
ℓ ˆ fℓ Fourier series coefficients
Characteristic Kernels (via Fourier)
Reminder: Fourier series of kernel k(x, y) = k(x − y) = k(z), k(z) =
∞
ˆ kℓ exp (ıℓz) , E.g., k(x) =
1 2πϑ
2π, ıσ2 2π
ˆ kℓ =
1 2π exp
2
ϑ is the Jacobi theta function, close to Gaussian when σ2 sufficiently narrower than [−π, π].
−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6
x k(x) Kernel
−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
ℓ ˆ fℓ Fourier series coefficients
Characteristic Kernels (via Fourier)
Maximum mean embedding via Fourier series:
φP
(convolution theorem) µP(x) = EPk(x − x) = π
−π
k(x − t)dP(t) ˆ µP,ℓ = ˆ kℓ × ¯ φP,ℓ
Characteristic Kernels (via Fourier)
Maximum mean embedding via Fourier series:
φP
(convolution theorem) µP(x) = EPk(x − x) = π
−π
k(x − t)dP(t) ˆ µP,ℓ = ˆ kℓ × ¯ φP,ℓ
MMD(P, Q; F) :=
¯ φP,ℓ − ¯ φQ,ℓ ˆ kℓ
A simpler Fourier expression for MMD
MMD(P, Q; F) :=
¯ φP,ℓ − ¯ φQ,ℓ ˆ kℓ
f2
F = f, fF = ∞
| ˆ fℓ|2 ˆ kℓ .
MMD2(P, Q; F) =
∞
[|φP,ℓ − φQ,ℓ|2ˆ kℓ]2 ˆ kℓ =
∞
|φP,ℓ − φQ,ℓ|2ˆ kℓ
Example
−2 2 0.05 0.1 0.15 0.2
x P (x)
−2 2 0.05 0.1 0.15 0.2
x Q(x)
Characteristic Kernels (2)
−2 2 0.05 0.1 0.15 0.2
x P (x)
−2 2 0.05 0.1 0.15 0.2
x Q(x)
F
→
F
→
−10 10 0.5 1
ℓ φP,ℓ
−10 10 0.5 1
ℓ φQ,ℓ
Characteristic Kernels (2)
−2 2 0.05 0.1 0.15 0.2
x P (x)
−2 2 0.05 0.1 0.15 0.2
x Q(x)
F
→
F
→
−10 10 0.5 1
ℓ φP,ℓ
−10 10 0.5 1
ℓ φQ,ℓ
ց ր
Characteristic function difference
−10 10 0.2 0.4 0.6 0.8 1
ℓ φP,ℓ − φQ,ℓ
Example
Is the Gaussian-spectrum kernel characteristic?
−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6
x k(x) Kernel
−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
ℓ ˆ fℓ Fourier series coefficients
MMD2(P, Q; F) :=
∞
|φP,ℓ − φQ,ℓ|2ˆ kℓ
Example
Is the Gaussian-spectrum kernel characteristic? YES
−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6
x k(x) Kernel
−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
ℓ ˆ fℓ Fourier series coefficients
MMD2(P, Q; F) :=
∞
|φP,ℓ − φQ,ℓ|2ˆ kℓ
Example
Is the triangle kernel characteristic?
−4 −2 2 4 −0.1 −0.05 0.05 0.1 0.15 0.2 0.25 0.3
x f (x) Triangle
−10 −5 5 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07
ℓ ˆ fℓ Fourier series coefficients
MMD2(P, Q; F) :=
∞
|φP,ℓ − φQ,ℓ|2ˆ kℓ
Example
Is the triangle kernel characteristic? NO
−4 −2 2 4 −0.1 −0.05 0.05 0.1 0.15 0.2 0.25 0.3
x f (x) Triangle
−10 −5 5 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07
ℓ ˆ fℓ Fourier series coefficients
MMD2(P, Q; F) :=
∞
|φP,ℓ − φQ,ℓ|2ˆ kℓ
Characteristic kernels (Via Fourier, on Rd)
Characteristic Kernels (via Fourier)
Characteristic Kernels (via Fourier)
φP(ω) =
Characteristic Kernels (via Fourier)
φP(ω) =
k(z) =
– Λ finite non-negative Borel measure
Characteristic Kernels (via Fourier)
φP(ω) =
k(z) =
– Λ finite non-negative Borel measure
Characteristic Kernels (via Fourier)
MMD(P, Q; F) := |φP(ω) − φQ(ω)|2 dΛ(ω) – φP characteristic function of P Proof: Using Bochner’s theorem (a) and Fubini’s theorem (b), MMD(P, Q) =
Rd k(x − y) d(P − Q)(x) d(P − Q)(y) (a)
=
Rd e−i(x−y)T ω dΛ(ω) d(P − Q)(x) d(P − Q)(y) (b)
=
Rd e−ixT ω d(P − Q)(x)
=
Example
−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35
X P(X)
−10 −5 5 10 0.1 0.2 0.3 0.4 0.5
X Q(X)
Example
−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35
X P(X)
−10 −5 5 10 0.1 0.2 0.3 0.4 0.5
X Q(X)
F
→
F
→
−20 −10 10 20 0.1 0.2 0.3 0.4
ω |φP |
−20 −10 10 20 0.1 0.2 0.3 0.4
ω |φQ |
Example
−10 −5 5 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35
X P(X)
−10 −5 5 10 0.1 0.2 0.3 0.4 0.5
X Q(X)
F
→
F
→
−20 −10 10 20 0.1 0.2 0.3 0.4
ω |φP |
−20 −10 10 20 0.1 0.2 0.3 0.4
ω |φQ |
ց ր
Characteristic function difference
−30 −20 −10 10 20 30 0.05 0.1 0.15 0.2
ω |φP − φQ|
Example
Gaussian kernel Difference |φP − φQ|
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
Characteristic
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
Sinc kernel Difference |φP − φQ|
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
NOT characteristic
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
Triangle (B-spline) kernel Difference |φP − φQ|
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
???
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
Characteristic
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Summary: Characteristic Kernels
Characteristic kernel: (MMD = 0 iff P = Q) [NIPS07b, COLT08] Main theorem: A translation invariant k characteristic for prob. measures on Rd if and only if supp(Λ) = Rd (i.e. support zero on at most a countable set)
[COLT08, JMLR10]
Corollary: continuous, compactly supported k characteristic (since Fourier spectrum Λ(ω) cannot be zero on an interval).
1-D proof sketch from [Mallat, 1999, Theorem 2.6] proof on Rd via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]
k characteristic iff supp(Λ) = Rd
Proof: supp {Λ} = Rd = ⇒ k characteristic: Recall Fourier definition of MMD: MMD2(P, Q) =
Characteristic functions φP(ω) and φQ(ω) uniformly continuous, hence their difference cannot be non-zero only on a countable set.
Map φP uniformly continuous: ∀ǫ > 0, ∃δ > 0 such that ∀(ω1, ω2) ∈ Ω for which d(ω1, ω2) < δ, we have d(φP(ω1), φP(ω2)) < ǫ. Uniform: δ depends only on ǫ, not on ω1, ω2.
k characteristic iff supp(Λ) = Rd
Proof: k characteristic = ⇒ supp {Λ} = Rd : Proof by contrapositive. Given supp {Λ} Rd, hence ∃ open interval U such that Λ(ω) zero on U. Construct densities p(x), q(x) such that φP, φQ differ only inside U
Further extensions
[Fukumizu et al., 2009]
– Locally compact Abelian groups (periodic domains, as we saw) – Compact, non-Abelian groups (orthogonal matrices) – The semigroup R+
n (histograms)
for characteristic kernels), other distances [Zhou and Chellappa, 2006](not yet shown to establish whether P = Q), energy distances
Statistical hypothesis testing
Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?
20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3
LFP near spike burst Time LFP amplitude
20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3
LFP without spike burst Time LFP amplitude
Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?
Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?
Statistical test using MMD (1)
– H0: null hypothesis (P = Q) – H1: alternative hypothesis (P = Q)
Statistical test using MMD (1)
– H0: null hypothesis (P = Q) – H1: alternative hypothesis (P = Q)
– “far from zero”: reject H0 – “close to zero”: accept H0
Statistical test using MMD (2)
MMD
2
Statistical test using MMD (2)
MMD
2
2 = 1 n(n−1)
k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)
Statistical test using MMD (2)
MMD
2
2 = 1 n(n−1)
k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)
(√n)
2 − MMD2
∼ N(0, σ2
u)
[Hoeffding, 1948, Serfling, 1980]
σ2
u = 4
−
2
Statistical test using MMD (3)
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 2 4 6 8 10 12 14
MMD distribution and Gaussian fit under H1 MMD
Empirical PDF Gaussian fit
−6 −4 −2 2 4 6 0.5 1 1.5
Two Laplace distributions with different variances X
PX QX
Statistical test using MMD (4)
nMMD(x, y; F) ∼
∞
λl
l − 2
– zl ∼ N(0, 2) i.i.d –
k(x, x′)
centred
ψi(x)dP(x) = λiψi(x′)
Statistical test using MMD (4)
nMMD(x, y; F) ∼
∞
λl
l − 2
– zl ∼ N(0, 2) i.i.d –
k(x, x′)
centred
ψi(x)dP(x) = λiψi(x′)
−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7
MMD density under H0 n× MMD2
χ2 sum Empirical PDF
Statistical test using MMD (5)
2 = KP,P + KQ,Q − 2KP,Q
−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7
MMD density under H0 and H1 n× MMD2
null alternative
1−α null quantile Type II error
Statistical test using MMD (5)
Statistical test using MMD (5)
e, 1992, Alba Fern´ andez et al., 2008]
Statistical test using MMD (5)
e, 1992, Alba Fern´ andez et al., 2008]
−0.02 0.02 0.04 0.06 0.08 0.1 0.2 0.4 0.6 0.8 1
CDF of the MMD and Pearson fit mmd P(MMD < mmd)
MMD Pearson
Approximate null distribution of MMD via permutation
Empirical MMD: w = (1, 1, 1, . . . 1
, −1 . . . , −1, −1, −1
)⊤
1 n2
KP,P KP,Q KQ,P KQ,Q ⊙
2
Approximate null distribution of MMD via permutation
Permuted case:
[Alba Fern´ andez et al., 2008]
w = (1, −1, 1, . . . 1
, −1 . . . , 1, −1, −1
)⊤
(equal number of +1 and −1)
1 n2
KP,P KP,Q KQ,P KQ,Q ⊙
= ?
Approximate null distribution of MMD via permutation
Permuted case:
[Alba Fern´ andez et al., 2008]
w = (1, −1, 1, . . . 1
, −1 . . . , 1, −1, −1
)⊤
(equal number of +1 and −1)
1 n2
KP,P KP,Q KQ,P KQ,Q ⊙
= ?
=
Figure thanks to Kacper Chwialkowski.
Approximate null distribution of MMD
2 via permutation
2 p ≈ 1
n2
KP,P KP,Q KQ,P KQ,Q ⊙
−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7
MMD density under H0 n× MMD2
Null PDF Null PDF from permutation
Detecting differences in brain signals
Do local field potential (LFP) signals change when measured near a spike burst?
20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3
LFP near spike burst Time LFP amplitude
20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3
LFP without spike burst Time LFP amplitude
Nero data: consistent test w/o permutation
MMD(P, Q; F) := µP − µQ2
F
MMD significantly > 0?
MMD: n MMD →
D ∞
λl(z2
l − 2),
– λl is lth eigenvalue of kernel ˜ k(xi, xj)
100 150 200 250 300 0.1 0.2 0.3 0.4 0.5
P ≠ Q (neuro) Sample size m Type II error
Spectral Permutation
Use Gram matrix spectrum for ˆ λl: consistent test without permutation
Hypothesis testing with HSIC
Distribution of HSIC at independence
HSICb = 1 n2 trace(KHLH) – Statistical testing: How do we find when this is larger enough that the null hypothesis P = PxPy is unlikely? – Formally: given P = PxPy, what is the threshold T such that P(HSIC > T) < α for small α?
Distribution of HSIC at independence
HSICb = 1 n2 trace(KHLH)
nHSICb
D
→
∞
λlz2
l ,
zl ∼ N(0, 1)i.i.d.
λlψl(zj) =
hijqr = 1 4!
(i,j,q,r)
ktultu + ktulvw − 2ktultv
Distribution of HSIC at independence
HSICb = 1 n2 trace(KHLH)
nHSICb
D
→
∞
λlz2
l ,
zl ∼ N(0, 1)i.i.d.
λlψl(zj) =
hijqr = 1 4!
(i,j,q,r)
ktultu + ktulvw − 2ktultv
E(HSICb) = 1 nTrCxxTrCyy var(HSICb) = 2(n − 4)(n − 5) (n)4 Cxx2
HS Cyy2 HS + O(n−3).
Statistical testing with HSIC
for small α?
– Compute HSIC for {xi, yπ(i)}n
i=1 for random permutation π of indices
{1, . . . , n}. This gives HSIC for independent variables. – Repeat for many different permutations, get empirical CDF – Threshold T is 1 − α quantile of empirical CDF
Statistical testing with HSIC
for small α?
– Compute HSIC for {xi, yπ(i)}n
i=1 for random permutation π of indices
{1, . . . , n}. This gives HSIC for independent variables. – Repeat for many different permutations, get empirical CDF – Threshold T is 1 − α quantile of empirical CDF
nHSICb(Z) ∼ xα−1e−x/β βαΓ(α) where α = (E(HSICb))2 var(HSICb) , β = var(HSICb) nE(HSICb).
Experiment: dependence testing for translation Are the French text extracts translations of English?
X1: Honourable senators, I have a question for
the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet.
Y1:
Honorables s´ enateurs, ma question s’adresse au leader du gouvernement au S´ enat et concerne l’aide financi´ ere qu’on a annonc´ ee pour les agriculteurs. La plupart des agriculteurs n’ont encore rien reu de cet argent.
X2: No doubt there is great pressure on provin-
cial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.
· · ·
?
Y2:Il est ´
evident que les ordres de gouverne- ments provinciaux et municipaux subissent de fortes pressions en ce qui concerne les ser- vices de garde, mais le gouvernement n’a pas r´ eduit le financement qu’il verse aux provinces pour les services de garde. Au contraire, nous avons augment´ e le financement f´ ed´ eral pour le d´ eveloppement des jeunes enfants.
· · ·
Experiment: dependence testing for translation
HSICb = 1 n2 trace(KHLH)
[NIPS07b]
Canadian Hansard (agriculture)
k-spectrum kernel, k = 10, repetitions=300, sample size 10 ⇓ K
⇒HSIC⇐
⇓ L
Experiment: dependence testing for translation
HSICb = 1 n2 trace(KHLH)
[NIPS07b]
Canadian Hansard (agriculture)
k-spectrum kernel, k = 10, repetitions=300, sample size 10 ⇓ K
⇒HSIC⇐
⇓ L
Kernel two-sample tests for big data, optimal kernel choice
Quadratic time estimate of MMD
MMD2 = µP − µQ2
F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)
Quadratic time estimate of MMD
MMD2 = µP − µQ2
F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)
Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P, Q, respectively: The earlier estimate: (quadratic time)
1 m(m − 1)
m
m
k(xi, xj)
Quadratic time estimate of MMD
MMD2 = µP − µQ2
F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)
Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P, Q, respectively: The earlier estimate: (quadratic time)
1 m(m − 1)
m
m
k(xi, xj) New, linear time estimate:
m [k(x1, x2) + k(x3, x4) + . . .] = 2 m
m/2
k(x2i−1, x2i)
Linear time MMD
Shorter expression with explicit k dependence: MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v), where hk(x, x′, y, y′) = k(x, x′) + k(y, y′) − k(x, y′) − k(x′, y), and v := [x, x′, y, y′].
Linear time MMD
Shorter expression with explicit k dependence: MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v), where hk(x, x′, y, y′) = k(x, x′) + k(y, y′) − k(x, y′) − k(x′, y), and v := [x, x′, y, y′]. The linear time estimate again: ˇ ηk = 2 m
m/2
hk(vi), where vi := [x2i−1, x2i, y2i−1, y2i] and hk(vi) := k(x2i−1, x2i) + k(y2i−1, y2i) − k(x2i−1, y2i) − k(x2i, y2i−1)
Linear time vs quadratic time MMD
Disadvantages of linear time MMD vs quadratic time MMD
Linear time vs quadratic time MMD
Disadvantages of linear time MMD vs quadratic time MMD
Advantages of the linear time MMD vs quadratic time MMD
weighted sum of χ2)
computation
Asymptotics of linear time MMD
By central limit theorem, m1/2 (ˇ ηk − ηk(p, q)) D → N(0, 2σ2
k)
k) < ∞ (true for bounded k)
k = Evh2 k(v) − [Ev(hk(v))]2 .
Hypothesis test
Hypothesis test of asymptotic level α: tk,α = m−1/2σk √ 2Φ−1(1 − α) where Φ−1 is inverse CDF of N(0, 1).
−4 −2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Null distribution, linear time
2 = ˇ
ηk P (ˇ ηk) ˇ ηk Type I error tk,α = (1 − α) quantile
Type II error
−4 −2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
P (ˇ ηk) ˇ ηk Null vs alternative distribution, P (ˇ ηk)
null alternative
Type II error ηk(p, q)
The best kernel: minimizes Type II error
Type II error: ˇ ηk falls below the threshold tk,α and ηk(p, q) > 0.
P(ˇ ηk < tk,α) = Φ
σk √ 2
The best kernel: minimizes Type II error
Type II error: ˇ ηk falls below the threshold tk,α and ηk(p, q) > 0.
P(ˇ ηk < tk,α) = Φ
σk √ 2
Since Φ monotonic, best kernel choice to minimize Type II error prob. is: k∗ = arg max
k∈K ηk(p, q)σ−1 k ,
where K is the family of kernels under consideration.
Learning the best kernel in a family
Define the family of kernels as follows: K :=
d
βuku, β1 = D, βu ≥ 0, ∀u ∈ {1, . . . , d}
Properties: if at least one βu > 0
Test statistic
The squared MMD becomes ηk(p, q) = µk(p) − µk(q)2
Fk = d
βuηu(p, q), where ηu(p, q) := Evhu(v).
Test statistic
The squared MMD becomes ηk(p, q) = µk(p) − µk(q)2
Fk = d
βuηu(p, q), where ηu(p, q) := Evhu(v). Denote:
– hu(x, x′, y, y′) = ku(x, x′) + ku(y, y′) − ku(x, y′) − ku(x′, y)
Quantities for test: ηk(p, q) = E(β⊤h) = β⊤η σ2
k := β⊤cov(h)β.
Optimization of ratio ηk(p, q)σ−1
k
Empirical test parameters: ˆ ηk = β⊤ˆ η ˆ σk,λ =
Q + λmI
ˆ Q is empirical estimate of cov(h). Note: ˆ ηk, ˆ σk,λ computed on training data, vs ˇ ηk, ˇ σk on data to be tested (why?)
Optimization of ratio ηk(p, q)σ−1
k
Empirical test parameters: ˆ ηk = β⊤ˆ η ˆ σk,λ =
Q + λmI
ˆ Q is empirical estimate of cov(h). Note: ˆ ηk, ˆ σk,λ computed on training data, vs ˇ ηk, ˇ σk on data to be tested (why?) Objective: ˆ β∗ = arg max
β0 ˆ
ηk(p, q)ˆ σ−1
k,λ
= arg max
β0
η β⊤ ˆ Q + λmI
−1/2 =: α(β; ˆ η, ˆ Q)
Optmization of ratio ηk(p, q)σ−1
k
Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0
Optmization of ratio ηk(p, q)σ−1
k
Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0 Solve easier problem: ˆ β∗ = arg maxβ0 α2(β; ˆ η, ˆ Q). Quadratic program: min{β⊤ ˆ Q + λmI
η = 1, β 0}
Optmization of ratio ηk(p, q)σ−1
k
Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0 Solve easier problem: ˆ β∗ = arg maxβ0 α2(β; ˆ η, ˆ Q). Quadratic program: min{β⊤ ˆ Q + λmI
η = 1, β 0} What if ˆ η has no positive entries?
Test procedure
(a) Compute ˆ ηu for all ku ∈ K (b) If at least one ˆ ηu > 0, solve the QP to get β∗, else choose random kernel from K
(a) Compute ˇ ηk∗ using k∗ = d
u=1 β∗ku
(b) Compute test threshold ˇ tα,k∗ using ˇ σk∗
ηk∗ > ˇ tα,k∗
Convergence bounds
Assume bounded kernel, σk, bounded away from 0. If λm = Θ(m−1/3) then
k∈K
ˆ ηkˆ σ−1
k,λ − sup k∈K
ηkσ−1
k
.
Convergence bounds
Assume bounded kernel, σk, bounded away from 0. If λm = Θ(m−1/3) then
k∈K
ˆ ηkˆ σ−1
k,λ − sup k∈K
ηkσ−1
k
. Idea:
k∈K
ˆ ηkˆ σ−1
k,λ − sup k∈K
ηkσ−1
k
k∈K
ηkˆ σ−1
k,λ − ηkσ−1 k,λ
k∈K
k,λ − ηkσ−1 k
√ d D√λm
k∈K
|ˆ ηk − ηk| + C2 sup
k∈K
|ˆ σk,λ − σk,λ|
Experiments
Competing approaches
ηu – same as maximizing β⊤ˆ η subject to β1 ≤ 1
η subject to β2 ≤ 1
Also compare with:
k
Blobs: data
Difficult problems: lengthscale of the difference in distributions not the same as that of the distributions.
Blobs: data
Difficult problems: lengthscale of the difference in distributions not the same as that of the distributions. We distinguish a field of Gaussian blobs with different covariances.
5 10 15 20 25 30 35 5 10 15 20 25 30 35
Blob data p x1 x2
5 10 15 20 25 30 35 5 10 15 20 25 30 35
Blob data q y1 y2
Ratio ε = 3.2 of largest to smallest eigenvalues of blobs in q.
Blobs: results
5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ratio Type II error
max ratio
l2 maxmmd xval xvalc med
Parameters: m = 10, 000 (for training and test). Ratio ε of largest to smallest eigenvalues of blobs in q. Results are average over 617 trials.
Blobs: results
5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ratio Type II error max ratio
l2 maxmmd xval xvalc med
Optimize ratio ηk(p, q)σ−1
k
Blobs: results
5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ratio Type II error max ratio
l2 maxmmd xval xvalc med
Maximize ηk(p, q) with β constraint
Blobs: results
5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ratio Type II error max ratio
l2 maxmmd xval xvalc med
Median heuristic
Feature selection: data
Idea: no single best kernel. Each of the ku are univariate (along a single coordinate)
Feature selection: data
Idea: no single best kernel. Each of the ku are univariate (along a single coordinate)
−4 −2 2 4 6 8 −4 −2 2 4 6 8
Selection data x1 x2
p q
Feature selection: results
5 10 15 20 25 30 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Feature selection dimension Type II error
max ratio
l2 maxmmd
Single best kernel Linear combination
m = 10, 000, average over 5000 trials
Amplitude modulated signals
Given an audio signal s(t), an amplitude modulated signal can be defined u(t) = sin(ωct) [a s(t) + l]
Amplitude modulated signals
Given an audio signal s(t), an amplitude modulated signal can be defined u(t) = sin(ωct) [a s(t) + l]
Two amplitude modulated signals from same artist (in this case, Magnetic Fields).
Amplitude modulated signals
Samples from P Samples from Q
Results: AM signals
−0.2 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
added noise Type II error
max ratio
med l2 maxmmd
m = 10, 000 (for training and test) and scaling a = 0.5. Average over 4124
Observations on kernel choice
two-sample test
distributions differ on a lengthscale different to that of the data.
– quadratic time statistic – avoid training/test split
Summary
– high dimensionality – non-euclidean data (strings, graphs) – Nonparametric hypothesis tests
– Easy to check: does spectrum cover Rd
Co-authors
– Luca Baldasssarre – Steffen Grunewalder – Guy Lever – Sam Patterson – Massimiliano Pontil – Dino Sejdinovic
– Karsten Borgwardt, MPI – Wicher Bergsma, LSE – Kenji Fukumizu, ISM – Zaid Harchaoui, INRIA – Bernhard Schoelkopf, MPI – Alex Smola, CMU/Google – Le Song, Georgia Tech – Bharath Sriperumbudur, Cambridge
Selected references
Characteristic kernels and mean embeddings:
embeddings and metrics on probability measures. JMLR.
Two-sample, independence, conditional independence tests:
Energy distance, relation to kernel distances
rkhs-based statistics in hypothesis testing. Annals of Statistics.
Three way interaction
Selected references (continued)
Conditional mean embedding, RKHS-valued regression:
Estimation, NIPS.
Foundations of Computational Mathematics.
embeddings as regressors. ICML.
Kernel Bayes rule:
kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.
kernels, JMLR
Local departures from the null
What is a hard testing problem?
Local departures from the null
What is a hard testing problem?
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
Samples from P and Q
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
Samples from P and Q
Local departures from the null
What is a hard testing problem?
Local departures from the null
What is a hard testing problem?
some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant
More general local departures from null
some fixed function such that fQ is a valid density
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35
X P(X)
VS
−6 −4 −2 2 4 6 0.1 0.2 0.3
X Q(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3
X Q(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3
X Q(X)
Local departures from the null
What is a hard testing problem?
Type II error
some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant
Local departures from the null
What is a hard testing problem?
Type II error
some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant
General characterization of local departures from H0:
distribution embedding
gmF = cm−1/2
More general local departures from null
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35
X P(X)
VS
−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4
X Q(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4
X Q(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4
X Q(X)
Kernels vs kernels
[Anderson et al., 1994]
ˆ fP(x) = 1 m
m
κ (xi − x) , where κ satisfies
κ (x) dx = 1 and κ (x) ≥ 0.
Kernels vs kernels
[Anderson et al., 1994]
ˆ fP(x) = 1 m
m
κ (xi − x) , where κ satisfies
κ (x) dx = 1 and κ (x) ≥ 0.
D2( ˆ fP, ˆ fQ)2 = 1 m
m
κ(xi − z) − 1 m
m
κ(yi − z) 2 dz = 1 m2
m
k(xi − xj) + 1 m2
m
k(yi − yj) − 2 m2
m
k(xi − yj), where k(x − y) =
Kernels vs kernels
[Anderson et al., 1994]
ˆ fP(x) = 1 m
m
κ (xi − x) , where κ satisfies
κ (x) dx = 1 and κ (x) ≥ 0.
D2( ˆ fP, ˆ fQ)2 = 1 m
m
κ(xi − z) − 1 m
m
κ(yi − z) 2 dz = 1 m2
m
k(xi − xj) + 1 m2
m
k(yi − yj) − 2 m2
m
k(xi − yj), where k(x − y) =
δ = (m)−1/2h−d/2
m
, where hm is width of κ.
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X
[Dudley, 2002]
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X
[Dudley, 2002]
Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X) with respect to L∞ [Steinwart, 2001]
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X
[Dudley, 2002]
Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X) with respect to L∞ [Steinwart, 2001] If F universal, then MMD {P, Q; F} = 0 iff P = Q
Characteristic Kernels (via universality)
Proof: First, it is clear that P = Q implies MMD {P, Q; F} is zero. Converse: by the universality of F, for any given ǫ > 0 and f ∈ C(X) ∃g ∈ F f − g∞ ≤ ǫ.
Characteristic Kernels (via universality)
Proof: First, it is clear that P = Q implies MMD {P, Q; F} is zero. Converse: by the universality of F, for any given ǫ > 0 and f ∈ C(X) ∃g ∈ F f − g∞ ≤ ǫ. We next make the expansion |EPf(x) − EQf(y)| ≤ |EPf(x) − EPg(x)|+|EPg(x) − EQg(y)|+|EQg(y) − EQf(y)| . The first and third terms satisfy |EPf(x) − EPg(x)| ≤ EP |f(x) − g(x)| ≤ ǫ.
Characteristic Kernels (via universality)
Proof (continued): Next, write EPg(x) − EQg(y) = g(·), µP − µQF = 0, since MMD {P, Q; F} = 0 implies µP = µQ. Hence |EPf(x) − EQf(y)| ≤ 2ǫ for all f ∈ C(X) and ǫ > 0, which implies P = Q.
References
andez, M. Jim´ enez-Gamero, and J. Mu˜ noz Garcia. A test for the two-sample problem based on empirical characteristic functions. Comput.
Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Anal- ysis, 50:41–54, 1994.
bridge, UK, 2002. Andrey Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3):419–433, 1993.
istic kernels on groups and semigroups. In Advances in Neural Information Processing Systems 21, pages 473–480, Red Hook, NY, 2009. Curran Asso- ciates Inc.
Fisher discriminant analysis. In Advances in Neural Information Processing Systems 20, pages 609–616. MIT Press, Cambridge, MA, 2008.
Probability inequalities for sums of bounded random vari-
Wassily Hoeffding. A class of statistics with asymptotically normal distri-
Consistent Testing of Total Independence Based on the Empirical Characteristic Function. PhD thesis, University of Jyv¨ askyl¨ a, 1995.
A Wavelet Tour of Signal Processing. Academic Press, 2nd edition, 1999.
torics, pages 148–188. Cambridge University Press, 1989.
103-1
1980. B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and
On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, 2008.
ekely and M. Rizzo. Brownian distance covariance. Annals of Applied Statistics, 4(3):1233–1303, 2009.
ekely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Stat., 35(6):2769–2794, 2007.
Probabilistic distance measures in reproducing kernel hilbert space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6):917–929, 2006.
103-2