Lecture 2: Mappings of Probabilities to RKHS and Applications
MLSS Cadiz, 2016
Arthur Gretton Gatsby Unit, CSML, UCL
Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS - - PowerPoint PPT Presentation
Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton Gatsby Unit, CSML, UCL Outline Kernel metric on the space of probability measures Function revealing differences in distributions
Lecture 2: Mappings of Probabilities to RKHS and Applications
MLSS Cadiz, 2016
Arthur Gretton Gatsby Unit, CSML, UCL
Outline
– Function revealing differences in distributions – Distance between means in space of features (RKHS) – Independence measure: features of joint minus product of marginals
– distributions on strings, images, graphs, groups (rotation matrices), semigroups,. . .
Feature mean difference
Two Gaussians with different means X
Feature mean difference
Two Gaussians with different variances
X
PX QXFeature mean difference
Two Gaussians with different variances
X
PX QX 10 −1 10 10 1 10 2 0.2 0.4 0.6 0.8 1 1.2 1.4Densities of feature X2 X2
Feature mean difference
Gaussian and Laplace densities
X
PX QXProbabilities in feature space: the mean trick
The reproducing property (kernel trick)
define feature map ϕ(x) ∈ F, ϕ(x) = [. . . ϕi(x) . . .] ∈ ℓ2
k(x, x′) = ϕ(x), ϕ(x′)F
∀f ∈ F, f(x) = f(·), ϕ(x)F
Probabilities in feature space: the mean trick
The reproducing property (kernel trick)
define feature map ϕ(x) ∈ F, ϕ(x) = [. . . ϕi(x) . . .] ∈ ℓ2
k(x, x′) = ϕ(x), ϕ(x′)F
∀f ∈ F, f(x) = f(·), ϕ(x)F The mean trick
measure on X, define feature map µP ∈ F µP = [. . . EP [ϕi(x)] . . .]
EP,Qk(x, y) = µP, µQF for x ∼ P and y ∼ Q.
mean/distribution embedding) EP(f(x)) =: µP, f(·)F
Does the feature space mean exist?
Does there exist an element µP ∈ F such that EPf(x) = EPf(·), ϕ(x)F = f(·), EPϕ(x)F = f(·), µP(·)F ∀f ∈ F
Does the feature space mean exist?
Does there exist an element µP ∈ F such that EPf(x) = EPf(·), ϕ(x)F = f(·), EPϕ(x)F = f(·), µP(·)F ∀f ∈ F Yes: You can exchange expectation and innner product (i.e. ϕ(x) is Bochner integrable [Steinwart and Christmann, 2008]) under the condition EPϕ(x)F = EP
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2
F = µP, µPF + µQ, µQF − 2 µP, µQF
= EPk(x, x′)
+ EQk(y, y′)
− 2EP,Qk(x, y)
(a)= within distrib. similarity, (b)= cross-distrib. similarity
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2
F = µP, µPF + µQ, µQF − 2 µP, µQF
= EPk(x, x′)
+ EQk(y, y′)
− 2EP,Qk(x, y)
(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2
F = µP, µPF + µQ, µQF − 2 µP, µQF
= EPk(x, x′)
+ EQk(y, y′)
− 2EP,Qk(x, y)
(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2
F = µP, µPF + µQ, µQF − 2 µP, µQF
= EPk(x, x′)
+ EQk(y, y′)
− 2EP,Qk(x, y)
(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . .
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2
F = µP, µPF + µQ, µQF − 2 µP, µQF
= EPk(x, x′)
+ EQk(y, y′)
− 2EP,Qk(x, y)
(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), ϕ(x) + . . .
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2
F = µP, µPF + µQ, µQF − 2 µP, µQF
= EPk(x, x′)
+ EQk(y, y′)
− 2EP,Qk(x, y)
(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . .
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2
F = µP, µPF + µQ, µQF − 2 µP, µQF
= EPk(x, x′)
+ EQk(y, y′)
− 2EP,Qk(x, y)
(a)= within distrib. similarity, (b)= cross-distrib. similarity Proof: µP − µQ2
F
= µP − µQ, µP − µQF = µP, µP + µQ, µQ − 2 µP, µQ = EP[µP(x)] + . . . = EP µP(·), k(x, ·) + . . . = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)
The maximum mean discrepancy
The maximum mean discrepancy is the distance between feature means: MMD2(P, Q) = µP − µQ2
F = µP, µPF + µQ, µQF − 2 µP, µQF
= EPk(x, x′)
+ EQk(y, y′)
− 2EP,Qk(x, y)
(a)= within distrib. similarity, (b)= cross-distrib. similarity Unbiased empirical estimate of first term (quadratic time)
1 m(m − 1)
m
m
k(xi, xj)
The maximum mean discrepancy
The maximum mean discrepancy
The maximum mean discrepancy
k(dogi, fishj) k(fishi, fishj) k(dogi, dogj) k(fishj, dogi)
2 = KP,P + KQ,Q − 2KP,Q
(diagonal terms removed from KP,P and KQ,Q)
Function Showing Difference in Distributions
Function Showing Difference in Distributions
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
Samples from P and Q
Function Showing Difference in Distributions
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
Samples from P and Q
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
x f(x) Smooth function
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
x f(x) Smooth function
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
Bounded continuous function x f(x)
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1
Bounded continuous function x f(x)
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8
Witness f for Gauss and Laplace densities X
f Gauss Laplace
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨
uller, 1997]
– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨
uller, 1997]
– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]
RKHS F (coming soon!)
[ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]
Function Showing Difference in Distributions
MMD(P, Q; F) := sup
f∈F
[EPf(x) − EQf(y)] .
– F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨
uller, 1997]
– F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]
RKHS F (coming soon!)
[ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]
How do smooth functions relate to feature maps?
Function view vs feature mean view
MMD(P, Q; F) = sup
f∈F
[EPf(x) − EQf(y)]
−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8Witness f for Gauss and Laplace densities X
Function view vs feature mean view
MMD(P, Q; F) = sup
f∈F
[EPf(x) − EQf(y)] use EP(f(x)) =: µP, fF
Function view vs feature mean view
MMD(P, Q; F) = sup
f∈F
[EPf(x) − EQf(y)] = sup
f∈F
f, µP − µQF use EP(f(x)) =: µP, fF
Function view vs feature mean view
MMD(P, Q; F) = sup
f∈F
[EPf(x) − EQf(y)] = sup
f∈F
f, µP − µQF = µP − µQF use θF = sup
f∈F
f, θF since F := {f ∈ F : f ≤ 1} Function view and feature view equivalent
MMD for independence: HSIC
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Sz´
ekely and Rizzo, 2009, Sz´ ekely et al., 2007]
HSIC(PXY , PXPY ) := µPXY − µPXPY 2
MMD for independence: HSIC
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Sz´
ekely and Rizzo, 2009, Sz´ ekely et al., 2007]
HSIC(PXY , PXPY ) := µPXY − µPXPY 2
k( , )
!" #" !"
l( , )
#"
k( , ) × l( , )
!" #" !" #"
κ( , ) =
!" #" !" #"
MMD for independence: HSIC
NIPS07a, ALT07, ALT08, JMLR10]
Related to [Feuerverger, 1993]and [Sz´
ekely and Rizzo, 2009, Sz´ ekely et al., 2007]
HSIC(PXY , PXPY ) := µPXY − µPXPY 2 HSIC using expectations of kernels: Define RKHS F on X with kernel k, RKHS G on Y with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(x, x′)l(y, y′) + EXEX′k(x, x′)EY EY ′l(y, y′) − 2EX′Y ′ EXk(x, x′)EY l(y, y′)
HSIC: empirical estimate and intuition
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%& 2',&0(//(7&3(+#%37"#%#:&
HSIC: empirical estimate and intuition
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&!" #"
HSIC: empirical estimate and intuition
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&!" #"
Empirical HSIC(PXY , PXPY ): 1 n2 (HKH ◦ HLH)++
Characteristic kernels (Via Fourier, on the torus T)
Characteristic Kernels (via Fourier)
Reminder: Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, JMLR10] In the next slides:
Characteristic Kernels (via Fourier)
Reminder: Fourier series
f(x) =
∞
ˆ fℓ exp(ıℓx) =
∞
ˆ fℓ (cos(ℓx) + ı sin(ℓx)) .
−4 −2 2 4 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4 x f (x) Top hat −10 −5 5 10 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 ℓ ˆ fℓ Fourier series coefficientsCharacteristic Kernels (via Fourier)
Reminder: Fourier series of kernel k(x, y) = k(x − y) = k(z), k(z) =
∞
ˆ kℓ exp (ıℓz) , E.g., k(x) =
1 2πϑ
2π, ıσ2 2π
ˆ kℓ =
1 2π exp
2
ϑ is the Jacobi theta function, close to Gaussian when σ2 sufficiently narrower than [−π, π].
−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6 x k(x) Kernel −10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 ℓ ˆ fℓ Fourier series coefficientsCharacteristic Kernels (via Fourier)
Maximum mean embedding via Fourier series:
φP
(convolution theorem) µP(x) = EPk(x − x) = π
−π
k(x − t)dP(t) ˆ µP,ℓ = ˆ kℓ × ¯ φP,ℓ
Characteristic Kernels (via Fourier)
Maximum mean embedding via Fourier series:
φP
(convolution theorem) µP(x) = EPk(x − x) = π
−π
k(x − t)dP(t) ˆ µP,ℓ = ˆ kℓ × ¯ φP,ℓ
MMD(P, Q; F) :=
¯ φP,ℓ − ¯ φQ,ℓ ˆ kℓ
A simpler Fourier expression for MMD
MMD(P, Q; F) :=
¯ φP,ℓ − ¯ φQ,ℓ ˆ kℓ
f2
F = f, fF = ∞
| ˆ fℓ|2 ˆ kℓ .
MMD2(P, Q; F) =
∞
[|φP,ℓ − φQ,ℓ|2ˆ kℓ]2 ˆ kℓ =
∞
|φP,ℓ − φQ,ℓ|2ˆ kℓ
Example
−2 2 0.05 0.1 0.15 0.2
x P (x)
−2 2 0.05 0.1 0.15 0.2
x Q(x)
Characteristic Kernels (2)
−2 2 0.05 0.1 0.15 0.2
x P (x)
−2 2 0.05 0.1 0.15 0.2
x Q(x)
F
→
F
→
−10 10 0.5 1
ℓ φP,ℓ
−10 10 0.5 1
ℓ φQ,ℓ
Characteristic Kernels (2)
−2 2 0.05 0.1 0.15 0.2
x P (x)
−2 2 0.05 0.1 0.15 0.2
x Q(x)
F
→
F
→
−10 10 0.5 1
ℓ φP,ℓ
−10 10 0.5 1
ℓ φQ,ℓ
ց ր
Characteristic function difference
−10 10 0.2 0.4 0.6 0.8 1
ℓ φP,ℓ − φQ,ℓ
Example
Is the Gaussian-spectrum kernel characteristic?
−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6x k(x) Kernel
−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16ℓ ˆ fℓ Fourier series coefficients
MMD2(P, Q; F) :=
∞
|φP,ℓ − φQ,ℓ|2ˆ kℓ
Example
Is the Gaussian-spectrum kernel characteristic? YES
−4 −2 2 4 −0.1 0.1 0.2 0.3 0.4 0.5 0.6x k(x) Kernel
−10 −5 5 10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16ℓ ˆ fℓ Fourier series coefficients
MMD2(P, Q; F) :=
∞
|φP,ℓ − φQ,ℓ|2ˆ kℓ
Example
Is the triangle kernel characteristic?
−4 −2 2 4 −0.1 −0.05 0.05 0.1 0.15 0.2 0.25 0.3 x f (x) Triangle −10 −5 5 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ℓ ˆ fℓ Fourier series coefficientsMMD2(P, Q; F) :=
∞
|φP,ℓ − φQ,ℓ|2ˆ kℓ
Example
Is the triangle kernel characteristic? NO
−4 −2 2 4 −0.1 −0.05 0.05 0.1 0.15 0.2 0.25 0.3 x f (x) Triangle −10 −5 5 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ℓ ˆ fℓ Fourier series coefficientsMMD2(P, Q; F) :=
∞
|φP,ℓ − φQ,ℓ|2ˆ kℓ
Characteristic kernels (Via Fourier, on Rd)
Characteristic Kernels (via Fourier)
Characteristic Kernels (via Fourier)
φP(ω) =
Characteristic Kernels (via Fourier)
φP(ω) =
k(z) =
– Λ finite non-negative Borel measure
Characteristic Kernels (via Fourier)
φP(ω) =
k(z) =
– Λ finite non-negative Borel measure
Characteristic Kernels (via Fourier)
Fourier representation of MMD: MMD2(P, Q; F) =
φP characteristic function of P
Proof: Using Bochner’s theorem (a)... and Fubini’s theorem (b) MMD2(P, Q) := EPk(x − x′) + EQk(y − y′) − 2EP,Qk(x, y) = k(s − t) d(P − Q)(s)
(a)
=
Rd e−i(s−t)T ω dΛ(ω) d(P − Q)(s) d(P − Q)(t) (b)
=
Rd e−ixT ω d(P − Q)(s)
=
Example
Example
F
→
F
→
−20 −10 10 20 0.1 0.2 0.3 0.4 ω |φP | −20 −10 10 20 0.1 0.2 0.3 0.4 ω |φQ |Example
F
→
F
→
−20 −10 10 20 0.1 0.2 0.3 0.4 ω |φP | −20 −10 10 20 0.1 0.2 0.3 0.4 ω |φQ |ց ր
Characteristic function difference
−30 −20 −10 10 20 30 0.05 0.1 0.15 0.2 ω |φP − φQ|Example
Exponentiated quadratic kernel Difference |φP − φQ|
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
Characteristic
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
Sinc kernel Difference |φP − φQ|
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
NOT characteristic
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
Triangle (B-spline) kernel Difference |φP − φQ|
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
???
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Example
Characteristic
−30 −20 −10 10 20 30 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Frequency ω
Summary: Characteristic Kernels
Characteristic kernel: (MMD = 0 iff P = Q) [NIPS07b, COLT08] Main theorem: A translation invariant k characteristic for prob. measures on Rd if and only if supp(Λ) = Rd (i.e. support zero on at most a countable set)
[COLT08, JMLR10]
Corollary: continuous, compactly supported k characteristic (since Fourier spectrum Λ(ω) cannot be zero on an interval).
1-D proof sketch from [Mallat, 1999, Theorem 2.6] proof on Rd via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]
k characteristic iff supp(Λ) = Rd
Proof: supp {Λ} = Rd = ⇒ k characteristic: Recall Fourier definition of MMD: MMD2(P, Q) =
Characteristic functions φP(ω) and φQ(ω) uniformly continuous, hence their difference cannot be non-zero only on a countable set.
Map φP uniformly continuous: ∀ǫ > 0, ∃δ > 0 such that ∀(ω1, ω2) ∈ Ω for which d(ω1, ω2) < δ, we have d(φP(ω1), φP(ω2)) < ǫ. Uniform: δ depends only on ǫ, not on ω1, ω2.
k characteristic iff supp(Λ) = Rd
Proof: k characteristic = ⇒ supp {Λ} = Rd : Proof by contrapositive. Given supp {Λ} Rd, hence ∃ open interval U such that Λ(ω) zero on U. Construct densities p(x), q(x) such that φP, φQ differ only inside U
Further extensions
[Fukumizu et al., 2009]
– Locally compact Abelian groups (periodic domains, as we saw) – Compact, non-Abelian groups (orthogonal matrices) – The semigroup R+
n (histograms)
for characteristic kernels), other distances [Zhou and Chellappa, 2006](not yet shown to establish whether P = Q), energy distances
Statistical hypothesis testing
Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?
20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3LFP near spike burst Time LFP amplitude
20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3LFP without spike burst Time LFP amplitude
Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?
Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?
Statistical test using MMD (1)
– H0: null hypothesis (P = Q) – H1: alternative hypothesis (P = Q)
Statistical test using MMD (1)
– H0: null hypothesis (P = Q) – H1: alternative hypothesis (P = Q)
– “far from zero”: reject H0 – “close to zero”: accept H0
Statistical test using MMD (2)
MMD
2
Statistical test using MMD (2)
MMD
2
2 = 1 n(n−1)
k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)
Statistical test using MMD (2)
MMD
2
2 = 1 n(n−1)
k(xi, xj) − k(xi, yj) − k(yi, xj) + k(yi, yj)
(√n)
2 − MMD2
∼ N(0, σ2
u)
[Hoeffding, 1948, Serfling, 1980]
σ2
u = 4
−
2
Statistical test using MMD (3)
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 2 4 6 8 10 12 14
MMD distribution and Gaussian fit under H1 MMD
Empirical PDF Gaussian fit
−6 −4 −2 2 4 6 0.5 1 1.5 Two Laplace distributions with different variances XStatistical test using MMD (4)
nMMD(x, y; F) ∼
∞
λl
l − 2
– zl ∼ N(0, 2) i.i.d –
k(x, x′)
centred
ψi(x)dP(x) = λiψi(x′)
Statistical test using MMD (4)
nMMD(x, y; F) ∼
∞
λl
l − 2
– zl ∼ N(0, 2) i.i.d –
k(x, x′)
centred
ψi(x)dP(x) = λiψi(x′)
−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7MMD density under H0 n× MMD2
Statistical test using MMD (5)
2 = KP,P + KQ,Q − 2KP,Q
−2 −1 1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7MMD density under H0 and H1 n× MMD2
Statistical test using MMD (5)
Statistical test using MMD (5)
e, 1992, Alba Fern´ andez et al., 2008]
Statistical test using MMD (5)
e, 1992, Alba Fern´ andez et al., 2008]
CDF of the MMD and Pearson fit mmd P(MMD < mmd)
MMD PearsonApproximate null distribution of MMD via permutation
Empirical MMD: w = (1, 1, 1, . . . 1
, −1 . . . , −1, −1, −1
)⊤
1 n2
KP,P KP,Q KQ,P KQ,Q ⊙
≈
2
Approximate null distribution of MMD via permutation
Permuted case:
[Alba Fern´ andez et al., 2008]
w = (1, −1, 1, . . . 1
, −1 . . . , 1, −1, −1
)⊤
(equal number of +1 and −1)
1 n2
KP,P KP,Q KQ,P KQ,Q ⊙
= ?
Approximate null distribution of MMD via permutation
Permuted case:
[Alba Fern´ andez et al., 2008]
w = (1, −1, 1, . . . 1
, −1 . . . , 1, −1, −1
)⊤
(equal number of +1 and −1)
1 n2
KP,P KP,Q KQ,P KQ,Q ⊙
= ?
=
Figure thanks to Kacper Chwialkowski.
Approximate null distribution of MMD
2 via permutation
2 p ≈ 1
n2
KP,P KP,Q KQ,P KQ,Q ⊙
MMD density under H0 n× MMD2
Detecting differences in brain signals
Do local field potential (LFP) signals change when measured near a spike burst?
20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3LFP near spike burst Time LFP amplitude
20 40 60 80 100 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3LFP without spike burst Time LFP amplitude
Neuro data: consistent test w/o permutation
MMD(P, Q; F) := µP − µQ2
F
MMD significantly > 0?
MMD: n MMD →
D ∞
λl(z2
l − 2),
– λl is lth eigenvalue of kernel ˜ k(xi, xj)
100 150 200 250 300 0.1 0.2 0.3 0.4 0.5
P ≠ Q (neuro) Sample size m Type II error
Spectral Permutation
Use Gram matrix spectrum for ˆ λl: consistent test without permutation
Hypothesis testing with HSIC
Distribution of HSIC at independence
HSICb = 1 n2 trace(KHLH) – Statistical testing: How do we find when this is larger enough that the null hypothesis P = PxPy is unlikely? – Formally: given P = PxPy, what is the threshold T such that P(HSIC > T) < α for small α?
Distribution of HSIC at independence
HSICb = 1 n2 trace(KHLH)
nHSICb
D
→
∞
λlz2
l ,
zl ∼ N(0, 1)i.i.d.
λlψl(zj) =
hijqr = 1 4!
(i,j,q,r)
ktultu + ktulvw − 2ktultv
Distribution of HSIC at independence
HSICb = 1 n2 trace(KHLH)
nHSICb
D
→
∞
λlz2
l ,
zl ∼ N(0, 1)i.i.d.
λlψl(zj) =
hijqr = 1 4!
(i,j,q,r)
ktultu + ktulvw − 2ktultv
E(HSICb) = 1 nTrCxxTrCyy var(HSICb) = 2(n − 4)(n − 5) (n)4 Cxx2
HS Cyy2 HS + O(n−3).
Statistical testing with HSIC
for small α?
– Compute HSIC for {xi, yπ(i)}n
i=1 for random permutation π of indices
{1, . . . , n}. This gives HSIC for independent variables. – Repeat for many different permutations, get empirical CDF – Threshold T is 1 − α quantile of empirical CDF
Statistical testing with HSIC
for small α?
– Compute HSIC for {xi, yπ(i)}n
i=1 for random permutation π of indices
{1, . . . , n}. This gives HSIC for independent variables. – Repeat for many different permutations, get empirical CDF – Threshold T is 1 − α quantile of empirical CDF
nHSICb(Z) ∼ xα−1e−x/β βαΓ(α) where α = (E(HSICb))2 var(HSICb) , β = var(HSICb) nE(HSICb).
Experiment: dependence testing for translation Are the French text extracts translations of English?
X1: Honourable senators, I have a question for
the Leader of the Government in the Senate with regard to the support funding to farmers that has been announced. Most farmers have not received any money yet.
Y1:
Honorables s´ enateurs, ma question s’adresse au leader du gouvernement au S´ enat et concerne l’aide financi´ ere qu’on a annonc´ ee pour les agriculteurs. La plupart des agriculteurs n’ont encore rien reu de cet argent.
X2: No doubt there is great pressure on provin-
cial and municipal governments in relation to the issue of child care, but the reality is that there have been no cuts to child care funding from the federal government to the provinces. In fact, we have increased federal investments for early childhood development.
· · ·
?
⇐ ⇒
Y2:Il est ´
evident que les ordres de gouverne- ments provinciaux et municipaux subissent de fortes pressions en ce qui concerne les ser- vices de garde, mais le gouvernement n’a pas r´ eduit le financement qu’il verse aux provinces pour les services de garde. Au contraire, nous avons augment´ e le financement f´ ed´ eral pour le d´ eveloppement des jeunes enfants.
· · ·
Experiment: dependence testing for translation
HSICb = 1 n2 trace(KHLH)
[NIPS07b]
Canadian Hansard (agriculture)
k-spectrum kernel, k = 10, repetitions=300, sample size 10 ⇓ K
⇒HSIC⇐
⇓ L
Experiment: dependence testing for translation
HSICb = 1 n2 trace(KHLH)
[NIPS07b]
Canadian Hansard (agriculture)
k-spectrum kernel, k = 10, repetitions=300, sample size 10 ⇓ K
⇒HSIC⇐
⇓ L
Summary
– high dimensionality – non-euclidean data (strings, graphs) – Nonparametric hypothesis tests
– Easy to check: does spectrum cover Rd
Co-authors
– Luca Baldasssarre – Steffen Grunewalder – Guy Lever – Sam Patterson – Massimiliano Pontil – Dino Sejdinovic
– Karsten Borgwardt, MPI – Wicher Bergsma, LSE – Kenji Fukumizu, ISM – Zaid Harchaoui, INRIA – Bernhard Schoelkopf, MPI – Alex Smola, CMU/Google – Le Song, Georgia Tech – Bharath Sriperumbudur, Cambridge
Selected references
Characteristic kernels and mean embeddings:
embeddings and metrics on probability measures. JMLR.
Two-sample, independence, conditional independence tests:
Energy distance, relation to kernel distances
rkhs-based statistics in hypothesis testing. Annals of Statistics.
Three way interaction
Selected references (continued)
Conditional mean embedding, RKHS-valued regression:
Estimation, NIPS.
Foundations of Computational Mathematics.
embeddings as regressors. ICML.
Kernel Bayes rule:
kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.
kernels, JMLR
Local departures from the null
What is a hard testing problem?
Local departures from the null
What is a hard testing problem?
Samples from P and Q
0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1Samples from P and Q
Local departures from the null
What is a hard testing problem?
Local departures from the null
What is a hard testing problem?
some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant
More general local departures from null
some fixed function such that fQ is a valid density
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 X P(X)VS
−6 −4 −2 2 4 6 0.1 0.2 0.3X Q(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3X Q(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3X Q(X)
Local departures from the null
What is a hard testing problem?
Type II error
some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant
Local departures from the null
What is a hard testing problem?
Type II error
some fixed function such that fQ is a valid density – If δ ∼ m−1/2, Type II error approaches a constant
General characterization of local departures from H0:
distribution embedding
gmF = cm−1/2
More general local departures from null
VS
−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4X Q(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4X Q(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4X Q(X)
Kernels vs kernels
[Anderson et al., 1994]
ˆ fP(x) = 1 m
m
κ (xi − x) , where κ satisfies
κ (x) dx = 1 and κ (x) ≥ 0.
Kernels vs kernels
[Anderson et al., 1994]
ˆ fP(x) = 1 m
m
κ (xi − x) , where κ satisfies
κ (x) dx = 1 and κ (x) ≥ 0.
D2( ˆ fP, ˆ fQ)2 = 1 m
m
κ(xi − z) − 1 m
m
κ(yi − z) 2 dz = 1 m2
m
k(xi − xj) + 1 m2
m
k(yi − yj) − 2 m2
m
k(xi − yj), where k(x − y) =
Kernels vs kernels
[Anderson et al., 1994]
ˆ fP(x) = 1 m
m
κ (xi − x) , where κ satisfies
κ (x) dx = 1 and κ (x) ≥ 0.
D2( ˆ fP, ˆ fQ)2 = 1 m
m
κ(xi − z) − 1 m
m
κ(yi − z) 2 dz = 1 m2
m
k(xi − xj) + 1 m2
m
k(yi − yj) − 2 m2
m
k(xi − yj), where k(x − y) =
δ = (m)−1/2h−d/2
m
, where hm is width of κ.
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08]
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X
[Dudley, 2002]
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X
[Dudley, 2002]
Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X) with respect to L∞ [Steinwart, 2001]
Characteristic Kernels (via universality)
Characteristic: MMD a metric (MMD = 0 iff P = Q) [NIPS07b, COLT08] Classical result: P = Q if and only if EP(f(x)) = EQ(f(y)) for all f ∈ C(X), the space of bounded continuous functions on X
[Dudley, 2002]
Universal RKHS: k(x, x′) continuous, X compact, and F dense in C(X) with respect to L∞ [Steinwart, 2001] If F universal, then MMD {P, Q; F} = 0 iff P = Q
Characteristic Kernels (via universality)
Proof: First, it is clear that P = Q implies MMD {P, Q; F} is zero. Converse: by the universality of F, for any given ǫ > 0 and f ∈ C(X) ∃g ∈ F f − g∞ ≤ ǫ.
Characteristic Kernels (via universality)
Proof: First, it is clear that P = Q implies MMD {P, Q; F} is zero. Converse: by the universality of F, for any given ǫ > 0 and f ∈ C(X) ∃g ∈ F f − g∞ ≤ ǫ. We next make the expansion |EPf(x) − EQf(y)| ≤ |EPf(x) − EPg(x)|+|EPg(x) − EQg(y)|+|EQg(y) − EQf(y)| . The first and third terms satisfy |EPf(x) − EPg(x)| ≤ EP |f(x) − g(x)| ≤ ǫ.
Characteristic Kernels (via universality)
Proof (continued): Next, write EPg(x) − EQg(y) = g(·), µP − µQF = 0, since MMD {P, Q; F} = 0 implies µP = µQ. Hence |EPf(x) − EQf(y)| ≤ 2ǫ for all f ∈ C(X) and ǫ > 0, which implies P = Q.
References
79-1
79-2