Lecture 3: Dependence measures using RKHS embeddings
MLSS Cadiz, 2016
Arthur Gretton Gatsby Unit, CSML, UCL
Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, - - PowerPoint PPT Presentation
Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, 2016 Arthur Gretton Gatsby Unit, CSML, UCL Outline Three or more variable interactions, comparison with conditional dependence testing [Sejdinovic et al., 2013a]
Arthur Gretton Gatsby Unit, CSML, UCL
dependence testing [Sejdinovic et al., 2013a]
[2015]
– Testing for time series Chwialkowski and Gretton [2014], Chwialkowski et al. [2014] – Infinite dimensional exponential families Sriperumbudur et al. [2014] – Adaptive MCMC, and adaptive Hamiltonian Monte Carlo Sejdinovic et al.
[2014], Strathmann et al. [2015]
X Y Z
⊥ Y , Y ⊥ ⊥ Z, X ⊥ ⊥ Z
X vs Y Y vs Z X vs Z XY vs Z
X Y Z
∼ N(0, 1),
√ 2)
Faithfulness violated here
X Y Z
Assume X ⊥ ⊥ Y has been established. V-structure can then be detected by:
⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or
X Y Z
Assume X ⊥ ⊥ Y has been established. V-structure can then be detected by:
⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or
⊥ Z ∨ (X, Z) ⊥ ⊥ Y ∨ (Y, Z) ⊥ ⊥ X (multiple standard two-variable tests) – compute p-values for each of the marginal tests for (Y, Z) ⊥ ⊥ X, (X, Z) ⊥ ⊥ Y , or (X, Y ) ⊥ ⊥ Z – apply Holm-Bonferroni (HB) sequentially rejective correction
(Holm 1979)
dependence?
⊥ Y , Y ⊥ ⊥ Z, X ⊥ ⊥ Z
X1 vs Y1 Y1 vs Z1 X1 vs Z1 X1*Y1 vs Z1
X Y Z
i.i.d.
∼ N(0, 1),
√ 2)
i.i.d.
∼ N(0, Ip−1) Faith-
fulness violated here
CI: X ⊥ ⊥Y |Z 2var: Factor
Null acceptance rate (Type II error) V-structure discovery: Dataset A Dimension
1 3 5 7 9 11 13 15 17 19 0.2 0.4 0.6 0.8 1
Figure 1: CI test for X ⊥ ⊥ Y |Z from Zhang et al (2011), and a factorisation test with a HB correction, n = 500
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.
∆LP = PXY − PXPY
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.
∆LP = PXY − PXPY
∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.
∆LP = PXY − PXPY
∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
X Y Z X Y Z X Y Z X Y Z
PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ ∆LP =
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.
∆LP = PXY − PXPY
∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
X Y Z X Y Z X Y Z X Y Z
PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ ∆LP = 0
Case of PX ⊥ ⊥ PY Z
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.
∆LP = PXY − PXPY
∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ (X, Y ) ⊥ ⊥ Z ∨ (X, Z) ⊥ ⊥ Y ∨ (Y, Z) ⊥ ⊥ X ⇒ ∆LP = 0. ...so what might be missed?
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.
∆LP = PXY − PXPY
∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ ∆LP = 0 (X, Y ) ⊥ ⊥ Z ∨ (X, Z) ⊥ ⊥ Y ∨ (Y, Z) ⊥ ⊥ X Example:
P(0, 0, 0) = 0.2 P(0, 0, 1) = 0.1 P(1, 0, 0) = 0.1 P(1, 0, 1) = 0.1 P(0, 1, 0) = 0.1 P(0, 1, 1) = 0.1 P(1, 1, 0) = 0.1 P(1, 1, 1) = 0.2
Hκ , where
κ = k ⊗ l ⊗ m: µκ(PXY Z − PXY PZ − · · · )2
Hκ =
µκPXY Z, µκPXY ZHκ − 2 µκPXY Z, µκPXY PZHκ · · ·
ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ PXY Z (K ◦ L ◦ M)++ ((K ◦ L) M)++ ((K ◦ M) L)++ ((M ◦ L) K)++ tr(K+ ◦ L+ ◦ M+) PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++ PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++ PY ZPX (L ◦ M)++ K++ (LM)++K++ PXPY PZ K++L++M++
Table 1: V -statistic estimators of
Hκ
ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ PXY Z (K ◦ L ◦ M)++ ((K ◦ L) M)++ ((K ◦ M) L)++ ((M ◦ L) K)++ tr(K+ ◦ L+ ◦ M+) PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++ PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++ PY ZPX (L ◦ M)++ K++ (LM)++K++ PXPY PZ K++L++M++
Table 2: V -statistic estimators of
Hκ
µκ (∆LP)2
Hκ = 1
n2 (HKH ◦ HLH ◦ HMH)++ . Empirical joint central moment in the feature space
CI: X ⊥ ⊥Y |Z ∆L: Factor 2var: Factor
Null acceptance rate (Type II error) V-structure discovery: Dataset A Dimension
1 3 5 7 9 11 13 15 17 19 0.2 0.4 0.6 0.8 1
Figure 2: Factorisation hypothesis: Lancaster statistic vs. a two-variable based test (both with HB correction); Test for X ⊥ ⊥ Y |Z from Zhang et al
(2011), n = 500
i.i.d.
∼ N(0, 1)
X2
1 + ǫ,
w.p. 1/3, Y 2
1 + ǫ,
w.p. 1/3, X1Y1 + ǫ, w.p. 1/3, where ǫ ∼ N(0, 0.12).
i.i.d.
∼ N(0, Ip−1)
CI: X ⊥ ⊥Y |Z ∆L: Factor 2var: Factor
Null acceptance rate (Type II error) V-structure discovery: Dataset B Dimension
1 3 5 7 9 11 13 15 17 19 0.2 0.4 0.6 0.8 1
Figure 3: Factorisation hypothesis: Lancaster statistic vs. a two-variable based test (both with HB correction); Test for X ⊥ ⊥ Y |Z from Zhang et al
(2011), n = 500
(Streitberg, 1990):
∆SP =
(−1)|π|−1 (|π| − 1)!JπP – For a partition π, Jπ associates to the joint the corresponding factorisation, e.g., J13|2|4P = PX1X3PX2PX4.
(Streitberg, 1990):
∆SP =
(−1)|π|−1 (|π| − 1)!JπP – For a partition π, Jπ associates to the joint the corresponding factorisation, e.g., J13|2|4P = PX1X3PX2PX4.
(Streitberg, 1990):
∆SP =
(−1)|π|−1 (|π| − 1)!JπP – For a partition π, Jπ associates to the joint the corresponding factorisation, e.g., J13|2|4P = PX1X3PX2PX4.
1e+04 1e+09 1e+14 1e+19 1 3 5 7 9 11 13 15 17 19 21 23 25
D Number of partitions of {1,...,D} Bell numbers growth
joint central moments (Lancaster interaction) vs. joint cumulants (Streitberg interaction)
H0 : PXY Z = PXPY PZ vs. H1 : PXY Z = PXPY PZ
H0 : PXY Z = PXPY PZ vs. H1 : PXY Z = PXPY PZ
D
k(i):
PX −
D
ˆ PXi
P
Hκ
= 1 n2
n
n
D
K(i)
ab −
2 nD+1
n
D
n
K(i)
ab
+ 1 n2D
D
n
n
K(i)
ab .
characteristic functions.
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%& 2',&0(//(7&3(+#%37"#%#:&
Empirical HSIC(PXY , PXPY ): 1 n2 (HKH ◦ HLH)++
A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup
fF=1,gG=1
(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])
A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup
fF=1,gG=1
(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])
−2 2 −1.5 −1 −0.5 0.5 1 1.5
X Y Correlation: −0.00
A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup
fF=1,gG=1
(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])
−2 2 −1.5 −1 −0.5 0.5 1 1.5
X Y Correlation: −0.00
−2 2 −1 −0.5 0.5
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5
y g(y) Dependence witness, Y
A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup
fF=1,gG=1
(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])
−2 2 −1.5 −1 −0.5 0.5 1 1.5
X Y Correlation: −0.00
−2 2 −1 −0.5 0.5
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5
y g(y) Dependence witness, Y
−1 −0.5 0.5 −1 −0.5 0.5 1
f(X) g(Y) Correlation: −0.90 COCO: 0.14
A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup
fF=1,gG=1
(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])
−2 2 −1.5 −1 −0.5 0.5 1 1.5
X Y Correlation: −0.00
−2 2 −1 −0.5 0.5
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5
y g(y) Dependence witness, Y
−1 −0.5 0.5 −1 −0.5 0.5 1
f(X) g(Y) Correlation: −0.90 COCO: 0.14
Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ Rd, y ∈ Rd′. Are they linearly dependent?
Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ Rd, y ∈ Rd′. Are they linearly dependent? Compute their covariance matrix:
(ignore centering)
Cxy = E
How to get a single “summary” number?
Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ Rd, y ∈ Rd′. Are they linearly dependent? Compute their covariance matrix:
(ignore centering)
Cxy = E
How to get a single “summary” number? Solve for vectors f ∈ Rd, g ∈ Rd′ argmax
f=1,g=1
f⊤Cxyg = argmax
f=1,g=1
Exy
g⊤y
argmax
f=1,g=1
Ex,y[f(x)g(y)] = argmax
f=1,g=1
cov (f(x)g(y)) (maximum singular value)
Given features φ(x) ∈ F and ψ(y) ∈ G: Challenge 1: Can we define a feature space analog to x y⊤? YES:
(f g⊤)h = f(g⊤h).
that (f ⊗ g)h = fg, hG.
φ(x) ⊗ ψ(y), φ(x′) ⊗ ψ(y′) = k(x, x′)l(y, y′)
Given features φ(x) ∈ F and ψ(y) ∈ G: Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some CXY : G → F such that f, CXY gF = Ex,y[f(x)g(y)] = cov (f(x), g(y))
Given features φ(x) ∈ F and ψ(y) ∈ G: Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some CXY : G → F such that f, CXY gF = Ex,y[f(x)g(y)] = cov (f(x), g(y)) YES: via Bochner integrability argument (as with mean embedding). Under the condition Ex,y
CXY := Ex,y [φ(x) ⊗ ψ(y)] which is a Hilbert-Schmidt operator (sum of squared singular values is finite).
COCO(P; F, G) := sup
fF=1,gG=1
(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])
−2 2 −1.5 −1 −0.5 0.5 1 1.5
X Y Correlation: −0.00
−2 2 −1 −0.5 0.5
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5
y g(y) Dependence witness, Y
−1 −0.5 0.5 −1 −0.5 0.5 1
f(X) g(Y) Correlation: −0.90 COCO: 0.14
How do we compute this from finite data?
The empirical feature covariance given z := (xi, yi)n
i=1 (now include
centering)
n
n
φ(xi) ⊗ ψ(yi) − ˆ µx ⊗ ˆ µy, where ˆ µx := 1 n
n
φ(xi).
Optimization problem: COCO(z; F, G) := max
CXY g
subject to fF ≤ 1 gG ≤ 1 Assume f =
n
αi [φ(xi) − ˆ µx] g =
n
βi [ψ(yi) − ˆ µy] The associated Lagrangian is L(f, g, λ, γ) =
CXY g
2
F − 1
2
G − 1
where λ ≥ 0 and γ ≥ 0.
1 n
L 1 n
K α β = γ
α β .
K and L are matrices of inner products between centred observations in respective feature spaces:
where Kij = k(xi, xj) and H = I − 1 n11⊤
1 n
L 1 n
K α β = γ
α β .
K and L are matrices of inner products between centred observations in respective feature spaces:
where Kij = k(xi, xj) and H = I − 1 n11⊤
f(x) =
n
αi k(xi, x) − 1 n
n
k(xj, x)
−2 2 −3 −2 −1 1 2 3
X Y Smooth density
−4 −2 2 4 −4 −2 2 4
X Y 500 Samples, smooth density
−2 2 −3 −2 −1 1 2 3
X Y Rough density
−4 −2 2 4 −4 −2 2 4
X Y 500 samples, rough density
Density takes the form: Px,y ∝ 1 + sin(ωx) sin(ωy)
COCO vs frequency of perturbation from independence.
COCO vs frequency of perturbation from independence. Case of ω = 1
−4 −2 2 4 −4 −3 −2 −1 1 2 3 4
X Y Correlation: 0.27
−2 2 −1 −0.5 0.5 1
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5 1
y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6
f(X) g(Y) Correlation: −0.50 COCO: 0.09
COCO vs frequency of perturbation from independence. Case of ω = 2
−4 −2 2 4 −4 −3 −2 −1 1 2 3 4
X Y Correlation: 0.04
−2 2 −1 −0.5 0.5 1
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5 1
y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6
f(X) g(Y) Correlation: 0.51 COCO: 0.07
COCO vs frequency of perturbation from independence. Case of ω = 3
−4 −2 2 4 −4 −3 −2 −1 1 2 3 4
X Y Correlation: 0.03
−2 2 −1 −0.5 0.5 1
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5 1
y g(y) Dependence witness, Y −0.6 −0.4 −0.2 0.2 0.4 −0.5 0.5
f(X) g(Y) Correlation: −0.45 COCO: 0.03
COCO vs frequency of perturbation from independence. Case of ω = 4
−4 −2 2 4 −4 −3 −2 −1 1 2 3 4
X Y Correlation: 0.03
−2 2 −1 −0.5 0.5 1
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5 1
y g(y) Dependence witness, Y −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6
f(X) g(Y) Correlation: 0.21 COCO: 0.02
COCO vs frequency of perturbation from independence. Case of ω =??
−4 −2 2 4 −3 −2 −1 1 2 3
X Y Correlation: 0.00
−2 2 −1 −0.5 0.5 1
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5 1
y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6
f(X) g(Y) Correlation: −0.13 COCO: 0.02
COCO vs frequency of perturbation from independence. Case of uniform noise! This bias will decrease with increasing sample size.
−4 −2 2 4 −3 −2 −1 1 2 3
X Y Correlation: 0.00
−2 2 −1 −0.5 0.5 1
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5 1
y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6
f(X) g(Y) Correlation: −0.13 COCO: 0.02
COCO vs frequency of perturbation from independence.
f, g achieve lower linear covariance.
sizes, since some mild linear dependence will be induced by f, g (bias)
ω=1 ω=2 ω=3 ω=4 ω=5 ω=6
1 2 3 4 5 6 7 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Frequency of non−constant density component COCO COCO (empirical average, 1500 samples)
−1 1 −1.5 −1 −0.5 0.5 1
X Y Correlation: 0
−2 2 −1 −0.5 0.5
x f(x) Dependence witness, X
−2 2 −1 −0.5 0.5
y g(y) Dependence witness, Y
−1 −0.5 0.5 −1 −0.5 0.5 1
f(X) g(Y) Correlation: −0.80 COCO: 0.11
−1 1 −1.5 −1 −0.5 0.5 1
X Y Correlation: 0
−2 2 −1 −0.5 0.5 1
x f2(x) 2nd dependence witness, X
−2 2 −1 −0.5 0.5 1
y g2(y) 2nd dependence witness, Y
−1 1 −1.5 −1 −0.5 0.5 1
X Y Correlation: 0
−2 2 −1 −0.5 0.5 1
x f2(x) 2nd dependence witness, X
−2 2 −1 −0.5 0.5 1
y g2(y) 2nd dependence witness, Y
−1 1 −1 −0.5 0.5 1
f2(X) g2(Y) Correlation: −0.37 COCO2: 0.06
Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : HSIC(z; F, G) :=
n
γ2
i
Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : HSIC(z; F, G) :=
n
γ2
i
HSIC(P; F, G) := Cxy2
HS
= Cxy, CxyHS = Ex,x′,y,y′[k(x, x′)l(y, y′)] + Ex,x′[k(x, x′)]Ey,y′[l(y, y′)] − 2Ex,y
x′ an independent copy of x, y′ a copy of y HSIC is identical to MMD(PXY , PXPY )
Theorem: When kernels k and l are each characteristic, then HSIC = 0 iff Px,y = PxPy [Gretton, 2015]. Weaker than MMD condition (which requires a kernel characteristic on X × Y to distinguish Px,y from Qx,y).
Question: Wouldn’t it be enough just to use a rich mapping from X to Y, e.g. via ridge regression with characteristic F: f∗ = arg min
f∈F
F
Question: Wouldn’t it be enough just to use a rich mapping from X to Y, e.g. via ridge regression with characteristic F: f∗ = arg min
f∈F
F
Counterexample: density symmetric about x-axis, s.t. p(x, y) = p(x, −y)
−2 2 −1.5 −1 −0.5 0.5 1 1.5
X Y Correlation: −0.00
– Support vector classification/regression, kernel ridge regression . . .
[Haussler, 1999, G¨ artner et al., 2002]
K(P, Q) = µP, µQF
MMD2(µP, µQ) := µP − µQ2
F = EP k(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)
[Haussler, 1999, G¨ artner et al., 2002]
K(P, Q) = µP, µQF
NIPS10],[AISTATS15]
KG Ke KC Kt . . . e−µP−µQ
2 F 2θ2
e−µP−µQF
2θ2
F /θ2−1
F
−1 , θ ≤ 2 . . . µP − µQ2
F = EP k(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)
i=1 i.i.d.
∼ ρ(µP, y) = ρ(y|µP)ρ(µP), µPi = EPi [ϕx]
fρ(µP) =
ydρ(y|µP),
i=1 i.i.d.
∼ ρ(µP, y) = ρ(y|µP)ρ(µP), µPi = EPi [ϕx]
fρ(µP) =
ydρ(y|µP),
fλ
z = arg min f∈H
1 ℓ
ℓ
(f(µPi) − yi)2 + λ f2
H ,
(λ > 0)
functions from F ⊂ F to R, where F := {µP : P ∈ P} P set of prob. meas. on X
R [f] = Eρ(µP,y) (f(µP) − y)2 E(fλ
z , fρ) = R[fλ z ] − R[fρ].
E(fλ
z , fρ) = Op
bc bc+1
– b size of input space, c smoothness of fρ
R [f] = Eρ(µP,y) (f(µP) − y)2 E(fλ
z , fρ) = R[fλ z ] − R[fρ].
E(fλ
z , fρ) = Op
bc bc+1
– b size of input space, c smoothness of fρ
µPi = N−1
N
ϕxj xj
i.i.d.
∼ Pi
E(fλ
ˆ z , fρ) = Op
bc bc+1
Same rate as for population µPi embeddings!
[AISTATS15, JMLR in revision]
Supervised learning applications:
– Atmospheric monitoring, predict aerosol value from distribution of pixel values of a multispectral satellite image over an area (performance matches engineered state-of-the-art [Wang et al., 2012] )
incoming messages, when updates would otherwise be done by numerical integration [UAI15]
Additive noise model to direct an edge between random variables x and y
[Hoyer et al., 2009] Figure: D. Lopez-Paz
Classification of cause-effect relations [Lopez-Paz et al., 2015]
and effects known [Zscheischler, J., 2014]
gaussian noise.
ˆ µPx, ˆ µPy, ˆ µPxy with labels for x → y and y → x
81% correct
Figure:Mooij et al.(2015)
– Luca Baldasssarre – Steffen Grunewalder – Guy Lever – Sam Patterson – Massimiliano Pontil – Dino Sejdinovic
– Karsten Borgwardt, MPI – Wicher Bergsma, LSE – Kenji Fukumizu, ISM – Zaid Harchaoui, INRIA – Bernhard Schoelkopf, MPI – Alex Smola, CMU/Google – Le Song, Georgia Tech – Bharath Sriperumbudur, Cambridge
MMD2 = µP − µQ2
F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)
MMD2 = µP − µQ2
F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)
Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P, Q, respectively: The earlier estimate: (quadratic time)
1 m(m − 1)
m
m
k(xi, xj)
MMD2 = µP − µQ2
F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)
Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P, Q, respectively: The earlier estimate: (quadratic time)
1 m(m − 1)
m
m
k(xi, xj) New, linear time estimate:
m [k(x1, x2) + k(x3, x4) + . . .] = 2 m
m/2
k(x2i−1, x2i)
Shorter expression with explicit k dependence: MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v), where hk(x, x′, y, y′) = k(x, x′) + k(y, y′) − k(x, y′) − k(x′, y), and v := [x, x′, y, y′].
Shorter expression with explicit k dependence: MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v), where hk(x, x′, y, y′) = k(x, x′) + k(y, y′) − k(x, y′) − k(x′, y), and v := [x, x′, y, y′]. The linear time estimate again: ˇ ηk = 2 m
m/2
hk(vi), where vi := [x2i−1, x2i, y2i−1, y2i] and hk(vi) := k(x2i−1, x2i) + k(y2i−1, y2i) − k(x2i−1, y2i) − k(x2i, y2i−1)
Disadvantages of linear time MMD vs quadratic time MMD
Disadvantages of linear time MMD vs quadratic time MMD
Advantages of the linear time MMD vs quadratic time MMD
weighted sum of χ2)
computation
By central limit theorem, m1/2 (ˇ ηk − ηk(p, q)) D → N(0, 2σ2
k)
k) < ∞ (true for bounded k)
k = Evh2 k(v) − [Ev(hk(v))]2 .
Hypothesis test of asymptotic level α: tk,α = m−1/2σk √ 2Φ−1(1 − α) where Φ−1 is inverse CDF of N(0, 1).
−4 −2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Null distribution, linear time
2 = ˇ
ηk P (ˇ ηk) ˇ ηk Type I error tk,α = (1 − α) quantile
−4 −2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
P (ˇ ηk) ˇ ηk Null vs alternative distribution, P (ˇ ηk)
null alternative
Type II error ηk(p, q)
Type II error: ˇ ηk falls below the threshold tk,α and ηk(p, q) > 0.
P(ˇ ηk < tk,α) = Φ
σk √ 2
Type II error: ˇ ηk falls below the threshold tk,α and ηk(p, q) > 0.
P(ˇ ηk < tk,α) = Φ
σk √ 2
Since Φ monotonic, best kernel choice to minimize Type II error prob. is: k∗ = arg max
k∈K ηk(p, q)σ−1 k ,
where K is the family of kernels under consideration.
Define the family of kernels as follows: K :=
d
βuku, β1 = D, βu ≥ 0, ∀u ∈ {1, . . . , d}
Properties: if at least one βu > 0
The squared MMD becomes ηk(p, q) = µk(p) − µk(q)2
Fk = d
βuηu(p, q), where ηu(p, q) := Evhu(v).
The squared MMD becomes ηk(p, q) = µk(p) − µk(q)2
Fk = d
βuηu(p, q), where ηu(p, q) := Evhu(v). Denote:
– hu(x, x′, y, y′) = ku(x, x′) + ku(y, y′) − ku(x, y′) − ku(x′, y)
Quantities for test: ηk(p, q) = E(β⊤h) = β⊤η σ2
k := β⊤cov(h)β.
k
Empirical test parameters: ˆ ηk = β⊤ˆ η ˆ σk,λ =
Q + λmI
ˆ Q is empirical estimate of cov(h). Note: ˆ ηk, ˆ σk,λ computed on training data, vs ˇ ηk, ˇ σk on data to be tested (why?)
k
Empirical test parameters: ˆ ηk = β⊤ˆ η ˆ σk,λ =
Q + λmI
ˆ Q is empirical estimate of cov(h). Note: ˆ ηk, ˆ σk,λ computed on training data, vs ˇ ηk, ˇ σk on data to be tested (why?) Objective: ˆ β∗ = arg max
β0 ˆ
ηk(p, q)ˆ σ−1
k,λ
= arg max
β0
η β⊤ ˆ Q + λmI
−1/2 =: α(β; ˆ η, ˆ Q)
k
Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0
k
Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0 Solve easier problem: ˆ β∗ = arg max
β0 α2(β; ˆ
η, ˆ Q). Quadratic program: min{β⊤ ˆ Q + λmI
η = 1, β 0}
k
Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0 Solve easier problem: ˆ β∗ = arg max
β0 α2(β; ˆ
η, ˆ Q). Quadratic program: min{β⊤ ˆ Q + λmI
η = 1, β 0} What if ˆ η has no positive entries?
(a) Compute ˆ ηu for all ku ∈ K (b) If at least one ˆ ηu > 0, solve the QP to get β∗, else choose random kernel from K
(a) Compute ˇ ηk∗ using k∗ =
d
β∗ku (b) Compute test threshold ˇ tα,k∗ using ˇ σk∗
ηk∗ > ˇ tα,k∗
Assume bounded kernel, σk, bounded away from 0. If λm = Θ(m−1/3) then
k∈K
ˆ ηkˆ σ−1
k,λ − sup k∈K
ηkσ−1
k
.
Assume bounded kernel, σk, bounded away from 0. If λm = Θ(m−1/3) then
k∈K
ˆ ηkˆ σ−1
k,λ − sup k∈K
ηkσ−1
k
. Idea:
k∈K
ˆ ηkˆ σ−1
k,λ − sup k∈K
ηkσ−1
k
k∈K
ηkˆ σ−1
k,λ − ηkσ−1 k,λ
k∈K
k,λ − ηkσ−1 k
√ d D√λm
k∈K
|ˆ ηk − ηk| + C2 sup
k∈K
|ˆ σk,λ − σk,λ|
ηu – same as maximizing β⊤ˆ η subject to β1 ≤ 1
η subject to β2 ≤ 1
Also compare with:
k
Difficult problems: lengthscale of the difference in distributions not the same as that of the distributions.
Difficult problems: lengthscale of the difference in distributions not the same as that of the distributions. We distinguish a field of Gaussian blobs with different covariances.
5 10 15 20 25 30 35 5 10 15 20 25 30 35
Blob data p x1 x2
5 10 15 20 25 30 35 5 10 15 20 25 30 35
Blob data q y1 y2
Ratio ε = 3.2 of largest to smallest eigenvalues of blobs in q.
5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ratio Type II error
max ratio
l2 maxmmd xval xvalc med
Parameters: m = 10, 000 (for training and test). Ratio ε of largest to smallest eigenvalues of blobs in q. Results are average over 617 trials.
5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ratio Type II error max ratio
l2 maxmmd xval xvalc med
k
5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ratio Type II error max ratio
l2 maxmmd xval xvalc med
5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ratio Type II error max ratio
l2 maxmmd xval xvalc med
Idea: no single best kernel. Each of the ku are univariate (along a single coordinate)
Idea: no single best kernel. Each of the ku are univariate (along a single coordinate)
−4 −2 2 4 6 8 −4 −2 2 4 6 8
Selection data x1 x2
p q
5 10 15 20 25 30 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Feature selection dimension Type II error
max ratio
l2 maxmmd
Single best kernel Linear combination
m = 10, 000, average over 5000 trials
Given an audio signal s(t), an amplitude modulated signal can be defined u(t) = sin(ωct) [a s(t) + l]
Given an audio signal s(t), an amplitude modulated signal can be defined u(t) = sin(ωct) [a s(t) + l]
Two amplitude modulated signals from same artist (in this case, Magnetic Fields).
Samples from P Samples from Q
−0.2 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
added noise Type II error
max ratio
med l2 maxmmd
m = 10, 000 (for training and test) and scaling a = 0.5. Average over 4124
Distance between probability distributions: Energy distance:
[Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005]
DE(P, Q) = EPX − X′q + EQY − Y ′q − 2EP,QX − Y q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012a] MMD2(P, Q; F) = EPk(X, X′) + EQk(Y, Y ′) − 2EP,Qk(X, Y )
Distance between probability distributions: Energy distance:
[Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005]
DE(P, Q) = EPX − X′q + EQY − Y ′q − 2EP,QX − Y q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012a] MMD2(P, Q; F) = EPk(X, X′) + EQk(Y, Y ′) − 2EP,Qk(X, Y ) Energy distance is MMD with a particular kernel!
[Sejdinovic et al., 2013b]
Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´
ekely et al., 2007]
V2(X, Y ) = EXY EX′Y ′ X − X′qY − Y ′r + EXEX′X − X′qEY EY ′Y − Y ′r − 2EXY
Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,
2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Y
with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(X, X′)l(Y, Y ′) + EXEX′k(X, X′)EY EY ′l(Y, Y ′) − 2EX′Y ′ EXk(X, X′)EY l(Y, Y ′)
Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´
ekely et al., 2007]
V2(X, Y ) = EXY EX′Y ′ X − X′qY − Y ′r + EXEX′X − X′qEY EY ′Y − Y ′r − 2EXY
Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,
2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Y
with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(X, X′)l(Y, Y ′) + EXEX′k(X, X′)EY EY ′l(Y, Y ′) − 2EX′Y ′ EXk(X, X′)EY l(Y, Y ′)
Distance covariance is HSIC with particular kernels!
[Sejdinovic et al., 2013b]
Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X. Let z0 ∈ X, and denote kρ(z, z′) = 1 2(ρ(z, z0) + ρ(z′, z0) − ρ(z, z′)). Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call kρ a distance induced kernel Negative type: The semimetric space (Z, ρ) is said to have negative type if ∀n ≥ 2, z1, . . . , zn ∈ Z, and α1, . . . , αn ∈ R, with
n
αi = 0,
n
n
αiαjρ(zi, zj) ≤ 0. (1)
Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X. Let z0 ∈ X, and denote kρ(z, z′) = 1 2(ρ(z, z0) + ρ(z′, z0) − ρ(z, z′)). Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call kρ a distance induced kernel Special case: Z ⊆ Rd and ρq(z, z′) =
Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X. Let z0 ∈ X, and denote kρ(z, z′) = 1 2(ρ(z, z0) + ρ(z′, z0) − ρ(z, z′)). Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call kρ a distance induced kernel Special case: Z ⊆ Rd and ρq(z, z′) =
Energy distance is MMD with a distance induced kernel Distance covariance is HSIC with distance induced kernels
Two-sample testing example in 1-D:
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35
X P(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4
X Q(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4
X Q(X)
−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4
X Q(X)
Obtain more powerful tests on this problem when q = 1 (exponent of distance) Key:
Characteristic kernels and mean embeddings:
embeddings and metrics on probability measures. JMLR.
Two-sample, independence, conditional independence tests:
Energy distance, relation to kernel distances
rkhs-based statistics in hypothesis testing. Annals of Statistics.
Three way interaction
Conditional mean embedding, RKHS-valued regression:
Estimation, NIPS.
Foundations of Computational Mathematics.
embeddings as regressors. ICML.
Kernel Bayes rule:
kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.
kernels, JMLR
Cxy = C1/2
xx VxyC1/2 Y Y
VxyS ≤ 1
[JMLR07]
ˆ VxyS := sup
f∈F,g∈G
f, ˆ CxygF subject to f, ( ˆ Cxx + ǫnI)fF = 1, g, ( ˆ Cyy + ǫnI)gG = 1, – First canonical correlate
Cxy = C1/2
xx VxyC1/2 Y Y
VxyS ≤ 1
[JMLR07]
ˆ VxyS := sup
f∈F,g∈G
f, ˆ CxygF subject to f, ( ˆ Cxx + ǫnI)fF = 1, g, ( ˆ Cyy + ǫnI)gG = 1, – First canonical correlate
[NIPS07b]
NOCCO(z; F, G) := ˆ Vxy2
HS = tr
Rx := Kx( Kx + nǫnIn)−1
−2 2 −1.5 −1 −0.5 0.5 1 1.5
X Y Correlation: −0.00
−2 2 −0.8 −0.6 −0.4 −0.2
x f(x) Dependence witness, X
−2 2 0.2 0.4 0.6 0.8
y g(y) Dependence witness, Y
−1 −0.5 0.2 0.3 0.4 0.5 0.6 0.7
f(X) g(Y) Correlation: 1.00
−2 2 −1.5 −1 −0.5 0.5 1 1.5
X Y Correlation: −0.00
−2 2 0.2 0.25 0.3 0.35 0.4
x f(x) Dependence witness, X
−2 2 0.2 0.25 0.3 0.35 0.4
y g(y) Dependence witness, Y
0.25 0.3 0.35 0.27 0.28 0.29 0.3 0.31 0.32 0.33
f(X) g(Y) Correlation: 0.97
NOCCO := Vxy2
HS
NOCCO =
X×Y
pxy(x, y) px(x)py(y) − 1 2 px(x)py(y)dµ(x)dµ(y).
– µ(x) and µ(y) Lebesgue measures on X and Y; Pxy absolutely continuous w.r.t. µ(x) × µ(y), density pxy, marginal densities px and py
ǫ3
nn → ∞, Then
ˆ Vxy − VxyHS → 0 in probability
American Mathematical Society, 186:273–289, 1973.
On a new multivariate two-sample test. J. Multivariate Anal., 88:190–206, 2004.
Springer, New York, 1984.
Optimal rates for the regularized least- squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
A kernel independence test for random
generate kernel tests. NIPS, 2014. Andrey Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3):419–433, 1993.
Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20, pages 489–496, Cambridge, MA, 2008. MIT Press.
artner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kernels. In Proceedings of the International Conference on Machine Learning, pages 179–186. Morgan Kaufmann Publishers Inc., 2002.
Journal of Machine Learning Research, 11:1391–1423, 2010.
cal dependence with Hilbert-Schmidt norms. In S. Jain, H. U. Simon, and
Learning Theory, pages 63–77. Springer-Verlag, 2005.
nel method for the two-sample problem. In Advances in Neural Information Processing Systems 15, pages 513–520, Cambridge, MA, 2007. MIT Press.
85-1
A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, pages 585–592, Cambridge, MA, 2008. MIT Press.
two-sample test. JMLR, 13:723–773, 2012a.
nan, M. Pontil, and K. Fukumizu. Optimal kernel choice for large-scale two-sample tests. In NIPS, 2012b. Arthur Gretton. A simpler condition for consistency of a kernel independence
David Haussler. Convolution kernels on discrete structures. Technical Re- port UCS-CRL-99-10, UC Santa Cruz, 1999.
discovery with additive noise models. In NIPS, 2009. Wittawat Jitkrittum, Arthur Gretton, Nicolas Heess, S. M. Ali Eslami, Bal- aji Lakshminarayanan, Dino Sejdinovic, and Zolt´ an Szab´
just-in-time learning for passing expectation propagation messages. UAI, 2015.
ing theory of cause-effect inference. In ICML, 2015.
A Hilbert space em- bedding for distributions. In Proceedings of the International Conference on Algorithmic Learning Theory, volume 4754, pages 13–31. Springer, 2007.
arinen. Density estimation in infinite dimensional exponential families. Technical Report 1312.3516, ArXiv e-prints, 2014.
85-2
Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zolt´ an Szab´
Arthur Gretton. Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. arxiv, 2015. Zolt´ an Szab´
as P´
budur. Two-stage sampled learning theory on distributions. AISTATS, 2015.
ekely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, 5, 2004.
ekely and M. Rizzo. A new test for multivariate normality. J. Multivariate Anal., 93:58–80, 2005.
ekely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Stat., 35(6):2769–2794, 2007.
Kernel-based con- ditional independence test and application in causal discovery. In 27th Conference on Uncertainty in Artificial Intelligence, pages 804–813, 2011.
85-3