1
Kenji Fukumizu
The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop
Measuring Dependence and Conditional Dependence with Kernels Kenji - - PowerPoint PPT Presentation
Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop 1 2 Introduction Dependence Measures Dependence measures and
1
The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop
2
– Constraint methods for causal structure learning are based on measuring or testing (conditional) dependence. e.g. PC Algorithm (Spirtes et al. 1991, 2001) (Conditional) independence tests with -tests.
3
1 4 3 2
etc.
– Tests for structure learning may involve many variables. – (Conditional) independence test for continuous, high-dimensional domains are not easy.
a large data size.
high-dimensionality. KDE, smoothing kernel, ... – Linear correlations may not be sufficient for complex relations.
4
– As building blocks of causal learning, kernel methods for measuring (in)dependence and conditional (in)dependence are discussed.
5
1. Introduction 2. Kernel measures for independence 3. Relations with distance covariance 4. How to choose a kernel 5. Conditional independence 6. Conclusions
6
7
– Feature map Feature vectors
8
xi H xj
Feature space (RKHS)
Space of original data feature map Do linear analysis in the feature space.
Φ Φ
– Empirical comparison: pos. def. kernel and smoothing kernel Nonparametric regression 1/ 1.5 | | , ~ 0, , ~ 0, 0.1
(Gaussian kernel)
(Epanechnikov kernel, ‘locfit’ in R is used)
= 100, 500 runs Bandwidth parameters are chosen by CV. – Theory?
9
2 4 6 8 10 0.005 0.01 0.015 Dimension of X Mean square errors Kernel method Local linear
: random variable taking values on Ω. : pos. def. kernel on Ω. Feature map defines a RKHS-valued random variable Φ. The kernel mean Φ represents the probability distribution of . ≔ Φ ⋅, – Kernel mean can express higher-order moments of . Suppose , ⋯ 0 , e.g., c.f. moment generating function
10
⋯
∼ , ∼ (two probabilities on Ω). : pos. def. kernel on Ω.
,∈
,
,∈
MMD , 0 if and only if . e.g. Gaussian and Laplace kernels Kernel mean determines the distribution of uniquely. MMD is a metric on the probabilities.
11
Comparing the moments through various functions
(X , Y) : random vector taking values on XY. (HX, kX), (HY , kY): RKHS on X and Y, resp. Compare the joint probability and the product of the marginal
⊗
Assume: product kernel is characteristic on Ω Ω. HSIC(,) = 0 if and only if
12
13
Operator expression:
Simply, extension of covariance matrix (linear map)
⋅
, Σ ∀ ∈ , ∈ X Y X Y HX HY X Y X(X) Y(Y)
YX
, Σ ∀, ∈ ⊗ , ⊗ ⊗
– HSIC , Σ
– HSIC, , , 2 , , , , – Empirical estimator (Gram matrix expression) HSIC , 1
14
, , , : independent copies of , .
, : ONB of and , (resp).
(same as Frobenius norm)
, ,
, , , , ≔
Given ,
, , … , , ~ , i.i.d.,
Test statistic
Theorem: null distribution (Gretton, Fukumizu, etc. NIPS2007) If X and Y are independent, then where Theorem: consistency of test (Gretton, Fukumizu, etc. NIPS2007) If , 0, then where
15 law
Zi : i.i.d. ~ N(0,1), HSIC , ⟹ ∑
YX d c b a d c b a
M U U U U h E E
2 , , 2
] , , , ( [ 16 HSIC, HSIC, ⇒ 0,
law
→ ∞ .
16
– How to compute the critical region given significance level.
The eigenvalues can be estimated with the Gram matrices.
matching (Gretton, Fukumizu et al NIPS2007).
Always possible, but time consuming.
X, Y: 1 dim + noise components HSIC (Gamma approx.) Power divergence ( 3/2) with discretization (equi-probable)
17
Type II errors
Each dimension is partitioned into parts. Partitions ∈. ( )
18
2 2 ̂
̂
̂: frequency in ̂
: marginal frequency in k-th dimension
0: Mutual information 2: -divergence (mean square contingency)
19
– Distance covariance (distance correlation) is a recent measure
AoS 2007). It is very popular among statistical community.
– HSIC is closely related to (more general than, in fact) dCov.
Def. , : random vectors (on Euclidean spaces) Note: is NOT positive definite.
20
dCov , ≔ 2
, , , : independent copies of , .
For be a semi-metric on Ω, ( , , , and , 0 with equality ), define generalized distance covariance by Theorem (Sejdinovic et al. AoS 2013). Assume is of negative type, i.e., ∑ ,
Then, , ≔
definite, and with and induced by and , resp., Example:
21
dCov,
, , . HSIC , dCov,,
, (0 2), ,
,
2 .
,
22
independent dependent harder easier
, ∝ 1 sin ℓ sin ℓ (A) (B)
23
– The power of a test depends on the choice of kernels. e.g. bandwidth in Gaussian kernel exp
sup
– No theoretical optimality, but empirically good.
– Developed for a simple version of MMD. – May be extended to HSIC.
24
, … , ~,
, … , ~ , i.i.d.
MMD ,
,
, , ,
L‐MMD , ≔
,
, , /
– Consistent estimator of MMD(X, Y).
25
(L‐MMD , , ⟹ 0, 2 , , ′ , ′ , ′ , ′
– , threshold for level : Pr , . – Under alternative ( , 0), the type II error is Pr LMMD , → Φ Φ 1 MMD , 2
– To minimize the type II error, choose a kernel such that max
∈
LMMD, ,
– Two AM signals (songs with different instruments) cos
27 AM signal, p AM signal, q
Gaussian kernels with different bandwidth
28
Σ| ≔ Σ ΣΣ
Σ
/ Σ / is possible with
(Baker 1973).
29
– Assume kernels are characteristic. Σ| is weaker than the cond. independence ∣ . Σ,| if and only if ∣ . paired variable: product kernel is used. – Conditional independence measure: HSCONIC, | ≔ Σ,,|
HSCONIC , ≔
≔ . : regularization coefficient
– The estimator is consistent, but the asymptotic distribution is NOT known.
– Permutation test for continuous variables is not straightforward.
to simulate the conditionally independent data. Not rigorous conditional independence.
31
32
– HSIC and HSCONIC are defined by the kernel mean embedding of probabilities. – Show better performance than classical methods for high dimensional cases.
– As a special case, HSIC includes the distance covariance, which is a recent popular independence measure in statistics. – For linear time MMD, a kernel can be chosen so that the power is maximized asymptotically.