Measuring Dependence and Conditional Dependence with Kernels Kenji - - PowerPoint PPT Presentation

measuring dependence and conditional dependence with
SMART_READER_LITE
LIVE PREVIEW

Measuring Dependence and Conditional Dependence with Kernels Kenji - - PowerPoint PPT Presentation

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop 1 2 Introduction Dependence Measures Dependence measures and


slide-1
SLIDE 1

1

Kenji Fukumizu

The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop

Measuring Dependence and Conditional Dependence with Kernels

slide-2
SLIDE 2

Introduction

2

slide-3
SLIDE 3

Dependence Measures

 Dependence measures and causality

– Constraint methods for causal structure learning are based on measuring or testing (conditional) dependence. e.g. PC Algorithm (Spirtes et al. 1991, 2001) (Conditional) independence tests with -tests.

3

1 4 3 2

∣ ∣

etc.

slide-4
SLIDE 4

 Problems

– Tests for structure learning may involve many variables. – (Conditional) independence test for continuous, high-dimensional domains are not easy.

  • Discretization causes many bins, requiring

a large data size.

  • Nonparametric methods are often weak for

high-dimensionality. KDE, smoothing kernel, ... – Linear correlations may not be sufficient for complex relations.

4

  • 3
  • 2
  • 1
1 2 3
  • 1
1 2 3 4 5 6 7 8 9
slide-5
SLIDE 5

 This talk

– As building blocks of causal learning, kernel methods for measuring (in)dependence and conditional (in)dependence are discussed.

5

slide-6
SLIDE 6

Outline

1. Introduction 2. Kernel measures for independence 3. Relations with distance covariance 4. How to choose a kernel 5. Conditional independence 6. Conclusions

6

slide-7
SLIDE 7

Kernel measures for independence

7

slide-8
SLIDE 8

Kernel methods

 Feature map and kernel methods

– Feature map Feature vectors

8

xi  H  xj

Feature space (RKHS)

Space of original data feature map Do linear analysis in the feature space.

Φ Φ

Φ: Ω → , ↦ Φ … , ↦ Φ , … , Φ

slide-9
SLIDE 9

 Do kernel methods work well for high dimensional data?

– Empirical comparison: pos. def. kernel and smoothing kernel Nonparametric regression 1/ 1.5 | | , ~ 0, , ~ 0, 0.1

  • Kernel ridge regression

(Gaussian kernel)

  • Local linear regression

(Epanechnikov kernel, ‘locfit’ in R is used)

= 100, 500 runs Bandwidth parameters are chosen by CV. – Theory?

9

2 4 6 8 10 0.005 0.01 0.015 Dimension of X Mean square errors Kernel method Local linear

slide-10
SLIDE 10

Representing probabilities

: random variable taking values on Ω. : pos. def. kernel on Ω. Feature map defines a RKHS-valued random variable Φ. The kernel mean Φ represents the probability distribution of . ≔ Φ ⋅, – Kernel mean can express higher-order moments of . Suppose , ⋯ 0 , e.g., c.f. moment generating function

10

slide-11
SLIDE 11

Comparing two probabilities

 MMD (Maximum Mean Discrepancy, Gretton et al 2005)

∼ , ∼ (two probabilities on Ω). : pos. def. kernel on Ω.

MMD , ≔

  • sup

,∈

,

  • sup

,∈

  • – Characteristic kernels are defined so that

MMD , 0 if and only if . e.g. Gaussian and Laplace kernels Kernel mean determines the distribution of uniquely. MMD is a metric on the probabilities.

11

Comparing the moments through various functions

slide-12
SLIDE 12

HSIC: Independence measure

 Hilbert-Schmidt Independence Criterion (HSIC)

(X , Y) : random vector taking values on XY. (HX, kX), (HY , kY): RKHS on X and Y, resp. Compare the joint probability and the product of the marginal

  • Def.

HSIC(,) ≔ MMD ,

  • Theorem

Assume: product kernel is characteristic on Ω Ω. HSIC(,) = 0 if and only if

12

slide-13
SLIDE 13

Covariance operator

13

Operator expression:

  • Def. covariance operators Σ: → , Σ: →

Simply, extension of covariance matrix (linear map)

  • ,

, Σ ∀ ∈ , ∈ X Y X Y HX HY X Y X(X) Y(Y)

YX

, Σ ∀, ∈ ⊗ , ⊗ ⊗

slide-14
SLIDE 14

Expressions of HSIC

– HSIC , Σ

  • Hilbert-Schmidt norm

– HSIC, , , 2 , , , , – Empirical estimator (Gram matrix expression) HSIC , 1

14

, , , : independent copies of , .

  • ≔ ∑ ∑ ,
  • : → .

, : ONB of and , (resp).

(same as Frobenius norm)

, ,

, , , , ≔

  • (centering)

Given ,

, , … , , ~ , i.i.d.,

 Test statistic

slide-15
SLIDE 15

Independence test with HSIC

Theorem: null distribution (Gretton, Fukumizu, etc. NIPS2007) If X and Y are independent, then where Theorem: consistency of test (Gretton, Fukumizu, etc. NIPS2007) If , 0, then where

15 law

  • is the eigenvalues of an integral operator.

Zi : i.i.d. ~ N(0,1), HSIC , ⟹ ∑

  • → ∞ .

   

YX d c b a d c b a

M U U U U h E E  

2 , , 2

] , , , ( [ 16  HSIC, HSIC, ⇒ 0,

law

→ ∞ .

slide-16
SLIDE 16

16

 Independent test with HSIC:

– How to compute the critical region given significance level.

  • Simulation of the null distribution (Gretton, Fukumizu et al NIPS2009).

The eigenvalues can be estimated with the Gram matrices.

  • Approximation with two-parameter Gamma by moment

matching (Gretton, Fukumizu et al NIPS2007).

  • Permutation test / Bootstrap

Always possible, but time consuming.

slide-17
SLIDE 17

Experiments: independence test

X, Y: 1 dim + noise components HSIC (Gamma approx.) Power divergence ( 3/2) with discretization (equi-probable)

17

Type II errors

slide-18
SLIDE 18

 Power divergence

Each dimension is partitioned into parts. Partitions ∈. ( )

18

2 2 ̂

  • ̂

̂

  • 1

̂: frequency in ̂

: marginal frequency in k-th dimension

0: Mutual information 2: -divergence (mean square contingency)

slide-19
SLIDE 19

Relation to distance covariance

19

slide-20
SLIDE 20

Distance covariance

– Distance covariance (distance correlation) is a recent measure

  • f independence for continuous variables (Székely, Rizzo, Bakirov,

AoS 2007). It is very popular among statistical community.

– HSIC is closely related to (more general than, in fact) dCov.

 Distance covariance

Def. , : random vectors (on Euclidean spaces) Note: is NOT positive definite.

20

dCov , ≔ 2

  • .

, , , : independent copies of , .

slide-21
SLIDE 21

For be a semi-metric on Ω, ( , , , and , 0 with equality ), define generalized distance covariance by Theorem (Sejdinovic et al. AoS 2013). Assume is of negative type, i.e., ∑ ,

  • for any with ∑
  • .

Then, , ≔

  • , , , is positive

definite, and with and induced by and , resp., Example:

21

dCov,

  • , ≔ , , 2 , ,

, , . HSIC , dCov,,

, (0 2), ,

  • HSIC , dCov

,

2 .

slide-22
SLIDE 22

Experiments

,

22

independent dependent harder easier

, ∝ 1 sin ℓ sin ℓ (A) (B)

slide-23
SLIDE 23

How to choose a kernel

23

slide-24
SLIDE 24

Kernel Choice

– The power of a test depends on the choice of kernels. e.g. bandwidth in Gaussian kernel exp

  • ).
  • Heuristics for : median of
  • (Gretton et al NIPS2006)
  • Maximization of HSIC value (Sriperumbudur, Fukumizu et al. NIPS2009)

sup

  • ,

– No theoretical optimality, but empirically good.

  • Power of the test (Gretton, Fukumizu, et al. NIPS 2010)

– Developed for a simple version of MMD. – May be extended to HSIC.

24

slide-25
SLIDE 25

Power of linear-time MMD test

 Linear-time MMD

, … , ~,

, … , ~ , i.i.d.

MMD ,

,

, , ,

  • ,

L‐MMD , ≔

,

, , /

  • ,

– Consistent estimator of MMD(X, Y).

  • Less accurate, but less computational cost
  • Easier asymptotics

25

(L‐MMD , , ⟹ 0, 2 , , ′ , ′ , ′ , ′

slide-26
SLIDE 26

 Power of test

– , threshold for level : Pr , . – Under alternative ( , 0), the type II error is Pr LMMD , → Φ Φ 1 MMD , 2

  • Φ: c.d.f. of 0,1.

– To minimize the type II error, choose a kernel such that max

LMMD, ,

  • 26
slide-27
SLIDE 27

 Experiment

– Two AM signals (songs with different instruments) cos

27 AM signal, p AM signal, q

Gaussian kernels with different bandwidth

slide-28
SLIDE 28

Conditional independence

28

slide-29
SLIDE 29

Conditional covariance

 Conditional covariance operator

Σ| ≔ Σ ΣΣ

Σ

  • Decomposition Σ Σ

/ Σ / is possible with

  • 1

(Baker 1973).

  • is a “correlation” operator. c.f.
  • /
  • /.

29

slide-30
SLIDE 30

Conditional independence

– Assume kernels are characteristic. Σ| is weaker than the cond. independence ∣ . Σ,| if and only if ∣ . paired variable: product kernel is used. – Conditional independence measure: HSCONIC, | ≔ Σ,,|

  • – Empirical estimator:

HSCONIC , ≔

  • 2
  • 30

≔ . : regularization coefficient

slide-31
SLIDE 31

– The estimator is consistent, but the asymptotic distribution is NOT known.

  • Regularized inversion makes it difficult.

– Permutation test for continuous variables is not straightforward.

  • Discretization / neighbor data are needed

to simulate the conditionally independent data.  Not rigorous conditional independence.

31

slide-32
SLIDE 32

Conclusions

32

 Dependence measures with kernels

– HSIC and HSCONIC are defined by the kernel mean embedding of probabilities. – Show better performance than classical methods for high dimensional cases.

  • Theoretical backup is needed, but still open.

– As a special case, HSIC includes the distance covariance, which is a recent popular independence measure in statistics. – For linear time MMD, a kernel can be chosen so that the power is maximized asymptotically.

  • Extension to other cases is needed.