measuring dependence and conditional dependence with
play

Measuring Dependence and Conditional Dependence with Kernels Kenji - PowerPoint PPT Presentation

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop 1 2 Introduction Dependence Measures Dependence measures and


  1. Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop 1

  2. 2 Introduction

  3. Dependence Measures  Dependence measures and causality – Constraint methods for causal structure learning are based on measuring or testing (conditional) dependence. e.g. PC Algorithm (Spirtes et al. 1991, 2001) (Conditional) independence tests with � � -tests. 1 � � � � � � � � � � ∣ � � 2 3 � � � � � ∣ � � 4 etc. 3

  4.  Problems – Tests for structure learning may involve many variables. – (Conditional) independence test for continuous, high-dimensional domains are not easy. • Discretization causes many bins, requiring a large data size. • Nonparametric methods are often weak for high-dimensionality. KDE, smoothing kernel, ... 9 – Linear correlations may not be sufficient for 8 7 complex relations. 6 5 4 3 2 1 0 -1 -3 -2 -1 0 1 2 3 4

  5.  This talk – As building blocks of causal learning, kernel methods for measuring (in)dependence and conditional (in)dependence are discussed. 5

  6. Outline 1. Introduction 2. Kernel measures for independence 3. Relations with distance covariance 4. How to choose a kernel 5. Conditional independence 6. Conclusions 6

  7. 7 Kernel measures for independence

  8. Kernel methods  Feature map and kernel methods H Φ � �  Φ � � x i feature map  x j Space of original data Feature space (RKHS) Do linear analysis in the feature space. – Feature map Φ: Ω → �, � ↦ Φ��� Feature vectors � � … , � � ↦ Φ � � , … , Φ�� � � 8

  9.  Do kernel methods work well for high dimensional data? – Empirical comparison: pos. def. kernel and smoothing kernel Nonparametric regression � � 1/ 1.5 � | �| � � �, � ~ � 0, � � , �~� 0, 0.1 � 0.015 • Kernel ridge regression Kernel method Local linear (Gaussian kernel) • Local linear regression Mean square errors 0.01 (Epanechnikov kernel, ‘locfit’ in R is used) � = 100, 500 runs 0.005 Bandwidth parameters are chosen by CV. 0 0 2 4 6 8 10 Dimension of X – Theory? 9

  10. Representing probabilities � : random variable taking values on Ω . � : pos. def. kernel on Ω . Feature map defines a RKHS-valued random variable Φ��� . The kernel mean ��Φ � � represents the probability distribution of � . � � ≔ � Φ � � � � ⋅, � ����� – Kernel mean can express higher-order moments of � . Suppose � �, � � � � � � � �� � � � �� � � ⋯ e.g., � �� � � � 0 , � � � � � � � � � � � � � � � � � � � � � ⋯ c.f . moment generating function 10

  11. Comparing two probabilities  MMD (Maximum Mean Discrepancy, Gretton et al 2005 ) � ∼ � , � ∼ � (two probabilities on Ω ). � : pos. def. kernel on Ω . MMD � �, � ≔ � � � � � � � � � � � � , � � � � sup � ��,�∈� � � sup � � � � � � � � ��,�∈� Comparing the moments through various functions – Characteristic kernels are defined so that MMD �, � � 0 if and only if � � � . e.g. Gaussian and Laplace kernels Kernel mean � � determines the distribution of � uniquely. MMD is a metric on the probabilities. 11

  12. HSIC: Independence measure  Hilbert-Schmidt Independence Criterion (HSIC) ( X , Y ) : random vector taking values on  X �  Y . ( H X , k X ), ( H Y , k Y ): RKHS on  X and  Y , resp. Compare the joint probability � �� and the product of the marginal � � � � HSIC( � , � ) ≔ MMD � � �� , � � � Def. � � � � �� � � � ⊗ � � � � ⊗� � Theorem Assume: product kernel � � � � is characteristic on Ω � � Ω � . HSIC( � , � ) = 0 if and only if � � � 12

  13. Covariance operator Operator expression: � �� � � � ⊗ � � , � ⊗ � � � ⊗� � � � � � � � � � � � ��� � � Def. covariance operators Σ �� : � � → � � , Σ �� : � � → � � �, Σ �� � � � � � � � � � � � � � ��� � � �∀� ∈ � � , � ∈ � � � �∀�, � ∈ � � � �, Σ �� � � � � � � � � � � � � � ��� � � Simply, extension of covariance matrix (linear map) �� � � �� � � � � ��� � � , � � � �� � � � � � � ⋅ � � � � � � � ����� � � �  X  X ( X )  Y ( Y )  Y   X  Y YX X Y 13 H X H Y

  14. Expressions of HSIC � – HSIC �, � � Σ �� Hilbert-Schmidt norm �� (same as Frobenius norm) � ≔ ∑ ∑ � � , �� � � � �� � � �: � → �. � � � , � � � : ONB of � and � , (resp). HSIC��, �� � � � � �, � � � � �, � � � 2� � � �, � � � � �, � �� – ��� � �, � � � �� � � �, � � � � , � � , � �� , � �� : independent copies of �, � . – Empirical estimator (Gram matrix expression) HSIC ��� �, � � 1 � � ���� � � � � � � � �  Test statistic � � (centering) � �,�� � � � � � , � � , � �,�� � � � � � , � � , � � ≔ � � � � � � � � Given � � , � � , , … , � � , � � ~ � �� , i.i.d., 14

  15. Independence test with HSIC Theorem: null distribution (Gretton, Fukumizu, etc. NIPS2007) If X and Y are independent, then law � � � HSIC ��� �, � ⟹ ∑ � � � � � → ∞ . ��� where Z i : i.i.d. ~ N (0,1), � � � ��� is the eigenvalues of an integral operator. Theorem: consistency of test (Gretton, Fukumizu, etc. NIPS2007) If ���� �, � � 0 , then law � HSIC ��� ��, �� � HSIC��, �� ⇒ ��0, � � � � → ∞ . where        2 2 16 E E [ h ( U , U , U , U ] M a b , c , d a b c d YX 15

  16.  Independent test with HSIC: – How to compute the critical region given significance level. • Simulation of the null distribution ( Gretton, Fukumizu et al NIPS2009 ). The eigenvalues can be estimated with the Gram matrices. • Approximation with two-parameter Gamma by moment matching ( Gretton, Fukumizu et al NIPS2007 ). • Permutation test / Bootstrap Always possible, but time consuming. 16

  17. Experiments: independence test X, Y: 1 dim + noise components HSIC (Gamma approx.) Power divergence ( � � 3/2 ) with discretization (equi-probable) 17 Type II errors

  18.  Power divergence Each dimension is partitioned into � parts. Partitions � � �∈� . ( � � � � ) � � �̂ � 2� ��� ≔ � � � � 2 � �̂ � � � 1 � ��� �̂ � � �∈� ��� �̂ � : frequency in � � ��� : marginal frequency in k-th dimension �̂ � � ��� ⇒ � � � ������� � � � � � 0 : Mutual information � � 2 : � � -divergence (mean square contingency) 18

  19. 19 Relation to distance covariance

  20. Distance covariance – Distance covariance (distance correlation) is a recent measure of independence for continuous variables (Székely, Rizzo, Bakirov, AoS 2007) . It is very popular among statistical community. – HSIC is closely related to (more general than, in fact) dCov.  Distance covariance Def. �, � : random vectors (on Euclidean spaces) dCov � �, � ≔ � � � � � � � � � � � � � � � � �� � 2� � � � � � � � � �� � . � � , � � , � �� , � �� : independent copies of �, � . Note: � � � � is NOT positive definite. 20

  21. For be a semi-metric � on Ω , ( � �, � � � � � � , � , and � �, � � � 0 with equality � � � � ), define generalized distance covariance by �, � ≔ � � � �, � � � � ��, � � � � 2� � � �, � � � � ��, � �� � � dCov � � ,� � �� � � ��, � � � � � � ��, � � � . Theorem (Sejdinovic et al. AoS 2013). Assume � is of negative type, i.e., � � ∑ for any �� � � with ∑ � � � � � , � � � 0 � � � 0 . ��� ��� Then, � �, � � ≔ � �� �, � � � � � � , � � � � �, � � � is positive � definite, and with � � and � � induced by � � and � � , resp., HSIC �, � � dCov � � ,� � ��, �� Example: � �, � � � � � � � � ( 0 � � � 2 ), � � �, � � � � � � � � � � � � � � � � � � � � �, � � � � � � � � � � � � � HSIC �, � � dCov � � � � � � � � � �� � � � � � � � � � � � � � � . �2� 21

  22. Experiments � �, � � � � � � � � (B) (A) � �, � ∝ 1 � sin ℓ� sin �ℓ�� independent harder dependent easier 22

  23. 23 How to choose a kernel

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend