The Finite-Set Independence Criterion (FSIC) Zoltn Szab Arthur - - PowerPoint PPT Presentation

the finite set independence criterion fsic
SMART_READER_LITE
LIVE PREVIEW

The Finite-Set Independence Criterion (FSIC) Zoltn Szab Arthur - - PowerPoint PPT Presentation

The Finite-Set Independence Criterion (FSIC) Zoltn Szab Arthur Gretton Wittawat Jitkrittum Gatsby Unit University College London wittawat@gatsby.ucl.ac.uk 3rd UCL Workshop on the Theory of Big Data 28 June 2017 1/10 What Is


slide-1
SLIDE 1

The Finite-Set Independence Criterion (FSIC)

Wittawat Jitkrittum Zoltán Szabó Arthur Gretton Gatsby Unit University College London wittawat@gatsby.ucl.ac.uk 3rd UCL Workshop on the Theory of Big Data 28 June 2017

1/10

slide-2
SLIDE 2

What Is Independence Testing?

Let ✭X ❀ Y ✮ ✷ ❘dx ✂ ❘dy be random vectors following Pxy. Given a joint sample ❢✭xi❀ yi✮❣n

i❂1 ✘ Pxy (unknown), test

H0 ✿Pxy ❂ PxPy❀

  • vs. H1 ✿Pxy ✻❂ PxPy✿

Compute a test statistic ❫ ✕n. Reject H0 if ❫ ✕n ❃ T☛ (threshold). T☛ ❂ ✭1 ☛✮-quantile of the null distribution.

2/10

slide-3
SLIDE 3

What Is Independence Testing?

Let ✭X ❀ Y ✮ ✷ ❘dx ✂ ❘dy be random vectors following Pxy. Given a joint sample ❢✭xi❀ yi✮❣n

i❂1 ✘ Pxy (unknown), test

H0 ✿Pxy ❂ PxPy❀

  • vs. H1 ✿Pxy ✻❂ PxPy✿

Compute a test statistic ❫ ✕n. Reject H0 if ❫ ✕n ❃ T☛ (threshold). T☛ ❂ ✭1 ☛✮-quantile of the null distribution.

2/10

slide-4
SLIDE 4

What Is Independence Testing?

Let ✭X ❀ Y ✮ ✷ ❘dx ✂ ❘dy be random vectors following Pxy. Given a joint sample ❢✭xi❀ yi✮❣n

i❂1 ✘ Pxy (unknown), test

H0 ✿Pxy ❂ PxPy❀

  • vs. H1 ✿Pxy ✻❂ PxPy✿

Compute a test statistic ❫ ✕n. Reject H0 if ❫ ✕n ❃ T☛ (threshold). T☛ ❂ ✭1 ☛✮-quantile of the null distribution.

25 50 75 ˆ λn

PH0(ˆ λn) Tα

2/10

slide-5
SLIDE 5

Motivations

Modern state-of-the-art test is HSIC [Gretton et al., 2005]. ✓ Nonparametric i.e., no assumption on Pxy. Kernel-based. ✗ Slow. Runtime: ❖✭n2✮ where n ❂ sample size. ✗ No systematic way to choose kernels. Propose the Finite-Set Independence Criterion (FSIC).

1 Nonparametric. 2 Linear-time. Runtime complexity: ❖✭n✮. Fast. 3 Tunable i.e., well-defined criterion for parameter tuning.

3/10

slide-6
SLIDE 6

Motivations

Modern state-of-the-art test is HSIC [Gretton et al., 2005]. ✓ Nonparametric i.e., no assumption on Pxy. Kernel-based. ✗ Slow. Runtime: ❖✭n2✮ where n ❂ sample size. ✗ No systematic way to choose kernels. Propose the Finite-Set Independence Criterion (FSIC).

1 Nonparametric. 2 Linear-time. Runtime complexity: ❖✭n✮. Fast. 3 Tunable i.e., well-defined criterion for parameter tuning.

3/10

slide-7
SLIDE 7

Proposal: The Finite-Set Independence Criterion (FSIC)

1 Pick 2 positive definite kernels: k for X , and l for Y .

✎ Gaussian kernel: k✭x❀ v✮ ❂ exp

✏ ❦xv❦2

2✛2

x

✑ .

2 Pick some feature ✭v❀ w✮ ✷ ❘dx ✂ ❘dy

3✿ Transform ✭x❀ y✮ ✼✦ ✭k✭x❀ v✮❀ l✭y❀ w✮✮ then measure covariance ❘dx ✂ ❘dy ✦ ❘ ✂ ❘ FSIC2✭X ❀ Y ✮ ❂ cov2

✭x❀y✮✘Pxy ❬k✭x❀ v✮❀ l✭y❀ w✮❪ ✿

4/10

slide-8
SLIDE 8

Proposal: The Finite-Set Independence Criterion (FSIC)

1 Pick 2 positive definite kernels: k for X , and l for Y .

✎ Gaussian kernel: k✭x❀ v✮ ❂ exp

✏ ❦xv❦2

2✛2

x

✑ .

2 Pick some feature ✭v❀ w✮ ✷ ❘dx ✂ ❘dy

3✿ Transform ✭x❀ y✮ ✼✦ ✭k✭x❀ v✮❀ l✭y❀ w✮✮ then measure covariance ❘dx ✂ ❘dy ✦ ❘ ✂ ❘ FSIC2✭X ❀ Y ✮ ❂ cov2

✭x❀y✮✘Pxy ❬k✭x❀ v✮❀ l✭y❀ w✮❪ ✿

4/10

slide-9
SLIDE 9

Proposal: The Finite-Set Independence Criterion (FSIC)

1 Pick 2 positive definite kernels: k for X , and l for Y .

✎ Gaussian kernel: k✭x❀ v✮ ❂ exp

✏ ❦xv❦2

2✛2

x

✑ .

2 Pick some feature ✭v❀ w✮ ✷ ❘dx ✂ ❘dy

3✿ Transform ✭x❀ y✮ ✼✦ ✭k✭x❀ v✮❀ l✭y❀ w✮✮ then measure covariance ❘dx ✂ ❘dy ✦ ❘ ✂ ❘ FSIC2✭X ❀ Y ✮ ❂ cov2

✭x❀y✮✘Pxy ❬k✭x❀ v✮❀ l✭y❀ w✮❪ ✿

4/10

slide-10
SLIDE 10

Proposal: The Finite-Set Independence Criterion (FSIC)

1 Pick 2 positive definite kernels: k for X , and l for Y .

✎ Gaussian kernel: k✭x❀ v✮ ❂ exp

✏ ❦xv❦2

2✛2

x

✑ .

2 Pick some feature ✭v❀ w✮ ✷ ❘dx ✂ ❘dy

3✿ Transform ✭x❀ y✮ ✼✦ ✭k✭x❀ v✮❀ l✭y❀ w✮✮ then measure covariance ❘dx ✂ ❘dy ✦ ❘ ✂ ❘ FSIC2✭X ❀ Y ✮ ❂ cov2

✭x❀y✮✘Pxy ❬k✭x❀ v✮❀ l✭y❀ w✮❪ ✿

−2.5 0.0 2.5 x 5 y Data (v, w) 0.0 0.5 1.0 k(x, v) 0.0 0.5 1.0 l(y, w)

correlation: 0.97

4/10

slide-11
SLIDE 11

Proposal: The Finite-Set Independence Criterion (FSIC)

1 Pick 2 positive definite kernels: k for X , and l for Y .

✎ Gaussian kernel: k✭x❀ v✮ ❂ exp

✏ ❦xv❦2

2✛2

x

✑ .

2 Pick some feature ✭v❀ w✮ ✷ ❘dx ✂ ❘dy

3✿ Transform ✭x❀ y✮ ✼✦ ✭k✭x❀ v✮❀ l✭y❀ w✮✮ then measure covariance ❘dx ✂ ❘dy ✦ ❘ ✂ ❘ FSIC2✭X ❀ Y ✮ ❂ cov2

✭x❀y✮✘Pxy ❬k✭x❀ v✮❀ l✭y❀ w✮❪ ✿

−2.5 0.0 2.5 x 5 y Data (v, w) 0.0 0.5 1.0 k(x, v) 0.0 0.5 1.0 l(y, w)

correlation: -0.47

4/10

slide-12
SLIDE 12

Proposal: The Finite-Set Independence Criterion (FSIC)

1 Pick 2 positive definite kernels: k for X , and l for Y .

✎ Gaussian kernel: k✭x❀ v✮ ❂ exp

✏ ❦xv❦2

2✛2

x

✑ .

2 Pick some feature ✭v❀ w✮ ✷ ❘dx ✂ ❘dy

3✿ Transform ✭x❀ y✮ ✼✦ ✭k✭x❀ v✮❀ l✭y❀ w✮✮ then measure covariance ❘dx ✂ ❘dy ✦ ❘ ✂ ❘ FSIC2✭X ❀ Y ✮ ❂ cov2

✭x❀y✮✘Pxy ❬k✭x❀ v✮❀ l✭y❀ w✮❪ ✿

−2.5 0.0 2.5 x 5 y Data (v, w) 0.0 0.5 1.0 k(x, v) 0.0 0.5 1.0 l(y, w)

correlation: 0.33

4/10

slide-13
SLIDE 13

Proposal: The Finite-Set Independence Criterion (FSIC)

1 Pick 2 positive definite kernels: k for X , and l for Y .

✎ Gaussian kernel: k✭x❀ v✮ ❂ exp

✏ ❦xv❦2

2✛2

x

✑ .

2 Pick some feature ✭v❀ w✮ ✷ ❘dx ✂ ❘dy

3✿ Transform ✭x❀ y✮ ✼✦ ✭k✭x❀ v✮❀ l✭y❀ w✮✮ then measure covariance ❘dx ✂ ❘dy ✦ ❘ ✂ ❘ FSIC2✭X ❀ Y ✮ ❂ cov2

✭x❀y✮✘Pxy ❬k✭x❀ v✮❀ l✭y❀ w✮❪ ✿

−10 10 x −2 2 y Data (v, w) 0.0 0.5 1.0 k(x, v) 0.0 0.5 1.0 l(y, w)

correlation: 0.023

4/10

slide-14
SLIDE 14

Proposal: The Finite-Set Independence Criterion (FSIC)

1 Pick 2 positive definite kernels: k for X , and l for Y .

✎ Gaussian kernel: k✭x❀ v✮ ❂ exp

✏ ❦xv❦2

2✛2

x

✑ .

2 Pick some feature ✭v❀ w✮ ✷ ❘dx ✂ ❘dy

3✿ Transform ✭x❀ y✮ ✼✦ ✭k✭x❀ v✮❀ l✭y❀ w✮✮ then measure covariance ❘dx ✂ ❘dy ✦ ❘ ✂ ❘ FSIC2✭X ❀ Y ✮ ❂ cov2

✭x❀y✮✘Pxy ❬k✭x❀ v✮❀ l✭y❀ w✮❪ ✿

−10 10 x −2 2 y Data (v, w) 0.0 0.5 1.0 k(x, v) 0.0 0.5 1.0 l(y, w)

correlation: 0.025

4/10

slide-15
SLIDE 15

Proposal: The Finite-Set Independence Criterion (FSIC)

1 Pick 2 positive definite kernels: k for X , and l for Y .

✎ Gaussian kernel: k✭x❀ v✮ ❂ exp

✏ ❦xv❦2

2✛2

x

✑ .

2 Pick some feature ✭v❀ w✮ ✷ ❘dx ✂ ❘dy

3✿ Transform ✭x❀ y✮ ✼✦ ✭k✭x❀ v✮❀ l✭y❀ w✮✮ then measure covariance ❘dx ✂ ❘dy ✦ ❘ ✂ ❘ FSIC2✭X ❀ Y ✮ ❂ cov2

✭x❀y✮✘Pxy ❬k✭x❀ v✮❀ l✭y❀ w✮❪ ✿

−10 10 x −2 2 y Data (v, w) 0.0 0.5 1.0 k(x, v) 0.0 0.5 l(y, w)

correlation: 0.087

4/10

slide-16
SLIDE 16

General Form of FSIC

FSIC2✭X ❀ Y ✮ ❂ 1 J

J

j ❂1

cov2

✭x❀y✮✘Pxy ❬k✭x❀ vj ✮❀ l✭y❀ wj ✮❪ ❀

for J features ❢✭vj ❀ wj ✮❣J

j ❂1 ✷ ❘dx ✂ ❘dy.

Proposition 1.

Assume

1 Kernels k and l satisfy some conditions (e.g. Gaussian kernels). 2 Features ❢✭vi❀ wi✮❣J i❂1 are drawn from a distribution with a density.

Then, for any J ✕ 1, FSIC✭X ❀ Y ✮ ❂ 0 if and only if X and Y are independent Under H0 ✿ Pxy ❂ PxPy, n ❭ FSIC2 ✘ weighted sum of J dependent ✤2 variables. Difficult to get ✭1 ☛✮-quantile for the threshold.

5/10

slide-17
SLIDE 17

General Form of FSIC

FSIC2✭X ❀ Y ✮ ❂ 1 J

J

j ❂1

cov2

✭x❀y✮✘Pxy ❬k✭x❀ vj ✮❀ l✭y❀ wj ✮❪ ❀

for J features ❢✭vj ❀ wj ✮❣J

j ❂1 ✷ ❘dx ✂ ❘dy.

Proposition 1.

Assume

1 Kernels k and l satisfy some conditions (e.g. Gaussian kernels). 2 Features ❢✭vi❀ wi✮❣J i❂1 are drawn from a distribution with a density.

Then, for any J ✕ 1, FSIC✭X ❀ Y ✮ ❂ 0 if and only if X and Y are independent Under H0 ✿ Pxy ❂ PxPy, n ❭ FSIC2 ✘ weighted sum of J dependent ✤2 variables. Difficult to get ✭1 ☛✮-quantile for the threshold.

5/10

slide-18
SLIDE 18

General Form of FSIC

FSIC2✭X ❀ Y ✮ ❂ 1 J

J

j ❂1

cov2

✭x❀y✮✘Pxy ❬k✭x❀ vj ✮❀ l✭y❀ wj ✮❪ ❀

for J features ❢✭vj ❀ wj ✮❣J

j ❂1 ✷ ❘dx ✂ ❘dy.

Proposition 1.

Assume

1 Kernels k and l satisfy some conditions (e.g. Gaussian kernels). 2 Features ❢✭vi❀ wi✮❣J i❂1 are drawn from a distribution with a density.

Then, for any J ✕ 1, FSIC✭X ❀ Y ✮ ❂ 0 if and only if X and Y are independent Under H0 ✿ Pxy ❂ PxPy, n ❭ FSIC2 ✘ weighted sum of J dependent ✤2 variables. Difficult to get ✭1 ☛✮-quantile for the threshold.

5/10

slide-19
SLIDE 19

Normalized FSIC (NFSIC)

Let ❫ u ✿❂

✒ ❞

cov❬k✭x❀ v1✮❀ l✭y❀ w1✮❪❀ ✿ ✿ ✿ ❀ ❞ cov❬k✭x❀ vJ✮❀ l✭y❀ wJ✮❪

✓❃

✷ ❘J. Then, ❭ FSIC2 ❂ 1

J ❫

u❃❫ u. ❭ NFSIC2✭X ❀ Y ✮ ❂ ❫ ✕n ✿❂ n ❫ u❃✏ ❫ ✝ ✰ ✌nI

1❫

u❀ with a regularization parameter ✌n ✕ 0. ❫ ✝ij ❂ covariance of ❫ ui and ❫ uj .

Theorem 1 (NFSIC test is consistent).

Assume ✌n ✦ 0, and same conditions on k and l as before.

1 Under H0, ❫

✕n

d

✦ ✤2✭J✮ as n ✦ ✶. Easy to get threshold T☛.

2 Under H1, P✭reject H0✮ ✦ 1 as n ✦ ✶.

Complexity: ❖✭J 3 ✰ J 2n ✰ ✭dx ✰ dy✮Jn✮. Only need small J.

6/10

slide-20
SLIDE 20

Normalized FSIC (NFSIC)

Let ❫ u ✿❂

✒ ❞

cov❬k✭x❀ v1✮❀ l✭y❀ w1✮❪❀ ✿ ✿ ✿ ❀ ❞ cov❬k✭x❀ vJ✮❀ l✭y❀ wJ✮❪

✓❃

✷ ❘J. Then, ❭ FSIC2 ❂ 1

J ❫

u❃❫ u. ❭ NFSIC2✭X ❀ Y ✮ ❂ ❫ ✕n ✿❂ n ❫ u❃✏ ❫ ✝ ✰ ✌nI

1❫

u❀ with a regularization parameter ✌n ✕ 0. ❫ ✝ij ❂ covariance of ❫ ui and ❫ uj .

Theorem 1 (NFSIC test is consistent).

Assume ✌n ✦ 0, and same conditions on k and l as before.

1 Under H0, ❫

✕n

d

✦ ✤2✭J✮ as n ✦ ✶. Easy to get threshold T☛.

2 Under H1, P✭reject H0✮ ✦ 1 as n ✦ ✶.

Complexity: ❖✭J 3 ✰ J 2n ✰ ✭dx ✰ dy✮Jn✮. Only need small J.

6/10

slide-21
SLIDE 21

Normalized FSIC (NFSIC)

Let ❫ u ✿❂

✒ ❞

cov❬k✭x❀ v1✮❀ l✭y❀ w1✮❪❀ ✿ ✿ ✿ ❀ ❞ cov❬k✭x❀ vJ✮❀ l✭y❀ wJ✮❪

✓❃

✷ ❘J. Then, ❭ FSIC2 ❂ 1

J ❫

u❃❫ u. ❭ NFSIC2✭X ❀ Y ✮ ❂ ❫ ✕n ✿❂ n ❫ u❃✏ ❫ ✝ ✰ ✌nI

1❫

u❀ with a regularization parameter ✌n ✕ 0. ❫ ✝ij ❂ covariance of ❫ ui and ❫ uj .

Theorem 1 (NFSIC test is consistent).

Assume ✌n ✦ 0, and same conditions on k and l as before.

1 Under H0, ❫

✕n

d

✦ ✤2✭J✮ as n ✦ ✶. Easy to get threshold T☛.

2 Under H1, P✭reject H0✮ ✦ 1 as n ✦ ✶.

Complexity: ❖✭J 3 ✰ J 2n ✰ ✭dx ✰ dy✮Jn✮. Only need small J.

6/10

slide-22
SLIDE 22

Normalized FSIC (NFSIC)

Let ❫ u ✿❂

✒ ❞

cov❬k✭x❀ v1✮❀ l✭y❀ w1✮❪❀ ✿ ✿ ✿ ❀ ❞ cov❬k✭x❀ vJ✮❀ l✭y❀ wJ✮❪

✓❃

✷ ❘J. Then, ❭ FSIC2 ❂ 1

J ❫

u❃❫ u. ❭ NFSIC2✭X ❀ Y ✮ ❂ ❫ ✕n ✿❂ n ❫ u❃✏ ❫ ✝ ✰ ✌nI

1❫

u❀ with a regularization parameter ✌n ✕ 0. ❫ ✝ij ❂ covariance of ❫ ui and ❫ uj .

Theorem 1 (NFSIC test is consistent).

Assume ✌n ✦ 0, and same conditions on k and l as before.

1 Under H0, ❫

✕n

d

✦ ✤2✭J✮ as n ✦ ✶. Easy to get threshold T☛.

2 Under H1, P✭reject H0✮ ✦ 1 as n ✦ ✶.

Complexity: ❖✭J 3 ✰ J 2n ✰ ✭dx ✰ dy✮Jn✮. Only need small J.

6/10

slide-23
SLIDE 23

Normalized FSIC (NFSIC)

Let ❫ u ✿❂

✒ ❞

cov❬k✭x❀ v1✮❀ l✭y❀ w1✮❪❀ ✿ ✿ ✿ ❀ ❞ cov❬k✭x❀ vJ✮❀ l✭y❀ wJ✮❪

✓❃

✷ ❘J. Then, ❭ FSIC2 ❂ 1

J ❫

u❃❫ u. ❭ NFSIC2✭X ❀ Y ✮ ❂ ❫ ✕n ✿❂ n ❫ u❃✏ ❫ ✝ ✰ ✌nI

1❫

u❀ with a regularization parameter ✌n ✕ 0. ❫ ✝ij ❂ covariance of ❫ ui and ❫ uj .

Theorem 1 (NFSIC test is consistent).

Assume ✌n ✦ 0, and same conditions on k and l as before.

1 Under H0, ❫

✕n

d

✦ ✤2✭J✮ as n ✦ ✶. Easy to get threshold T☛.

2 Under H1, P✭reject H0✮ ✦ 1 as n ✦ ✶.

Complexity: ❖✭J 3 ✰ J 2n ✰ ✭dx ✰ dy✮Jn✮. Only need small J.

6/10

slide-24
SLIDE 24

Tuning Features and Kernels

Split the data into training ✭tr✮ and test ✭te✮ sets. Procedure:

1 Choose ❢✭vi❀ wi✮❣J i❂1 and Gaussian widths by maximizing ❫

✕✭tr✮

n

(i.e., computed on the training set). Gradient ascent.

2 Reject H0 if ❫

✕✭te✮

n

❃ ✭1 ☛✮-quantile of ✤2✭J✮. Splitting avoids overfitting.

Theorem 2.

This procedure increases a lower bound on P✭reject H0 ❥ H1 true✮ (test power). Asymptotically, false rejection rate is ☛.

7/10

slide-25
SLIDE 25

Tuning Features and Kernels

Split the data into training ✭tr✮ and test ✭te✮ sets. Procedure:

1 Choose ❢✭vi❀ wi✮❣J i❂1 and Gaussian widths by maximizing ❫

✕✭tr✮

n

(i.e., computed on the training set). Gradient ascent.

2 Reject H0 if ❫

✕✭te✮

n

❃ ✭1 ☛✮-quantile of ✤2✭J✮. Splitting avoids overfitting.

Theorem 2.

This procedure increases a lower bound on P✭reject H0 ❥ H1 true✮ (test power). Asymptotically, false rejection rate is ☛.

7/10

slide-26
SLIDE 26

Tuning Features and Kernels

Split the data into training ✭tr✮ and test ✭te✮ sets. Procedure:

1 Choose ❢✭vi❀ wi✮❣J i❂1 and Gaussian widths by maximizing ❫

✕✭tr✮

n

(i.e., computed on the training set). Gradient ascent.

2 Reject H0 if ❫

✕✭te✮

n

❃ ✭1 ☛✮-quantile of ✤2✭J✮. Splitting avoids overfitting.

Theorem 2.

This procedure increases a lower bound on P✭reject H0 ❥ H1 true✮ (test power). Asymptotically, false rejection rate is ☛.

7/10

slide-27
SLIDE 27

Simulation Settings

Gaussian kernels k✭x❀ x✵✮ ❂ exp

❦xx✵❦2

2

2✛2

x

for both X and Y .

Method Description 1 NFSIC-opt NFSIC with optimization. ❖✭n✮. 2 QHSIC [Gretton et al., 2005] State-of-the-art HSIC. ❖✭n2✮. 3 NFSIC-med NFSIC with random features. 4 NyHSIC Linear-time HSIC with Nystrom approx. 5 FHSIC Linear-time HSIC with random Fourier features 6 RDC [Lopez-Paz et al., 2013] Canonical Correlation Analysis with cosine basis.

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC

J ❂ 10 in NFSIC.

8/10

slide-28
SLIDE 28

Youtube Video ✭X ✮ vs. Caption ✭Y ✮.

X ✷ ❘2000: Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y ✷ ❘1878: Bag of words. Term frequency. ☛ ❂ 0✿01.

2000 4000 6000 8000 Sample size n 0.0 0.2 0.4 0.6 0.8 1.0 Test power

QHSIC

For large n, NFSIC is comparable to HSIC.

9/10

slide-29
SLIDE 29

Youtube Video ✭X ✮ vs. Caption ✭Y ✮.

X ✷ ❘2000: Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y ✷ ❘1878: Bag of words. Term frequency. ☛ ❂ 0✿01.

2000 4000 6000 8000 Sample size n 0.0 0.2 0.4 0.6 0.8 1.0 Test power

QHSIC Proposed NFSIC

For large n, NFSIC is comparable to HSIC.

9/10

slide-30
SLIDE 30

Youtube Video ✭X ✮ vs. Caption ✭Y ✮.

X ✷ ❘2000: Fisher vector encoding of motion boundary histograms descriptors [Wang and Schmid, 2013]. Y ✷ ❘1878: Bag of words. Term frequency. ☛ ❂ 0✿01.

2000 4000 6000 8000 Sample size n 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 Type-I error

Exchange ✭X ❀ Y ✮ pairs. H0 true.

For large n, NFSIC is comparable to HSIC.

9/10

slide-31
SLIDE 31

Conclusions

Proposed The Finite Set Independence Criterion (FSIC). Independece test based on FSIC is

1 nonparametric, 2 linear-time, 3 adaptive (parameters automatically tuned).

An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum, Zoltán Szabó, Arthur Gretton https://arxiv.org/abs/1610.04782 (to appear in ICML 2017) Python code: https://github.com/wittawatj/fsic-test

10/10

slide-32
SLIDE 32

Questions?

Thank you

11/10

slide-33
SLIDE 33

Reference

Coauthors: Zoltán Szabó

École Polytechnique

Arthur Gretton

Gatsby Unit, UCL

An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum, Zoltán Szabó, Arthur Gretton https://arxiv.org/abs/1610.04782 (to appear in ICML 2017) Python code: https://github.com/wittawatj/fsic-test

12/10

slide-34
SLIDE 34

Requirements on the Kernels

Definition 1 (Analytic kernels).

k ✿ ❳ ✂ ❳ ✦ ❘ is said to be analytic if for all x ✷ ❳, v ✦ k✭x❀ v✮ is a real analytic function on ❳. Analytic: Taylor series about x0 converges for all x0 ✷ ❳. ❂ ✮ k is infinitely differentiable.

Definition 2 (Characteristic kernels).

Let ✖P✭v✮ ✿❂ ❊z✘P❬k✭z❀ v✮❪. k is said to be characteristic if ✖P is unique for distinct P. Equivalently, P ✼✦ ✖P is injective. P Q

}

MMD(P, Q) MMD(P, Q) RKHS

Space of distributions

µP µQ

13/10

slide-35
SLIDE 35

Optimization Objective = Power Lower Bound

Recall ❫ ✕n ✿❂ n ❫ u❃ ✏ ❫ ✝ ✰ ✌nI

✑1 ❫

u. Let NFSIC2✭X ❀ Y ✮ ✿❂ ✕n ✿❂ nu❃✝1u.

Theorem 3 (A lower bound on the test power).

1 With some conditions, the test power PH1

✏❫

✕n ✕ T☛

✕ L✭✕n✮ where L✭✕n✮ ❂ 1 62e✘1✌2

n✭✕nT☛✮2❂n 2e❜0✿5n❝✭✕nT☛✮2❂❬✘2n2❪

2e❬✭✕nT☛✮✌n✭n1✮❂3✘3nc3✌2

nn✭n1✮❪ 2❂❬✘4n2✭n1✮❪❀

where ✘1❀ ✿ ✿ ✿ ❀ ✘4❀ c3 ❃ 0 are constants.

2 For large n, L✭✕n✮ is increasing in ✕n.

✭✕ ✮ ❂ ✕

14/10

slide-36
SLIDE 36

Optimization Objective = Power Lower Bound

Recall ❫ ✕n ✿❂ n ❫ u❃ ✏ ❫ ✝ ✰ ✌nI

✑1 ❫

u. Let NFSIC2✭X ❀ Y ✮ ✿❂ ✕n ✿❂ nu❃✝1u.

Theorem 3 (A lower bound on the test power).

1 With some conditions, the test power PH1

✏❫

✕n ✕ T☛

✕ L✭✕n✮ where L✭✕n✮ ❂ 1 62e✘1✌2

n✭✕nT☛✮2❂n 2e❜0✿5n❝✭✕nT☛✮2❂❬✘2n2❪

2e❬✭✕nT☛✮✌n✭n1✮❂3✘3nc3✌2

nn✭n1✮❪ 2❂❬✘4n2✭n1✮❪❀

where ✘1❀ ✿ ✿ ✿ ❀ ✘4❀ c3 ❃ 0 are constants.

2 For large n, L✭✕n✮ is increasing in ✕n.

✭✕ ✮ ❂ ✕

14/10

slide-37
SLIDE 37

Optimization Objective = Power Lower Bound

Recall ❫ ✕n ✿❂ n ❫ u❃ ✏ ❫ ✝ ✰ ✌nI

✑1 ❫

u. Let NFSIC2✭X ❀ Y ✮ ✿❂ ✕n ✿❂ nu❃✝1u.

Theorem 3 (A lower bound on the test power).

1 With some conditions, the test power PH1

✏❫

✕n ✕ T☛

✕ L✭✕n✮ where L✭✕n✮ ❂ 1 62e✘1✌2

n✭✕nT☛✮2❂n 2e❜0✿5n❝✭✕nT☛✮2❂❬✘2n2❪

2e❬✭✕nT☛✮✌n✭n1✮❂3✘3nc3✌2

nn✭n1✮❪ 2❂❬✘4n2✭n1✮❪❀

where ✘1❀ ✿ ✿ ✿ ❀ ✘4❀ c3 ❃ 0 are constants.

2 For large n, L✭✕n✮ is increasing in ✕n.

✭✕ ✮ ❂ ✕

14/10

slide-38
SLIDE 38

Optimization Objective = Power Lower Bound

Recall ❫ ✕n ✿❂ n ❫ u❃ ✏ ❫ ✝ ✰ ✌nI

✑1 ❫

u. Let NFSIC2✭X ❀ Y ✮ ✿❂ ✕n ✿❂ nu❃✝1u.

Theorem 3 (A lower bound on the test power).

1 With some conditions, the test power PH1

✏❫

✕n ✕ T☛

✕ L✭✕n✮ where L✭✕n✮ ❂ 1 62e✘1✌2

n✭✕nT☛✮2❂n 2e❜0✿5n❝✭✕nT☛✮2❂❬✘2n2❪

2e❬✭✕nT☛✮✌n✭n1✮❂3✘3nc3✌2

nn✭n1✮❪ 2❂❬✘4n2✭n1✮❪❀

where ✘1❀ ✿ ✿ ✿ ❀ ✘4❀ c3 ❃ 0 are constants.

2 For large n, L✭✕n✮ is increasing in ✕n.

Set test locations and Gaussian widths = arg max L✭✕n✮ ❂ arg max ✕n

14/10

slide-39
SLIDE 39

An Estimator of ❭ NFSIC2

❫ ✕n ✿❂ n ❫ u❃ ✏ ❫ ✝ ✰ ✌nI

✑1 ❫

u❀ J test locations ❢✭vi❀ wi✮❣J

i❂1 ✘ ✑.

K ❂ ❬k✭vi❀ xj ✮❪ ✷ ❘J✂n L ❂ ❬l✭wi❀ yj ✮❪ ✷ ❘J✂n. (No n ✂ n Gram matrix.) Estimators

1 ❫

u ❂ ✭K✍L✮1n

n1

✭K1n✮✍✭L1n✮

n✭n1✮

.

2 ❫

✝ ❂ ❃

n

where ✿❂ ✭K n1K1n1❃

n ✮ ✍ ✭L n1L1n1❃ n ✮ ❫

u1❃

n ✿

❫ ✕n can be computed in ❖✭J 3 ✰ J 2n ✰ ✭dx ✰ dy✮Jn✮ time. Main Point: Linear in n. Cubic in J (small).

15/10

slide-40
SLIDE 40

An Estimator of ❭ NFSIC2

❫ ✕n ✿❂ n ❫ u❃ ✏ ❫ ✝ ✰ ✌nI

✑1 ❫

u❀ J test locations ❢✭vi❀ wi✮❣J

i❂1 ✘ ✑.

K ❂ ❬k✭vi❀ xj ✮❪ ✷ ❘J✂n L ❂ ❬l✭wi❀ yj ✮❪ ✷ ❘J✂n. (No n ✂ n Gram matrix.) Estimators

1 ❫

u ❂ ✭K✍L✮1n

n1

✭K1n✮✍✭L1n✮

n✭n1✮

.

2 ❫

✝ ❂ ❃

n

where ✿❂ ✭K n1K1n1❃

n ✮ ✍ ✭L n1L1n1❃ n ✮ ❫

u1❃

n ✿

❫ ✕n can be computed in ❖✭J 3 ✰ J 2n ✰ ✭dx ✰ dy✮Jn✮ time. Main Point: Linear in n. Cubic in J (small).

15/10

slide-41
SLIDE 41

An Estimator of ❭ NFSIC2

❫ ✕n ✿❂ n ❫ u❃ ✏ ❫ ✝ ✰ ✌nI

✑1 ❫

u❀ J test locations ❢✭vi❀ wi✮❣J

i❂1 ✘ ✑.

K ❂ ❬k✭vi❀ xj ✮❪ ✷ ❘J✂n L ❂ ❬l✭wi❀ yj ✮❪ ✷ ❘J✂n. (No n ✂ n Gram matrix.) Estimators

1 ❫

u ❂ ✭K✍L✮1n

n1

✭K1n✮✍✭L1n✮

n✭n1✮

.

2 ❫

✝ ❂ ❃

n

where ✿❂ ✭K n1K1n1❃

n ✮ ✍ ✭L n1L1n1❃ n ✮ ❫

u1❃

n ✿

❫ ✕n can be computed in ❖✭J 3 ✰ J 2n ✰ ✭dx ✰ dy✮Jn✮ time. Main Point: Linear in n. Cubic in J (small).

15/10

slide-42
SLIDE 42

Alternative View of the Witness u✭v❀ w✮

The witness u✭v❀ w✮ can be rewritten as u✭v❀ w✮ ✿❂ ✖xy✭v❀ w✮ ✖x✭v✮✖y✭w✮ ❂ ❊xy❬k✭x❀ v✮l✭y❀ w✮❪ ❊x❬k✭x❀ v✮❪❊y❬l✭y❀ w✮❪❀ ❂ covxy❬k✭x❀ v✮❀ l✭y❀ w✮❪✿

1 Transforming x ✼✦ k✭x❀ v✮ and y ✼✦ l✭y❀ w✮ (from ❘dy to ❘). 2 Then, take the covariance.

The kernel transformations turn the linear covariance into a dependence measure.

16/10

slide-43
SLIDE 43

Alternative View of the Witness u✭v❀ w✮

The witness u✭v❀ w✮ can be rewritten as u✭v❀ w✮ ✿❂ ✖xy✭v❀ w✮ ✖x✭v✮✖y✭w✮ ❂ ❊xy❬k✭x❀ v✮l✭y❀ w✮❪ ❊x❬k✭x❀ v✮❪❊y❬l✭y❀ w✮❪❀ ❂ covxy❬k✭x❀ v✮❀ l✭y❀ w✮❪✿

1 Transforming x ✼✦ k✭x❀ v✮ and y ✼✦ l✭y❀ w✮ (from ❘dy to ❘). 2 Then, take the covariance.

The kernel transformations turn the linear covariance into a dependence measure.

16/10

slide-44
SLIDE 44

Alternative Form of ❫ u✭v❀ w✮

Recall ❭ FSIC2 ❂ 1

J

PJ

i❂1 ❫

u✭vi❀ wi✮2

Let ❬ ✖x✖y✭v❀ w✮ be an unbiased estimator of ✖x✭v✮✖y✭w✮.

❬ ✖x✖y✭v❀ w✮ ✿❂

1 n✭n1✮

Pn

i❂1

P

j ✻❂i k✭xi❀ v✮l✭yj ❀ w✮.

An unbiased estimator of u✭v❀ w✮ is

❫ u✭v❀ w✮ ❂ ❫ ✖xy✭v❀ w✮ ❬ ✖x✖y✭v❀ w✮ ❂ 2 n✭n 1✮ ❳

i❁j

h✭v❀w✮✭✭xi❀ yi✮❀ ✭xj ❀ yj ✮✮❀

where h✭v❀w✮✭✭x❀ y✮❀ ✭x✵❀ y✵✮✮ ✿❂ 1 2✭k✭x❀ v✮ k✭x✵❀ v✮✮✭l✭y❀ w✮ l✭y✵❀ w✮✮✿ ❫ u✭v❀ w✮ is a one-sample 2nd-order U-statistic, given ✭v❀ w✮.

17/10

slide-45
SLIDE 45

Alternative Form of ❫ u✭v❀ w✮

Recall ❭ FSIC2 ❂ 1

J

PJ

i❂1 ❫

u✭vi❀ wi✮2

Let ❬ ✖x✖y✭v❀ w✮ be an unbiased estimator of ✖x✭v✮✖y✭w✮.

❬ ✖x✖y✭v❀ w✮ ✿❂

1 n✭n1✮

Pn

i❂1

P

j ✻❂i k✭xi❀ v✮l✭yj ❀ w✮.

An unbiased estimator of u✭v❀ w✮ is

❫ u✭v❀ w✮ ❂ ❫ ✖xy✭v❀ w✮ ❬ ✖x✖y✭v❀ w✮ ❂ 2 n✭n 1✮ ❳

i❁j

h✭v❀w✮✭✭xi❀ yi✮❀ ✭xj ❀ yj ✮✮❀

where h✭v❀w✮✭✭x❀ y✮❀ ✭x✵❀ y✵✮✮ ✿❂ 1 2✭k✭x❀ v✮ k✭x✵❀ v✮✮✭l✭y❀ w✮ l✭y✵❀ w✮✮✿ ❫ u✭v❀ w✮ is a one-sample 2nd-order U-statistic, given ✭v❀ w✮.

17/10

slide-46
SLIDE 46

Independence Test with HSIC [Gretton et al., 2005]

Hilbert-Schmidt Independence Criterion. HSIC✭X ❀ Y ✮ ❂ MMD✭Pxy❀ PxPy✮ ❂ ❦u❦RKHS (need two kernels: k for X , and l for Y ). Empirical witness: ❫ u✭v❀ w✮ ❂ ❫ ✖xy✭v❀ w✮ ❫ ✖x✭v✮❫ ✖y✭w✮ where ❫ ✖xy✭v❀ w✮ ❂ 1

n

Pn

i❂1 k✭xi❀ v✮l✭yi❀ w✮.

❫ ✖ ✭ ❀ ✮

✖ ✭ ✮❫ ✖ ✭ ✮ ❂ ❫✭ ❀ ✮ HSIC✭X ❀ Y ✮ ❂ 0 if and only if X and Y are independent. Test statistic = ❦❫ u❦RKHS (“flatness” of ❫ u). Complexity: ❖✭n2✮. Key: Can we measure the flatness by other way that costs only ❖✭n✮?

18/10

slide-47
SLIDE 47

Independence Test with HSIC [Gretton et al., 2005]

Hilbert-Schmidt Independence Criterion. HSIC✭X ❀ Y ✮ ❂ MMD✭Pxy❀ PxPy✮ ❂ ❦u❦RKHS (need two kernels: k for X , and l for Y ). Empirical witness: ❫ u✭v❀ w✮ ❂ ❫ ✖xy✭v❀ w✮ ❫ ✖x✭v✮❫ ✖y✭w✮ where ❫ ✖xy✭v❀ w✮ ❂ 1

n

Pn

i❂1 k✭xi❀ v✮l✭yi❀ w✮.

❫ ✖ ✭ ❀ ✮

✖ ✭ ✮❫ ✖ ✭ ✮ ❂ ❫✭ ❀ ✮ HSIC✭X ❀ Y ✮ ❂ 0 if and only if X and Y are independent. Test statistic = ❦❫ u❦RKHS (“flatness” of ❫ u). Complexity: ❖✭n2✮. Key: Can we measure the flatness by other way that costs only ❖✭n✮?

18/10

slide-48
SLIDE 48

Independence Test with HSIC [Gretton et al., 2005]

Hilbert-Schmidt Independence Criterion. HSIC✭X ❀ Y ✮ ❂ MMD✭Pxy❀ PxPy✮ ❂ ❦u❦RKHS (need two kernels: k for X , and l for Y ). Empirical witness: ❫ u✭v❀ w✮ ❂ ❫ ✖xy✭v❀ w✮ ❫ ✖x✭v✮❫ ✖y✭w✮ where ❫ ✖xy✭v❀ w✮ ❂ 1

n

Pn

i❂1 k✭xi❀ v✮l✭yi❀ w✮.

❫ ✖xy✭v❀ w✮

✖x✭v✮❫ ✖y✭w✮ ❂ ❫✭ ❀ ✮ HSIC✭X ❀ Y ✮ ❂ 0 if and only if X and Y are independent. Test statistic = ❦❫ u❦RKHS (“flatness” of ❫ u). Complexity: ❖✭n2✮. Key: Can we measure the flatness by other way that costs only ❖✭n✮?

18/10

slide-49
SLIDE 49

Independence Test with HSIC [Gretton et al., 2005]

Hilbert-Schmidt Independence Criterion. HSIC✭X ❀ Y ✮ ❂ MMD✭Pxy❀ PxPy✮ ❂ ❦u❦RKHS (need two kernels: k for X , and l for Y ). Empirical witness: ❫ u✭v❀ w✮ ❂ ❫ ✖xy✭v❀ w✮ ❫ ✖x✭v✮❫ ✖y✭w✮ where ❫ ✖xy✭v❀ w✮ ❂ 1

n

Pn

i❂1 k✭xi❀ v✮l✭yi❀ w✮.

❫ ✖xy✭v❀ w✮

✖x✭v✮❫ ✖y✭w✮ ❂ Witness ❫ u✭v❀ w✮ HSIC✭X ❀ Y ✮ ❂ 0 if and only if X and Y are independent. Test statistic = ❦❫ u❦RKHS (“flatness” of ❫ u). Complexity: ❖✭n2✮. Key: Can we measure the flatness by other way that costs only ❖✭n✮?

18/10

slide-50
SLIDE 50

Independence Test with HSIC [Gretton et al., 2005]

Hilbert-Schmidt Independence Criterion. HSIC✭X ❀ Y ✮ ❂ MMD✭Pxy❀ PxPy✮ ❂ ❦u❦RKHS (need two kernels: k for X , and l for Y ). Empirical witness: ❫ u✭v❀ w✮ ❂ ❫ ✖xy✭v❀ w✮ ❫ ✖x✭v✮❫ ✖y✭w✮ where ❫ ✖xy✭v❀ w✮ ❂ 1

n

Pn

i❂1 k✭xi❀ v✮l✭yi❀ w✮.

❫ ✖xy✭v❀ w✮

✖x✭v✮❫ ✖y✭w✮ ❂ Witness ❫ u✭v❀ w✮ HSIC✭X ❀ Y ✮ ❂ 0 if and only if X and Y are independent. Test statistic = ❦❫ u❦RKHS (“flatness” of ❫ u). Complexity: ❖✭n2✮. Key: Can we measure the flatness by other way that costs only ❖✭n✮?

18/10

slide-51
SLIDE 51

Independence Test with HSIC [Gretton et al., 2005]

Hilbert-Schmidt Independence Criterion. HSIC✭X ❀ Y ✮ ❂ MMD✭Pxy❀ PxPy✮ ❂ ❦u❦RKHS (need two kernels: k for X , and l for Y ). Empirical witness: ❫ u✭v❀ w✮ ❂ ❫ ✖xy✭v❀ w✮ ❫ ✖x✭v✮❫ ✖y✭w✮ where ❫ ✖xy✭v❀ w✮ ❂ 1

n

Pn

i❂1 k✭xi❀ v✮l✭yi❀ w✮.

❫ ✖xy✭v❀ w✮

✖x✭v✮❫ ✖y✭w✮ ❂ Witness ❫ u✭v❀ w✮ HSIC✭X ❀ Y ✮ ❂ 0 if and only if X and Y are independent. Test statistic = ❦❫ u❦RKHS (“flatness” of ❫ u). Complexity: ❖✭n2✮. Key: Can we measure the flatness by other way that costs only ❖✭n✮?

18/10

slide-52
SLIDE 52

Independence Test with HSIC [Gretton et al., 2005]

Hilbert-Schmidt Independence Criterion. HSIC✭X ❀ Y ✮ ❂ MMD✭Pxy❀ PxPy✮ ❂ ❦u❦RKHS (need two kernels: k for X , and l for Y ). Empirical witness: ❫ u✭v❀ w✮ ❂ ❫ ✖xy✭v❀ w✮ ❫ ✖x✭v✮❫ ✖y✭w✮ where ❫ ✖xy✭v❀ w✮ ❂ 1

n

Pn

i❂1 k✭xi❀ v✮l✭yi❀ w✮.

❫ ✖xy✭v❀ w✮

✖x✭v✮❫ ✖y✭w✮ ❂ Witness ❫ u✭v❀ w✮ HSIC✭X ❀ Y ✮ ❂ 0 if and only if X and Y are independent. Test statistic = ❦❫ u❦RKHS (“flatness” of ❫ u). Complexity: ❖✭n2✮. Key: Can we measure the flatness by other way that costs only ❖✭n✮?

18/10

slide-53
SLIDE 53

Proposal: The Finite Set Independence Criterion (FSIC)

Idea: Evaluate ❫ u2✭v❀ w✮ at only finitely many test locations. A set of random J locations: ❢✭v1❀ w1✮❀ ✿ ✿ ✿ ❀ ✭vJ❀ wJ✮❣ ❭ FSIC2✭X ❀ Y ✮ ❂ 1

J

PJ

i❂1 ❫

u2✭vi❀ wi✮

0.000 0.003 0.006 0.009 0.012 0.015 0.018 0.021 0.024

Complexity: ❖✭✭dx ✰ dy✮Jn✮. Linear time. Can FSIC2✭X ❀ Y ✮ ❂ 0 even if X and Y are dependent?? ✭ ❀ ✮ ❂ ❄

19/10

slide-54
SLIDE 54

Proposal: The Finite Set Independence Criterion (FSIC)

Idea: Evaluate ❫ u2✭v❀ w✮ at only finitely many test locations. A set of random J locations: ❢✭v1❀ w1✮❀ ✿ ✿ ✿ ❀ ✭vJ❀ wJ✮❣ ❭ FSIC2✭X ❀ Y ✮ ❂ 1

J

PJ

i❂1 ❫

u2✭vi❀ wi✮

0.000 0.003 0.006 0.009 0.012 0.015 0.018 0.021 0.024 0.000 0.003 0.006 0.009 0.012 0.015 0.018 0.021 0.024

Complexity: ❖✭✭dx ✰ dy✮Jn✮. Linear time. Can FSIC2✭X ❀ Y ✮ ❂ 0 even if X and Y are dependent?? ✭ ❀ ✮ ❂ ❄

19/10

slide-55
SLIDE 55

Proposal: The Finite Set Independence Criterion (FSIC)

Idea: Evaluate ❫ u2✭v❀ w✮ at only finitely many test locations. A set of random J locations: ❢✭v1❀ w1✮❀ ✿ ✿ ✿ ❀ ✭vJ❀ wJ✮❣ ❭ FSIC2✭X ❀ Y ✮ ❂ 1

J

PJ

i❂1 ❫

u2✭vi❀ wi✮

0.000 0.003 0.006 0.009 0.012 0.015 0.018 0.021 0.024 0.000 0.003 0.006 0.009 0.012 0.015 0.018 0.021 0.024

Complexity: ❖✭✭dx ✰ dy✮Jn✮. Linear time. Can FSIC2✭X ❀ Y ✮ ❂ 0 even if X and Y are dependent?? ✭ ❀ ✮ ❂ ❄

19/10

slide-56
SLIDE 56

Proposal: The Finite Set Independence Criterion (FSIC)

Idea: Evaluate ❫ u2✭v❀ w✮ at only finitely many test locations. A set of random J locations: ❢✭v1❀ w1✮❀ ✿ ✿ ✿ ❀ ✭vJ❀ wJ✮❣ ❭ FSIC2✭X ❀ Y ✮ ❂ 1

J

PJ

i❂1 ❫

u2✭vi❀ wi✮

0.000 0.003 0.006 0.009 0.012 0.015 0.018 0.021 0.024

Complexity: ❖✭✭dx ✰ dy✮Jn✮. Linear time. Can FSIC2✭X ❀ Y ✮ ❂ 0 even if X and Y are dependent?? ✭ ❀ ✮ ❂ ❄

19/10

slide-57
SLIDE 57

Proposal: The Finite Set Independence Criterion (FSIC)

Idea: Evaluate ❫ u2✭v❀ w✮ at only finitely many test locations. A set of random J locations: ❢✭v1❀ w1✮❀ ✿ ✿ ✿ ❀ ✭vJ❀ wJ✮❣ ❭ FSIC2✭X ❀ Y ✮ ❂ 1

J

PJ

i❂1 ❫

u2✭vi❀ wi✮

0.000 0.003 0.006 0.009 0.012 0.015 0.018 0.021 0.024

Complexity: ❖✭✭dx ✰ dy✮Jn✮. Linear time. Can FSIC2✭X ❀ Y ✮ ❂ 0 even if X and Y are dependent??

  • No. Population FSIC✭X ❀ Y ✮ ❂ 0 iff X ❄ Y , almost surely.

19/10

slide-58
SLIDE 58

HSIC vs. FSIC

Recall the witness ❫ u✭v❀ w✮ ❂ ❫ ✖xy✭v❀ w✮ ❫ ✖x✭v✮❫ ✖y✭w✮✿ HSIC [Gretton et al., 2005] ❂ ❦❫ u❦RKHS

(v, w)

witness

Good when difference between pxy and pxpy is spatially diffuse. ❫ u is almost flat. FSIC [proposed] ❂ 1

J

PJ

i❂1 ❫

u2✭vi❀ wi✮

(v, w)

witness

Good when difference between pxy and pxpy is local. ❫ u is mostly zero, has many peaks (feature interaction).

20/10

slide-59
SLIDE 59

Toy Problem 1: Independent Gaussians

X ✘ ◆✭0❀ Idx ✮ and Y ✘ ◆✭0❀ Idy✮. Independent X ❀ Y . So, H0 holds. Set ☛ ✿❂ 0✿05❀ dx ❂ dy ❂ 250.

103 104 105 Sample size n 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Type-I error 103 104 105 Sample size n 10-1 100 101 102 103 Time (s)

21/10

slide-60
SLIDE 60

Toy Problem 1: Independent Gaussians

X ✘ ◆✭0❀ Idx ✮ and Y ✘ ◆✭0❀ Idy✮. Independent X ❀ Y . So, H0 holds. Set ☛ ✿❂ 0✿05❀ dx ❂ dy ❂ 250.

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC 103 104 105 Sample size n 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Type-I error 103 104 105 Sample size n 10-1 100 101 102 103 Time (s)

Correct type-I errors (false positive rate).

21/10

slide-61
SLIDE 61

Toy Problem 1: Independent Gaussians

X ✘ ◆✭0❀ Idx ✮ and Y ✘ ◆✭0❀ Idy✮. Independent X ❀ Y . So, H0 holds. Set ☛ ✿❂ 0✿05❀ dx ❂ dy ❂ 250.

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC 103 104 105 Sample size n 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Type-I error 103 104 105 Sample size n 10-1 100 101 102 103 Time (s)

Correct type-I errors (false positive rate).

21/10

slide-62
SLIDE 62

Toy Problem 2: Sinusoid

pxy✭x❀ y✮ ✴ 1 ✰ sin✭✦x✮ sin✭✦y✮ where x❀ y ✷ ✭✙❀ ✙✮. Local changes between pxy and pxpy. Set n ❂ 4000.

1 2 3 4 5 6 ω in 1 + sin(ωx)sin(ωy) 0.0 0.2 0.4 0.6 0.8 1.0 Test power

22/10

slide-63
SLIDE 63

Toy Problem 2: Sinusoid

pxy✭x❀ y✮ ✴ 1 ✰ sin✭✦x✮ sin✭✦y✮ where x❀ y ✷ ✭✙❀ ✙✮. Local changes between pxy and pxpy. Set n ❂ 4000.

1 2 3 4 5 6 ω in 1 + sin(ωx)sin(ωy) 0.0 0.2 0.4 0.6 0.8 1.0 Test power

3 2 1 1 2 3 x 3 2 1 1 2 3 y

ω = 1. 00

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

22/10

slide-64
SLIDE 64

Toy Problem 2: Sinusoid

pxy✭x❀ y✮ ✴ 1 ✰ sin✭✦x✮ sin✭✦y✮ where x❀ y ✷ ✭✙❀ ✙✮. Local changes between pxy and pxpy. Set n ❂ 4000.

1 2 3 4 5 6 ω in 1 + sin(ωx)sin(ωy) 0.0 0.2 0.4 0.6 0.8 1.0 Test power

3 2 1 1 2 3 x 3 2 1 1 2 3 y

ω = 2. 00

22/10

slide-65
SLIDE 65

Toy Problem 2: Sinusoid

pxy✭x❀ y✮ ✴ 1 ✰ sin✭✦x✮ sin✭✦y✮ where x❀ y ✷ ✭✙❀ ✙✮. Local changes between pxy and pxpy. Set n ❂ 4000.

1 2 3 4 5 6 ω in 1 + sin(ωx)sin(ωy) 0.0 0.2 0.4 0.6 0.8 1.0 Test power

3 2 1 1 2 3 x 3 2 1 1 2 3 y

ω = 3. 00

22/10

slide-66
SLIDE 66

Toy Problem 2: Sinusoid

pxy✭x❀ y✮ ✴ 1 ✰ sin✭✦x✮ sin✭✦y✮ where x❀ y ✷ ✭✙❀ ✙✮. Local changes between pxy and pxpy. Set n ❂ 4000.

1 2 3 4 5 6 ω in 1 + sin(ωx)sin(ωy) 0.0 0.2 0.4 0.6 0.8 1.0 Test power

3 2 1 1 2 3 x 3 2 1 1 2 3 y

ω = 4. 00

22/10

slide-67
SLIDE 67

Toy Problem 2: Sinusoid

pxy✭x❀ y✮ ✴ 1 ✰ sin✭✦x✮ sin✭✦y✮ where x❀ y ✷ ✭✙❀ ✙✮. Local changes between pxy and pxpy. Set n ❂ 4000.

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC

1 2 3 4 5 6 ω in 1 + sin(ωx)sin(ωy) 0.0 0.2 0.4 0.6 0.8 1.0 Test power

3 2 1 1 2 3 x 3 2 1 1 2 3 y

ω = 4. 00

Main Point: NFSIC can handle well the local changes in the joint space.

22/10

slide-68
SLIDE 68

Toy Problem 3: Gaussian Sign

y ❂ ❥Z❥ ◗dx

i❂1 sign✭xi✮, where x ✘ ◆✭0❀ Idy✮ and Z ✘ ◆✭0❀ 1✮ (noise).

Full interaction among x1❀ ✿ ✿ ✿ ❀ xdx . Need to consider all x1❀ ✿ ✿ ✿ ❀ xd to detect the dependency.

103 104 105 Sample size n 0.0 0.2 0.4 0.6 0.8 1.0 Test power

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC

Main Point: NFSIC can handle feature interaction.

23/10

slide-69
SLIDE 69

Toy Problem 3: Gaussian Sign

y ❂ ❥Z❥ ◗dx

i❂1 sign✭xi✮, where x ✘ ◆✭0❀ Idy✮ and Z ✘ ◆✭0❀ 1✮ (noise).

Full interaction among x1❀ ✿ ✿ ✿ ❀ xdx . Need to consider all x1❀ ✿ ✿ ✿ ❀ xd to detect the dependency.

103 104 105 Sample size n 0.0 0.2 0.4 0.6 0.8 1.0 Test power

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC

Main Point: NFSIC can handle feature interaction.

23/10

slide-70
SLIDE 70

Test Power vs. J

Test power does not always increase with J (number of test locations). n ❂ 800.

3 2 1 1 2 3 x 3 2 1 1 2 3 y

ω = 2. 00

100 200 300 400 500 600 J 0.2 0.4 0.6 0.8 1.0 Test power

Accurate estimation of ❫ ✝ ✷ ❘J✂J in ❫ ✕n ❂ n ❫ u❃ ✏ ❫ ✝ ✰ ✌nI

✑1 ❫

u becomes more difficult. Large J defeats the purpose of a linear-time test.

24/10

slide-71
SLIDE 71

Real Problem: Million Song Data

Song ✭X ✮ vs. year of release ✭Y ✮. Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X ✷ ❘90 contains audio features. Y ✷ ❘ is the year of release.

500 1000 1500 2000 Sample size n 0.000 0.005 0.010 0.015 0.020 0.025 Type-I error

✭ ❀ ✮

500 1000 1500 2000 Sample size n 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Test power

25/10

slide-72
SLIDE 72

Real Problem: Million Song Data

Song ✭X ✮ vs. year of release ✭Y ✮. Western commercial tracks from 1922 to 2011 [Bertin-Mahieux et al., 2011]. X ✷ ❘90 contains audio features. Y ✷ ❘ is the year of release.

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC 500 1000 1500 2000 Sample size n 0.000 0.005 0.010 0.015 0.020 0.025 Type-I error

Break ✭X ❀ Y ✮ pairs to simulate H0.

500 1000 1500 2000 Sample size n 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Test power

NFSIC-opt has the highest power among the linear-time tests.

25/10

slide-73
SLIDE 73

References I

Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. (2011). The million song dataset. In International Conference on Music Information Retrieval (ISMIR). Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring Statistical Dependence with Hilbert-Schmidt Norms. In Algorithmic Learning Theory (ALT), pages 63–77. Lopez-Paz, D., Hennig, P., and Schölkopf, B. (2013). The Randomized Dependence Coefficient. In Advances in Neural Information Processing Systems (NIPS), pages 1–9.

26/10

slide-74
SLIDE 74

References II

Wang, H. and Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV), pages 3551–3558.

27/10