Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, - - PowerPoint PPT Presentation

lecture 3 dependence measures using rkhs embeddings
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, - - PowerPoint PPT Presentation

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, 2016 Arthur Gretton Gatsby Unit, CSML, UCL Outline Three or more variable interactions, comparison with conditional dependence testing [Sejdinovic et al., 2013a]


slide-1
SLIDE 1

Lecture 3: Dependence measures using RKHS embeddings

MLSS Cadiz, 2016

Arthur Gretton Gatsby Unit, CSML, UCL

slide-2
SLIDE 2

Outline

  • Three or more variable interactions, comparison with conditional

dependence testing [Sejdinovic et al., 2013a]

  • Dependence detection in detail, covariance operators
  • Choice of kernel to maximise test power Gretton et al. [2012b]
  • Supervised learning with distributions as inputs Jitkrittum et al. [2015], Szab´
  • et al.

[2015]

  • Recent work (2014/2015) (not in this talk, see my webpage)

– Testing for time series Chwialkowski and Gretton [2014], Chwialkowski et al. [2014] – Infinite dimensional exponential families Sriperumbudur et al. [2014] – Adaptive MCMC, and adaptive Hamiltonian Monte Carlo Sejdinovic et al.

[2014], Strathmann et al. [2015]

slide-3
SLIDE 3

Lancaster (3-way) Interactions

slide-4
SLIDE 4

Detecting a higher order interaction

  • How to detect V-structures with pairwise weak individual dependence?

X Y Z

slide-5
SLIDE 5

Detecting a higher order interaction

  • How to detect V-structures with pairwise weak individual dependence?
slide-6
SLIDE 6

Detecting a higher order interaction

  • How to detect V-structures with pairwise weak individual dependence?
  • X ⊥

⊥ Y , Y ⊥ ⊥ Z, X ⊥ ⊥ Z

X vs Y Y vs Z X vs Z XY vs Z

X Y Z

  • X, Y i.i.d.

∼ N(0, 1),

  • Z| X, Y ∼ sign(XY )Exp( 1

√ 2)

Faithfulness violated here

slide-7
SLIDE 7

V-structure Discovery

X Y Z

Assume X ⊥ ⊥ Y has been established. V-structure can then be detected by:

  • Consistent CI test: H0 : X ⊥

⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or

slide-8
SLIDE 8

V-structure Discovery

X Y Z

Assume X ⊥ ⊥ Y has been established. V-structure can then be detected by:

  • Consistent CI test: H0 : X ⊥

⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or

  • Factorisation test: H0 : (X, Y ) ⊥

⊥ Z ∨ (X, Z) ⊥ ⊥ Y ∨ (Y, Z) ⊥ ⊥ X (multiple standard two-variable tests) – compute p-values for each of the marginal tests for (Y, Z) ⊥ ⊥ X, (X, Z) ⊥ ⊥ Y , or (X, Y ) ⊥ ⊥ Z – apply Holm-Bonferroni (HB) sequentially rejective correction

(Holm 1979)

slide-9
SLIDE 9

V-structure Discovery (2)

  • How to detect V-structures with pairwise weak (or nonexistent)

dependence?

  • X ⊥

⊥ Y , Y ⊥ ⊥ Z, X ⊥ ⊥ Z

X1 vs Y1 Y1 vs Z1 X1 vs Z1 X1*Y1 vs Z1

X Y Z

  • X1, Y1

i.i.d.

∼ N(0, 1),

  • Z1| X1, Y1 ∼ sign(X1Y1)Exp( 1

√ 2)

  • X2:p, Y2:p, Z2:p

i.i.d.

∼ N(0, Ip−1) Faith-

fulness violated here

slide-10
SLIDE 10

V-structure Discovery (3)

CI: X ⊥ ⊥Y |Z 2var: Factor

Null acceptance rate (Type II error) V-structure discovery: Dataset A Dimension

1 3 5 7 9 11 13 15 17 19 0.2 0.4 0.6 0.8 1

Figure 1: CI test for X ⊥ ⊥ Y |Z from Zhang et al (2011), and a factorisation test with a HB correction, n = 500

slide-11
SLIDE 11

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

slide-12
SLIDE 12

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

  • D = 3 :

∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

slide-13
SLIDE 13

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

  • D = 3 :

∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

X Y Z X Y Z X Y Z X Y Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ ∆LP =

slide-14
SLIDE 14

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

  • D = 3 :

∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ

X Y Z X Y Z X Y Z X Y Z

PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ ∆LP = 0

Case of PX ⊥ ⊥ PY Z

slide-15
SLIDE 15

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

  • D = 3 :

∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ (X, Y ) ⊥ ⊥ Z ∨ (X, Z) ⊥ ⊥ Y ∨ (Y, Z) ⊥ ⊥ X ⇒ ∆LP = 0. ...so what might be missed?

slide-16
SLIDE 16

Lancaster Interaction Measure

[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is a signed measure ∆P that vanishes whenever P can be factorised in a non-trivial way as a product of its (possibly multivariate) marginal distributions.

  • D = 2 :

∆LP = PXY − PXPY

  • D = 3 :

∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ ∆LP = 0 (X, Y ) ⊥ ⊥ Z ∨ (X, Z) ⊥ ⊥ Y ∨ (Y, Z) ⊥ ⊥ X Example:

P(0, 0, 0) = 0.2 P(0, 0, 1) = 0.1 P(1, 0, 0) = 0.1 P(1, 0, 1) = 0.1 P(0, 1, 0) = 0.1 P(0, 1, 1) = 0.1 P(1, 1, 0) = 0.1 P(1, 1, 1) = 0.2

slide-17
SLIDE 17

A Test using Lancaster Measure

  • Test statistic is empirical estimate of µκ (∆LP)2

Hκ , where

κ = k ⊗ l ⊗ m: µκ(PXY Z − PXY PZ − · · · )2

Hκ =

µκPXY Z, µκPXY ZHκ − 2 µκPXY Z, µκPXY PZHκ · · ·

slide-18
SLIDE 18

Inner Product Estimators

ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ PXY Z (K ◦ L ◦ M)++ ((K ◦ L) M)++ ((K ◦ M) L)++ ((M ◦ L) K)++ tr(K+ ◦ L+ ◦ M+) PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++ PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++ PY ZPX (L ◦ M)++ K++ (LM)++K++ PXPY PZ K++L++M++

Table 1: V -statistic estimators of

  • µκν, µκν′

slide-19
SLIDE 19

Inner Product Estimators

ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ PXY Z (K ◦ L ◦ M)++ ((K ◦ L) M)++ ((K ◦ M) L)++ ((M ◦ L) K)++ tr(K+ ◦ L+ ◦ M+) PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++ PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++ PY ZPX (L ◦ M)++ K++ (LM)++K++ PXPY PZ K++L++M++

Table 2: V -statistic estimators of

  • µκν, µκν′

µκ (∆LP)2

Hκ = 1

n2 (HKH ◦ HLH ◦ HMH)++ . Empirical joint central moment in the feature space

slide-20
SLIDE 20

Example A: factorisation tests

CI: X ⊥ ⊥Y |Z ∆L: Factor 2var: Factor

Null acceptance rate (Type II error) V-structure discovery: Dataset A Dimension

1 3 5 7 9 11 13 15 17 19 0.2 0.4 0.6 0.8 1

Figure 2: Factorisation hypothesis: Lancaster statistic vs. a two-variable based test (both with HB correction); Test for X ⊥ ⊥ Y |Z from Zhang et al

(2011), n = 500

slide-21
SLIDE 21

Example B: Joint dependence can be easier to detect

  • X1, Y1

i.i.d.

∼ N(0, 1)

  • Z1 =

         X2

1 + ǫ,

w.p. 1/3, Y 2

1 + ǫ,

w.p. 1/3, X1Y1 + ǫ, w.p. 1/3, where ǫ ∼ N(0, 0.12).

  • X2:p, Y2:p, Z2:p

i.i.d.

∼ N(0, Ip−1)

  • dependence of Z on pair (X, Y ) is stronger than on X and Y individually
  • Satisfies faithfulness
slide-22
SLIDE 22

Example B: factorisation tests

CI: X ⊥ ⊥Y |Z ∆L: Factor 2var: Factor

Null acceptance rate (Type II error) V-structure discovery: Dataset B Dimension

1 3 5 7 9 11 13 15 17 19 0.2 0.4 0.6 0.8 1

Figure 3: Factorisation hypothesis: Lancaster statistic vs. a two-variable based test (both with HB correction); Test for X ⊥ ⊥ Y |Z from Zhang et al

(2011), n = 500

slide-23
SLIDE 23

Interaction for D ≥ 4

  • Interaction measure valid for all D

(Streitberg, 1990):

∆SP =

  • π

(−1)|π|−1 (|π| − 1)!JπP – For a partition π, Jπ associates to the joint the corresponding factorisation, e.g., J13|2|4P = PX1X3PX2PX4.

slide-24
SLIDE 24

Interaction for D ≥ 4

  • Interaction measure valid for all D

(Streitberg, 1990):

∆SP =

  • π

(−1)|π|−1 (|π| − 1)!JπP – For a partition π, Jπ associates to the joint the corresponding factorisation, e.g., J13|2|4P = PX1X3PX2PX4.

slide-25
SLIDE 25

Interaction for D ≥ 4

  • Interaction measure valid for all D

(Streitberg, 1990):

∆SP =

  • π

(−1)|π|−1 (|π| − 1)!JπP – For a partition π, Jπ associates to the joint the corresponding factorisation, e.g., J13|2|4P = PX1X3PX2PX4.

1e+04 1e+09 1e+14 1e+19 1 3 5 7 9 11 13 15 17 19 21 23 25

D Number of partitions of {1,...,D} Bell numbers growth

joint central moments (Lancaster interaction) vs. joint cumulants (Streitberg interaction)

slide-26
SLIDE 26

Total independence test

  • Total independence test:

H0 : PXY Z = PXPY PZ vs. H1 : PXY Z = PXPY PZ

slide-27
SLIDE 27

Total independence test

  • Total independence test:

H0 : PXY Z = PXPY PZ vs. H1 : PXY Z = PXPY PZ

  • For (X1, . . . , XD) ∼ PX, and κ =

D

  • i=1

k(i):

  • µκ
  • ˆ

PX −

D

  • i=1

ˆ PXi

  • ∆tot ˆ

P

  • 2

= 1 n2

n

  • a=1

n

  • b=1

D

  • i=1

K(i)

ab −

2 nD+1

n

  • a=1

D

  • i=1

n

  • b=1

K(i)

ab

+ 1 n2D

D

  • i=1

n

  • a=1

n

  • b=1

K(i)

ab .

  • Coincides with the test proposed by Kankainen (1995) using empirical

characteristic functions.

slide-28
SLIDE 28

Kernel dependence measures - in detail

slide-29
SLIDE 29

MMD for independence: HSIC

!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&

  • "#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&

2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&

  • "2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&

2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:&

!" #"

Empirical HSIC(PXY , PXPY ): 1 n2 (HKH ◦ HLH)++

slide-30
SLIDE 30

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

slide-31
SLIDE 31

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

slide-32
SLIDE 32

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 −1 −0.5 0.5

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5

y g(y) Dependence witness, Y

slide-33
SLIDE 33

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 −1 −0.5 0.5

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5

y g(y) Dependence witness, Y

−1 −0.5 0.5 −1 −0.5 0.5 1

f(X) g(Y) Correlation: −0.90 COCO: 0.14

slide-34
SLIDE 34

Covariance to reveal dependence

A more intuitive idea: maximize covariance of smooth mappings: COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 −1 −0.5 0.5

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5

y g(y) Dependence witness, Y

−1 −0.5 0.5 −1 −0.5 0.5 1

f(X) g(Y) Correlation: −0.90 COCO: 0.14

How do we define covariance in (infinite) feature spaces?

slide-35
SLIDE 35

Covariance to reveal dependence

Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ Rd, y ∈ Rd′. Are they linearly dependent?

slide-36
SLIDE 36

Covariance to reveal dependence

Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ Rd, y ∈ Rd′. Are they linearly dependent? Compute their covariance matrix:

(ignore centering)

Cxy = E

  • xy⊤

How to get a single “summary” number?

slide-37
SLIDE 37

Covariance to reveal dependence

Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ Rd, y ∈ Rd′. Are they linearly dependent? Compute their covariance matrix:

(ignore centering)

Cxy = E

  • xy⊤

How to get a single “summary” number? Solve for vectors f ∈ Rd, g ∈ Rd′ argmax

f=1,g=1

f⊤Cxyg = argmax

f=1,g=1

Exy

  • f⊤x

g⊤y

  • =

argmax

f=1,g=1

Ex,y[f(x)g(y)] = argmax

f=1,g=1

cov (f(x)g(y)) (maximum singular value)

slide-38
SLIDE 38

Challenges in defining feature space covariance

Given features φ(x) ∈ F and ψ(y) ∈ G: Challenge 1: Can we define a feature space analog to x y⊤? YES:

  • Given f ∈ Rd, g ∈ Rd′, h ∈ Rd′, define matrix f g⊤ such that

(f g⊤)h = f(g⊤h).

  • Given f ∈ F, g ∈ G, h ∈ G, define tensor product operator f ⊗ g such

that (f ⊗ g)h = fg, hG.

  • Now just set f := φ(x), g = ψ(y), to get x y⊤ → φ(x) ⊗ ψ(y)
  • Corresponds to the product kernel:

φ(x) ⊗ ψ(y), φ(x′) ⊗ ψ(y′) = k(x, x′)l(y, y′)

slide-39
SLIDE 39

Challenges in defining feature space covariance

Given features φ(x) ∈ F and ψ(y) ∈ G: Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some CXY : G → F such that f, CXY gF = Ex,y[f(x)g(y)] = cov (f(x), g(y))

slide-40
SLIDE 40

Challenges in defining feature space covariance

Given features φ(x) ∈ F and ψ(y) ∈ G: Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some CXY : G → F such that f, CXY gF = Ex,y[f(x)g(y)] = cov (f(x), g(y)) YES: via Bochner integrability argument (as with mean embedding). Under the condition Ex,y

  • k(x, x)l(y, y)
  • < ∞, we can define:

CXY := Ex,y [φ(x) ⊗ ψ(y)] which is a Hilbert-Schmidt operator (sum of squared singular values is finite).

slide-41
SLIDE 41

REMINDER: functions revealing dependence

COCO(P; F, G) := sup

fF=1,gG=1

(Ex,y[f(x)g(y)] − Ex[f(x)]Ey[g(y)])

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 −1 −0.5 0.5

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5

y g(y) Dependence witness, Y

−1 −0.5 0.5 −1 −0.5 0.5 1

f(X) g(Y) Correlation: −0.90 COCO: 0.14

How do we compute this from finite data?

slide-42
SLIDE 42

Empirical covariance operator

The empirical feature covariance given z := (xi, yi)n

i=1 (now include

centering)

  • CXY := 1

n

n

  • i=1

φ(xi) ⊗ ψ(yi) − ˆ µx ⊗ ˆ µy, where ˆ µx := 1 n

n

  • i=1

φ(xi).

slide-43
SLIDE 43

Functions revealing dependence

Optimization problem: COCO(z; F, G) := max

  • f,

CXY g

  • F

subject to fF ≤ 1 gG ≤ 1 Assume f =

n

  • i=1

αi [φ(xi) − ˆ µx] g =

n

  • j=1

βi [ψ(yi) − ˆ µy] The associated Lagrangian is L(f, g, λ, γ) =

  • f,

CXY g

  • F − λ

2

  • f2

F − 1

  • − γ

2

  • g2

G − 1

  • ,

where λ ≥ 0 and γ ≥ 0.

slide-44
SLIDE 44

Covariance to reveal dependence

  • Empirical COCO(z; F, G) largest eigenvalue of

   1 n

  • K

L 1 n

  • L

K      α β   = γ  

  • K
  • L

    α β   .

K and L are matrices of inner products between centred observations in respective feature spaces:

  • K = HKH

where Kij = k(xi, xj) and H = I − 1 n11⊤

slide-45
SLIDE 45

Covariance to reveal dependence

  • Empirical COCO(z; F, G) largest eigenvalue of

   1 n

  • K

L 1 n

  • L

K      α β   = γ  

  • K
  • L

    α β   .

K and L are matrices of inner products between centred observations in respective feature spaces:

  • K = HKH

where Kij = k(xi, xj) and H = I − 1 n11⊤

  • Mapping function for x:

f(x) =

n

  • i=1

αi  k(xi, x) − 1 n

n

  • j=1

k(xj, x)  

slide-46
SLIDE 46

Hard-to-detect dependence

−2 2 −3 −2 −1 1 2 3

X Y Smooth density

−4 −2 2 4 −4 −2 2 4

X Y 500 Samples, smooth density

−2 2 −3 −2 −1 1 2 3

X Y Rough density

−4 −2 2 4 −4 −2 2 4

X Y 500 samples, rough density

Density takes the form: Px,y ∝ 1 + sin(ωx) sin(ωy)

slide-47
SLIDE 47

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

slide-48
SLIDE 48

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of ω = 1

−4 −2 2 4 −4 −3 −2 −1 1 2 3 4

X Y Correlation: 0.27

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6

f(X) g(Y) Correlation: −0.50 COCO: 0.09

slide-49
SLIDE 49

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of ω = 2

−4 −2 2 4 −4 −3 −2 −1 1 2 3 4

X Y Correlation: 0.04

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6

f(X) g(Y) Correlation: 0.51 COCO: 0.07

slide-50
SLIDE 50

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of ω = 3

−4 −2 2 4 −4 −3 −2 −1 1 2 3 4

X Y Correlation: 0.03

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −0.6 −0.4 −0.2 0.2 0.4 −0.5 0.5

f(X) g(Y) Correlation: −0.45 COCO: 0.03

slide-51
SLIDE 51

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of ω = 4

−4 −2 2 4 −4 −3 −2 −1 1 2 3 4

X Y Correlation: 0.03

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6

f(X) g(Y) Correlation: 0.21 COCO: 0.02

slide-52
SLIDE 52

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of ω =??

−4 −2 2 4 −3 −2 −1 1 2 3

X Y Correlation: 0.00

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6

f(X) g(Y) Correlation: −0.13 COCO: 0.02

slide-53
SLIDE 53

Hard-to-detect dependence

COCO vs frequency of perturbation from independence. Case of uniform noise! This bias will decrease with increasing sample size.

−4 −2 2 4 −3 −2 −1 1 2 3

X Y Correlation: 0.00

−2 2 −1 −0.5 0.5 1

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5 1

y g(y) Dependence witness, Y −1 −0.5 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6

f(X) g(Y) Correlation: −0.13 COCO: 0.02

slide-54
SLIDE 54

Hard-to-detect dependence

COCO vs frequency of perturbation from independence.

  • As dependence is encoded at higher frequencies, the smooth mappings

f, g achieve lower linear covariance.

  • Even for independent variables, COCO will not be zero at finite sample

sizes, since some mild linear dependence will be induced by f, g (bias)

  • This bias will decrease with increasing sample size.
slide-55
SLIDE 55

Hard-to-detect dependence

  • Example: sinusoids of increasing frequency

ω=1 ω=2 ω=3 ω=4 ω=5 ω=6

1 2 3 4 5 6 7 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Frequency of non−constant density component COCO COCO (empirical average, 1500 samples)

slide-56
SLIDE 56

More functions revealing dependence

  • Can we do better than COCO?
slide-57
SLIDE 57

More functions revealing dependence

  • Can we do better than COCO?
  • A second example with zero correlation

−1 1 −1.5 −1 −0.5 0.5 1

X Y Correlation: 0

−2 2 −1 −0.5 0.5

x f(x) Dependence witness, X

−2 2 −1 −0.5 0.5

y g(y) Dependence witness, Y

−1 −0.5 0.5 −1 −0.5 0.5 1

f(X) g(Y) Correlation: −0.80 COCO: 0.11

slide-58
SLIDE 58

More functions revealing dependence

  • Can we do better than COCO?
  • A second example with zero correlation

−1 1 −1.5 −1 −0.5 0.5 1

X Y Correlation: 0

−2 2 −1 −0.5 0.5 1

x f2(x) 2nd dependence witness, X

−2 2 −1 −0.5 0.5 1

y g2(y) 2nd dependence witness, Y

slide-59
SLIDE 59

More functions revealing dependence

  • Can we do better than COCO?
  • A second example with zero correlation

−1 1 −1.5 −1 −0.5 0.5 1

X Y Correlation: 0

−2 2 −1 −0.5 0.5 1

x f2(x) 2nd dependence witness, X

−2 2 −1 −0.5 0.5 1

y g2(y) 2nd dependence witness, Y

−1 1 −1 −0.5 0.5 1

f2(X) g2(Y) Correlation: −0.37 COCO2: 0.06

slide-60
SLIDE 60

Hilbert-Schmidt Independence Criterion

  • Given γi := COCOi(z; F, G), define Hilbert-Schmidt Independence

Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : HSIC(z; F, G) :=

n

  • i=1

γ2

i

slide-61
SLIDE 61

Hilbert-Schmidt Independence Criterion

  • Given γi := COCOi(z; F, G), define Hilbert-Schmidt Independence

Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : HSIC(z; F, G) :=

n

  • i=1

γ2

i

  • In limit of infinite samples:

HSIC(P; F, G) := Cxy2

HS

= Cxy, CxyHS = Ex,x′,y,y′[k(x, x′)l(y, y′)] + Ex,x′[k(x, x′)]Ey,y′[l(y, y′)] − 2Ex,y

  • Ex′[k(x, x′)]Ey′[l(y, y′)]

x′ an independent copy of x, y′ a copy of y HSIC is identical to MMD(PXY , PXPY )

slide-62
SLIDE 62

When does HSIC determine independence?

Theorem: When kernels k and l are each characteristic, then HSIC = 0 iff Px,y = PxPy [Gretton, 2015]. Weaker than MMD condition (which requires a kernel characteristic on X × Y to distinguish Px,y from Qx,y).

slide-63
SLIDE 63

Intuition: why characteristic needed on both X and Y

Question: Wouldn’t it be enough just to use a rich mapping from X to Y, e.g. via ridge regression with characteristic F: f∗ = arg min

f∈F

  • EXY (Y − f, φ(X)F)2 + λf2

F

  • ,
slide-64
SLIDE 64

Intuition: why characteristic needed on both X and Y

Question: Wouldn’t it be enough just to use a rich mapping from X to Y, e.g. via ridge regression with characteristic F: f∗ = arg min

f∈F

  • EXY (Y − f, φ(X)F)2 + λf2

F

  • ,

Counterexample: density symmetric about x-axis, s.t. p(x, y) = p(x, −y)

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

slide-65
SLIDE 65

Regression using distribution embeddings

slide-66
SLIDE 66

Kernels on distributions in supervised learning

  • Kernels have been very widely used in supervised learning

– Support vector classification/regression, kernel ridge regression . . .

slide-67
SLIDE 67

Kernels on distributions in supervised learning

  • Kernels have been very widely used in supervised learning
  • Simple kernel on distributions (population counterpart of set kernel)

[Haussler, 1999, G¨ artner et al., 2002]

K(P, Q) = µP, µQF

  • Squared distance between distribution embeddings (MMD)

MMD2(µP, µQ) := µP − µQ2

F = EP k(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

slide-68
SLIDE 68

Kernels on distributions in supervised learning

  • Kernels have been very widely used in supervised learning
  • Simple kernel on distributions (population counterpart of set kernel)

[Haussler, 1999, G¨ artner et al., 2002]

K(P, Q) = µP, µQF

  • Can define kernels on mean embedding features [Christmann, Steinwart

NIPS10],[AISTATS15]

KG Ke KC Kt . . . e−µP−µQ

2 F 2θ2

e−µP−µQF

2θ2

  • 1 + µP − µQ2

F /θ2−1

  • 1 + µP − µQθ

F

−1 , θ ≤ 2 . . . µP − µQ2

F = EP k(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

slide-69
SLIDE 69

Regression using population mean embeddings

  • Samples z := {(µPi, yi)}ℓ

i=1 i.i.d.

∼ ρ(µP, y) = ρ(y|µP)ρ(µP), µPi = EPi [ϕx]

  • Regression function

fρ(µP) =

  • R

ydρ(y|µP),

slide-70
SLIDE 70

Regression using population mean embeddings

  • Samples z := {(µPi, yi)}ℓ

i=1 i.i.d.

∼ ρ(µP, y) = ρ(y|µP)ρ(µP), µPi = EPi [ϕx]

  • Regression function

fρ(µP) =

  • R

ydρ(y|µP),

  • Ridge regression for labelled distributions

z = arg min f∈H

1 ℓ

  • i=1

(f(µPi) − yi)2 + λ f2

H ,

(λ > 0)

  • Define RKHS H with kernel K(µP, µQ) := ψµP, ψµQH:

functions from F ⊂ F to R, where F := {µP : P ∈ P} P set of prob. meas. on X

slide-71
SLIDE 71

Regression using population mean embeddings

  • Expected risk, Excess risk

R [f] = Eρ(µP,y) (f(µP) − y)2 E(fλ

z , fρ) = R[fλ z ] − R[fρ].

  • Minimax rate [Caponnetto and Vito, 2007]

E(fλ

z , fρ) = Op

  • ℓ−

bc bc+1

  • (1 < b, c ∈ (1, 2]).

– b size of input space, c smoothness of fρ

slide-72
SLIDE 72

Regression using population mean embeddings

  • Expected risk, Excess risk

R [f] = Eρ(µP,y) (f(µP) − y)2 E(fλ

z , fρ) = R[fλ z ] − R[fρ].

  • Minimax rate [Caponnetto and Vito, 2007]

E(fλ

z , fρ) = Op

  • ℓ−

bc bc+1

  • (1 < b, c ∈ (1, 2]).

– b size of input space, c smoothness of fρ

  • Replace µPi with ˆ

µPi = N−1

N

  • j=1

ϕxj xj

i.i.d.

∼ Pi

  • Given N = ℓa log(ℓ) and a = 2, (and H¨
  • lder condition on ψ : F → H)

E(fλ

ˆ z , fρ) = Op

  • ℓ−

bc bc+1

  • (1 < b, c ∈ (1, 2]).

Same rate as for population µPi embeddings!

[AISTATS15, JMLR in revision]

slide-73
SLIDE 73

Kernels on distributions in supervised learning

Supervised learning applications:

  • Regression: From distributions to vector spaces.[AISTATS15]

– Atmospheric monitoring, predict aerosol value from distribution of pixel values of a multispectral satellite image over an area (performance matches engineered state-of-the-art [Wang et al., 2012] )

  • Expectation propagation: learn to predict outgoing messages from

incoming messages, when updates would otherwise be done by numerical integration [UAI15]

  • Learning causal direction with mean embeddings [Lopez-Paz et al., 2015]
slide-74
SLIDE 74

Learning causal direction with mean embeddings

Additive noise model to direct an edge between random variables x and y

[Hoyer et al., 2009] Figure: D. Lopez-Paz

slide-75
SLIDE 75

Learning causal direction with mean embeddings

Classification of cause-effect relations [Lopez-Paz et al., 2015]

  • Tuebingen cause-effect pairs: 82 scalar real-world examples where causes

and effects known [Zscheischler, J., 2014]

  • Training data: artificial, random nonlinear functions with additive

gaussian noise.

  • Features:

ˆ µPx, ˆ µPy, ˆ µPxy with labels for x → y and y → x

  • Performance

81% correct

Figure:Mooij et al.(2015)

slide-76
SLIDE 76

Co-authors

  • From UCL:

– Luca Baldasssarre – Steffen Grunewalder – Guy Lever – Sam Patterson – Massimiliano Pontil – Dino Sejdinovic

  • External:

– Karsten Borgwardt, MPI – Wicher Bergsma, LSE – Kenji Fukumizu, ISM – Zaid Harchaoui, INRIA – Bernhard Schoelkopf, MPI – Alex Smola, CMU/Google – Le Song, Georgia Tech – Bharath Sriperumbudur, Cambridge

slide-77
SLIDE 77

Kernel two-sample tests for big data, optimal kernel choice

slide-78
SLIDE 78

Quadratic time estimate of MMD

MMD2 = µP − µQ2

F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

slide-79
SLIDE 79

Quadratic time estimate of MMD

MMD2 = µP − µQ2

F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P, Q, respectively: The earlier estimate: (quadratic time)

  • EPk(x, x′) =

1 m(m − 1)

m

  • i=1

m

  • j=i

k(xi, xj)

slide-80
SLIDE 80

Quadratic time estimate of MMD

MMD2 = µP − µQ2

F = EPk(x, x′) + EQk(y, y′) − 2EP,Qk(x, y)

Given i.i.d. X := {x1, . . . , xm} and Y := {y1, . . . , ym} from P, Q, respectively: The earlier estimate: (quadratic time)

  • EPk(x, x′) =

1 m(m − 1)

m

  • i=1

m

  • j=i

k(xi, xj) New, linear time estimate:

  • EPk(x, x′) = 2

m [k(x1, x2) + k(x3, x4) + . . .] = 2 m

m/2

  • i=1

k(x2i−1, x2i)

slide-81
SLIDE 81

Linear time MMD

Shorter expression with explicit k dependence: MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v), where hk(x, x′, y, y′) = k(x, x′) + k(y, y′) − k(x, y′) − k(x′, y), and v := [x, x′, y, y′].

slide-82
SLIDE 82

Linear time MMD

Shorter expression with explicit k dependence: MMD2 =: ηk(p, q) = Exx′yy′hk(x, x′, y, y′) =: Evhk(v), where hk(x, x′, y, y′) = k(x, x′) + k(y, y′) − k(x, y′) − k(x′, y), and v := [x, x′, y, y′]. The linear time estimate again: ˇ ηk = 2 m

m/2

  • i=1

hk(vi), where vi := [x2i−1, x2i, y2i−1, y2i] and hk(vi) := k(x2i−1, x2i) + k(y2i−1, y2i) − k(x2i−1, y2i) − k(x2i, y2i−1)

slide-83
SLIDE 83

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

  • Much higher variance for a given m, hence. . .
  • . . .a much less powerful test for a given m
slide-84
SLIDE 84

Linear time vs quadratic time MMD

Disadvantages of linear time MMD vs quadratic time MMD

  • Much higher variance for a given m, hence. . .
  • . . .a much less powerful test for a given m

Advantages of the linear time MMD vs quadratic time MMD

  • Very simple asymptotic null distribution (a Gaussian, vs an infinite

weighted sum of χ2)

  • Both test statistic and threshold computable in O(m), with storage O(1).
  • Given unlimited data, a given Type II error can be attained with less

computation

slide-85
SLIDE 85

Asymptotics of linear time MMD

By central limit theorem, m1/2 (ˇ ηk − ηk(p, q)) D → N(0, 2σ2

k)

  • assuming 0 < E(h2

k) < ∞ (true for bounded k)

  • σ2

k = Evh2 k(v) − [Ev(hk(v))]2 .

slide-86
SLIDE 86

Hypothesis test

Hypothesis test of asymptotic level α: tk,α = m−1/2σk √ 2Φ−1(1 − α) where Φ−1 is inverse CDF of N(0, 1).

−4 −2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Null distribution, linear time

  • MMD

2 = ˇ

ηk P (ˇ ηk) ˇ ηk Type I error tk,α = (1 − α) quantile

slide-87
SLIDE 87

Type II error

−4 −2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

P (ˇ ηk) ˇ ηk Null vs alternative distribution, P (ˇ ηk)

null alternative

Type II error ηk(p, q)

slide-88
SLIDE 88

The best kernel: minimizes Type II error

Type II error: ˇ ηk falls below the threshold tk,α and ηk(p, q) > 0.

  • Prob. of a Type II error:

P(ˇ ηk < tk,α) = Φ

  • Φ−1(1 − α) − ηk(p, q)√m

σk √ 2

  • where Φ is a Normal CDF.
slide-89
SLIDE 89

The best kernel: minimizes Type II error

Type II error: ˇ ηk falls below the threshold tk,α and ηk(p, q) > 0.

  • Prob. of a Type II error:

P(ˇ ηk < tk,α) = Φ

  • Φ−1(1 − α) − ηk(p, q)√m

σk √ 2

  • where Φ is a Normal CDF.

Since Φ monotonic, best kernel choice to minimize Type II error prob. is: k∗ = arg max

k∈K ηk(p, q)σ−1 k ,

where K is the family of kernels under consideration.

slide-90
SLIDE 90

Learning the best kernel in a family

Define the family of kernels as follows: K :=

  • k : k =

d

  • u=1

βuku, β1 = D, βu ≥ 0, ∀u ∈ {1, . . . , d}

  • .

Properties: if at least one βu > 0

  • all k ∈ K are valid kernels,
  • If all ku charateristic then k characteristic
slide-91
SLIDE 91

Test statistic

The squared MMD becomes ηk(p, q) = µk(p) − µk(q)2

Fk = d

  • u=1

βuηu(p, q), where ηu(p, q) := Evhu(v).

slide-92
SLIDE 92

Test statistic

The squared MMD becomes ηk(p, q) = µk(p) − µk(q)2

Fk = d

  • u=1

βuηu(p, q), where ηu(p, q) := Evhu(v). Denote:

  • β = (β1, β2, . . . , βd)⊤ ∈ Rd,
  • h = (h1, h2, . . . , hd)⊤ ∈ Rd,

– hu(x, x′, y, y′) = ku(x, x′) + ku(y, y′) − ku(x, y′) − ku(x′, y)

  • η = Ev(h) = (η1, η2, . . . , ηd)⊤ ∈ Rd.

Quantities for test: ηk(p, q) = E(β⊤h) = β⊤η σ2

k := β⊤cov(h)β.

slide-93
SLIDE 93

Optimization of ratio ηk(p, q)σ−1

k

Empirical test parameters: ˆ ηk = β⊤ˆ η ˆ σk,λ =

  • β⊤
  • ˆ

Q + λmI

  • β,

ˆ Q is empirical estimate of cov(h). Note: ˆ ηk, ˆ σk,λ computed on training data, vs ˇ ηk, ˇ σk on data to be tested (why?)

slide-94
SLIDE 94

Optimization of ratio ηk(p, q)σ−1

k

Empirical test parameters: ˆ ηk = β⊤ˆ η ˆ σk,λ =

  • β⊤
  • ˆ

Q + λmI

  • β,

ˆ Q is empirical estimate of cov(h). Note: ˆ ηk, ˆ σk,λ computed on training data, vs ˇ ηk, ˇ σk on data to be tested (why?) Objective: ˆ β∗ = arg max

β0 ˆ

ηk(p, q)ˆ σ−1

k,λ

= arg max

β0

  • β⊤ˆ

η β⊤ ˆ Q + λmI

  • β

−1/2 =: α(β; ˆ η, ˆ Q)

slide-95
SLIDE 95

Optmization of ratio ηk(p, q)σ−1

k

Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0

slide-96
SLIDE 96

Optmization of ratio ηk(p, q)σ−1

k

Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0 Solve easier problem: ˆ β∗ = arg max

β0 α2(β; ˆ

η, ˆ Q). Quadratic program: min{β⊤ ˆ Q + λmI

  • β : β⊤ˆ

η = 1, β 0}

slide-97
SLIDE 97

Optmization of ratio ηk(p, q)σ−1

k

Assume: ˆ η has at least one positive entry Then there exists β 0 s.t. α(β; ˆ η, ˆ Q) > 0. Thus: α(ˆ β∗; ˆ η, ˆ Q) > 0 Solve easier problem: ˆ β∗ = arg max

β0 α2(β; ˆ

η, ˆ Q). Quadratic program: min{β⊤ ˆ Q + λmI

  • β : β⊤ˆ

η = 1, β 0} What if ˆ η has no positive entries?

slide-98
SLIDE 98

Test procedure

  • 1. Split the data into testing and training.
  • 2. On the training data:

(a) Compute ˆ ηu for all ku ∈ K (b) If at least one ˆ ηu > 0, solve the QP to get β∗, else choose random kernel from K

  • 3. On the test data:

(a) Compute ˇ ηk∗ using k∗ =

d

  • u=1

β∗ku (b) Compute test threshold ˇ tα,k∗ using ˇ σk∗

  • 4. Reject null if ˇ

ηk∗ > ˇ tα,k∗

slide-99
SLIDE 99

Convergence bounds

Assume bounded kernel, σk, bounded away from 0. If λm = Θ(m−1/3) then

  • sup

k∈K

ˆ ηkˆ σ−1

k,λ − sup k∈K

ηkσ−1

k

  • = OP
  • m−1/3

.

slide-100
SLIDE 100

Convergence bounds

Assume bounded kernel, σk, bounded away from 0. If λm = Θ(m−1/3) then

  • sup

k∈K

ˆ ηkˆ σ−1

k,λ − sup k∈K

ηkσ−1

k

  • = OP
  • m−1/3

. Idea:

  • sup

k∈K

ˆ ηkˆ σ−1

k,λ − sup k∈K

ηkσ−1

k

  • ≤ sup

k∈K

  • ˆ

ηkˆ σ−1

k,λ − ηkσ−1 k,λ

  • + sup

k∈K

  • ηkσ−1

k,λ − ηkσ−1 k

√ d D√λm

  • C1 sup

k∈K

|ˆ ηk − ηk| + C2 sup

k∈K

|ˆ σk,λ − σk,λ|

  • + C3D2λm,
slide-101
SLIDE 101

Experiments

slide-102
SLIDE 102

Competing approaches

  • Median heuristic
  • Max. MMD: choose ku ∈ K with the largest ˆ

ηu – same as maximizing β⊤ˆ η subject to β1 ≤ 1

  • ℓ2 statistic: maximize β⊤ˆ

η subject to β2 ≤ 1

  • Cross validation on training set

Also compare with:

  • Single kernel that maximizes ratio ηk(p, q)σ−1

k

slide-103
SLIDE 103

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same as that of the distributions.

slide-104
SLIDE 104

Blobs: data

Difficult problems: lengthscale of the difference in distributions not the same as that of the distributions. We distinguish a field of Gaussian blobs with different covariances.

5 10 15 20 25 30 35 5 10 15 20 25 30 35

Blob data p x1 x2

5 10 15 20 25 30 35 5 10 15 20 25 30 35

Blob data q y1 y2

Ratio ε = 3.2 of largest to smallest eigenvalues of blobs in q.

slide-105
SLIDE 105

Blobs: results

5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ε ratio Type II error

max ratio

  • pt

l2 maxmmd xval xvalc med

Parameters: m = 10, 000 (for training and test). Ratio ε of largest to smallest eigenvalues of blobs in q. Results are average over 617 trials.

slide-106
SLIDE 106

Blobs: results

5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ε ratio Type II error max ratio

  • pt

l2 maxmmd xval xvalc med

Optimize ratio ηk(p, q)σ−1

k

slide-107
SLIDE 107

Blobs: results

5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ε ratio Type II error max ratio

  • pt

l2 maxmmd xval xvalc med

Maximize ηk(p, q) with β constraint

slide-108
SLIDE 108

Blobs: results

5 10 15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ε ratio Type II error max ratio

  • pt

l2 maxmmd xval xvalc med

Median heuristic

slide-109
SLIDE 109

Feature selection: data

Idea: no single best kernel. Each of the ku are univariate (along a single coordinate)

slide-110
SLIDE 110

Feature selection: data

Idea: no single best kernel. Each of the ku are univariate (along a single coordinate)

−4 −2 2 4 6 8 −4 −2 2 4 6 8

Selection data x1 x2

p q

slide-111
SLIDE 111

Feature selection: results

5 10 15 20 25 30 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Feature selection dimension Type II error

max ratio

  • pt

l2 maxmmd

Single best kernel Linear combination

m = 10, 000, average over 5000 trials

slide-112
SLIDE 112

Amplitude modulated signals

Given an audio signal s(t), an amplitude modulated signal can be defined u(t) = sin(ωct) [a s(t) + l]

  • ωc: carrier frequency
  • a = 0.2 is signal scaling, l = 2 is offset
slide-113
SLIDE 113

Amplitude modulated signals

Given an audio signal s(t), an amplitude modulated signal can be defined u(t) = sin(ωct) [a s(t) + l]

  • ωc: carrier frequency
  • a = 0.2 is signal scaling, l = 2 is offset

Two amplitude modulated signals from same artist (in this case, Magnetic Fields).

  • Music sampled at 8KHz (very low)
  • Carrier frequency is 24kHz
  • AM signal observed at 120kHz
  • Samples are extracts of length N = 1000, approx. 0.01 sec (very short).
  • Total dataset size is 30,000 samples from each of p, q.
slide-114
SLIDE 114

Amplitude modulated signals

Samples from P Samples from Q

slide-115
SLIDE 115

Results: AM signals

−0.2 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

added noise Type II error

max ratio

  • pt

med l2 maxmmd

m = 10, 000 (for training and test) and scaling a = 0.5. Average over 4124

  • trials. Gaussian noise added.
slide-116
SLIDE 116

Observations on kernel choice

  • It is possible to choose the best kernel for a kernel

two-sample test

  • Kernel choice matters for “difficult” problems, where the

distributions differ on a lengthscale different to that of the data.

  • Ongoing work:

– quadratic time statistic – avoid training/test split

slide-117
SLIDE 117

Energy Distance and the MMD

slide-118
SLIDE 118

Energy distance and MMD

Distance between probability distributions: Energy distance:

[Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005]

DE(P, Q) = EPX − X′q + EQY − Y ′q − 2EP,QX − Y q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012a] MMD2(P, Q; F) = EPk(X, X′) + EQk(Y, Y ′) − 2EP,Qk(X, Y )

slide-119
SLIDE 119

Energy distance and MMD

Distance between probability distributions: Energy distance:

[Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005]

DE(P, Q) = EPX − X′q + EQY − Y ′q − 2EP,QX − Y q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012a] MMD2(P, Q; F) = EPk(X, X′) + EQk(Y, Y ′) − 2EP,Qk(X, Y ) Energy distance is MMD with a particular kernel!

[Sejdinovic et al., 2013b]

slide-120
SLIDE 120

Distance covariance and HSIC

Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´

ekely et al., 2007]

V2(X, Y ) = EXY EX′Y ′ X − X′qY − Y ′r + EXEX′X − X′qEY EY ′Y − Y ′r − 2EXY

  • EX′X − X′qEY ′Y − Y ′r

Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,

2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Y

with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(X, X′)l(Y, Y ′) + EXEX′k(X, X′)EY EY ′l(Y, Y ′) − 2EX′Y ′ EXk(X, X′)EY l(Y, Y ′)

  • .
slide-121
SLIDE 121

Distance covariance and HSIC

Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´

ekely et al., 2007]

V2(X, Y ) = EXY EX′Y ′ X − X′qY − Y ′r + EXEX′X − X′qEY EY ′Y − Y ′r − 2EXY

  • EX′X − X′qEY ′Y − Y ′r

Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,

2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Y

with kernel l. Then HSIC(PXY , PXPY ) = EXY EX′Y ′k(X, X′)l(Y, Y ′) + EXEX′k(X, X′)EY EY ′l(Y, Y ′) − 2EX′Y ′ EXk(X, X′)EY l(Y, Y ′)

  • .

Distance covariance is HSIC with particular kernels!

[Sejdinovic et al., 2013b]

slide-122
SLIDE 122

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X. Let z0 ∈ X, and denote kρ(z, z′) = 1 2(ρ(z, z0) + ρ(z′, z0) − ρ(z, z′)). Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call kρ a distance induced kernel Negative type: The semimetric space (Z, ρ) is said to have negative type if ∀n ≥ 2, z1, . . . , zn ∈ Z, and α1, . . . , αn ∈ R, with

n

  • i=1

αi = 0,

n

  • i=1

n

  • j=1

αiαjρ(zi, zj) ≤ 0. (1)

slide-123
SLIDE 123

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X. Let z0 ∈ X, and denote kρ(z, z′) = 1 2(ρ(z, z0) + ρ(z′, z0) − ρ(z, z′)). Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call kρ a distance induced kernel Special case: Z ⊆ Rd and ρq(z, z′) =

  • z − z′
  • q. Then ρq is a valid semimetric
  • f negative type for 0 < q ≤ 2.
slide-124
SLIDE 124

Semimetrics and Hilbert spaces

Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X. Let z0 ∈ X, and denote kρ(z, z′) = 1 2(ρ(z, z0) + ρ(z′, z0) − ρ(z, z′)). Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call kρ a distance induced kernel Special case: Z ⊆ Rd and ρq(z, z′) =

  • z − z′
  • q. Then ρq is a valid semimetric
  • f negative type for 0 < q ≤ 2.

Energy distance is MMD with a distance induced kernel Distance covariance is HSIC with distance induced kernels

slide-125
SLIDE 125

Two-sample testing benchmark

Two-sample testing example in 1-D:

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35

X P(X)

VS

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4

X Q(X)

slide-126
SLIDE 126

Two-sample test, MMD with distance kernel

Obtain more powerful tests on this problem when q = 1 (exponent of distance) Key:

  • Gaussian kernel
  • q = 1
  • Best: q = 1/3
  • Worst: q = 2
slide-127
SLIDE 127

Selected references

Characteristic kernels and mean embeddings:

  • Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.
  • Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space

embeddings and metrics on probability measures. JMLR.

  • Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.

Two-sample, independence, conditional independence tests:

  • Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of
  • independence. NIPS
  • Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.
  • Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample
  • test. NIPS.
  • Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR

Energy distance, relation to kernel distances

  • Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. Annals of Statistics.

Three way interaction

  • Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.
slide-128
SLIDE 128

Selected references (continued)

Conditional mean embedding, RKHS-valued regression:

  • Weston, J., Chapelle, O., Elisseeff, A., Sch¨
  • lkopf, B., and Vapnik, V., (2003). Kernel Dependency

Estimation, NIPS.

  • Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.
  • Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.

Foundations of Computational Mathematics.

  • Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional
  • Distributions. ICML.
  • Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean

embeddings as regressors. ICML.

  • Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.

Kernel Bayes rule:

  • Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified

kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.

  • Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite

kernels, JMLR

slide-129
SLIDE 129
slide-130
SLIDE 130

Kernel CCA: Definition

  • There exists a factorization of Cxy such that [Baker, 1973]

Cxy = C1/2

xx VxyC1/2 Y Y

VxyS ≤ 1

  • Regularized empirical estimate of spectral norm:

[JMLR07]

ˆ VxyS := sup

f∈F,g∈G

f, ˆ CxygF subject to    f, ( ˆ Cxx + ǫnI)fF = 1, g, ( ˆ Cyy + ǫnI)gG = 1, – First canonical correlate

slide-131
SLIDE 131

Kernel CCA: Definition

  • There exists a factorization of Cxy such that [Baker, 1973]

Cxy = C1/2

xx VxyC1/2 Y Y

VxyS ≤ 1

  • Regularized empirical estimate of spectral norm:

[JMLR07]

ˆ VxyS := sup

f∈F,g∈G

f, ˆ CxygF subject to    f, ( ˆ Cxx + ǫnI)fF = 1, g, ( ˆ Cyy + ǫnI)gG = 1, – First canonical correlate

  • Regularized empirical estimate of HS norm:

[NIPS07b]

NOCCO(z; F, G) := ˆ Vxy2

HS = tr

  • RyRx
  • ,

Rx := Kx( Kx + nǫnIn)−1

slide-132
SLIDE 132

Kernel CCA: Illustration

  • Ring-shaped density, first eigenvalue

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 −0.8 −0.6 −0.4 −0.2

x f(x) Dependence witness, X

−2 2 0.2 0.4 0.6 0.8

y g(y) Dependence witness, Y

−1 −0.5 0.2 0.3 0.4 0.5 0.6 0.7

f(X) g(Y) Correlation: 1.00

slide-133
SLIDE 133

Kernel CCA: Illustration

  • Ring-shaped density, third eigenvalue

−2 2 −1.5 −1 −0.5 0.5 1 1.5

X Y Correlation: −0.00

−2 2 0.2 0.25 0.3 0.35 0.4

x f(x) Dependence witness, X

−2 2 0.2 0.25 0.3 0.35 0.4

y g(y) Dependence witness, Y

0.25 0.3 0.35 0.27 0.28 0.29 0.3 0.31 0.32 0.33

f(X) g(Y) Correlation: 0.97

slide-134
SLIDE 134

NOCCO: HS Norm of Normalized Cross Covariance

  • Define NOCCO as

NOCCO := Vxy2

HS

  • Characteristic kernels: population NOCCO is mean-square contingency,
  • indep. of RKHS

NOCCO =

X×Y

pxy(x, y) px(x)py(y) − 1 2 px(x)py(y)dµ(x)dµ(y).

– µ(x) and µ(y) Lebesgue measures on X and Y; Pxy absolutely continuous w.r.t. µ(x) × µ(y), density pxy, marginal densities px and py

  • Convergence result: assume regularization ǫn satisfies ǫn → 0 and

ǫ3

nn → ∞, Then

ˆ Vxy − VxyHS → 0 in probability

slide-135
SLIDE 135

References

  • C. Baker. Joint measures and cross-covariance operators. Transactions of the

American Mathematical Society, 186:273–289, 1973.

  • L. Baringhaus and C. Franz.

On a new multivariate two-sample test. J. Multivariate Anal., 88:190–206, 2004.

  • C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups.

Springer, New York, 1984.

  • A. Caponnetto and E. De Vito.

Optimal rates for the regularized least- squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.

  • K. Chwialkowski and A. Gretton.

A kernel independence test for random

  • processes. ICML, 2014.
  • K. Chwialkowski, D. Sejdinovic, and A. Gretton. A wild bootstrap for de-

generate kernel tests. NIPS, 2014. Andrey Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3):419–433, 1993.

  • K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨
  • lkopf.

Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20, pages 489–496, Cambridge, MA, 2008. MIT Press.

  • T. G¨

artner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kernels. In Proceedings of the International Conference on Machine Learning, pages 179–186. Morgan Kaufmann Publishers Inc., 2002.

  • A. Gretton and L. Gyorfi. Consistent nonparametric tests of independence.

Journal of Machine Learning Research, 11:1391–1423, 2010.

  • A. Gretton, O. Bousquet, A. J. Smola, and B. Sch¨
  • lkopf. Measuring statisti-

cal dependence with Hilbert-Schmidt norms. In S. Jain, H. U. Simon, and

  • E. Tomita, editors, Proceedings of the International Conference on Algorithmic

Learning Theory, pages 63–77. Springer-Verlag, 2005.

  • A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨
  • lkopf, and A. J. Smola. A ker-

nel method for the two-sample problem. In Advances in Neural Information Processing Systems 15, pages 513–520, Cambridge, MA, 2007. MIT Press.

85-1

slide-136
SLIDE 136
  • A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch¨
  • lkopf, and A. J. Smola.

A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, pages 585–592, Cambridge, MA, 2008. MIT Press.

  • A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A. Smola. A kernel

two-sample test. JMLR, 13:723–773, 2012a.

  • A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrish-

nan, M. Pontil, and K. Fukumizu. Optimal kernel choice for large-scale two-sample tests. In NIPS, 2012b. Arthur Gretton. A simpler condition for consistency of a kernel independence

  • test. Technical Report 1501.06103, arXiv, 2015.

David Haussler. Convolution kernels on discrete structures. Technical Re- port UCS-CRL-99-10, UC Santa Cruz, 1999.

  • P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Sch¨
  • lkopf. Nonlinear causal

discovery with additive noise models. In NIPS, 2009. Wittawat Jitkrittum, Arthur Gretton, Nicolas Heess, S. M. Ali Eslami, Bal- aji Lakshminarayanan, Dino Sejdinovic, and Zolt´ an Szab´

  • . Kernel-based

just-in-time learning for passing expectation propagation messages. UAI, 2015.

  • D. Lopez-Paz, K. Muandet, B. Sch¨
  • lkopf, and I. Tolstikhin. Towards a learn-

ing theory of cause-effect inference. In ICML, 2015.

  • D. Sejdinovic, A. Gretton, and W. Bergsma. A kernel test for three-variable
  • interactions. In NIPS, 2013a.
  • D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence
  • f distance-based and rkhs-based statistics in hypothesis testing. Annals
  • f Statistics, 41(5):2263–2702, 2013b.
  • D. Sejdinovic, H. Strathmann, M. Lomeli Garcia, C. Andrieu, and A. Gret-
  • ton. Kernel adaptive Metropolis-Hastings. ICML, 2014.
  • A. J. Smola, A. Gretton, L. Song, and B. Sch¨
  • lkopf.

A Hilbert space em- bedding for distributions. In Proceedings of the International Conference on Algorithmic Learning Theory, volume 4754, pages 13–31. Springer, 2007.

  • B. Sriperumbudur, K. Fukumizu, A. Gretton, and A. Hyv¨

arinen. Density estimation in infinite dimensional exponential families. Technical Report 1312.3516, ArXiv e-prints, 2014.

85-2

slide-137
SLIDE 137

Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zolt´ an Szab´

  • , and

Arthur Gretton. Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. arxiv, 2015. Zolt´ an Szab´

  • , Arthur Gretton, Barnab´

as P´

  • czos, and Bharath Sriperum-

budur. Two-stage sampled learning theory on distributions. AISTATS, 2015.

  • G. Sz´

ekely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, 5, 2004.

  • G. Sz´

ekely and M. Rizzo. A new test for multivariate normality. J. Multivariate Anal., 93:58–80, 2005.

  • G. Sz´

ekely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Stat., 35(6):2769–2794, 2007.

  • K. Zhang, J. Peters, D. Janzing, B., and B. Sch¨
  • lkopf.

Kernel-based con- ditional independence test and application in causal discovery. In 27th Conference on Uncertainty in Artificial Intelligence, pages 804–813, 2011.

85-3