Relative Goodness-of-Fit Tests for Models with Latent Variables - - PowerPoint PPT Presentation

relative goodness of fit tests for models with latent
SMART_READER_LITE
LIVE PREVIEW

Relative Goodness-of-Fit Tests for Models with Latent Variables - - PowerPoint PPT Presentation

Relative Goodness-of-Fit Tests for Models with Latent Variables Arthur Gretton Gatsby Computational Neuroscience Unit, University College London June 15, 2019 1/37 Model Criticism 2/37 Model Criticism 2/37 Model Criticism Data =


slide-1
SLIDE 1

Relative Goodness-of-Fit Tests for Models with Latent Variables

Arthur Gretton

Gatsby Computational Neuroscience Unit, University College London

June 15, 2019

1/37

slide-2
SLIDE 2

Model Criticism

2/37

slide-3
SLIDE 3

Model Criticism

2/37

slide-4
SLIDE 4

Model Criticism

Data = robbery events in Chicago in 2016.

2/37

slide-5
SLIDE 5

Model Criticism

Is this a good model?

2/37

slide-6
SLIDE 6

Model Criticism

Goals: Test if a (complicated) model fits the data.

2/37

slide-7
SLIDE 7

Model Criticism

"All models are wrong."

  • G. Box (1976)

3/37

slide-8
SLIDE 8

Relative model comparison

Have: two candidate models P and Q, and samples ❢xi❣n

i❂1 from

reference distribution R Goal: which of P and Q is better? Samples from GAN, Goodfellow et al. (2014) Samples from LSGAN, Mao et al. (2017)

Which model is better?

4/37

slide-9
SLIDE 9

Most interesting models have latent structure

Graphical model representation of hierarchical LDA with a nested CRP prior, Blei et al. (2003)

c

L

c

1

c

3

c

2

" # z $ N M w % & !

8 (b)

5/37

slide-10
SLIDE 10

Outline

Relative goodness-of-fit tests for Models with Latent Variables

The kernel Stein discrepancy

✎ Comparing two models via samples: MMD and the witness function. ✎ Comparing a sample and a model: Stein modification of the witness

class

Constructing a relative hypothesis test using the KSD Relative hypothesis tests with latent variables (new, unpublished)

6/37

slide-11
SLIDE 11

Kernel Stein Discrepancy

Model P, data ❢xi❣n

i❂1 ✘ Q.

“All models are wrong” (P ✻❂ Q).

7/37

slide-12
SLIDE 12

Integral probability metrics

Integral probability metric: Find a "well behaved function" f ✭x✮ to maximize EQf ✭Y ✮ EPf ✭X ✮

0.2 0.4 0.6 0.8 1

x

  • 1
  • 0.5

0.5 1

f(x) Smooth function

8/37

slide-13
SLIDE 13

Integral probability metrics

Integral probability metric: Find a "well behaved function" f ✭x✮ to maximize EQf ✭Y ✮ EPf ✭X ✮

0.2 0.4 0.6 0.8 1

x

  • 1
  • 0.5

0.5 1

f(x) Smooth function

9/37

slide-14
SLIDE 14

All of kernel methods

Functions are linear combinations of features: ❦f ❦2

❋ ✿❂ P✶ i❂1 fi 2

10/37

slide-15
SLIDE 15

All of kernel methods

“The kernel trick” f ✭x✮ ❂

❵❂1

f❵✬❵✭x✮ ❂

m

i❂1

☛ik✭xi❀ x✮

  • 6
  • 4
  • 2

2 4 6 8

x

0.2 0.4 0.6 0.8

f(x)

11/37

slide-16
SLIDE 16

All of kernel methods

“The kernel trick” f ✭x✮ ❂

❵❂1

f❵✬❵✭x✮ ❂

m

i❂1

☛ik✭xi❀ x✮

  • 6
  • 4
  • 2

2 4 6 8

x

  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

f(x)

f❵ ✿❂ Pm

i❂1 ☛i✬❵✭xi✮

Function of infinitely many features expressed using m coefficients.

11/37

slide-17
SLIDE 17

MMD as an integral probability metric

Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ ❋✮ ✿❂ sup

❦f ❦❋✔1

❬EPf ✭X ✮ EQf ✭Y ✮❪

12/37

slide-18
SLIDE 18

MMD as an integral probability metric

Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ ❋✮ ✿❂ sup

❦f ❦❋✔1

❬EPf ✭X ✮ EQf ✭Y ✮❪

  • 4
  • 2

2 4

  • 0.3
  • 0.2
  • 0.1

0.1 0.2 0.3 0.4

p(x) q(x) f *(x)

12/37

slide-19
SLIDE 19

MMD as an integral probability metric

Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ ❋✮ ✿❂ sup

❦f ❦❋✔1

❬EPf ✭X ✮ EQf ✭Y ✮❪ For characteristic RKHS ❋, MMD✭P❀ Q❀ ❋✮ ❂ 0 iff P ❂ Q Other choices for witness function class: Bounded continuous [Dudley, 2002] Bounded variation 1 (Kolmogorov metric) [Müller, 1997] 1-Lipschitz (Wasserstein distances) [Dudley, 2002]

12/37

slide-20
SLIDE 20

Statistical model criticism: toy example

MMD✭P❀ Q✮ ❂ sup❦f ❦❋✔1❬Eqf Epf ❪

  • 4
  • 2

2 4

  • 0.3
  • 0.2
  • 0.1

0.1 0.2 0.3 0.4

p(x) q(x) f *(x)

Can we compute MMD with samples from Q and a model P? Problem: usualy can’t compute Epf in closed form.

13/37

slide-21
SLIDE 21

Stein idea

To get rid of Epf in sup

❦f ❦❋✔1

❬Eqf Epf ❪ we define the (1-D) Stein operator ❬❆pf ❪ ✭x✮ ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ Then Ep❆pf ❂ 0 subject to appropriate boundary conditions.

Gorham and Mackey (NeurIPS 15), Oates, Girolami, Chopin (JRSS B 2016) 14/37

slide-22
SLIDE 22

Kernel Stein Discrepancy

Stein operator ❆pf ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ Kernel Stein Discrepancy (KSD) KSDp✭Q✮ ❂ sup

❦g❦❋✔1

Eq❆pg Ep❆pg

15/37

slide-23
SLIDE 23

Kernel Stein Discrepancy

Stein operator ❆pf ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ Kernel Stein Discrepancy (KSD) KSDp✭Q✮ ❂ sup

❦g❦❋✔1

Eq❆pg ✘✘✘

Ep❆pg ❂ sup

❦g❦❋✔1

Eq❆pg

15/37

slide-24
SLIDE 24

Kernel Stein Discrepancy

Stein operator ❆pf ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ Kernel Stein Discrepancy (KSD) KSDp✭Q✮ ❂ sup

❦g❦❋✔1

Eq❆pg ✘✘✘

Ep❆pg ❂ sup

❦g❦❋✔1

Eq❆pg

  • 4
  • 2

2 4

  • 0.6
  • 0.4
  • 0.2

0.2 0.4

p(x) q(x) g *(x)

15/37

slide-25
SLIDE 25

Kernel Stein Discrepancy

Stein operator ❆pf ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ Kernel Stein Discrepancy (KSD) KSDp✭Q✮ ❂ sup

❦g❦❋✔1

Eq❆pg ✘✘✘

Ep❆pg ❂ sup

❦g❦❋✔1

Eq❆pg

  • 4
  • 2

2 4 0.1 0.2 0.3 0.4

p(x) q(x) g *(x)

15/37

slide-26
SLIDE 26

Simple expression using kernels

Re-write stein operator as: ❬❆pf ❪ ✭x✮ ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ ❂ f ✭x✮ d dx log p✭x✮ ✰ d dx f ✭x✮

Can we define “Stein features”?

❬❆pf ❪ ✭x✮ ❂

✒ d

dx log p✭x✮

f ✭x✮ ✰ d dx f ✭x✮ ❂✿

✡f ❀

✘✭x✮

⑤④③⑥

stein features

where Ex✘p✘✭x✮ ❂ 0.

16/37

slide-27
SLIDE 27

Simple expression using kernels

Re-write stein operator as: ❬❆pf ❪ ✭x✮ ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ ❂ f ✭x✮ d dx log p✭x✮ ✰ d dx f ✭x✮

Can we define “Stein features”?

❬❆pf ❪ ✭x✮ ❂

✒ d

dx log p✭x✮

f ✭x✮ ✰ d dx f ✭x✮ ❂✿

✡f ❀

✘✭x✮

⑤④③⑥

stein features

where Ex✘p✘✭x✮ ❂ 0.

16/37

slide-28
SLIDE 28

The kernel trick for derivatives

Reproducing property for the derivative: for differentiable k✭x❀ x ✵✮, d dx f ✭x✮ ❂

f ❀ d dx ✬✭x✮

✭ ✮ ❬❆ ❪ ✭ ✮ ❂

✭ ✮

✭ ✮ ✰ ✭ ✮ ❂

✭ ✮

✬✭ ✮ ✰ ✬✭ ✮

⑤ ④③ ⑥

✭ ✮

❂✿ ❤ ❀ ✘✭ ✮✐❋ ✿

17/37

slide-29
SLIDE 29

The kernel trick for derivatives

Reproducing property for the derivative: for differentiable k✭x❀ x ✵✮, d dx f ✭x✮ ❂

f ❀ d dx ✬✭x✮

Using kernel derivative trick in ✭a✮, ❬❆pf ❪ ✭x✮ ❂

✒ d

dx log p✭x✮

f ✭x✮ ✰ d dx f ✭x✮ ❂

f ❀

✒ d

dx log p✭x✮

✬✭x✮ ✰ d dx ✬✭x✮

⑤ ④③ ⑥

✭a✮

❂✿ ❤f ❀ ✘✭x✮✐❋ ✿

17/37

slide-30
SLIDE 30

Kernel stein discrepancy: derivation

Closed-form expression for KSD: given independent x❀ x ✵ ✘ Q, then KSDp✭Q✮ ❂ sup

❦g❦❋✔1

Ex✘q ✭❬❆pg❪ ✭x✮✮ ❂ sup

❦g❦❋✔1

Ex✘q ❤g❀ ✘x✐❋ ❂

✭a✮

sup

❦g❦❋✔1

❤g❀ Ex✘q✘x✐❋ ❂ ❦Ex✘q✘x❦❋ ✭ ✮

✭ ✮

❁ ✶✿

Chwialkowski, Strathmann, G., (ICML 2016) Liu, Lee, Jordan (ICML 2016) 18/37

slide-31
SLIDE 31

Kernel stein discrepancy: derivation

Closed-form expression for KSD: given independent x❀ x ✵ ✘ Q, then KSDp✭Q✮ ❂ sup

❦g❦❋✔1

Ex✘q ✭❬❆pg❪ ✭x✮✮ ❂ sup

❦g❦❋✔1

Ex✘q ❤g❀ ✘x✐❋ ❂

✭a✮

sup

❦g❦❋✔1

❤g❀ Ex✘q✘x✐❋ ❂ ❦Ex✘q✘x❦❋ ✭ ✮

✭ ✮

❁ ✶✿

Chwialkowski, Strathmann, G., (ICML 2016) Liu, Lee, Jordan (ICML 2016) 18/37

slide-32
SLIDE 32

Kernel stein discrepancy: derivation

Closed-form expression for KSD: given independent x❀ x ✵ ✘ Q, then KSDp✭Q✮ ❂ sup

❦g❦❋✔1

Ex✘q ✭❬❆pg❪ ✭x✮✮ ❂ sup

❦g❦❋✔1

Ex✘q ❤g❀ ✘x✐❋ ❂

✭a✮

sup

❦g❦❋✔1

❤g❀ Ex✘q✘x✐❋ ❂ ❦Ex✘q✘x❦❋ Caution: ✭a✮ requires a condition for the Riesz theorem to hold, Ex✘q

✒ d

dx log p✭x✮

✓2

❁ ✶✿

Chwialkowski, Strathmann, G., (ICML 2016) Liu, Lee, Jordan (ICML 2016) 18/37

slide-33
SLIDE 33

The witness function: Chicago Crime

Model p ❂ 10-component Gaussian mixture.

19/37

slide-34
SLIDE 34

The witness function: Chicago Crime

Witness function g shows mismatch

19/37

slide-35
SLIDE 35

Does the Riesz condition matter?

Consider the standard normal, p✭x✮ ❂ 1 ♣ 2✙ exp

x 2❂2

✿ Then d dx log p✭x✮ ❂ x✿ If q is a Cauchy distribution, then the integral Ex✘q

✒ d

dx log p✭x✮

✓2

❩ ✶

x 2q✭x✮dx is undefined.

20/37

slide-36
SLIDE 36

Does the Riesz condition matter?

Consider the standard normal, p✭x✮ ❂ 1 ♣ 2✙ exp

x 2❂2

✿ Then d dx log p✭x✮ ❂ x✿ If q is a Cauchy distribution, then the integral Ex✘q

✒ d

dx log p✭x✮

✓2

❩ ✶

x 2q✭x✮dx is undefined.

20/37

slide-37
SLIDE 37

Kernel stein discrepancy: population expression

Test statistic: KSD2

p✭Q✮ ❂ ❦Ex✘q✘x❦2 ❋ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮

where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ ✰ sp✭x✮❃k2✭x❀ x ✵✮ ✰ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr

✂k12✭x❀ x ✵✮ ✄

sp✭x✮ ✷ ❘D ❂ rp✭x✮

p✭x✮

k1✭a❀ b✮ ✿❂ rxk✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k2✭a❀ b✮ ✿❂ rx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k12✭a❀ b✮ ✿❂ rxrx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D✂D

21/37

slide-38
SLIDE 38

Kernel stein discrepancy: population expression

Test statistic: KSD2

p✭Q✮ ❂ ❦Ex✘q✘x❦2 ❋ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮

where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ ✰ sp✭x✮❃k2✭x❀ x ✵✮ ✰ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr

✂k12✭x❀ x ✵✮ ✄

sp✭x✮ ✷ ❘D ❂ rp✭x✮

p✭x✮

k1✭a❀ b✮ ✿❂ rxk✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k2✭a❀ b✮ ✿❂ rx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k12✭a❀ b✮ ✿❂ rxrx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D✂D

21/37

slide-39
SLIDE 39

Kernel stein discrepancy: population expression

Test statistic: KSD2

p✭Q✮ ❂ ❦Ex✘q✘x❦2 ❋ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮

where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ ✰ sp✭x✮❃k2✭x❀ x ✵✮ ✰ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr

✂k12✭x❀ x ✵✮ ✄

sp✭x✮ ✷ ❘D ❂ rp✭x✮

p✭x✮

k1✭a❀ b✮ ✿❂ rxk✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k2✭a❀ b✮ ✿❂ rx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k12✭a❀ b✮ ✿❂ rxrx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D✂D

Do not need to normalize p, or sample from it.

21/37

slide-40
SLIDE 40

Kernel stein discrepancy: population expression

Test statistic: KSD2

p✭Q✮ ❂ ❦Ex✘q✘x❦2 ❋ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮

where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ ✰ sp✭x✮❃k2✭x❀ x ✵✮ ✰ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr

✂k12✭x❀ x ✵✮ ✄

sp✭x✮ ✷ ❘D ❂ rp✭x✮

p✭x✮

k1✭a❀ b✮ ✿❂ rxk✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k2✭a❀ b✮ ✿❂ rx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k12✭a❀ b✮ ✿❂ rxrx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D✂D If kernel is C0-universal and Q satisfies Ex✘Q

✌ ✌ ✌r ✏

log p✭x✮

q✭x✮

✑✌ ✌ ✌

2 ❁ ✶,

then KSD2

p✭Q✮ ❂ 0 iff P ❂ Q.

21/37

slide-41
SLIDE 41

KSD for discrete-valued variables

Discrete domains: ❳ ❂ ❢1❀ ✿ ✿ ✿ ❀ L❣D with L ✷ ◆. The population KSD (discrete): KSD2

p✭Q✮ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮

where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ sp✭x✮❃k2✭x❀ x ✵✮ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr

✂k12✭x❀ x ✵✮ ✄

k1✭x❀ x ✵✮ ❂ ✁1

x k✭x❀ x ✵✮, ✁1 x

is difference on x, sp✭x✮ ❂ ✁p✭x✮

p✭x✮

✭ ❀

✵✮ ❂

✭ ✭ ❀

✵✮✮

✭ ❀

✵✮ ❂ P ❂ ■✭

✻❂

✵ ✮

✭ ✮ ❂ ❂

❳ ❃ ❃

Ranganath et al. (NeurIPS 2016), Yang et al. (ICML 2018) 22/37

slide-42
SLIDE 42

KSD for discrete-valued variables

Discrete domains: ❳ ❂ ❢1❀ ✿ ✿ ✿ ❀ L❣D with L ✷ ◆. The population KSD (discrete): KSD2

p✭Q✮ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮

where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ sp✭x✮❃k2✭x❀ x ✵✮ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr

✂k12✭x❀ x ✵✮ ✄

k1✭x❀ x ✵✮ ❂ ✁1

x k✭x❀ x ✵✮, ✁1 x

is difference on x, sp✭x✮ ❂ ✁p✭x✮

p✭x✮

A discrete kernel: k✭x❀ x ✵✮ ❂ exp ✭dH ✭x❀ x ✵✮✮, where dH ✭x❀ x ✵✮ ❂ D1 PD

d❂1 ■✭xd ✻❂ x ✵ d✮.

✭ ✮ ❂ ❂

❳ ❃ ❃

Ranganath et al. (NeurIPS 2016), Yang et al. (ICML 2018) 22/37

slide-43
SLIDE 43

KSD for discrete-valued variables

Discrete domains: ❳ ❂ ❢1❀ ✿ ✿ ✿ ❀ L❣D with L ✷ ◆. The population KSD (discrete): KSD2

p✭Q✮ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮

where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ sp✭x✮❃k2✭x❀ x ✵✮ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr

✂k12✭x❀ x ✵✮ ✄

k1✭x❀ x ✵✮ ❂ ✁1

x k✭x❀ x ✵✮, ✁1 x

is difference on x, sp✭x✮ ❂ ✁p✭x✮

p✭x✮

A discrete kernel: k✭x❀ x ✵✮ ❂ exp ✭dH ✭x❀ x ✵✮✮, where dH ✭x❀ x ✵✮ ❂ D1 PD

d❂1 ■✭xd ✻❂ x ✵ d✮.

KSD2

p✭Q✮ ❂ 0 iff P ❂ Q if

Gram matrix over all the configurations in ❳ is strictly positive definite, P ❃ 0 and Q ❃ 0.

Ranganath et al. (NeurIPS 2016), Yang et al. (ICML 2018) 22/37

slide-44
SLIDE 44

Empirical statistic, asymptotic normality for P ✻❂ Q

The empirical statistic: ❭ KSD2

p✭Q✮ ✿❂

1 n✭n 1✮

i✻❂j

hp✭xi❀ xj ✮✿ ✻❂ ♣

❭✭ ✮ ✭ ✮

✦ ◆✭ ❀ ✛ ✮ ✛ ❂ ❬❊

✵❬

✭ ❀

✵✮❪❪✿

23/37

slide-45
SLIDE 45

Empirical statistic, asymptotic normality for P ✻❂ Q

The empirical statistic: ❭ KSD2

p✭Q✮ ✿❂

1 n✭n 1✮

i✻❂j

hp✭xi❀ xj ✮✿ Asymptotic distribution when P ✻❂ Q: ♣n

❭ KSD2

p✭Q✮ KSD2 p✭Q✮

d

✦ ◆✭0❀ ✛2

hp✮

✛2

hp ❂ 4Var❬❊x ✵❬hp✭x❀ x ✵✮❪❪✿

  • KSD2

p(Q)

0.0 0.5

Prob

KSD2

p(Q)

23/37

slide-46
SLIDE 46

Relative goodness-of-fit testing

Two generative models P and Q, data ❢xi❣n

i❂1 ✘ R.

Neither model gives a perfect fit ( P ✻❂ R and Q ✻❂ R).

24/37

slide-47
SLIDE 47

Joint asymptotic normality

Joint asymptotic normality when P ✻❂ R and Q ✻❂ R ♣n

✷ ✹

❭ KSD2

p✭R✮ KSD2 p✭R✮

❭ KSD2

q✭R✮ KSD2 q✭R✮

✸ ✺ d

✦ ◆

✥✧ ★

✛2

hp

✛hphq ✛hphq ✛2

hq

★✦

  • KSD2

p(R)

  • KSD2

q(R)

KSD2

p(R)

KSD2

q(R)

25/37

slide-48
SLIDE 48

Joint asymptotic normality

Joint asymptotic normality when P ✻❂ R and Q ✻❂ R ♣n

✷ ✹

❭ KSD2

p✭R✮ KSD2 p✭R✮

❭ KSD2

q✭R✮ KSD2 q✭R✮

✸ ✺ d

✦ ◆

✥✧ ★

✛2

hp

✛hphq ✛hphq ✛2

hq

★✦

Difference in statistics is asymptotically normal: ♣n

❭ KSD2

p✭R✮ ❭

KSD2

q✭R✮

KSD2

p✭R✮ KSD2 q✭R✮

✑✕

d

✦ ◆

0❀ ✛2

hp ✰ ✛2 hq 2✛hphq

❂ ✮ a statistical test with null hypothesis KSD2

p✭R✮ KSD2 q✭R✮ ✔ 0

is straightforward.

25/37

slide-49
SLIDE 49

Latent variable models

Can we compare latent variable models with KSD? p✭x✮ ❂

p✭x❥z✮p✭z✮dz q✭y✮ ❂

q✭y❥w✮p✭w✮dw

X Y W Z

Recall multi-dimensional Stein operator: ❬❆pf ❪ ✭x✮ ❂

rp✭x✮ p✭x✮

⑤ ④③ ⑥

✭a✮

❀ f ✭x✮

✰ ❤r❀ f ✭x✮✐ ✿ Expression ✭a✮ requires marginal p✭x✮, often intractable✿ ✿ ✿ ✿ ✿ ✿

26/37

slide-50
SLIDE 50

Latent variable models

Can we compare latent variable models with KSD? p✭x✮ ❂

p✭x❥z✮p✭z✮dz q✭y✮ ❂

q✭y❥w✮p✭w✮dw

X Y W Z

Recall multi-dimensional Stein operator: ❬❆pf ❪ ✭x✮ ❂

rp✭x✮ p✭x✮

⑤ ④③ ⑥

✭a✮

❀ f ✭x✮

✰ ❤r❀ f ✭x✮✐ ✿ Expression ✭a✮ requires marginal p✭x✮, often intractable✿ ✿ ✿ ✿ ✿ ✿but sampling can be straightforward!

26/37

slide-51
SLIDE 51

Monte Carlo approximation

Approximate the integral using ❢zj ❣m

j ❂1 ✘ p✭z✮:

p✭x✮ ❂

p✭x❥z✮p✭z✮dz ✙ pm✭x✮ ❂ 1 m

m

j ❂1

p✭x❥zj ✮ Estimate KSDs with approxiomate densities: ❭ KSD2

p✭R✮ ❭

KSD2

q✭R✮ ✙ ❭

KSD2

pm✭R✮ ❭

KSD2

qm✭R✮

❭✭ ✮ ❭✭ ✮

✭ ✮ ✭ ✮

✑✕

✦ ◆

❀ ✛ ✰ ✛ ✛

27/37

slide-52
SLIDE 52

Monte Carlo approximation

Approximate the integral using ❢zj ❣m

j ❂1 ✘ p✭z✮:

p✭x✮ ❂

p✭x❥z✮p✭z✮dz ✙ pm✭x✮ ❂ 1 m

m

j ❂1

p✭x❥zj ✮ Estimate KSDs with approxiomate densities: ❭ KSD2

p✭R✮ ❭

KSD2

q✭R✮ ✙ ❭

KSD2

pm✭R✮ ❭

KSD2

qm✭R✮

Recall ♣n

❭ KSD2

p✭R✮ ❭

KSD2

q✭R✮

KSD2

p✭R✮ KSD2 q✭R✮

✑✕

d

✦ ◆

0❀ ✛2

hp ✰ ✛2 hq 2✛hphq

✦ if m is large, can we simply substitute pm and qm ?

27/37

slide-53
SLIDE 53

Simple proof of concept

Check ❭ KSD2

p✭R✮ ✙ ❭

KSD2

pm✭R✮ with a toy model:

Model: Beta-Binomial BetaBinom✭☛❀ ☞✮ p✭x❥z✮ ❂

N x

z x✭1 z✮nx❀ p✭z✮ ❂ Beta✭a❀ b✮

✎ Latent z ✷ ✭0❀ 1✮: success probability for binomial likelihood ✎ Marginal p✭x✮: tractable (given by the beta function)

Generate ♣n ❭ KSD2

p✭R✮ and ♣n ❭

KSD2

pm✭R✮

✦ what do their distribution look like?

28/37

slide-54
SLIDE 54

Effect of sampling the latents (Beta-binomial)

√n U-stat

0.0 0.5

Prob

√n KSD2

p

29/37

slide-55
SLIDE 55

Effect of sampling the latents (Beta-binomial)

√n U-stat

0.0 0.5

Prob

√n KSD2

p

√n KSD2

pm

29/37

slide-56
SLIDE 56

Effect of sampling the latents (Beta-binomial)

√n U-stat

0.0 0.5

Prob

√n KSD2

p

√n KSD2

pm

29/37

slide-57
SLIDE 57

Why this happens

KSD2

p

P✭KSD2

pm✮

KSD2

pm✭R✮ is normally distributed around KSD2 p✭R✮

(approximation error)

30/37

slide-58
SLIDE 58

Why this happens

KSD2

p

P✭KSD2

pm✮

KSD2

pm

Approximation pm gives a random draw KSD2

pm✭R✮

30/37

slide-59
SLIDE 59

Why this happens

KSD2

p

P✭KSD2

pm✮

KSD2

pm

P✭ ❭ KSD2

pm❥pm✮

❭ KSD2

pm✭R✮ is normally distributed around KSD2 pm✭R✮

30/37

slide-60
SLIDE 60

Why this happens

Distribution of ❭ KSD2

pm✭R✮ is

averaged over random draws of KSD2

pm✭R✮

30/37

slide-61
SLIDE 61

Why this happens

Distribution of ❭ KSD2

pm✭R✮ is

averaged over random draws of KSD2

pm✭R✮

30/37

slide-62
SLIDE 62

Why this happens

❭ KSD2

pm✭R✮ has a higher variance than ❭

KSD2

p✭R✮

30/37

slide-63
SLIDE 63

Correction for this effect

BetaBinomial models with p vs q ❂ pm: numerical vs closed-form marginalisation. With correction for increased ❭ KSD2

pm✭R✮ variance,

null accepted w.p. 1 ☛.

100 200 300 Sample size n 0.05 0.50 Rejection rate

P ❂ BetaBinom✭5✰a❀ 1✰b✮ Q ❂ pm R ❂ BetaBinom✭a❀ b✮ k✭x❀ x ✵✮ ❂ exp✭■✭x ✻❂ x ✵✮✮ ☛ ❂ 0✿05

KSD without corrected threshold (m=100) KSD m=1000 LKSD (KSD for Latent Models) m=100 LKSD m=1000

31/37

slide-64
SLIDE 64

Correction for this effect

BetaBinomial models with p vs q ❂ pm: numerical vs closed-form marginalisation. With correction for increased ❭ KSD2

pm✭R✮ variance,

null accepted w.p. 1 ☛.

100 200 300 Sample size n 0.05 0.50 Rejection rate

Naive Rel-KSD test has incorrect type-I error Naive KSD: q ❂ pm ✻❂ p ✮ rejection rate ✦ 1 as n ✦ ✶

KSD without corrected threshold (m=100) KSD m=1000 LKSD (KSD for Latent Models) m=100 LKSD m=1000

31/37

slide-65
SLIDE 65

Asymptotics for approximate KSD

We have asymptotic normality for KSD2

pm✭R✮,

♣m

KSD2

pm✭R✮ KSD2 p✭R✮

✁ d

✦ ◆✭0❀ ✌2

p✮

The fine print: infx p✭x✮ ❃ 0 supx

☞ ☞ ☞dp✭x✮

dx

☞ ☞ ☞ ❁ ✶

(Uniform CLT) Likelihoods ❢p✭x❥✁✮❥x ✷ ❳❣ and derivatives ❢ d

dx p✭x❥✁✮❥x ✷ ❳❣ are p✭z✮ - Donsker class

32/37

slide-66
SLIDE 66

Asymptotic distribution for relative KSD test

Asymptotic distribution of approximate KSD estimate ✭n❀ m✮ ✦ ✶❀

n m ✦ r ✷ ❬0❀ ✶✮:

♣n

✔✒

❭ KSD2

pm✭R✮ ❭

KSD2

qm✭R✮

KSD2

p✭R✮ KSD2 q✭R✮

✑✕

d

✦ ◆✭0❀ c2✮ where c ❂ ✛pq

q

1 ✰ r✭✌pq❂✛pq✮2 ✌2

pq ❂ lim m✦✶m ✁ Var ❬Ex❀x ✵hpm✭x❀ x ✵✮ Ex❀x ✵hqm✭x❀ x ✵✮❪

✛2

pq ❂ lim n✦✶n ✁ Var

❭ KSD2

p✭R✮ ❭

KSD2

q✭R✮

Fine print: hp✭x❀ x ✵✮ hq✭x❀ x ✵✮ has a finite third moment An additional technical condition (next slide)

33/37

slide-67
SLIDE 67

Main theorem

Theorem (Asymptotic distribution of random kernel U-statistic)

Let

✎ Un❀m : a U-statistic defined by a random U-statistic kernel Hm ✎ Un : a U-statistic defined by a fixed U-statistic kernel h

Assume that

✎ ✛2

Hm ✦ ✛2 h in probability

✎ ✗3✭Hm✮ ✦ ✗3✭h✮ ❁ ✶ in probability

where ✗3✭Hm✮ ❂ ❊x❀x ✵ ☞ ☞Hm✭x❀ x ✵✮ ❊x❀x ✵Hm✭x❀ x ✵✮ ☞ ☞3

✎ Ym ✿❂ ♣m

✏ ❊n❬Un❀m❥Hm❪ ❊n❬Un❪ ✑ d ✦ Y

Then, with n❂m ✦ r ✷ ❬0❀ ✶✮❀ lim

n❀m✦✶ Pr

❤♣n✭Un❀m ❊nUn✮ ❁ t ✐

❂ ❊Y

t ♣rY ✛h

✦★

34/37

slide-68
SLIDE 68

Main theorem

Theorem (Asymptotic distribution of random kernel U-statistic)

Let

✎ Un❀m : a U-statistic defined by a random U-statistic kernel Hm ✎ Un : a U-statistic defined by a fixed U-statistic kernel h

Assume that

✎ ✛2

Hm ✦ ✛2 h in probability

✎ ✗3✭Hm✮ ✦ ✗3✭h✮ ❁ ✶ in probability

where ✗3✭Hm✮ ❂ ❊x❀x ✵☞ ☞Hm✭x❀ x ✵✮ ❊x❀x ✵Hm✭x❀ x ✵✮ ☞ ☞3

✎ Ym ✿❂ ♣m

✏ ❊n❬Un❀m❥Hm❪ ❊n❬Un❪ ✑ d ✦ Y

Then, with n❂m ✦ r ✷ ❬0❀ ✶✮❀ lim

n❀m✦✶ Pr

❤♣n✭Un❀m ❊nUn✮ ❁ t ✐

❂ ❊Y

t ♣rY ✛h

✦★

34/37

slide-69
SLIDE 69

Main theorem

Theorem (Asymptotic distribution of random kernel U-statistic)

Let

✎ Un❀m : a U-statistic defined by a random U-statistic kernel Hm ✎ Un : a U-statistic defined by a fixed U-statistic kernel h

Assume that

✎ ✛2

Hm ✦ ✛2 h in probability

✎ ✗3✭Hm✮ ✦ ✗3✭h✮ ❁ ✶ in probability

where ✗3✭Hm✮ ❂ ❊x❀x ✵☞ ☞Hm✭x❀ x ✵✮ ❊x❀x ✵Hm✭x❀ x ✵✮ ☞ ☞3

✎ Ym ✿❂ ♣m

✏ ❊n❬Un❀m❥Hm❪ ❊n❬Un❪ ✑ d ✦ Y

Then, with n❂m ✦ r ✷ ❬0❀ ✶✮❀ lim

n❀m✦✶ Pr

❤♣n✭Un❀m ❊nUn✮ ❁ t ✐

❂ ❊Y

t ♣rY ✛h

✦★

34/37

slide-70
SLIDE 70

Experiment: sensitivity to model difference

Data R ❂ Sigmoid Belief Network SBN✭W ✮: R✭x❥z✮ ❂ sigmoid✭Wz✮❀ R✭z✮ ❂ ◆✭0❀ I ✮❀ W ✷ ❘30✂10 Models: P ❂ SBN✭W ✰ ✎❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮❀ Q ❂ SBN✭W ✰ ❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮ Only the first column of weight W is perturbed by ✎

35/37

slide-71
SLIDE 71

Experiment: sensitivity to model difference

Data R ❂ Sigmoid Belief Network SBN✭W ✮: R✭x❥z✮ ❂ sigmoid✭Wz✮❀ R✭z✮ ❂ ◆✭0❀ I ✮❀ W ✷ ❘30✂10 Models: P ❂ SBN✭W ✰ ✎❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮❀ Q ❂ SBN✭W ✰ ❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮ Only the first column of weight W is perturbed by ✎

1 2 3 4 Perturbation ǫ 0.05 0.50 1.00 Rejection rate

Two scenarios:

✎ Null: ✎ ✔ 1 (☛ ❂ 0✿05 ) ✎ Alternative: ✎ ❃ 1

(the higher the better)

Hamming kernel Sample size n ❂ 300

MMD LKSD (KSD for Latent Models) m=100 LKSD m=1000

36/37

slide-72
SLIDE 72

Experiment: sensitivity to model difference

Data R ❂ Sigmoid Belief Network SBN✭W ✮: R✭x❥z✮ ❂ sigmoid✭Wz✮❀ R✭z✮ ❂ ◆✭0❀ I ✮❀ W ✷ ❘30✂10 Models: P ❂ SBN✭W ✰ ✎❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮❀ Q ❂ SBN✭W ✰ ❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮ Only the first column of weight W is perturbed by ✎

1 2 3 4 Perturbation ǫ 0.05 0.50 1.00 Rejection rate

KSD has higher power (✎ ❃ 1) Sample-wise difference in models = subtle (MMD fails) Model’s information is better utilised

MMD LKSD (KSD for Latent Models) m=100 LKSD m=1000

36/37

slide-73
SLIDE 73

Questions?

37/37