Relative Goodness-of-Fit Tests for Models with Latent Variables
Arthur Gretton
Gatsby Computational Neuroscience Unit, University College London
June 15, 2019
1/37
Relative Goodness-of-Fit Tests for Models with Latent Variables - - PowerPoint PPT Presentation
Relative Goodness-of-Fit Tests for Models with Latent Variables Arthur Gretton Gatsby Computational Neuroscience Unit, University College London June 15, 2019 1/37 Model Criticism 2/37 Model Criticism 2/37 Model Criticism Data =
Arthur Gretton
Gatsby Computational Neuroscience Unit, University College London
June 15, 2019
1/37
2/37
2/37
Data = robbery events in Chicago in 2016.
2/37
Is this a good model?
2/37
Goals: Test if a (complicated) model fits the data.
2/37
3/37
Have: two candidate models P and Q, and samples ❢xi❣n
i❂1 from
reference distribution R Goal: which of P and Q is better? Samples from GAN, Goodfellow et al. (2014) Samples from LSGAN, Mao et al. (2017)
4/37
Graphical model representation of hierarchical LDA with a nested CRP prior, Blei et al. (2003)
c
L
c
1
c
3
c
2
" # z $ N M w % & !
8 (b)
5/37
The kernel Stein discrepancy
✎ Comparing two models via samples: MMD and the witness function. ✎ Comparing a sample and a model: Stein modification of the witness
class
Constructing a relative hypothesis test using the KSD Relative hypothesis tests with latent variables (new, unpublished)
6/37
Model P, data ❢xi❣n
i❂1 ✘ Q.
“All models are wrong” (P ✻❂ Q).
7/37
Integral probability metric: Find a "well behaved function" f ✭x✮ to maximize EQf ✭Y ✮ EPf ✭X ✮
0.2 0.4 0.6 0.8 1
x
0.5 1
f(x) Smooth function
8/37
Integral probability metric: Find a "well behaved function" f ✭x✮ to maximize EQf ✭Y ✮ EPf ✭X ✮
0.2 0.4 0.6 0.8 1
x
0.5 1
f(x) Smooth function
9/37
Functions are linear combinations of features: ❦f ❦2
❋ ✿❂ P✶ i❂1 fi 2
10/37
“The kernel trick” f ✭x✮ ❂
✶
❳
❵❂1
f❵✬❵✭x✮ ❂
m
❳
i❂1
☛ik✭xi❀ x✮
2 4 6 8
x
0.2 0.4 0.6 0.8
f(x)
11/37
“The kernel trick” f ✭x✮ ❂
✶
❳
❵❂1
f❵✬❵✭x✮ ❂
m
❳
i❂1
☛ik✭xi❀ x✮
2 4 6 8
x
0.2 0.4 0.6 0.8
f(x)
f❵ ✿❂ Pm
i❂1 ☛i✬❵✭xi✮
Function of infinitely many features expressed using m coefficients.
11/37
Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ ❋✮ ✿❂ sup
❦f ❦❋✔1
❬EPf ✭X ✮ EQf ✭Y ✮❪
12/37
Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ ❋✮ ✿❂ sup
❦f ❦❋✔1
❬EPf ✭X ✮ EQf ✭Y ✮❪
2 4
0.1 0.2 0.3 0.4
p(x) q(x) f *(x)
12/37
Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ ❋✮ ✿❂ sup
❦f ❦❋✔1
❬EPf ✭X ✮ EQf ✭Y ✮❪ For characteristic RKHS ❋, MMD✭P❀ Q❀ ❋✮ ❂ 0 iff P ❂ Q Other choices for witness function class: Bounded continuous [Dudley, 2002] Bounded variation 1 (Kolmogorov metric) [Müller, 1997] 1-Lipschitz (Wasserstein distances) [Dudley, 2002]
12/37
MMD✭P❀ Q✮ ❂ sup❦f ❦❋✔1❬Eqf Epf ❪
2 4
0.1 0.2 0.3 0.4
p(x) q(x) f *(x)
Can we compute MMD with samples from Q and a model P? Problem: usualy can’t compute Epf in closed form.
13/37
To get rid of Epf in sup
❦f ❦❋✔1
❬Eqf Epf ❪ we define the (1-D) Stein operator ❬❆pf ❪ ✭x✮ ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ Then Ep❆pf ❂ 0 subject to appropriate boundary conditions.
Gorham and Mackey (NeurIPS 15), Oates, Girolami, Chopin (JRSS B 2016) 14/37
Stein operator ❆pf ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ Kernel Stein Discrepancy (KSD) KSDp✭Q✮ ❂ sup
❦g❦❋✔1
Eq❆pg Ep❆pg
15/37
Stein operator ❆pf ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ Kernel Stein Discrepancy (KSD) KSDp✭Q✮ ❂ sup
❦g❦❋✔1
Eq❆pg ✘✘✘
✘
Ep❆pg ❂ sup
❦g❦❋✔1
Eq❆pg
15/37
Stein operator ❆pf ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ Kernel Stein Discrepancy (KSD) KSDp✭Q✮ ❂ sup
❦g❦❋✔1
Eq❆pg ✘✘✘
✘
Ep❆pg ❂ sup
❦g❦❋✔1
Eq❆pg
2 4
0.2 0.4
p(x) q(x) g *(x)
15/37
Stein operator ❆pf ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ Kernel Stein Discrepancy (KSD) KSDp✭Q✮ ❂ sup
❦g❦❋✔1
Eq❆pg ✘✘✘
✘
Ep❆pg ❂ sup
❦g❦❋✔1
Eq❆pg
2 4 0.1 0.2 0.3 0.4
p(x) q(x) g *(x)
15/37
Re-write stein operator as: ❬❆pf ❪ ✭x✮ ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ ❂ f ✭x✮ d dx log p✭x✮ ✰ d dx f ✭x✮
❬❆pf ❪ ✭x✮ ❂
✒ d
dx log p✭x✮
✓
f ✭x✮ ✰ d dx f ✭x✮ ❂✿
✡f ❀
✘✭x✮
⑤④③⑥
stein features
☛
❋
where Ex✘p✘✭x✮ ❂ 0.
16/37
Re-write stein operator as: ❬❆pf ❪ ✭x✮ ❂ 1 p✭x✮ d dx ✭f ✭x✮p✭x✮✮ ❂ f ✭x✮ d dx log p✭x✮ ✰ d dx f ✭x✮
❬❆pf ❪ ✭x✮ ❂
✒ d
dx log p✭x✮
✓
f ✭x✮ ✰ d dx f ✭x✮ ❂✿
✡f ❀
✘✭x✮
⑤④③⑥
stein features
☛
❋
where Ex✘p✘✭x✮ ❂ 0.
16/37
Reproducing property for the derivative: for differentiable k✭x❀ x ✵✮, d dx f ✭x✮ ❂
✜
f ❀ d dx ✬✭x✮
✢
❋
✭ ✮ ❬❆ ❪ ✭ ✮ ❂
✒
✭ ✮
✓
✭ ✮ ✰ ✭ ✮ ❂
✯
❀
✒
✭ ✮
✓
✬✭ ✮ ✰ ✬✭ ✮
⑤ ④③ ⑥
✭ ✮
✰
❋
❂✿ ❤ ❀ ✘✭ ✮✐❋ ✿
17/37
Reproducing property for the derivative: for differentiable k✭x❀ x ✵✮, d dx f ✭x✮ ❂
✜
f ❀ d dx ✬✭x✮
✢
❋
Using kernel derivative trick in ✭a✮, ❬❆pf ❪ ✭x✮ ❂
✒ d
dx log p✭x✮
✓
f ✭x✮ ✰ d dx f ✭x✮ ❂
✯
f ❀
✒ d
dx log p✭x✮
✓
✬✭x✮ ✰ d dx ✬✭x✮
⑤ ④③ ⑥
✭a✮
✰
❋
❂✿ ❤f ❀ ✘✭x✮✐❋ ✿
17/37
Closed-form expression for KSD: given independent x❀ x ✵ ✘ Q, then KSDp✭Q✮ ❂ sup
❦g❦❋✔1
Ex✘q ✭❬❆pg❪ ✭x✮✮ ❂ sup
❦g❦❋✔1
Ex✘q ❤g❀ ✘x✐❋ ❂
✭a✮
sup
❦g❦❋✔1
❤g❀ Ex✘q✘x✐❋ ❂ ❦Ex✘q✘x❦❋ ✭ ✮
✘
✒
✭ ✮
✓
❁ ✶✿
Chwialkowski, Strathmann, G., (ICML 2016) Liu, Lee, Jordan (ICML 2016) 18/37
Closed-form expression for KSD: given independent x❀ x ✵ ✘ Q, then KSDp✭Q✮ ❂ sup
❦g❦❋✔1
Ex✘q ✭❬❆pg❪ ✭x✮✮ ❂ sup
❦g❦❋✔1
Ex✘q ❤g❀ ✘x✐❋ ❂
✭a✮
sup
❦g❦❋✔1
❤g❀ Ex✘q✘x✐❋ ❂ ❦Ex✘q✘x❦❋ ✭ ✮
✘
✒
✭ ✮
✓
❁ ✶✿
Chwialkowski, Strathmann, G., (ICML 2016) Liu, Lee, Jordan (ICML 2016) 18/37
Closed-form expression for KSD: given independent x❀ x ✵ ✘ Q, then KSDp✭Q✮ ❂ sup
❦g❦❋✔1
Ex✘q ✭❬❆pg❪ ✭x✮✮ ❂ sup
❦g❦❋✔1
Ex✘q ❤g❀ ✘x✐❋ ❂
✭a✮
sup
❦g❦❋✔1
❤g❀ Ex✘q✘x✐❋ ❂ ❦Ex✘q✘x❦❋ Caution: ✭a✮ requires a condition for the Riesz theorem to hold, Ex✘q
✒ d
dx log p✭x✮
✓2
❁ ✶✿
Chwialkowski, Strathmann, G., (ICML 2016) Liu, Lee, Jordan (ICML 2016) 18/37
Model p ❂ 10-component Gaussian mixture.
19/37
Witness function g shows mismatch
19/37
Consider the standard normal, p✭x✮ ❂ 1 ♣ 2✙ exp
✏
x 2❂2
✑
✿ Then d dx log p✭x✮ ❂ x✿ If q is a Cauchy distribution, then the integral Ex✘q
✒ d
dx log p✭x✮
✓2
❂
❩ ✶
✶
x 2q✭x✮dx is undefined.
20/37
Consider the standard normal, p✭x✮ ❂ 1 ♣ 2✙ exp
✏
x 2❂2
✑
✿ Then d dx log p✭x✮ ❂ x✿ If q is a Cauchy distribution, then the integral Ex✘q
✒ d
dx log p✭x✮
✓2
❂
❩ ✶
✶
x 2q✭x✮dx is undefined.
20/37
Test statistic: KSD2
p✭Q✮ ❂ ❦Ex✘q✘x❦2 ❋ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮
where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ ✰ sp✭x✮❃k2✭x❀ x ✵✮ ✰ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr
✂k12✭x❀ x ✵✮ ✄
sp✭x✮ ✷ ❘D ❂ rp✭x✮
p✭x✮
k1✭a❀ b✮ ✿❂ rxk✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k2✭a❀ b✮ ✿❂ rx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k12✭a❀ b✮ ✿❂ rxrx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D✂D
21/37
Test statistic: KSD2
p✭Q✮ ❂ ❦Ex✘q✘x❦2 ❋ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮
where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ ✰ sp✭x✮❃k2✭x❀ x ✵✮ ✰ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr
✂k12✭x❀ x ✵✮ ✄
sp✭x✮ ✷ ❘D ❂ rp✭x✮
p✭x✮
k1✭a❀ b✮ ✿❂ rxk✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k2✭a❀ b✮ ✿❂ rx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k12✭a❀ b✮ ✿❂ rxrx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D✂D
21/37
Test statistic: KSD2
p✭Q✮ ❂ ❦Ex✘q✘x❦2 ❋ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮
where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ ✰ sp✭x✮❃k2✭x❀ x ✵✮ ✰ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr
✂k12✭x❀ x ✵✮ ✄
sp✭x✮ ✷ ❘D ❂ rp✭x✮
p✭x✮
k1✭a❀ b✮ ✿❂ rxk✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k2✭a❀ b✮ ✿❂ rx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k12✭a❀ b✮ ✿❂ rxrx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D✂D
21/37
Test statistic: KSD2
p✭Q✮ ❂ ❦Ex✘q✘x❦2 ❋ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮
where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ ✰ sp✭x✮❃k2✭x❀ x ✵✮ ✰ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr
✂k12✭x❀ x ✵✮ ✄
sp✭x✮ ✷ ❘D ❂ rp✭x✮
p✭x✮
k1✭a❀ b✮ ✿❂ rxk✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k2✭a❀ b✮ ✿❂ rx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D, k12✭a❀ b✮ ✿❂ rxrx ✵k✭x❀ x ✵✮❥x❂a❀x ✵❂b ✷ ❘D✂D If kernel is C0-universal and Q satisfies Ex✘Q
✌ ✌ ✌r ✏
log p✭x✮
q✭x✮
✑✌ ✌ ✌
2 ❁ ✶,
then KSD2
p✭Q✮ ❂ 0 iff P ❂ Q.
21/37
Discrete domains: ❳ ❂ ❢1❀ ✿ ✿ ✿ ❀ L❣D with L ✷ ◆. The population KSD (discrete): KSD2
p✭Q✮ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮
where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ sp✭x✮❃k2✭x❀ x ✵✮ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr
✂k12✭x❀ x ✵✮ ✄
k1✭x❀ x ✵✮ ❂ ✁1
x k✭x❀ x ✵✮, ✁1 x
is difference on x, sp✭x✮ ❂ ✁p✭x✮
p✭x✮
✭ ❀
✵✮ ❂
✭ ✭ ❀
✵✮✮
✭ ❀
✵✮ ❂ P ❂ ■✭
✻❂
✵ ✮
✭ ✮ ❂ ❂
❳ ❃ ❃
Ranganath et al. (NeurIPS 2016), Yang et al. (ICML 2018) 22/37
Discrete domains: ❳ ❂ ❢1❀ ✿ ✿ ✿ ❀ L❣D with L ✷ ◆. The population KSD (discrete): KSD2
p✭Q✮ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮
where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ sp✭x✮❃k2✭x❀ x ✵✮ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr
✂k12✭x❀ x ✵✮ ✄
k1✭x❀ x ✵✮ ❂ ✁1
x k✭x❀ x ✵✮, ✁1 x
is difference on x, sp✭x✮ ❂ ✁p✭x✮
p✭x✮
A discrete kernel: k✭x❀ x ✵✮ ❂ exp ✭dH ✭x❀ x ✵✮✮, where dH ✭x❀ x ✵✮ ❂ D1 PD
d❂1 ■✭xd ✻❂ x ✵ d✮.
✭ ✮ ❂ ❂
❳ ❃ ❃
Ranganath et al. (NeurIPS 2016), Yang et al. (ICML 2018) 22/37
Discrete domains: ❳ ❂ ❢1❀ ✿ ✿ ✿ ❀ L❣D with L ✷ ◆. The population KSD (discrete): KSD2
p✭Q✮ ❂ Ex❀x ✵✘Qhp✭x❀ x ✵✮
where hp✭x❀ x ✵✮ ❂ sp✭x✮❃sp✭x ✵✮k✭x❀ x ✵✮ sp✭x✮❃k2✭x❀ x ✵✮ sp✭x ✵✮❃k1✭x❀ x ✵✮ ✰ tr
✂k12✭x❀ x ✵✮ ✄
k1✭x❀ x ✵✮ ❂ ✁1
x k✭x❀ x ✵✮, ✁1 x
is difference on x, sp✭x✮ ❂ ✁p✭x✮
p✭x✮
A discrete kernel: k✭x❀ x ✵✮ ❂ exp ✭dH ✭x❀ x ✵✮✮, where dH ✭x❀ x ✵✮ ❂ D1 PD
d❂1 ■✭xd ✻❂ x ✵ d✮.
KSD2
p✭Q✮ ❂ 0 iff P ❂ Q if
Gram matrix over all the configurations in ❳ is strictly positive definite, P ❃ 0 and Q ❃ 0.
Ranganath et al. (NeurIPS 2016), Yang et al. (ICML 2018) 22/37
The empirical statistic: ❭ KSD2
p✭Q✮ ✿❂
1 n✭n 1✮
❳
i✻❂j
hp✭xi❀ xj ✮✿ ✻❂ ♣
✒
❭✭ ✮ ✭ ✮
✓
✦ ◆✭ ❀ ✛ ✮ ✛ ❂ ❬❊
✵❬
✭ ❀
✵✮❪❪✿
23/37
The empirical statistic: ❭ KSD2
p✭Q✮ ✿❂
1 n✭n 1✮
❳
i✻❂j
hp✭xi❀ xj ✮✿ Asymptotic distribution when P ✻❂ Q: ♣n
✒
❭ KSD2
p✭Q✮ KSD2 p✭Q✮
✓
d
✦ ◆✭0❀ ✛2
hp✮
✛2
hp ❂ 4Var❬❊x ✵❬hp✭x❀ x ✵✮❪❪✿
p(Q)
KSD2
p(Q)
23/37
Two generative models P and Q, data ❢xi❣n
i❂1 ✘ R.
Neither model gives a perfect fit ( P ✻❂ R and Q ✻❂ R).
24/37
Joint asymptotic normality when P ✻❂ R and Q ✻❂ R ♣n
✷ ✹
❭ KSD2
p✭R✮ KSD2 p✭R✮
❭ KSD2
q✭R✮ KSD2 q✭R✮
✸ ✺ d
✦ ◆
✥✧ ★
❀
✧
✛2
hp
✛hphq ✛hphq ✛2
hq
★✦
p(R)
q(R)
25/37
Joint asymptotic normality when P ✻❂ R and Q ✻❂ R ♣n
✷ ✹
❭ KSD2
p✭R✮ KSD2 p✭R✮
❭ KSD2
q✭R✮ KSD2 q✭R✮
✸ ✺ d
✦ ◆
✥✧ ★
❀
✧
✛2
hp
✛hphq ✛hphq ✛2
hq
★✦
Difference in statistics is asymptotically normal: ♣n
✔
❭ KSD2
p✭R✮ ❭
KSD2
q✭R✮
✏
KSD2
p✭R✮ KSD2 q✭R✮
✑✕
d
✦ ◆
✏
0❀ ✛2
hp ✰ ✛2 hq 2✛hphq
✑
❂ ✮ a statistical test with null hypothesis KSD2
p✭R✮ KSD2 q✭R✮ ✔ 0
is straightforward.
25/37
Can we compare latent variable models with KSD? p✭x✮ ❂
❩
p✭x❥z✮p✭z✮dz q✭y✮ ❂
❩
q✭y❥w✮p✭w✮dw
Recall multi-dimensional Stein operator: ❬❆pf ❪ ✭x✮ ❂
✯
rp✭x✮ p✭x✮
⑤ ④③ ⑥
✭a✮
❀ f ✭x✮
✰
✰ ❤r❀ f ✭x✮✐ ✿ Expression ✭a✮ requires marginal p✭x✮, often intractable✿ ✿ ✿ ✿ ✿ ✿
26/37
Can we compare latent variable models with KSD? p✭x✮ ❂
❩
p✭x❥z✮p✭z✮dz q✭y✮ ❂
❩
q✭y❥w✮p✭w✮dw
Recall multi-dimensional Stein operator: ❬❆pf ❪ ✭x✮ ❂
✯
rp✭x✮ p✭x✮
⑤ ④③ ⑥
✭a✮
❀ f ✭x✮
✰
✰ ❤r❀ f ✭x✮✐ ✿ Expression ✭a✮ requires marginal p✭x✮, often intractable✿ ✿ ✿ ✿ ✿ ✿but sampling can be straightforward!
26/37
Approximate the integral using ❢zj ❣m
j ❂1 ✘ p✭z✮:
p✭x✮ ❂
❩
p✭x❥z✮p✭z✮dz ✙ pm✭x✮ ❂ 1 m
m
❳
j ❂1
p✭x❥zj ✮ Estimate KSDs with approxiomate densities: ❭ KSD2
p✭R✮ ❭
KSD2
q✭R✮ ✙ ❭
KSD2
pm✭R✮ ❭
KSD2
qm✭R✮
♣
✔
❭✭ ✮ ❭✭ ✮
✏
✭ ✮ ✭ ✮
✑✕
✦ ◆
✏
❀ ✛ ✰ ✛ ✛
✑
✦
27/37
Approximate the integral using ❢zj ❣m
j ❂1 ✘ p✭z✮:
p✭x✮ ❂
❩
p✭x❥z✮p✭z✮dz ✙ pm✭x✮ ❂ 1 m
m
❳
j ❂1
p✭x❥zj ✮ Estimate KSDs with approxiomate densities: ❭ KSD2
p✭R✮ ❭
KSD2
q✭R✮ ✙ ❭
KSD2
pm✭R✮ ❭
KSD2
qm✭R✮
Recall ♣n
✔
❭ KSD2
p✭R✮ ❭
KSD2
q✭R✮
✏
KSD2
p✭R✮ KSD2 q✭R✮
✑✕
d
✦ ◆
✏
0❀ ✛2
hp ✰ ✛2 hq 2✛hphq
✑
✦ if m is large, can we simply substitute pm and qm ?
27/37
Check ❭ KSD2
p✭R✮ ✙ ❭
KSD2
pm✭R✮ with a toy model:
Model: Beta-Binomial BetaBinom✭☛❀ ☞✮ p✭x❥z✮ ❂
✥
N x
✦
z x✭1 z✮nx❀ p✭z✮ ❂ Beta✭a❀ b✮
✎ Latent z ✷ ✭0❀ 1✮: success probability for binomial likelihood ✎ Marginal p✭x✮: tractable (given by the beta function)
Generate ♣n ❭ KSD2
p✭R✮ and ♣n ❭
KSD2
pm✭R✮
✦ what do their distribution look like?
28/37
p
29/37
p
pm
29/37
p
pm
29/37
KSD2
p
P✭KSD2
pm✮
KSD2
pm✭R✮ is normally distributed around KSD2 p✭R✮
(approximation error)
30/37
KSD2
p
P✭KSD2
pm✮
KSD2
pm
Approximation pm gives a random draw KSD2
pm✭R✮
30/37
KSD2
p
P✭KSD2
pm✮
KSD2
pm
P✭ ❭ KSD2
pm❥pm✮
❭ KSD2
pm✭R✮ is normally distributed around KSD2 pm✭R✮
30/37
Distribution of ❭ KSD2
pm✭R✮ is
averaged over random draws of KSD2
pm✭R✮
30/37
Distribution of ❭ KSD2
pm✭R✮ is
averaged over random draws of KSD2
pm✭R✮
30/37
❭ KSD2
pm✭R✮ has a higher variance than ❭
KSD2
p✭R✮
30/37
BetaBinomial models with p vs q ❂ pm: numerical vs closed-form marginalisation. With correction for increased ❭ KSD2
pm✭R✮ variance,
null accepted w.p. 1 ☛.
P ❂ BetaBinom✭5✰a❀ 1✰b✮ Q ❂ pm R ❂ BetaBinom✭a❀ b✮ k✭x❀ x ✵✮ ❂ exp✭■✭x ✻❂ x ✵✮✮ ☛ ❂ 0✿05
KSD without corrected threshold (m=100) KSD m=1000 LKSD (KSD for Latent Models) m=100 LKSD m=1000
31/37
BetaBinomial models with p vs q ❂ pm: numerical vs closed-form marginalisation. With correction for increased ❭ KSD2
pm✭R✮ variance,
null accepted w.p. 1 ☛.
Naive Rel-KSD test has incorrect type-I error Naive KSD: q ❂ pm ✻❂ p ✮ rejection rate ✦ 1 as n ✦ ✶
KSD without corrected threshold (m=100) KSD m=1000 LKSD (KSD for Latent Models) m=100 LKSD m=1000
31/37
We have asymptotic normality for KSD2
pm✭R✮,
♣m
KSD2
pm✭R✮ KSD2 p✭R✮
✁ d
✦ ◆✭0❀ ✌2
p✮
The fine print: infx p✭x✮ ❃ 0 supx
☞ ☞ ☞dp✭x✮
dx
☞ ☞ ☞ ❁ ✶
(Uniform CLT) Likelihoods ❢p✭x❥✁✮❥x ✷ ❳❣ and derivatives ❢ d
dx p✭x❥✁✮❥x ✷ ❳❣ are p✭z✮ - Donsker class
32/37
Asymptotic distribution of approximate KSD estimate ✭n❀ m✮ ✦ ✶❀
n m ✦ r ✷ ❬0❀ ✶✮:
♣n
✔✒
❭ KSD2
pm✭R✮ ❭
KSD2
qm✭R✮
✓
KSD2
p✭R✮ KSD2 q✭R✮
✑✕
d
✦ ◆✭0❀ c2✮ where c ❂ ✛pq
q
1 ✰ r✭✌pq❂✛pq✮2 ✌2
pq ❂ lim m✦✶m ✁ Var ❬Ex❀x ✵hpm✭x❀ x ✵✮ Ex❀x ✵hqm✭x❀ x ✵✮❪
✛2
pq ❂ lim n✦✶n ✁ Var
✔
❭ KSD2
p✭R✮ ❭
KSD2
q✭R✮
✕
Fine print: hp✭x❀ x ✵✮ hq✭x❀ x ✵✮ has a finite third moment An additional technical condition (next slide)
33/37
Let
✎ Un❀m : a U-statistic defined by a random U-statistic kernel Hm ✎ Un : a U-statistic defined by a fixed U-statistic kernel h
Assume that
✎ ✛2
Hm ✦ ✛2 h in probability
✎ ✗3✭Hm✮ ✦ ✗3✭h✮ ❁ ✶ in probability
where ✗3✭Hm✮ ❂ ❊x❀x ✵ ☞ ☞Hm✭x❀ x ✵✮ ❊x❀x ✵Hm✭x❀ x ✵✮ ☞ ☞3
✎ Ym ✿❂ ♣m
✏ ❊n❬Un❀m❥Hm❪ ❊n❬Un❪ ✑ d ✦ Y
Then, with n❂m ✦ r ✷ ❬0❀ ✶✮❀ lim
n❀m✦✶ Pr
❤♣n✭Un❀m ❊nUn✮ ❁ t ✐
❂ ❊Y
✧
✟
✥
t ♣rY ✛h
✦★
34/37
Let
✎ Un❀m : a U-statistic defined by a random U-statistic kernel Hm ✎ Un : a U-statistic defined by a fixed U-statistic kernel h
Assume that
✎ ✛2
Hm ✦ ✛2 h in probability
✎ ✗3✭Hm✮ ✦ ✗3✭h✮ ❁ ✶ in probability
where ✗3✭Hm✮ ❂ ❊x❀x ✵☞ ☞Hm✭x❀ x ✵✮ ❊x❀x ✵Hm✭x❀ x ✵✮ ☞ ☞3
✎ Ym ✿❂ ♣m
✏ ❊n❬Un❀m❥Hm❪ ❊n❬Un❪ ✑ d ✦ Y
Then, with n❂m ✦ r ✷ ❬0❀ ✶✮❀ lim
n❀m✦✶ Pr
❤♣n✭Un❀m ❊nUn✮ ❁ t ✐
❂ ❊Y
✧
✟
✥
t ♣rY ✛h
✦★
34/37
Let
✎ Un❀m : a U-statistic defined by a random U-statistic kernel Hm ✎ Un : a U-statistic defined by a fixed U-statistic kernel h
Assume that
✎ ✛2
Hm ✦ ✛2 h in probability
✎ ✗3✭Hm✮ ✦ ✗3✭h✮ ❁ ✶ in probability
where ✗3✭Hm✮ ❂ ❊x❀x ✵☞ ☞Hm✭x❀ x ✵✮ ❊x❀x ✵Hm✭x❀ x ✵✮ ☞ ☞3
✎ Ym ✿❂ ♣m
✏ ❊n❬Un❀m❥Hm❪ ❊n❬Un❪ ✑ d ✦ Y
Then, with n❂m ✦ r ✷ ❬0❀ ✶✮❀ lim
n❀m✦✶ Pr
❤♣n✭Un❀m ❊nUn✮ ❁ t ✐
❂ ❊Y
✧
✟
✥
t ♣rY ✛h
✦★
34/37
Data R ❂ Sigmoid Belief Network SBN✭W ✮: R✭x❥z✮ ❂ sigmoid✭Wz✮❀ R✭z✮ ❂ ◆✭0❀ I ✮❀ W ✷ ❘30✂10 Models: P ❂ SBN✭W ✰ ✎❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮❀ Q ❂ SBN✭W ✰ ❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮ Only the first column of weight W is perturbed by ✎
35/37
Data R ❂ Sigmoid Belief Network SBN✭W ✮: R✭x❥z✮ ❂ sigmoid✭Wz✮❀ R✭z✮ ❂ ◆✭0❀ I ✮❀ W ✷ ❘30✂10 Models: P ❂ SBN✭W ✰ ✎❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮❀ Q ❂ SBN✭W ✰ ❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮ Only the first column of weight W is perturbed by ✎
Two scenarios:
✎ Null: ✎ ✔ 1 (☛ ❂ 0✿05 ) ✎ Alternative: ✎ ❃ 1
(the higher the better)
Hamming kernel Sample size n ❂ 300
MMD LKSD (KSD for Latent Models) m=100 LKSD m=1000
36/37
Data R ❂ Sigmoid Belief Network SBN✭W ✮: R✭x❥z✮ ❂ sigmoid✭Wz✮❀ R✭z✮ ❂ ◆✭0❀ I ✮❀ W ✷ ❘30✂10 Models: P ❂ SBN✭W ✰ ✎❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮❀ Q ❂ SBN✭W ✰ ❬1❀ 0❀ ✿ ✿ ✿ ❀ 0❪✮ Only the first column of weight W is perturbed by ✎
KSD has higher power (✎ ❃ 1) Sample-wise difference in models = subtle (MMD fails) Model’s information is better utilised
MMD LKSD (KSD for Latent Models) m=100 LKSD m=1000
36/37
37/37