The Maximum Mean Discrepancy for Training Generative Adversarial Networks
Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Cardiff, 2018
1/73
The Maximum Mean Discrepancy for Training Generative Adversarial - - PowerPoint PPT Presentation
The Maximum Mean Discrepancy for Training Generative Adversarial Networks Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Cardiff, 2018 1/73 A motivation: comparing two samples Given: Samples from unknown
1/73
2/73
Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017. 3/73
4/73
(Binkowski, Sutherland, Arbel, G., ICLR 2018¯ ), (Arbel, Sutherland, Binkowski, G., arXiv 2018¯ )
5/73
6/73
Their noses guide them through life, and they're never happier than when following an interesting scent. A large animal who slings slobber, exudes a distinctive houndy odor, and wants nothing more than to follow his nose.
Text from dogtime.com and petfinder.com
A responsive, interactive pet, one that will blow in your ear and follow you everywhere.
7/73
8/73
9/73
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Two Gaussians with different means X
PX QX 10/73
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Two Gaussians with different variances
X
PX QX
11/73
−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Two Gaussians with different variances
X
PX QX 10
−1
10 10
1
10
2
0.2 0.4 0.6 0.8 1 1.2 1.4
Densities of feature X2 X2
PX QX
11/73
−4 −3 −2 −1 1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Gaussian and Laplace densities
X
PX QX 12/73
13/73
✏
Features: Gaussian Processes for Machine learning, Ras- mussen and Williams, Ch. 4. 13/73
Fine print: feature map ✬✭x✮ must be Bochner integrable for all probability measures considered. Always true if kernel bounded. 14/73
Fine print: feature map ✬✭x✮ must be Bochner integrable for all probability measures considered. Always true if kernel bounded. 14/73
❋
⑤ ④③ ⑥
✭a✮
⑤ ④③ ⑥
✭a✮
⑤ ④③ ⑥
✭b✮
15/73
❋
⑤ ④③ ⑥
✭a✮
⑤ ④③ ⑥
✭a✮
⑤ ④③ ⑥
✭b✮
15/73
❋
⑤ ④③ ⑥
✭a✮
⑤ ④③ ⑥
✭a✮
⑤ ④③ ⑥
✭b✮
15/73
16/73
❭ MMD
2 ❂
1 n✭n 1✮ ❳
i✻❂j
k✭dogi❀ dogj ✮ ✰ 1 n✭n 1✮ ❳
i✻❂j
k✭fishi❀ fishj ✮ 2 n2 ❳
i❀j
k✭dogi❀ fishj ✮
17/73
0.2 0.4 0.6 0.8 1
0.5 1
Samples from P and Q
18/73
0.2 0.4 0.6 0.8 1
0.5 1
Samples from P and Q
19/73
0.2 0.4 0.6 0.8 1
x
0.5 1
f(x) Smooth function
20/73
0.2 0.4 0.6 0.8 1
x
0.5 1
f(x) Smooth function
21/73
❦f ❦✔1
22/73
❦f ❦✔1
22/73
❦f ❦✔1
−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8
Witness f for Gauss and Laplace densities X
f Gauss Laplace 22/73
❦f ❦✔1
(always true if kernel is bounded) 22/73
❦f ❦✔1
22/73
f ✷F
−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8
Witness f for Gauss and Laplace densities X
f Gauss Laplace
23/73
f ✷F
f ✷F
23/73
f ✷F
f ✷F
23/73
f ✷F
f ✷F
23/73
f ✷F
f ✷F
23/73
f ✷F
f ✷F
23/73
24/73
24/73
24/73
⑤ ④③ ⑥
24/73
25/73
❜
n
❳
i❂1
25/73
❜
n
❳
i❂1
25/73
❜
n
❳
i❂1
25/73
❜
n
❳
i❂1
n
❳
i❂1
n
❳
i❂1
❤
1
2
✐
25/73
26/73
27/73
28/73
29/73
30/73
Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012)
31/73
32/73
❭ MMD
2 ❂
1 n✭n 1✮ ❳
i✻❂j
k✭xi❀ xj ✮ ✰ 1 n✭n 1✮ ❳
i✻❂j
k✭yi❀ yj ✮ 2 n2 ❳
i❀j
k✭xi❀ yj ✮
❭
❭
☛
33/73
❭ MMD
2 ❂
1 n✭n 1✮ ❳
i✻❂j
k✭xi❀ xj ✮ ✰ 1 n✭n 1✮ ❳
i✻❂j
k✭yi❀ yj ✮ 2 n2 ❳
i❀j
k✭xi❀ yj ✮
MMD
2 “close to zero”.
MMD
2 “far from zero”
☛
33/73
❭ MMD
2 ❂
1 n✭n 1✮ ❳
i✻❂j
k✭xi❀ xj ✮ ✰ 1 n✭n 1✮ ❳
i✻❂j
k✭yi❀ yj ✮ 2 n2 ❳
i❀j
k✭xi❀ yj ✮
MMD
2 “close to zero”.
MMD
2 “far from zero”
2 to get false positive rate ☛
33/73
2 when P ✻❂ Q
2 ❂ 1✿2
2
2 4 6 8 10 P Q
34/73
2 when P ✻❂ Q
2 ❂ 1✿2
0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 8
2
2 4 6 8 10 P Q
35/73
2 when P ✻❂ Q
2 ❂ 1✿5
0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 3 3.5 4
2
2 4 6 8 10 P Q
36/73
2 when P ✻❂ Q
0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
37/73
2 when P ✻❂ Q
0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
37/73
2 when P ✻❂ Q
0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
37/73
2 when P ✻❂ Q
2 MMD✭P❀ Q✮
♣
D
n1✁ .
0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5
Empirical PDF Gaussian fit
−6 −4 −2 2 4 6 0.5 1 1.5
Two Laplace distributions with different variances X
PX QX
38/73
2 when P ❂ Q
39/73
2 when P ❂ Q
2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7
40/73
2 when P ❂ Q
2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7
40/73
2 when P ❂ Q
2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7
40/73
2 when P ❂ Q
2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7
40/73
2 when P ❂ Q
2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7
40/73
2 when P ❂ Q
2 ✘ ✶
❳
l❂1
❤
l 2
✐
2 4 6 0.2 0.4 0.6
✕i✥i✭x ✵✮ ❂
❩
❳
⑦ k✭x❀ x ✵✮
⑤ ④③ ⑥
centred
✥i✭x✮dP✭x✮
41/73
1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 42/73
1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 42/73
❭ MMD
2 ❂
1 n✭n 1✮ ❳
i✻❂j
k✭xi❀ xj ✮ ✰ 1 n✭n 1✮ ❳
i✻❂j
k✭yi❀ yj ✮ 2 n2 ❳
i❀j
k✭xi❀ yj ✮
43/73
44/73
❭ MMD
2 ❂
1 n✭n 1✮ ❳
i✻❂j
k✭⑦ xi❀ ⑦ xj ✮ ✰ 1 n✭n 1✮ ❳
i✻❂j
k✭~ yi❀~ yj ✮ 2 n2 ❳
i❀j
k✭⑦ xi❀~ yj ✮
44/73
45/73
1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 46/73
✒
2 ❃ ❫
✓
47/73
✒
2 ❃ ❫
✓
✥
♣
♣
✦
47/73
✒
2 ❃ ❫
✓
✥
♣
⑤ ④③ ⑥
O✭n1❂2✮
♣
⑤ ④③ ⑥
O✭n1❂2✮
✦
♣
47/73
✒
2 ❃ ❫
✓
✥
♣
♣
✦
♣
(Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017)
Code: github.com/dougalsutherland/opt-mmd
47/73
48/73
48/73
49/73
50/73
51/73
51/73
51/73
51/73
52/73
53/73
53/73
54/73
55/73
55/73
✥
X
✏✌ ✌ ✌r❡
X f✒✭❢
✌ ✌ ✌ 1 ✑2
❢
❵❂1
❵❂1
55/73
Binkowski, Sutherland, Arbel, G. [ICLR 2018], Bellemare et al. [2017] for energy distance:
✥
X
✏✌ ✌ ✌r❡
X f✥✭❢
✌ ✌ ✌ 1 ✑2
❢
❵❂1
❵❂1 Remark by Bottou et al. (2017): gradient penalty modifies the function class. So critic is not an MMD in RKHS ❋.
56/73
57/73
57/73
57/73
Nagarajan and Kolter [NIPS 2017], Mescheder et al. [ICML 2018], Balduzzi et al. [ICML 2018]
Figure from Mescheder et al. [ICML 2018]
58/73
Nagarajan and Kolter [NIPS 2017], Mescheder et al. [ICML 2018], Balduzzi et al. [ICML 2018]
Figure from Mescheder et al. [ICML 2018]
58/73
Arbel, Sutherland, Binkowski, G. [NIPS 2018]
59/73
Arbel, Sutherland, Binkowski, G. [NIPS 2018]
59/73
Arbel, Sutherland, Binkowski, G. [NIPS 2018]
59/73
Arbel, Sutherland, Binkowski, G. [NIPS 2018]
59/73
Arbel, Sutherland, Binkowski, G. [NIPS 2018]
✥
❩
d
❳
i❂1
❩
✦1❂2
S ✔ ✛1 k❀P❀✕ ❦f ❦2 k
59/73
Arbel, Sutherland, Binkowski, G. [NIPS 2018]
✥
❩
d
❳
i❂1
❩
✦1❂2
S ✔ ✛1 k❀P❀✕ ❦f ❦2 k
59/73
60/73
Salimans et al. [NIPS 2016]
et al. [ICLR 2014],
61/73
Salimans et al. [NIPS 2016]
et al. [ICLR 2014],
61/73
Heusel et al. [NIPS 2017]
✏
1 2
✑
62/73
Heusel et al. [NIPS 2017]
✏
1 2
✑
2000 4000 6000 8000 10000
n
10 20 30 40 50
FID 62/73
63/73
63/73
63/73
63/73
d CC T, with C a d ✂ d matrix with iid standard normal
64/73
d CC T, with C a d ✂ d matrix with iid standard normal
64/73
d CC T, with C a d ✂ d matrix with iid standard normal
64/73
d CC T, with C a d ✂ d matrix with iid standard normal
64/73
✒ 1
✓3
250 500 750 1000 1250 1500 1750 2000
n
0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004
KID 65/73
✒ 1
✓3
250 500 750 1000 1250 1500 1750 2000
n
0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004
KID
65/73
✒ 1
✓3
250 500 750 1000 1250 1500 1750 2000
n
0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004
KID
65/73
✒ 1
✓3
250 500 750 1000 1250 1500 1750 2000
n
0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004
KID
[Bounliphone et al. ICLR 2016]
Related: “An empirical study on evaluation metrics of generative adversarial networks”, Xu et al. [arxiv, June 2018] 65/73
66/73
67/73
67/73
68/73
202 599 face images, re- sized and cropped to 160 ✂ 160 69/73
ILSVRC2012 (ImageNet) dataset, 1 281 167 im- ages, resized to 64 × 64. Around 20 000 classes. 70/73
ILSVRC2012 (ImageNet) dataset, 1 281 167 im- ages, resized to 64 × 64. Around 20 000 classes. 70/73
ILSVRC2012 (ImageNet) dataset, 1 281 167 im- ages, resized to 64 × 64. Around 20 000 classes. 70/73
71/73
72/73
73/73