The Maximum Mean Discrepancy for Training Generative Adversarial - - PowerPoint PPT Presentation

the maximum mean discrepancy for training generative
SMART_READER_LITE
LIVE PREVIEW

The Maximum Mean Discrepancy for Training Generative Adversarial - - PowerPoint PPT Presentation

The Maximum Mean Discrepancy for Training Generative Adversarial Networks Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Cardiff, 2018 1/73 A motivation: comparing two samples Given: Samples from unknown


slide-1
SLIDE 1

The Maximum Mean Discrepancy for Training Generative Adversarial Networks

Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Cardiff, 2018

1/73

slide-2
SLIDE 2

A motivation: comparing two samples

Given: Samples from unknown distributions P and Q. Goal: do P and Q differ?

2/73

slide-3
SLIDE 3

A real-life example: two-sample tests

Have: Two collections of samples X❀ Y from unknown distributions P and Q. Goal: do P and Q differ? MNIST samples Samples from a GAN

Significant difference in GAN and MNIST?

  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, Xi Chen, NIPS 2016

Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017. 3/73

slide-4
SLIDE 4

Training generative models

4/73

slide-5
SLIDE 5

Training generative models

Have: One collection of samples X from unknown distribution P. Goal: generate samples Q that look like P LSUN bedroom samples P Generated Q, MMD GAN

Using MMD to train a GAN

(Binkowski, Sutherland, Arbel, G., ICLR 2018¯ ), (Arbel, Sutherland, Binkowski, G., arXiv 2018¯ )

5/73

slide-6
SLIDE 6

Testing goodness of fit

Given: A model P and samples and Q. Goal: is P a good fit for Q? Chicago crime data Model is Gaussian mixture with two components.

6/73

slide-7
SLIDE 7

Testing independence

Given: Samples from a distribution PX Y Goal: Are X and Y independent?

Their noses guide them through life, and they're never happier than when following an interesting scent. A large animal who slings slobber, exudes a distinctive houndy odor, and wants nothing more than to follow his nose.

Text from dogtime.com and petfinder.com

A responsive, interactive pet, one that will blow in your ear and follow you everywhere.

Y X

7/73

slide-8
SLIDE 8

Outline

Maximum Mean Discrepancy (MMD)...

  • ...as a difference in feature means
  • ...as an integral probability metric (not just a technicality!)

A statistical test based on the MMD Training generative adversarial networks with MMD

  • Gradient regularisation and data adaptivity
  • Evaluating GAN performance? Problems with Inception and FID.

8/73

slide-9
SLIDE 9

Maximum Mean Discrepancy

9/73

slide-10
SLIDE 10

Feature mean difference

Simple example: 2 Gaussians with different means Answer: t-test

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different means X

  • Prob. density

PX QX 10/73

slide-11
SLIDE 11

Feature mean difference

Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form ✬✭x✮ ❂ x 2

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different variances

  • Prob. density

X

PX QX

11/73

slide-12
SLIDE 12

Feature mean difference

Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form ✬✭x✮ ❂ x 2

−6 −4 −2 2 4 6 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different variances

  • Prob. density

X

PX QX 10

−1

10 10

1

10

2

0.2 0.4 0.6 0.8 1 1.2 1.4

Densities of feature X2 X2

  • Prob. density

PX QX

11/73

slide-13
SLIDE 13

Feature mean difference

Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features...RKHS

−4 −3 −2 −1 1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Gaussian and Laplace densities

  • Prob. density

X

PX QX 12/73

slide-14
SLIDE 14

Infinitely many features using kernels

Kernels: dot products

  • f features

Feature map ✬✭x✮ ✷ ❋, ✬✭x✮ ❂ ❬✿ ✿ ✿ ✬i✭x✮ ✿ ✿ ✿❪ ✷ ❵2 For positive definite k, k✭x❀ x ✵✮ ❂ ❤✬✭x✮❀ ✬✭x ✵✮✐❋ Infinitely many features ✬✭x✮, dot product in closed form!

13/73

slide-15
SLIDE 15

Infinitely many features using kernels

Kernels: dot products

  • f features

Feature map ✬✭x✮ ✷ ❋, ✬✭x✮ ❂ ❬✿ ✿ ✿ ✬i✭x✮ ✿ ✿ ✿❪ ✷ ❵2 For positive definite k, k✭x❀ x ✵✮ ❂ ❤✬✭x✮❀ ✬✭x ✵✮✐❋ Infinitely many features ✬✭x✮, dot product in closed form! Exponentiated quadratic kernel k✭x❀ x ✵✮ ❂ exp

✌ ❦x x ✵❦2✑

Features: Gaussian Processes for Machine learning, Ras- mussen and Williams, Ch. 4. 13/73

slide-16
SLIDE 16

Infinitely many features of distributions

Given P a Borel probability measure on ❳, define feature map of probability P, ✖P ❂ ❬✿ ✿ ✿ EP ❬✬i✭X ✮❪ ✿ ✿ ✿❪ For positive definite k✭x❀ x ✵✮, ❤✖P❀ ✖Q✐❋ ❂ EP❀Qk✭x❀ y✮ for x ✘ P and y ✘ Q.

Fine print: feature map ✬✭x✮ must be Bochner integrable for all probability measures considered. Always true if kernel bounded. 14/73

slide-17
SLIDE 17

Infinitely many features of distributions

Given P a Borel probability measure on ❳, define feature map of probability P, ✖P ❂ ❬✿ ✿ ✿ EP ❬✬i✭X ✮❪ ✿ ✿ ✿❪ For positive definite k✭x❀ x ✵✮, ❤✖P❀ ✖Q✐❋ ❂ EP❀Qk✭x❀ y✮ for x ✘ P and y ✘ Q.

Fine print: feature map ✬✭x✮ must be Bochner integrable for all probability measures considered. Always true if kernel bounded. 14/73

slide-18
SLIDE 18

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2✭P❀ Q✮ ❂ ❦✖P ✖Q❦2

❂ ❤✖P❀ ✖P✐❋ ✰ ❤✖Q❀ ✖Q✐❋ 2 ❤✖P❀ ✖Q✐❋ ❂ EPk✭X ❀ X ✵✮

⑤ ④③ ⑥

✭a✮

✰ EQk✭Y ❀ Y ✵✮

⑤ ④③ ⑥

✭a✮

2EP❀Qk✭X ❀ Y ✮

⑤ ④③ ⑥

✭b✮

15/73

slide-19
SLIDE 19

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2✭P❀ Q✮ ❂ ❦✖P ✖Q❦2

❂ ❤✖P❀ ✖P✐❋ ✰ ❤✖Q❀ ✖Q✐❋ 2 ❤✖P❀ ✖Q✐❋ ❂ EPk✭X ❀ X ✵✮

⑤ ④③ ⑥

✭a✮

✰ EQk✭Y ❀ Y ✵✮

⑤ ④③ ⑥

✭a✮

2EP❀Qk✭X ❀ Y ✮

⑤ ④③ ⑥

✭b✮

15/73

slide-20
SLIDE 20

The maximum mean discrepancy

The maximum mean discrepancy is the distance between feature means: MMD2✭P❀ Q✮ ❂ ❦✖P ✖Q❦2

❂ ❤✖P❀ ✖P✐❋ ✰ ❤✖Q❀ ✖Q✐❋ 2 ❤✖P❀ ✖Q✐❋ ❂ EPk✭X ❀ X ✵✮

⑤ ④③ ⑥

✭a✮

✰ EQk✭Y ❀ Y ✵✮

⑤ ④③ ⑥

✭a✮

2EP❀Qk✭X ❀ Y ✮

⑤ ④③ ⑥

✭b✮

(a)= within distrib. similarity, (b)= cross-distrib. similarity.

15/73

slide-21
SLIDE 21

Illustration of MMD

Dogs ✭❂ P✮ and fish ✭❂ Q✮ example revisited Each entry is one of k✭dogi❀ dogj ✮, k✭dogi❀ fishj ✮, or k✭fishi❀ fishj ✮

16/73

slide-22
SLIDE 22

Illustration of MMD

The maximum mean discrepancy:

❭ MMD

2 ❂

1 n✭n 1✮ ❳

i✻❂j

k✭dogi❀ dogj ✮ ✰ 1 n✭n 1✮ ❳

i✻❂j

k✭fishi❀ fishj ✮ 2 n2 ❳

i❀j

k✭dogi❀ fishj ✮

17/73

slide-23
SLIDE 23

MMD as an integral probability metric

Are P and Q different?

0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1

Samples from P and Q

18/73

slide-24
SLIDE 24

MMD as an integral probability metric

Are P and Q different?

0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1

Samples from P and Q

19/73

slide-25
SLIDE 25

MMD as an integral probability metric

Integral probability metric: Find a "well behaved function" f ✭x✮ to maximize EPf ✭X ✮ EQf ✭Y ✮

0.2 0.4 0.6 0.8 1

x

  • 1
  • 0.5

0.5 1

f(x) Smooth function

20/73

slide-26
SLIDE 26

MMD as an integral probability metric

Integral probability metric: Find a "well behaved function" f ✭x✮ to maximize EPf ✭X ✮ EQf ✭Y ✮

0.2 0.4 0.6 0.8 1

x

  • 1
  • 0.5

0.5 1

f(x) Smooth function

21/73

slide-27
SLIDE 27

MMD as an integral probability metric

Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ F✮ ✿❂ sup

❦f ❦✔1

❬EPf ✭X ✮ EQf ✭Y ✮❪ (F ❂ unit ball in RKHS ❋)

22/73

slide-28
SLIDE 28

MMD as an integral probability metric

Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ F✮ ✿❂ sup

❦f ❦✔1

❬EPf ✭X ✮ EQf ✭Y ✮❪ (F ❂ unit ball in RKHS ❋) Functions are linear combinations of features:

22/73

slide-29
SLIDE 29

MMD as an integral probability metric

Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ F✮ ✿❂ sup

❦f ❦✔1

❬EPf ✭X ✮ EQf ✭Y ✮❪ (F ❂ unit ball in RKHS ❋)

−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8

Witness f for Gauss and Laplace densities X

  • Prob. density and f

f Gauss Laplace 22/73

slide-30
SLIDE 30

MMD as an integral probability metric

Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ F✮ ✿❂ sup

❦f ❦✔1

❬EPf ✭X ✮ EQf ✭Y ✮❪ (F ❂ unit ball in RKHS ❋) Expectations of functions are linear combinations

  • f expected features

EP✭f ✭X ✮✮ ❂ ❤f ❀ EP✬✭X ✮✐❋ ❂ ❤f ❀ ✖P✐❋

(always true if kernel is bounded) 22/73

slide-31
SLIDE 31

MMD as an integral probability metric

Maximum mean discrepancy: smooth function for P vs Q MMD✭P❀ Q❀ F✮ ✿❂ sup

❦f ❦✔1

❬EPf ✭X ✮ EQf ✭Y ✮❪ (F ❂ unit ball in RKHS ❋) For characteristic RKHS ❋, MMD✭P❀ Q❀ F✮ ❂ 0 iff P ❂ Q Other choices for witness function class: Bounded continuous [Dudley, 2002] Bounded varation 1 (Kolmogorov metric) [Müller, 1997] Bounded Lipschitz (Wasserstein distances) [Dudley, 2002]

22/73

slide-32
SLIDE 32

Integral prob. metric vs feature difference

The MMD: MMD✭P❀ Q❀ F✮ ❂ sup

f ✷F

❬EPf ✭X ✮ EQf ✭Y ✮❪

−6 −4 −2 2 4 6 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8

Witness f for Gauss and Laplace densities X

  • Prob. density and f

f Gauss Laplace

23/73

slide-33
SLIDE 33

Integral prob. metric vs feature difference

The MMD: MMD✭P❀ Q❀ F✮ ❂ sup

f ✷F

❬EPf ✭X ✮ EQf ✭Y ✮❪ ❂ sup

f ✷F

❤f ❀ ✖P ✖Q✐❋ use EPf ✭X ✮ ❂ ❤✖P❀ f ✐❋

23/73

slide-34
SLIDE 34

Integral prob. metric vs feature difference

The MMD: MMD✭P❀ Q❀ F✮ ❂ sup

f ✷F

❬EPf ✭X ✮ EQf ✭Y ✮❪ ❂ sup

f ✷F

❤f ❀ ✖P ✖Q✐❋

f

23/73

slide-35
SLIDE 35

Integral prob. metric vs feature difference

The MMD: MMD✭P❀ Q❀ F✮ ❂ sup

f ✷F

❬EPf ✭X ✮ EQf ✭Y ✮❪ ❂ sup

f ✷F

❤f ❀ ✖P ✖Q✐❋

f

23/73

slide-36
SLIDE 36

Integral prob. metric vs feature difference

The MMD: MMD✭P❀ Q❀ F✮ ❂ sup

f ✷F

❬EPf ✭X ✮ EQf ✭Y ✮❪ ❂ sup

f ✷F

❤f ❀ ✖P ✖Q✐❋

f*

23/73

slide-37
SLIDE 37

Integral prob. metric vs feature difference

The MMD: MMD✭P❀ Q❀ F✮ ❂ sup

f ✷F

❬EPf ✭X ✮ EQf ✭Y ✮❪ ❂ sup

f ✷F

❤f ❀ ✖P ✖Q✐❋ ❂ ❦✖P ✖Q❦

Function view and feature view equivalent

23/73

slide-38
SLIDE 38

Construction of MMD witness

Construction of empirical witness function (proof: next slide!)

Observe X ❂ ❢x1❀ ✿ ✿ ✿ ❀ xn❣ ✘ P Observe Y ❂ ❢y1❀ ✿ ✿ ✿ ❀ yn❣ ✘ Q

24/73

slide-39
SLIDE 39

Construction of MMD witness

Construction of empirical witness function (proof: next slide!)

24/73

slide-40
SLIDE 40

Construction of MMD witness

Construction of empirical witness function (proof: next slide!)

v

24/73

slide-41
SLIDE 41

Construction of MMD witness

Construction of empirical witness function (proof: next slide!)

v

witness✭v✮

⑤ ④③ ⑥

24/73

slide-42
SLIDE 42

Derivation of empirical witness function

Recall the witness function expression f ✄ ✴ ✖P ✖Q

25/73

slide-43
SLIDE 43

Derivation of empirical witness function

Recall the witness function expression f ✄ ✴ ✖P ✖Q The empirical feature mean for P

✖P ✿❂ 1 n

n

i❂1

✬✭xi✮

25/73

slide-44
SLIDE 44

Derivation of empirical witness function

Recall the witness function expression f ✄ ✴ ✖P ✖Q The empirical feature mean for P

✖P ✿❂ 1 n

n

i❂1

✬✭xi✮ The empirical witness function at v f ✄✭v✮ ❂ ❤f ✄❀ ✬✭v✮✐❋

25/73

slide-45
SLIDE 45

Derivation of empirical witness function

Recall the witness function expression f ✄ ✴ ✖P ✖Q The empirical feature mean for P

✖P ✿❂ 1 n

n

i❂1

✬✭xi✮ The empirical witness function at v f ✄✭v✮ ❂ ❤f ✄❀ ✬✭v✮✐❋ ✴ ❤❜ ✖P ❜ ✖Q❀ ✬✭v✮✐❋

25/73

slide-46
SLIDE 46

Derivation of empirical witness function

Recall the witness function expression f ✄ ✴ ✖P ✖Q The empirical feature mean for P

✖P ✿❂ 1 n

n

i❂1

✬✭xi✮ The empirical witness function at v f ✄✭v✮ ❂ ❤f ✄❀ ✬✭v✮✐❋ ✴ ❤❜ ✖P ❜ ✖Q❀ ✬✭v✮✐❋ ❂ 1 n

n

i❂1

k✭xi❀ v✮ 1 n

n

i❂1

k✭yi❀ v✮ Don’t need explicit feature coefficients f ✄ ✿❂

f ✄

1

f ✄

2

✿ ✿ ✿

25/73

slide-47
SLIDE 47

Interlude: divergence measures

26/73

slide-48
SLIDE 48

Divergences

27/73

slide-49
SLIDE 49

Divergences

28/73

slide-50
SLIDE 50

Divergences

29/73

slide-51
SLIDE 51

Divergences

30/73

slide-52
SLIDE 52

Divergences

Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012)

31/73

slide-53
SLIDE 53

Two-Sample Testing with MMD

32/73

slide-54
SLIDE 54

A statistical test using MMD

The empirical MMD:

❭ MMD

2 ❂

1 n✭n 1✮ ❳

i✻❂j

k✭xi❀ xj ✮ ✰ 1 n✭n 1✮ ❳

i✻❂j

k✭yi❀ yj ✮ 2 n2 ❳

i❀j

k✭xi❀ yj ✮

How does this help decide whether P ❂ Q? ❍ ❂

❍ ✻❂

❭ ☛

33/73

slide-55
SLIDE 55

A statistical test using MMD

The empirical MMD:

❭ MMD

2 ❂

1 n✭n 1✮ ❳

i✻❂j

k✭xi❀ xj ✮ ✰ 1 n✭n 1✮ ❳

i✻❂j

k✭yi❀ yj ✮ 2 n2 ❳

i❀j

k✭xi❀ yj ✮

Perspective from statistical hypothesis testing: Null hypothesis ❍0 when P ❂ Q

  • should see ❭

MMD

2 “close to zero”.

Alternative hypothesis ❍1 when P ✻❂ Q

  • should see ❭

MMD

2 “far from zero”

❭ ☛

33/73

slide-56
SLIDE 56

A statistical test using MMD

The empirical MMD:

❭ MMD

2 ❂

1 n✭n 1✮ ❳

i✻❂j

k✭xi❀ xj ✮ ✰ 1 n✭n 1✮ ❳

i✻❂j

k✭yi❀ yj ✮ 2 n2 ❳

i❀j

k✭xi❀ yj ✮

Perspective from statistical hypothesis testing: Null hypothesis ❍0 when P ❂ Q

  • should see ❭

MMD

2 “close to zero”.

Alternative hypothesis ❍1 when P ✻❂ Q

  • should see ❭

MMD

2 “far from zero”

Want Threshold c☛ for ❭ MMD

2 to get false positive rate ☛

33/73

slide-57
SLIDE 57

Behaviour of ❭ MMD

2 when P ✻❂ Q

Draw n ❂ 200 i.i.d samples from P and Q Laplace with different y-variance. ♣n ✂ ❭ MMD

2 ❂ 1✿2

  • 2

2

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8 10 P Q

34/73

slide-58
SLIDE 58

Behaviour of ❭ MMD

2 when P ✻❂ Q

Draw n ❂ 200 i.i.d samples from P and Q Laplace with different y-variance. ♣n ✂ ❭ MMD

2 ❂ 1✿2

0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 8

  • 2

2

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8 10 P Q

35/73

slide-59
SLIDE 59

Behaviour of ❭ MMD

2 when P ✻❂ Q

Draw n ❂ 200 new samples from P and Q Laplace with different y-variance. ♣n ✂ ❭ MMD

2 ❂ 1✿5

0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 3 3.5 4

  • 2

2

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8 10 P Q

36/73

slide-60
SLIDE 60

Behaviour of ❭ MMD

2 when P ✻❂ Q

Repeat this 150 times ✿ ✿ ✿

0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

37/73

slide-61
SLIDE 61

Behaviour of ❭ MMD

2 when P ✻❂ Q

Repeat this 300 times ✿ ✿ ✿

0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

37/73

slide-62
SLIDE 62

Behaviour of ❭ MMD

2 when P ✻❂ Q

Repeat this 3000 times ✿ ✿ ✿

0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

37/73

slide-63
SLIDE 63

Asymptotics of ❭ MMD

2 when P ✻❂ Q

When P ✻❂ Q, statistic is asymptotically normal, ❭ MMD

2 MMD✭P❀ Q✮

Vn✭P❀ Q✮

D

  • ✦ ◆✭0❀ 1✮❀

where variance Vn✭P❀ Q✮ ❂ O

n1✁ .

0.5 1 1.5 2 2.5 3 3.5 0.5 1 1.5

Empirical PDF Gaussian fit

−6 −4 −2 2 4 6 0.5 1 1.5

Two Laplace distributions with different variances X

  • Prob. density

PX QX

38/73

slide-64
SLIDE 64

Behaviour of ❭ MMD

2 when P ❂ Q

What happens when P and Q are the same?

39/73

slide-65
SLIDE 65

Behaviour of ❭ MMD

2 when P ❂ Q

Case of P ❂ Q ❂ ◆✭0❀ 1✮

  • 2

2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

40/73

slide-66
SLIDE 66

Behaviour of ❭ MMD

2 when P ❂ Q

Case of P ❂ Q ❂ ◆✭0❀ 1✮

  • 2

2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

40/73

slide-67
SLIDE 67

Behaviour of ❭ MMD

2 when P ❂ Q

Case of P ❂ Q ❂ ◆✭0❀ 1✮

  • 2

2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

40/73

slide-68
SLIDE 68

Behaviour of ❭ MMD

2 when P ❂ Q

Case of P ❂ Q ❂ ◆✭0❀ 1✮

  • 2

2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

40/73

slide-69
SLIDE 69

Behaviour of ❭ MMD

2 when P ❂ Q

Case of P ❂ Q ❂ ◆✭0❀ 1✮

  • 2

2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7

40/73

slide-70
SLIDE 70

Asymptotics of ❭ MMD

2 when P ❂ Q

Where P ❂ Q, statistic has asymptotic distribution n ❭ MMD

2 ✘ ✶

l❂1

✕l

z 2

l 2

  • 2

2 4 6 0.2 0.4 0.6

where

✕i✥i✭x ✵✮ ❂

⑦ k✭x❀ x ✵✮

⑤ ④③ ⑥

centred

✥i✭x✮dP✭x✮

zl ✘ ◆✭0❀ 2✮ i✿i✿d✿

41/73

slide-71
SLIDE 71

A statistical test

A summary of the asymptotics:

  • 2
  • 1

1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 42/73

slide-72
SLIDE 72

A statistical test

Test construction: (G., Borgwardt, Rasch, Schoelkopf, and Smola, JMLR 2012)

  • 2
  • 1

1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 42/73

slide-73
SLIDE 73

How do we get test threshold c☛?

Original empirical MMD for dogs and fish:

❭ MMD

2 ❂

1 n✭n 1✮ ❳

i✻❂j

k✭xi❀ xj ✮ ✰ 1 n✭n 1✮ ❳

i✻❂j

k✭yi❀ yj ✮ 2 n2 ❳

i❀j

k✭xi❀ yj ✮

k(xi, yj) k(xi, xj) k(yi, yj)

43/73

slide-74
SLIDE 74

How do we get test threshold c☛?

Permuted dog and fish samples (merdogs):

44/73

slide-75
SLIDE 75

How do we get test threshold c☛?

Permuted dog and fish samples (merdogs):

❭ MMD

2 ❂

1 n✭n 1✮ ❳

i✻❂j

k✭⑦ xi❀ ⑦ xj ✮ ✰ 1 n✭n 1✮ ❳

i✻❂j

k✭~ yi❀~ yj ✮ 2 n2 ❳

i❀j

k✭⑦ xi❀~ yj ✮

Permutation simulates P ❂ Q k(˜ xi, ˜ yj) k(˜ xi, ˜ xj) k(˜ yi, ˜ yj)

44/73

slide-76
SLIDE 76

How to choose the best kernel:

  • ptimising the kernel parameters

45/73

slide-77
SLIDE 77

Graphical illustration

Maximising test power same as minimizing false negatives

  • 2
  • 1

1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 46/73

slide-78
SLIDE 78

Optimizing kernel for test power

The power of our test (Pr1 denotes probability under P ✻❂ Q): Pr1

n ❭ MMD

2 ❃ ❫

c☛

47/73

slide-79
SLIDE 79

Optimizing kernel for test power

The power of our test (Pr1 denotes probability under P ✻❂ Q): Pr1

n ❭ MMD

2 ❃ ❫

c☛

✦ ✟

nMMD2✭P❀ Q✮

Vn✭P❀ Q✮

  • c☛

Vn✭P❀ Q✮

where ✟ is the CDF of the standard normal distribution. ❫ c☛ is an estimate of c☛ test threshold.

47/73

slide-80
SLIDE 80

Optimizing kernel for test power

The power of our test (Pr1 denotes probability under P ✻❂ Q): Pr1

n ❭ MMD

2 ❃ ❫

c☛

✦ ✟

MMD2✭P❀ Q✮

Vn✭P❀ Q✮

⑤ ④③ ⑥

O✭n1❂2✮

  • c☛

n

Vn✭P❀ Q✮

⑤ ④③ ⑥

O✭n1❂2✮

Variance under ❍1 decreases as

Vn✭P❀ Q✮ ✘ O✭n1❂2✮ For large n, second term negligible!

47/73

slide-81
SLIDE 81

Optimizing kernel for test power

The power of our test (Pr1 denotes probability under P ✻❂ Q): Pr1

n ❭ MMD

2 ❃ ❫

c☛

✦ ✟

MMD2✭P❀ Q✮

Vn✭P❀ Q✮ c☛ n

Vn✭P❀ Q✮

To maximize test power, maximize MMD2✭P❀ Q✮

Vn✭P❀ Q✮

(Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017)

Code: github.com/dougalsutherland/opt-mmd

47/73

slide-82
SLIDE 82

Troubleshooting for generative adversarial networks

MNIST samples Samples from a GAN

48/73

slide-83
SLIDE 83

Troubleshooting for generative adversarial networks

MNIST samples Samples from a GAN ARD map Power for optimzed ARD kernel: 1.00 at ☛ ❂ 0✿01 Power for optimized RBF kernel: 0.57 at ☛ ❂ 0✿01

48/73

slide-84
SLIDE 84

Troubleshooting generative adversarial networks

49/73

slide-85
SLIDE 85

Training GANs with MMD

50/73

slide-86
SLIDE 86

What is a Generative Adversarial Network (GAN)?

51/73

slide-87
SLIDE 87

What is a Generative Adversarial Network (GAN)?

51/73

slide-88
SLIDE 88

What is a Generative Adversarial Network (GAN)?

51/73

slide-89
SLIDE 89

What is a Generative Adversarial Network (GAN)?

51/73

slide-90
SLIDE 90

Why is classification not enough?

52/73

slide-91
SLIDE 91

MMD for GAN critic

Can you use MMD as a critic to train GANs? From ICML 2015: From UAI 2015:

53/73

slide-92
SLIDE 92

MMD for GAN critic

Can you use MMD as a critic to train GANs? Need better image features.

53/73

slide-93
SLIDE 93

How to improve the critic witness

Add convolutional features! The critic (teacher) also needs to be trained. How to regularise? MMD GAN Li et al., [NIPS 2017] Coulomb GAN Unterthiner et al., [ICLR 2018]

54/73

slide-94
SLIDE 94

WGAN-GP

Wasserstein GAN Arjovsky et al. [ICML 2017] WGAN-GP Gukrajani et al. [NIPS 2017]

55/73

slide-95
SLIDE 95

WGAN-GP

Wasserstein GAN Arjovsky et al. [ICML 2017] WGAN-GP Gukrajani et al. [NIPS 2017] Given a generator G✒ with parameters ✒ to be trained. Samples Y ✘ G✒✭Z✮ where Z ✘ R Given critic features h✥ with parameters ✥ to be trained. f✥ a linear function of h✥.

55/73

slide-96
SLIDE 96

WGAN-GP

Wasserstein GAN Arjovsky et al. [ICML 2017] WGAN-GP Gukrajani et al. [NIPS 2017] Given a generator G✒ with parameters ✒ to be trained. Samples Y ✘ G✒✭Z✮ where Z ✘ R Given critic features h✥ with parameters ✥ to be trained. f✥ a linear function of h✥. WGAN-GP gradient penalty: max

EX ✘Pf✥✭X ✮ EZ✘Rf✥✭G✒✭Z✮✮ ✰ ✕E❡

X

✏✌ ✌ ✌r❡

X f✒✭❢

X ✮

✌ ✌ ✌ 1 ✑2

where

X ❂ ✌xi ✰ ✭1 ✌✮G✥✭zj ✮ ✌ ✘ ❯✭❬0❀ 1❪✮ xi ✷ ❢x❵❣m

❵❂1

zj ✷ ❢z❵❣n

❵❂1

55/73

slide-97
SLIDE 97

The (W)MMD

Train MMD critic features with the witness function gradient penalty

Binkowski, Sutherland, Arbel, G. [ICLR 2018], Bellemare et al. [2017] for energy distance:

max

MMD2✭h✥✭X ✮❀ h✥✭G✒✭Z✮✮✮ ✰ ✕E❡

X

✏✌ ✌ ✌r❡

X f✥✭❢

X ✮

✌ ✌ ✌ 1 ✑2

where

X ❂ ✌xi ✰ ✭1 ✌✮G✥✭zj ✮ ✌ ✘ ❯✭❬0❀ 1❪✮ xi ✷ ❢x❵❣m

❵❂1

zj ✷ ❢z❵❣n

❵❂1 Remark by Bottou et al. (2017): gradient penalty modifies the function class. So critic is not an MMD in RKHS ❋.

56/73

slide-98
SLIDE 98

MMD for GAN critic: revisited

From ICLR 2018:

57/73

slide-99
SLIDE 99

MMD for GAN critic: revisited

Samples are better!

57/73

slide-100
SLIDE 100

MMD for GAN critic: revisited

Samples are better!

Can we do better still?

57/73

slide-101
SLIDE 101

Convergence issues for WGAN-GP penalty

WGAN-GP style gradient penalty may not converge near solution

Nagarajan and Kolter [NIPS 2017], Mescheder et al. [ICML 2018], Balduzzi et al. [ICML 2018]

The Dirac-GAN P ❂ ✍0 Q ❂ ✍✒ f✥✭x✮ ❂ ✥ ✁ x

Figure from Mescheder et al. [ICML 2018]

58/73

slide-102
SLIDE 102

Convergence issues for WGAN-GP penalty

WGAN-GP style gradient penalty may not converge near solution

Nagarajan and Kolter [NIPS 2017], Mescheder et al. [ICML 2018], Balduzzi et al. [ICML 2018]

The Dirac-GAN P ❂ ✍0 Q ❂ ✍✒ f✥✭x✮ ❂ ✥ ✁ x

Figure from Mescheder et al. [ICML 2018]

58/73

slide-103
SLIDE 103

A better gradient penalty

New MMD GAN witness regulariser (NIPS 2018)

Arbel, Sutherland, Binkowski, G. [NIPS 2018]

Based on semi-supervised learning regulariser Bousquet et al. [NIPS 2004] Related to Sobolev GAN Mroueh et al. [ICLR 2018]

59/73

slide-104
SLIDE 104

A better gradient penalty

New MMD GAN witness regulariser (NIPS 2018)

Arbel, Sutherland, Binkowski, G. [NIPS 2018]

Based on semi-supervised learning regulariser Bousquet et al. [NIPS 2004] Related to Sobolev GAN Mroueh et al. [ICLR 2018]

59/73

slide-105
SLIDE 105

A better gradient penalty

New MMD GAN witness regulariser (NIPS 2018)

Arbel, Sutherland, Binkowski, G. [NIPS 2018]

Based on semi-supervised learning regulariser Bousquet et al. [NIPS 2004] Related to Sobolev GAN Mroueh et al. [ICLR 2018] Modified witness function: where

59/73

slide-106
SLIDE 106

A better gradient penalty

New MMD GAN witness regulariser (NIPS 2018)

Arbel, Sutherland, Binkowski, G. [NIPS 2018]

Based on semi-supervised learning regulariser Bousquet et al. [NIPS 2004] Related to Sobolev GAN Mroueh et al. [ICLR 2018] Modified witness function: where Problem: not computationally feasible: O✭n3✮ per iteration.

59/73

slide-107
SLIDE 107

A better gradient penalty

New MMD GAN witness regulariser (NIPS 2018)

Arbel, Sutherland, Binkowski, G. [NIPS 2018]

Based on semi-supervised learning regulariser Bousquet et al. [NIPS 2004] Related to Sobolev GAN Mroueh et al. [ICLR 2018] The scaled MMD: SMMD ❂ ✛k❀P❀✕ MMD where ✛k❀P❀✕ ❂

✕ ✰

k✭x❀ x✮dP✭x✮ ✰

d

i❂1

❅i❅i✰dk✭x❀ x✮ dP✭x✮

✦1❂2

Replace expensive constraint with cheap upper bound: ❦f ❦2

S ✔ ✛1 k❀P❀✕ ❦f ❦2 k

59/73

slide-108
SLIDE 108

A better gradient penalty

New MMD GAN witness regulariser (NIPS 2018)

Arbel, Sutherland, Binkowski, G. [NIPS 2018]

Based on semi-supervised learning regulariser Bousquet et al. [NIPS 2004] Related to Sobolev GAN Mroueh et al. [ICLR 2018] The scaled MMD: SMMD ❂ ✛k❀P❀✕ MMD where ✛k❀P❀✕ ❂

✕ ✰

k✭x❀ x✮dP✭x✮ ✰

d

i❂1

❅i❅i✰dk✭x❀ x✮ dP✭x✮

✦1❂2

Replace expensive constraint with cheap upper bound: ❦f ❦2

S ✔ ✛1 k❀P❀✕ ❦f ❦2 k

Idea: rather than regularise the critic or witness function, regularise features directly

59/73

slide-109
SLIDE 109

Evaluation and experiments

60/73

slide-110
SLIDE 110

Evaluation of GANs

The inception score?

Salimans et al. [NIPS 2016]

Based on the classification output p✭y❥x✮ of the inception model Szegedy

et al. [ICLR 2014],

EX exp KL✭P✭y❥X ✮❦P✭y✮✮✿ High when: predictive label distribution P✭y❥x✮ has low entropy (good quality images) label entropy P✭y✮ is high (good variety).

61/73

slide-111
SLIDE 111

Evaluation of GANs

The inception score?

Salimans et al. [NIPS 2016]

Based on the classification output p✭y❥x✮ of the inception model Szegedy

et al. [ICLR 2014],

EX exp KL✭P✭y❥X ✮❦P✭y✮✮✿ High when: predictive label distribution P✭y❥x✮ has low entropy (good quality images) label entropy P✭y✮ is high (good variety). Problem: relies on a trained classifier! Can’t be used on new categories (celeb, bedroom...)

61/73

slide-112
SLIDE 112

Evaluation of GANs

The Frechet inception distance?

Heusel et al. [NIPS 2017]

Fits Gaussians to features in the inception architecture (pool3 layer): FID✭P❀ Q✮ ❂ ❦✖P ✖Q❦2 ✰ tr✭✝P✮ ✰ tr✭✝Q✮ 2tr

✭✝P✝Q✮

1 2

where ✖P and ✝P are the feature mean and covariance of P

62/73

slide-113
SLIDE 113

Evaluation of GANs

The Frechet inception distance?

Heusel et al. [NIPS 2017]

Fits Gaussians to features in the inception architecture (pool3 layer): FID✭P❀ Q✮ ❂ ❦✖P ✖Q❦2 ✰ tr✭✝P✮ ✰ tr✭✝Q✮ 2tr

✭✝P✝Q✮

1 2

where ✖P and ✝P are the feature mean and covariance of P Problem: bias. For finite samples can consistently give incorrect answer. Bias demo, CIFAR-10 train vs test

2000 4000 6000 8000 10000

n

10 20 30 40 50

FID 62/73

slide-114
SLIDE 114

Evaluation of GANs

The FID can give the wrong answer in theory. Assume m samples from P and n ✦ ✶ samples from Q. Given two alternatives: P1 ✘ ◆✭0❀ ✭1 m1✮2✮ P2 ✘ ◆✭0❀ 1✮ Q ✘ ◆✭0❀ 1✮✿ Clearly, FID✭P1❀ Q✮ ❂ 1 m2 ❃ FID✭P2❀ Q✮ ❂ 0 Given m samples from P1 and P2, FID✭❝ P1❀ Q✮ ❁ FID✭❝ P2❀ Q✮✿

63/73

slide-115
SLIDE 115

Evaluation of GANs

The FID can give the wrong answer in theory. Assume m samples from P and n ✦ ✶ samples from Q. Given two alternatives: P1 ✘ ◆✭0❀ ✭1 m1✮2✮ P2 ✘ ◆✭0❀ 1✮ Q ✘ ◆✭0❀ 1✮✿ Clearly, FID✭P1❀ Q✮ ❂ 1 m2 ❃ FID✭P2❀ Q✮ ❂ 0 Given m samples from P1 and P2, FID✭❝ P1❀ Q✮ ❁ FID✭❝ P2❀ Q✮✿

63/73

slide-116
SLIDE 116

Evaluation of GANs

The FID can give the wrong answer in theory. Assume m samples from P and n ✦ ✶ samples from Q. Given two alternatives: P1 ✘ ◆✭0❀ ✭1 m1✮2✮ P2 ✘ ◆✭0❀ 1✮ Q ✘ ◆✭0❀ 1✮✿ Clearly, FID✭P1❀ Q✮ ❂ 1 m2 ❃ FID✭P2❀ Q✮ ❂ 0 Given m samples from P1 and P2, FID✭❝ P1❀ Q✮ ❁ FID✭❝ P2❀ Q✮✿

63/73

slide-117
SLIDE 117

Evaluation of GANs

The FID can give the wrong answer in theory. Assume m samples from P and n ✦ ✶ samples from Q. Given two alternatives: P1 ✘ ◆✭0❀ ✭1 m1✮2✮ P2 ✘ ◆✭0❀ 1✮ Q ✘ ◆✭0❀ 1✮✿ Clearly, FID✭P1❀ Q✮ ❂ 1 m2 ❃ FID✭P2❀ Q✮ ❂ 0 Given m samples from P1 and P2, FID✭❝ P1❀ Q✮ ❁ FID✭❝ P2❀ Q✮✿

63/73

slide-118
SLIDE 118

Evaluation of GANs

The FID can give the wrong answer in practice. Let d ❂ 2048, and define P1 ❂ relu✭◆✭0❀ Id✮✮ P2 ❂ relu✭◆✭1❀ ✿8✝✰✿2Id✮✮ Q ❂ relu✭◆✭1❀ Id✮✮ where ✝ ❂ 4

d CC T, with C a d ✂ d matrix with iid standard normal

entries. For a random draw of C: FID✭P1❀ Q✮ ✙ 1123✿0 ❃ 1114✿8 ✙ FID✭P2❀ Q✮ With m ❂ 50 000 samples, FID✭❝ P1❀ Q✮ ✙ 1133✿7 ❁ 1136✿2 ✙ FID✭❝ P2❀ Q✮ At m ❂ 100 000 samples, the ordering of the estimates is correct. This behavior is similar for other random draws of C.

64/73

slide-119
SLIDE 119

Evaluation of GANs

The FID can give the wrong answer in practice. Let d ❂ 2048, and define P1 ❂ relu✭◆✭0❀ Id✮✮ P2 ❂ relu✭◆✭1❀ ✿8✝✰✿2Id✮✮ Q ❂ relu✭◆✭1❀ Id✮✮ where ✝ ❂ 4

d CC T, with C a d ✂ d matrix with iid standard normal

entries. For a random draw of C: FID✭P1❀ Q✮ ✙ 1123✿0 ❃ 1114✿8 ✙ FID✭P2❀ Q✮ With m ❂ 50 000 samples, FID✭❝ P1❀ Q✮ ✙ 1133✿7 ❁ 1136✿2 ✙ FID✭❝ P2❀ Q✮ At m ❂ 100 000 samples, the ordering of the estimates is correct. This behavior is similar for other random draws of C.

64/73

slide-120
SLIDE 120

Evaluation of GANs

The FID can give the wrong answer in practice. Let d ❂ 2048, and define P1 ❂ relu✭◆✭0❀ Id✮✮ P2 ❂ relu✭◆✭1❀ ✿8✝✰✿2Id✮✮ Q ❂ relu✭◆✭1❀ Id✮✮ where ✝ ❂ 4

d CC T, with C a d ✂ d matrix with iid standard normal

entries. For a random draw of C: FID✭P1❀ Q✮ ✙ 1123✿0 ❃ 1114✿8 ✙ FID✭P2❀ Q✮ With m ❂ 50 000 samples, FID✭❝ P1❀ Q✮ ✙ 1133✿7 ❁ 1136✿2 ✙ FID✭❝ P2❀ Q✮ At m ❂ 100 000 samples, the ordering of the estimates is correct. This behavior is similar for other random draws of C.

64/73

slide-121
SLIDE 121

Evaluation of GANs

The FID can give the wrong answer in practice. Let d ❂ 2048, and define P1 ❂ relu✭◆✭0❀ Id✮✮ P2 ❂ relu✭◆✭1❀ ✿8✝✰✿2Id✮✮ Q ❂ relu✭◆✭1❀ Id✮✮ where ✝ ❂ 4

d CC T, with C a d ✂ d matrix with iid standard normal

entries. For a random draw of C: FID✭P1❀ Q✮ ✙ 1123✿0 ❃ 1114✿8 ✙ FID✭P2❀ Q✮ With m ❂ 50 000 samples, FID✭❝ P1❀ Q✮ ✙ 1133✿7 ❁ 1136✿2 ✙ FID✭❝ P2❀ Q✮ At m ❂ 100 000 samples, the ordering of the estimates is correct. This behavior is similar for other random draws of C.

64/73

slide-122
SLIDE 122

The kernel inception distance (KID)

The Kernel inception distance Binkowski, Sutherland, Arbel, G. [ICLR 2018] Measures similarity of the samples’ representations in the inception architecture (pool3 layer) MMD with kernel k✭x❀ y✮ ❂

✒ 1

d x ❃y ✰ 1

✓3

✿ Checks match for feature means, variances, skewness Unbiased : eg CIFAR-10 train/test

250 500 750 1000 1250 1500 1750 2000

n

0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004

KID 65/73

slide-123
SLIDE 123

The kernel inception distance (KID)

The Kernel inception distance Binkowski, Sutherland, Arbel, G. [ICLR 2018] Measures similarity of the samples’ representations in the inception architecture (pool3 layer) MMD with kernel k✭x❀ y✮ ❂

✒ 1

d x ❃y ✰ 1

✓3

✿ Checks match for feature means, variances, skewness Unbiased : eg CIFAR-10 train/test

250 500 750 1000 1250 1500 1750 2000

n

0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004

KID

...“but isn’t KID is computationally costly?”

65/73

slide-124
SLIDE 124

The kernel inception distance (KID)

The Kernel inception distance Binkowski, Sutherland, Arbel, G. [ICLR 2018] Measures similarity of the samples’ representations in the inception architecture (pool3 layer) MMD with kernel k✭x❀ y✮ ❂

✒ 1

d x ❃y ✰ 1

✓3

✿ Checks match for feature means, variances, skewness Unbiased : eg CIFAR-10 train/test

250 500 750 1000 1250 1500 1750 2000

n

0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004

KID

...“but isn’t KID is computationally costly?” “Block” KID implementation is cheaper than FID: see paper (or use Tensorflow implementation)!

65/73

slide-125
SLIDE 125

The kernel inception distance (KID)

The Kernel inception distance Binkowski, Sutherland, Arbel, G. [ICLR 2018] Measures similarity of the samples’ representations in the inception architecture (pool3 layer) MMD with kernel k✭x❀ y✮ ❂

✒ 1

d x ❃y ✰ 1

✓3

✿ Checks match for feature means, variances, skewness Unbiased : eg CIFAR-10 train/test

250 500 750 1000 1250 1500 1750 2000

n

0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004

KID

Also used for automatic learning rate adjustment: if KID✭ ❜ Pt✰1❀ Q✮ not significantly better than KID✭ ❜ Pt❀ Q✮ then reduce learning rate.

[Bounliphone et al. ICLR 2016]

Related: “An empirical study on evaluation metrics of generative adversarial networks”, Xu et al. [arxiv, June 2018] 65/73

slide-126
SLIDE 126

Benchmarks for comparison (all from ICLR 2018)

66/73

slide-127
SLIDE 127

Results: what does MMD buy you?

Critic features from DCGAN: an f -filter critic has f , 2f , 4f and 8f convolutional filters in layers 1-4. LSUN 64 ✂ 64. MMD GAN samples, f ❂ 64, KID=3 WGAN samples, f ❂ 64, KID=4

67/73

slide-128
SLIDE 128

Results: what does MMD buy you?

Critic features from DCGAN: an f -filter critic has f , 2f , 4f and 8f convolutional filters in layers 1-4. LSUN 64 ✂ 64. MMD GAN samples, f ❂ 16, KID=9 WGAN samples, f ❂ 16, f ❂ 64, KID=37

67/73

slide-129
SLIDE 129

The kernel inception distance (KID)

Faster training: performance scores vs generator iterations on MNIST

68/73

slide-130
SLIDE 130

Results: celebrity faces 160✂160

KID scores: Sobolev GAN: 14 SN-GAN: 18 Old MMD GAN: 13 SMMD GAN: 6

202 599 face images, re- sized and cropped to 160 ✂ 160 69/73

slide-131
SLIDE 131

Results: imagenet 64✂64

KID (FID) scores: BGAN: 47 SN-GAN: 44 SMMD GAN: 35

ILSVRC2012 (ImageNet) dataset, 1 281 167 im- ages, resized to 64 × 64. Around 20 000 classes. 70/73

slide-132
SLIDE 132

Results: imagenet 64✂64

KID (FID) scores: BGAN: 47 SN-GAN: 44 SMMD GAN: 35

ILSVRC2012 (ImageNet) dataset, 1 281 167 im- ages, resized to 64 × 64. Around 20 000 classes. 70/73

slide-133
SLIDE 133

Results: imagenet 64✂64

KID (FID) scores: BGAN: 47 SN-GAN: 44 SMMD GAN: 35

ILSVRC2012 (ImageNet) dataset, 1 281 167 im- ages, resized to 64 × 64. Around 20 000 classes. 70/73

slide-134
SLIDE 134

Summary

MMD critic gives state-of-the-art performance for GAN training (FID and KID)

  • use convolutional input features
  • train with new gradient regulariser

Faster training, simpler critic network Reasons for good performance:

  • Unlike WGAN-GP, MMD loss still a valid critic when features not
  • ptimal
  • Kernel features do some of the “work”, so simpler h✥ features possible.
  • Better gradient/feature regulariser gives better critic

Code for “Demystifying MMD GANs,” ICLR 2018, including KID score: https://github.com/mbinkowski/MMD-GAN Code for new SMMD: https://github.com/MichaelArbel/Scaled-MMD-GAN

71/73

slide-135
SLIDE 135

Co-authors From Gatsby:

Mikolaj Binkowski Kacper Chwialkowski Wittawat Jitkrittum Heiko Strathmann Dougal Sutherland Wenkai Xu

External collaborators:

Kenji Fukumizu Bernhard Schoelkopf Dino Sejdinovic Bharath Sriperumbudur Alex Smola Zoltan Szabo

72/73

slide-136
SLIDE 136

Questions?

73/73