The Maximum Mean Discrepancy for Training Generative Adversarial - PowerPoint PPT Presentation

MMD as an integral probability metric Maximum mean discrepancy : smooth function for P vs Q MMD ✭ P ❀ Q ❀ F ✮ ✿❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ ❦ f ❦✔ 1 ( F ❂ unit ball in RKHS ❋ ) For characteristic RKHS ❋ , MMD ✭ P ❀ Q ❀ F ✮ ❂ 0 iff P ❂ Q Other choices for witness function class: Bounded continuous [Dudley, 2002] Bounded varation 1 (Kolmogorov metric) [Müller, 1997] Bounded Lipschitz (Wasserstein distances) [Dudley, 2002] 22/73

Integral prob. metric vs feature difference The MMD: Witness f for Gauss and Laplace densities 0.8 f Gauss 0.6 Laplace MMD ✭ P ❀ Q ❀ F ✮ Prob. density and f 0.4 ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ 0.2 f ✷ F 0 −0.2 −0.4 −0.6 −6 −4 −2 0 2 4 6 X 23/73

Integral prob. metric vs feature difference The MMD: use E P f ✭ X ✮ ❂ ❤ ✖ P ❀ f ✐ ❋ MMD ✭ P ❀ Q ❀ F ✮ ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ f ✷ F ❂ sup ❤ f ❀ ✖ P � ✖ Q ✐ ❋ f ✷ F 23/73

Integral prob. metric vs feature difference The MMD: MMD ✭ P ❀ Q ❀ F ✮ ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ f ✷ F ❂ sup ❤ f ❀ ✖ P � ✖ Q ✐ ❋ f ✷ F f 23/73

Integral prob. metric vs feature difference The MMD: MMD ✭ P ❀ Q ❀ F ✮ ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ f f ✷ F ❂ sup ❤ f ❀ ✖ P � ✖ Q ✐ ❋ f ✷ F 23/73

Integral prob. metric vs feature difference The MMD: MMD ✭ P ❀ Q ❀ F ✮ ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ f* f ✷ F ❂ sup ❤ f ❀ ✖ P � ✖ Q ✐ ❋ f ✷ F 23/73

Integral prob. metric vs feature difference The MMD: MMD ✭ P ❀ Q ❀ F ✮ ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ f ✷ F ❂ sup ❤ f ❀ ✖ P � ✖ Q ✐ ❋ f ✷ F ❂ ❦ ✖ P � ✖ Q ❦ Function view and feature view equivalent 23/73

Construction of MMD witness Construction of empirical witness function (proof: next slide!) Observe X ❂ ❢ x 1 ❀ ✿ ✿ ✿ ❀ x n ❣ ✘ P Observe Y ❂ ❢ y 1 ❀ ✿ ✿ ✿ ❀ y n ❣ ✘ Q 24/73

Construction of MMD witness Construction of empirical witness function (proof: next slide!) 24/73

Construction of MMD witness Construction of empirical witness function (proof: next slide!) v 24/73

Construction of MMD witness Construction of empirical witness function (proof: next slide!) v witness ✭ v ✮ ⑤ ④③ ⑥ 24/73

Derivation of empirical witness function Recall the witness function expression f ✄ ✴ ✖ P � ✖ Q 25/73

Derivation of empirical witness function Recall the witness function expression f ✄ ✴ ✖ P � ✖ Q The empirical feature mean for P n ❳ ✖ P ✿❂ 1 ❜ ✬ ✭ x i ✮ n i ❂ 1 25/73

Derivation of empirical witness function Recall the witness function expression f ✄ ✴ ✖ P � ✖ Q The empirical feature mean for P n ❳ ✖ P ✿❂ 1 ❜ ✬ ✭ x i ✮ n i ❂ 1 The empirical witness function at v f ✄ ✭ v ✮ ❂ ❤ f ✄ ❀ ✬ ✭ v ✮ ✐ ❋ 25/73

Derivation of empirical witness function Recall the witness function expression f ✄ ✴ ✖ P � ✖ Q The empirical feature mean for P n ❳ ✖ P ✿❂ 1 ❜ ✬ ✭ x i ✮ n i ❂ 1 The empirical witness function at v f ✄ ✭ v ✮ ❂ ❤ f ✄ ❀ ✬ ✭ v ✮ ✐ ❋ ✴ ❤ ❜ ✖ P � ❜ ✖ Q ❀ ✬ ✭ v ✮ ✐ ❋ 25/73

Derivation of empirical witness function Recall the witness function expression f ✄ ✴ ✖ P � ✖ Q The empirical feature mean for P n ❳ ✖ P ✿❂ 1 ❜ ✬ ✭ x i ✮ n i ❂ 1 The empirical witness function at v f ✄ ✭ v ✮ ❂ ❤ f ✄ ❀ ✬ ✭ v ✮ ✐ ❋ ✴ ❤ ❜ ✖ P � ❜ ✖ Q ❀ ✬ ✭ v ✮ ✐ ❋ n n ❳ ❳ ❂ 1 k ✭ x i ❀ v ✮ � 1 k ✭ y i ❀ v ✮ n n i ❂ 1 i ❂ 1 ❤ ✐ Don’t need explicit feature coefficients f ✄ ✿❂ f ✄ f ✄ ✿ ✿ ✿ 1 2 25/73

Interlude: divergence measures 26/73

Divergences 27/73

Divergences 28/73

Divergences 29/73

Divergences 30/73

Divergences Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 31/73

Two-Sample Testing with MMD 32/73

❍ ❂ ❭ ❍ ✻ ❂ ❭ ❭ ☛ ☛ A statistical test using MMD The empirical MMD: ❳ ❳ 2 ❂ 1 1 ❭ k ✭ x i ❀ x j ✮ ✰ k ✭ y i ❀ y j ✮ MMD n ✭ n � 1 ✮ n ✭ n � 1 ✮ i ✻ ❂ j i ✻ ❂ j ❳ � 2 k ✭ x i ❀ y j ✮ n 2 i ❀ j How does this help decide whether P ❂ Q ? 33/73

❭ ☛ ☛ A statistical test using MMD The empirical MMD: ❳ ❳ 2 ❂ 1 1 ❭ k ✭ x i ❀ x j ✮ ✰ k ✭ y i ❀ y j ✮ MMD n ✭ n � 1 ✮ n ✭ n � 1 ✮ i ✻ ❂ j i ✻ ❂ j ❳ � 2 k ✭ x i ❀ y j ✮ n 2 i ❀ j Perspective from statistical hypothesis testing: Null hypothesis ❍ 0 when P ❂ Q 2 “close to zero”. • should see ❭ MMD Alternative hypothesis ❍ 1 when P ✻ ❂ Q 2 “far from zero” • should see ❭ MMD 33/73

A statistical test using MMD The empirical MMD: ❳ ❳ 2 ❂ 1 1 ❭ k ✭ x i ❀ x j ✮ ✰ k ✭ y i ❀ y j ✮ MMD n ✭ n � 1 ✮ n ✭ n � 1 ✮ i ✻ ❂ j i ✻ ❂ j ❳ � 2 k ✭ x i ❀ y j ✮ n 2 i ❀ j Perspective from statistical hypothesis testing: Null hypothesis ❍ 0 when P ❂ Q 2 “close to zero”. • should see ❭ MMD Alternative hypothesis ❍ 1 when P ✻ ❂ Q 2 “far from zero” • should see ❭ MMD 2 to get false positive rate ☛ Want Threshold c ☛ for ❭ MMD 33/73

2 when P ✻ ❂ Q Behaviour of ❭ MMD Draw n ❂ 200 i.i.d samples from P and Q Laplace with different y-variance. ♣ n ✂ ❭ 2 ❂ 1 ✿ 2 MMD 10 P 8 Q 6 4 2 0 -2 -4 -6 -8 -10 -2 0 2 34/73

2 when P ✻ ❂ Q Behaviour of ❭ MMD Draw n ❂ 200 i.i.d samples from P and Q Laplace with different y-variance. ♣ n ✂ ❭ 2 ❂ 1 ✿ 2 MMD 10 8 P 8 Q 7 6 4 6 2 5 0 4 -2 3 -4 -6 2 -8 1 -10 -2 0 2 0 0 0.5 1 1.5 2 2.5 35/73

2 when P ✻ ❂ Q Behaviour of ❭ MMD Draw n ❂ 200 new samples from P and Q Laplace with different y-variance. ♣ n ✂ ❭ 2 ❂ 1 ✿ 5 MMD 10 4 P 8 Q 3.5 6 4 3 2 2.5 0 2 -2 1.5 -4 -6 1 -8 0.5 -10 -2 0 2 0 0 0.5 1 1.5 2 2.5 36/73

2 when P ✻ ❂ Q Behaviour of ❭ MMD Repeat this 150 times ✿ ✿ ✿ 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 37/73

2 when P ✻ ❂ Q Asymptotics of ❭ MMD When P ✻ ❂ Q , statistic is asymptotically normal, 2 � MMD ✭ P ❀ Q ✮ ❭ MMD D ♣ � ✦ ◆ ✭ 0 ❀ 1 ✮ ❀ V n ✭ P ❀ Q ✮ � n � 1 ✁ . where variance V n ✭ P ❀ Q ✮ ❂ O Two Laplace distributions with different variances 1.5 P X 1.5 Q X Empirical PDF Gaussian fit Prob. density 1 1 0.5 0 −6 −4 −2 0 2 4 6 0.5 X 0 0 0.5 1 1.5 2 2.5 3 3.5 38/73

2 when P ❂ Q Behaviour of ❭ MMD What happens when P and Q are the same? 39/73

2 when P ❂ Q Behaviour of ❭ MMD Case of P ❂ Q ❂ ◆ ✭ 0 ❀ 1 ✮ 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 0 2 4 6 40/73

2 when P ❂ Q Asymptotics of ❭ MMD Where P ❂ Q , statistic has asymptotic distribution ✶ ❤ ✐ ❳ 2 ✘ n ❭ z 2 l � 2 MMD ✕ l l ❂ 1 where ❩ ✕ i ✥ i ✭ x ✵ ✮ ❂ k ✭ x ❀ x ✵ ✮ ⑦ ✥ i ✭ x ✮ dP ✭ x ✮ ⑤ ④③ ⑥ 0.6 ❳ centred z l ✘ ◆ ✭ 0 ❀ 2 ✮ i ✿ i ✿ d ✿ 0.4 0.2 0 -2 0 2 4 6 41/73

A statistical test A summary of the asymptotics: 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 -1 0 1 2 3 4 5 6 42/73

A statistical test Test construction: (G., Borgwardt, Rasch, Schoelkopf, and Smola, JMLR 2012) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 -1 0 1 2 3 4 5 6 42/73

How do we get test threshold c ☛ ? Original empirical MMD for dogs and fish: ❳ 2 ❂ 1 k ( x i , x j ) k ( x i , y j ) ❭ k ✭ x i ❀ x j ✮ MMD n ✭ n � 1 ✮ i ✻ ❂ j ❳ 1 ✰ k ✭ y i ❀ y j ✮ n ✭ n � 1 ✮ i ✻ ❂ j ❳ � 2 k ( y i , y j ) k ✭ x i ❀ y j ✮ n 2 i ❀ j 43/73

How do we get test threshold c ☛ ? Permuted dog and fish samples ( merdogs ): 44/73

How do we get test threshold c ☛ ? Permuted dog and fish samples ( merdogs ): ❳ 2 ❂ 1 ❭ k ✭⑦ x i ❀ ⑦ x j ✮ MMD n ✭ n � 1 ✮ k (˜ x i , ˜ x j ) k (˜ x i , ˜ y j ) i ✻ ❂ j ❳ 1 ✰ k ✭ ~ y i ❀ ~ y j ✮ n ✭ n � 1 ✮ i ✻ ❂ j ❳ � 2 k ✭⑦ x i ❀ ~ y j ✮ n 2 k (˜ y i , ˜ y j ) i ❀ j Permutation simulates P ❂ Q 44/73

How to choose the best kernel: optimising the kernel parameters 45/73

Graphical illustration Maximising test power same as minimizing false negatives 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 -1 0 1 2 3 4 5 6 46/73

Optimizing kernel for test power The power of our test (Pr 1 denotes probability under P ✻ ❂ Q ): ✒ ✓ 2 ❃ ❫ n ❭ Pr 1 MMD c ☛ 47/73

Optimizing kernel for test power The power of our test (Pr 1 denotes probability under P ✻ ❂ Q ): ✒ ✓ 2 ❃ ❫ n ❭ Pr 1 MMD c ☛ ✥ ✦ n MMD 2 ✭ P ❀ Q ✮ c ☛ ✦ ✟ ♣ � ♣ V n ✭ P ❀ Q ✮ V n ✭ P ❀ Q ✮ where ✟ is the CDF of the standard normal distribution. c ☛ is an estimate of c ☛ test threshold. ❫ 47/73

Optimizing kernel for test power The power of our test (Pr 1 denotes probability under P ✻ ❂ Q ): ✒ ✓ 2 ❃ ❫ n ❭ Pr 1 MMD c ☛ ✥ ✦ MMD 2 ✭ P ❀ Q ✮ c ☛ ✦ ✟ ♣ � ♣ V n ✭ P ❀ Q ✮ n V n ✭ P ❀ Q ✮ ⑤ ④③ ⑥ ⑤ ④③ ⑥ O ✭ n 1 ❂ 2 ✮ O ✭ n � 1 ❂ 2 ✮ ♣ V n ✭ P ❀ Q ✮ ✘ O ✭ n � 1 ❂ 2 ✮ Variance under ❍ 1 decreases as For large n , second term negligible! 47/73

Optimizing kernel for test power The power of our test (Pr 1 denotes probability under P ✻ ❂ Q ): ✒ ✓ 2 ❃ ❫ n ❭ Pr 1 MMD c ☛ ✥ ✦ MMD 2 ✭ P ❀ Q ✮ c ☛ ✦ ✟ ♣ V n ✭ P ❀ Q ✮ � ♣ n V n ✭ P ❀ Q ✮ To maximize test power, maximize MMD 2 ✭ P ❀ Q ✮ ♣ V n ✭ P ❀ Q ✮ (Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017) Code: github.com/dougalsutherland/opt-mmd 47/73

Troubleshooting for generative adversarial networks Samples from a GAN MNIST samples 48/73

Troubleshooting for generative adversarial networks Samples from a GAN MNIST samples Power for optimzed ARD kernel : 1.00 at ☛ ❂ 0 ✿ 01 Power for optimized RBF kernel: 0.57 at ☛ ❂ 0 ✿ 01 48/73 ARD map

Troubleshooting generative adversarial networks 49/73

Training GANs with MMD 50/73

What is a Generative Adversarial Network (GAN)? 51/73

Why is classification not enough? 52/73

MMD for GAN critic Can you use MMD as a critic to train GANs? From ICML 2015: From UAI 2015: 53/73

MMD for GAN critic Can you use MMD as a critic to train GANs? Need better image features. 53/73

How to improve the critic witness Add convolutional features! The critic (teacher) also needs to be trained. How to regularise? MMD GAN Li et al., [NIPS 2017] Coulomb GAN Unterthiner et al., [ICLR 2018] 54/73

WGAN-GP Wasserstein GAN Arjovsky et al. [ICML 2017] WGAN-GP Gukrajani et al. [NIPS 2017] 55/73

WGAN-GP Wasserstein GAN Arjovsky et al. [ICML 2017] WGAN-GP Gukrajani et al. [NIPS 2017] Given a generator G ✒ with parameters ✒ to be trained. Samples Y ✘ G ✒ ✭ Z ✮ where Z ✘ R Given critic features h ✥ with parameters ✥ to be trained. f ✥ a linear function of h ✥ . 55/73

WGAN-GP Wasserstein GAN Arjovsky et al. [ICML 2017] WGAN-GP Gukrajani et al. [NIPS 2017] Given a generator G ✒ with parameters ✒ to be trained. Samples Y ✘ G ✒ ✭ Z ✮ where Z ✘ R Given critic features h ✥ with parameters ✥ to be trained. f ✥ a linear function of h ✥ . WGAN-GP gradient penalty: ✏✌ ✌ ✑ 2 ✌ ✌ X f ✒ ✭ ❢ max E X ✘ P f ✥ ✭ X ✮ � E Z ✘ R f ✥ ✭ G ✒ ✭ Z ✮✮ ✰ ✕ E ❡ ✌ r ❡ X ✮ ✌ � 1 X ✥ where ❢ X ❂ ✌ x i ✰ ✭ 1 � ✌ ✮ G ✥ ✭ z j ✮ x i ✷ ❢ x ❵ ❣ m z j ✷ ❢ z ❵ ❣ n 55/73 ✌ ✘ ❯ ✭❬ 0 ❀ 1 ❪✮ ❵ ❂ 1 ❵ ❂ 1

The (W)MMD Train MMD critic features with the witness function gradient penalty Binkowski, Sutherland, Arbel, G. [ICLR 2018], Bellemare et al. [2017] for energy distance : ✏✌ ✌ ✑ 2 ✌ ✌ MMD 2 ✭ h ✥ ✭ X ✮ ❀ h ✥ ✭ G ✒ ✭ Z ✮✮✮ ✰ ✕ E ❡ X f ✥ ✭ ❢ ✌ r ❡ ✌ � 1 max X ✮ X ✥ where ❢ X ❂ ✌ x i ✰ ✭ 1 � ✌ ✮ G ✥ ✭ z j ✮ x i ✷ ❢ x ❵ ❣ m z j ✷ ❢ z ❵ ❣ n ✌ ✘ ❯ ✭❬ 0 ❀ 1 ❪✮ ❵ ❂ 1 ❵ ❂ 1 Remark by Bottou et al. (2017): gradient penalty modifies the function class. So critic is 56/73 not an MMD in RKHS ❋ .

MMD for GAN critic: revisited From ICLR 2018: 57/73

MMD for GAN critic: revisited Samples are better! 57/73

MMD for GAN critic: revisited Samples are better! Can we do better still? 57/73

The Maximum Mean Discrepancy for Training Generative Adversarial - PowerPoint PPT Presentation

The Maximum Mean Discrepancy for Training Generative Adversarial Networks Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Cardiff, 2018 1/73 A motivation: comparing two samples Given: Samples from unknown

Discrepancy and SDPs Nikhil Bansal (TU Eindhoven, Netherlands ) Outline Discrepancy Theory

Constructive Discrepancy Minimization for Convex Sets Thomas Rothvoss UW Seattle Discrepancy

Discrepancy of Random Set Systems Rebecca Hoberg and Thomas Rothvo Discrepancy theory Set

Flow Cytometry Data Assessment Flow Cytometry Data Assessment with L2 Discrepancy Learning with

The discrepancy of the linear flow on the torus Bence Borda Alfr ed R enyi Institute of

generative design systems Generative Brief Design Definitions Workshop Processes

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Discrepancy Theory and Applications to Bin Packing Thomas Rothvoss Joint work with Becca Hoberg

Lower Bounds for L 1 Discrepancy Armen Vagharshakyan Brown University January 10, 2013 Armen

On some sets with minimal L 2 discrepancy Dmitriy Bilyk University of South Carolina, Columbia,

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Genera&ve Models and Model Cri&cism via Op&mized Maximum Mean Discrepancy Dougal J.

What is the maximum efficiency that What is the maximum efficiency that What is the maximum

discrepancy for unsupervised domain adaptation Hongliang Yan 2017/06/21 Domain Adaptation DA

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

python3 September 25, 2017 1 CS131 Python 3 Tutorial Adapted by Ranjay Krishna from the CS228

Racket Pattern Matching Principles of Programming Languages Colorado School of Mines

TRECVID 2010 Paul Over* Alan Smeaton (Dublin City University) George Awad* Wessel Kraaij

Towards Attack-Agnostic Defense for 2D and 3D Recognition Hao Su Workshop on Adversarial Machine

Deep Dream CSC321: Intro to Machine Learning and Neural Networks, Winter 2016 Michael Guerzhoy

Ontology Engineering Lecture 6: Top-down Ontology Development I Maria Keet email:

Upper-Ontologies a closer look Fausto Giunchiglia and Mattia Fumagallli University of Trento

Part III: Processes and Events in BFO and DOLCE Antony Galton Department of Mathematics and

The Maximum Mean Discrepancy for Training Generative Adversarial - PowerPoint PPT Presentation

The Maximum Mean Discrepancy for Training Generative Adversarial Networks Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Cardiff, 2018 1/73 A motivation: comparing two samples Given: Samples from unknown

Discrepancy and SDPs Nikhil Bansal (TU Eindhoven, Netherlands ) Outline Discrepancy Theory

Constructive Discrepancy Minimization for Convex Sets Thomas Rothvoss UW Seattle Discrepancy

Discrepancy of Random Set Systems Rebecca Hoberg and Thomas Rothvo Discrepancy theory Set

Flow Cytometry Data Assessment Flow Cytometry Data Assessment with L2 Discrepancy Learning with

The discrepancy of the linear flow on the torus Bence Borda Alfr ed R enyi Institute of

generative design systems Generative Brief Design Definitions Workshop Processes

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Discrepancy Theory and Applications to Bin Packing Thomas Rothvoss Joint work with Becca Hoberg

Lower Bounds for L 1 Discrepancy Armen Vagharshakyan Brown University January 10, 2013 Armen

On some sets with minimal L 2 discrepancy Dmitriy Bilyk University of South Carolina, Columbia,

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Genera&amp;ve Models and Model Cri&amp;cism via Op&amp;mized Maximum Mean Discrepancy Dougal J.

What is the maximum efficiency that What is the maximum efficiency that What is the maximum

discrepancy for unsupervised domain adaptation Hongliang Yan 2017/06/21 Domain Adaptation DA

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

python3 September 25, 2017 1 CS131 Python 3 Tutorial Adapted by Ranjay Krishna from the CS228

Racket Pattern Matching Principles of Programming Languages Colorado School of Mines

TRECVID 2010 Paul Over* Alan Smeaton (Dublin City University) George Awad* Wessel Kraaij

Towards Attack-Agnostic Defense for 2D and 3D Recognition Hao Su Workshop on Adversarial Machine

Deep Dream CSC321: Intro to Machine Learning and Neural Networks, Winter 2016 Michael Guerzhoy

Ontology Engineering Lecture 6: Top-down Ontology Development I Maria Keet email:

Upper-Ontologies a closer look Fausto Giunchiglia and Mattia Fumagallli University of Trento

Part III: Processes and Events in BFO and DOLCE Antony Galton Department of Mathematics and

Genera&ve Models and Model Cri&cism via Op&mized Maximum Mean Discrepancy Dougal J.