Stochastic Optimization for Regularized Wasserstein Estimators ICML - PowerPoint PPT Presentation

Stochastic Optimization for Regularized Wasserstein Estimators ICML 2020 Francis Bach Quentin Berthet Marin Ballu

Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? 1

Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? • Relative entropy and other f-divergences allow classical statistical approaches. 1

Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? • Relative entropy and other f-divergences allow classical statistical approaches. • Optimal transport theory allows us to capture the geometry of the data distributions, with the Wasserstein distance . W c p µ, ν q “ OT p µ, ν q “ min T # µ “ ν E X „ µ r c p X , T p X qqs 1

Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? • Relative entropy and other f-divergences allow classical statistical approaches. • Optimal transport theory allows us to capture the geometry of the data distributions, with the Wasserstein distance . W c p µ, ν q “ OT p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs min 1

Wasserstein distance in machine learning Wasserstein GAN (Arjovsky et al., 2017) Wasserstein Discriminant Analysis (Flamary et al., 2018) Clustered point-matching (Alvarez-Melis et al., 2018) 2

Wasserstein distance in machine learning Diffeomorphic registration (Feydy et al., 2017) Sinkhorn divergence for generative models (Genevay et al., 2019) Alignment of embeddings (Grave et al., 2019) 3

Our contribution µ OT p µ, ν q We consider the minimum Kantorovich estimator (Bassetti et al., 2006), or Wasserstein estimator of the measure µ : min ν P M OT p µ, ν q , ν which is often used for µ “ ř i δ x i to fit a parametric M model M (as with MLE, where KL divergence replaces OT). 4

Our contribution µ OT p µ, ν q • We add two layers of entropic regularization. • We propose a new stochastic optimization scheme to minimize the regularized problem. ν • Time per step is sublinear in the natural dimension of the problem. M • We provide theoretical guarantees, and simulations. 5

Regularized Wasserstein Distance Wasserstein distance W c p µ, ν q “ OT p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs min 6

Regularized Wasserstein Distance Wasserstein distance W c p µ, ν q “ OT p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs min Regularized Wasserstein distance OT ε p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs ` ε KL p π, µ b ν q min Computed at lightspeed by Sinkhorn algorithm (Cuturi 2013) SGD on dual problem (Genevay et al. 2016) 6

Regularized Wasserstein Estimator Wasserstein estimator ν P M OT p µ, ν q min 7

Regularized Wasserstein Estimator Wasserstein estimator ν P M OT p µ, ν q min First layer of regularization ν P M OT ε p µ, ν q min 7

Regularized Wasserstein Estimator Wasserstein estimator ν P M OT p µ, ν q min First layer of regularization ν P M OT ε p µ, ν q min Second layer of regularization min ν P M OT ε p µ, ν q` η KL p ν, β q 7

First layer: Gaussian deconvolution This is a recent interpretation (Rigollet, Weed 2018). Let X i be iid random variables following ν ˚ , Z i „ ϕ ε “ N p 0 , ε Id q an iid gaussian noise and Y i “ X i ` Z i the perturbed observation with distribution µ . X i ` Z i X i „ ν ˚ Y i „ ϕ ε ˚ ν ˚ Ñ 8

First layer: Gaussian deconvolution For c p x , y q “ } x ´ y } 2 , the MLE for ν ˚ is ÿ ν : “ arg max ˆ log p ϕ ε ˚ ν qp X i q ô ˆ ν “ arg min ν P M OT ε p µ, ν q . ν P M i X i ` Z i X i „ ν ˚ Y i „ ϕ ε ˚ ν ˚ Ð 8

First layer: adds entropy to the transport matrix Figure 1: Small regularization ε “ 0 . 01 Figure 2: Big regularization ε “ 0 . 1 9

Second Layer: Interpolation with likelihood estimators Wasserstein Estimator Maximum Likelihood Estimator min ν P M OT p µ, ν q ν P M KL p ν, β q min Regularized Wasserstein Estimator min ν P M OT ε p µ, ν q` η KL p ν, β q 10

Second Layer: adds entropy to the target measure Figure 3: Small regularization η “ 0 . 02 Figure 4: Big regularization η “ 0 . 2 11

Dual Formulation of the problem min ν P M OT ε p µ, ν q` η KL p ν, β q with OT ε p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs ` ε KL p π, µ b ν q min is min π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs ` ε KL p π, µ b ν q` η KL p ν, β q . min ν P M We consider the dual of the second min. 12

Dual Formulation The dual problem can be written as a saddle point problem, where the min and the max can be swapped. The final formulation is of the form p a , b qP R I ˆ R J F p a , b q . max 13

Properties of the function F in the discrete case 1. F is λ -strongly convex on the hyperplane E “ t ř i µ i a i “ ř j β j b j u . 2. There exists a solution of p a , b qP R I ˆ R J F p a , b q , which is in E , and it is unique. max 3. The gradients of F can be written as expectations ∇ a F “ E rp 1 ´ D i , j q e i s , ∇ b F “ E rp f j ´ D i , j q e j s . ´ ¯ a i ` b j ´ C i , j and f j “ ν j p b q with D i , j p a , b q “ exp β j . ε 14

Stochastic Gradient Descent We have stochastic gradients for F G a “ p 1 ´ D i , j q e i G b “ p f j ´ D i , j q e j . SGD algorithm: • Sample i P t 1 , . . . , I u with probability µ i , • Sample j P t 1 , . . . , J u with probability β j , • Compute G a and G b • a Ð a ` γ t G a , • b Ð b ` γ t G b . 15

Stochastic Gradient Descent We only have to compute a and b one coefficient at a time • Sample i P t 1 , . . . , I u with probability µ i , • Sample j P t 1 , . . . , J u with probability β j , • Compute f j and D i , j • a i Ð a i ` γ t p 1 ´ D i , j q , • b j Ð b j ` γ t p f j ´ D i , j q . 16

The sum memorization trick ´ ¯ a i ` b j ´ C i , j and f j “ ν j p b q The computation of D i , j p a , b q “ exp is O p 1 q . ε β j However β j e ´ b j {p η ´ ε q ν ˚ j “ k β k e ´ b k {p η ´ ε q , ř but we can do it in O p 1 q if we update S p t q “ β k e ´ b p t q ÿ k {p η ´ ε q , k with S p t ` 1 q “ S p t q ` β j e ´ b p t ` 1 q {p η ´ ε q ´ β j e ´ b p t q {p η ´ ε q . j j 17

Convergence Bounds 1 With stepsize γ t “ λ t , the estimator verifies C 1 1 ` log t E r KL p ν ˚ , ν t qs ď . p η ´ ε q λ 2 t With stepsize γ t “ C 2 ? t , the estimator verifies the following bound: C 3 2 ` log t E r KL p ν ˚ , ν t qs ď ? t . p η ´ ε q λ 18

Simulations Figure 5: Convergence of the gradient norm for different dimensions. 19

Using for Wasserstein Barycenters Wasserstein barycenter K ÿ θ k OT p µ k , ν q . min ν k “ 1 Doubly regularized Wasserstein barycenter K ÿ θ k OT ε p µ k , ν q ` η KL p ν, β q . min ν k “ 1 20

Conclusion Takeaways: • Wasserstein estimators are ”projections” according to Wasserstein distances, • Two layers of entropic regularization are used here, • It is then possible to compute stochastic gradients in O p 1 q for this problem, • The results are also valid for Wasserstein barycenters. Thank you for your attention! 21

Stochastic Optimization for Regularized Wasserstein Estimators ICML - PowerPoint PPT Presentation

Stochastic Optimization for Regularized Wasserstein Estimators ICML 2020 Francis Bach Quentin Berthet Marin Ballu Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? 1

Bregman and Wasserstein, with Applications to Generative Adversarial Networks (GANs) and beyond

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Regularized generalized CCA (RGCCA) Arthur Tenenhaus (SUPELEC) Michel Tenenhaus (HEC Paris) 1

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Stronger and Faster Wasserstein Adversarial Attacks Kaiwen Wu kaiwen.wu@uwaterloo.ca Joint work

A variational finite volume scheme for Wasserstein gradient flows es 1 , T. O. Gallou et 2 , G.

On the Complexity of Approximating Wasserstein Barycenters Alexey Kroshnin, Darina Dvinskikh,

Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters Pavel Dvurechensky,

Wasserstein barycenters over Riemannian manifolds Brendan Pass (joint work with Y.H. Kim (UBC))

Stochastic Gravitational Wave Background Mapmaking using regularized deconvolution Sambit Panda ,

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Principled Learning Method for Wasserstein Distributionally Robust Optimization with Local

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Advances in Programming Languages APL11: Heterogeneous Metaprogramming in F# Ian Stark School of

@Stigmaindexuk #zerodiscrimination: UNAIDS Stigma Survey UK 2015 Methodology 2009 Survey

Marktoberdorf NATO Summer School 2016, Lecture 4 Formal Models for Human-Machine Interactions

Oregon Dept. of Forestry Streamside Protections Reviews: Western Oregon and Siskiyou Network of

Exploring Isogeny Graphs Around the Volcano in 2 80 Days Luca De Feo hand drawings by Rachel

Extended Producer Responsibility (EPR) Working Group Meeting #3 November 5, 2020 1:00pm

Revenue Equivalence Game Theory Course: Jackson, Leyton-Brown & Shoham Game Theory Course:

Spaceborne Infrared Atmospheric Sounder GEO (SIRAS-G) Thomas Kampe Ball Aerospace &

Stochastic Optimization for Regularized Wasserstein Estimators ICML - PowerPoint PPT Presentation

Stochastic Optimization for Regularized Wasserstein Estimators ICML 2020 Francis Bach Quentin Berthet Marin Ballu Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? 1

Bregman and Wasserstein, with Applications to Generative Adversarial Networks (GANs) and beyond

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Regularized generalized CCA (RGCCA) Arthur Tenenhaus (SUPELEC) Michel Tenenhaus (HEC Paris) 1

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Stronger and Faster Wasserstein Adversarial Attacks Kaiwen Wu kaiwen.wu@uwaterloo.ca Joint work

A variational finite volume scheme for Wasserstein gradient flows es 1 , T. O. Gallou et 2 , G.

On the Complexity of Approximating Wasserstein Barycenters Alexey Kroshnin, Darina Dvinskikh,

Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters Pavel Dvurechensky,

Wasserstein barycenters over Riemannian manifolds Brendan Pass (joint work with Y.H. Kim (UBC))

Stochastic Gravitational Wave Background Mapmaking using regularized deconvolution Sambit Panda ,

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Principled Learning Method for Wasserstein Distributionally Robust Optimization with Local

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Advances in Programming Languages APL11: Heterogeneous Metaprogramming in F# Ian Stark School of

@Stigmaindexuk #zerodiscrimination: UNAIDS Stigma Survey UK 2015 Methodology 2009 Survey

Marktoberdorf NATO Summer School 2016, Lecture 4 Formal Models for Human-Machine Interactions

Oregon Dept. of Forestry Streamside Protections Reviews: Western Oregon and Siskiyou Network of

Exploring Isogeny Graphs Around the Volcano in 2 80 Days Luca De Feo hand drawings by Rachel

Extended Producer Responsibility (EPR) Working Group Meeting #3 November 5, 2020 1:00pm

Revenue Equivalence Game Theory Course: Jackson, Leyton-Brown &amp; Shoham Game Theory Course:

Spaceborne Infrared Atmospheric Sounder GEO (SIRAS-G) Thomas Kampe Ball Aerospace &amp;

Revenue Equivalence Game Theory Course: Jackson, Leyton-Brown & Shoham Game Theory Course:

Spaceborne Infrared Atmospheric Sounder GEO (SIRAS-G) Thomas Kampe Ball Aerospace &