stochastic optimization for regularized wasserstein
play

Stochastic Optimization for Regularized Wasserstein Estimators ICML - PowerPoint PPT Presentation

Stochastic Optimization for Regularized Wasserstein Estimators ICML 2020 Francis Bach Quentin Berthet Marin Ballu Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? 1


  1. Stochastic Optimization for Regularized Wasserstein Estimators ICML 2020 Francis Bach Quentin Berthet Marin Ballu

  2. Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? 1

  3. Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? • Relative entropy and other f-divergences allow classical statistical approaches. 1

  4. Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? • Relative entropy and other f-divergences allow classical statistical approaches. • Optimal transport theory allows us to capture the geometry of the data distributions, with the Wasserstein distance . W c p µ, ν q “ OT p µ, ν q “ min T # µ “ ν E X „ µ r c p X , T p X qqs 1

  5. Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? • Relative entropy and other f-divergences allow classical statistical approaches. • Optimal transport theory allows us to capture the geometry of the data distributions, with the Wasserstein distance . W c p µ, ν q “ OT p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs min 1

  6. Wasserstein distance in machine learning Wasserstein GAN (Arjovsky et al., 2017) Wasserstein Discriminant Analysis (Flamary et al., 2018) Clustered point-matching (Alvarez-Melis et al., 2018) 2

  7. Wasserstein distance in machine learning Diffeomorphic registration (Feydy et al., 2017) Sinkhorn divergence for generative models (Genevay et al., 2019) Alignment of embeddings (Grave et al., 2019) 3

  8. Our contribution µ OT p µ, ν q We consider the minimum Kantorovich estimator (Bassetti et al., 2006), or Wasserstein estimator of the measure µ : min ν P M OT p µ, ν q , ν which is often used for µ “ ř i δ x i to fit a parametric M model M (as with MLE, where KL divergence replaces OT). 4

  9. Our contribution µ OT p µ, ν q • We add two layers of entropic regularization. • We propose a new stochastic optimization scheme to minimize the regularized problem. ν • Time per step is sublinear in the natural dimension of the problem. M • We provide theoretical guarantees, and simulations. 5

  10. Regularized Wasserstein Distance Wasserstein distance W c p µ, ν q “ OT p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs min 6

  11. Regularized Wasserstein Distance Wasserstein distance W c p µ, ν q “ OT p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs min Regularized Wasserstein distance OT ε p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs ` ε KL p π, µ b ν q min Computed at lightspeed by Sinkhorn algorithm (Cuturi 2013) SGD on dual problem (Genevay et al. 2016) 6

  12. Regularized Wasserstein Estimator Wasserstein estimator ν P M OT p µ, ν q min 7

  13. Regularized Wasserstein Estimator Wasserstein estimator ν P M OT p µ, ν q min First layer of regularization ν P M OT ε p µ, ν q min 7

  14. Regularized Wasserstein Estimator Wasserstein estimator ν P M OT p µ, ν q min First layer of regularization ν P M OT ε p µ, ν q min Second layer of regularization min ν P M OT ε p µ, ν q` η KL p ν, β q 7

  15. First layer: Gaussian deconvolution This is a recent interpretation (Rigollet, Weed 2018). Let X i be iid random variables following ν ˚ , Z i „ ϕ ε “ N p 0 , ε Id q an iid gaussian noise and Y i “ X i ` Z i the perturbed observation with distribution µ . X i ` Z i X i „ ν ˚ Y i „ ϕ ε ˚ ν ˚ Ñ 8

  16. First layer: Gaussian deconvolution For c p x , y q “ } x ´ y } 2 , the MLE for ν ˚ is ÿ ν : “ arg max ˆ log p ϕ ε ˚ ν qp X i q ô ˆ ν “ arg min ν P M OT ε p µ, ν q . ν P M i X i ` Z i X i „ ν ˚ Y i „ ϕ ε ˚ ν ˚ Ð 8

  17. First layer: adds entropy to the transport matrix Figure 1: Small regularization ε “ 0 . 01 Figure 2: Big regularization ε “ 0 . 1 9

  18. Second Layer: Interpolation with likelihood estimators Wasserstein Estimator Maximum Likelihood Estimator min ν P M OT p µ, ν q ν P M KL p ν, β q min Regularized Wasserstein Estimator min ν P M OT ε p µ, ν q` η KL p ν, β q 10

  19. Second Layer: adds entropy to the target measure Figure 3: Small regularization η “ 0 . 02 Figure 4: Big regularization η “ 0 . 2 11

  20. Dual Formulation of the problem min ν P M OT ε p µ, ν q` η KL p ν, β q with OT ε p µ, ν q “ π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs ` ε KL p π, µ b ν q min is min π P Π p µ,ν q E p X , Y q„ π r c p X , Y qs ` ε KL p π, µ b ν q` η KL p ν, β q . min ν P M We consider the dual of the second min. 12

  21. Dual Formulation The dual problem can be written as a saddle point problem, where the min and the max can be swapped. The final formulation is of the form p a , b qP R I ˆ R J F p a , b q . max 13

  22. Properties of the function F in the discrete case 1. F is λ -strongly convex on the hyperplane E “ t ř i µ i a i “ ř j β j b j u . 2. There exists a solution of p a , b qP R I ˆ R J F p a , b q , which is in E , and it is unique. max 3. The gradients of F can be written as expectations ∇ a F “ E rp 1 ´ D i , j q e i s , ∇ b F “ E rp f j ´ D i , j q e j s . ´ ¯ a i ` b j ´ C i , j and f j “ ν j p b q with D i , j p a , b q “ exp β j . ε 14

  23. Stochastic Gradient Descent We have stochastic gradients for F G a “ p 1 ´ D i , j q e i G b “ p f j ´ D i , j q e j . SGD algorithm: • Sample i P t 1 , . . . , I u with probability µ i , • Sample j P t 1 , . . . , J u with probability β j , • Compute G a and G b • a Ð a ` γ t G a , • b Ð b ` γ t G b . 15

  24. Stochastic Gradient Descent We only have to compute a and b one coefficient at a time • Sample i P t 1 , . . . , I u with probability µ i , • Sample j P t 1 , . . . , J u with probability β j , • Compute f j and D i , j • a i Ð a i ` γ t p 1 ´ D i , j q , • b j Ð b j ` γ t p f j ´ D i , j q . 16

  25. The sum memorization trick ´ ¯ a i ` b j ´ C i , j and f j “ ν j p b q The computation of D i , j p a , b q “ exp is O p 1 q . ε β j However β j e ´ b j {p η ´ ε q ν ˚ j “ k β k e ´ b k {p η ´ ε q , ř but we can do it in O p 1 q if we update S p t q “ β k e ´ b p t q ÿ k {p η ´ ε q , k with S p t ` 1 q “ S p t q ` β j e ´ b p t ` 1 q {p η ´ ε q ´ β j e ´ b p t q {p η ´ ε q . j j 17

  26. Convergence Bounds 1 With stepsize γ t “ λ t , the estimator verifies C 1 1 ` log t E r KL p ν ˚ , ν t qs ď . p η ´ ε q λ 2 t With stepsize γ t “ C 2 ? t , the estimator verifies the following bound: C 3 2 ` log t E r KL p ν ˚ , ν t qs ď ? t . p η ´ ε q λ 18

  27. Simulations Figure 5: Convergence of the gradient norm for different dimensions. 19

  28. Using for Wasserstein Barycenters Wasserstein barycenter K ÿ θ k OT p µ k , ν q . min ν k “ 1 Doubly regularized Wasserstein barycenter K ÿ θ k OT ε p µ k , ν q ` η KL p ν, β q . min ν k “ 1 20

  29. Conclusion Takeaways: • Wasserstein estimators are ”projections” according to Wasserstein distances, • Two layers of entropic regularization are used here, • It is then possible to compute stochastic gradients in O p 1 q for this problem, • The results are also valid for Wasserstein barycenters. Thank you for your attention! 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend