Stochastic Optimization for Regularized Wasserstein Estimators ICML - - PowerPoint PPT Presentation

stochastic optimization for regularized wasserstein
SMART_READER_LITE
LIVE PREVIEW

Stochastic Optimization for Regularized Wasserstein Estimators ICML - - PowerPoint PPT Presentation

Stochastic Optimization for Regularized Wasserstein Estimators ICML 2020 Francis Bach Quentin Berthet Marin Ballu Wasserstein Distance: a natural geometry for distributions How does one compute the distance between two data distributions? 1


slide-1
SLIDE 1

Stochastic Optimization for Regularized Wasserstein Estimators

ICML 2020

Marin Ballu Quentin Berthet Francis Bach

slide-2
SLIDE 2

Wasserstein Distance: a natural geometry for distributions

How does one compute the distance between two data distributions?

1

slide-3
SLIDE 3

Wasserstein Distance: a natural geometry for distributions

How does one compute the distance between two data distributions?

  • Relative entropy and other f-divergences

allow classical statistical approaches.

1

slide-4
SLIDE 4

Wasserstein Distance: a natural geometry for distributions

How does one compute the distance between two data distributions?

  • Relative entropy and other f-divergences

allow classical statistical approaches.

  • Optimal transport theory allows us to

capture the geometry of the data distributions, with the Wasserstein distance. Wcpµ, νq “ OTpµ, νq “ min

T#µ“ν EX„µ rcpX, TpXqqs 1

slide-5
SLIDE 5

Wasserstein Distance: a natural geometry for distributions

How does one compute the distance between two data distributions?

  • Relative entropy and other f-divergences

allow classical statistical approaches.

  • Optimal transport theory allows us to

capture the geometry of the data distributions, with the Wasserstein distance. Wcpµ, νq “ OTpµ, νq “ min

πPΠpµ,νq EpX,Y q„π rcpX, Y qs 1

slide-6
SLIDE 6

Wasserstein distance in machine learning

Wasserstein GAN (Arjovsky et al., 2017) Wasserstein Discriminant Analysis (Flamary et al., 2018) Clustered point-matching (Alvarez-Melis et al., 2018)

2

slide-7
SLIDE 7

Wasserstein distance in machine learning

Diffeomorphic registration (Feydy et al., 2017) Alignment of embeddings (Grave et al., 2019) Sinkhorn divergence for generative models (Genevay et al., 2019)

3

slide-8
SLIDE 8

Our contribution

We consider the minimum Kantorovich estimator (Bassetti et al., 2006), or Wasserstein estimator of the measure µ: min

νPM OTpµ, νq ,

which is often used for µ “ ř

i δxi to fit a parametric

model M (as with MLE, where KL divergence replaces OT). µ ν M OTpµ, νq

4

slide-9
SLIDE 9

Our contribution

  • We add two layers of entropic regularization.
  • We propose a new stochastic optimization scheme

to minimize the regularized problem.

  • Time per step is sublinear in the natural dimension
  • f the problem.
  • We provide theoretical guarantees, and simulations.

µ ν M OTpµ, νq

5

slide-10
SLIDE 10

Regularized Wasserstein Distance

Wasserstein distance Wcpµ, νq “ OTpµ, νq “ min

πPΠpµ,νq EpX,Y q„π rcpX, Y qs 6

slide-11
SLIDE 11

Regularized Wasserstein Distance

Wasserstein distance Wcpµ, νq “ OTpµ, νq “ min

πPΠpµ,νq EpX,Y q„π rcpX, Y qs

Regularized Wasserstein distance OTεpµ, νq “ min

πPΠpµ,νq EpX,Y q„π rcpX, Y qs `ε KLpπ, µ b νq

Computed at lightspeed by Sinkhorn algorithm (Cuturi 2013) SGD on dual problem (Genevay et al. 2016)

6

slide-12
SLIDE 12

Regularized Wasserstein Estimator

Wasserstein estimator min

νPM OTpµ, νq 7

slide-13
SLIDE 13

Regularized Wasserstein Estimator

Wasserstein estimator min

νPM OTpµ, νq

First layer of regularization min

νPM OTεpµ, νq 7

slide-14
SLIDE 14

Regularized Wasserstein Estimator

Wasserstein estimator min

νPM OTpµ, νq

First layer of regularization min

νPM OTεpµ, νq

Second layer of regularization min

νPM OTεpµ, νq`η KLpν, βq 7

slide-15
SLIDE 15

First layer: Gaussian deconvolution

This is a recent interpretation (Rigollet, Weed 2018). Let Xi be iid random variables following ν˚, Zi „ ϕε “ Np0, εIdq an iid gaussian noise and Yi “ Xi ` Zi the perturbed observation with distribution µ. Xi „ ν˚

Ñ

Yi „ ϕε ˚ ν˚ Xi ` Zi

8

slide-16
SLIDE 16

First layer: Gaussian deconvolution

For cpx, yq “ }x ´ y}2, the MLE for ν˚ is ˆ ν :“ arg max

νPM

ÿ

i

logpϕε ˚ νqpXiq ô ˆ ν “ arg min

νPM OTεpµ, νq.

Xi „ ν˚

Ð

Yi „ ϕε ˚ ν˚ Xi ` Zi

8

slide-17
SLIDE 17

First layer: adds entropy to the transport matrix

Figure 1: Small regularization ε “ 0.01 Figure 2: Big regularization ε “ 0.1

9

slide-18
SLIDE 18

Second Layer: Interpolation with likelihood estimators

Wasserstein Estimator min

νPM OTpµ, νq

Maximum Likelihood Estimator min

νPM KLpν, βq

Regularized Wasserstein Estimator min

νPM OTεpµ, νq`η KLpν, βq 10

slide-19
SLIDE 19

Second Layer: adds entropy to the target measure

Figure 3: Small regularization η “ 0.02 Figure 4: Big regularization η “ 0.2

11

slide-20
SLIDE 20

Dual Formulation of the problem

min

νPM OTεpµ, νq`η KLpν, βq

with OTεpµ, νq “ min

πPΠpµ,νq EpX,Y q„π rcpX, Y qs `ε KLpπ, µ b νq

is min

νPM

min

πPΠpµ,νq EpX,Y q„π rcpX, Y qs `ε KLpπ, µ b νq`η KLpν, βq.

We consider the dual of the second min.

12

slide-21
SLIDE 21

Dual Formulation

The dual problem can be written as a saddle point problem, where the min and the max can be swapped. The final formulation is of the form max

pa,bqPRI ˆRJ Fpa, bq. 13

slide-22
SLIDE 22

Properties of the function F in the discrete case

  • 1. F is λ-strongly convex on the hyperplane E “ tř

i µiai “ ř j βjbju.

  • 2. There exists a solution of

max

pa,bqPRI ˆRJ Fpa, bq, which is in E, and it is unique.

  • 3. The gradients of F can be written as expectations

∇aF “ E rp1 ´ Di,jqeis , ∇bF “ E rpfj ´ Di,jqejs . with Di,jpa, bq “ exp ´

ai`bj´Ci,j ε

¯ and fj “ νjpbq

βj . 14

slide-23
SLIDE 23

Stochastic Gradient Descent

We have stochastic gradients for F Ga “ p1 ´ Di,jqei Gb “ pfj ´ Di,jqej. SGD algorithm:

  • Sample i P t1, . . . , Iu with probability µi,
  • Sample j P t1, . . . , Ju with probability βj,
  • Compute Ga and Gb
  • a Ð a ` γtGa,
  • b Ð b ` γtGb.

15

slide-24
SLIDE 24

Stochastic Gradient Descent

We only have to compute a and b one coefficient at a time

  • Sample i P t1, . . . , Iu with probability µi,
  • Sample j P t1, . . . , Ju with probability βj,
  • Compute fj and Di,j
  • ai Ð ai ` γtp1 ´ Di,jq,
  • bj Ð bj ` γtpfj ´ Di,jq.

16

slide-25
SLIDE 25

The sum memorization trick

The computation of Di,jpa, bq “ exp ´

ai`bj´Ci,j ε

¯ and fj “ νjpbq

βj

is Op1q. However ν˚

j “

βje´bj{pη´εq ř

k βke´bk{pη´εq,

but we can do it in Op1q if we update Sptq “ ÿ

k

βke´bptq

k {pη´εq,

with Spt`1q “ Sptq ` βje´bpt`1q

j

{pη´εq ´ βje´bptq

j

{pη´εq. 17

slide-26
SLIDE 26

Convergence Bounds

With stepsize γt “

1 λt, the estimator verifies

E rKLpν˚, νtqs ď C1 pη ´ εqλ2 1 ` log t t . With stepsize γt “ C2

?t, the estimator verifies the following bound:

E rKLpν˚, νtqs ď C3 pη ´ εqλ 2 ` log t ?t .

18

slide-27
SLIDE 27

Simulations

Figure 5: Convergence of the gradient norm for different dimensions.

19

slide-28
SLIDE 28

Using for Wasserstein Barycenters

Wasserstein barycenter min

ν K

ÿ

k“1

θk OTpµk, νq. Doubly regularized Wasserstein barycenter min

ν K

ÿ

k“1

θk OTεpµk, νq ` η KLpν, βq.

20

slide-29
SLIDE 29

Conclusion

Takeaways:

  • Wasserstein estimators are ”projections” according to Wasserstein distances,
  • Two layers of entropic regularization are used here,
  • It is then possible to compute stochastic gradients in Op1q for this problem,
  • The results are also valid for Wasserstein barycenters.

Thank you for your attention!

21