Online Sinkhorn: Optimal Transport distances from sample streams - - PowerPoint PPT Presentation

online sinkhorn optimal transport distances from sample
SMART_READER_LITE
LIVE PREVIEW

Online Sinkhorn: Optimal Transport distances from sample streams - - PowerPoint PPT Presentation

Online Sinkhorn: Optimal Transport distances from sample streams Arthur Mensch Joint work with Gabriel Peyr e Ecole Normale Sup erieure D epartement de Math ematiques et Applications Paris, France CIRM, 3/12/2020 Optimal


slide-1
SLIDE 1

Online Sinkhorn: Optimal Transport distances from sample streams

Arthur Mensch Joint work with Gabriel Peyr´ e

´ Ecole Normale Sup´ erieure D´ epartement de Math´ ematiques et Applications Paris, France

CIRM, 3/12/2020

slide-2
SLIDE 2

Optimal transport for machine learning

Density fitting

1 / 29

slide-3
SLIDE 3

Optimal transport for machine learning

Density fitting Distance between points

1 / 29

slide-4
SLIDE 4

Optimal transport for machine learning

Density fitting Distance between points Distance between distributions: α ∈ P(X), β ∈ P(X) Dependency on the cost C : X × X → R W(α, β, C)

1 / 29

slide-5
SLIDE 5

The trouble with optimal transport

Réseau de neurone Échantillon

Figure: StyleGAN2

In ML, at least one distribution is not discrete α = 1 n

  • δxi

Algorithms for OT works with discrete distributions

2 / 29

slide-6
SLIDE 6

The trouble with optimal transport

Réseau de neurone Échantillon

Figure: StyleGAN2

In ML, at least one distribution is not discrete α = 1 n

  • δxi

Algorithms for OT works with discrete distributions Need for consistent estimators of W(α, β) Using streams of samples (xt)t, (yt)t from α and β

2 / 29

slide-7
SLIDE 7

The trouble with optimal transport

Réseau de neurone Échantillon

Figure: StyleGAN2

In ML, at least one distribution is not discrete α = 1 n

  • δxi

Algorithms for OT works with discrete distributions Need for consistent estimators of W(α, β) and its backward

  • perator

Using streams of samples (xt)t, (yt)t from α and β

2 / 29

slide-8
SLIDE 8

Outline

1

Tractable algorithms for optimal transport

2

Online Sinkhorn

3 / 29

slide-9
SLIDE 9

Wasserstein distance (Kantorovich, 1942)

α =

n

  • i=1

aiδxi β =

n

  • i=1

bjδyi C = (C(xi, yj)i,j α ∈ P(X): positions x = (xi)i ∈ X, weights a = (ai)i ∈ △n

4 / 29

slide-10
SLIDE 10

Wasserstein distance (Kantorovich, 1942)

α =

n

  • i=1

aiδxi β =

n

  • i=1

bjδyi C = (C(xi, yj)i,j α ∈ P(X): positions x = (xi)i ∈ X, weights a = (ai)i ∈ △n P ∈ △n×m, P 1 = a, P ⊤1 = b

4 / 29

slide-11
SLIDE 11

Wasserstein distance (Kantorovich, 1942)

α =

n

  • i=1

aiδxi β =

n

  • i=1

bjδyi C = (C(xi, yj)i,j α ∈ P(X): positions x = (xi)i ∈ X, weights a = (ai)i ∈ △n P ∈ △n×m, P 1 = a, P ⊤1 = b Cost:

  • i,j

Pi,jCi,j = P , C

4 / 29

slide-12
SLIDE 12

Wasserstein distance (Kantorovich, 1942)

WC(α, β) = min

P ∈△n×m P 1=a,P ⊤1=b

P , C

  • 1M. Cuturi. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural

Information Processing Systems. 2013. 5 / 29

slide-13
SLIDE 13

Wasserstein distance (Kantorovich, 1942)

WC(α, β) = min

P ∈△n×m P 1=a,P ⊤1=b

P , C Entropic regularization1 W(α, β) = W1

C(α, β) =

min

P ∈△n×m P 1=a,P ⊤1=b

P , C + KL(P |a ⊗ b)

  • 1M. Cuturi. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural

Information Processing Systems. 2013. 5 / 29

slide-14
SLIDE 14

Working with continuous objects

C : X × X → R functions α, β ∈ P(X) distributions π ∈ P(X × X) with marginals

  • y dπ(·, y) = dα(·)
  • x dπ(x, ·) = dβ(·)

6 / 29

slide-15
SLIDE 15

Working with continuous objects

C : X × X → R functions α, β ∈ P(X) distributions π ∈ P(X × X) with marginals

  • y dπ(·, y) = dα(·)
  • x dπ(x, ·) = dβ(·)

W(α, β) min

π∈U(α,β)π, C + KL(π|α ⊗ β)

=

  • x,y C(x, y)dπ(x, y) +
  • x,y log

dπ dαdβ (x, y)dπ(x, y)

6 / 29

slide-16
SLIDE 16

Working with continuous objects

C : X × X → R functions α, β ∈ P(X) distributions π ∈ P(X × X) with marginals

  • y dπ(·, y) = dα(·)
  • x dπ(x, ·) = dβ(·)

W(α, β) min

π∈U(α,β)π, C + KL(π|α ⊗ β)

=

  • x,y C(x, y)dπ(x, y) +
  • x,y log

dπ dαdβ (x, y)dπ(x, y) Discrete case: π =

i,j Pi,jδxi,yj

6 / 29

slide-17
SLIDE 17

Computing W using matrix scaling

W(α, β) = min

π∈U(α,β)π, C + KL(π|α ⊗ β)

Fenchel-Rockafeller2 dual W(α, β) = max

f,g∈C(X)α, f+β, g−α⊗β, exp(f ⊕g−C)+1

  • 2R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In:

Proceedings of Symposia in Pure Mathematics. Vol. 18.1. 1970, pp. 241–250.

3Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The

Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

slide-18
SLIDE 18

Computing W using matrix scaling

W(α, β) = min

π∈U(α,β)π, C + KL(π|α ⊗ β)

Fenchel-Rockafeller2 dual W(α, β) = max

f,g∈C(X)α, f+β, g−α⊗β, exp(f ⊕g−C)+1

Alternated maximisation: Sinkhorn-Knopp algorithm3 ft(·) = Tβ(gt−1)(·) = − log

  • y∈X exp(gt−1(y) − C(·, y))dβ(y)

gt(·) = Tβ(ft)(·) = − log

  • x∈X exp(ft(x) − C(x, ·))dα(x)
  • 2R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In:

Proceedings of Symposia in Pure Mathematics. Vol. 18.1. 1970, pp. 241–250.

3Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The

Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

slide-19
SLIDE 19

Computing W using matrix scaling

W(α, β) = min

π∈U(α,β)π, C + KL(π|α ⊗ β)

Fenchel-Rockafeller2 dual W(α, β) = max

f,g∈C(X)α, f+β, g−α⊗β, exp(f ⊕g−C)+1

Alternated maximisation: Sinkhorn-Knopp algorithm3 f ⋆(·) = Tβ(g⋆)(·) − log

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

g⋆(·) = Tα(f ⋆)(·) − log

  • x∈X exp(f ⋆(x) − C(x, ·))dα(x)
  • 2R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In:

Proceedings of Symposia in Pure Mathematics. Vol. 18.1. 1970, pp. 241–250.

3Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The

Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

slide-20
SLIDE 20

Computing W using matrix scaling

W(α, β) = min

π∈U(α,β)π, C + KL(π|α ⊗ β)

Fenchel-Rockafeller2 dual (non strongly convex) W(α, β) = max

f,g∈C(X)α, f+β, g−α⊗β, exp(f ⊕g−C)+1

Alternated maximisation: Sinkhorn-Knopp algorithm3 f ⋆(·) = Tβ(g⋆)(·) − log

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

g⋆(·) = Tα(f ⋆)(·) − log

  • x∈X exp(f ⋆(x) − C(x, ·))dα(x)
  • 2R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In:

Proceedings of Symposia in Pure Mathematics. Vol. 18.1. 1970, pp. 241–250.

3Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The

Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

slide-21
SLIDE 21

Implementing Sinkhorn algorithm

Discrete distributions α =

n

  • i=1

aiδxi β =

n

  • i=1

bjδyi C = (C(xi, yj))i,j Repeat until convergence ft(xi) = − log

  • j

bj exp(gt−1(yj) − C(xi, yj)) gt(yj) = − log

  • i

ai exp(ft(xi) − C(xi, yj))

8 / 29

slide-22
SLIDE 22

Implementing Sinkhorn algorithm

Discrete distributions α =

n

  • i=1

aiδxi β =

n

  • i=1

bjδyi C = (C(xi, yj))i,j Repeat until convergence ft(xi) = − log

  • j

bj exp(gt−1(yj) − C(xi, yj)) gt(yj) = − log

  • i

ai exp(ft(xi) − C(xi, yj)) Finite representation of potentials / transportation plan

8 / 29

slide-23
SLIDE 23

Distances between continuous distributions

Classic approach α, β − − − − − →

Sampling ˆ

α, ˆ β − − →

Cost C = (C(xi, yj))i,j −

− − − →

Sinkhorn W(ˆ

α, ˆ β) ˆ α = 1 b

b

  • i=1

δxi ˆ β = 1 b

b

  • i=1

δyi

9 / 29

slide-24
SLIDE 24

Distances between continuous distributions

Classic approach α, β − − − − − →

Sampling ˆ

α, ˆ β − − →

Cost C = (C(xi, yj))i,j −

− − − →

Sinkhorn W(ˆ

α, ˆ β) ˆ α = 1 b

b

  • i=1

δxi ˆ β = 1 b

b

  • i=1

δyi Sampling once, approximation W(ˆ α, ˆ β) ≈ W(α, β)

9 / 29

slide-25
SLIDE 25

Distances between continuous distributions

Classic approach α, β − − − − − →

Sampling ˆ

α, ˆ β − − →

Cost C = (C(xi, yj))i,j −

− − − →

Sinkhorn W(ˆ

α, ˆ β) ˆ αt = 1 bt

nt+1

  • i=nt

δxi ˆ βt = 1 bt

nt+1

  • i=nt

δyi Sampling once, approximation W(ˆ α, ˆ β) ≈ W(α, β) Our approach α, β − − − − − − − − − − →

Repeated sampling (ˆ

αt, ˆ βt)t − − − − − − − − − →

Cost + transform (ft, gt)t

9 / 29

slide-26
SLIDE 26

Distances between continuous distributions

Classic approach α, β − − − − − →

Sampling ˆ

α, ˆ β − − →

Cost C = (C(xi, yj))i,j −

− − − →

Sinkhorn W(ˆ

α, ˆ β) ˆ αt = 1 bt

nt+1

  • i=nt

δxi ˆ βt = 1 bt

nt+1

  • i=nt

δyi Sampling once, approximation W(ˆ α, ˆ β) ≈ W(α, β) Our approach α, β − − − − − − − − − − →

Repeated sampling (ˆ

αt, ˆ βt)t − − − − − − − − − →

Cost + transform (ft, gt)t

Consistent estimation: (ft, gt) → (f ⋆, g⋆), Wt → W(α, β)

9 / 29

slide-27
SLIDE 27

Optimising the cost in functional space4

W(α, β) = max

f,g∈C(X)α, f+β, g−α⊗β, exp(f ⊕g−C)+1

Parametrize ft, gt in a RKHS with kernel κ, use SGD

  • 4A. Genevay, M. Cuturi, G. Peyr´

e, and F. Bach. “Stochastic Optimization for Large-scale Optimal Transport”. In: Advances in Neural Information Processing Systens. 2016, pp. 3432–3440. 10 / 29

slide-28
SLIDE 28

Optimising the cost in functional space4

W(α, β) = max

f,g∈C(X)α, f+β, g−α⊗β, exp(f ⊕g−C)+1

Parametrize ft, gt in a RKHS with kernel κ, use SGD Sample (xt, yt) ∼ α, β

  • 4A. Genevay, M. Cuturi, G. Peyr´

e, and F. Bach. “Stochastic Optimization for Large-scale Optimal Transport”. In: Advances in Neural Information Processing Systens. 2016, pp. 3432–3440. 10 / 29

slide-29
SLIDE 29

Optimising the cost in functional space4

W(α, β) = max

f,g∈C(X)α, f+β, g−α⊗β, exp(f ⊕g−C)+1

Parametrize ft, gt in a RKHS with kernel κ, use SGD Sample (xt, yt) ∼ α, β Compute ∇F(ft, gt)(xt, yt), update ft(·) =

t

  • s=1

µsκ(·, xs) gt(·) =

t

  • s=1

νsκ(·, ys)

  • 4A. Genevay, M. Cuturi, G. Peyr´

e, and F. Bach. “Stochastic Optimization for Large-scale Optimal Transport”. In: Advances in Neural Information Processing Systens. 2016, pp. 3432–3440. 10 / 29

slide-30
SLIDE 30

RKHS approach

11 / 29

slide-31
SLIDE 31

RKHS approach

Convergence of F(ft, gt) → F⋆, rate O( 1

√ t)

Ad-hoc parametrization Convergence of the energy Unstable updates

11 / 29

slide-32
SLIDE 32

Apart´ e for the deep learners

Backward Forward

Potentials are needed for backpropagation

12 / 29

slide-33
SLIDE 33

Outline

1

Tractable algorithms for optimal transport

2

Online Sinkhorn

13 / 29

slide-34
SLIDE 34

Parametrizing continuous potentials

At optimality f ⋆(·) = − log

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

g⋆(·) = − log

  • x∈X exp(f ⋆(x) − C(x, ·))dα(x)

14 / 29

slide-35
SLIDE 35

Parametrizing continuous potentials

At optimality f ⋆(·) = − log

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

g⋆(·) = − log

  • x∈X exp(f ⋆(x) − C(x, ·))dα(x)

Mixture parametrization κ = exp(−C) ft(·) = − log

nt

  • i=1

exp(qi)κ(·, yi) gt(·) = − log

nt

  • i=1

exp(pi)κ(xi, ·)

14 / 29

slide-36
SLIDE 36

Parametrizing continuous potentials

At optimality f ⋆(·) = − log

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

g⋆(·) = − log

  • x∈X exp(f ⋆(x) − C(x, ·))dα(x)

Mixture parametrization κ = exp(−C) ft(·) = − log

nt

  • i=1

exp(qi)κ(·, yi) gt(·) = − log

nt

  • i=1

exp(pi)κ(xi, ·) Iteration t: (ˆ αt)t, (ˆ βt)t = ⇒ enrich ft(·), gt(·)

14 / 29

slide-37
SLIDE 37

Parametrizing continuous potentials

At optimality f ⋆(·) = − log

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

g⋆(·) = − log

  • x∈X exp(f ⋆(x) − C(x, ·))dα(x)

Mixture parametrization κ = exp(−C) ft(·) = − log

nt

  • i=1

exp(qi)κ(·, yi) gt(·) = − log

nt

  • i=1

exp(pi)κ(xi, ·) Iteration t: (ˆ αt)t, (ˆ βt)t = ⇒ enrich ft(·), gt(·)

14 / 29

slide-38
SLIDE 38

A na¨ ıve approach: randomized Sinkhorn

Functional updates ft(·) = − log

  • y∈X exp(gt(y) − C(·, y))dβ(y)

gt(·) = − log

  • x∈X exp(ft(x) − C(x, ·))dα(x)

15 / 29

slide-39
SLIDE 39

A na¨ ıve approach: randomized Sinkhorn

Functional updates ft(·) = − log

  • y∈X exp(gt(y) − C(·, y))dβ(y)

gt(·) = − log

  • x∈X exp(ft(x) − C(x, ·))dα(x)

α − → ˆ αt = 1 bt

nt+1

  • i=nt

δxi βt − → ˆ βt = 1 bt

nt+1

  • i=nt

δyi

15 / 29

slide-40
SLIDE 40

A na¨ ıve approach: randomized Sinkhorn

Functional updates ft(·) = − log

  • y∈X exp(gt(y) − C(·, y))dβ(y)

gt(·) = − log

  • x∈X exp(ft(x) − C(x, ·))dα(x)

α − → ˆ αt = 1 bt

nt+1

  • i=nt

δxi βt − → ˆ βt = 1 bt

nt+1

  • i=nt

δyi ft(·) = − log 1 bt

nt+1

  • i=nt

exp(gt(yi)) − C(·, yi)) gt(·) = − log 1 bt

nt+1

  • i=nt

exp(ft(xi)) − C(xi, ·))

15 / 29

slide-41
SLIDE 41

Does it converge ?

16 / 29

slide-42
SLIDE 42

Does it converge ?

Solution f ⋆, g⋆ of the dual problem: exp(−f ⋆(·)) =

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

= 1 bt

nt+1

  • i=nt

exp(g⋆(yi) − C(·, yi))

5Persi Diaconis and David Freedman. “Iterated random functions”. In: SIAM Review 41.1 (1999), pp. 45–76.

17 / 29

slide-43
SLIDE 43

Does it converge ?

Solution f ⋆, g⋆ of the dual problem: exp(−f ⋆(·)) =

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

= 1 bt

nt+1

  • i=nt

exp(g⋆(yi) − C(·, yi)) Zero-mean oscillations “around” the optimum (in (e−f, e−g))

5Persi Diaconis and David Freedman. “Iterated random functions”. In: SIAM Review 41.1 (1999), pp. 45–76.

17 / 29

slide-44
SLIDE 44

Does it converge ?

Solution f ⋆, g⋆ of the dual problem: exp(−f ⋆(·)) =

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

= 1 bt

nt+1

  • i=nt

exp(g⋆(yi) − C(·, yi)) Zero-mean oscillations “around” the optimum (in (e−f, e−g)) Proposition: (ft, gt) is a Markov chain that converges towards a functional random variable (f∞, g∞).

5Persi Diaconis and David Freedman. “Iterated random functions”. In: SIAM Review 41.1 (1999), pp. 45–76.

17 / 29

slide-45
SLIDE 45

Does it converge ?

Solution f ⋆, g⋆ of the dual problem: exp(−f ⋆(·)) =

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

= 1 bt

nt+1

  • i=nt

exp(g⋆(yi) − C(·, yi)) Zero-mean oscillations “around” the optimum (in (e−f, e−g)) Proposition: (ft, gt) is a Markov chain that converges towards a functional random variable (f∞, g∞). E[exp (−f∞(·))] =

  • y∈X E[exp(g∞(y))] − C(·, y)dβ(y)

5Persi Diaconis and David Freedman. “Iterated random functions”. In: SIAM Review 41.1 (1999), pp. 45–76.

17 / 29

slide-46
SLIDE 46

Does it converge ?

Solution f ⋆, g⋆ of the dual problem: exp(−f ⋆(·)) =

  • y∈X exp(g⋆(y) − C(·, y))dβ(y)

= 1 bt

nt+1

  • i=nt

exp(g⋆(yi) − C(·, yi)) Zero-mean oscillations “around” the optimum (in (e−f, e−g)) Proposition: (ft, gt) is a Markov chain that converges towards a functional random variable (f∞, g∞). E[exp (−f∞(·))] =

  • y∈X E[exp(g∞(y))] − C(·, y)dβ(y)

Proof: Iterated random contracting functions5

5Persi Diaconis and David Freedman. “Iterated random functions”. In: SIAM Review 41.1 (1999), pp. 45–76.

17 / 29

slide-47
SLIDE 47

Reducing the iterate variance

We need f∞, g∞ to be deterministic functions

18 / 29

slide-48
SLIDE 48

Reducing the iterate variance

We need f∞, g∞ to be deterministic functions Two strategies: Keep past information: (κ(xi, ·))i, (κ(·, yi))i Increase the batch-size bt

18 / 29

slide-49
SLIDE 49

Reducing the iterate variance

We need f∞, g∞ to be deterministic functions Two strategies: Keep past information: (κ(xi, ·))i, (κ(·, yi))i Increase the batch-size bt exp(−Tˆ

βt(gt)) = 1

bt

nt+1

  • i=nt

exp(gt(yi) − C(·, yi)) exp(−ft+1(·)) = (1 − ηt) exp(−ft(·)) + ηt exp(−Tˆ

βt(gt))

18 / 29

slide-50
SLIDE 50

Updating potential representations

exp(−Tˆ

βt(gt)) = 1

bt

nt+1

  • i=nt

exp(gt(yi) − C(·, yi)) exp(−ft+1(·)) = (1 − ηt) exp(−ft(·)) + ηt exp(−Tˆ

βt(gt))

Add new points + simple update rule on (qi)i, (pi)i:

ft(·) = − log

nt

  • i=1

exp(qi)κ(·, yi)

Unbiased updates in (e−f, e−g) space

19 / 29

slide-51
SLIDE 51

Updating potential representations

exp(−Tˆ

βt(gt)) = 1

bt

nt+1

  • i=nt

exp(gt(yi) − C(·, yi)) exp(−ft+1(·)) = (1 − ηt) exp(−ft(·)) + ηt exp(−Tˆ

βt(gt))

Add new points + simple update rule on (qi)i, (pi)i:

ft(·) = − log

nt

  • i=1

exp(qi)κ(·, yi)

Unbiased updates in (e−f, e−g) space

ft, gt − − − − − − →

Random iter. Tˆ βt(gt), Tˆ αt(ft) −

− − − − →

enrich(ηt) ft+1, gt+1

19 / 29

slide-52
SLIDE 52

Complexity

ft, gt − − − − − − →

Random iter. Tˆ βt(gt), Tˆ αt(ft) −

− − − − →

enrich(ηt) ft+1, gt+1

Single iteration: + O(bt) memory, O(btnt−1) computations Total complexity after seeing nt points: O(nt) memory O(n2

t) computations: cost + fixed point iterations

20 / 29

slide-53
SLIDE 53

Complexity

ft, gt − − − − − − →

Random iter. Tˆ βt(gt), Tˆ αt(ft) −

− − − − →

enrich(ηt) ft+1, gt+1

Single iteration: + O(bt) memory, O(btnt−1) computations Total complexity after seeing nt points: O(nt) memory O(n2

t) computations: cost + fixed point iterations

Sinkhorn with n points: O(n2) cost + O(n2) iterations

20 / 29

slide-54
SLIDE 54

Does it converge ?

Assumptions: Compact space and Lipschitz cost bt = bt, ηt = ∞,

ηt b1/2

t

< ∞ Norm: fvar maxx f(x) − minx f(x) Proposition: The potentials ft and gt converges a.s. ft − f ⋆var + gt − g⋆var → 0

21 / 29

slide-55
SLIDE 55

Convergence guarantees

Proposition: The potentials ft and gt converges a.s. ft − f ⋆var + gt − g⋆var → 0 Proof: Contractance + uniform law of large numbers6

6Aad W. Van der Vaart. Asymptotic statistics. Cambridge University Press, 2000.

22 / 29

slide-56
SLIDE 56

Convergence guarantees

Proposition: The potentials ft and gt converges a.s. ft − f ⋆var + gt − g⋆var → 0 Proof: Contractance + uniform law of large numbers6 Consistent estimation of W(α, β) backward operator No rates, solution with infinite complexity (when to stop ?)

6Aad W. Van der Vaart. Asymptotic statistics. Cambridge University Press, 2000.

22 / 29

slide-57
SLIDE 57

Convergence guarantees

Proposition: The potentials ft and gt converges a.s. ft − f ⋆var + gt − g⋆var → 0 Proof: Contractance + uniform law of large numbers6 Consistent estimation of W(α, β) backward operator No rates, solution with infinite complexity (when to stop ?) Trade-off step-size ηt / increasing batch size bt: ηt = 1 t , bt = log2 t − → ηt = 1, bt = t2 log2 t Constant batch-size b: O( 1

√ b) approximate solution

6Aad W. Van der Vaart. Asymptotic statistics. Cambridge University Press, 2000.

22 / 29

slide-58
SLIDE 58

A mirror descent interpretation

W = max

f,g∈C(X)α, f + β, g − α ⊗ β, exp(f ⊕ g − C) + 1

23 / 29

slide-59
SLIDE 59

A mirror descent interpretation

W = max

f,g∈C(X)α, f + β, g − α ⊗ β, exp(f ⊕ g − C) + 1

= − min

µ,ν∈M+(X) KL(α|µ) + KL(β|ν) + µ ⊗ ν, exp(−C) − 1

Block convex objective on positive measures µt = α exp(ft)

23 / 29

slide-60
SLIDE 60

A mirror descent interpretation

W = max

f,g∈C(X)α, f + β, g − α ⊗ β, exp(f ⊕ g − C) + 1

= − min

µ,ν∈M+(X) KL(α|µ) + KL(β|ν) + µ ⊗ ν, exp(−C) − 1

Block convex objective on positive measures µt = α exp(ft) Distance generating function = ⇒ online Sinkhorn ω(µ, ν) = KL(α|µ) + KL(β|ν)

23 / 29

slide-61
SLIDE 61

A mirror descent interpretation

W = max

f,g∈C(X)α, f + β, g − α ⊗ β, exp(f ⊕ g − C) + 1

= − min

µ,ν∈M+(X) KL(α|µ) + KL(β|ν) + µ ⊗ ν, exp(−C) − 1

Block convex objective on positive measures µt = α exp(ft) Distance generating function = ⇒ online Sinkhorn ω(µ, ν) = KL(α|µ) + KL(β|ν) Stochastic mirror descent (unbiased gradient) µt+1, νt+1 = ∇ω⋆(∇ω(µt, νt) − ηt ˜ ∇F(µt, νt))

23 / 29

slide-62
SLIDE 62

Reducing the bias due to sampling

105 109 10−2 10−1 100 101 ||ft, gt − f ⋆, g⋆||var Computat° Avg online Sinkhorn (5%) Online Sinkhorn (5%) Avg random Sinkhorn (5%) Random Sinkhorn (5%) Sinkhorn (5%) Sinkhorn 105 109 10−6 10−4 10−2 100 |Wt − W|

Online Sinkhorn is consistent. Iterate averaging help

24 / 29

slide-63
SLIDE 63

Simultaneous estimation of cost and potentials

1 104 Full ˆ C 108 10−2 10−1 100 101 ||ft, gt − f ⋆, g⋆||var Computat° Online Sinkhorn (5% sampling) Online Sinkhorn (10% sampling) Sinkhorn 1 104 Full ˆ C 108 10−6 10−4 10−2 100 |Wt − W|

ft(·) = − log

nt

  • i=1

exp(qi)κ(·, yi) gt(·) = − log

nt

  • i=1

exp(pi)κ(xi, ·) Fill cost matrix

  • C(xi, yi)
  • i,j∈[1,nt] and move nt → n

25 / 29

slide-64
SLIDE 64

Estimating continuous potentials

26 / 29

slide-65
SLIDE 65

Comparing with RKHS SGD

27 / 29

slide-66
SLIDE 66

Conclusion

Finding a fixed point solution with noisy operators Consistent estimation of W(α, β) Importance of the parametrization in functional space ft(·) = − log

nt

  • i=1

exp(qi)κ(·, yi) gt(·) = − log

nt

  • i=1

exp(pi)κ(xi, ·) Can we compress the parametrization ? e.g. k-means Can we obtain rates ? Mirror descent fails at that

28 / 29

slide-67
SLIDE 67

Thank you !

Arthur Mensch and Gabriel Peyr´

  • e. “Online Sinkhorn: Optimal

Transport distances from sample streams”. In: arXiv preprint arXiv:2003.01415 (2020)

29 / 29

slide-68
SLIDE 68

◮ Mensch, Arthur and Gabriel Peyr´

  • e. “Online Sinkhorn: Optimal Transport distances

from sample streams”. In: arXiv preprint arXiv:2003.01415 (2020). ◮ Cuturi, M. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural Information Processing Systems. 2013. ◮ Diaconis, Persi and David Freedman. “Iterated random functions”. In: SIAM Review 41.1 (1999), pp. 45–76. ◮ Genevay, A., M. Cuturi, G. Peyr´ e, and F. Bach. “Stochastic Optimization for Large-scale Optimal Transport”. In: Advances in Neural Information Processing

  • Systens. 2016, pp. 3432–3440.

◮ Rockafellar, R. T. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics. Vol. 18.1. 1970,

  • pp. 241–250.

◮ Sinkhorn, Richard. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964),

  • pp. 876–879.

◮ Van der Vaart, Aad W. Asymptotic statistics. Cambridge University Press, 2000.

1 / 1