[PPT] - Generative Models and Optimal Transport Marco Cuturi Joint work / PowerPoint Presentation

SLIDE 1

Generative Models and Optimal Transport

Marco Cuturi

Joint work / work in progress with

G. Peyré, A. Genevay (ENS), F. Bach (INRIA),
G. Montavon, K-R Müller (TU Berlin)

SLIDE 2

2

Statistics 0.1 : Density Fitting

νdata

We collect data

νdata = 1 N

N

X

i=1

δxi

SLIDE 3

2

Statistics 0.1 : Density Fitting

νdata

We fit a parametric family of densities

{pθ, θ ∈ Θ}

We collect data

νdata = 1 N

N

X

i=1

δxi pθ0 e.g. θ = (m, Σ); pθ = N(m, Σ)

SLIDE 4

Density Fitting

νdata pθ1

SLIDE 5

Density Fitting

νdata pθ2

SLIDE 6

Density Fitting

νdata pθdone!

We stop when there is a good fit.

SLIDE 7

Maximum Likelihood Estimation

νdata pθdone! max

θ∈Θ

1 N

N

X

i=1

log pθ(xi)

SLIDE 8

Maximum Likelihood Estimation

νdata pθdone! max

θ∈Θ

1 N

N

X

i=1

log pθ(xi)

log 0 = −∞

pθ(xi) must be > 0

SLIDE 9

νdata pθdone!

Equivalent to a KL projection in the space of probability measures

{pθ, θ ∈ Θ} νdata pθdone! pθ1 pθ2 min

θ∈Θ KL(νdatakpθ)

KL

Maximum Likelihood Estimation

SLIDE 10

νdata pθdone!

Equivalent to a KL projection in the space of probability measures

{pθ, θ ∈ Θ} νdata pθdone! pθ0 pθ1 pθ2 min

θ∈Θ KL(νdatakpθ)

KL

Maximum Likelihood Estimation

SLIDE 11

8

In higher dimensional spaces…

νdata min

θ∈Θ KL(νdatakpθ)

SLIDE 12

pθ

8

In higher dimensional spaces…

νdata min

θ∈Θ KL(νdatakpθ)

SLIDE 13

pθ

9

Data space has dimension

100 × 100 × 256 × 256 × 256 ≈ 167 × 109 νdata

In higher dimensional spaces…

SLIDE 14

10

Generative Models

νdata

SLIDE 15

10

Generative Models

µ

latent space data space

νdata

SLIDE 16

10

Generative Models

µ

latent space data space

νdata fθ : latent space → data space

SLIDE 17

10

Generative Models

µ

latent space data space

νdata fθ : latent space → data space z

z =        .32 .8 .34 . . . .01       

SLIDE 18

10

Generative Models

µ

latent space data space

νdata fθ : latent space → data space z

z =        .32 .8 .34 . . . .01       

fθ(z) fθ

SLIDE 19

10

Generative Models

µ

latent space data space

νdata fθ : latent space → data space z

z =        .32 .8 .34 . . . .01       

fθ(z) fθ

SLIDE 20

10

Generative Models

µ

latent space data space

νdata fθ : latent space → data space fθ]µ

SLIDE 21

10

Generative Models

µ

latent space data space

νdata fθ : latent space → data space fθ]µ

Push-forward: ∀B ⊂ Ω, f ]µ(B) := µ(f −1(B))

SLIDE 22

11

Generative Models

µ

latent space

νdata fθ : latent space → data space

data space

fθ]µ Goal: find θ such that fθ]µ fits νdata

SLIDE 23

11

Generative Models

µ

latent space

νdata fθ : latent space → data space

data space

fθ]µ Goal: find θ such that fθ]µ fits νdata

SLIDE 24

12

Generative Models

µ

latent space

νdata fθ : latent space → data space

data space

Difference between fitting a push forward measure fθ]µ vs. a density pθ? fθ]µ

SLIDE 25

13

Generative Models

µ

latent space

νdata fθ : latent space → data space

max

θ∈Θ

1 N

N

X

i=1

log pθ(xi)

MLE data space

min

θ∈Θ KL(νdatakpθ)

fθ]µ

=

SLIDE 26

14

Generative Models

µ

latent space

νdata fθ : latent space → data space

max

✓∈Θ

1 N

N

X

i=1

log f✓]µ(xi)

MLE data space

min

✓∈Θ KL(νdatakf✓]µ)

fθ]µ

SLIDE 27

14

Generative Models

µ

latent space

νdata fθ : latent space → data space

max

✓∈Θ

1 N

N

X

i=1

log f✓]µ(xi)

MLE MLE data space

min

✓∈Θ KL(νdatakf✓]µ)

fθ]µ

SLIDE 28

15

Generative Models

µ

latent space

νdata fθ : latent space → data space

data space

Need a more flexible discrepancy function to compare νdata and f✓]µ fθ]µ

SLIDE 29

16

Workarounds?

µ

latent space

νdata

data space

Formulation as adversarial problem [GPM…’14]
Use a richer metric for probability measures,

able to handle measures with non-overlapping supports:

min

✓∈Θ

max

classifiers g Accuracyg ((f✓]µ, +1), (νdata, −1))

min

θ∈Θ ∆(νdata, pθ),

not min

θ∈Θ KL(νdatakpθ)

∆

SLIDE 30

Minimum Estimation

17

l1

∆

SLIDE 31

Minimum Kantorovich Estimation

18

Use optimal transport theory, namely Wasserstein

distances to define discrepancy .

Optimal transport? fertile field in mathematics.

Monge Kantorovich Dantzig Brenier McCann Villani Otto Koopmans

∆

Nobel ’75 Fields ’10

min

θ∈Θ W(νdata, fθ]µ)

SLIDE 32

What is Optimal Transport?

A geometric toolbox to   compare probability measures   supported on a metric space.

19

Empirical Measures, i.e. data

µ

ν

h1

Color Histograms

h2

Bags

f features

d

pθ pθ0

Statistical Models Brain Activation Maps

SLIDE 33

h2

Bags

f features

d

Brain Activation Maps

What is Optimal Transport?

A geometric toolbox to   compare probability measures  supported on a metric space.

20

pθ pθ0

Statistical Models

µ

ν

Color Histograms Empirical Measures, i.e. data

SLIDE 34

21

pθ

pθ0 P(Ω)

Optimal Transport Geometry

A geometric toolbox to   compare probability measures  supported on a metric space.

SLIDE 35

21

pθ Wasserstein Distance

W(pθ, pθ0) pθ0 P(Ω)

Optimal Transport Geometry

A geometric toolbox to   compare probability measures  supported on a metric space.

SLIDE 36

21

pθ [McCann’95] Interpolant

pθ0 P(Ω)

Optimal Transport Geometry

A geometric toolbox to   compare probability measures  supported on a metric space.

SLIDE 37

22

pθ0

pθ

pθ00 P(Ω) A geometric toolbox to   compare probability measures  supported on a metric space.

Optimal Transport Geometry

SLIDE 38

22

pθ0

pθ

pθ00 Wasserstein Barycenter [Agueh’11] P(Ω) A geometric toolbox to   compare probability measures  supported on a metric space.

Optimal Transport Geometry

SLIDE 39

23

[SDPC..’15]

A geometric toolbox to   compare probability measures  supported on a metric space.

Optimal Transport Geometry

SLIDE 40

23

[SDPC..’15]

A geometric toolbox to   compare probability measures  supported on a metric space.

Optimal Transport Geometry

SLIDE 41

24

[SDPC..’15]

A geometric toolbox to   compare probability measures  supported on a metric space.

Optimal Transport Geometry

SLIDE 42

25

Origins: Monge’s Problem

SLIDE 43

26

Origins: Monge’s Problem

SLIDE 44

26

Origins: Monge’s Problem

SLIDE 45

26

Origins: Monge’s Problem

SLIDE 46

26

Origins: Monge’s Problem

SLIDE 47

26

Origins: Monge’s Problem

SLIDE 48

26

Origins: Monge’s Problem

SLIDE 49

26

Origins: Monge’s Problem

x

SLIDE 50

26

Origins: Monge’s Problem

x

SLIDE 51

26

Origins: Monge’s Problem

x y = T(x)

SLIDE 52

26

Origins: Monge’s Problem

x y = T(x) D(x, T(x))

SLIDE 53

27

Ω a probability space, c : Ω × Ω → R. µ, ν two probability measures in P(Ω). x T (x)

[Monge’81] problem: find a map T : Ω → Ω

inf

T ]µ=ν

Z

Ω

c(x, T (x))µ(dx)

Origins: Monge’s Problem

SLIDE 54

27

Ω a probability space, c : Ω × Ω → R. µ, ν two probability measures in P(Ω). x T (x) If Ω = Rd, c = k · · k2, µ, ν a.c., then T = ru, u convex.

[Brenier’87] [Monge’81] problem: find a map T : Ω → Ω

Origins: Monge’s Problem

SLIDE 55

28

[Monge’81] problem: find a map T : Ω → Ω

x T (x) Ω a probability space, c : Ω × Ω → R. µ, ν two probability measures in P(Ω).

inf

T ]µ=ν

Z

Ω

c(x, T (x))µ(dx)

Monge’s Problem

SLIDE 56

28

[Monge’81] problem: find a map T : Ω → Ω

δx Ω a probability space, c : Ω × Ω → R. µ, ν two probability measures in P(Ω).

inf

T ]µ=ν

Z

Ω

c(x, T (x))µ(dx)

Monge’s Problem

SLIDE 57

[Kantorovich’42] Relaxation

29

Π(µ, ν)

def

= {P ∈ P(Ω × Ω)| ∀A, B ⊂ Ω, P (A × Ω) = µ(A), P (Ω × B) = ν(B)}

Instead of maps , consider

probabilistic maps, i.e. couplings :

T : Ω → Ω P ∈ P(Ω × Ω)

SLIDE 58

30

Π(µ, ν)

def

= {P ∈ P(Ω × Ω)| ∀A, B ⊂ Ω, P (A × Ω) = µ(A), P (Ω × B) = ν(B)}

{ } { } { } {

−1 1 2 3 4−1 1 2 3 4 0.2 0.4 0.6 µ(x) ν(y) x y P 0.1 0.2 0.3 P (x, y)

[Kantorovich’42] Relaxation

SLIDE 59

30

Π(µ, ν)

def

= {P ∈ P(Ω × Ω)| ∀A, B ⊂ Ω, P (A × Ω) = µ(A), P (Ω × B) = ν(B)}

{ } { } { } {

−1 1 2 3 4−1 1 2 3 4 0.2 0.4 0.6 µ(x) ν(y) x y P 0.1 0.2 0.3 P (x, y) −1 1 2 3 4−1 1 2 3 4 0.2 0.4 0.6 µ(x) ν(y) x y P 5 · 10 0.1 0.15 P (x, y) 0.1 0.2 0.3

[Kantorovich’42] Relaxation

SLIDE 60

Wasserstein Distances

31

Def. For p ≥ 1, the p-Wasserstein distance

between µ, ν in P(Ω), defined by a metric D on Ω, W p

p (µ, ν) def

= inf

P ∈Π(µ,ν)

ZZ D(x, y)pP (dx, dy).

PRIMAL

SLIDE 61

Wasserstein Distances

31

Def. For p ≥ 1, the p-Wasserstein distance

between µ, ν in P(Ω), defined by a metric D on Ω, W p

p (µ, ν) def

= inf

P ∈Π(µ,ν)

ZZ D(x, y)pP (dx, dy).

PRIMAL

SLIDE 62

Wasserstein Distances

31

Def. For p ≥ 1, the p-Wasserstein distance

between µ, ν in P(Ω), defined by a metric D on Ω, W p

p (µ, ν) def

= inf

P ∈Π(µ,ν)

ZZ D(x, y)pP (dx, dy).

PRIMAL

W p

p (µ, ν) =

sup

ϕ∈L1(µ),ψ∈L1(ν) ϕ(x)+ψ(y)≤Dp(x,y)

Z ϕdµ + Z ψdν.

DUAL

SLIDE 63

W is versatile

32

Discrete - Continuous Continuous - Continuous Discrete - Discrete

SLIDE 64

W is versatile

32

Discrete - Continuous Continuous - Continuous Discrete - Discrete

Stochastic Optimization

Network flow solvers
Entropic regularization

[GCPB’16]

low dim.

[M’11][KMB’16] [L’15]

SLIDE 65

33

Minimum Kantorovich Estimators

min

θ∈Θ W(νdata, fθ]µ)

[Bassetti’06] 1st reference discussing this approach.
[MMC’16] use regularization in a finite setting.
[ACB’17] (WGAN) [BJGR’17] (Wasserstein ABC).
Hot topics: approximate & differentiate W efficiently.
Today: ideas from our recent preprint [GPC’17]

SLIDE 66

Wasserstein between 2 Diracs

34

δy δx (Ω, D) W p

p (δx, δy) = D(x, y)

SLIDE 67

Wasserstein on Uniform Measures

35

µ =

n

X

i=1

1 nδxi ν =

n

X

j=1

1 nδyj (Ω, D)

SLIDE 68

Wasserstein on Uniform Measures

35

µ =

n

X

i=1

1 nδxi ν =

n

X

j=1

1 nδyj (Ω, D) C(σ) = 1 n

n

X

i=1

D(xi, yσi)p

SLIDE 69

Optimal Assignment ⊂ Wasserstein

36

µ =

n

X

i=1

1 nδxi W p

p (µ, ν) = min σ∈Sn C(σ)

ν =

n

X

j=1

1 nδyj (Ω, D)

SLIDE 70

37

(Ω, D)

OT on Two Empirical Measures

µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

SLIDE 71

37

(Ω, D)

OT on Two Empirical Measures

µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

SLIDE 72

Wasserstein on Empirical Measures

38

U(a, b)

def

= {P ∈ Rn×m

+

|P 1m = a,P T 1n = b} MXY

def

= [D(xi, yj)p]ij Consider µ =

n

X

i=1

aiδxi and ν =

m

X

j=1

bjδyj.

     

b1 ... bm a1

· · · · · · · · ·

. . .

· · · P 1m = a · · ·

an

· · · · · · · · ·            

y1 ... ym x1

· · ·

. . .

· D(xi, yj)p ·

xn

· · ·      

SLIDE 73

Wasserstein on Empirical Measures

38

U(a, b)

def

= {P ∈ Rn×m

+

|P 1m = a,P T 1n = b} MXY

def

= [D(xi, yj)p]ij Consider µ =

n

X

i=1

aiδxi and ν =

m

X

j=1

bjδyj.

     

b1 ... bm a1

. . . . . . . . .

. . .

. . . P T 1n = b . . .

an

. . . . . . . . .      

     

y1 ... ym x1

· · ·

. . .

· D(xi, yj)p ·

xn

· · ·      

SLIDE 74

Wasserstein on Empirical Measures

38

U(a, b)

def

= {P ∈ Rn×m

+

|P 1m = a,P T 1n = b} MXY

def

= [D(xi, yj)p]ij

Def. Optimal Transport Problem

W p

p (µ, ν) =

min

P ∈U(a,b)hP , MXY i

Consider µ =

n

X

i=1

aiδxi and ν =

m

X

j=1

bjδyj.

SLIDE 75

Discrete OT Problem

39

MXY U(a, b)

SLIDE 76

Discrete OT Problem

40

MXY U(a, b) P ?

SLIDE 77

Discrete OT Problem

40

Def. Dual OT problem

W p

p (µ, ν) =

max

α∈Rn,β∈Rm αi+βj≤D(xi,yj)p

αT a + βT b MXY U(a, b) P ?

SLIDE 78

Discrete OT Problem

40

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

Note: flow/PDE formulations [Beckman’61]/[Benamou’98] can be used for p=1/p=2 for a sparse-graph metric/Euclidean metric.

SLIDE 79

Discrete OT Problem

41

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

SLIDE 80

Discrete OT Problem

41

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

P ?

Solution unstable and not always unique.

SLIDE 81

Discrete OT Problem

41

MXY U(a, b) O(n3 log(n))

network flow solver used in practice.

P ?

Solution unstable and not always unique.

{P ?}

SLIDE 82

Discrete OT Problem

42

MXY U(a, b) O(n3 log(n))

network flow solver used in practice.

{P ?} P ?

Solution unstable and not always unique.

SLIDE 83

Discrete OT Problem

42

MXY U(a, b) O(n3 log(n))

network flow solver used in practice.

P ? P ?

Solution unstable and not always unique.

SLIDE 84

Discrete OT Problem

42

MXY U(a, b) O(n3 log(n))

network flow solver used in practice.

P ? P ?

Solution unstable and not always unique.

W p

p (µ, ν) not differentiable.

SLIDE 85

Discrete OT Problem

43

MXY U(a, b) P ?

SLIDE 86

Discrete OT Problem

43

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

SLIDE 87

Discrete OT Problem

43

MXY U(a, b) P ? O(n3 log(n))

network flow solver used in practice.

SLIDE 88

Solution: Modify OT Problem

44

MXY U(a, b) P ?

Wishlist: faster & scalable, more stable, differentiable

SLIDE 89

Entropic Regularization [Wilson’62]

45

Note: Unique optimal solution because of strong concavity of Entropy

E(P)

def

= −

nm

X

i,j=1

Pij(log Pij)

Def. Regularized Wasserstein, γ ≥ 0

Wγ(µ, ν)

def

= min

P ∈U(a,b)hP , MXY i γE(P )

SLIDE 90

Entropic Regularization [Wilson’62]

45

γ

µ ν Pγ

Note: Unique optimal solution because of strong concavity of Entropy

Def. Regularized Wasserstein, γ ≥ 0

Wγ(µ, ν)

def

= min

P ∈U(a,b)hP , MXY i γE(P )

SLIDE 91

Fast & Scalable Algorithm

46

Prop. If Pγ

def

= argmin

P ∈U(a,b)

hP , MXY iγE(P ) then 9!u 2 Rn

+, v 2 Rm +, such that

Pγ = diag(u)Kdiag(v), K

def

= e−MXY /γ

SLIDE 92

Fast & Scalable Algorithm

46

Prop. If Pγ

def

= argmin

P ∈U(a,b)

hP , MXY iγE(P ) then 9!u 2 Rn

+, v 2 Rm +, such that

Pγ = diag(u)Kdiag(v), K

def

= e−MXY /γ

L(P, α, β) = X

ij

PijMij + γPij log Pij + αT (P1 − a) + βT (P T 1 − b) ∂L/∂Pij = Mij + γ(log Pij + 1) + αi + βj (∂L/∂Pij = 0) ⇒Pij = e

αi γ + 1 2 e − Mij γ

e

βj γ + 1 2 = ui Kijvj

SLIDE 93

Fast & Scalable Algorithm

46

[Sinkhorn’64] fixed-point iterations for
complexity, GPGPU parallel [C’13] .
if and separable.
Prop. If Pγ

def

= argmin

P ∈U(a,b)

hP , MXY iγE(P ) then 9!u 2 Rn

+, v 2 Rm +, such that

Pγ = diag(u)Kdiag(v), K

def

= e−MXY /γ (u, v) O(nm) Dp

[S..C..’15]

Ω = {1, . . . , n}d O(nd+1) u ← a/Kv, v ← b/KT u

SLIDE 94

Very Fast EMD Approx. Solver

47

Note. is a random graph with shortest path metric, histograms

sampled uniformly on simplex, Sinkhorn tolerance 10-2.

(Ω, D)

64 128 256 512 1024 2048 4096 10

−6

10

−4

10

−2

10 10

2

10

4

Histogram Dimension

Avg. Execution Time per Distance (in s.)

FastEMD Rubner’s emd CPU γ=0.02 CPU γ=0.1 GPU γ=0.02 GPU γ=0.1

SLIDE 95

48

(Ω, D) µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

Regularization ⤑ Differentiability

Wγ((a, X), (b, Y )) = min

P ∈U(a,b)hP , MXY iγE(P )

SLIDE 96

48

(Ω, D) µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

Regularization ⤑ Differentiability

Wγ((a + ∆a, X), (b, Y )) = Wγ((a, X), (b, Y ))+??

SLIDE 97

48

(Ω, D) µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

Regularization ⤑ Differentiability

a ← a + ∆a

Wγ((a + ∆a, X), (b, Y )) = Wγ((a, X), (b, Y ))+??

SLIDE 98

49

(Ω, D) µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj

Wγ((a, X + ∆X), (b, Y )) = Wγ((a, X), (b, Y ))+??

Regularization ⤑ Differentiability

SLIDE 99

49

(Ω, D) µ =

n

X

i=1

aiδxi ν =

m

X

j=1

bjδyj X ← X + ∆X

Wγ((a, X + ∆X), (b, Y )) = Wγ((a, X), (b, Y ))+??

Regularization ⤑ Differentiability

SLIDE 100

1. Differentiability of Regularized OT

50

Def. Dual regularized OT Problem

Wγ(µ, ν) = max

α,β αT a + βT b − 1

γ (eα/γ)T K Keβ/γ

[CD’14]

Prop. W(µ, ν) is
1. convex w.r.t. a,

raW = α? = γ log(u).

2. decreased, when p = 2, Ω = Rd, using

X Y P T

D(a−1).

SLIDE 101

51

[CP’16]

Prop. Writing Hν : a 7! Wγ(µ, ν),
1. Hν has simple Legendre transform:

H∗

ν : g 2 Rn 7! γ

⇣ E(b) + bT log(Keg/γ) ⌘

2. If A 2 Rn×d, f convex on Rd,

min

a∈ΣnHν(a)+f(Aa)=max g∈RdH∗ ν(

ATg)f ∗( g)

2. Duality for Discrete Reg. OT’s

SLIDE 102

W p

p (µ, ν) = sup ϕ,ψ

Z ϕdµ + Z ψdν − ιC(ϕ, ψ)

3. Stochastic Formulation

52

C = {(ϕ, ψ)|ϕ ⊕ ψ ≤ Dp} ιγ

C(ϕ, ψ) = γ

RR e(ϕ⊕ψ−Dp)/γdµdν γ > 0 Wγ(µ, ν) = sup

ϕ,ψ

Z ϕdµ + Z ψdν − ιγ

C(ϕ, ψ)

regularizing dual constraints

REGULARIZED DUAL DUAL

[GCPB’16

SLIDE 103

W p

p (µ, ν) = sup ϕ,ψ

Z ϕdµ + Z ψdν − ιC(ϕ, ψ)

3. Stochastic Formulation

52

C = {(ϕ, ψ)|ϕ ⊕ ψ ≤ Dp} ιγ

C(ϕ, ψ) = γ

RR e(ϕ⊕ψ−Dp)/γdµdν γ > 0 Wγ(µ, ν) = sup

ϕ,ψ

Z ϕdµ + Z ψdν − ιγ

C(ϕ, ψ)

regularizing dual constraints

REGULARIZED DUAL DUAL

[GCPB’16

SLIDE 104

Smoothed D transforms

53

γ > 0

W p

p (µ, ν) = sup ϕ

Z ϕdµ + Z ϕDdν.

Wγ(µ, ν) = sup

ϕ

Z ϕdµ + Z ϕD,γdν. ϕD,γ = −γ log Z e

ϕ(x)−D(x,·)p γ

dµ(x)

REGULARIZED SEMI-DUAL SEMI-DUAL

SLIDE 105

54

Wγ(µ, ν) = sup

ϕ

Z ϕdµ + Z ϕD,γdν. ϕD,γ = −γ log Z e

ϕ(x)−D(x,·)p γ

dµ(x)

REGULARIZED SEMI-DUAL REGULARIZED SEMI-DUAL

Regularized Semidual Wasserstein

substituting

sup

ϕ

Z

y

Z

x

ϕ(x)dµ(x) − γ log Z

x

e

ϕ(x)−D(x,y)p γ

dµ(x)

dν(y).

SLIDE 106

55

REGULARIZED SEMI-DUAL

Stochastic Regularized Semidual

sup

ϕ

Z

y

Z

x

ϕ(x)dµ(x) − γ log Z

x

e

ϕ(x)−D(x,y)p γ

dµ(x)

dν(y).

SLIDE 107

55

REGULARIZED SEMI-DUAL

Stochastic Regularized Semidual

What if µ is a discrete measure?

µ = Pn

i=1 aiδxi

ϕ ∈ L1(µ) is now just a vector α ∈ Rn!

sup

ϕ

Z

y

Z

x

ϕ(x)dµ(x) − γ log Z

x

e

ϕ(x)−D(x,y)p γ

dµ(x)

dν(y).

SLIDE 108

55

REGULARIZED SEMI-DUAL

= sup

α∈Rn Eν[f(α, y)]

STOCHASTIC REGULARIZED SEMI-DUAL

sup

α∈Rn

Z

y

" n X

i=1

αiai − γ log

n

X

i=1

e

αi−D(xi,y)p γ

ai # dν(y)

Stochastic Regularized Semidual

What if µ is a discrete measure?

µ = Pn

i=1 aiδxi

ϕ ∈ L1(µ) is now just a vector α ∈ Rn!

sup

ϕ

Z

y

Z

x

ϕ(x)dµ(x) − γ log Z

x

e

ϕ(x)−D(x,y)p γ

dµ(x)

dν(y).

SLIDE 109

56

4. Sinkhorn Divergence
Prop. Wγ(µ, µ) > 0
Def. Normalized Sinkhorn Divergence

¯ Wγ(µ, ν)

def

= Wγ(µ, ν)−1 2 (Wγ(µ, µ) + Wγ(ν, ν))

Def. For γ > 0, let Wγ(µ, ν)

def

= hPγ, MXY i

Prop. If p = 1, ¯

Wγ(µ, ν) →

γ→∞ ED(µ, ν)

SLIDE 110

57

Algorithmic Formulation

Prop.

∂WL ∂X , ∂WL ∂a

can be computed recur- sively, in O(L) kernel K×vector products.

Def. For L 1, define

WL(µ, ν)

def

= hPL, MXY i, where PL

def

= diag(uL)Kdiag(vL), v0 = 1m; l 0, ul

def

= a/ Kvl, vl+1

def

= b/ KT ul.

SLIDE 111

58

✓∂v0 ∂a ◆T = 0m×n, ✓∂ul ∂a ◆T x = x Kvl

✓∂vl

∂a ◆T KT x a ( Kvl)2 , ✓∂vl+1 ∂a ◆T y = ✓∂ul ∂a ◆T K y b ( KT ul)2 . Example: Differentiability w.r.t. a

Algorithmic Formulation of Reg. OT

SLIDE 112

59

Example: Differentiability w.r.t. a

N = K MXY raWL(µ, ν) = ✓∂uL ∂a ◆T NvL + ✓∂vL ∂a ◆T N T uL

Algorithmic Formulation of Reg. OT

SLIDE 113

Wasserstein Barycenters

60

Wasserstein Barycenter [Agueh’11]

min

µ∈P(Ω) N

X

i=1

λiW p

p (µ, νi)

ν1 ν2 ν3 P(Ω)

SLIDE 114

Multimarginal Formulation

Exact solution (W2) using MM-OT. [Agueh’11]

−1 −0.5 0.5 1 1.5 2 2.5 3 −1.5 −1 −0.5 0.5 1

61

SLIDE 115

Multimarginal Formulation

Exact solution (W2) using MM-OT. [Agueh’11]

−1 −0.5 0.5 1 1.5 2 2.5 3 −1.5 −1 −0.5 0.5 1

If | supp νi| = ni, LP of size (Q

i ni, P i ni)

−1 −0.5 0.5 1 1.5 2 2.5 3 −1.5 −1 −0.5 0.5 1

61

SLIDE 116

When is a finite set, metric M, another LP.

Finite Case, LP Formulation

62

Ω min

µ

X

i

λiW p

p (µ, νi)

SLIDE 117

When is a finite set, metric M, another LP.

Finite Case, LP Formulation

62

Ω min

P1,··· ,PN ,a N

X

i=1

λihPi, M i s.t. Pi

T 1n = bi, 8i  N,

P11n = · · · = PN1d = a.

If |Ω| = n, LP of size (Nn2, (2N − 1)n); unstable

SLIDE 118

Primal Descent on Regularized W

63

[CD’14]

min

µ∈Q⊂P(Ω) N

X

i=1

λiWγ(µ, νi)

Fast Computation of Wasserstein Barycenters International Conference on Machine Learning 2014

SLIDE 119

Primal Descent on Regularized W

63

[CD’14]

min

µ∈Q⊂P(Ω) N

X

i=1

λiWγ(µ, νi)

Fast Computation of Wasserstein Barycenters International Conference on Machine Learning 2014

SLIDE 120

Primal Descent on Regularized W

63

[CD’14]

min

µ∈Q⊂P(Ω) N

X

i=1

λiWγ(µ, νi)

Fast Computation of Wasserstein Barycenters International Conference on Machine Learning 2014

SLIDE 121

Primal Descent on Algorithmic W

64

min

µ∈Q⊂P(Ω) N

X

i=1

λiWL(µ, νi)

SLIDE 122

Primal Descent on Algorithmic W

64

min

µ∈Q⊂P(Ω) N

X

i=1

λiWL(µ, νi)

SLIDE 123

Primal Descent on Algorithmic W

64

min

µ∈Q⊂P(Ω) N

X

i=1

λiWL(µ, νi)

not a convex problem

SLIDE 124

65

consider Barycenter operator:
address now Wasserstein inverse problems:

b(λ)

def

= argmin

a N

X

i=1

λiWγ(a, bi) Given a, find argmin

λ∈ΣN

E(λ)

def

= Loss(a, b(λ))

Inverse Wasserstein Problems

SLIDE 125

66

The Wasserstein Simplex

SLIDE 126

Barycenters = Fixed Points

67

Prop. [BCCNP’15] Consider B ∈ ΣN

d

and let U0 = 1d×N, and then for l ≥ 0: bl def = exp

log
KT Ul
λ
;

8 < : Vl+1

def

=

bl1T

N

KT Ul ,

Ul+1

def

=

B KVl+1 .

SLIDE 127

68

Using Truncated Barycenters

argmin

λ∈ΣN

E(L)(λ)

def

= Loss(a, b(L)(λ)) argmin

λ∈ΣN

E(λ)

def

= Loss(a, b(λ))

instead of using the exact barycenter
use instead the L-iterate barycenter
Differente using the chain rule.

rE(L)(λ) = [∂b(L)]T (g), g

def

= rLoss(a, ·)|b(L)(λ).

SLIDE 128

69

Gradient / Barycenter Computation

SLIDE 129

70

Application: Volume Reconstruction

Wasserstein Barycentric Coordinates: Histogram Regression using Optimal Transport, SIGGRAPH’16

[BPC’16]

SLIDE 130

71

Application: Color Grading

SLIDE 131

72

Application: Color Grading

SLIDE 132

73

Application: Color Grading

SLIDE 133

74

Application: Color Grading

Wasserstein Barycentric Coordinates: Histogram Regression using Optimal Transport, SIGGRAPH’16

[BPC’16]

SLIDE 134

75

Application: Brain Mapping

Original Euclidean Wasserstein projection projection

SLIDE 135

75

Application: Brain Mapping

Original Euclidean Wasserstein projection projection

SLIDE 136

76

At Last: Application to Generative Models

[GPC’17]

C

K

` ← ` + 1

Sinkhorn Generative model

` = 1, . . . , L − 1

. . .

θ1 θ2

(c(xi, yj))i,j

. . .

Input data

(z1, . . . , zm)

(x1, . . . , xm) (y1, . . . , yn)

1m

ˆ EL(θ)

1/· ×mK>

×nK

1/·

b`

a`+1

b`+1

. . . . . .

h(C K)bL, aLi

e−C/ε

Approximate W loss by the transport cost ¯ WL after L Sinkhorn iterations.

SLIDE 137

77

Example: MNIST, Learning fθ

SLIDE 138

78

Example: Generation of Images

MMD-GAN gamma = 1000 gamma=10

CIFAR-10 images
In these examples the cost function is also learned

adversarially, as a NN mapping onto feature vectors.

SLIDE 139

79

Concluding Remarks

Regularized OT is much faster than OT.
Regularized OT can interpolate between W and the

MMD / Energy distance metrics.

The solution of regularized OT is “auto-differentiable”.
Many open problems remain!

NIPS’17 WORKSHOP NIPS’17 TUTORIAL