Optimal Transport for Machine Learning Aude Genevay CEREMADE - - PowerPoint PPT Presentation

optimal transport for machine learning
SMART_READER_LITE
LIVE PREVIEW

Optimal Transport for Machine Learning Aude Genevay CEREMADE - - PowerPoint PPT Presentation

Optimal Transport for Machine Learning Aude Genevay CEREMADE (Universit Paris-Dauphine) DMA (Ecole Normale Suprieure) MOKAPLAN Team (INRIA Paris) Imaging in Paris - February 2018 Optimal transport Outline Aude Genevay Entropy


slide-1
SLIDE 1

Optimal Transport for Machine Learning

Aude Genevay

CEREMADE (Université Paris-Dauphine) DMA (Ecole Normale Supérieure) MOKAPLAN Team (INRIA Paris)

Imaging in Paris - February 2018

slide-2
SLIDE 2

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Outline

1 Entropy Regularized OT 2 Applications in Imaging 3 Large Scale "OT" for Machine Learning 4 Application to Generative Models

slide-3
SLIDE 3

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Shortcomings of OT

Two main issues when using OT in practice :

  • Poor sample complexity : need a lot of samples from µ and

ν to get a good approximation of W (µ, ν)

  • Heavy computational cost : solving discrete OT requires

solving an LP → network simplex solver O(n3log(n)) [Pele and Werman ’09]

slide-4
SLIDE 4

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Entropy!

  • Basically : Adding an entropic regularization smoothes the

constraint

  • Makes the problem easier :
  • yields an unconstrained dual problem
  • discrete case can be solved efficiently with iterative

algorithm (more on that later)

  • For ML applications, regularized Wasserstein is better than

standard one

  • In high dimension, helps avoiding overfitting
slide-5
SLIDE 5

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Entropic Relaxation of OT [Cuturi ’13]

Add entropic penalty to Kantorovitch formulation of OT min

γ∈Π(µ,ν)

  • X×Y

c(x, y)dγ(x, y) + ε KL(γ|µ ⊗ ν) where KL(γ|µ ⊗ ν)

def.

=

  • X×Y
  • log

dγ dµdν (x, y)

  • − 1
  • dγ(x, y)
slide-6
SLIDE 6

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Dual Formulation

max

u∈C(X)v∈C(Y)

  • X

u(x)dµ(x) +

  • Y

v(y)dν(y) −ε

  • X×Y

e

u(x)+v(y)−c(x,y) ε

dµ(x)dν(y) Constraint in standard OT u(x) + v(y) < c(x, y) replaced by a smooth penalty term.

slide-7
SLIDE 7

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Dual Formulation

Dual problem concave in u and v, first order condition for each variable yield : ∇u = 0 ⇔ u(x) = −ε log(

  • Y

e

v(y)−c(x,y) ε

dν(y)) ∇v = 0 ⇔ v(y) = −ε log(

  • X

e

u(x)−c(x,y) ε

dµ(x))

slide-8
SLIDE 8

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

The Discrete Case

Dual problem : max

u∈Rmv∈Rn n

  • i=1

uiµi +

m

  • j=1

vjνj − ε

n,m

  • i,j=1

e

ui +vj −c(xi ,yj ) ε

µiνj First order conditions for each variable: ∇u = 0 ⇔ ui = −ε log(

m

  • j=1

e

vj −c(xi ,yj ) ε

νj) ∇v = 0 ⇔ vj = −ε log(

n

  • i=1

e

ui −c(xi ,yj ) ε

µi) ⇒ Do alternate maximizations!

slide-9
SLIDE 9

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Sinkhorn’s Algorithm

  • Iterates (a, b) := (e

u ε , e v ε )

Sinkhorn algorithm [Cuturi ’13]

initialize b ← 1m K ← (e−cij/εmij)ij repeat a ← µ ⊘ Kb b ← ν ⊘ K Ta return γ = diag(a)Kdiag(b)

  • each iteration O(nm) complexity (matrix vector

multiplication)

  • can be improved to O(n log n) on gridded space with

convolutions [Solomon et al. ’15]

slide-10
SLIDE 10

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Sinkhorn - Toy Example

Marginals µ and ν top : evolution of γ with number of iterations l bottom : evolution of γ with regularization parameter ε

slide-11
SLIDE 11

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Sinkhorn - Convergence

Definition (Hilbert metric)

Projective metric defined for x, y ∈ Rd

++ by

dH(x, y) := log maxi(xi/yi) mini(xi/yi)

Theorem

The iterates (a(l), b(l)) converge linearly for the Hilbert metric. Remark : the contraction coefficient deteriorates quickly when ε → 0 (exponentially in worst-case)

slide-12
SLIDE 12

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Sinkhorn - Convergence

Constraint violation

We have the following bound on the iterates: dH(a(l), a⋆) ≤ κdH(γ1m, µ) So monitoring the violation of the marginal constraints is a good way to monitor convergence of Sinkhorn’s algorithm γ1m − µ for various regularizations

slide-13
SLIDE 13

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Color Transfer

Image courtesy of G. Peyré

slide-14
SLIDE 14

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Shape / Image Barycenters

Regularized Wasserstein Barycenters [Nenna et al. ’15]

¯ µ = arg min

µ∈Σn

Wε(µk, µ)

Image from [Solomon et al. ’15]

slide-15
SLIDE 15

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Sinkhorn loss

Consider entropy-regularized OT min

π∈Π(µ,ν)

  • X×Y

c(x, y)dπ(x, y) + ε KL(π|µ ⊗ ν) Regularized loss : Wc,ε(µ, ν)

def.

=

  • XY

c(x, y)dπε(x, y) where πε solution of (15)

slide-16
SLIDE 16

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Sinkhorn Divergences : interpolation between OT and MMD

Theorem

The Sinkhorn loss between two measures µ, ν is defined as: ¯ Wc,ε(µ, ν) = 2Wc,ε(µ, ν) − Wc,ε(µ, µ) − Wc,ε(ν, ν) with the following limiting behavior in ε:

1 as ε → 0,

¯ Wc,ε(µ, ν) → 2Wc(µ, ν)

2 as ε → +∞,

¯ Wc,ε(µ, ν) → µ − ν−c where ·−c is the MMD distance whose kernel is minus the cost from OT. Remark : Some conditions are required on c to get MMD distance when ε → ∞. In particular, c = ·p

p , 0 < p < 2 is

valid.

slide-17
SLIDE 17

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Sample Complexity

Sample Complexity of OT and MMD

Let µ a probability distribution on Rd, and ˆ µn an empirical measure from µ W (µ, ˆ µn) = O(n−1/d) MMD(µ, ˆ µn) = O(n−1/2) ⇒ the number n of samples you need to get a precision η on the Wassertein distance grows exponentially with the dimension d of the space!

slide-18
SLIDE 18

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Sample Complexity - Sinkhorn loss

Sample Complexity of Sinkhorn loss seems to improve as ε grows.

Plots courtesy of G. Peyré and M. Cuturi

slide-19
SLIDE 19

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Generative Models

Figure: Illustration of Density Fitting on a Generative Model

slide-20
SLIDE 20

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Density Fitting with Sinkhorn loss "Formally"

Solve minθ E(θ) where E(θ)

def.

= ¯ Wc,ε(µθ, ν) ⇒ Issue : untractable gradient

slide-21
SLIDE 21

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Approximating Sinkhorn loss

  • Rather than approximating the gradient approximate the

loss itself

  • Minibatches : ˆ

E(θ)

  • sample x1, . . . , xm from µθ
  • use empirical Wasserstein distance Wc,ε(ˆ

µθ, ˆ ν) where ˆ µθ = 1

N

m

i=1 δxi

  • Use L iterations of Sinkhorn’s algorithm : ˆ

E (L)(θ)

  • compute L steps of the algorithm
  • use this as a proxy for W (ˆ

µθ, ν)

slide-22
SLIDE 22

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Computing the Gradient in Practice

C

K

` ← ` + 1

Sinkhorn Generative model

` = 1, . . . , L − 1

. . .

θ1 θ2

(c(xi, yj))i,j

. . .

Input data

(z1, . . . , zm) (x1, . . . , xm) (y1, . . . , yn)

1m

ˆ EL(θ)

1/· ×mK> ×nK 1/·

b`

a`+1

b`+1

. . . . . .

h(C K)bL, aLi e−C/ε

Figure: Scheme of the loss approximation

  • Compute exact gradient of ˆ

E (L)(θ) with autodiff

  • Backpropagation through above graph
  • Same computational cost as evaluation of ˆ

E (L)(θ)

slide-23
SLIDE 23

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Numerical Results on MNIST (L2 cost)

Figure: Samples from MNIST dataset

slide-24
SLIDE 24

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Numerical Results on MNIST (L2 cost)

Figure: Fully connected NN with 2 hidden layers

slide-25
SLIDE 25

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Numerical Results on MNIST (L2 cost)

Figure: Manifolds in the latent space for various parameters

slide-26
SLIDE 26

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Learning the cost [Li et al. ’17, Bellemare et al. ’17]

  • On complex data sets, choice of a good ground metric c is

not trivial

  • Use parametric cost function cφ(x, y) = fφ(x) − fφ(y)2

2

(where fφ : X → Rd )

  • Optimization problem becomes minmax (like GANs)

minθmaxφ ¯ Wcφ,ε(µθ, ν)

  • Same approximations but alternate between updating the

cost parameters φ and the measure parameters θ

slide-27
SLIDE 27

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Numerical Results on CIFAR (learning the cost)

Figure: Samples from CIFAR dataset

slide-28
SLIDE 28

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Numerical Results on CIFAR (learning the cost)

Figure: Fully connected NN with 2 hidden layers

slide-29
SLIDE 29

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Numerical Results on CIFAR (learning the cost)

(a) MMD (b) ε = 1000 (c) ε = 10 Figure: Samples from the generator trained on CIFAR 10 for MMD and Sinkhorn loss (coming from the same samples in the latent space)

Which is better? Not just about generating nice images, but more about capturing a high dimensional distribution... Hard to evaluate.

slide-30
SLIDE 30

Optimal transport Aude Genevay Entropy Regularized OT Applications in Imaging Large Scale "OT" for Machine Learning Application to Generative Models

Shape Registration [Feydy et al. ’17]