Data-driven optimal transport Esteban G. Tabak NYU - Courant - - PowerPoint PPT Presentation

data driven optimal transport
SMART_READER_LITE
LIVE PREVIEW

Data-driven optimal transport Esteban G. Tabak NYU - Courant - - PowerPoint PPT Presentation

Data-driven optimal transport Esteban G. Tabak NYU - Courant Institute of Mathematical Sciences Giulio Trigila Technische Universit at M unchen Machine Learning Seminar, November 5 2013 The story in a picture (y) (x) Y X (x,y)


slide-1
SLIDE 1

Data-driven optimal transport

Esteban G. Tabak NYU - Courant Institute of Mathematical Sciences Giulio Trigila Technische Universit¨ at M¨ unchen Machine Learning Seminar, November 5 2013

slide-2
SLIDE 2

The story in a picture

X Y µ (y) ρ (x) xi yj π (x,y)

slide-3
SLIDE 3

Optimal transport (Monge formulation)

Find a plan to transport material from one location to another that minimizes the total transportation cost. In terms of the probability densities ρ(x) and µ(y) with x, y ∈ Rn, min

y(x) M(y) =

  • Rn c(x, y(x)) ρ(x)dx.

subject to

  • y−1(Ω)

ρ(x)dx =

µ(y)dy for all measurable sets Ω. If y is smooth and one to one, this is equivalent to the point-wise relation ρ(x) = µ(y(x))Jy(x),

slide-4
SLIDE 4

Kantorovich formulation

min

π(x,y) K(π) =

  • c(x, y) π(x, y)dxdy

subject to ρ(x) =

  • π(x, y)dy,

µ(y) =

  • π(x, y)dx

If c(x, y) is convex, min M(y) = max K(π) and π(S) = ρ({x : (x, y(x)) ∈ S}).

slide-5
SLIDE 5

Dual formulation

Maximize D(u, v) =

  • u(x)ρ(x)dx +
  • v(x)µ(y)dy
  • ver all continuous functions u and v satisfying

u(x) + v(y) ≤ c(x, y).

slide-6
SLIDE 6

The standard example: c(c, y) = y − x2

min

  • c(x, y) π(x, y)dxdy → max
  • x · y π(x, y)dxdy,

with dual min

  • u(x)ρ(x)dx +
  • v(y)µ(y)dy,

u(x) + v(y) ≥ x · y.    u(x) = max

y (x · y − v(y)) ≡ v∗(x)

v(y) = max

x (x · y − u(x)) ≡ u∗(y)

and the optimal map for Monge’s problem is given by y(x) = ∇u(x), with potential u(x) satisfying the Monge-Ampere equation ρ(x) = µ(∇u(x))det(D2u)(x).

slide-7
SLIDE 7

A data-based formulation

In Kantorovich dual formulation, the objective function to maximize is a sum of two expected values: D(u, v) =

  • u(x)ρ(x)dx +
  • v(x)µ(y)dy.

Hence, if ρ(x) and µ(y) are known only through samples, it is natural to pose the problem in terms of empirical means: Maximize D(u, v) = 1 m

m

  • i=1

u(xi) + 1 n

n

  • j=1

v(yj)

  • ver functions u and v satisfying

u(x) + v(y) ≤ c(x, y).

slide-8
SLIDE 8

A purely discrete reduction

Maximize D(u, v) = 1 m

m

  • i=1

ui + 1 n

n

  • j=1

vj (1)

  • ver vectors u and v satisfying

ui + vj ≤ cij. (2) This is dual to the uncapacitated transportation problem: Minimize C(π) =

  • i,j

cijπij (3) subject to

  • j

πij = n (4)

  • i

πij = m. (5)

slide-9
SLIDE 9

Fully constrained scenario

D(u, v) = 1 m

m

  • i=1

u(xi)+1 n

n

  • j=1

v(yj) =

  • u(x)ρ(x)dx+
  • v(x)µ(y)dy

with ρ(x) = 1 m

m

  • i=1

δ(x − xi), µ(y) = 1 n

n

  • j=1

δ(y − yj), which brings us back to the purely discrete case.

slide-10
SLIDE 10

The Lagrangian

L(π, u, v) =

  • c(x, y) π(x, y) dxdy

− ρ(x) − 1 m

m

  • i=1

δ(x − xi)

  • u(x)dx

−  µ(y) − 1 n

n

  • j=1

δ(y − yj)   v(y)dy, where ρ(x) and µ(y) are shortcut notations for ρ(x) =

  • π(x, y)dy

and µ(y) =

  • π(x, y)dx.
slide-11
SLIDE 11

In terms of the Lagrangian, the problem can be formulated as a game: d : max

u(x),v(y)

min

π(x,y)≥0 L(π, u, v),

with dual p : min

π(x,y)≥0

max

u(x),v(y) L(π, u, v).

If the functions u(x) and v(y) are unrestricted, the problem p becomes p : min

π(x,y)≥0

  • c(x, y) π(x, y) dxdy

subject to ρ(x) = 1 m

m

  • i=1

δ(x − xi), µ(y) = 1 n

n

  • j=1

δ(y − yj), as before.

slide-12
SLIDE 12

A suitable relaxation

Instead, restrictict the space of functions F from where u and v can be selected. If F is invariant under dilations: u ∈ F, λ ∈ R → λu ∈ F, then the problem p becomes p : min

π(x,y)≥0

  • c(x, y) π(x, y) dxdy

ρ(x) − 1 m

m

  • i=1

δ(x − xi)

  • u(x)dx

= 0,  µ(y) − 1 n

n

  • j=1

δ(y − yj)   v(y)dy = for all u(x), v(y) ∈ F, a weak formulation of the constraints.

slide-13
SLIDE 13

Some possible choices for F

  • 1. Constants: The constraints just guaranty that ρ and µ

integrate to one.

  • 2. Linear Functions: The solution is a rigid displacement that

moves the mean of {xi} into the mean of {yj}.

  • 3. Quadratic Functions: A linear transformation mapping the

empirical mean and covariance matrix of {xi} into those of {yj}.

  • 4. Smooth functions with appropriate local bandwidth,

neither to over nor under-resolve ρ(x) and µ(y).

slide-14
SLIDE 14

A flow-based, primal-dual approach

c = 1 2x − y2, y(x) = ∇u(x). We think of the map y(x) as the endpoint of a time-dependent flow z(x, t), such that z(x, 0) = x, and z(x, ∞) = y(x). z(x, t) follows the gradient flow        ˙ z = −∇z

  • δ ˜

Dt δu

  • u= 1

2 x2

z(x, 0) = x where ˜ Dt =

  • u(z)ρt(z)dz +
  • u∗(y)µ(y)dy

and ρt is the evolving probability distribution underlying the points z(x, t).

slide-15
SLIDE 15

The variational derivative of ˜ D adopts the form δ ˜ Dt δu = ρt −

  • D2u
  • µ(∇u)

which, applied at u = 1

2x2 (the potential corresponding to the

identity map), yields the simpler expression δ ˜ Dt δu (z) = ρt(z) − µ(z), so

  • ˙

z = −∇z [ρt(z) − µ(z)] z(x, 0) = x

slide-16
SLIDE 16

The probability density ρt satisfies the continuity equation ∂ρt(z) ∂t + ∇z · [ρt(z)˙ z] = 0, yielding ∂ρt(z) ∂t − ∇z · [ρt(z)∇z(ρt(z) − µ(z))] = 0, a nonlinear heat equation that converges to ρt=∞(z) = µ(z).

slide-17
SLIDE 17

In terms of samples,

  • 1. Objetive function:

˜ Dt =

  • u(z)ρt(z)dz+
  • u∗(y)µ(y)dy → 1

m

  • i

u(zi)+1 n

  • j

u∗(yj)

  • 2. Test functions: General u(z) ∈ F → localized u(z, β), with

β a finite-dimensional parameter and appropriate bandwidth.

  • 3. Gradient descent: −˙

u = δ ˜

Dt δu (z) = ρt(z) − µ(z) →

− ˙ β = ∇β ˜ Dt(z) = 1 m

  • i

∇βu(zi) − 1 n

  • j

∇βu(yj)

slide-18
SLIDE 18

Left to discuss:

  • 1. Form of u(z, β), determination of bandwidth:

u(z, β) = 1 2x2 +

  • k

βkFk z − zk αk

  • ,

αk ∝

  • 1

ρt(zk) 1

d

.

  • 2. Enforcement of the map’s optimality.
slide-19
SLIDE 19

Some applications and extensions

  • 1. Maps: fluid flow from tracers, effects of a medical treatment,

resource allocation, . . .

  • 2. Density estimation: ρ(x)
  • 3. Classification and clustering: ρk(x)
  • 4. Regression: π(x, y), dim(x) = dim(y),

c(x, y) =

  • k

wk y − yk2 α2

k + x − xk2

  • 5. Determination of worst scenario under given marginals:

π(x1, x2, . . . , xn)

slide-20
SLIDE 20

Thanks!