Data-driven optimal transport
Esteban G. Tabak NYU - Courant Institute of Mathematical Sciences Giulio Trigila Technische Universit¨ at M¨ unchen Machine Learning Seminar, November 5 2013
Data-driven optimal transport Esteban G. Tabak NYU - Courant - - PowerPoint PPT Presentation
Data-driven optimal transport Esteban G. Tabak NYU - Courant Institute of Mathematical Sciences Giulio Trigila Technische Universit at M unchen Machine Learning Seminar, November 5 2013 The story in a picture (y) (x) Y X (x,y)
Esteban G. Tabak NYU - Courant Institute of Mathematical Sciences Giulio Trigila Technische Universit¨ at M¨ unchen Machine Learning Seminar, November 5 2013
X Y µ (y) ρ (x) xi yj π (x,y)
Find a plan to transport material from one location to another that minimizes the total transportation cost. In terms of the probability densities ρ(x) and µ(y) with x, y ∈ Rn, min
y(x) M(y) =
subject to
ρ(x)dx =
µ(y)dy for all measurable sets Ω. If y is smooth and one to one, this is equivalent to the point-wise relation ρ(x) = µ(y(x))Jy(x),
min
π(x,y) K(π) =
subject to ρ(x) =
µ(y) =
If c(x, y) is convex, min M(y) = max K(π) and π(S) = ρ({x : (x, y(x)) ∈ S}).
Maximize D(u, v) =
u(x) + v(y) ≤ c(x, y).
min
with dual min
u(x) + v(y) ≥ x · y. u(x) = max
y (x · y − v(y)) ≡ v∗(x)
v(y) = max
x (x · y − u(x)) ≡ u∗(y)
and the optimal map for Monge’s problem is given by y(x) = ∇u(x), with potential u(x) satisfying the Monge-Ampere equation ρ(x) = µ(∇u(x))det(D2u)(x).
In Kantorovich dual formulation, the objective function to maximize is a sum of two expected values: D(u, v) =
Hence, if ρ(x) and µ(y) are known only through samples, it is natural to pose the problem in terms of empirical means: Maximize D(u, v) = 1 m
m
u(xi) + 1 n
n
v(yj)
u(x) + v(y) ≤ c(x, y).
Maximize D(u, v) = 1 m
m
ui + 1 n
n
vj (1)
ui + vj ≤ cij. (2) This is dual to the uncapacitated transportation problem: Minimize C(π) =
cijπij (3) subject to
πij = n (4)
πij = m. (5)
D(u, v) = 1 m
m
u(xi)+1 n
n
v(yj) =
with ρ(x) = 1 m
m
δ(x − xi), µ(y) = 1 n
n
δ(y − yj), which brings us back to the purely discrete case.
L(π, u, v) =
− ρ(x) − 1 m
m
δ(x − xi)
− µ(y) − 1 n
n
δ(y − yj) v(y)dy, where ρ(x) and µ(y) are shortcut notations for ρ(x) =
and µ(y) =
In terms of the Lagrangian, the problem can be formulated as a game: d : max
u(x),v(y)
min
π(x,y)≥0 L(π, u, v),
with dual p : min
π(x,y)≥0
max
u(x),v(y) L(π, u, v).
If the functions u(x) and v(y) are unrestricted, the problem p becomes p : min
π(x,y)≥0
subject to ρ(x) = 1 m
m
δ(x − xi), µ(y) = 1 n
n
δ(y − yj), as before.
Instead, restrictict the space of functions F from where u and v can be selected. If F is invariant under dilations: u ∈ F, λ ∈ R → λu ∈ F, then the problem p becomes p : min
π(x,y)≥0
ρ(x) − 1 m
m
δ(x − xi)
= 0, µ(y) − 1 n
n
δ(y − yj) v(y)dy = for all u(x), v(y) ∈ F, a weak formulation of the constraints.
integrate to one.
moves the mean of {xi} into the mean of {yj}.
empirical mean and covariance matrix of {xi} into those of {yj}.
neither to over nor under-resolve ρ(x) and µ(y).
c = 1 2x − y2, y(x) = ∇u(x). We think of the map y(x) as the endpoint of a time-dependent flow z(x, t), such that z(x, 0) = x, and z(x, ∞) = y(x). z(x, t) follows the gradient flow ˙ z = −∇z
Dt δu
2 x2
z(x, 0) = x where ˜ Dt =
and ρt is the evolving probability distribution underlying the points z(x, t).
The variational derivative of ˜ D adopts the form δ ˜ Dt δu = ρt −
which, applied at u = 1
2x2 (the potential corresponding to the
identity map), yields the simpler expression δ ˜ Dt δu (z) = ρt(z) − µ(z), so
z = −∇z [ρt(z) − µ(z)] z(x, 0) = x
The probability density ρt satisfies the continuity equation ∂ρt(z) ∂t + ∇z · [ρt(z)˙ z] = 0, yielding ∂ρt(z) ∂t − ∇z · [ρt(z)∇z(ρt(z) − µ(z))] = 0, a nonlinear heat equation that converges to ρt=∞(z) = µ(z).
˜ Dt =
m
u(zi)+1 n
u∗(yj)
β a finite-dimensional parameter and appropriate bandwidth.
u = δ ˜
Dt δu (z) = ρt(z) − µ(z) →
− ˙ β = ∇β ˜ Dt(z) = 1 m
∇βu(zi) − 1 n
∇βu(yj)
u(z, β) = 1 2x2 +
βkFk z − zk αk
αk ∝
ρt(zk) 1
d
.
resource allocation, . . .
c(x, y) =
wk y − yk2 α2
k + x − xk2
π(x1, x2, . . . , xn)