Sinkhorn Barycenters with Free Support via Frank Wolfe algorithm - - PowerPoint PPT Presentation

sinkhorn barycenters with free support via frank wolfe
SMART_READER_LITE
LIVE PREVIEW

Sinkhorn Barycenters with Free Support via Frank Wolfe algorithm - - PowerPoint PPT Presentation

Sinkhorn Barycenters with Free Support via Frank Wolfe algorithm Giulia Luise 1 , Saverio Salzo 2 , Massimiliano Pontil 1 , 2 , Carlo Ciliberto 3 1 Department of Computer Science, University College London, UK 2 CSML, Istituto Italiano di


slide-1
SLIDE 1

Sinkhorn Barycenters with Free Support via Frank Wolfe algorithm

Giulia Luise1, Saverio Salzo2, Massimiliano Pontil1,2, Carlo Ciliberto3

1 Department of Computer Science, University College London, UK 2 CSML, Istituto Italiano di Tecnologia, Genova, Italy 3 Department of Electrical and Electronic Engineering, Imperial College London, UK Sinkhorn Barycenters via Frank Wolfe 1 / 22

slide-2
SLIDE 2

Outline

  • 1. Introduction: Goal and Contributions
  • 2. Setting and problem statement
  • 3. Approach
  • 4. Convergence analysis
  • 5. Experiments

Sinkhorn Barycenters via Frank Wolfe 2 / 22

slide-3
SLIDE 3

Introduction: Goal and Contributions

Goal and contributions

We propose a novel method to compute the barycenter of a set of probability distributions with respect to the Sinkhorn divergence that:

  • does not fix the support beforehand
  • handles both discrete and continuous measures
  • admits convergence analysis.

Sinkhorn Barycenters via Frank Wolfe 3 / 22

slide-4
SLIDE 4

Introduction: Goal and Contributions

Goal and contributions

Our analyais hinges on the following contributions:

  • We show that the gradient of the Sinkhorn divergence is Lipschitz

continuous on the space of probability measures with respect to the Total Variation.

  • We characterize the sample complexity of an emprical estimator

approximating the Sinkhorn gradients.

  • A byproduct of our analysis is the generalization of the Frank-Wolfe

algorithm to settings where the objective functional is defined only on a set with empty interior, which is the case for Sinkhorn divergence barycenter problem.

Sinkhorn Barycenters via Frank Wolfe 4 / 22

slide-5
SLIDE 5

Setting and problem statement

Setting and Notation

X ⊂ Rd is a compact set c: X × X → R is a symmetric cost function, e.g. c(·, ·) = · − ·2

2

M+

1 (X) is the space of probability measures on X.

M(X) is the Banach space of finite signed measures on X.

Sinkhorn Barycenters via Frank Wolfe 5 / 22

slide-6
SLIDE 6

Setting and problem statement

Entropic Regularized Optimal Transport

For any α, β ∈ M+

1 (X), the Optimal Transport problem with entropic

regularization is defined as follow OTε(α, β) = min

π∈Π(α,β)

  • X 2 c(x, y) dπ(x, y)+εKL(π|α⊗β),

ε ≥ 0 (1) where: KL(π|α ⊗ β) is the Kullback-Leibler divergence between transport plan π and the product distribution α ⊗ β Π(α, β) = {π ∈ M1

+(X 2): P1#π = α, P2#π = β} is the transport

polytope (with Pi : X × X → X the projector onto the i-th component and # the push-forward)

Sinkhorn Barycenters via Frank Wolfe 6 / 22

slide-7
SLIDE 7

Setting and problem statement

Sinkhorn Divergences

To remove the bias induced by the KL, [Genevay et al., 2018] proposed to remove the autocorrelation terms − 1

2OTε(α, α), − 1 2OTε(β, β) from

OTε(α, β) in order to get a divergence Sε(α, β) = OTε(α, β) − 1 2OTε(α, α) − 1 2OTε(β, β), (2) which is nonnegative, convex and metrizes the weak convergence (see [Feydy

et al., 2019]).

In the following we study barycenter problem with this Sinkhorn divergence.

Sinkhorn Barycenters via Frank Wolfe 7 / 22

slide-8
SLIDE 8

Setting and problem statement

Barycenter Problem

Barycenters of probabilities are useful in a range of applications, as texture mixing, Bayesian inference, imaging. The barycenter problem w.r.t. Sinkhorn divergence is formulated as follows: given β1, . . . βm ∈ M+

1 (X) input measures, and ω1, . . . , ωm ≥ 0 a set of

weights such that m

j=1 ωj = 1, solve

min

α∈M+

1 (X)

Bε(α), with Bε(α) =

m

  • j=1

ωj Sε(α, βj). (3)

Sinkhorn Barycenters via Frank Wolfe 8 / 22

slide-9
SLIDE 9

Setting and problem statement

Approach: Frank-Wolfe algorithm

Classic methods to approach barycenter problem:

  • 1. fix the support of the barycenter beforehand and optimize the weights
  • nly (convergence analysis available)

OR

  • 2. alternately optimize on weights and support points (no convergence

guarantees) Our approach via Frank-Wolfe: − It iteratively populates the target barycenter, one point at the time; − It does not require the support to be fixed beforehand; − There is no hyperparameter tuning.

Sinkhorn Barycenters via Frank Wolfe 9 / 22

slide-10
SLIDE 10

Approach

Frank-Wolfe Algorithm on Banach spaces

W Banach space, W∗ topological dual and D ⊂ W∗ nonempty, convex, closed, bounded set. G : D → R convex + some smoothness properties Theorem Suppose in addition that ∇G is L-Lipschitz continuous with L > 0. Let (wk)k∈N be obtained according to Alg 1. Then, for every integer k ≥ 1, G(wk) − min G ≤ 2 k + 2L (diamD)2 + ∆k. (4)

Sinkhorn Barycenters via Frank Wolfe 10 / 22

slide-11
SLIDE 11

Approach

Can Frank-Wolfe be applied?

Optimization domain. M+

1 (X) is convex, closed, and bounded in the

Banach space M(X): ✔ Objective functional. The objective functional Bε is convex since it is a convex combination of Sε(·, βj), with j = 1 . . . m. ✔ Lipschitz continuity of the gradient. This is the most critical condition.

Sinkhorn Barycenters via Frank Wolfe 11 / 22

slide-12
SLIDE 12

Approach

Lipschitz continuity of Sinkhorn potentials

This is one of the main contributions of the paper. Theorem The gradient ∇Sε is Lipschitz continuous, i.e. for all α, α′, β, β′ ∈ M+

1 (X),

  • ∇Sε(α, β) − ∇Sε(α′, β′)
  • ∞ (
  • α − α′
  • TV +
  • β − β′
  • TV ).

(5) It follows that ∇Bε is also Lipschitz continuous and hence our framework is suitable to apply FW algorithm.

Sinkhorn Barycenters via Frank Wolfe 12 / 22

slide-13
SLIDE 13

Approach

How the algorithm works - I

The inner step in FW algorithm amounts to: µk+1 ∈ argmin

µ∈M+

1 (X)

m

  • j=1

ωj ∇Sε[(·, βj)](αk), µ . (6) Note that:

  • by Bauer maximum principle → solutions of (6) are achieved at the

extreme points of the optimization domain;

  • extreme points of M+

1 (X) are Dirac deltas.

Hence (6) is equivalent to µk+1 = δxk+1 with xk+1 ∈ argmin

x∈X m

  • j=1

ωj

  • ∇Sε[(·, βj)](αk)(x)
  • .

(7)

Sinkhorn Barycenters via Frank Wolfe 13 / 22

slide-14
SLIDE 14

Approach

How the algorithm works - II

Once the new support point xk+1 has been obtained, FW update corresponds to αk+1 = αk + 2 k + 2(δxk+1 − αk) = k k + 2αk + 2 k + 2δxk+1. (8) Weights and support points are updated simultaneously at each iteration.

Sinkhorn Barycenters via Frank Wolfe 14 / 22

slide-15
SLIDE 15

Convergence analysis

Convergence analysis-finite case

Theorem Suppose that β1, . . . βm ∈ M+

1 (X) have finite support and let αk be the

k-th iterate of our algorithm. Then, Bε(αk) − min

α∈M+

1 (X)

Bε(α) ≤ Cε k + 2, (9) where Cε is a constant depending on ε and on the domain X. What if the input measures β1, . . . βm ∈ M+

1 (X) are continuous and we

  • nly have access to samples?

Sinkhorn Barycenters via Frank Wolfe 15 / 22

slide-16
SLIDE 16

Convergence analysis

Sample complexity of Sinkhorn Potentials

FW can be applied when only an approximation of the gradient is available. Hence we need quantify the approximation error between ∇Sε(·, β) and ∇Sε(·, ˆ β) in terms of the sample size of ˆ β: Theorem (Sample Complexity of Sinkhorn Potentials) Suppose that c is smooth. Then, for any α, β ∈ M+

1 (X) and any empirical

measure ˆ β of a set of n points independently sampled from β, we have, for every τ ∈ (0, 1] ∇1Sε(α, β) − ∇1Sε(α, ˆ β)∞ ≤ Cε log 3

τ

√n (10) with probability at least 1 − τ.

Sinkhorn Barycenters via Frank Wolfe 16 / 22

slide-17
SLIDE 17

Convergence analysis

Convergence analysis-general case

Using the sample complexity of Sinkhorn gradient, we are able to characterize the convergence analysis of our algorithm in the general setting. Theorem Suppose that c is smooth. Let n ∈ N and ˆ β1, . . . , ˆ βm be empirical distributions with n support points, each independently sampled from β1, . . . , βm. Let αk be the k-th iterate of our algorithm applied to ˆ β1, . . . , ˆ βm. Then for any τ ∈ (0, 1], the following holds with probability larger than 1 − τ Bε(αk) − min

α∈M+

1 (X)

Bε(α) ≤ Cε log 3m

τ

min(k, √n). (11)

Sinkhorn Barycenters via Frank Wolfe 17 / 22

slide-18
SLIDE 18

Experiments

Barycenter of nested ellipses

Barycenter of 30 randomly generated nested ellipses on a 50 × 50 grid similarly to [Cuturi and Doucet, 2014]. Each image is interpreted as a probability distribution in 2D.

Sinkhorn Barycenters via Frank Wolfe 18 / 22

slide-19
SLIDE 19

Experiments

Barycenters of continuous measures

Barycenter of 5 Gaussian distributions with mean and covariance randomly generated. scatter plot: output of our method level sets of its density: true Wasserstein barycenter FW recovers both the mean and covariance of the target barycenter.

Sinkhorn Barycenters via Frank Wolfe 19 / 22

slide-20
SLIDE 20

Experiments

Matching of a distribution

“Barycenter” of a single measure β ∈ M1

+(X).

Solution of this problem is β itself → we can interpret the intermediate iterates as compressed version of the original measure. FW prioritizes the support points with higher weight.

Sinkhorn Barycenters via Frank Wolfe 20 / 22

slide-21
SLIDE 21

Experiments

Summary

  • We proposed a novel method to compute Sinkhorn barycenter with free

supports via Frank-Wolfe algorithm.

  • We proved convergence rate both in case of finite and continuous

measures.

  • We proved two new results on Sinkhorn divergences- Lipschitz

continuity and sample complexity of the gradient- instrumental for the convergence analysis of the method.

Sinkhorn Barycenters via Frank Wolfe 21 / 22

slide-22
SLIDE 22

Experiments

References I

Cuturi, M. and Doucet, A. (2014). Fast computation of wasserstein barycenters. In Xing, E. P. and Jebara, T., editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 685–693, Bejing,

  • China. PMLR.

Feydy, J., S´ ejourn´ e, T., Vialard, F.-X., Amari, S.-I., Trouv´ e, A., and Peyr´ e, G. (2019). Interpolating between optimal transport and mmd using sinkhorn divergences. International Conference on Artificial Intelligence and Statistics (AIStats). Genevay, A., Peyr´ e, G., and Cuturi, M. (2018). Learning generative models with sinkhorn

  • divergences. In International Conference on Artificial Intelligence and Statistics, pages

1608–1617.

Sinkhorn Barycenters via Frank Wolfe 22 / 22