Online Sinkhorn: Optimal Transport distances from sample streams - PowerPoint PPT Presentation

Online Sinkhorn: Optimal Transport distances from sample streams Arthur Mensch Joint work with Gabriel Peyr´ e ´ Ecole Normale Sup´ erieure D´ epartement de Math´ ematiques et Applications Paris, France CIRM, 3/12/2020

Optimal transport for machine learning Density fitting 1 / 29

Optimal transport for machine learning Density fitting Distance between points 1 / 29

Optimal transport for machine learning Density fitting Distance between points Distance between distributions : α ∈ P ( X ) , β ∈ P ( X ) Dependency on the cost C : X × X → R W ( α, β, C ) 1 / 29

The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions 2 / 29

The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions Need for consistent estimators of W ( α, β ) Using streams of samples ( x t ) t , ( y t ) t from α and β 2 / 29

The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions Need for consistent estimators of W ( α, β ) and its backward operator Using streams of samples ( x t ) t , ( y t ) t from α and β 2 / 29

Outline Tractable algorithms for optimal transport 1 Online Sinkhorn 2 3 / 29

Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n 4 / 29

Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n P ∈ △ n × m , P 1 = a , P ⊤ 1 = b 4 / 29

Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n P ∈ △ n × m , P 1 = a , P ⊤ 1 = b � Cost: P i,j C i,j = � P , C � i,j 4 / 29

Wasserstein distance (Kantorovich, 1942) W C ( α, β ) = min � P , C � P ∈△ n × m P 1= a , P ⊤ 1= b 1 M. Cuturi. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural Information Processing Systems . 2013. 5 / 29

Wasserstein distance (Kantorovich, 1942) W C ( α, β ) = min � P , C � P ∈△ n × m P 1= a , P ⊤ 1= b Entropic regularization 1 W ( α, β ) = W 1 C ( α, β ) = min � P , C � + KL ( P | a ⊗ b ) P ∈△ n × m P 1= a , P ⊤ 1= b 1 M. Cuturi. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural Information Processing Systems . 2013. 5 / 29

Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) 6 / 29

Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) W ( α, β ) � π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min d π � � = x,y C ( x, y )d π ( x, y ) + x,y log d α d β ( x, y )d π ( x, y ) 6 / 29

Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) W ( α, β ) � π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min d π � � = x,y C ( x, y )d π ( x, y ) + x,y log d α d β ( x, y )d π ( x, y ) Discrete case: π = � i,j P i,j δ x i ,y j 6 / 29

Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f t ( · ) = T β ( g t − 1 )( · ) = − log y ∈X exp( g t − 1 ( y ) − C ( · , y ))d β ( y ) � g t ( · ) = T β ( f t )( · ) = − log x ∈X exp( f t ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f ⋆ ( · ) = T β ( g ⋆ )( · ) � − log y ∈X exp( g ⋆ ( y ) − C ( · , y ))d β ( y ) � g ⋆ ( · ) = T α ( f ⋆ )( · ) � − log x ∈X exp( f ⋆ ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual (non strongly convex) W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f ⋆ ( · ) = T β ( g ⋆ )( · ) � − log y ∈X exp( g ⋆ ( y ) − C ( · , y ))d β ( y ) � g ⋆ ( · ) = T α ( f ⋆ )( · ) � − log x ∈X exp( f ⋆ ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

Implementing Sinkhorn algorithm Discrete distributions n n � � α = β = C = ( C ( x i , y j )) i,j a i δ x i b j δ y i i =1 i =1 Repeat until convergence � f t ( x i ) = − log b j exp( g t − 1 ( y j ) − C ( x i , y j )) j � g t ( y j ) = − log a i exp( f t ( x i ) − C ( x i , y j )) i 8 / 29

Implementing Sinkhorn algorithm Discrete distributions n n � � α = β = C = ( C ( x i , y j )) i,j a i δ x i b j δ y i i =1 i =1 Repeat until convergence � f t ( x i ) = − log b j exp( g t − 1 ( y j ) − C ( x i , y j )) j � g t ( y j ) = − log a i exp( f t ( x i ) − C ( x i , y j )) i Finite representation of potentials / transportation plan 8 / 29

Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) b b α = 1 β = 1 ˆ � � ˆ δ x i δ y i b b i =1 i =1 9 / 29

Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) b b α = 1 β = 1 ˆ � � ˆ δ x i δ y i b b i =1 i =1 α, ˆ Sampling once , approximation W (ˆ β ) ≈ W ( α, β ) 9 / 29

Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) n t +1 n t +1 α t = 1 β t = 1 ˆ � � ˆ δ x i δ y i b t b t i = n t i = n t α, ˆ Sampling once , approximation W (ˆ β ) ≈ W ( α, β ) Our approach α t , ˆ α, β − Repeated sampling (ˆ − − − − − − − − − → β t ) t − Cost + transform ( f t , g t ) t − − − − − − − − → 9 / 29

Online Sinkhorn: Optimal Transport distances from sample streams - PowerPoint PPT Presentation

Online Sinkhorn: Optimal Transport distances from sample streams Arthur Mensch Joint work with Gabriel Peyr e Ecole Normale Sup erieure D epartement de Math ematiques et Applications Paris, France CIRM, 3/12/2020 Optimal

Bridging the gap between Optimal Transport and MMD with Sinkhorn Divergences Aude Genevay MIT

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Martingale Optimal Transport in Higher Hadrien De March Dimension Optimal transport

Dr Jeffrey Chow Research Consultant Civic Exchange Distances to public open spaces Distances to

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

A Sociolinguistic Analysis of Linguistically Sensitive Dialectal Word Pronunciation Distances

Phylogenetic trees II Estimating distances, estimating trees from distances Gerhard Jger

Metric Distances 28 Great Circle Distances North Pole (90N lat) North Pole C Prime

Geodesic distances and intrinsic distances on some fractal sets Masanori Hino (Kyoto Univ.)

Wasserstein Adversarial Examples via Projected Sinkhorn Iterations ICML 19 Eric Wong 1 Frank R.

Sinkhorn Barycenters with Free Support via Frank Wolfe algorithm Giulia Luise 1 , Saverio Salzo 2

Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent Konstantin Mishchenko, KAUST

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

An Optimal Transport View on Generalization Nemo Fournier January 13, 2020 An Optimal Transport

Polynomial actions of unitary operators and idempotent ultrafilters Mariusz Lemaczyk (based on

The other Apache Technologies your Big Data solution needs! Nick Burch The Apache Software

NONPARAMETRIC TIME SERIES ANALYSIS USING GAUSSIAN PROCESSES Sotirios Damouras Advisor: Mark

Green Banks and Financing Energy Efficiency and Renewables in Industry and Buildings Sixth

Optimization, Monitoring, and Control for Smart Grid Consumers New Brunswick, NJ, 27 October 2010

Degree conditions for partitioning graphs into chorded cycles Shuya Chiba (Kumamoto University,

SUPPLEMENT Lockdown, Loosening, and Asias June 2020 Growth Prospects Abdul Abiad Director,

PRELIMINARY RESULTS OF LES SIMULATIONS OF SELF-SIMILAR VARIABLE ACCELERATION RT MIXING FLOWS D