online sinkhorn optimal transport distances from sample
play

Online Sinkhorn: Optimal Transport distances from sample streams - PowerPoint PPT Presentation

Online Sinkhorn: Optimal Transport distances from sample streams Arthur Mensch Joint work with Gabriel Peyr e Ecole Normale Sup erieure D epartement de Math ematiques et Applications Paris, France CIRM, 3/12/2020 Optimal


  1. Online Sinkhorn: Optimal Transport distances from sample streams Arthur Mensch Joint work with Gabriel Peyr´ e ´ Ecole Normale Sup´ erieure D´ epartement de Math´ ematiques et Applications Paris, France CIRM, 3/12/2020

  2. Optimal transport for machine learning Density fitting 1 / 29

  3. Optimal transport for machine learning Density fitting Distance between points 1 / 29

  4. Optimal transport for machine learning Density fitting Distance between points Distance between distributions : α ∈ P ( X ) , β ∈ P ( X ) Dependency on the cost C : X × X → R W ( α, β, C ) 1 / 29

  5. The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions 2 / 29

  6. The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions Need for consistent estimators of W ( α, β ) Using streams of samples ( x t ) t , ( y t ) t from α and β 2 / 29

  7. The trouble with optimal transport Réseau de neurone Échantillon Figure: StyleGAN2 In ML, at least one distribution is not discrete α � = 1 � δ x i n Algorithms for OT works with discrete distributions Need for consistent estimators of W ( α, β ) and its backward operator Using streams of samples ( x t ) t , ( y t ) t from α and β 2 / 29

  8. Outline Tractable algorithms for optimal transport 1 Online Sinkhorn 2 3 / 29

  9. Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n 4 / 29

  10. Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n P ∈ △ n × m , P 1 = a , P ⊤ 1 = b 4 / 29

  11. Wasserstein distance (Kantorovich, 1942) n n � � α = a i δ x i β = b j δ y i C = ( C ( x i , y j ) i,j i =1 i =1 α ∈ P ( X ) : positions x = ( x i ) i ∈ X , weights a = ( a i ) i ∈ △ n P ∈ △ n × m , P 1 = a , P ⊤ 1 = b � Cost: P i,j C i,j = � P , C � i,j 4 / 29

  12. Wasserstein distance (Kantorovich, 1942) W C ( α, β ) = min � P , C � P ∈△ n × m P 1= a , P ⊤ 1= b 1 M. Cuturi. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural Information Processing Systems . 2013. 5 / 29

  13. Wasserstein distance (Kantorovich, 1942) W C ( α, β ) = min � P , C � P ∈△ n × m P 1= a , P ⊤ 1= b Entropic regularization 1 W ( α, β ) = W 1 C ( α, β ) = min � P , C � + KL ( P | a ⊗ b ) P ∈△ n × m P 1= a , P ⊤ 1= b 1 M. Cuturi. “Sinkhorn Distances: Lightspeed Computation of Optimal Transport”. In: Advances in Neural Information Processing Systems . 2013. 5 / 29

  14. Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) 6 / 29

  15. Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) W ( α, β ) � π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min d π � � = x,y C ( x, y )d π ( x, y ) + x,y log d α d β ( x, y )d π ( x, y ) 6 / 29

  16. Working with continuous objects C : X × X → R functions α, β ∈ P ( X ) distributions π ∈ P ( X × X ) with marginals � � y d π ( · , y ) = d α ( · ) x d π ( x, · ) = d β ( · ) W ( α, β ) � π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min d π � � = x,y C ( x, y )d π ( x, y ) + x,y log d α d β ( x, y )d π ( x, y ) Discrete case: π = � i,j P i,j δ x i ,y j 6 / 29

  17. Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

  18. Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f t ( · ) = T β ( g t − 1 )( · ) = − log y ∈X exp( g t − 1 ( y ) − C ( · , y ))d β ( y ) � g t ( · ) = T β ( f t )( · ) = − log x ∈X exp( f t ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

  19. Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f ⋆ ( · ) = T β ( g ⋆ )( · ) � − log y ∈X exp( g ⋆ ( y ) − C ( · , y ))d β ( y ) � g ⋆ ( · ) = T α ( f ⋆ )( · ) � − log x ∈X exp( f ⋆ ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

  20. Computing W using matrix scaling W ( α, β ) = π ∈U ( α,β ) � π, C � + KL ( π | α ⊗ β ) min Fenchel-Rockafeller 2 dual (non strongly convex) W ( α, β ) = f,g ∈C ( X ) � α, f � + � β, g �−� α ⊗ β, exp( f ⊕ g − C ) � +1 max Alternated maximisation : Sinkhorn-Knopp algorithm 3 � f ⋆ ( · ) = T β ( g ⋆ )( · ) � − log y ∈X exp( g ⋆ ( y ) − C ( · , y ))d β ( y ) � g ⋆ ( · ) = T α ( f ⋆ )( · ) � − log x ∈X exp( f ⋆ ( x ) − C ( x, · ))d α ( x ) 2 R. T. Rockafellar. “Monotone operators associated with saddle-functions and minimax problems”. In: Proceedings of Symposia in Pure Mathematics . Vol. 18.1. 1970, pp. 241–250. 3 Richard Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In: The Annals of Mathematical Statistics 35 (1964), pp. 876–879. 7 / 29

  21. Implementing Sinkhorn algorithm Discrete distributions n n � � α = β = C = ( C ( x i , y j )) i,j a i δ x i b j δ y i i =1 i =1 Repeat until convergence � f t ( x i ) = − log b j exp( g t − 1 ( y j ) − C ( x i , y j )) j � g t ( y j ) = − log a i exp( f t ( x i ) − C ( x i , y j )) i 8 / 29

  22. Implementing Sinkhorn algorithm Discrete distributions n n � � α = β = C = ( C ( x i , y j )) i,j a i δ x i b j δ y i i =1 i =1 Repeat until convergence � f t ( x i ) = − log b j exp( g t − 1 ( y j ) − C ( x i , y j )) j � g t ( y j ) = − log a i exp( f t ( x i ) − C ( x i , y j )) i Finite representation of potentials / transportation plan 8 / 29

  23. Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) b b α = 1 β = 1 ˆ � � ˆ δ x i δ y i b b i =1 i =1 9 / 29

  24. Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) b b α = 1 β = 1 ˆ � � ˆ δ x i δ y i b b i =1 i =1 α, ˆ Sampling once , approximation W (ˆ β ) ≈ W ( α, β ) 9 / 29

  25. Distances between continuous distributions Classic approach α, ˆ α, ˆ α, β − Sampling ˆ − − − − → β − Cost C = ( C ( x i , y j )) i,j − − → Sinkhorn W (ˆ − − − → β ) n t +1 n t +1 α t = 1 β t = 1 ˆ � � ˆ δ x i δ y i b t b t i = n t i = n t α, ˆ Sampling once , approximation W (ˆ β ) ≈ W ( α, β ) Our approach α t , ˆ α, β − Repeated sampling (ˆ − − − − − − − − − → β t ) t − Cost + transform ( f t , g t ) t − − − − − − − − → 9 / 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend