advanced section 2 optimal transport
play

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 - PowerPoint PPT Presentation

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 Javier Zazo Pavlos Protopapas Lecture Outline Historical overview Definitions and formulations Metric properties about optimal transport Application I: Supervised learning with


  1. Advanced Section #2: Optimal Transport AC 209B: Data Science 2 Javier Zazo Pavlos Protopapas

  2. Lecture Outline Historical overview Definitions and formulations Metric properties about optimal transport Application I: Supervised learning with Wasserstein Loss Application II: Domain adaptation 2

  3. Historical overview 3

  4. The origins of optimal transport ◮ Gaspard Monge proposed the first idea in 1781. ◮ How to move dirt from one place (d’eblais) to another (remblais) with minimal effort? ◮ Enunciated the problem of finding a mapping F between two distributions of mass. ◮ Optimization with respect to a displacement cost c ( x, y ). 4

  5. Transportation problem I ◮ Formulated by Frank Lauren Hitchcock in 1941. Factories & warehouses example ◮ Fixed number of factories, each of which produces good at a fixed output rate. ◮ Fixed number of warehouses, each of which has a fixed storage capacity. ◮ There is a cost to transport goods from a factory to a warehouse. ◮ Goal: Find the transportation of goods from factory → warehouse with lowest possible cost. 5

  6. Transportation problem II: Example Factories: Transportation costs: W 1 W 2 W 3 W 4 ◮ F 1 makes 5 units. F 1 5 4 7 6 F 2 2 5 3 5 ◮ F 2 makes 4 units. F 3 6 3 4 4 ◮ F 3 makes 6 units. 5 4 5 Warehouses: 3 1 ◮ W 1 can store 5 units. 1 4 3 ◮ W 2 can store 3 units. 5 ◮ W 3 can store 5 units. 6 4 ◮ W 4 can store 2 units. 2 2 6

  7. Transportation problem III: ◮ One factory can transport product to multiple warehouses. ◮ One warehouse can receive product from multiple factories. ◮ The Transportation problem can be formulated as an ordinary linear constrained optimization problem (LP): min 5 x 11 + 4 x 12 + 7 x 13 + 6 x 14 + 2 x 21 + 5 x 22 x ij +3 x 23 + 2 x 24 + 6 x 31 + 3 x 32 + 4 x 33 + 4 x 34 s.t. x 11 + x 12 + x 13 + x 14 = 5 x 21 + x 22 + x 23 + x 24 = 4 x 31 + x 32 + x 33 + x 34 = 6 x 11 + x 21 + x 31 ≤ 5 x 12 + x 22 + x 32 ≤ 3 x 13 + x 23 + x 33 ≤ 5 x 14 + x 24 + x 34 ≤ 2 7

  8. Definitions and formulations 8

  9. Definitions ◮ Probability simplex: � � n � � a i ∈ R n � ∆ n = a i = 1 + � i =1 ◮ Discrete probability distribution: p = ( p 1 , p 2 , . . . , p n ) ∈ ∆ n . ◮ Space X : support for the distritution (coordinates vector/array, temperature, etc.). ◮ Discrete measure: given weights p = ( p 1 , p 2 , . . . , p n ) and x = ( x 1 , x 2 , . . . , x n ) locations, � α = p i δ x i i ◮ Radon measure: α ∈ M ( X ), – X is equipped with a distance, integrating it against a continuous function f � � R d f ( x ) dα ( x ) = f ( x ) ρ α ( x ) dx X X 9

  10. More definitions � ◮ Set of positive measures: M + , such that X f ( x ) dα ( x ) → R + . � ◮ Set of probability measures: M 1 + , such that X dα ( x ) = 1. 10

  11. Assingment and Monge problems ◮ n origin elements ( factories ), ◮ m = n destination elements ( warehouses ), ◮ we look for a permutation (an assignment in the general case) of elements n 1 � min C i,σ ( i ) n σ ∈ Perm(n) i =1 ◮ The set of n discrete elements has n ! possible permutations. ◮ Works after Monge, aimed to simplify the problem, such as Hitchcock in 1941, or Kantorovich in 1942. 11

  12. Kantorovich relaxation ◮ Goal: find a minimal transport plan F such that | F1 = p and F T 1 = q } F ∈ U ( p , q ) = { F ∈ R n × n + ◮ F1 = p sum the rows of F → all goods are transported from p . ◮ F T 1 = q sum the columns of F → all goods are received in q . ◮ p and q are probability distributions → mass is conserved and equals 1. 12

  13. Relation to linear programming ◮ The Kantorovich problem is an LP: L C ( p , q ) = min tr( FC ) F ≥ 0 (1) F T 1 = q F1 = p , ◮ LP programs can be solved with simplex method , interior point methods , dual descent methods , etc. The problem is convex . ◮ One option is to use LP solvers: Clp, Gurobi, Mosek, SeDuMi, CPLEX, ECOS, etc. ◮ Spezialized methods exist (and Python, C, Julia, etc. libraries) – Network simplex – Approximate methods: Sinkhorn, smoothed versions, etc. 13

  14. Kantorovich formulation for arbitrary measures ◮ Now C needs to be a function: c ( x, y ) : X × Y → R + ◮ Discrete measures α = � i p i δ x i and β = � i q i δ y i : – c ( x, y ) is still a matrix where costs depends on locations of measures. ◮ For arbitrary probabilistic measures: – Define a coupling π ∈ M 1 + ( X , Y ) → joint probability distribution of X and Y . � � � π ∈ M 1 � U ( α, β ) = + ( X , Y ) � P X ♯ π = α and P Y ♯ π = β – The continuous problem: � � � � � L c ( α, β ) = min c ( x, y ) dπ ( x, y ) = min E ( X,Y ) ( c ( X, Y )) � X ∼ α, Y ∼ β π ∈ U ( α,β ) ( X,Y ) X×Y 14

  15. Example of transport maps for arbitrary measures 15

  16. Metric properties about optimal transport 16

  17. Metric properties of the discrete optimal transport ◮ Wasserstein distance is also referred as OT, or Earth mover’s distance (EMD). Discrete Wasserstein distance Consider p , q ∈ ∆ n and � � � � C = C T , diag( C ) = 0 and ∀ ( i, j, k ) C ∈ R n × n � C ∈ C n = C i,j ≤ C i,k + C k,j . + Then, W p ( p , q ) = L C p ( p , q ) 1 /p defines a p-Wasserstein distance on ∆ n . ◮ Recall that L C ( p , q ) refers to the discrete Kantorovich problem: � � � F T 1 = q � � F ≥ 0 , L C ( p , q ) = min tr( FC ) F1 = p , 17

  18. Proof that p-Wasserstein constitutes a distance ◮ We need to show positivity , symmetry and triangular inequality . ◮ Since diag( C ) = 0, W p ( p , p ) = 0, and F ∗ = diag( p ). ◮ Because of strict positivity of off-diagonal elements, W p ( p , q ) = tr( CF ) > 0 for p � = q . ◮ Since W p ( p , q ) = tr( CF ), and C is symmetric, W p ( p , q ) = W p ( q , p ). ◮ For triangularity, define p , q and t and F = sol( W p ( p , q )) G = sol( W p ( q , t )) . ◮ For simplicity, assume q > 0 (detailed proof in the lecture notes). Define S = F diag(1 / q ) G ∈ R n × n . + ◮ Note that F ∈ U ( p , t ), i.e., is a feasible transport plan: S1 = F diag(1 / q ) G1 = F diag( q / q ) = F1 = p ���� � �� � q 1 S T 1 = G T diag(1 / q ) F T 1 = G T diag( q / q ) = G T 1 = t ���� � �� � q 1 18

  19. Wasserstein distance for arbitrary measures Wasserstein distance for arbitrary measures Consider α ( x ) ∈ M 1 + ( X ) , β ( y ) ∈ M 1 + ( Y ), X = Y , and for some p ≥ 1, ◮ c ( x, y ) = c ( y, x ) ≥ 0; ◮ c ( x, y ) = 0 if and only if x = y ; ◮ ∀ ( x, y, z ) ∈ X 3 , c ( x, y ) ≤ c ( x, z ) + c ( z, y ) Then, W p ( α, β ) = L c p ( α, β ) 1 /p defines a p-Wasserstein distance on X . ◮ Recall, that the Kantorovich problem for arbitrary measures is given by: � L c ( α, β ) = min c ( x, y ) dπ ( x, y ) π ∈ U ( α,β ) X×Y 19

  20. Special cases I ◮ Binary cost matrix: If C = 11 T − I , then L C ( p , q ) = � p − q � 1 . ◮ 1D case of empirical measures: � � – X = R ; α = 1 i δ x i β = 1 i δ y i ; n n – x 1 ≤ x 2 , . . . ≤ x n and y 1 ≤ y 2 , . . . ≤ y n ordered observations. n � W p ( p , q ) p = | x i − y i | p i =1 ◮ Histogram equalization: 20

  21. Color transfer 21

  22. Special cases II: Distance between Gaussians ◮ If α = N ( m α , Σ α ) and β = N ( m β , Σ β ) are two gaussians in R d , ◮ The following map: T : x → m β + A ( x − m α ) where A = Σ − 1 / 2 ( Σ 1 / 2 Σ β Σ 1 / 2 ) 1 / 2 Σ − 1 / 2 constitutes an optimal transport plan. α α α α 2 ( α, β ) = � m α − m β � 2 + tr( Σ α + Σ β − 2 ( Σ 1 / 2 Σ β Σ 1 / 2 ◮ Furthermore, W 2 ) 1 / 2 ) 2 . α α 22

  23. Application I: Supervised learning with Wasserstein Loss 23

  24. Learning with Wasserstein Loss ◮ Natural metric on the outputs that can be used to improve predictions. ◮ Wasserstein distance provides a natural notion of dissimilarity for probability measures − → Can encourage smoothness on the predictions. – In ImageNet, 1000 categories may have inherent semantic relationships. – Speech recognition systems, output correspond to keywords that also have semantic relations → this correlation can be exploited. 24

  25. Semantic relationships: Flickr dataset 25

  26. Problem setup ◮ Goal: Learn a mapping X ⊂ R d → K ⊂ Y = R K + , where |K| = K . ◮ Assume K possesses a metric d K ( · , · ), or ground metric. ◮ Learning over a hypothessis space H of predictors: h θ : X → Y , param. by θ ∈ Θ. – These can be a logistic regression, output of a NN, etc. ◮ Empirical risk minimization: N E { l ( h θ ( x ) , y ) } ≈ 1 � min l ( h θ ( x i ) , y i ) N h θ ∈H i =1 26

  27. Discrete Wasserstein loss ◮ Assuming h θ outputs a probability measure (or a discrete probability distribution), and y i corresponds to the one-hot encoding of the label classes, N � W c ( α, β ) = L C ( h θ ( x i ) , y i ) i =1 where C encodes the ground metric given by c ( x, y ). ◮ In order to optimize the loss function, how do we compute gradients? – Gradients are easy to compute in the dual domain. 27

  28. Dual problem formulation 1. Construct the Lagrangian: � � L ( x, λ, ν ) = f ( x )+ λ i g i ( x )+ ν j h j ( x ) . i j 2. Dual function : the minimum of the Lagrangian over x : q ( λ, ν ) = min x L ( x, λ, ν ) . weak duality strong duality 3. Dual problem : maximization of the dual function over λ i ≥ 0: max q ( λ, ν ) λ ∈ R m ,ν R p (2) s.t. λ i ≥ 0 ∀ i. 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend