Advanced Section #2: Optimal Transport AC 209B: Data Science 2 - PowerPoint PPT Presentation

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 Javier Zazo Pavlos Protopapas

Lecture Outline Historical overview Definitions and formulations Metric properties about optimal transport Application I: Supervised learning with Wasserstein Loss Application II: Domain adaptation 2

Historical overview 3

The origins of optimal transport ◮ Gaspard Monge proposed the first idea in 1781. ◮ How to move dirt from one place (d’eblais) to another (remblais) with minimal effort? ◮ Enunciated the problem of finding a mapping F between two distributions of mass. ◮ Optimization with respect to a displacement cost c ( x, y ). 4

Transportation problem I ◮ Formulated by Frank Lauren Hitchcock in 1941. Factories & warehouses example ◮ Fixed number of factories, each of which produces good at a fixed output rate. ◮ Fixed number of warehouses, each of which has a fixed storage capacity. ◮ There is a cost to transport goods from a factory to a warehouse. ◮ Goal: Find the transportation of goods from factory → warehouse with lowest possible cost. 5

Transportation problem II: Example Factories: Transportation costs: W 1 W 2 W 3 W 4 ◮ F 1 makes 5 units. F 1 5 4 7 6 F 2 2 5 3 5 ◮ F 2 makes 4 units. F 3 6 3 4 4 ◮ F 3 makes 6 units. 5 4 5 Warehouses: 3 1 ◮ W 1 can store 5 units. 1 4 3 ◮ W 2 can store 3 units. 5 ◮ W 3 can store 5 units. 6 4 ◮ W 4 can store 2 units. 2 2 6

Transportation problem III: ◮ One factory can transport product to multiple warehouses. ◮ One warehouse can receive product from multiple factories. ◮ The Transportation problem can be formulated as an ordinary linear constrained optimization problem (LP): min 5 x 11 + 4 x 12 + 7 x 13 + 6 x 14 + 2 x 21 + 5 x 22 x ij +3 x 23 + 2 x 24 + 6 x 31 + 3 x 32 + 4 x 33 + 4 x 34 s.t. x 11 + x 12 + x 13 + x 14 = 5 x 21 + x 22 + x 23 + x 24 = 4 x 31 + x 32 + x 33 + x 34 = 6 x 11 + x 21 + x 31 ≤ 5 x 12 + x 22 + x 32 ≤ 3 x 13 + x 23 + x 33 ≤ 5 x 14 + x 24 + x 34 ≤ 2 7

Definitions and formulations 8

Definitions ◮ Probability simplex: � � n � � a i ∈ R n � ∆ n = a i = 1 + � i =1 ◮ Discrete probability distribution: p = ( p 1 , p 2 , . . . , p n ) ∈ ∆ n . ◮ Space X : support for the distritution (coordinates vector/array, temperature, etc.). ◮ Discrete measure: given weights p = ( p 1 , p 2 , . . . , p n ) and x = ( x 1 , x 2 , . . . , x n ) locations, � α = p i δ x i i ◮ Radon measure: α ∈ M ( X ), – X is equipped with a distance, integrating it against a continuous function f � � R d f ( x ) dα ( x ) = f ( x ) ρ α ( x ) dx X X 9

More definitions � ◮ Set of positive measures: M + , such that X f ( x ) dα ( x ) → R + . � ◮ Set of probability measures: M 1 + , such that X dα ( x ) = 1. 10

Assingment and Monge problems ◮ n origin elements ( factories ), ◮ m = n destination elements ( warehouses ), ◮ we look for a permutation (an assignment in the general case) of elements n 1 � min C i,σ ( i ) n σ ∈ Perm(n) i =1 ◮ The set of n discrete elements has n ! possible permutations. ◮ Works after Monge, aimed to simplify the problem, such as Hitchcock in 1941, or Kantorovich in 1942. 11

Kantorovich relaxation ◮ Goal: find a minimal transport plan F such that | F1 = p and F T 1 = q } F ∈ U ( p , q ) = { F ∈ R n × n + ◮ F1 = p sum the rows of F → all goods are transported from p . ◮ F T 1 = q sum the columns of F → all goods are received in q . ◮ p and q are probability distributions → mass is conserved and equals 1. 12

Relation to linear programming ◮ The Kantorovich problem is an LP: L C ( p , q ) = min tr( FC ) F ≥ 0 (1) F T 1 = q F1 = p , ◮ LP programs can be solved with simplex method , interior point methods , dual descent methods , etc. The problem is convex . ◮ One option is to use LP solvers: Clp, Gurobi, Mosek, SeDuMi, CPLEX, ECOS, etc. ◮ Spezialized methods exist (and Python, C, Julia, etc. libraries) – Network simplex – Approximate methods: Sinkhorn, smoothed versions, etc. 13

Kantorovich formulation for arbitrary measures ◮ Now C needs to be a function: c ( x, y ) : X × Y → R + ◮ Discrete measures α = � i p i δ x i and β = � i q i δ y i : – c ( x, y ) is still a matrix where costs depends on locations of measures. ◮ For arbitrary probabilistic measures: – Define a coupling π ∈ M 1 + ( X , Y ) → joint probability distribution of X and Y . � � � π ∈ M 1 � U ( α, β ) = + ( X , Y ) � P X ♯ π = α and P Y ♯ π = β – The continuous problem: � � � � � L c ( α, β ) = min c ( x, y ) dπ ( x, y ) = min E ( X,Y ) ( c ( X, Y )) � X ∼ α, Y ∼ β π ∈ U ( α,β ) ( X,Y ) X×Y 14

Example of transport maps for arbitrary measures 15

Metric properties about optimal transport 16

Metric properties of the discrete optimal transport ◮ Wasserstein distance is also referred as OT, or Earth mover’s distance (EMD). Discrete Wasserstein distance Consider p , q ∈ ∆ n and � � � � C = C T , diag( C ) = 0 and ∀ ( i, j, k ) C ∈ R n × n � C ∈ C n = C i,j ≤ C i,k + C k,j . + Then, W p ( p , q ) = L C p ( p , q ) 1 /p defines a p-Wasserstein distance on ∆ n . ◮ Recall that L C ( p , q ) refers to the discrete Kantorovich problem: � � � F T 1 = q � � F ≥ 0 , L C ( p , q ) = min tr( FC ) F1 = p , 17

Proof that p-Wasserstein constitutes a distance ◮ We need to show positivity , symmetry and triangular inequality . ◮ Since diag( C ) = 0, W p ( p , p ) = 0, and F ∗ = diag( p ). ◮ Because of strict positivity of off-diagonal elements, W p ( p , q ) = tr( CF ) > 0 for p � = q . ◮ Since W p ( p , q ) = tr( CF ), and C is symmetric, W p ( p , q ) = W p ( q , p ). ◮ For triangularity, define p , q and t and F = sol( W p ( p , q )) G = sol( W p ( q , t )) . ◮ For simplicity, assume q > 0 (detailed proof in the lecture notes). Define S = F diag(1 / q ) G ∈ R n × n . + ◮ Note that F ∈ U ( p , t ), i.e., is a feasible transport plan: S1 = F diag(1 / q ) G1 = F diag( q / q ) = F1 = p �� q 1 S T 1 = G T diag(1 / q ) F T 1 = G T diag( q / q ) = G T 1 = t �� q 1 18

Wasserstein distance for arbitrary measures Wasserstein distance for arbitrary measures Consider α ( x ) ∈ M 1 + ( X ) , β ( y ) ∈ M 1 + ( Y ), X = Y , and for some p ≥ 1, ◮ c ( x, y ) = c ( y, x ) ≥ 0; ◮ c ( x, y ) = 0 if and only if x = y ; ◮ ∀ ( x, y, z ) ∈ X 3 , c ( x, y ) ≤ c ( x, z ) + c ( z, y ) Then, W p ( α, β ) = L c p ( α, β ) 1 /p defines a p-Wasserstein distance on X . ◮ Recall, that the Kantorovich problem for arbitrary measures is given by: � L c ( α, β ) = min c ( x, y ) dπ ( x, y ) π ∈ U ( α,β ) X×Y 19

Special cases I ◮ Binary cost matrix: If C = 11 T − I , then L C ( p , q ) = � p − q � 1 . ◮ 1D case of empirical measures: � � – X = R ; α = 1 i δ x i β = 1 i δ y i ; n n – x 1 ≤ x 2 , . . . ≤ x n and y 1 ≤ y 2 , . . . ≤ y n ordered observations. n � W p ( p , q ) p = | x i − y i | p i =1 ◮ Histogram equalization: 20

Color transfer 21

Special cases II: Distance between Gaussians ◮ If α = N ( m α , Σ α ) and β = N ( m β , Σ β ) are two gaussians in R d , ◮ The following map: T : x → m β + A ( x − m α ) where A = Σ − 1 / 2 ( Σ 1 / 2 Σ β Σ 1 / 2 ) 1 / 2 Σ − 1 / 2 constitutes an optimal transport plan. α α α α 2 ( α, β ) = � m α − m β � 2 + tr( Σ α + Σ β − 2 ( Σ 1 / 2 Σ β Σ 1 / 2 ◮ Furthermore, W 2 ) 1 / 2 ) 2 . α α 22

Application I: Supervised learning with Wasserstein Loss 23

Learning with Wasserstein Loss ◮ Natural metric on the outputs that can be used to improve predictions. ◮ Wasserstein distance provides a natural notion of dissimilarity for probability measures − → Can encourage smoothness on the predictions. – In ImageNet, 1000 categories may have inherent semantic relationships. – Speech recognition systems, output correspond to keywords that also have semantic relations → this correlation can be exploited. 24

Semantic relationships: Flickr dataset 25

Problem setup ◮ Goal: Learn a mapping X ⊂ R d → K ⊂ Y = R K + , where |K| = K . ◮ Assume K possesses a metric d K ( · , · ), or ground metric. ◮ Learning over a hypothessis space H of predictors: h θ : X → Y , param. by θ ∈ Θ. – These can be a logistic regression, output of a NN, etc. ◮ Empirical risk minimization: N E { l ( h θ ( x ) , y ) } ≈ 1 � min l ( h θ ( x i ) , y i ) N h θ ∈H i =1 26

Discrete Wasserstein loss ◮ Assuming h θ outputs a probability measure (or a discrete probability distribution), and y i corresponds to the one-hot encoding of the label classes, N � W c ( α, β ) = L C ( h θ ( x i ) , y i ) i =1 where C encodes the ground metric given by c ( x, y ). ◮ In order to optimize the loss function, how do we compute gradients? – Gradients are easy to compute in the dual domain. 27

Dual problem formulation 1. Construct the Lagrangian: � � L ( x, λ, ν ) = f ( x )+ λ i g i ( x )+ ν j h j ( x ) . i j 2. Dual function : the minimum of the Lagrangian over x : q ( λ, ν ) = min x L ( x, λ, ν ) . weak duality strong duality 3. Dual problem : maximization of the dual function over λ i ≥ 0: max q ( λ, ν ) λ ∈ R m ,ν R p (2) s.t. λ i ≥ 0 ∀ i. 28

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 - PowerPoint PPT Presentation

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 Javier Zazo Pavlos Protopapas Lecture Outline Historical overview Definitions and formulations Metric properties about optimal transport Application I: Supervised learning with

Martingale Optimal Transport in Higher Hadrien De March Dimension Optimal transport

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

An Optimal Transport View on Generalization Nemo Fournier January 13, 2020 An Optimal Transport

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Half Year Results Presentation 2019 6 months ended 30 June 2019 Section 1 Section 2 Section 3

2018 Full year results presentation 12 months ended 31 December 2018 1 Section 1 Section 2

Joint Local Transport Plan for West of England Bristol Transport Strategy The emerging transport

GANs, Optimal Transport, and Implicit Distribution Estimation Tengyuan Liang Econometrics and

May 2013 Agenda Section 1 Jaypee Group Overview Section 2 Company Overview Section 3 Yamuna

Fermilab NORTH 0 20 20 40 1"=20'-0" 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

Inverse problems and control optimal in non-linear mechanics C. Stolz 1 2 Introduction

Gloucestershires Local Transport Plan Review Improving transport in our county What is a

APM Transport S.A. Company Presentation Outline APM Transport SA introduction APM Group

Special Discrete Distributions Bernoulli Distribution A Bernoulli trial is an experiment with

Mode Detection, Age Pattern Transition, and its Consequences on Carbon Emissions Team 5 Deepank

Assessing the Impact of the Maternity Capital Policy in Russia Using a Dynamic Stochastic Model

APPLYING THE METHOD APPLYING THE METHOD OF MOMENTS TO OF MOMENTS TO DEVELOP RELIABILITY

Factorization Theorem Lecture 02 Biostatistics 602 - Statistical Inference . Summary . .

Discrete Element Method in STAR - CCM+ Material Presented at STAR Japanese Conference 2012 By Oleh

Bayesian and Non-Bayesian Analysis of Soccer Data using Bivariate Poisson Regression Models

Monte Carlo Methods and Neural Networks Noah Gamboa and Alexander Keller Neural Networks Fully

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 - PowerPoint PPT Presentation

Advanced Section #2: Optimal Transport AC 209B: Data Science 2 Javier Zazo Pavlos Protopapas Lecture Outline Historical overview Definitions and formulations Metric properties about optimal transport Application I: Supervised learning with

Martingale Optimal Transport in Higher Hadrien De March Dimension Optimal transport

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

An Optimal Transport View on Generalization Nemo Fournier January 13, 2020 An Optimal Transport

Optimal Agents Nick Hay 27th September 2005 1 / 36 Nick Hay Optimal Agents The Optimal Agent

Toward Computing Towards an Optimal . . . An (Almost) Optimal . . . Minor Problem an Optimal

Half Year Results Presentation 2019 6 months ended 30 June 2019 Section 1 Section 2 Section 3

2018 Full year results presentation 12 months ended 31 December 2018 1 Section 1 Section 2

Joint Local Transport Plan for West of England Bristol Transport Strategy The emerging transport

GANs, Optimal Transport, and Implicit Distribution Estimation Tengyuan Liang Econometrics and

May 2013 Agenda Section 1 Jaypee Group Overview Section 2 Company Overview Section 3 Yamuna

Fermilab NORTH 0 20 20 40 1&quot;=20'-0&quot; 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

Inverse problems and control optimal in non-linear mechanics C. Stolz 1 2 Introduction

Gloucestershires Local Transport Plan Review Improving transport in our county What is a

APM Transport S.A. Company Presentation Outline APM Transport SA introduction APM Group

Special Discrete Distributions Bernoulli Distribution A Bernoulli trial is an experiment with

Mode Detection, Age Pattern Transition, and its Consequences on Carbon Emissions Team 5 Deepank

Assessing the Impact of the Maternity Capital Policy in Russia Using a Dynamic Stochastic Model

APPLYING THE METHOD APPLYING THE METHOD OF MOMENTS TO OF MOMENTS TO DEVELOP RELIABILITY

Factorization Theorem Lecture 02 Biostatistics 602 - Statistical Inference . Summary . .

Discrete Element Method in STAR - CCM+ Material Presented at STAR Japanese Conference 2012 By Oleh

Bayesian and Non-Bayesian Analysis of Soccer Data using Bivariate Poisson Regression Models

Monte Carlo Methods and Neural Networks Noah Gamboa and Alexander Keller Neural Networks Fully

Fermilab NORTH 0 20 20 40 1"=20'-0" 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE