Optimal Transport Methods in Operations Research and Statistics - - PowerPoint PPT Presentation

optimal transport methods in operations research and
SMART_READER_LITE
LIVE PREVIEW

Optimal Transport Methods in Operations Research and Statistics - - PowerPoint PPT Presentation

Optimal Transport Methods in Operations Research and Statistics Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F. Zhang). Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics


slide-1
SLIDE 1

Optimal Transport Methods in Operations Research and Statistics

Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F. Zhang).

Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and Department of IEOR).

Blanchet (Columbia U. and Stanford U.) 1 / 60

slide-2
SLIDE 2

Goal:

Goal: Introduce optimal transport techniques and applications in OR & Statistics Optimal transport is useful tool in model robustness, equilibrium, and machine learning!

Blanchet (Columbia U. and Stanford U.) 2 / 60

slide-3
SLIDE 3

Agenda

Introduction to Optimal Transport

Blanchet (Columbia U. and Stanford U.) 3 / 60

slide-4
SLIDE 4

Agenda

Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances

Blanchet (Columbia U. and Stanford U.) 3 / 60

slide-5
SLIDE 5

Agenda

Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research

Blanchet (Columbia U. and Stanford U.) 3 / 60

slide-6
SLIDE 6

Agenda

Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Applications in Distributionally Robust Optimization

Blanchet (Columbia U. and Stanford U.) 3 / 60

slide-7
SLIDE 7

Agenda

Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Applications in Distributionally Robust Optimization Applications in Statistics

Blanchet (Columbia U. and Stanford U.) 3 / 60

slide-8
SLIDE 8

Introduction to Optimal Transport

Monge-Kantorovich Problem & Duality (see e.g. C. Villani’s 2008 textbook)

Blanchet (Columbia U. and Stanford U.) 4 / 60

slide-9
SLIDE 9

Monge Problem

What’s the cheapest way to transport a pile of sand to cover a sinkhole?

Blanchet (Columbia U. and Stanford U.) 5 / 60

slide-10
SLIDE 10

Monge Problem

What’s the cheapest way to transport a pile of sand to cover a sinkhole? min

T (·):T (X )∼v Eµ {c (X, T (X))} ,

Blanchet (Columbia U. and Stanford U.) 6 / 60

slide-11
SLIDE 11

Monge Problem

What’s the cheapest way to transport a pile of sand to cover a sinkhole? min

T (·):T (X )∼v Eµ {c (X, T (X))} ,

where c (x, y) ≥ 0 is the cost of transporting x to y.

Blanchet (Columbia U. and Stanford U.) 6 / 60

slide-12
SLIDE 12

Monge Problem

What’s the cheapest way to transport a pile of sand to cover a sinkhole? min

T (·):T (X )∼v Eµ {c (X, T (X))} ,

where c (x, y) ≥ 0 is the cost of transporting x to y. T (X) ∼ v means T (X) follows distribution v (·).

Blanchet (Columbia U. and Stanford U.) 6 / 60

slide-13
SLIDE 13

Monge Problem

What’s the cheapest way to transport a pile of sand to cover a sinkhole? min

T (·):T (X )∼v Eµ {c (X, T (X))} ,

where c (x, y) ≥ 0 is the cost of transporting x to y. T (X) ∼ v means T (X) follows distribution v (·). Problem is highly non-linear, not much progress for about 160 yrs!

Blanchet (Columbia U. and Stanford U.) 6 / 60

slide-14
SLIDE 14

Kantorovich Relaxation: Primal Problem

Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that πX = marginal of X = µ, πY = marginal of Y = v.

Blanchet (Columbia U. and Stanford U.) 7 / 60

slide-15
SLIDE 15

Kantorovich Relaxation: Primal Problem

Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that πX = marginal of X = µ, πY = marginal of Y = v. Solve min{Eπ [c (X, Y )] : π ∈ Π (µ, v)}

Blanchet (Columbia U. and Stanford U.) 7 / 60

slide-16
SLIDE 16

Kantorovich Relaxation: Primal Problem

Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that πX = marginal of X = µ, πY = marginal of Y = v. Solve min{Eπ [c (X, Y )] : π ∈ Π (µ, v)} Linear programming (infinite dimensional): Dc (µ, v) : = min

π(dx,dy)≥0

  • X ×Y c (x, y) π (dx, dy)
  • Y π (dx, dy) = µ (dx) ,
  • X π (dx, dy) = v (dy) .

Blanchet (Columbia U. and Stanford U.) 7 / 60

slide-17
SLIDE 17

Kantorovich Relaxation: Primal Problem

Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that πX = marginal of X = µ, πY = marginal of Y = v. Solve min{Eπ [c (X, Y )] : π ∈ Π (µ, v)} Linear programming (infinite dimensional): Dc (µ, v) : = min

π(dx,dy)≥0

  • X ×Y c (x, y) π (dx, dy)
  • Y π (dx, dy) = µ (dx) ,
  • X π (dx, dy) = v (dy) .

If c (x, y) = dp (x, y) (d-metric) then D1/p

c

(µ, v) is a p-Wasserstein metric.

Blanchet (Columbia U. and Stanford U.) 7 / 60

slide-18
SLIDE 18

Illustration of Optimal Transport Costs

Monge’s solution would take the form π∗ (dx, dy) = δ{T (x)} (dy) µ (dx) .

Blanchet (Columbia U. and Stanford U.) 8 / 60

slide-19
SLIDE 19

Kantorovich Relaxation: Dual Problem

Primal has always a solution for c (·) ≥ 0 lower semicontinuous.

Blanchet (Columbia U. and Stanford U.) 9 / 60

slide-20
SLIDE 20

Kantorovich Relaxation: Dual Problem

Primal has always a solution for c (·) ≥ 0 lower semicontinuous. Linear programming (Dual): sup

α,β

  • X α (x) µ (dx) +
  • Y β (y) v (dy)

α (x) + β (y) ≤ c (x, y) ∀ (x, y) ∈ X × Y .

Blanchet (Columbia U. and Stanford U.) 9 / 60

slide-21
SLIDE 21

Kantorovich Relaxation: Dual Problem

Primal has always a solution for c (·) ≥ 0 lower semicontinuous. Linear programming (Dual): sup

α,β

  • X α (x) µ (dx) +
  • Y β (y) v (dy)

α (x) + β (y) ≤ c (x, y) ∀ (x, y) ∈ X × Y . Dual α and β can be taken over continuous functions.

Blanchet (Columbia U. and Stanford U.) 9 / 60

slide-22
SLIDE 22

Kantorovich Relaxation: Dual Problem

Primal has always a solution for c (·) ≥ 0 lower semicontinuous. Linear programming (Dual): sup

α,β

  • X α (x) µ (dx) +
  • Y β (y) v (dy)

α (x) + β (y) ≤ c (x, y) ∀ (x, y) ∈ X × Y . Dual α and β can be taken over continuous functions. Complementary slackness: Equality holds on the support of π∗ (primal optimizer).

Blanchet (Columbia U. and Stanford U.) 9 / 60

slide-23
SLIDE 23

Kantorovich Relaxation: Primal Interpretation

John wants to remove of a pile of sand, µ (·).

Blanchet (Columbia U. and Stanford U.) 10 / 60

slide-24
SLIDE 24

Kantorovich Relaxation: Primal Interpretation

John wants to remove of a pile of sand, µ (·). Peter wants to cover a sinkhole, v (·).

Blanchet (Columbia U. and Stanford U.) 10 / 60

slide-25
SLIDE 25

Kantorovich Relaxation: Primal Interpretation

John wants to remove of a pile of sand, µ (·). Peter wants to cover a sinkhole, v (·). Cost for John and Peter to transport the sand to cover the sinkhole is Dc (µ, v) =

  • X ×Y c (x, y) π∗ (dx, dy) .

Blanchet (Columbia U. and Stanford U.) 10 / 60

slide-26
SLIDE 26

Kantorovich Relaxation: Primal Interpretation

John wants to remove of a pile of sand, µ (·). Peter wants to cover a sinkhole, v (·). Cost for John and Peter to transport the sand to cover the sinkhole is Dc (µ, v) =

  • X ×Y c (x, y) π∗ (dx, dy) .

Now comes Maria, who has a business...

Blanchet (Columbia U. and Stanford U.) 10 / 60

slide-27
SLIDE 27

Kantorovich Relaxation: Primal Interpretation

John wants to remove of a pile of sand, µ (·). Peter wants to cover a sinkhole, v (·). Cost for John and Peter to transport the sand to cover the sinkhole is Dc (µ, v) =

  • X ×Y c (x, y) π∗ (dx, dy) .

Now comes Maria, who has a business... Maria promises to transport on behalf of John and Peter the whole amount.

Blanchet (Columbia U. and Stanford U.) 10 / 60

slide-28
SLIDE 28

Kantorovich Relaxation: Primal Interpretation

Maria charges John α (x) per-unit of mass at x (similarly to Peter).

Blanchet (Columbia U. and Stanford U.) 11 / 60

slide-29
SLIDE 29

Kantorovich Relaxation: Primal Interpretation

Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) ≤ c (x, y) .

Blanchet (Columbia U. and Stanford U.) 11 / 60

slide-30
SLIDE 30

Kantorovich Relaxation: Primal Interpretation

Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) ≤ c (x, y) . Maria wishes to maximize her profit

  • α (x) µ (dx) +
  • β (y) v (dy) .

Blanchet (Columbia U. and Stanford U.) 11 / 60

slide-31
SLIDE 31

Kantorovich Relaxation: Primal Interpretation

Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) ≤ c (x, y) . Maria wishes to maximize her profit

  • α (x) µ (dx) +
  • β (y) v (dy) .

Kantorovich duality says primal and dual optimal values coincide and (under mild regularity) α∗ (x) = inf

y {c (x, y) − β∗ (y)}

β∗ (y) = inf

x {c (x, y) − α∗ (x)} .

Blanchet (Columbia U. and Stanford U.) 11 / 60

slide-32
SLIDE 32

Proof Techniques

Suppose X and Y compact sup

π≥0,

inf

α,β

  • X ×Y c (x, y) π (dx, dy)

  • X ×Y α (x) π (dx, dy) +
  • X α (x) µ (dx)

  • X ×Y β (y) π (dx, dy) +
  • Y β (y) v (dy)}

Blanchet (Columbia U. and Stanford U.) 12 / 60

slide-33
SLIDE 33

Proof Techniques

Suppose X and Y compact sup

π≥0,

inf

α,β

  • X ×Y c (x, y) π (dx, dy)

  • X ×Y α (x) π (dx, dy) +
  • X α (x) µ (dx)

  • X ×Y β (y) π (dx, dy) +
  • Y β (y) v (dy)}

Swap sup and inf using Sion’s min-max theorem by a compactness argument and conclude.

Blanchet (Columbia U. and Stanford U.) 12 / 60

slide-34
SLIDE 34

Proof Techniques

Suppose X and Y compact sup

π≥0,

inf

α,β

  • X ×Y c (x, y) π (dx, dy)

  • X ×Y α (x) π (dx, dy) +
  • X α (x) µ (dx)

  • X ×Y β (y) π (dx, dy) +
  • Y β (y) v (dy)}

Swap sup and inf using Sion’s min-max theorem by a compactness argument and conclude. Significant amount of work needed to extend to general Polish spaces and construct the dual optimizers (primal a bit easier).

Blanchet (Columbia U. and Stanford U.) 12 / 60

slide-35
SLIDE 35

Optimal Transport Applications

Optimal Transport has gained popularity in many areas including: image analysis, economics, statistics, machine learning... The rest of the talk mostly concerns applications to OR and Statistics but we’ll briefly touch upon others, including economics...

Blanchet (Columbia U. and Stanford U.) 13 / 60

slide-36
SLIDE 36

Illustration of Optimal Transport in Image Analysis

Santambrogio (2010)’s illustration

Blanchet (Columbia U. and Stanford U.) 14 / 60

slide-37
SLIDE 37

Application of Optimal Transport in Economics

Economic Interpretations (see e.g. A. Galichon’s 2016 textbook & McCaan 2013 notes).

Blanchet (Columbia U. and Stanford U.) 15 / 60

slide-38
SLIDE 38

Applications in Labor Markets

Worker with skill x & company with technology y have surplus Ψ (x, y).

Blanchet (Columbia U. and Stanford U.) 16 / 60

slide-39
SLIDE 39

Applications in Labor Markets

Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x).

Blanchet (Columbia U. and Stanford U.) 16 / 60

slide-40
SLIDE 40

Applications in Labor Markets

Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y).

Blanchet (Columbia U. and Stanford U.) 16 / 60

slide-41
SLIDE 41

Applications in Labor Markets

Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). The salary of worker x is α (x) & cost of technology y is β (y) α (x) + β (y) ≥ Ψ (x, y) .

Blanchet (Columbia U. and Stanford U.) 16 / 60

slide-42
SLIDE 42

Applications in Labor Markets

Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). The salary of worker x is α (x) & cost of technology y is β (y) α (x) + β (y) ≥ Ψ (x, y) . Companies want to minimize total production cost

  • α (x) µ (x) dx +
  • β (y) v (y) dy

Blanchet (Columbia U. and Stanford U.) 16 / 60

slide-43
SLIDE 43

Applications in Labor Markets

Letting a central planner organize the Labor market

Blanchet (Columbia U. and Stanford U.) 17 / 60

slide-44
SLIDE 44

Applications in Labor Markets

Letting a central planner organize the Labor market The planner wishes to maximize total surplus

  • Ψ (x, y) π (dx, dy)

Blanchet (Columbia U. and Stanford U.) 17 / 60

slide-45
SLIDE 45

Applications in Labor Markets

Letting a central planner organize the Labor market The planner wishes to maximize total surplus

  • Ψ (x, y) π (dx, dy)

Over assignments π (·) which satisfy market clearing

  • Y π (dx, dy) = µ (dx) ,
  • X π (dx, dy) = v (dy) .

Blanchet (Columbia U. and Stanford U.) 17 / 60

slide-46
SLIDE 46

Solving for Optimal Transport Coupling

Suppose that Ψ (x, y) = xy, µ (x) = I (x ∈ [0, 1]), v (y) = e−yI (y > 0).

Blanchet (Columbia U. and Stanford U.) 18 / 60

slide-47
SLIDE 47

Solving for Optimal Transport Coupling

Suppose that Ψ (x, y) = xy, µ (x) = I (x ∈ [0, 1]), v (y) = e−yI (y > 0). Solve primal by sampling: Let {X n

i }n i=1 and {Y n i }n i=1 both i.i.d. from

µ and v, respectively. Fµn (x) = 1 n

n

i=1

I (X n

i ≤ x) , Fvn (y) = 1

n

n

j=1

I

  • Y n

j ≤ y

  • Blanchet (Columbia U. and Stanford U.)

18 / 60

slide-48
SLIDE 48

Solving for Optimal Transport Coupling

Suppose that Ψ (x, y) = xy, µ (x) = I (x ∈ [0, 1]), v (y) = e−yI (y > 0). Solve primal by sampling: Let {X n

i }n i=1 and {Y n i }n i=1 both i.i.d. from

µ and v, respectively. Fµn (x) = 1 n

n

i=1

I (X n

i ≤ x) , Fvn (y) = 1

n

n

j=1

I

  • Y n

j ≤ y

  • Consider

max

π(x n

i ,x n j )≥0∑

i,j

Ψ

  • xn

i , yn j

  • π
  • xn

i , yn j

j

π

  • xn

i , yn j

= 1 n ∀xi, ∑

i

π

  • xn

i , yn j

= 1 n ∀yj.

Blanchet (Columbia U. and Stanford U.) 18 / 60

slide-49
SLIDE 49

Solving for Optimal Transport Coupling

Suppose that Ψ (x, y) = xy, µ (x) = I (x ∈ [0, 1]), v (y) = e−yI (y > 0). Solve primal by sampling: Let {X n

i }n i=1 and {Y n i }n i=1 both i.i.d. from

µ and v, respectively. Fµn (x) = 1 n

n

i=1

I (X n

i ≤ x) , Fvn (y) = 1

n

n

j=1

I

  • Y n

j ≤ y

  • Consider

max

π(x n

i ,x n j )≥0∑

i,j

Ψ

  • xn

i , yn j

  • π
  • xn

i , yn j

j

π

  • xn

i , yn j

= 1 n ∀xi, ∑

i

π

  • xn

i , yn j

= 1 n ∀yj. Clearly, simply sort and match is the solution!

Blanchet (Columbia U. and Stanford U.) 18 / 60

slide-50
SLIDE 50

Solving for Optimal Transport Coupling

Think of Y n

j = − log

  • 1 − Un

j

  • for Un

j s i.i.d. uniform(0, 1).

Blanchet (Columbia U. and Stanford U.) 19 / 60

slide-51
SLIDE 51

Solving for Optimal Transport Coupling

Think of Y n

j = − log

  • 1 − Un

j

  • for Un

j s i.i.d. uniform(0, 1).

The j-th order statistic X n

(j) is matched to Y n (j).

Blanchet (Columbia U. and Stanford U.) 19 / 60

slide-52
SLIDE 52

Solving for Optimal Transport Coupling

Think of Y n

j = − log

  • 1 − Un

j

  • for Un

j s i.i.d. uniform(0, 1).

The j-th order statistic X n

(j) is matched to Y n (j).

As n → ∞, X n

(nt) → t, so Y n (nt) → − log (1 − t).

Blanchet (Columbia U. and Stanford U.) 19 / 60

slide-53
SLIDE 53

Solving for Optimal Transport Coupling

Think of Y n

j = − log

  • 1 − Un

j

  • for Un

j s i.i.d. uniform(0, 1).

The j-th order statistic X n

(j) is matched to Y n (j).

As n → ∞, X n

(nt) → t, so Y n (nt) → − log (1 − t).

Thus, the optimal coupling as n → ∞ is X = U and Y = − log (1 − U) (comonotonic coupling).

Blanchet (Columbia U. and Stanford U.) 19 / 60

slide-54
SLIDE 54

Identities for Wasserstein Distances

Comonotonic coupling is the solution if ∂2

x,y Ψ (x, y) ≥ 0 -

supermodularity.

Blanchet (Columbia U. and Stanford U.) 20 / 60

slide-55
SLIDE 55

Identities for Wasserstein Distances

Comonotonic coupling is the solution if ∂2

x,y Ψ (x, y) ≥ 0 -

supermodularity. Of for costs c (x, y) = −Ψ (x, y) if ∂2

x,yc (x, y) ≤ 0 (submodularity).

Blanchet (Columbia U. and Stanford U.) 20 / 60

slide-56
SLIDE 56

Identities for Wasserstein Distances

Comonotonic coupling is the solution if ∂2

x,y Ψ (x, y) ≥ 0 -

supermodularity. Of for costs c (x, y) = −Ψ (x, y) if ∂2

x,yc (x, y) ≤ 0 (submodularity).

Corollary: Suppose c (x, y) = |x − y| then X = F −1

µ

(U) and Y = F −1

v

(U) thus Dc

  • Fµ, Fv

=

1

  • F −1

µ

(u) − F −1

v

(u)

  • du.

Blanchet (Columbia U. and Stanford U.) 20 / 60

slide-57
SLIDE 57

Identities for Wasserstein Distances

Comonotonic coupling is the solution if ∂2

x,y Ψ (x, y) ≥ 0 -

supermodularity. Of for costs c (x, y) = −Ψ (x, y) if ∂2

x,yc (x, y) ≤ 0 (submodularity).

Corollary: Suppose c (x, y) = |x − y| then X = F −1

µ

(U) and Y = F −1

v

(U) thus Dc

  • Fµ, Fv

=

1

  • F −1

µ

(u) − F −1

v

(u)

  • du.

Similar identities are common for Wasserstein distances...

Blanchet (Columbia U. and Stanford U.) 20 / 60

slide-58
SLIDE 58

Interesting Insight on Salary Effects

In equilibrium, by the envelope theorem ˙ β∗ (y) = d dy sup

x [Ψ (x, y) − λ∗ (x)] = ∂

∂y Ψ (xy, y) = xy.

Blanchet (Columbia U. and Stanford U.) 21 / 60

slide-59
SLIDE 59

Interesting Insight on Salary Effects

In equilibrium, by the envelope theorem ˙ β∗ (y) = d dy sup

x [Ψ (x, y) − λ∗ (x)] = ∂

∂y Ψ (xy, y) = xy. We also know that y = − log (1 − x), or x = 1 − exp (−y) β∗ (y) = y + exp (−y) − 1 + β∗ (0) . α∗ (x) + β∗ (− log (1 − x)) = xy.

Blanchet (Columbia U. and Stanford U.) 21 / 60

slide-60
SLIDE 60

Interesting Insight on Salary Effects

In equilibrium, by the envelope theorem ˙ β∗ (y) = d dy sup

x [Ψ (x, y) − λ∗ (x)] = ∂

∂y Ψ (xy, y) = xy. We also know that y = − log (1 − x), or x = 1 − exp (−y) β∗ (y) = y + exp (−y) − 1 + β∗ (0) . α∗ (x) + β∗ (− log (1 − x)) = xy. What if Ψ (x, y) → Ψ (x, y) + f (x)? (i.e. productivity grows).

Blanchet (Columbia U. and Stanford U.) 21 / 60

slide-61
SLIDE 61

Interesting Insight on Salary Effects

In equilibrium, by the envelope theorem ˙ β∗ (y) = d dy sup

x [Ψ (x, y) − λ∗ (x)] = ∂

∂y Ψ (xy, y) = xy. We also know that y = − log (1 − x), or x = 1 − exp (−y) β∗ (y) = y + exp (−y) − 1 + β∗ (0) . α∗ (x) + β∗ (− log (1 − x)) = xy. What if Ψ (x, y) → Ψ (x, y) + f (x)? (i.e. productivity grows). Answer: salaries grows if f (·) is increasing.

Blanchet (Columbia U. and Stanford U.) 21 / 60

slide-62
SLIDE 62

Applications of Optimal Transport in Stochastic OR

Application of Optimal Transport in Stochastic OR Blanchet and Murthy (2016) https://arxiv.org/abs/1604.01446. Insight: Diffusion approximations and optimal transport

Blanchet (Columbia U. and Stanford U.) 22 / 60

slide-63
SLIDE 63

A Distributionally Robust Performance Analysis

In Stochastic OR we are often interested in evaluating EPtrue (f (X)) for a complex model Ptrue

Blanchet (Columbia U. and Stanford U.) 23 / 60

slide-64
SLIDE 64

A Distributionally Robust Performance Analysis

In Stochastic OR we are often interested in evaluating EPtrue (f (X)) for a complex model Ptrue Moreover, we wish to control / optimize it min

θ EPtrue (h (X, θ)) .

Blanchet (Columbia U. and Stanford U.) 23 / 60

slide-65
SLIDE 65

A Distributionally Robust Performance Analysis

In Stochastic OR we are often interested in evaluating EPtrue (f (X)) for a complex model Ptrue Moreover, we wish to control / optimize it min

θ EPtrue (h (X, θ)) .

Model Ptrue might be unknown or too difficult to work with.

Blanchet (Columbia U. and Stanford U.) 23 / 60

slide-66
SLIDE 66

A Distributionally Robust Performance Analysis

In Stochastic OR we are often interested in evaluating EPtrue (f (X)) for a complex model Ptrue Moreover, we wish to control / optimize it min

θ EPtrue (h (X, θ)) .

Model Ptrue might be unknown or too difficult to work with. So, we introduce a proxy P0 which provides a good trade-off between tractability and model fidelity (e.g. Brownian motion for heavy-traffic approximations).

Blanchet (Columbia U. and Stanford U.) 23 / 60

slide-67
SLIDE 67

A Distributionally Robust Performance Analysis

For f (·) upper semicontinuous with EP0 |f (X)| < ∞ sup EP (f (Y )) Dc (P, P0) ≤ δ , X takes values on a Polish space and c (·) is lower semi-continuous.

Blanchet (Columbia U. and Stanford U.) 24 / 60

slide-68
SLIDE 68

A Distributionally Robust Performance Analysis

For f (·) upper semicontinuous with EP0 |f (X)| < ∞ sup EP (f (Y )) Dc (P, P0) ≤ δ , X takes values on a Polish space and c (·) is lower semi-continuous. Also an infinite dimensional linear program sup

  • X ×Y f (y) π (dx, dy)

s.t.

  • X ×Y c (x, y) π (dx, dy) ≤ δ
  • Y π (dx, dy) = P0 (dx) .

Blanchet (Columbia U. and Stanford U.) 24 / 60

slide-69
SLIDE 69

A Distributionally Robust Performance Analysis

Formal duality: Dual = inf

λ≥0,α

  • λδ +
  • α (x) P0 (dx)
  • λc (x, y) + α (x) ≥ f (y) .

Blanchet (Columbia U. and Stanford U.) 25 / 60

slide-70
SLIDE 70

A Distributionally Robust Performance Analysis

Formal duality: Dual = inf

λ≥0,α

  • λδ +
  • α (x) P0 (dx)
  • λc (x, y) + α (x) ≥ f (y) .
  • B. & Murthy (2016) - No duality gap:

Dual = inf

λ≥0

  • λδ + E0
  • sup

y {f (y) − λc (X, y)}

  • .

Blanchet (Columbia U. and Stanford U.) 25 / 60

slide-71
SLIDE 71

A Distributionally Robust Performance Analysis

Formal duality: Dual = inf

λ≥0,α

  • λδ +
  • α (x) P0 (dx)
  • λc (x, y) + α (x) ≥ f (y) .
  • B. & Murthy (2016) - No duality gap:

Dual = inf

λ≥0

  • λδ + E0
  • sup

y {f (y) − λc (X, y)}

  • .

We refer to this as RoPA Duality in this talk.

Blanchet (Columbia U. and Stanford U.) 25 / 60

slide-72
SLIDE 72

A Distributionally Robust Performance Analysis

Formal duality: Dual = inf

λ≥0,α

  • λδ +
  • α (x) P0 (dx)
  • λc (x, y) + α (x) ≥ f (y) .
  • B. & Murthy (2016) - No duality gap:

Dual = inf

λ≥0

  • λδ + E0
  • sup

y {f (y) − λc (X, y)}

  • .

We refer to this as RoPA Duality in this talk. Let us consider the important case f (y) = I (y ∈ A) & c (x, x) = 0.

Blanchet (Columbia U. and Stanford U.) 25 / 60

slide-73
SLIDE 73

A Distributionally Robust Performance Analysis

So, if f (y) = I (y ∈ A) and cA (X) = inf{y ∈ A : c (x, y)}, then Dual = inf

λ≥0

  • λδ + E0 (1 − λcA (X))+

= P0 (cA (X) ≤ 1/λ∗) .

Blanchet (Columbia U. and Stanford U.) 26 / 60

slide-74
SLIDE 74

A Distributionally Robust Performance Analysis

So, if f (y) = I (y ∈ A) and cA (X) = inf{y ∈ A : c (x, y)}, then Dual = inf

λ≥0

  • λδ + E0 (1 − λcA (X))+

= P0 (cA (X) ≤ 1/λ∗) . If cA (X) is continuous under P0 & E0 (cA (X)) ≥ δ, then δ = E0 [cA (X) I (cA (X) ≤ 1/λ∗)] .

Blanchet (Columbia U. and Stanford U.) 26 / 60

slide-75
SLIDE 75

Example: Model Uncertainty in Bankruptcy Calculations

R (t) = the reserve (perhaps multiple lines) at time t.

Blanchet (Columbia U. and Stanford U.) 27 / 60

slide-76
SLIDE 76

Example: Model Uncertainty in Bankruptcy Calculations

R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T) uT = Ptrue (R (t) ∈ B for some t ∈ [0, T]) .

Blanchet (Columbia U. and Stanford U.) 27 / 60

slide-77
SLIDE 77

Example: Model Uncertainty in Bankruptcy Calculations

R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T) uT = Ptrue (R (t) ∈ B for some t ∈ [0, T]) . B is a set which models bankruptcy.

Blanchet (Columbia U. and Stanford U.) 27 / 60

slide-78
SLIDE 78

Example: Model Uncertainty in Bankruptcy Calculations

R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T) uT = Ptrue (R (t) ∈ B for some t ∈ [0, T]) . B is a set which models bankruptcy. Problem: Model (Ptrue) may be complex, intractable or simply unknown...

Blanchet (Columbia U. and Stanford U.) 27 / 60

slide-79
SLIDE 79

A Distributionally Robust Risk Analysis Formulation

Our solution: Estimate uT by solving sup

Dc (P0,P)≤δ

Ptrue (R (t) ∈ B for some t ∈ [0, T]) , where P0 is a suitable model.

Blanchet (Columbia U. and Stanford U.) 28 / 60

slide-80
SLIDE 80

A Distributionally Robust Risk Analysis Formulation

Our solution: Estimate uT by solving sup

Dc (P0,P)≤δ

Ptrue (R (t) ∈ B for some t ∈ [0, T]) , where P0 is a suitable model. P0 = proxy for Ptrue.

Blanchet (Columbia U. and Stanford U.) 28 / 60

slide-81
SLIDE 81

A Distributionally Robust Risk Analysis Formulation

Our solution: Estimate uT by solving sup

Dc (P0,P)≤δ

Ptrue (R (t) ∈ B for some t ∈ [0, T]) , where P0 is a suitable model. P0 = proxy for Ptrue. P0 right trade-off between fidelity and tractability.

Blanchet (Columbia U. and Stanford U.) 28 / 60

slide-82
SLIDE 82

A Distributionally Robust Risk Analysis Formulation

Our solution: Estimate uT by solving sup

Dc (P0,P)≤δ

Ptrue (R (t) ∈ B for some t ∈ [0, T]) , where P0 is a suitable model. P0 = proxy for Ptrue. P0 right trade-off between fidelity and tractability. δ is the distributional uncertainty size.

Blanchet (Columbia U. and Stanford U.) 28 / 60

slide-83
SLIDE 83

A Distributionally Robust Risk Analysis Formulation

Our solution: Estimate uT by solving sup

Dc (P0,P)≤δ

Ptrue (R (t) ∈ B for some t ∈ [0, T]) , where P0 is a suitable model. P0 = proxy for Ptrue. P0 right trade-off between fidelity and tractability. δ is the distributional uncertainty size. Dc (·) is the distributional uncertainty region.

Blanchet (Columbia U. and Stanford U.) 28 / 60

slide-84
SLIDE 84

Desirable Elements of Distributionally Robust Formulation

Would like Dc (·) to have wide flexibility (even non-parametric).

Blanchet (Columbia U. and Stanford U.) 29 / 60

slide-85
SLIDE 85

Desirable Elements of Distributionally Robust Formulation

Would like Dc (·) to have wide flexibility (even non-parametric). Want optimization to be tractable.

Blanchet (Columbia U. and Stanford U.) 29 / 60

slide-86
SLIDE 86

Desirable Elements of Distributionally Robust Formulation

Would like Dc (·) to have wide flexibility (even non-parametric). Want optimization to be tractable. Want to preserve advantages of using P0.

Blanchet (Columbia U. and Stanford U.) 29 / 60

slide-87
SLIDE 87

Desirable Elements of Distributionally Robust Formulation

Would like Dc (·) to have wide flexibility (even non-parametric). Want optimization to be tractable. Want to preserve advantages of using P0. Want a way to estimate δ.

Blanchet (Columbia U. and Stanford U.) 29 / 60

slide-88
SLIDE 88

Connections to Distributionally Robust Optimization

Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) D (v||µ) = Ev

  • log

dv dµ

  • .

Blanchet (Columbia U. and Stanford U.) 30 / 60

slide-89
SLIDE 89

Connections to Distributionally Robust Optimization

Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) D (v||µ) = Ev

  • log

dv dµ

  • .

Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009).

Blanchet (Columbia U. and Stanford U.) 30 / 60

slide-90
SLIDE 90

Connections to Distributionally Robust Optimization

Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) D (v||µ) = Ev

  • log

dv dµ

  • .

Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated...

Blanchet (Columbia U. and Stanford U.) 30 / 60

slide-91
SLIDE 91

Connections to Distributionally Robust Optimization

Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) D (v||µ) = Ev

  • log

dv dµ

  • .

Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Think of using Brownian motion as a proxy model for R (t)...

Blanchet (Columbia U. and Stanford U.) 30 / 60

slide-92
SLIDE 92

Connections to Distributionally Robust Optimization

Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) D (v||µ) = Ev

  • log

dv dµ

  • .

Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Think of using Brownian motion as a proxy model for R (t)... Optimal transport is a natural option!

Blanchet (Columbia U. and Stanford U.) 30 / 60

slide-93
SLIDE 93

Application 1: Back to Classical Risk Problem

Suppose that c (x, y) = dJ (x (·) , y (·)) = Skorokhod J1 metric. = inf

φ(·) bijection{ sup t∈[0,1]

|x (t) − y (φ (t))| , sup

t∈[0,1]

|φ (t) − t|}.

Blanchet (Columbia U. and Stanford U.) 31 / 60

slide-94
SLIDE 94

Application 1: Back to Classical Risk Problem

Suppose that c (x, y) = dJ (x (·) , y (·)) = Skorokhod J1 metric. = inf

φ(·) bijection{ sup t∈[0,1]

|x (t) − y (φ (t))| , sup

t∈[0,1]

|φ (t) − t|}. If R (t) = b − Z (t), then ruin during time interval [0, 1] is Bb = {R (·) : 0 ≥ inf

t∈[0,1] R (t)} = {Z (·) : b ≤ sup t∈[0,1]

Z (t)}.

Blanchet (Columbia U. and Stanford U.) 31 / 60

slide-95
SLIDE 95

Application 1: Back to Classical Risk Problem

Suppose that c (x, y) = dJ (x (·) , y (·)) = Skorokhod J1 metric. = inf

φ(·) bijection{ sup t∈[0,1]

|x (t) − y (φ (t))| , sup

t∈[0,1]

|φ (t) − t|}. If R (t) = b − Z (t), then ruin during time interval [0, 1] is Bb = {R (·) : 0 ≥ inf

t∈[0,1] R (t)} = {Z (·) : b ≤ sup t∈[0,1]

Z (t)}. Let P0 (·) be the Wiener measure want to compute sup

Dc (P0,P)≤δ

P (Z ∈ Bb) .

Blanchet (Columbia U. and Stanford U.) 31 / 60

slide-96
SLIDE 96

Application 1: Computing Distance to Bankruptcy

So: {cBb (Z) ≤ 1/λ∗} = {supt∈[0,1] Z (t) ≥ b − 1/λ∗}, and sup

Dc (P0,P)≤δ

P (Z ∈ Bb) = P0

  • sup

t∈[0,1]

Z (t) ≥ b − 1/λ∗

  • .

Blanchet (Columbia U. and Stanford U.) 32 / 60

slide-97
SLIDE 97

Application 1: Computing Uncertainty Size

Note any coupling π so that πX = P0 and πY = P satisfies Dc (P0, P) ≤ Eπ [c (X, Y )] ≈ δ.

Blanchet (Columbia U. and Stanford U.) 33 / 60

slide-98
SLIDE 98

Application 1: Computing Uncertainty Size

Note any coupling π so that πX = P0 and πY = P satisfies Dc (P0, P) ≤ Eπ [c (X, Y )] ≈ δ. So use any coupling between evidence and P0 or expert knowledge.

Blanchet (Columbia U. and Stanford U.) 33 / 60

slide-99
SLIDE 99

Application 1: Computing Uncertainty Size

Note any coupling π so that πX = P0 and πY = P satisfies Dc (P0, P) ≤ Eπ [c (X, Y )] ≈ δ. So use any coupling between evidence and P0 or expert knowledge. We discuss choosing δ non-parametrically momentarily.

Blanchet (Columbia U. and Stanford U.) 33 / 60

slide-100
SLIDE 100

Application 1: Illustration of Coupling

Given arrivals and claim sizes let Z (t) = m−1/2

2

N(t) k=1 (Xk − m1)

Blanchet (Columbia U. and Stanford U.) 34 / 60

slide-101
SLIDE 101

Application 1: Coupling in Action

Blanchet (Columbia U. and Stanford U.) 35 / 60

slide-102
SLIDE 102

Application 1: Numerical Example

Assume Poisson arrivals. Pareto claim sizes with index 2.2 — (P (V > t) = 1/(1 + t)2.2). Cost c (x, y) = dJ (x, y)2 <— note power of 2. Used Algorithm 1 to calibrate (estimating means and variances from data). b

P0(Ruin) Ptrue(Ruin) P ∗

robust(Ruin)

Ptrue(Ruin)

100 1.07 × 10−1 12.28 150 2.52 × 10−4 10.65 200 5.35 × 10−8 10.80 250 1.15 × 10−12 10.98

Blanchet (Columbia U. and Stanford U.) 36 / 60

slide-103
SLIDE 103

Additional Applications: Multidimensional Ruin Problems

https://arxiv.org/abs/1604.01446 contains more applications.

Blanchet (Columbia U. and Stanford U.) 37 / 60

slide-104
SLIDE 104

Additional Applications: Multidimensional Ruin Problems

https://arxiv.org/abs/1604.01446 contains more applications. Control: minθ supP:D(P,P0)≤δ E[L (θ, Z)] <— robust optimal reinsurance.

Blanchet (Columbia U. and Stanford U.) 37 / 60

slide-105
SLIDE 105

Additional Applications: Multidimensional Ruin Problems

https://arxiv.org/abs/1604.01446 contains more applications. Control: minθ supP:D(P,P0)≤δ E[L (θ, Z)] <— robust optimal reinsurance. Multidimensional risk processes (explicit evaluation of cB (x) for dJ metric).

Blanchet (Columbia U. and Stanford U.) 37 / 60

slide-106
SLIDE 106

Additional Applications: Multidimensional Ruin Problems

https://arxiv.org/abs/1604.01446 contains more applications. Control: minθ supP:D(P,P0)≤δ E[L (θ, Z)] <— robust optimal reinsurance. Multidimensional risk processes (explicit evaluation of cB (x) for dJ metric). Key insight: Geometry of target set often remains largely the same!

Blanchet (Columbia U. and Stanford U.) 37 / 60

slide-107
SLIDE 107

Connections to Distributionally Robust Optimization

Based on: Robust Wasserstein Profile Inference (B., Murthy & Kang ’16) https://arxiv.org/abs/1610.05627 Highlight: Additional insights into why optimal transport...

Blanchet (Columbia U. and Stanford U.) 38 / 60

slide-108
SLIDE 108

Distributionally Robust Optimization in Machine Learning

Consider estimating β∗ ∈ Rm in linear regression Yi = βXi + ei, where {(Yi, Xi)}n

i=1 are data points.

Blanchet (Columbia U. and Stanford U.) 39 / 60

slide-109
SLIDE 109

Distributionally Robust Optimization in Machine Learning

Consider estimating β∗ ∈ Rm in linear regression Yi = βXi + ei, where {(Yi, Xi)}n

i=1 are data points.

Optimal Least Squares approach consists in estimating β∗ via min

β EPn

  • Y − βT X

2 = min

β

1 n

n

i=1

  • Yi − βT Xi

2 =

Blanchet (Columbia U. and Stanford U.) 39 / 60

slide-110
SLIDE 110

Distributionally Robust Optimization in Machine Learning

Consider estimating β∗ ∈ Rm in linear regression Yi = βXi + ei, where {(Yi, Xi)}n

i=1 are data points.

Optimal Least Squares approach consists in estimating β∗ via min

β EPn

  • Y − βT X

2 = min

β

1 n

n

i=1

  • Yi − βT Xi

2 = Apply the distributionally robust estimator based on optimal transport.

Blanchet (Columbia U. and Stanford U.) 39 / 60

slide-111
SLIDE 111

Connection to Sqrt-Lasso

Theorem (B., Kang, Murthy (2016)) Suppose that c (x, y) ,

  • x, y =
  • x − x2

q

if y = y ∞ if y = y . Then, if 1/p + 1/q = 1 max

P:Dc (P,Pn)≤δ E 1/2 P

  • Y − βT X

2 = E 1/2

Pn

  • Y − βT X

2 + √ δ βp . Remark 1: This is sqrt-Lasso (Belloni et al. (2011)). Remark 2: Uses RoPA duality theorem & "judicious choice of c (·) ”

Blanchet (Columbia U. and Stanford U.) 40 / 60

slide-112
SLIDE 112

Connection to Regularized Logistic Regression

Theorem (B., Kang, Murthy (2016)) Suppose that c (x, y) ,

  • x, y =

x − xq if y = y ∞ if y = y . Then, sup

P: Dc (P,Pn)≤δ

EP

  • log(1 + e−Y βT X )
  • = EPn
  • log(1 + e−Y βT X )
  • + δ βp .

Remark 1: Approximate connection studied in Esfahani and Kuhn (2015).

Blanchet (Columbia U. and Stanford U.) 41 / 60

slide-113
SLIDE 113

Unification and Extensions of Regularized Estimators

Distributionally Robust Optimization using Optimal Transport recovers many other estimators...

Blanchet (Columbia U. and Stanford U.) 42 / 60

slide-114
SLIDE 114

Unification and Extensions of Regularized Estimators

Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - https://arxiv.org/abs/1610.05627

Blanchet (Columbia U. and Stanford U.) 42 / 60

slide-115
SLIDE 115

Unification and Extensions of Regularized Estimators

Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - https://arxiv.org/abs/1610.05627 Group Lasso: B., & Kang (2016): https://arxiv.org/abs/1705.04241

Blanchet (Columbia U. and Stanford U.) 42 / 60

slide-116
SLIDE 116

Unification and Extensions of Regularized Estimators

Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - https://arxiv.org/abs/1610.05627 Group Lasso: B., & Kang (2016): https://arxiv.org/abs/1705.04241 Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017): https://arxiv.org/abs/1705.07152

Blanchet (Columbia U. and Stanford U.) 42 / 60

slide-117
SLIDE 117

Unification and Extensions of Regularized Estimators

Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - https://arxiv.org/abs/1610.05627 Group Lasso: B., & Kang (2016): https://arxiv.org/abs/1705.04241 Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017): https://arxiv.org/abs/1705.07152 Semisupervised learning: B., and Kang (2016): https://arxiv.org/abs/1702.08848

Blanchet (Columbia U. and Stanford U.) 42 / 60

slide-118
SLIDE 118

How Regularization and Dual Norms Arise?

Let us work out a simple example...

Blanchet (Columbia U. and Stanford U.) 43 / 60

slide-119
SLIDE 119

How Regularization and Dual Norms Arise?

Let us work out a simple example... Recall RoPA Duality: Pick c ((x, y) , (x, y )) = (x, y) − (x, y )2

q

max

P:Dc (P,Pn)≤δ EP

  • ((X, Y ) · (β, 1))2

= min

λ≥0

  • λδ + EPn sup

(x ,y )

  • x, y · (β, 1)

2 − λ

  • (X, Y ) −
  • x, y

2

q

Blanchet (Columbia U. and Stanford U.) 43 / 60

slide-120
SLIDE 120

How Regularization and Dual Norms Arise?

Let us work out a simple example... Recall RoPA Duality: Pick c ((x, y) , (x, y )) = (x, y) − (x, y )2

q

max

P:Dc (P,Pn)≤δ EP

  • ((X, Y ) · (β, 1))2

= min

λ≥0

  • λδ + EPn sup

(x ,y )

  • x, y · (β, 1)

2 − λ

  • (X, Y ) −
  • x, y

2

q

Let’s focus on the inside EPn...

Blanchet (Columbia U. and Stanford U.) 43 / 60

slide-121
SLIDE 121

How Regularization and Dual Norms Arise?

Let ∆ = (X, Y ) − (x, y ) sup

(x ,y )

  • x, y · (β, 1)

2 − λ

  • (X, Y ) −
  • x, y

2

q

  • =

sup

  • ((X, Y ) · (β, 1) − ∆ · (β, 1))2 − λ ∆2

q

  • =

sup

∆q

  • (|(X, Y ) · (β, 1)| + ∆q (β, 1)p)2 − λ ∆2

q

  • Blanchet (Columbia U. and Stanford U.)

44 / 60

slide-122
SLIDE 122

How Regularization and Dual Norms Arise?

Let ∆ = (X, Y ) − (x, y ) sup

(x ,y )

  • x, y · (β, 1)

2 − λ

  • (X, Y ) −
  • x, y

2

q

  • =

sup

  • ((X, Y ) · (β, 1) − ∆ · (β, 1))2 − λ ∆2

q

  • =

sup

∆q

  • (|(X, Y ) · (β, 1)| + ∆q (β, 1)p)2 − λ ∆2

q

  • Last equality uses z → z2 is symmetric around origin and

|a · b| ≤ ap bq.

Blanchet (Columbia U. and Stanford U.) 44 / 60

slide-123
SLIDE 123

How Regularization and Dual Norms Arise?

Let ∆ = (X, Y ) − (x, y ) sup

(x ,y )

  • x, y · (β, 1)

2 − λ

  • (X, Y ) −
  • x, y

2

q

  • =

sup

  • ((X, Y ) · (β, 1) − ∆ · (β, 1))2 − λ ∆2

q

  • =

sup

∆q

  • (|(X, Y ) · (β, 1)| + ∆q (β, 1)p)2 − λ ∆2

q

  • Last equality uses z → z2 is symmetric around origin and

|a · b| ≤ ap bq. Note problem is now one-dimensional (easily computable).

Blanchet (Columbia U. and Stanford U.) 44 / 60

slide-124
SLIDE 124

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).

Blanchet (Columbia U. and Stanford U.) 45 / 60

slide-125
SLIDE 125

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·). Suppose that x − x2

A = (x − x) A (x − x) with A positive definite

(Mahalanobis distance).

Blanchet (Columbia U. and Stanford U.) 45 / 60

slide-126
SLIDE 126

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·). Suppose that x − x2

A = (x − x) A (x − x) with A positive definite

(Mahalanobis distance). Then, max

P:Dc (P,Pn)≤δ E 1/2 P

  • Y − βT X

2 = min

β E 1/2 Pn

  • Y − βT X

2 + √ δ βA−1 .

Blanchet (Columbia U. and Stanford U.) 45 / 60

slide-127
SLIDE 127

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·). Suppose that x − x2

A = (x − x) A (x − x) with A positive definite

(Mahalanobis distance). Then, max

P:Dc (P,Pn)≤δ E 1/2 P

  • Y − βT X

2 = min

β E 1/2 Pn

  • Y − βT X

2 + √ δ βA−1 . Intuition: Think of A diagonal, encoding inverse variability of Xis...

Blanchet (Columbia U. and Stanford U.) 45 / 60

slide-128
SLIDE 128

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·). Suppose that x − x2

A = (x − x) A (x − x) with A positive definite

(Mahalanobis distance). Then, max

P:Dc (P,Pn)≤δ E 1/2 P

  • Y − βT X

2 = min

β E 1/2 Pn

  • Y − βT X

2 + √ δ βA−1 . Intuition: Think of A diagonal, encoding inverse variability of Xis... High variability –> cheap transportation –> high impact in risk estimation.

Blanchet (Columbia U. and Stanford U.) 45 / 60

slide-129
SLIDE 129

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·).

Blanchet (Columbia U. and Stanford U.) 46 / 60

slide-130
SLIDE 130

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·). Suppose that x − x2

Λ = (x − x) Λ (x − x) with Λ positive

definite (Mahalanobis distance).

Blanchet (Columbia U. and Stanford U.) 46 / 60

slide-131
SLIDE 131

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·). Suppose that x − x2

Λ = (x − x) Λ (x − x) with Λ positive

definite (Mahalanobis distance). Then, max

P:Dc (P,Pn)≤δ E 1/2 P

  • Y − βT X

2 = min

β E 1/2 Pn

  • Y − βT X

2 + √ δ βΛ−1 .

Blanchet (Columbia U. and Stanford U.) 46 / 60

slide-132
SLIDE 132

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·). Suppose that x − x2

Λ = (x − x) Λ (x − x) with Λ positive

definite (Mahalanobis distance). Then, max

P:Dc (P,Pn)≤δ E 1/2 P

  • Y − βT X

2 = min

β E 1/2 Pn

  • Y − βT X

2 + √ δ βΛ−1 . Intuition: Think of Λ diagonal, encoding inverse variability of Xis...

Blanchet (Columbia U. and Stanford U.) 46 / 60

slide-133
SLIDE 133

On Role of Transport Cost...

https://arxiv.org/abs/1705.07152: Data-driven chose of c (·). Suppose that x − x2

Λ = (x − x) Λ (x − x) with Λ positive

definite (Mahalanobis distance). Then, max

P:Dc (P,Pn)≤δ E 1/2 P

  • Y − βT X

2 = min

β E 1/2 Pn

  • Y − βT X

2 + √ δ βΛ−1 . Intuition: Think of Λ diagonal, encoding inverse variability of Xis... High variability –> cheap transportation –> high impact in risk estimation.

Blanchet (Columbia U. and Stanford U.) 46 / 60

slide-134
SLIDE 134

On Role of Transport Cost...

Comparing L1 regularization vs data-driven cost regularization: real data BC BN QSAR Magic 3*LRL1 Train .185 ± .123 .080 ± .030 .614 ± .038 .548 ± .087 Test .428 ± .338 .340 ± .228 .755 ± .019 .610 ± .050 Accur .929 ± .023 .930 ± .042 .646 ± .036 .665 ± .045 3*DRO-NL Train .032 ± .015 .113 ± .035 .339 ± .044 .381 ± .084 Test .119 ± .044 .194 ± .067 .554 ± .032 .576 ± .049 Accur .955 ± .016 .931 ± .036 .736 ± .027 .730 ± .043 Num Predictors 30 4 30 10 Train Size 40 20 80 30 Test Size 329 752 475 9990

Table: Numerical results for real data sets.

Blanchet (Columbia U. and Stanford U.) 47 / 60

slide-135
SLIDE 135

Connections to Statistical Analysis

Based on: Robust Wasserstein Profile Inference (B., Murthy & Kang ’16) https://arxiv.org/abs/1610.05627 Highlight: How to choose size of uncertainty?

Blanchet (Columbia U. and Stanford U.) 48 / 60

slide-136
SLIDE 136

Towards an Optimal Choice of Uncertainty Size

How to choose uncertainty size in a data-driven way?

Blanchet (Columbia U. and Stanford U.) 49 / 60

slide-137
SLIDE 137

Towards an Optimal Choice of Uncertainty Size

How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min

β

max

P:Dc (P,Pn)≤δ EP

  • Y − βT X

2 = min

β

  • E 1/2

Pn

  • Y − βT X

2 + √ δ βp 2 .

Blanchet (Columbia U. and Stanford U.) 49 / 60

slide-138
SLIDE 138

Towards an Optimal Choice of Uncertainty Size

How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min

β

max

P:Dc (P,Pn)≤δ EP

  • Y − βT X

2 = min

β

  • E 1/2

Pn

  • Y − βT X

2 + √ δ βp 2 . Use left hand side to define a statistical principle to choose δ.

Blanchet (Columbia U. and Stanford U.) 49 / 60

slide-139
SLIDE 139

Towards an Optimal Choice of Uncertainty Size

How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min

β

max

P:Dc (P,Pn)≤δ EP

  • Y − βT X

2 = min

β

  • E 1/2

Pn

  • Y − βT X

2 + √ δ βp 2 . Use left hand side to define a statistical principle to choose δ. Important: Optimizing δ is equivalent to optimizing regularization!

Blanchet (Columbia U. and Stanford U.) 49 / 60

slide-140
SLIDE 140

Towards an Optimal Choice of Uncertainty Size

“Standard” way to pick δ (Esfahani and Kuhn (2015)).

Blanchet (Columbia U. and Stanford U.) 50 / 60

slide-141
SLIDE 141

Towards an Optimal Choice of Uncertainty Size

“Standard” way to pick δ (Esfahani and Kuhn (2015)). Estimate D (Ptrue, Pn) using concentration of measure results.

Blanchet (Columbia U. and Stanford U.) 50 / 60

slide-142
SLIDE 142

Towards an Optimal Choice of Uncertainty Size

“Standard” way to pick δ (Esfahani and Kuhn (2015)). Estimate D (Ptrue, Pn) using concentration of measure results. Not a good idea: rate of convergence of the form O

  • 1/n1/d

(d is the data dimension).

Blanchet (Columbia U. and Stanford U.) 50 / 60

slide-143
SLIDE 143

Towards an Optimal Choice of Uncertainty Size

“Standard” way to pick δ (Esfahani and Kuhn (2015)). Estimate D (Ptrue, Pn) using concentration of measure results. Not a good idea: rate of convergence of the form O

  • 1/n1/d

(d is the data dimension). Instead we seek an optimal approach.

Blanchet (Columbia U. and Stanford U.) 50 / 60

slide-144
SLIDE 144

Towards an Optimal Choice of Uncertainty Size

Keep in mind linear regression problem Yi = βT

∗ Xi + ǫi.

Blanchet (Columbia U. and Stanford U.) 51 / 60

slide-145
SLIDE 145

Towards an Optimal Choice of Uncertainty Size

Keep in mind linear regression problem Yi = βT

∗ Xi + ǫi.

The plausible model variations of Pn are given by the set Uδ (n) = {P : Dc (P, Pn) ≤ δ}.

Blanchet (Columbia U. and Stanford U.) 51 / 60

slide-146
SLIDE 146

Towards an Optimal Choice of Uncertainty Size

Keep in mind linear regression problem Yi = βT

∗ Xi + ǫi.

The plausible model variations of Pn are given by the set Uδ (n) = {P : Dc (P, Pn) ≤ δ}. Given P ∈ Uδ (n), define ¯ β (P) = arg min EP

  • Y − βT X
  • .

Blanchet (Columbia U. and Stanford U.) 51 / 60

slide-147
SLIDE 147

Towards an Optimal Choice of Uncertainty Size

Keep in mind linear regression problem Yi = βT

∗ Xi + ǫi.

The plausible model variations of Pn are given by the set Uδ (n) = {P : Dc (P, Pn) ≤ δ}. Given P ∈ Uδ (n), define ¯ β (P) = arg min EP

  • Y − βT X
  • .

It is natural to say that Λδ (n) = {¯ β (P) : P ∈ Uδ (n)} are plausible estimates of β∗.

Blanchet (Columbia U. and Stanford U.) 51 / 60

slide-148
SLIDE 148

Optimal Choice of Uncertainty Size

Given a confidence level 1 − α we advocate choosing δ via min δ s.t. P (β∗ ∈ Λδ (n)) ≥ 1 − α .

Blanchet (Columbia U. and Stanford U.) 52 / 60

slide-149
SLIDE 149

Optimal Choice of Uncertainty Size

Given a confidence level 1 − α we advocate choosing δ via min δ s.t. P (β∗ ∈ Λδ (n)) ≥ 1 − α . Equivalently: Find smallest confidence region Λδ (n) at level 1 − α.

Blanchet (Columbia U. and Stanford U.) 52 / 60

slide-150
SLIDE 150

Optimal Choice of Uncertainty Size

Given a confidence level 1 − α we advocate choosing δ via min δ s.t. P (β∗ ∈ Λδ (n)) ≥ 1 − α . Equivalently: Find smallest confidence region Λδ (n) at level 1 − α. In simple words: Find the smallest δ so that β∗ is plausible with confidence level 1 − α.

Blanchet (Columbia U. and Stanford U.) 52 / 60

slide-151
SLIDE 151

The Robust Wasserstein Profile Function

The value ¯ β (P) is characterized by EP

  • ∇β
  • Y − βT X

2 = 2EP

  • Y − βT X
  • X
  • = 0.

Blanchet (Columbia U. and Stanford U.) 53 / 60

slide-152
SLIDE 152

The Robust Wasserstein Profile Function

The value ¯ β (P) is characterized by EP

  • ∇β
  • Y − βT X

2 = 2EP

  • Y − βT X
  • X
  • = 0.

Define the Robust Wasserstein Profile (RWP) Function: Rn (β) = min{Dc (P, Pn) : EP

  • Y − βT X
  • X
  • = 0}.

Blanchet (Columbia U. and Stanford U.) 53 / 60

slide-153
SLIDE 153

The Robust Wasserstein Profile Function

The value ¯ β (P) is characterized by EP

  • ∇β
  • Y − βT X

2 = 2EP

  • Y − βT X
  • X
  • = 0.

Define the Robust Wasserstein Profile (RWP) Function: Rn (β) = min{Dc (P, Pn) : EP

  • Y − βT X
  • X
  • = 0}.

Note that Rn (β∗) ≤ δ ⇐ ⇒ β∗ ∈ Λδ (n) = {¯ β (P) : D (P, Pn) ≤ δ}.

Blanchet (Columbia U. and Stanford U.) 53 / 60

slide-154
SLIDE 154

The Robust Wasserstein Profile Function

The value ¯ β (P) is characterized by EP

  • ∇β
  • Y − βT X

2 = 2EP

  • Y − βT X
  • X
  • = 0.

Define the Robust Wasserstein Profile (RWP) Function: Rn (β) = min{Dc (P, Pn) : EP

  • Y − βT X
  • X
  • = 0}.

Note that Rn (β∗) ≤ δ ⇐ ⇒ β∗ ∈ Λδ (n) = {¯ β (P) : D (P, Pn) ≤ δ}. So δ is 1 − α quantile of Rn (β∗)!

Blanchet (Columbia U. and Stanford U.) 53 / 60

slide-155
SLIDE 155

The Robust Wasserstein Profile Function

Blanchet (Columbia U. and Stanford U.) 54 / 60

slide-156
SLIDE 156

Computing Optimal Regularization Parameter

Theorem (B., Murthy, Kang (2016)) Suppose that {(Yi, Xi)}n

i=1 is an

i.i.d. sample with finite variance, with c (x, y) ,

  • x, y =
  • x − x2

q

if y = y ∞ if y = y , then nRn(β∗) ⇒ L1, where L1 is explicitly and L1

D

≤ L2 := E[e2] E[e2] − (E|e|)2 N(0, Cov (X))2

q.

Remark: We recover same order of regularization (but L1 gives the

  • ptimal constant!)

Blanchet (Columbia U. and Stanford U.) 55 / 60

slide-157
SLIDE 157

Discussion on Optimal Uncertainty Size

Optimal δ is of order O (1/n) as opposed to O

  • 1/n1/d

as advocated in the standard approach.

Blanchet (Columbia U. and Stanford U.) 56 / 60

slide-158
SLIDE 158

Discussion on Optimal Uncertainty Size

Optimal δ is of order O (1/n) as opposed to O

  • 1/n1/d

as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P

  • L1 ≤ η1−α

= 1 − α.

Blanchet (Columbia U. and Stanford U.) 56 / 60

slide-159
SLIDE 159

Discussion on Optimal Uncertainty Size

Optimal δ is of order O (1/n) as opposed to O

  • 1/n1/d

as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P

  • L1 ≤ η1−α

= 1 − α. Rn (β∗) is inspired by Empirical Likelihood — Owen (1988).

Blanchet (Columbia U. and Stanford U.) 56 / 60

slide-160
SLIDE 160

Discussion on Optimal Uncertainty Size

Optimal δ is of order O (1/n) as opposed to O

  • 1/n1/d

as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P

  • L1 ≤ η1−α

= 1 − α. Rn (β∗) is inspired by Empirical Likelihood — Owen (1988). Lam & Zhou (2015) use Empirical Likelihood in DRO, but focus on divergence.

Blanchet (Columbia U. and Stanford U.) 56 / 60

slide-161
SLIDE 161

A Toy Example Illustrating Proof Techniques

Consider min

β

max

P:Dc (P,Pn)≤δ E

  • (Y − β)2

with c (y, y ) = (y − y )ρ and define Rn (β) = min

π(dy,du)≥0

  • (y − u)ρ π (dy, du) :
  • u∈R π (dy, du) = 1

nδ{Yi } (dy) ∀i, 2 (u − β) π (dy, du) = 0.

Blanchet (Columbia U. and Stanford U.) 57 / 60

slide-162
SLIDE 162

A Toy Example Illustrating Proof Techniques

Dual linear programming problem: Plug in β = β∗ Rn (β∗) = sup

λ∈R

  • −1

n

n

i=1

sup

u∈R

  • λ(u − β∗) − |Yi − u|ρ
  • =

sup

λ∈R

  • − λ

n ∑n i=1(Yi − β∗)

− 1

n ∑n i=1 supu∈R

  • λ (u − Yi) − |Yi − u|ρ
  • =

sup

λ

  • −λ

n

n

i=1

(Yi − β∗) − (ρ − 1)

  • λ

ρ

  • ρ

ρ−1

=

  • 1

n

n

i=1

(Yi − β∗)

  • ρ

= 1 n1/2

  • N
  • 0, σ2

ρ .

Blanchet (Columbia U. and Stanford U.) 58 / 60

slide-163
SLIDE 163

Discussion: Some Open Problems

Extensions: Optimal Transport with constrains, Optimal Martingale Transport.

Blanchet (Columbia U. and Stanford U.) 59 / 60

slide-164
SLIDE 164

Discussion: Some Open Problems

Extensions: Optimal Transport with constrains, Optimal Martingale Transport. Computational methods: Typical approach is entropic regularization (new methods currently developed in the machine learning literature).

Blanchet (Columbia U. and Stanford U.) 59 / 60

slide-165
SLIDE 165

Conclusions

Optimal transport (OT) is a powerful tool based on linear programming.

Blanchet (Columbia U. and Stanford U.) 60 / 60

slide-166
SLIDE 166

Conclusions

Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty.

Blanchet (Columbia U. and Stanford U.) 60 / 60

slide-167
SLIDE 167

Conclusions

Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations.

Blanchet (Columbia U. and Stanford U.) 60 / 60

slide-168
SLIDE 168

Conclusions

Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization.

Blanchet (Columbia U. and Stanford U.) 60 / 60

slide-169
SLIDE 169

Conclusions

Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Cost function in OT can be used to improve out-of-sample performance.

Blanchet (Columbia U. and Stanford U.) 60 / 60

slide-170
SLIDE 170

Conclusions

Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Cost function in OT can be used to improve out-of-sample performance. OT can be used for statistical inference using RWP function.

Blanchet (Columbia U. and Stanford U.) 60 / 60