Divergence, Gibbs measures, and entropic regularizations of optimal - PowerPoint PPT Presentation

Divergence, Gibbs measures, and entropic regularizations of optimal transport Soumik Pal University of Washington, Seattle Fields Institute, Feb 13, 2020

The Monge problem 1781 P , Q - probabilities on X = R d = Y . c ( x , y ) - cost of transport. E.g., c ( x , y ) = � x − y � or 2 � x − y � 2 . c ( x , y ) = 1 Monge problem: minimize among T : R d → R d , T # P = Q , � c ( x , T ( x )) dP .

Kantorovich relaxation 1939 Figure: by M. Cuturi Π( P , Q ) - couplings of ( P , Q ) (joint dist. with given marginals). (Monge-) Kantorovich relaxation: minimize among ν ∈ Π( P , Q ) �� inf c ( x , y ) d ν . ν ∈ Π( P , Q ) Linear optimization in ν over convex Π( P , Q ) .

Example: quadratic Wasserstein 2 � x − y � 2 . Consider c ( x , y ) = 1 Assume P , Q has densities ρ 0 , ρ 1 . �� x − y � 2 d ν W 2 2 ( P , Q ) = W 2 2 ( ρ 0 , ρ 1 ) = inf . ν ∈ Π( ρ 0 ,ρ 1 ) Theorem (Y. Brenier ’87) There exists convex φ such that T ( x ) = ∇ φ ( x ) solves both Monge and Kantorovich OT problems for ( ρ 0 , ρ 1 ) uniquely.

When are MK solutions Monge? When transporting densities, other cost functions give Monge solutions. Twist condition : y �→ ∇ x c ( x , y ) is 1-1. Example: c ( x , y ) = g ( x − y ) , strictly convex. � W g ( ρ 0 , ρ 1 ) := inf ν ∈ Π ν ( g ( x − y )) = inf g ( x − y ) d ν. ν ∈ Π

Entropic regularization Monge solutions are highly degenerate; supported on a graph. Entropy as a measure of degeneracy: �� f ( x ) log f ( x ) dx , if ν has a density f , Ent ( ν ) := ∞ , otherwise. Example: Entropy of N ( 0 , σ 2 ) is − log σ + constant. Monge solutions have infinite entropy. Föllmer ’88, Rüschendorff-Thomsen ’93, Cuturi ’13, Gigli ’19 ... suggested penalizing OT with entropy. Why? Fast algorithms. Statistical physics. Smooth approximations.

Entropic regularization MK OT problem with c ( x , y ) = g ( x − y ) , g ≥ 0 str. cx. � W g ( ρ 0 , ρ 1 ) := inf g ( x − y ) d ν. ν ∈ Π( ρ 0 ,ρ 1 ) For h > 0, K ′ h := inf ν ∈ Π [ ν ( g ( x − y )) + h Ent ( ν )] . Naturally, K ′ h ( ρ 0 , ρ 1 ) ≈ W g ( ρ 0 , ρ 1 ) , as h → 0 + . What is the rate of convergence?

Entropic cost An equivalent form of entropic relaxation. Define “transition kernel”: � � p h ( x , y ) = 1 − 1 exp hg ( x − y ) , Λ h = normalization . Λ h and joint distribution µ h ( x , y ) = ρ 0 ( x ) p h ( x , y ) . Relative entropy: � d ν � � H ( ν | µ ) = log d ν. d µ Define entropic cost K h = couplings ( ρ 0 ,ρ 1 ) H ( ν | µ h ) . inf K h = K ′ h / h − Ent ( ρ 0 ) + log Λ h .

Example: quadratic Wasserstein 2 � x − y � 2 . Consider g ( x − y ) = 1 p h ( x , y ) - transition of Brownian motion. h = temperature. � � − 1 p h ( x , y ) = ( 2 π h ) − d / 2 exp 2 h � x − y � 2 Λ h = ( 2 π h ) − d / 2 . , Entropic cost, K h = K ′ h − Ent ( ρ 0 ) + d 2 log( 2 π h ) . h In general, there need not be a stochastic process for p h ( x , y ) .

Schrödinger’s problem Brownian motion X - temperature h ≈ 0 “Condition” X 0 ∼ ρ 0 , X 1 ∼ ρ 1 . Exponentially rare. On this rare event what do particles do? Schrödinger ’31, Föllmer ’88, Léonard ’12. Particle initially at x moves close to ∇ φ ( x ) (Brenier map). Recall: For any g ( x − y ) : h → 0 K ′ h → 0 hK h = lim lim h = W g ( ρ 0 , ρ 1 ) . Rate of convergence?

Pointwise convergence Theorem (P. ’19) ρ 0 , ρ 1 compactly supported (+ technical conditions). Kantorovich potential uniformly convex. � � K h − 1 = 1 2 h W 2 lim 2 ( ρ 0 , ρ 1 ) 2 ( Ent ( ρ 1 ) − Ent ( ρ 0 )) . h → 0 + Complementary results known for gamma convergence. Pointwise convergence left open. Adams, Dirr, Peletier, Zimmer ’11 (1-d), Duong, Laschos, Renger ’13, Erbar, Maas, Renger ’15 (multidimension, Fokker-Planck).

Divergence To state the result for a general g , need a new concept. For a convex function φ , Bregman divergence: D [ y | z ] = φ ( y ) − φ ( z ) − ( y − z ) · ∇ φ ( z ) ≥ 0 . If x ∗ = ∇ φ ( x ) (Brenier solutions), D [ y | x ∗ ] = 1 2 � y − x � 2 − φ c ( x ) − φ ∗ c ( y ) , where φ c , φ ∗ c are c-concave functions: φ c ( x ) = 1 c ( y ) = 1 2 � x � 2 − φ ( x ) , 2 � y � 2 − φ ∗ ( y ) . φ ∗ y ≈ x ∗ , D [ y | x ∗ ] ≈ ( y − x ∗ ) T A ( x ∗ )( y − x ∗ ) , A ( z ) = ∇ 2 φ ∗ ( z ) .

Divergence Generalize to cost g . Monge solution given by (Gangbo - McCann) x ∗ = x − ( ∇ g ) − 1 ◦ ∇ ψ, for some c -concave function ψ . Dual c-concave function ψ ∗ . Divergence D [ y | x ∗ ] = g ( x − y ) − ψ ( x ) − ψ ∗ ( y ) ≥ 0 . y ≈ x ∗ , extract matrix A ( x ∗ ) from the Taylor series. Divergence/ A ( · ) measures sensitivity of Monge map. Related to cross-difference of Kim & McCann ’10, McCann ’12, Yang & Wong ’19.

Pointwise convergence Theorem (P. ’19) ρ 0 , ρ 1 compactly supported (+ technical condition). A ( · ) “uniformly elliptic”. � � � K h − 1 = 1 ρ 1 ( y ) log det( A ( y )) dy − 1 2 log det ∇ 2 g ( 0 ) . lim h W g ( ρ 0 , ρ 1 ) 2 h → 0 + For g ( x − y ) = � x − y � 2 / 2, log det ∇ 2 g ( 0 ) = 0, for φ (Brenier) � � 1 ρ 1 ( y ) log det( A ( y )) dy = 1 ρ 1 ( y ) log det( ∇ 2 φ ∗ ( y )) dy , 2 2 which is 1 2 ( Ent ( ρ 1 ) − Ent ( ρ 0 )) by simple calculation par McCann.

Idea of the proof: approximate Schrödinger bridge

Idea of the proof: Brownian case Recall, want to condition Brownian motion to have marginals ρ 0 , ρ 1 . p h ( x , y ) Brownian transition density at time h . µ h ( x , y ) = ρ 0 ( x ) p h ( x , y ) , joint distribution . If I can “guess” this conditional distribution � µ h , then K h = couplings ( ρ 0 ,ρ 1 ) H ( ν | µ h ) = H ( � inf µ h | µ h ) . Can approximately do so for small h by a Taylor expansion in h .

Idea of the proof: Brownian case It is known (Rüschendorf) that � µ h must be of the form � � − 1 µ h ( x , y ) = e a ( x )+ b ( y ) µ h ( x , y ) ∝ exp � hg ( x − y ) + a ( x ) + b ( y ) . φ - convex function from Brenier map. � � � � � x � 2 | y | 2 a ( x ) = 1 + h ζ h ( x ) , b ( y ) = 1 − φ ∗ ( y ) − φ ( x ) + h ξ h ( y ) , h 2 h 2 ζ h , ξ h are O ( 1 ) .

Idea of the proof Thus, up to lower order terms, � � − 1 hg ( x − y ) + 1 h φ c ( x ) + 1 h φ ∗ � µ h ( x , y ) ∝ ρ 0 ( x ) exp c ( y ) � � − 1 = ρ 0 ( x ) exp hD [ y | x ∗ ] . If y − x ∗ is large, it gets penalized exponentially. Hence � � − 1 2 h ( y − x ∗ ) T ∇ 2 φ ∗ ( x ∗ )( y − x ∗ ) µ h ( x , y ) ∝ ρ 0 ( x ) exp � Gaussian transition kernel with mean x ∗ and covariance � � − 1 . ∇ 2 φ ∗ ( x ∗ ) h

Idea of the proof For h ≈ 0, the Schrödinger bridge is approximately Gaussian. � � − 1 � � x ∗ , h ∇ 2 φ ∗ ( x ∗ ) Sample X ∼ ρ 0 , generate Y ∼ N . 1 ( 2 π h ) − d / 2 × µ h ( x , y ) ≈ ρ 0 ( x ) � � det( ∇ 2 φ ∗ ( x ∗ )) � � − 1 2 h ( y − x ∗ ) T ∇ 2 φ ∗ ( x ∗ )( y − x ∗ ) exp . Y is not exactly ρ 1 . Lower order corrections. Nevertheless, � µ h | µ h ) = 1 det ∇ 2 φ ∗ ( x ∗ ) ρ 0 ( x ) dx = 1 H ( � 2 ( Ent ( ρ 1 ) − Ent ( ρ 0 )) . 2

Divergence based methods Divergence based method is distinct from usual dynamic techniques. Usually: only quadratic cost, Benamou-Breiner, Otto calculus. See Conforti & Tamanini ’19 for one more term for the quadratic cost. Higher order terms should be related to higher order derivatives of divergence.

The Dirichlet transport

Dirichlet transport, P.-Wong ’16 ∆ n - unit simplex { ( p 1 , . . . , p n ) : p i > 0 , � i p i = 1 } . ∆ n is an abelian group. e = ( 1 / n , . . . , 1 / n ) If p , q ∈ ∆ n , then � p − 1 � p i q i 1 / p i ( p ⊙ q ) i = � n , i = � n . j = 1 p j q j j = 1 1 / p j K-L divergence or relative entropy as “distance”: n � H ( q | p ) = q i log( q i / p i ) . i = 1 Take X = Y = ∆ n . � � � n � n � � 1 q i − 1 log q i e | p − 1 ⊙ q c ( p , q ) = H = log ≥ 0 . n p i n p i i = 1 i = 1

Exponentially concave functions ϕ : ∆ n → R ∪ {−∞} is exponentially concave if e ϕ is concave. x �→ 1 2 log x is e-concave, but not x �→ 2 log x . Examples: p , r ∈ ∆ n , 0 < λ < 1. � ϕ ( p ) = 1 log p i . n i �� ϕ ( p ) = 1 p λ ϕ ( p ) = log r i p i , λ log . i i i (Fernholz ’02, P. and Wong ’15). Analog of Brenier’s Theorem: If ( p , q = F ( p )) is the Monge solution, then p − 1 = � ∇ ϕ ( q ) , Kantorovich potential . Smooth, MTW Khan & Zhang ’19.

Back to the Dirichlet transport What is the corresponding probabilistic picture for the cost function � � e | p − 1 ⊙ q c ( p , q ) = H on the unit simplex ∆ n ? Symmetric Dirichlet distribution Dir ( λ ) : � n p λ/ n − 1 density ∝ . j j = 1 Probability distribution on the unit simplex. If U ∼ Dir ( · ) , � 1 � E ( U ) = e , Var ( U i ) = O . λ

Divergence, Gibbs measures, and entropic regularizations of optimal - PowerPoint PPT Presentation

Divergence, Gibbs measures, and entropic regularizations of optimal transport Soumik Pal University of Washington, Seattle Fields Institute, Feb 13, 2020 The Monge problem 1781 P , Q - probabilities on X = R d = Y . c ( x , y ) - cost of

Gibbs-non-Gibbs dynamical transitions. A large-deviation paradigm R. Fern andez F. den

Factors of Gibbs measures on subshifts What is a Gibbs measure? Two-ish definitions Equivalence

Gibbs sampling Dr. Jarad Niemi Iowa State University March 29, 2018 Jarad Niemi (Iowa State)

Gradient Gibbs measures with disorder Codina Cotar University College London April 16, 2015,

Gradient Gibbs measures with disorder Codina Cotar University College London June 25, 2015, GGI

College P Planning N Night GIBBS GIBBS HIGH IGH SCHOOL SC SCHO HOOL COUNSE SELING OFFICE

Entropic Causal Inference Murat Kocaoglu, Alexandros G. Dimakis, Sriram Vishwanath and Babak

Maxima and entropic repulsion of Gaussian free field: Going beyond Z d Joe P. Chen Department of

On entropic cost optimal transport cost Soumik Pal University of Washington, Seattle

Time energy entropic uncertainty relations: an algebraic approach Christian Bertoni, Yuxiang

Almost sure GWP , Gibbs measures and gauge transformations Gigliola Staffilani Massachusetts

Optimization considerations for regularizations of inverse and learning problems Hugo Raguet 1

GoBack Towards a general convergence theory for inexact Newton regularizations Andreas Rieder

JUST THE MATHS SLIDES NUMBER 2.3 SERIES 3 (Elementary convergence and divergence) by

Divergence Theorems in Path Space Denis Bell University of North Florida Motivation Divergence

29. The divergence theorem Theorem 29.1 (Divergence Theorem; Gauss, Ostrogradsky) . Let S be a

ICME Interprofessional Case Management Experience The Jim Thomas Case Learners, go to this site

Minnesota HIE Study Request for Public Comment Meeting hosted by the Minnesota e-Health

Pleasant Street and Gilman Avenue Area Improvement Project Town of St. Johnsbury, Vermont Bond

Standard Template Library Bryce Boe 2013/08/20 CS24, Summer

Expecta(ons,From,The,Pa(ent,Group,Perspec(ve,Using, Evidence,Sheets,And,Template,Le>ers,,,

Variational principles for discrete maps Martin Tassy, joint work with Georg Menz October 12,

during ENTSOG SJWS 3 Brussels 4 th May 2011 ACER Draft Framework Guideline on Capacity

What do Preschool Quality and Costs Tell Us About Having

Divergence, Gibbs measures, and entropic regularizations of optimal - PowerPoint PPT Presentation

Divergence, Gibbs measures, and entropic regularizations of optimal transport Soumik Pal University of Washington, Seattle Fields Institute, Feb 13, 2020 The Monge problem 1781 P , Q - probabilities on X = R d = Y . c ( x , y ) - cost of

Gibbs-non-Gibbs dynamical transitions. A large-deviation paradigm R. Fern andez F. den

Factors of Gibbs measures on subshifts What is a Gibbs measure? Two-ish definitions Equivalence

Gibbs sampling Dr. Jarad Niemi Iowa State University March 29, 2018 Jarad Niemi (Iowa State)

Gradient Gibbs measures with disorder Codina Cotar University College London April 16, 2015,

Gradient Gibbs measures with disorder Codina Cotar University College London June 25, 2015, GGI

College P Planning N Night GIBBS GIBBS HIGH IGH SCHOOL SC SCHO HOOL COUNSE SELING OFFICE

Entropic Causal Inference Murat Kocaoglu, Alexandros G. Dimakis, Sriram Vishwanath and Babak

Maxima and entropic repulsion of Gaussian free field: Going beyond Z d Joe P. Chen Department of

On entropic cost optimal transport cost Soumik Pal University of Washington, Seattle

Time energy entropic uncertainty relations: an algebraic approach Christian Bertoni, Yuxiang

Almost sure GWP , Gibbs measures and gauge transformations Gigliola Staffilani Massachusetts

Optimization considerations for regularizations of inverse and learning problems Hugo Raguet 1

GoBack Towards a general convergence theory for inexact Newton regularizations Andreas Rieder

JUST THE MATHS SLIDES NUMBER 2.3 SERIES 3 (Elementary convergence and divergence) by

Divergence Theorems in Path Space Denis Bell University of North Florida Motivation Divergence

29. The divergence theorem Theorem 29.1 (Divergence Theorem; Gauss, Ostrogradsky) . Let S be a

ICME Interprofessional Case Management Experience The Jim Thomas Case Learners, go to this site

Minnesota HIE Study Request for Public Comment Meeting hosted by the Minnesota e-Health

Pleasant Street and Gilman Avenue Area Improvement Project Town of St. Johnsbury, Vermont Bond

Standard Template Library Bryce Boe 2013/08/20 CS24, Summer

Expecta(ons,From,The,Pa(ent,Group,Perspec(ve,Using, Evidence,Sheets,And,Template,Le&gt;ers,,,

Variational principles for discrete maps Martin Tassy, joint work with Georg Menz October 12,

during ENTSOG SJWS 3 Brussels 4 th May 2011 ACER Draft Framework Guideline on Capacity

What do Preschool Quality and Costs Tell Us About Having

Expecta(ons,From,The,Pa(ent,Group,Perspec(ve,Using, Evidence,Sheets,And,Template,Le>ers,,,