divergence gibbs measures and entropic regularizations of
play

Divergence, Gibbs measures, and entropic regularizations of optimal - PowerPoint PPT Presentation

Divergence, Gibbs measures, and entropic regularizations of optimal transport Soumik Pal University of Washington, Seattle Fields Institute, Feb 13, 2020 The Monge problem 1781 P , Q - probabilities on X = R d = Y . c ( x , y ) - cost of


  1. Divergence, Gibbs measures, and entropic regularizations of optimal transport Soumik Pal University of Washington, Seattle Fields Institute, Feb 13, 2020

  2. The Monge problem 1781 P , Q - probabilities on X = R d = Y . c ( x , y ) - cost of transport. E.g., c ( x , y ) = � x − y � or 2 � x − y � 2 . c ( x , y ) = 1 Monge problem: minimize among T : R d → R d , T # P = Q , � c ( x , T ( x )) dP .

  3. Kantorovich relaxation 1939 Figure: by M. Cuturi Π( P , Q ) - couplings of ( P , Q ) (joint dist. with given marginals). (Monge-) Kantorovich relaxation: minimize among ν ∈ Π( P , Q ) �� � inf c ( x , y ) d ν . ν ∈ Π( P , Q ) Linear optimization in ν over convex Π( P , Q ) .

  4. Example: quadratic Wasserstein 2 � x − y � 2 . Consider c ( x , y ) = 1 Assume P , Q has densities ρ 0 , ρ 1 . �� � � x − y � 2 d ν W 2 2 ( P , Q ) = W 2 2 ( ρ 0 , ρ 1 ) = inf . ν ∈ Π( ρ 0 ,ρ 1 ) Theorem (Y. Brenier ’87) There exists convex φ such that T ( x ) = ∇ φ ( x ) solves both Monge and Kantorovich OT problems for ( ρ 0 , ρ 1 ) uniquely.

  5. When are MK solutions Monge? When transporting densities, other cost functions give Monge solutions. Twist condition : y �→ ∇ x c ( x , y ) is 1-1. Example: c ( x , y ) = g ( x − y ) , strictly convex. � W g ( ρ 0 , ρ 1 ) := inf ν ∈ Π ν ( g ( x − y )) = inf g ( x − y ) d ν. ν ∈ Π

  6. Entropic regularization Monge solutions are highly degenerate; supported on a graph. Entropy as a measure of degeneracy: �� f ( x ) log f ( x ) dx , if ν has a density f , Ent ( ν ) := ∞ , otherwise. Example: Entropy of N ( 0 , σ 2 ) is − log σ + constant. Monge solutions have infinite entropy. Föllmer ’88, Rüschendorff-Thomsen ’93, Cuturi ’13, Gigli ’19 ... suggested penalizing OT with entropy. Why? Fast algorithms. Statistical physics. Smooth approximations.

  7. Entropic regularization MK OT problem with c ( x , y ) = g ( x − y ) , g ≥ 0 str. cx. � W g ( ρ 0 , ρ 1 ) := inf g ( x − y ) d ν. ν ∈ Π( ρ 0 ,ρ 1 ) For h > 0, K ′ h := inf ν ∈ Π [ ν ( g ( x − y )) + h Ent ( ν )] . Naturally, K ′ h ( ρ 0 , ρ 1 ) ≈ W g ( ρ 0 , ρ 1 ) , as h → 0 + . What is the rate of convergence?

  8. Entropic cost An equivalent form of entropic relaxation. Define “transition kernel”: � � p h ( x , y ) = 1 − 1 exp hg ( x − y ) , Λ h = normalization . Λ h and joint distribution µ h ( x , y ) = ρ 0 ( x ) p h ( x , y ) . Relative entropy: � d ν � � H ( ν | µ ) = log d ν. d µ Define entropic cost K h = couplings ( ρ 0 ,ρ 1 ) H ( ν | µ h ) . inf K h = K ′ h / h − Ent ( ρ 0 ) + log Λ h .

  9. Example: quadratic Wasserstein 2 � x − y � 2 . Consider g ( x − y ) = 1 p h ( x , y ) - transition of Brownian motion. h = temperature. � � − 1 p h ( x , y ) = ( 2 π h ) − d / 2 exp 2 h � x − y � 2 Λ h = ( 2 π h ) − d / 2 . , Entropic cost, K h = K ′ h − Ent ( ρ 0 ) + d 2 log( 2 π h ) . h In general, there need not be a stochastic process for p h ( x , y ) .

  10. Schrödinger’s problem Brownian motion X - temperature h ≈ 0 “Condition” X 0 ∼ ρ 0 , X 1 ∼ ρ 1 . Exponentially rare. On this rare event what do particles do? Schrödinger ’31, Föllmer ’88, Léonard ’12. Particle initially at x moves close to ∇ φ ( x ) (Brenier map). Recall: For any g ( x − y ) : h → 0 K ′ h → 0 hK h = lim lim h = W g ( ρ 0 , ρ 1 ) . Rate of convergence?

  11. Pointwise convergence Theorem (P. ’19) ρ 0 , ρ 1 compactly supported (+ technical conditions). Kantorovich potential uniformly convex. � � K h − 1 = 1 2 h W 2 lim 2 ( ρ 0 , ρ 1 ) 2 ( Ent ( ρ 1 ) − Ent ( ρ 0 )) . h → 0 + Complementary results known for gamma convergence. Pointwise convergence left open. Adams, Dirr, Peletier, Zimmer ’11 (1-d), Duong, Laschos, Renger ’13, Erbar, Maas, Renger ’15 (multidimension, Fokker-Planck).

  12. Divergence To state the result for a general g , need a new concept. For a convex function φ , Bregman divergence: D [ y | z ] = φ ( y ) − φ ( z ) − ( y − z ) · ∇ φ ( z ) ≥ 0 . If x ∗ = ∇ φ ( x ) (Brenier solutions), D [ y | x ∗ ] = 1 2 � y − x � 2 − φ c ( x ) − φ ∗ c ( y ) , where φ c , φ ∗ c are c-concave functions: φ c ( x ) = 1 c ( y ) = 1 2 � x � 2 − φ ( x ) , 2 � y � 2 − φ ∗ ( y ) . φ ∗ y ≈ x ∗ , D [ y | x ∗ ] ≈ ( y − x ∗ ) T A ( x ∗ )( y − x ∗ ) , A ( z ) = ∇ 2 φ ∗ ( z ) .

  13. Divergence Generalize to cost g . Monge solution given by (Gangbo - McCann) x ∗ = x − ( ∇ g ) − 1 ◦ ∇ ψ, for some c -concave function ψ . Dual c-concave function ψ ∗ . Divergence D [ y | x ∗ ] = g ( x − y ) − ψ ( x ) − ψ ∗ ( y ) ≥ 0 . y ≈ x ∗ , extract matrix A ( x ∗ ) from the Taylor series. Divergence/ A ( · ) measures sensitivity of Monge map. Related to cross-difference of Kim & McCann ’10, McCann ’12, Yang & Wong ’19.

  14. Pointwise convergence Theorem (P. ’19) ρ 0 , ρ 1 compactly supported (+ technical condition). A ( · ) “uniformly elliptic”. � � � K h − 1 = 1 ρ 1 ( y ) log det( A ( y )) dy − 1 2 log det ∇ 2 g ( 0 ) . lim h W g ( ρ 0 , ρ 1 ) 2 h → 0 + For g ( x − y ) = � x − y � 2 / 2, log det ∇ 2 g ( 0 ) = 0, for φ (Brenier) � � 1 ρ 1 ( y ) log det( A ( y )) dy = 1 ρ 1 ( y ) log det( ∇ 2 φ ∗ ( y )) dy , 2 2 which is 1 2 ( Ent ( ρ 1 ) − Ent ( ρ 0 )) by simple calculation par McCann.

  15. Idea of the proof: approximate Schrödinger bridge

  16. Idea of the proof: Brownian case Recall, want to condition Brownian motion to have marginals ρ 0 , ρ 1 . p h ( x , y ) Brownian transition density at time h . µ h ( x , y ) = ρ 0 ( x ) p h ( x , y ) , joint distribution . If I can “guess” this conditional distribution � µ h , then K h = couplings ( ρ 0 ,ρ 1 ) H ( ν | µ h ) = H ( � inf µ h | µ h ) . Can approximately do so for small h by a Taylor expansion in h .

  17. Idea of the proof: Brownian case It is known (Rüschendorf) that � µ h must be of the form � � − 1 µ h ( x , y ) = e a ( x )+ b ( y ) µ h ( x , y ) ∝ exp � hg ( x − y ) + a ( x ) + b ( y ) . φ - convex function from Brenier map. � � � � � x � 2 | y | 2 a ( x ) = 1 + h ζ h ( x ) , b ( y ) = 1 − φ ∗ ( y ) − φ ( x ) + h ξ h ( y ) , h 2 h 2 ζ h , ξ h are O ( 1 ) .

  18. Idea of the proof Thus, up to lower order terms, � � − 1 hg ( x − y ) + 1 h φ c ( x ) + 1 h φ ∗ � µ h ( x , y ) ∝ ρ 0 ( x ) exp c ( y ) � � − 1 = ρ 0 ( x ) exp hD [ y | x ∗ ] . If y − x ∗ is large, it gets penalized exponentially. Hence � � − 1 2 h ( y − x ∗ ) T ∇ 2 φ ∗ ( x ∗ )( y − x ∗ ) µ h ( x , y ) ∝ ρ 0 ( x ) exp � Gaussian transition kernel with mean x ∗ and covariance � � − 1 . ∇ 2 φ ∗ ( x ∗ ) h

  19. Idea of the proof For h ≈ 0, the Schrödinger bridge is approximately Gaussian. � � − 1 � � x ∗ , h ∇ 2 φ ∗ ( x ∗ ) Sample X ∼ ρ 0 , generate Y ∼ N . 1 ( 2 π h ) − d / 2 × µ h ( x , y ) ≈ ρ 0 ( x ) � � det( ∇ 2 φ ∗ ( x ∗ )) � � − 1 2 h ( y − x ∗ ) T ∇ 2 φ ∗ ( x ∗ )( y − x ∗ ) exp . Y is not exactly ρ 1 . Lower order corrections. Nevertheless, � µ h | µ h ) = 1 det ∇ 2 φ ∗ ( x ∗ ) ρ 0 ( x ) dx = 1 H ( � 2 ( Ent ( ρ 1 ) − Ent ( ρ 0 )) . 2

  20. Divergence based methods Divergence based method is distinct from usual dynamic techniques. Usually: only quadratic cost, Benamou-Breiner, Otto calculus. See Conforti & Tamanini ’19 for one more term for the quadratic cost. Higher order terms should be related to higher order derivatives of divergence.

  21. The Dirichlet transport

  22. Dirichlet transport, P.-Wong ’16 ∆ n - unit simplex { ( p 1 , . . . , p n ) : p i > 0 , � i p i = 1 } . ∆ n is an abelian group. e = ( 1 / n , . . . , 1 / n ) If p , q ∈ ∆ n , then � p − 1 � p i q i 1 / p i ( p ⊙ q ) i = � n , i = � n . j = 1 p j q j j = 1 1 / p j K-L divergence or relative entropy as “distance”: n � H ( q | p ) = q i log( q i / p i ) . i = 1 Take X = Y = ∆ n . � � � n � n � � 1 q i − 1 log q i e | p − 1 ⊙ q c ( p , q ) = H = log ≥ 0 . n p i n p i i = 1 i = 1

  23. Exponentially concave functions ϕ : ∆ n → R ∪ {−∞} is exponentially concave if e ϕ is concave. x �→ 1 2 log x is e-concave, but not x �→ 2 log x . Examples: p , r ∈ ∆ n , 0 < λ < 1. � ϕ ( p ) = 1 log p i . n i �� � �� � ϕ ( p ) = 1 p λ ϕ ( p ) = log r i p i , λ log . i i i (Fernholz ’02, P. and Wong ’15). Analog of Brenier’s Theorem: If ( p , q = F ( p )) is the Monge solution, then p − 1 = � ∇ ϕ ( q ) , Kantorovich potential . Smooth, MTW Khan & Zhang ’19.

  24. Back to the Dirichlet transport What is the corresponding probabilistic picture for the cost function � � e | p − 1 ⊙ q c ( p , q ) = H on the unit simplex ∆ n ? Symmetric Dirichlet distribution Dir ( λ ) : � n p λ/ n − 1 density ∝ . j j = 1 Probability distribution on the unit simplex. If U ∼ Dir ( · ) , � 1 � E ( U ) = e , Var ( U i ) = O . λ

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend