on the chi square and higher order chi distances for
play

On the Chi square and higher-order Chi distances for approximating f - PowerPoint PPT Presentation

On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org 1 Sony Computer Science Laboratories, Inc. 2 UAG-CEREGMIA September 2013 c 2013 Frank Nielsen, Sony


  1. On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org 1 Sony Computer Science Laboratories, Inc. 2 UAG-CEREGMIA September 2013 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/17

  2. Statistical divergences Measures the separability between two distributions. Examples: Pearson/Neymann χ 2 , Kullback-Leibler divergence: � ( x 2 ( x ) − x 1 ( x )) 2 χ 2 P ( X 1 : X 2 ) = d ν ( x ) , x 1 ( x ) � ( x 1 ( x ) − x 2 ( x )) 2 χ 2 N ( X 1 : X 2 ) = d ν ( x ) , x 2 ( x ) � x 1 ( x ) log x 1 ( x ) KL ( X 1 : X 2 ) = x 2 ( x ) d ν ( x ) , c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/17

  3. f -divergences: A generic definition � � x 2 ( x ) � I f ( X 1 : X 2 ) = x 1 ( x ) f d ν ( x ) ≥ 0 , x 1 ( x ) where f is a convex function f : (0 , ∞ ) ⊆ dom ( f ) �→ [0 , ∞ ] such that f (1) = 0. � Jensen inequality: I f ( X 1 : X 2 ) ≥ f ( x 2 ( x ) d ν ( x )) = f (1) = 0. May consider f ′ (1) = 0 and fix the scale of divergence by setting f ′′ (1) = 1. Can always be symmetrized: S f ( X 1 : X 2 ) = I f ( X 1 : X 2 ) + I f ∗ ( X 1 : X 2 ) with f ∗ ( u ) = uf (1 / u ), and I f ∗ ( X 1 : X 2 ) = I f ( X 2 : X 1 ). c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/17

  4. f -divergences: Some examples Name of the f -divergence Formula I f ( P : Q ) Generator f ( u ) with f (1) = 0 1 1 � Total variation (metric) | p ( x ) − q ( x ) | d ν ( x ) 2 | u − 1 | 2 ( √ u − 1) 2 q ( x )) 2 d ν ( x ) � � � Squared Hellinger ( p ( x ) − � ( q ( x ) − p ( x ))2 Pearson χ 2 ( u − 1) 2 d ν ( x ) P p ( x ) � ( p ( x ) − q ( x ))2 (1 − u )2 Neyman χ 2 d ν ( x ) N q ( x ) u � ( q ( x ) − λ p ( x )) k Pearson-Vajda χ k ( u − 1) k d ν ( x ) P pk − 1( x ) � | q ( x ) − λ p ( x ) | k Pearson-Vajda | χ | k | u − 1 | k d ν ( x ) P pk − 1( x ) p ( x ) log p ( x ) Kullback-Leibler � q ( x ) d ν ( x ) − log u q ( x ) log q ( x ) � reverse Kullback-Leibler p ( x ) d ν ( x ) u log u 1 − α 1+ α ( x ) q 1+ α ( x ) d ν ( x )) 4 � 4 α -divergence 1 − α 2 (1 − p 2 1 − α 2 (1 − u 2 ) 2 p ( x ) 2 q ( x ) 1 − ( u + 1) log 1+ u Jensen-Shannon � ( p ( x ) log p ( x )+ q ( x ) + q ( x ) log p ( x )+ q ( x ) ) d ν ( x ) + u log u 2 2 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/17

  5. Stochastic approximations of f -divergences � � x 2 ( s i ) � � x 2 ( t i ) �� � n � ( X 1 : X 2 ) ∼ 1 + x 1 ( t i ) I ( n ) f x 2 ( t i ) f , f 2 n x 1 ( s i ) x 1 ( t i ) i =1 with s 1 , ..., s n and t 1 , ..., t n IID. sampled from X 1 and X 2 , respectively. � I ( n ) lim ( X 1 : X 2 ) → I f ( X 1 : X 2 ) f n →∞ ◮ work for any generator f but... ◮ In practice, limited to small dimension support. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/17

  6. Exponential families Canonical decomposition of the probability measure: p θ ( x ) = exp( � t ( x ) , θ � − F ( θ ) + k ( x )) , Here, consider natural parameter space Θ affine. p ( x | λ ) = λ x e − λ Poi ( λ ) : , λ > 0 , x ∈ { 0 , 1 , ... } x ! p ( x | µ ) = (2 π ) − d 2 e − 1 2 ( x − µ ) ⊤ ( x − µ ) , µ ∈ R d , x ∈ R d Nor I ( µ ) : Family θ Θ F ( θ ) k ( x ) t ( x ) ν e θ log λ − log x ! x ν c Poisson R 1 2 log 2 π − 1 R d 2 θ ⊤ θ d 2 x ⊤ x Iso . Gaussian µ x ν L c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/17

  7. χ 2 for affine exponential families Bypass integral computation, Closed-form formula e F (2 θ 2 − θ 1 ) − (2 F ( θ 2 ) − F ( θ 1 )) − 1 , χ 2 P ( X 1 : X 2 ) = e F (2 θ 1 − θ 2 ) − (2 F ( θ 1 ) − F ( θ 2 )) − 1 , χ 2 N ( X 1 : X 2 ) = Kullback-Leibler divergence amounts to a Bregman divergence [3]: KL ( X 1 : X 2 ) = B F ( θ 2 : θ 1 ) B F ( θ : θ ′ ) = F ( θ ) − F ( θ ′ ) − ( θ − θ ′ ) ⊤ ∇ F ( θ ′ ) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/17

  8. Higher-order Vajda χ k divergences � ( x 2 ( x ) − x 1 ( x )) k χ k P ( X 1 : X 2 ) = d ν ( x ) , x 1 ( x ) k − 1 � | x 2 ( x ) − x 1 ( x ) | k | χ | k P ( X 1 : X 2 ) = d ν ( x ) , x 1 ( x ) k − 1 are f -divergences for the generators ( u − 1) k and | u − 1 | k . � ◮ When k = 1, χ 1 P ( X 1 : X 2 ) = ( x 1 ( x ) − x 2 ( x )) d ν ( x ) = 0 (never discriminative), and | χ 1 P | ( X 1 , X 2 ) is twice the total variation distance. ◮ χ 0 P is the unit constant. ◮ χ k P is a signed distance c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/17

  9. Higher-order Vajda χ k divergences Lemma The (signed) χ k P distance between members X 1 ∼ E F ( θ 1 ) and X 2 ∼ E F ( θ 2 ) of the same affine exponential family is (k ∈ N ) always bounded and equal to: � k � e F ((1 − j ) θ 1 + j θ 2 ) k � χ k ( − 1) k − j P ( X 1 : X 2 ) = e (1 − j ) F ( θ 1 )+ jF ( θ 2 ) . j j =0 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/17

  10. Higher-order Vajda χ k divergences: For Poisson/Normal distributions, we get closed-form formula: � k � � k e λ 1 − j λ j χ k ( − 1) k − j 2 − ((1 − j ) λ 1 + j λ 2 ) , P ( λ 1 : λ 2 ) = 1 j j =0 � k � k � 1 2 j ( j − 1)( µ 1 − µ 2 ) ⊤ ( µ 1 − µ 2 ) . χ k ( − 1) k − j P ( µ 1 : µ 2 ) = e j j =0 signed distances. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/17

  11. f -divergences from Taylor series Lemma (extends Theorem 1 of [1]) When bounded, the f -divergence I f can be expressed as the power series of higher order Chi-type distances: � x 2 ( x ) � i � � ∞ 1 i ! f ( i ) ( λ ) I f ( X 1 : X 2 ) = x 1 ( x ) x 1 ( x ) − λ d ν ( x ) , i =0 ∞ � 1 i ! f ( i ) ( λ ) χ i = λ, P ( X 1 : X 2 ) , i =0 I f < ∞ , and χ i λ, P ( X 1 : X 2 ) is a generalization of the χ i P defined by: � ( x 2 ( x ) − λ x 1 ( x )) i χ i λ, P ( X 1 : X 2 ) = d ν ( x ) . x 1 ( x ) i − 1 and χ 0 λ, P ( X 1 : X 2 ) = 1 by convention. Note that λ, P ≥ f (1) = (1 − λ ) k is a f -divergence for χ i f ( u ) = ( u − λ ) k − (1 − λ ) k c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/17

  12. f -divergences: Analytic formula ◮ λ = 1 ∈ int ( dom ( f ( i ) )), f -divergence (Theorem 1 of [1]): s � f ( k ) (1) χ k | I f ( X 1 : X 2 ) − P ( X 1 : X 2 ) | k ! k =0 1 ( s + 1)! � f ( s +1) � ∞ ( M − m ) s , ≤ where � f ( s +1) � ∞ = sup t ∈ [ m , M ] | f ( s +1) ( t ) | and m ≤ p q ≤ M . ◮ λ = 0 (whenever 0 ∈ int ( dom ( f ( i ) ))) and affine exponential families, simpler expression: � ∞ f ( i ) (0) I f ( X 1 : X 2 ) = I 1 − i , i ( θ 1 : θ 2 ) , i ! i =0 e F ( i θ 2 +(1 − i ) θ 1 ) I 1 − i , i ( θ 1 : θ 2 ) = e iF ( θ 2 )+(1 − i ) F ( θ 1 ) . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/17

  13. Corollary: Approximating f -divergences by χ 2 divergences Corollary A second-order Taylor expansion yields N ( X 1 : X 2 ) + 1 I f ( X 1 : X 2 ) ∼ f (1) + f ′ (1) χ 1 2 f ′′ (1) χ 2 N ( X 1 : X 2 ) Since f (1) = 0 and χ 1 N ( X 1 : X 2 ) = 0 , it follows that I f ( X 1 : X 2 ) ∼ f ′′ (1) χ 2 N ( X 1 : X 2 ) , 2 (f ′′ (1) > 0 follows from the strict convexity of the generator). When f ( u ) = u log u, this yields the well-known approximation [2]: χ 2 P ( X 1 : X 2 ) ∼ 2 KL ( X 1 : X 2 ) . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/17

  14. Kullback-Leibler divergence: Analytic expression Kullback-Leibler divergence: f ( u ) = − log u . f ( i ) ( u ) = ( − 1) i ( i − 1)! u − i and hence f ( i ) (1) = ( − 1) i , for i ≥ 1 (with f (1) = 0). i ! i Since χ 1 1 , P = 0, it follows that: ∞ � ( − 1) i χ j KL ( X 1 : X 2 ) = P ( X 1 : X 2 ) . i j =2 → alternating sign sequence Poisson distributions: λ 1 = 0 . 6 and λ 2 = 0 . 3, KL ∼ 0 . 1158 (exact using Bregman divergence), stochastic evaluation with n = 10 6 yields � KL ∼ 0 . 1156 KL divergence from Taylor truncation: 0 . 0809( s = 2), 0 . 0910( s = 3), 0 . 1017( s = 4), 0 . 1135( s = 10), 0 . 1150( s = 15), etc. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/17

  15. Contributions Statistical f -divergences between members of the same exponential family with affine natural space . ◮ Generic closed-form formula for the Pearson/Neyman χ 2 and Vajda χ k -type distance ◮ Analytic expression of f -divergences using Pearson-Vajda-type distances. ◮ Second-order Taylor approximation for fast estimation of f -divergences. Java TM package: www.informationgeometry.org/fDivergence/ c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/17

  16. Thank you. @article{fDivChi-arXiv1309.3029, author="Frank Nielsen and Richard Nock", title="On the {C}hi square and higher-order {C}hi distances for approximating $f$-divergences", year="2013", eprint="arXiv/1309.3029" } www.informationgeometry.org c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend