On the Chi square and higher-order Chi distances for approximating f - PowerPoint PPT Presentation

On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org 1 Sony Computer Science Laboratories, Inc. 2 UAG-CEREGMIA September 2013 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/17

Statistical divergences Measures the separability between two distributions. Examples: Pearson/Neymann χ 2 , Kullback-Leibler divergence: � ( x 2 ( x ) − x 1 ( x )) 2 χ 2 P ( X 1 : X 2 ) = d ν ( x ) , x 1 ( x ) � ( x 1 ( x ) − x 2 ( x )) 2 χ 2 N ( X 1 : X 2 ) = d ν ( x ) , x 2 ( x ) � x 1 ( x ) log x 1 ( x ) KL ( X 1 : X 2 ) = x 2 ( x ) d ν ( x ) , c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/17

f -divergences: A generic definition � � x 2 ( x ) � I f ( X 1 : X 2 ) = x 1 ( x ) f d ν ( x ) ≥ 0 , x 1 ( x ) where f is a convex function f : (0 , ∞ ) ⊆ dom ( f ) �→ [0 , ∞ ] such that f (1) = 0. � Jensen inequality: I f ( X 1 : X 2 ) ≥ f ( x 2 ( x ) d ν ( x )) = f (1) = 0. May consider f ′ (1) = 0 and fix the scale of divergence by setting f ′′ (1) = 1. Can always be symmetrized: S f ( X 1 : X 2 ) = I f ( X 1 : X 2 ) + I f ∗ ( X 1 : X 2 ) with f ∗ ( u ) = uf (1 / u ), and I f ∗ ( X 1 : X 2 ) = I f ( X 2 : X 1 ). c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/17

f -divergences: Some examples Name of the f -divergence Formula I f ( P : Q ) Generator f ( u ) with f (1) = 0 1 1 � Total variation (metric) | p ( x ) − q ( x ) | d ν ( x ) 2 | u − 1 | 2 ( √ u − 1) 2 q ( x )) 2 d ν ( x ) � � � Squared Hellinger ( p ( x ) − � ( q ( x ) − p ( x ))2 Pearson χ 2 ( u − 1) 2 d ν ( x ) P p ( x ) � ( p ( x ) − q ( x ))2 (1 − u )2 Neyman χ 2 d ν ( x ) N q ( x ) u � ( q ( x ) − λ p ( x )) k Pearson-Vajda χ k ( u − 1) k d ν ( x ) P pk − 1( x ) � | q ( x ) − λ p ( x ) | k Pearson-Vajda | χ | k | u − 1 | k d ν ( x ) P pk − 1( x ) p ( x ) log p ( x ) Kullback-Leibler � q ( x ) d ν ( x ) − log u q ( x ) log q ( x ) � reverse Kullback-Leibler p ( x ) d ν ( x ) u log u 1 − α 1+ α ( x ) q 1+ α ( x ) d ν ( x )) 4 � 4 α -divergence 1 − α 2 (1 − p 2 1 − α 2 (1 − u 2 ) 2 p ( x ) 2 q ( x ) 1 − ( u + 1) log 1+ u Jensen-Shannon � ( p ( x ) log p ( x )+ q ( x ) + q ( x ) log p ( x )+ q ( x ) ) d ν ( x ) + u log u 2 2 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/17

Stochastic approximations of f -divergences � � x 2 ( s i ) � � x 2 ( t i ) �� n � ( X 1 : X 2 ) ∼ 1 + x 1 ( t i ) I ( n ) f x 2 ( t i ) f , f 2 n x 1 ( s i ) x 1 ( t i ) i =1 with s 1 , ..., s n and t 1 , ..., t n IID. sampled from X 1 and X 2 , respectively. � I ( n ) lim ( X 1 : X 2 ) → I f ( X 1 : X 2 ) f n →∞ ◮ work for any generator f but... ◮ In practice, limited to small dimension support. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/17

Exponential families Canonical decomposition of the probability measure: p θ ( x ) = exp( � t ( x ) , θ � − F ( θ ) + k ( x )) , Here, consider natural parameter space Θ affine. p ( x | λ ) = λ x e − λ Poi ( λ ) : , λ > 0 , x ∈ { 0 , 1 , ... } x ! p ( x | µ ) = (2 π ) − d 2 e − 1 2 ( x − µ ) ⊤ ( x − µ ) , µ ∈ R d , x ∈ R d Nor I ( µ ) : Family θ Θ F ( θ ) k ( x ) t ( x ) ν e θ log λ − log x ! x ν c Poisson R 1 2 log 2 π − 1 R d 2 θ ⊤ θ d 2 x ⊤ x Iso . Gaussian µ x ν L c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/17

χ 2 for affine exponential families Bypass integral computation, Closed-form formula e F (2 θ 2 − θ 1 ) − (2 F ( θ 2 ) − F ( θ 1 )) − 1 , χ 2 P ( X 1 : X 2 ) = e F (2 θ 1 − θ 2 ) − (2 F ( θ 1 ) − F ( θ 2 )) − 1 , χ 2 N ( X 1 : X 2 ) = Kullback-Leibler divergence amounts to a Bregman divergence [3]: KL ( X 1 : X 2 ) = B F ( θ 2 : θ 1 ) B F ( θ : θ ′ ) = F ( θ ) − F ( θ ′ ) − ( θ − θ ′ ) ⊤ ∇ F ( θ ′ ) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/17

Higher-order Vajda χ k divergences � ( x 2 ( x ) − x 1 ( x )) k χ k P ( X 1 : X 2 ) = d ν ( x ) , x 1 ( x ) k − 1 � | x 2 ( x ) − x 1 ( x ) | k | χ | k P ( X 1 : X 2 ) = d ν ( x ) , x 1 ( x ) k − 1 are f -divergences for the generators ( u − 1) k and | u − 1 | k . � ◮ When k = 1, χ 1 P ( X 1 : X 2 ) = ( x 1 ( x ) − x 2 ( x )) d ν ( x ) = 0 (never discriminative), and | χ 1 P | ( X 1 , X 2 ) is twice the total variation distance. ◮ χ 0 P is the unit constant. ◮ χ k P is a signed distance c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/17

Higher-order Vajda χ k divergences Lemma The (signed) χ k P distance between members X 1 ∼ E F ( θ 1 ) and X 2 ∼ E F ( θ 2 ) of the same affine exponential family is (k ∈ N ) always bounded and equal to: � k � e F ((1 − j ) θ 1 + j θ 2 ) k � χ k ( − 1) k − j P ( X 1 : X 2 ) = e (1 − j ) F ( θ 1 )+ jF ( θ 2 ) . j j =0 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/17

Higher-order Vajda χ k divergences: For Poisson/Normal distributions, we get closed-form formula: � k � � k e λ 1 − j λ j χ k ( − 1) k − j 2 − ((1 − j ) λ 1 + j λ 2 ) , P ( λ 1 : λ 2 ) = 1 j j =0 � k � k � 1 2 j ( j − 1)( µ 1 − µ 2 ) ⊤ ( µ 1 − µ 2 ) . χ k ( − 1) k − j P ( µ 1 : µ 2 ) = e j j =0 signed distances. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/17

f -divergences from Taylor series Lemma (extends Theorem 1 of [1]) When bounded, the f -divergence I f can be expressed as the power series of higher order Chi-type distances: � x 2 ( x ) � i � � ∞ 1 i ! f ( i ) ( λ ) I f ( X 1 : X 2 ) = x 1 ( x ) x 1 ( x ) − λ d ν ( x ) , i =0 ∞ � 1 i ! f ( i ) ( λ ) χ i = λ, P ( X 1 : X 2 ) , i =0 I f < ∞ , and χ i λ, P ( X 1 : X 2 ) is a generalization of the χ i P defined by: � ( x 2 ( x ) − λ x 1 ( x )) i χ i λ, P ( X 1 : X 2 ) = d ν ( x ) . x 1 ( x ) i − 1 and χ 0 λ, P ( X 1 : X 2 ) = 1 by convention. Note that λ, P ≥ f (1) = (1 − λ ) k is a f -divergence for χ i f ( u ) = ( u − λ ) k − (1 − λ ) k c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/17

f -divergences: Analytic formula ◮ λ = 1 ∈ int ( dom ( f ( i ) )), f -divergence (Theorem 1 of [1]): s � f ( k ) (1) χ k | I f ( X 1 : X 2 ) − P ( X 1 : X 2 ) | k ! k =0 1 ( s + 1)! � f ( s +1) � ∞ ( M − m ) s , ≤ where � f ( s +1) � ∞ = sup t ∈ [ m , M ] | f ( s +1) ( t ) | and m ≤ p q ≤ M . ◮ λ = 0 (whenever 0 ∈ int ( dom ( f ( i ) ))) and affine exponential families, simpler expression: � ∞ f ( i ) (0) I f ( X 1 : X 2 ) = I 1 − i , i ( θ 1 : θ 2 ) , i ! i =0 e F ( i θ 2 +(1 − i ) θ 1 ) I 1 − i , i ( θ 1 : θ 2 ) = e iF ( θ 2 )+(1 − i ) F ( θ 1 ) . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/17

Corollary: Approximating f -divergences by χ 2 divergences Corollary A second-order Taylor expansion yields N ( X 1 : X 2 ) + 1 I f ( X 1 : X 2 ) ∼ f (1) + f ′ (1) χ 1 2 f ′′ (1) χ 2 N ( X 1 : X 2 ) Since f (1) = 0 and χ 1 N ( X 1 : X 2 ) = 0 , it follows that I f ( X 1 : X 2 ) ∼ f ′′ (1) χ 2 N ( X 1 : X 2 ) , 2 (f ′′ (1) > 0 follows from the strict convexity of the generator). When f ( u ) = u log u, this yields the well-known approximation [2]: χ 2 P ( X 1 : X 2 ) ∼ 2 KL ( X 1 : X 2 ) . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/17

Kullback-Leibler divergence: Analytic expression Kullback-Leibler divergence: f ( u ) = − log u . f ( i ) ( u ) = ( − 1) i ( i − 1)! u − i and hence f ( i ) (1) = ( − 1) i , for i ≥ 1 (with f (1) = 0). i ! i Since χ 1 1 , P = 0, it follows that: ∞ � ( − 1) i χ j KL ( X 1 : X 2 ) = P ( X 1 : X 2 ) . i j =2 → alternating sign sequence Poisson distributions: λ 1 = 0 . 6 and λ 2 = 0 . 3, KL ∼ 0 . 1158 (exact using Bregman divergence), stochastic evaluation with n = 10 6 yields � KL ∼ 0 . 1156 KL divergence from Taylor truncation: 0 . 0809( s = 2), 0 . 0910( s = 3), 0 . 1017( s = 4), 0 . 1135( s = 10), 0 . 1150( s = 15), etc. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/17

Contributions Statistical f -divergences between members of the same exponential family with affine natural space . ◮ Generic closed-form formula for the Pearson/Neyman χ 2 and Vajda χ k -type distance ◮ Analytic expression of f -divergences using Pearson-Vajda-type distances. ◮ Second-order Taylor approximation for fast estimation of f -divergences. Java TM package: www.informationgeometry.org/fDivergence/ c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/17

Thank you. @article{fDivChi-arXiv1309.3029, author="Frank Nielsen and Richard Nock", title="On the {C}hi square and higher-order {C}hi distances for approximating $f$-divergences", year="2013", eprint="arXiv/1309.3029" } www.informationgeometry.org c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/17

On the Chi square and higher-order Chi distances for approximating f - PowerPoint PPT Presentation

On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org 1 Sony Computer Science Laboratories, Inc. 2 UAG-CEREGMIA September 2013 c 2013 Frank Nielsen, Sony

1 Outline Chi-square test Logistic regression 2 Chi-square test 3 Chi-Square Test -

Unit 5: Inference for categorical variables Lecture 3: Chi-square tests Statistics 101 Thomas

+ Quantitative Statistics: Chi-Square ScWk 242 Session 7 Slides + Chi-Square Test of

Dr Jeffrey Chow Research Consultant Civic Exchange Distances to public open spaces Distances to

Chi square LING572 Advanced Statistical Methods for NLP January 23, 2020 1 Chi square An

Higher order complexity Hugo Fre Mathieu Hoyrup CCA 2013 Hugo Fre Higher order

Square Root of Not: Square Root of Not: . . . A Major Difference Between Square Root of

York University www.cs.york.ac.uk/~ndm First order vs Higher order Higher order:

Geodesic distances and intrinsic distances on some fractal sets Masanori Hino (Kyoto Univ.)

A Sociolinguistic Analysis of Linguistically Sensitive Dialectal Word Pronunciation Distances

Phylogenetic trees II Estimating distances, estimating trees from distances Gerhard Jger

Metric Distances 28 Great Circle Distances North Pole (90N lat) North Pole C Prime

Higher order Ambisonics Higher order Ambisonics A future-proof 3D audio technique A future-proof

More JavaScript! Higher-Order Functions, Callbacks, and Array Methods Higher-Order Functions

Higher Order Proof Engineering Robert White ILLC/INRIA Cool Logic, ILLC 1/23 Higher Order

Higher Order Functions 1 Shell CSCE 314 TAMU Higher-order Functions A function is called

S e n s i t i v i t y t o C P v i o l a t i o n i n n e u t r i n

Probability and Statistics for Computer Science "StaGsGcal thinking will one day be as

Goodness-of-Fit Measures for Induction Trees Gilbert Ritschard, Department of Econometrics,

Introduction to Business Statistics QM 220 QM 220 Chapter 11 Dr. Mohammad Zainal Chapter 11:

Chapter 6 Inference for categorical data Huamei Dong 03/22/2016 1. Review of hypothesis test

Review - Mathematical Tools & Probability Logarithm Fundamentals of Probability Discrete

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010 Learning a good

Chapter 2 1 2.1: Inferences about 1 Test of interest throughout regression: Need sampling

On the Chi square and higher-order Chi distances for approximating f - PowerPoint PPT Presentation

On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org 1 Sony Computer Science Laboratories, Inc. 2 UAG-CEREGMIA September 2013 c 2013 Frank Nielsen, Sony

1 Outline Chi-square test Logistic regression 2 Chi-square test 3 Chi-Square Test -

Unit 5: Inference for categorical variables Lecture 3: Chi-square tests Statistics 101 Thomas

+ Quantitative Statistics: Chi-Square ScWk 242 Session 7 Slides + Chi-Square Test of

Dr Jeffrey Chow Research Consultant Civic Exchange Distances to public open spaces Distances to

Chi square LING572 Advanced Statistical Methods for NLP January 23, 2020 1 Chi square An

Higher order complexity Hugo Fre Mathieu Hoyrup CCA 2013 Hugo Fre Higher order

Square Root of Not: Square Root of Not: . . . A Major Difference Between Square Root of

York University www.cs.york.ac.uk/~ndm First order vs Higher order Higher order:

Geodesic distances and intrinsic distances on some fractal sets Masanori Hino (Kyoto Univ.)

A Sociolinguistic Analysis of Linguistically Sensitive Dialectal Word Pronunciation Distances

Phylogenetic trees II Estimating distances, estimating trees from distances Gerhard Jger

Metric Distances 28 Great Circle Distances North Pole (90N lat) North Pole C Prime

Higher order Ambisonics Higher order Ambisonics A future-proof 3D audio technique A future-proof

More JavaScript! Higher-Order Functions, Callbacks, and Array Methods Higher-Order Functions

Higher Order Proof Engineering Robert White ILLC/INRIA Cool Logic, ILLC 1/23 Higher Order

Higher Order Functions 1 Shell CSCE 314 TAMU Higher-order Functions A function is called

S e n s i t i v i t y t o C P v i o l a t i o n i n n e u t r i n

Probability and Statistics for Computer Science &quot;StaGsGcal thinking will one day be as

Goodness-of-Fit Measures for Induction Trees Gilbert Ritschard, Department of Econometrics,

Introduction to Business Statistics QM 220 QM 220 Chapter 11 Dr. Mohammad Zainal Chapter 11:

Chapter 6 Inference for categorical data Huamei Dong 03/22/2016 1. Review of hypothesis test

Review - Mathematical Tools &amp; Probability Logarithm Fundamentals of Probability Discrete

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010 Learning a good

Chapter 2 1 2.1: Inferences about 1 Test of interest throughout regression: Need sampling

Probability and Statistics for Computer Science "StaGsGcal thinking will one day be as

Review - Mathematical Tools & Probability Logarithm Fundamentals of Probability Discrete