on data processing and majorization inequalities for f
play

On Data-Processing and Majorization Inequalities for f -Divergences - PowerPoint PPT Presentation

On Data-Processing and Majorization Inequalities for f -Divergences Igal Sason EE Department, Technion - Israel Institute of Technology IZS 2020 Zurich, Switzerland February 26-28, 2020 I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020


  1. On Data-Processing and Majorization Inequalities for f -Divergences Igal Sason EE Department, Technion - Israel Institute of Technology IZS 2020 Zurich, Switzerland February 26-28, 2020 I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 1 / 20

  2. Introduction f -Divergences f -divergences form a general class of divergence measures which are commonly used in information theory, learning theory and related fields. I. Csisz´ ar, “Eine Informationstheoretische Ungleichung und ihre Anwendung auf den Bewis der Ergodizit¨ at von Markhoffschen Ketten,” Publ. Math. Inst. Hungar. Acad. Sci. , vol. 8, pp. 85–108, Jan. 1963. I. Csisz´ ar, “On topological properties of f -divergences,” Studia Scientiarum Mathematicarum Hungarica , vol. 2, pp. 329–339, Jan. 1967. I. Csisz´ ar, “A class of measures of informativity of observation channels,” Periodica Mathematicarum Hungarica , vol. 2, pp. 191–213, Mar. 1972. S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistics Society , series B, vol. 28, no. 1, pp. 131–142, Jan. 1966. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 2 / 20

  3. Introduction This Talk is Restricted to the Discrete Setting f : (0 , ∞ ) �→ R is a convex function with f (1) = 0 ; P, Q are probability mass functions defined on a (finite or countably infinite) set X . f -Divergence: Definition The f -divergence from P to Q is given by � P ( x ) � � D f ( P � Q ) := Q ( x ) f Q ( x ) x ∈X with the convention that f (0) := lim t ↓ 0 f ( t ) , � 0 � a � a f ( u ) � � � 0 f := 0 , 0 f := lim t ↓ 0 tf = a lim u , a > 0 . 0 0 t u →∞ I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 3 / 20

  4. Introduction f -divergences: Examples Relative entropy f ( t ) = t log t, t > 0 = ⇒ D f ( P � Q ) = D ( P � Q ) , f ( t ) = − log t, t > 0 = ⇒ D f ( P � Q ) = D ( Q � P ) . Total variation (TV) distance f ( t ) = | t − 1 | , t ≥ 0 � ⇒ D f ( P � Q ) = | P − Q | := | P ( x ) − Q ( x ) | . x ∈X Chi-Squared Divergence f ( t ) = ( t − 1) 2 , t ≥ 0 � 2 � P ( x ) − Q ( x ) ⇒ D f ( P � Q ) = χ 2 ( P � Q ) := � . Q ( x ) x ∈X I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 4 / 20

  5. Introduction f -divergences: Examples (cont.) E γ divergence (Polyanskiy, Poor and Verd´ u, IEEE T-IT, 2010) For γ ≥ 1 , E γ ( P � Q ) := D f γ ( P � Q ) (1) with f γ ( t ) = ( t − γ ) + , for t > 0 , and ( x ) + := max { x, 0 } . E 1 ( P � Q ) = 1 2 | P − Q | = ⇒ E γ divergence generalizes TV distance. � � E γ ( P � Q ) = max P ( E ) − γ Q ( E ) . E∈ F Other Important f -divergences Triangular Discrimination (Vincze-Le Cam distance ’81; Topsøe 2000); Jensen-Shannon divergence (Lin 1991; Topsøe 2000); DeGroot statistical information (DeGroot ’62; Liese & Vajda ’06); see later. Marton’s divergence (Marton 1996; Samson 2000). I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 5 / 20

  6. Introduction Data-Processing Inequality for f -Divergences Let X and Y be finite or countably infinite sets; P X and Q X be probability mass functions that are supported on X ; W Y | X : X → Y be a stochastic transformation; Output distributions: P Y := P X W Y | X , Q Y := Q X W Y | X ; f : (0 , ∞ ) → R be a convex function with f (1) = 0 . Then, D f ( P Y � Q Y ) ≤ D f ( P X � Q X ) . I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 6 / 20

  7. Introduction Contraction Coefficient for f -Divergences Let Q X be a probability mass function defined on a set X , and which is not a point mass; W Y | X : X → Y be a stochastic transformation. The contraction coefficient for f -divergences is defined as D f ( P Y � Q Y ) µ f ( Q X , W Y | X ) := sup D f ( P X � Q X ) . P X : D f ( P X � Q X ) ∈ (0 , ∞ ) I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 7 / 20

  8. Introduction Strong Data Processing Inequalities (SDPI) If µ f ( Q X , W Y | X ) < 1 , then D f ( P Y � Q Y ) ≤ µ f ( Q X , W Y | X ) D f ( P X � Q X ) . Contraction coefficients for f -divergences play a key role in strong data-processing inequalities: Ahlswede and G´ acs (’76); Cohen et al. (’93); Raginsky (’16); Polyanskiy and Wu (’16, ’17); Makur, Polyanskiy and Wu (’18). I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 8 / 20

  9. New Results: SDPI for f -divergences Theorem 1: SDPI for f -divergences Let P X ( x ) P X ( x ) ξ 1 := inf Q X ( x ) ∈ [0 , 1] , ξ 2 := sup Q X ( x ) ∈ [1 , ∞ ] . x ∈X x ∈X c f := c f ( ξ 1 , ξ 2 ) ≥ 0 and d f := d f ( ξ 1 , ξ 2 ) ≥ 0 satisfy 2 c f ≤ f ′ + ( v ) − f ′ + ( u ) ≤ 2 d f , ∀ u, v ∈ I , u < v v − u where f ′ + is the right-side derivative of f , and I := [ ξ 1 , ξ 2 ] ∩ (0 , ∞ ) . Then, χ 2 ( P X � Q X ) − χ 2 ( P Y � Q Y ) � � d f ≥ D f ( P X � Q X ) − D f ( P Y � Q Y ) χ 2 ( P X � Q X ) − χ 2 ( P Y � Q Y ) � � ≥ c f ≥ 0 . I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 9 / 20

  10. New Results: SDPI for f -divergences Theorem 1: SDPI (Cont.) If f is twice differentiable on I , then the best coefficients are given by c f = 1 t ∈I ( ξ 1 ,ξ 2 ) f ′′ ( t ) , d f = 1 f ′′ ( t ) . inf sup 2 2 t ∈I ( ξ 1 ,ξ 2 ) I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 10 / 20

  11. New Results: SDPI for f -divergences Theorem 1: SDPI (Cont.) If f is twice differentiable on I , then the best coefficients are given by c f = 1 t ∈I ( ξ 1 ,ξ 2 ) f ′′ ( t ) , d f = 1 f ′′ ( t ) . inf sup 2 2 t ∈I ( ξ 1 ,ξ 2 ) This SDPI is Locally Tight Let P ( n ) P ( n ) X ( x ) X ( x ) n →∞ inf lim Q X ( x ) = 1 , n →∞ sup lim Q X ( x ) = 1 . x ∈X x ∈X If f has a continuous second derivative at unity, then D f ( P ( n ) X � Q X ) − D f ( P ( n ) � Q Y ) Y = 1 2 f ′′ (1) . lim χ 2 ( P ( n ) X � Q X ) − χ 2 ( P ( n ) n →∞ � Q Y ) Y I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 10 / 20

  12. New Results: SDPI for f -divergences Advantage: Tensorization of the Chi-Squared Divergence m � � � χ 2 ( P 1 × . . . × P m � Q 1 × . . . × Q m ) = 1 + χ 2 ( P i � Q i ) − 1 . i =1 I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 11 / 20

  13. New Results: SDPI for f -divergences Theorem 2: SDPI for f -divergences Let f : (0 , ∞ ) → R satisfy the conditions: f is a convex function, differentiable at 1, f (1) = 0 , and f (0) := lim t → 0 + f ( t ) < ∞ ; The function g : (0 , ∞ ) → R , defined by g ( t ) := f ( t ) − f (0) for all t t > 0 , is convex. Let f ( t ) + f ′ (1) (1 − t ) κ ( ξ 1 , ξ 2 ) := sup . ( t − 1) 2 t ∈ ( ξ 1 , 1) ∪ (1 ,ξ 2 ) Then, f (0) + f ′ (1) · χ 2 ( P Y � Q Y ) D f ( P Y � Q Y ) κ ( ξ 1 , ξ 2 ) D f ( P X � Q X ) ≤ χ 2 ( P X � Q X ) . I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 12 / 20

  14. New Results: SDPI for f -divergences Numerical Results The tightness of the bounds (SDPI inequalities) in Theorems 1 and 2 was exemplified numerically for transmission over a BEC and BSC. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 13 / 20

  15. Application: List Decoding Error Bounds List Decoding Decision rule outputs a list of choices. The extension of Fano’s inequality to list decoding, expressed in terms of H ( X | Y ) , was initiated by Ahlswede, Gacs and K¨ orner (’66). Useful to prove converse results (jointly with the blowing-up lemma). Generalized Fano’s Inequality for Fixed List Size � � P L � 1 − L H ( X | Y ) ≤ log M − d M where d ( ·�· ) denotes the binary relative entropy: � x � � 1 − x � d ( x � y ) := x log + (1 − x ) log , x, y ∈ (0 , 1) . y 1 − y I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 14 / 20

  16. List Decoding Error Bounds Theorem 3: Tightened Bound by Strong DPI (SDPI) Let P XY be a probability measure defined on X × Y with |X| = M . � X � X � � Consider a decision rule L : Y → , where stands for the set of L L subsets of X with cardinality L , and L < M is fixed. � � Denote the list decoding error probability by P L := P X / ∈ L ( Y ) . If the L most probable elements from X are selected, given Y ∈ Y , then − 1 − P L � � � � · E P X | Y ( X | Y ) P L � 1 − L − log e L H ( X | Y ) ≤ log M − d . M 2 sup P X | Y ( x | y ) ( x,y ) ∈X×Y Proof: Use Theorem 1 (our first SDPI) with f ( t ) = t log t, t > 0 , P X | Y = y , and Q X | Y = y be equiprobable over { 1 , . . . , M } , W Z | X,Y = y be 1 or 0 if X ∈ L ( y ) or X / ∈ L ( y ) , and average over Y . Numerical experimentation exemplifies this improvement. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 15 / 20

  17. List Decoding Error Bounds Generalized Fano’s Inequality for Variable List Size (1975) Let P XY be a probability measure defined on X × Y with |X| = M ; Consider a decision rule L : Y → 2 X ; Let the (average) list decoding error probability be given by � � P L := P X / ∈ L ( Y ) with |L ( y ) | ≥ 1 for all y ∈ Y . Then, H ( X | Y ) ≤ h ( P L ) + E [log |L ( Y ) | ] + P L log M. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 16 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend