On Data-Processing and Majorization Inequalities for f -Divergences - PowerPoint PPT Presentation

On Data-Processing and Majorization Inequalities for f -Divergences Igal Sason EE Department, Technion - Israel Institute of Technology IZS 2020 Zurich, Switzerland February 26-28, 2020 I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 1 / 20

Introduction f -Divergences f -divergences form a general class of divergence measures which are commonly used in information theory, learning theory and related fields. I. Csisz´ ar, “Eine Informationstheoretische Ungleichung und ihre Anwendung auf den Bewis der Ergodizit¨ at von Markhoffschen Ketten,” Publ. Math. Inst. Hungar. Acad. Sci. , vol. 8, pp. 85–108, Jan. 1963. I. Csisz´ ar, “On topological properties of f -divergences,” Studia Scientiarum Mathematicarum Hungarica , vol. 2, pp. 329–339, Jan. 1967. I. Csisz´ ar, “A class of measures of informativity of observation channels,” Periodica Mathematicarum Hungarica , vol. 2, pp. 191–213, Mar. 1972. S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistics Society , series B, vol. 28, no. 1, pp. 131–142, Jan. 1966. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 2 / 20

Introduction This Talk is Restricted to the Discrete Setting f : (0 , ∞ ) �→ R is a convex function with f (1) = 0 ; P, Q are probability mass functions defined on a (finite or countably infinite) set X . f -Divergence: Definition The f -divergence from P to Q is given by � P ( x ) � � D f ( P � Q ) := Q ( x ) f Q ( x ) x ∈X with the convention that f (0) := lim t ↓ 0 f ( t ) , � 0 � a � a f ( u ) � � � 0 f := 0 , 0 f := lim t ↓ 0 tf = a lim u , a > 0 . 0 0 t u →∞ I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 3 / 20

Introduction f -divergences: Examples Relative entropy f ( t ) = t log t, t > 0 = ⇒ D f ( P � Q ) = D ( P � Q ) , f ( t ) = − log t, t > 0 = ⇒ D f ( P � Q ) = D ( Q � P ) . Total variation (TV) distance f ( t ) = | t − 1 | , t ≥ 0 � ⇒ D f ( P � Q ) = | P − Q | := | P ( x ) − Q ( x ) | . x ∈X Chi-Squared Divergence f ( t ) = ( t − 1) 2 , t ≥ 0 � 2 � P ( x ) − Q ( x ) ⇒ D f ( P � Q ) = χ 2 ( P � Q ) := � . Q ( x ) x ∈X I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 4 / 20

Introduction f -divergences: Examples (cont.) E γ divergence (Polyanskiy, Poor and Verd´ u, IEEE T-IT, 2010) For γ ≥ 1 , E γ ( P � Q ) := D f γ ( P � Q ) (1) with f γ ( t ) = ( t − γ ) + , for t > 0 , and ( x ) + := max { x, 0 } . E 1 ( P � Q ) = 1 2 | P − Q | = ⇒ E γ divergence generalizes TV distance. � � E γ ( P � Q ) = max P ( E ) − γ Q ( E ) . E∈ F Other Important f -divergences Triangular Discrimination (Vincze-Le Cam distance ’81; Topsøe 2000); Jensen-Shannon divergence (Lin 1991; Topsøe 2000); DeGroot statistical information (DeGroot ’62; Liese & Vajda ’06); see later. Marton’s divergence (Marton 1996; Samson 2000). I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 5 / 20

Introduction Data-Processing Inequality for f -Divergences Let X and Y be finite or countably infinite sets; P X and Q X be probability mass functions that are supported on X ; W Y | X : X → Y be a stochastic transformation; Output distributions: P Y := P X W Y | X , Q Y := Q X W Y | X ; f : (0 , ∞ ) → R be a convex function with f (1) = 0 . Then, D f ( P Y � Q Y ) ≤ D f ( P X � Q X ) . I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 6 / 20

Introduction Contraction Coefficient for f -Divergences Let Q X be a probability mass function defined on a set X , and which is not a point mass; W Y | X : X → Y be a stochastic transformation. The contraction coefficient for f -divergences is defined as D f ( P Y � Q Y ) µ f ( Q X , W Y | X ) := sup D f ( P X � Q X ) . P X : D f ( P X � Q X ) ∈ (0 , ∞ ) I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 7 / 20

Introduction Strong Data Processing Inequalities (SDPI) If µ f ( Q X , W Y | X ) < 1 , then D f ( P Y � Q Y ) ≤ µ f ( Q X , W Y | X ) D f ( P X � Q X ) . Contraction coefficients for f -divergences play a key role in strong data-processing inequalities: Ahlswede and G´ acs (’76); Cohen et al. (’93); Raginsky (’16); Polyanskiy and Wu (’16, ’17); Makur, Polyanskiy and Wu (’18). I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 8 / 20

New Results: SDPI for f -divergences Theorem 1: SDPI for f -divergences Let P X ( x ) P X ( x ) ξ 1 := inf Q X ( x ) ∈ [0 , 1] , ξ 2 := sup Q X ( x ) ∈ [1 , ∞ ] . x ∈X x ∈X c f := c f ( ξ 1 , ξ 2 ) ≥ 0 and d f := d f ( ξ 1 , ξ 2 ) ≥ 0 satisfy 2 c f ≤ f ′ + ( v ) − f ′ + ( u ) ≤ 2 d f , ∀ u, v ∈ I , u < v v − u where f ′ + is the right-side derivative of f , and I := [ ξ 1 , ξ 2 ] ∩ (0 , ∞ ) . Then, χ 2 ( P X � Q X ) − χ 2 ( P Y � Q Y ) � � d f ≥ D f ( P X � Q X ) − D f ( P Y � Q Y ) χ 2 ( P X � Q X ) − χ 2 ( P Y � Q Y ) � � ≥ c f ≥ 0 . I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 9 / 20

New Results: SDPI for f -divergences Theorem 1: SDPI (Cont.) If f is twice differentiable on I , then the best coefficients are given by c f = 1 t ∈I ( ξ 1 ,ξ 2 ) f ′′ ( t ) , d f = 1 f ′′ ( t ) . inf sup 2 2 t ∈I ( ξ 1 ,ξ 2 ) I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 10 / 20

New Results: SDPI for f -divergences Theorem 1: SDPI (Cont.) If f is twice differentiable on I , then the best coefficients are given by c f = 1 t ∈I ( ξ 1 ,ξ 2 ) f ′′ ( t ) , d f = 1 f ′′ ( t ) . inf sup 2 2 t ∈I ( ξ 1 ,ξ 2 ) This SDPI is Locally Tight Let P ( n ) P ( n ) X ( x ) X ( x ) n →∞ inf lim Q X ( x ) = 1 , n →∞ sup lim Q X ( x ) = 1 . x ∈X x ∈X If f has a continuous second derivative at unity, then D f ( P ( n ) X � Q X ) − D f ( P ( n ) � Q Y ) Y = 1 2 f ′′ (1) . lim χ 2 ( P ( n ) X � Q X ) − χ 2 ( P ( n ) n →∞ � Q Y ) Y I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 10 / 20

New Results: SDPI for f -divergences Advantage: Tensorization of the Chi-Squared Divergence m � � � χ 2 ( P 1 × . . . × P m � Q 1 × . . . × Q m ) = 1 + χ 2 ( P i � Q i ) − 1 . i =1 I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 11 / 20

New Results: SDPI for f -divergences Theorem 2: SDPI for f -divergences Let f : (0 , ∞ ) → R satisfy the conditions: f is a convex function, differentiable at 1, f (1) = 0 , and f (0) := lim t → 0 + f ( t ) < ∞ ; The function g : (0 , ∞ ) → R , defined by g ( t ) := f ( t ) − f (0) for all t t > 0 , is convex. Let f ( t ) + f ′ (1) (1 − t ) κ ( ξ 1 , ξ 2 ) := sup . ( t − 1) 2 t ∈ ( ξ 1 , 1) ∪ (1 ,ξ 2 ) Then, f (0) + f ′ (1) · χ 2 ( P Y � Q Y ) D f ( P Y � Q Y ) κ ( ξ 1 , ξ 2 ) D f ( P X � Q X ) ≤ χ 2 ( P X � Q X ) . I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 12 / 20

New Results: SDPI for f -divergences Numerical Results The tightness of the bounds (SDPI inequalities) in Theorems 1 and 2 was exemplified numerically for transmission over a BEC and BSC. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 13 / 20

Application: List Decoding Error Bounds List Decoding Decision rule outputs a list of choices. The extension of Fano’s inequality to list decoding, expressed in terms of H ( X | Y ) , was initiated by Ahlswede, Gacs and K¨ orner (’66). Useful to prove converse results (jointly with the blowing-up lemma). Generalized Fano’s Inequality for Fixed List Size � � P L � 1 − L H ( X | Y ) ≤ log M − d M where d ( ·�· ) denotes the binary relative entropy: � x � � 1 − x � d ( x � y ) := x log + (1 − x ) log , x, y ∈ (0 , 1) . y 1 − y I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 14 / 20

List Decoding Error Bounds Theorem 3: Tightened Bound by Strong DPI (SDPI) Let P XY be a probability measure defined on X × Y with |X| = M . � X � X � � Consider a decision rule L : Y → , where stands for the set of L L subsets of X with cardinality L , and L < M is fixed. � � Denote the list decoding error probability by P L := P X / ∈ L ( Y ) . If the L most probable elements from X are selected, given Y ∈ Y , then − 1 − P L � � � � · E P X | Y ( X | Y ) P L � 1 − L − log e L H ( X | Y ) ≤ log M − d . M 2 sup P X | Y ( x | y ) ( x,y ) ∈X×Y Proof: Use Theorem 1 (our first SDPI) with f ( t ) = t log t, t > 0 , P X | Y = y , and Q X | Y = y be equiprobable over { 1 , . . . , M } , W Z | X,Y = y be 1 or 0 if X ∈ L ( y ) or X / ∈ L ( y ) , and average over Y . Numerical experimentation exemplifies this improvement. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 15 / 20

List Decoding Error Bounds Generalized Fano’s Inequality for Variable List Size (1975) Let P XY be a probability measure defined on X × Y with |X| = M ; Consider a decision rule L : Y → 2 X ; Let the (average) list decoding error probability be given by � � P L := P X / ∈ L ( Y ) with |L ( y ) | ≥ 1 for all y ∈ Y . Then, H ( X | Y ) ≤ h ( P L ) + E [log |L ( Y ) | ] + P L log M. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 16 / 20

On Data-Processing and Majorization Inequalities for f -Divergences - PowerPoint PPT Presentation

On Data-Processing and Majorization Inequalities for f -Divergences Igal Sason EE Department, Technion - Israel Institute of Technology IZS 2020 Zurich, Switzerland February 26-28, 2020 I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020

Majorization and Extreme Points: Economic Applications Andreas Kleiner, Benny Moldovanu, and

Majorization inequalities for valuations of eigenvalues Marianne Akian INRIA Saclay -

Inequalities for Symmetric Polynomials Curtis Greene October 24, 2009 Inequalities for Symmetric

Wittens Laplacian and the Morse Inequalities Background Morse Inequalities Wittens Idea

Welcome Health inequalities What are health inequalities? Our presenters will be introducing the

Health inequalities slides Wirral January 2020 Version 1.1 Why health inequalities are

Health Inequalities: A postcode lottery Postcode Lottery Health Inequalities Health

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Quantum Statistical Comparison and Majorization (and their applications to generalized resource

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

Majorization and entropy at the output of bosonic Gaussian channels Andrea Mari NEST, Scuola

Ironing, Sweeping, and Multivariate Majorization Optimal Mechanisms for Mass-Produced Goods (with

X - B a t t e r i e s Majorization and Fluctuations Phys. Rev. X 6, 041017 (2016)

Entropy, majorization and thermodynamics in general probabilistic systems Howard Barnum 1 ,

A smoothing majorization method for l 2 2 - l p p matrix minimization Liwei Zhang Dalian

Generalized Majorization-Minimization Sobhan Naderi Kun He Reza Aghajani Stan

Previously... Joint typical sequences Covering and Packing Lemmas Channel Coding Theorem

Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri Matthias Petri

Successive Cancellation Inactivation Decoding for Modified Reed-Muller and eBCH Codes Mustafa

Polynomial ideals associated to combinatorial objects William J. Martin Department of

Virtual file system 1. VFS basic concepts 2. VFS design approach and architecture 3. Device

What is DNA? D eoxyribo n ucleic A cid A molecule composed of two chains that wrap

Wonderful Counselor Wonderful Counselor He made us. Wonderful Counselor He made us.

Fault Tolerance via the State Machine Replication Approach Favian Contreras Implementing