comparison of local and global contraction coefficients
play

Comparison of Local and Global Contraction Coefficients for KL - PowerPoint PPT Presentation

Comparison of Local and Global Contraction Coefficients for KL Divergence Anuran Makur and Lizhong Zheng EECS Department, Massachusetts Institute of Technology 5 November 2015 A. Makur & L. Zheng (MIT) Local and Global Contraction


  1. Comparison of Local and Global Contraction Coefficients for KL Divergence Anuran Makur and Lizhong Zheng EECS Department, Massachusetts Institute of Technology 5 November 2015 A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 1 / 32

  2. Outline Introduction to Contraction Coefficients 1 Measuring Ergodicity Contraction Coefficients of Strong Data Processing Inequalities Motivation from Inference 2 Contraction Coefficients for KL and χ 2 -Divergences 3 Bounds between Contraction Coefficients 4 A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 2 / 32

  3. Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32

  4. Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32

  5. Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π aperiodic ⇒ W k → π 1 T (rank 1 matrix) A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32

  6. Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π aperiodic ⇒ W k → π 1 T (rank 1 matrix) Rate of convergence? A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32

  7. Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π aperiodic ⇒ W k → π 1 T (rank 1 matrix) Rate of convergence? Perron-Frobenius: 1 = λ 1 ( W ) > | λ 2 ( W ) | ≥ · · · ≥ | λ n ( W ) | A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32

  8. Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π aperiodic ⇒ W k → π 1 T (rank 1 matrix) Rate of convergence? Perron-Frobenius: 1 = λ 1 ( W ) > | λ 2 ( W ) | ≥ · · · ≥ | λ n ( W ) | Rate of convergence determined by | λ 2 ( W ) | ← − coefficient of ergodicity A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32

  9. Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π aperiodic ⇒ W k → π 1 T (rank 1 matrix) Rate of convergence? Perron-Frobenius: 1 = λ 1 ( W ) > | λ 2 ( W ) | ≥ · · · ≥ | λ n ( W ) | Rate of convergence determined by | λ 2 ( W ) | ← − coefficient of ergodicity Want: A guarantee on the relative improvement i.e. for any distribution p , W k +1 p is “closer” to π than W k p . A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32

  10. Measuring Ergodicity Let d : P × P → [0 , ∞ ] be a divergence measure on the simplex P . Want: ∀ p ∈ P , d ( Wp , W π ) ≤ η d ( π, W ) d ( p , π ) ���� = π for some contraction coefficient η d ( π, W ) ∈ [0 , 1]. A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 4 / 32

  11. Measuring Ergodicity Let d : P × P → [0 , ∞ ] be a divergence measure on the simplex P . Want: ∀ p ∈ P , d ( Wp , W π ) ≤ η d ( π, W ) d ( p , π ) ���� = π for some contraction coefficient η d ( π, W ) ∈ [0 , 1]. This would mean that: d ( W k p , π ) ≤ η d ( π, W ) k d ( p , π ) . ∀ p ∈ P , A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 4 / 32

  12. Measuring Ergodicity Let d : P × P → [0 , ∞ ] be a divergence measure on the simplex P . Want: ∀ p ∈ P , d ( Wp , W π ) ≤ η d ( π, W ) d ( p , π ) ���� = π for some contraction coefficient η d ( π, W ) ∈ [0 , 1]. This would mean that: d ( W k p , π ) ≤ η d ( π, W ) k d ( p , π ) . ∀ p ∈ P , d η d ( π, W ) < 1 ⇒ W k p − → π geometrically fast with rate η d ( π, W ). A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 4 / 32

  13. Measuring Ergodicity Let d : P × P → [0 , ∞ ] be a divergence measure on the simplex P . Want: ∀ p ∈ P , d ( Wp , W π ) ≤ η d ( π, W ) d ( p , π ) ���� = π for some contraction coefficient η d ( π, W ) ∈ [0 , 1]. This would mean that: d ( W k p , π ) ≤ η d ( π, W ) k d ( p , π ) . ∀ p ∈ P , d η d ( π, W ) < 1 ⇒ W k p − → π geometrically fast with rate η d ( π, W ). So, η d ( π, W ) is a coefficient of ergodicity, and we define it as: d ( Wp , W π ) η d ( π, W ) � sup . d ( p , π ) p : p � = π A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 4 / 32

  14. Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32

  15. Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? Does the ℓ 2 -norm work? A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32

  16. Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? Does the ℓ 2 -norm work? � W π − Wp � 2 = � W ( π − p ) � 2 ≤ � W � 2 � π − p � 2 where the spectral norm � W � 2 is the largest singular value of W . A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32

  17. Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? Does the ℓ 2 -norm work? � W π − Wp � 2 = � W ( π − p ) � 2 ≤ � W � 2 � π − p � 2 where the spectral norm � W � 2 is the largest singular value of W . � W � 2 > 1 is possible ... � A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32

  18. Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? Does the ℓ 2 -norm work? � W π − Wp � 2 = � W ( π − p ) � 2 ≤ � W � 2 � π − p � 2 where the spectral norm � W � 2 is the largest singular value of W . � W � 2 > 1 is possible ... � Dobrushin-Doeblin Coefficient of Ergodicity: The ℓ 1 -norm (total variation distance) works! � A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32

  19. Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? Does the ℓ 2 -norm work? � W π − Wp � 2 = � W ( π − p ) � 2 ≤ � W � 2 � π − p � 2 where the spectral norm � W � 2 is the largest singular value of W . � W � 2 > 1 is possible ... � Dobrushin-Doeblin Coefficient of Ergodicity: The ℓ 1 -norm (total variation distance) works! � � W π − Wp � 1 = � W ( π − p ) � 1 ≤ η TV ( π, W ) � π − p � 1 � W π − Wp � 1 where η TV ( π, W ) � sup p : p � = π ∈ [0 , 1] is the Dobrushin-Doeblin � π − p � 1 contraction coefficient. A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32

  20. Csisz´ ar f -Divergence Definition (Csisz´ ar f -Divergence) Given distributions R X and P X on X , we define their f -divergence as: � R X ( x ) � � D f ( R X || P X ) � P X ( x ) f P X ( x ) x ∈X where f : R + → R is convex and f (1) = 0. A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 6 / 32

  21. Csisz´ ar f -Divergence Definition (Csisz´ ar f -Divergence) Given distributions R X and P X on X , we define their f -divergence as: � R X ( x ) � � D f ( R X || P X ) � P X ( x ) f P X ( x ) x ∈X where f : R + → R is convex and f (1) = 0. Non-negativity: D f ( R X || P X ) ≥ 0 with equality iff R X = P X . A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 6 / 32

  22. Csisz´ ar f -Divergence Definition (Csisz´ ar f -Divergence) Given distributions R X and P X on X , we define their f -divergence as: � R X ( x ) � � D f ( R X || P X ) � P X ( x ) f P X ( x ) x ∈X where f : R + → R is convex and f (1) = 0. Non-negativity: D f ( R X || P X ) ≥ 0 with equality iff R X = P X . Data Processing Inequality: For a fixed channel P Y | X : ∀ R X , P X , D f ( R Y || P Y ) ≤ D f ( R X || P X ) where R Y and P Y are output pmfs corresponding to R X and P X . A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 6 / 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend