Comparison of Local and Global Contraction Coefficients for KL - - PowerPoint PPT Presentation

comparison of local and global contraction coefficients
SMART_READER_LITE
LIVE PREVIEW

Comparison of Local and Global Contraction Coefficients for KL - - PowerPoint PPT Presentation

Comparison of Local and Global Contraction Coefficients for KL Divergence Anuran Makur and Lizhong Zheng EECS Department, Massachusetts Institute of Technology 5 November 2015 A. Makur & L. Zheng (MIT) Local and Global Contraction


slide-1
SLIDE 1

Comparison of Local and Global Contraction Coefficients for KL Divergence

Anuran Makur and Lizhong Zheng

EECS Department, Massachusetts Institute of Technology

5 November 2015

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 1 / 32

slide-2
SLIDE 2

Outline

1

Introduction to Contraction Coefficients Measuring Ergodicity Contraction Coefficients of Strong Data Processing Inequalities

2

Motivation from Inference

3

Contraction Coefficients for KL and χ2-Divergences

4

Bounds between Contraction Coefficients

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 2 / 32

slide-3
SLIDE 3

Measuring Ergodicity

Consider an ergodic Markov chain with n × n column stochastic transition matrix W .

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 3 / 32

slide-4
SLIDE 4

Measuring Ergodicity

Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π: W π = π

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 3 / 32

slide-5
SLIDE 5

Measuring Ergodicity

Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π: W π = π aperiodic ⇒ W k → π1T (rank 1 matrix)

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 3 / 32

slide-6
SLIDE 6

Measuring Ergodicity

Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π: W π = π aperiodic ⇒ W k → π1T (rank 1 matrix) Rate of convergence?

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 3 / 32

slide-7
SLIDE 7

Measuring Ergodicity

Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π: W π = π aperiodic ⇒ W k → π1T (rank 1 matrix) Rate of convergence? Perron-Frobenius: 1 = λ1(W ) > |λ2(W )| ≥ · · · ≥ |λn(W )|

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 3 / 32

slide-8
SLIDE 8

Measuring Ergodicity

Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π: W π = π aperiodic ⇒ W k → π1T (rank 1 matrix) Rate of convergence? Perron-Frobenius: 1 = λ1(W ) > |λ2(W )| ≥ · · · ≥ |λn(W )| Rate of convergence determined by |λ2(W )| ← − coefficient of ergodicity

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 3 / 32

slide-9
SLIDE 9

Measuring Ergodicity

Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π: W π = π aperiodic ⇒ W k → π1T (rank 1 matrix) Rate of convergence? Perron-Frobenius: 1 = λ1(W ) > |λ2(W )| ≥ · · · ≥ |λn(W )| Rate of convergence determined by |λ2(W )| ← − coefficient of ergodicity Want: A guarantee on the relative improvement i.e. for any distribution p, W k+1p is “closer” to π than W kp.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 3 / 32

slide-10
SLIDE 10

Measuring Ergodicity

Let d : P × P → [0, ∞] be a divergence measure on the simplex P. Want: ∀p ∈ P, d(Wp, W π

) ≤ ηd(π, W )d(p, π) for some contraction coefficient ηd(π, W ) ∈ [0, 1].

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 4 / 32

slide-11
SLIDE 11

Measuring Ergodicity

Let d : P × P → [0, ∞] be a divergence measure on the simplex P. Want: ∀p ∈ P, d(Wp, W π

) ≤ ηd(π, W )d(p, π) for some contraction coefficient ηd(π, W ) ∈ [0, 1]. This would mean that: ∀p ∈ P, d(W kp, π) ≤ ηd(π, W )kd(p, π).

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 4 / 32

slide-12
SLIDE 12

Measuring Ergodicity

Let d : P × P → [0, ∞] be a divergence measure on the simplex P. Want: ∀p ∈ P, d(Wp, W π

) ≤ ηd(π, W )d(p, π) for some contraction coefficient ηd(π, W ) ∈ [0, 1]. This would mean that: ∀p ∈ P, d(W kp, π) ≤ ηd(π, W )kd(p, π). ηd(π, W ) < 1 ⇒ W kp

d

− → π geometrically fast with rate ηd(π, W ).

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 4 / 32

slide-13
SLIDE 13

Measuring Ergodicity

Let d : P × P → [0, ∞] be a divergence measure on the simplex P. Want: ∀p ∈ P, d(Wp, W π

) ≤ ηd(π, W )d(p, π) for some contraction coefficient ηd(π, W ) ∈ [0, 1]. This would mean that: ∀p ∈ P, d(W kp, π) ≤ ηd(π, W )kd(p, π). ηd(π, W ) < 1 ⇒ W kp

d

− → π geometrically fast with rate ηd(π, W ). So, ηd(π, W ) is a coefficient of ergodicity, and we define it as: ηd(π, W ) sup

p:p=π

d(Wp, W π) d(p, π) .

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 4 / 32

slide-14
SLIDE 14

Measuring Ergodicity

Can we define notions of distance between distributions which make W a contraction?

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 5 / 32

slide-15
SLIDE 15

Measuring Ergodicity

Can we define notions of distance between distributions which make W a contraction? Does the ℓ2-norm work?

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 5 / 32

slide-16
SLIDE 16

Measuring Ergodicity

Can we define notions of distance between distributions which make W a contraction? Does the ℓ2-norm work? W π − Wp2 = W (π − p)2 ≤ W 2 π − p2 where the spectral norm W 2 is the largest singular value of W .

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 5 / 32

slide-17
SLIDE 17

Measuring Ergodicity

Can we define notions of distance between distributions which make W a contraction? Does the ℓ2-norm work? W π − Wp2 = W (π − p)2 ≤ W 2 π − p2 where the spectral norm W 2 is the largest singular value of W . W 2 > 1 is possible...

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 5 / 32

slide-18
SLIDE 18

Measuring Ergodicity

Can we define notions of distance between distributions which make W a contraction? Does the ℓ2-norm work? W π − Wp2 = W (π − p)2 ≤ W 2 π − p2 where the spectral norm W 2 is the largest singular value of W . W 2 > 1 is possible... Dobrushin-Doeblin Coefficient of Ergodicity: The ℓ1-norm (total variation distance) works!

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 5 / 32

slide-19
SLIDE 19

Measuring Ergodicity

Can we define notions of distance between distributions which make W a contraction? Does the ℓ2-norm work? W π − Wp2 = W (π − p)2 ≤ W 2 π − p2 where the spectral norm W 2 is the largest singular value of W . W 2 > 1 is possible... Dobrushin-Doeblin Coefficient of Ergodicity: The ℓ1-norm (total variation distance) works! W π − Wp1 = W (π − p)1 ≤ ηTV(π, W ) π − p1 where ηTV(π, W ) supp:p=π

W π−Wp1 π−p1

∈ [0, 1] is the Dobrushin-Doeblin contraction coefficient.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 5 / 32

slide-20
SLIDE 20

Csisz´ ar f -Divergence

Definition (Csisz´ ar f -Divergence)

Given distributions RX and PX on X, we define their f -divergence as: Df (RX||PX)

  • x∈X

PX(x)f RX(x) PX(x)

  • where f : R+ → R is convex and f (1) = 0.
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 6 / 32

slide-21
SLIDE 21

Csisz´ ar f -Divergence

Definition (Csisz´ ar f -Divergence)

Given distributions RX and PX on X, we define their f -divergence as: Df (RX||PX)

  • x∈X

PX(x)f RX(x) PX(x)

  • where f : R+ → R is convex and f (1) = 0.

Non-negativity: Df (RX||PX) ≥ 0 with equality iff RX = PX.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 6 / 32

slide-22
SLIDE 22

Csisz´ ar f -Divergence

Definition (Csisz´ ar f -Divergence)

Given distributions RX and PX on X, we define their f -divergence as: Df (RX||PX)

  • x∈X

PX(x)f RX(x) PX(x)

  • where f : R+ → R is convex and f (1) = 0.

Non-negativity: Df (RX||PX) ≥ 0 with equality iff RX = PX. Data Processing Inequality: For a fixed channel PY |X: ∀RX, PX, Df (RY ||PY ) ≤ Df (RX||PX) where RY and PY are output pmfs corresponding to RX and PX.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 6 / 32

slide-23
SLIDE 23

Csisz´ ar f -Divergence

Definition (Csisz´ ar f -Divergence)

Given distributions RX and PX on X, we define their f -divergence as: Df (RX||PX)

  • x∈X

PX(x)f RX(x) PX(x)

  • where f : R+ → R is convex and f (1) = 0.

Theorem [Amari and Cichocki, 2010]: A decomposable divergence measure satisfies data processing if and only if it is an f -divergence.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 7 / 32

slide-24
SLIDE 24

Csisz´ ar f -Divergence

Definition (Csisz´ ar f -Divergence)

Given distributions RX and PX on X, we define their f -divergence as: Df (RX||PX)

  • x∈X

PX(x)f RX(x) PX(x)

  • where f : R+ → R is convex and f (1) = 0.

Theorem [Amari and Cichocki, 2010]: A decomposable divergence measure satisfies data processing if and only if it is an f -divergence. Definition: A divergence d is decomposable if it can be written as: d(RX, PX) =

  • x∈X

g (RX(x), PX(x)) for some function g : [0, 1]2 → R.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 7 / 32

slide-25
SLIDE 25

Csisz´ ar f -Divergence

Definition (Csisz´ ar f -Divergence)

Given distributions RX and PX on X, we define their f -divergence as: Df (RX||PX)

  • x∈X

PX(x)f RX(x) PX(x)

  • where f : R+ → R is convex and f (1) = 0.

Some Examples:

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 8 / 32

slide-26
SLIDE 26

Csisz´ ar f -Divergence

Definition (Csisz´ ar f -Divergence)

Given distributions RX and PX on X, we define their f -divergence as: Df (RX||PX)

  • x∈X

PX(x)f RX(x) PX(x)

  • where f : R+ → R is convex and f (1) = 0.

Some Examples: Total Variation Distance: f (t) = |t − 1| produces Df (RX||PX) = RX − PX1.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 8 / 32

slide-27
SLIDE 27

Csisz´ ar f -Divergence

Definition (Csisz´ ar f -Divergence)

Given distributions RX and PX on X, we define their f -divergence as: Df (RX||PX)

  • x∈X

PX(x)f RX(x) PX(x)

  • where f : R+ → R is convex and f (1) = 0.

Some Examples: Total Variation Distance: f (t) = |t − 1| produces Df (RX||PX) = RX − PX1. KL Divergence: f (t) = t log(t) produces Df (RX||PX) = D(RX||PX) =

x∈X RX(x) log

  • RX (x)

PX (x)

  • .
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 8 / 32

slide-28
SLIDE 28

Csisz´ ar f -Divergence

Definition (Csisz´ ar f -Divergence)

Given distributions RX and PX on X, we define their f -divergence as: Df (RX||PX)

  • x∈X

PX(x)f RX(x) PX(x)

  • where f : R+ → R is convex and f (1) = 0.

Some Examples: Total Variation Distance: f (t) = |t − 1| produces Df (RX||PX) = RX − PX1. KL Divergence: f (t) = t log(t) produces Df (RX||PX) = D(RX||PX) =

x∈X RX(x) log

  • RX (x)

PX (x)

  • .

χ2-Divergence: f (t) = (t − 1)2 produces Df (RX||PX) = χ2(RX, PX) =

x∈X (RX (x)−PX (x))2 PX (x)

.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 8 / 32

slide-29
SLIDE 29

Contraction Coefficients

Definition (Contraction Coefficient for f -Divergence)

For a fixed source distribution PX and channel PY |X, we define the contraction coefficient for f -divergence as: ηf

  • PX, PY |X
  • sup

RX :RX =PX

Df (RY ||PY ) Df (RX||PX) where RY is the output distribution when RX passes through PY |X.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 9 / 32

slide-30
SLIDE 30

Contraction Coefficients

Definition (Contraction Coefficient for f -Divergence)

For a fixed source distribution PX and channel PY |X, we define the contraction coefficient for f -divergence as: ηf

  • PX, PY |X
  • sup

RX :RX =PX

Df (RY ||PY ) Df (RX||PX) where RY is the output distribution when RX passes through PY |X. Strong Data Processing Inequality For fixed PX and PY |X, we have: ∀RX, Df (RY ||PY ) ≤ ηf

  • PX, PY |X
  • Df (RX||PX).
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 9 / 32

slide-31
SLIDE 31

Contraction Coefficients

Definition (Contraction Coefficient for f -Divergence)

For a fixed source distribution PX and channel PY |X, we define the contraction coefficient for f -divergence as: ηf

  • PX, PY |X
  • sup

RX :RX =PX

Df (RY ||PY ) Df (RX||PX) where RY is the output distribution when RX passes through PY |X. Strong Data Processing Inequality For fixed PX and PY |X, we have: ∀RX, Df (RY ||PY ) ≤ ηf

  • PX, PY |X
  • Df (RX||PX).

We will use the following instances of contraction coefficients:

1 f (t) = t log(t): ηf

  • PX, PY |X
  • = ηKL
  • PX, PY |X
  • 2 f (t) = (t − 1)2: ηf
  • PX, PY |X
  • = ηχ2
  • PX, PY |X
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 9 / 32

slide-32
SLIDE 32

Outline

1

Introduction to Contraction Coefficients

2

Motivation from Inference Inference Problem Unsupervised Model Selection

3

Contraction Coefficients for KL and χ2-Divergences

4

Bounds between Contraction Coefficients

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 10 / 32

slide-33
SLIDE 33

Motivation: Inference Problem

Problem: Infer a hidden variable U about a “person X” given some data Y1, . . . ,Ym ∈Y about the person that is conditionally independent given U.

Y1 ր U . . . ց Ym

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 11 / 32

slide-34
SLIDE 34

Motivation: Inference Problem

Problem: Infer a hidden variable U about a “person X” given some data Y1, . . . ,Ym ∈Y about the person that is conditionally independent given U.

Y1 ր U . . . ց Ym

Assume U is binary with P(U = −1) = P(U = 1) = 1

2.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 11 / 32

slide-35
SLIDE 35

Motivation: Inference Problem

Problem: Infer a hidden variable U about a “person X” given some data Y1, . . . ,Ym ∈Y about the person that is conditionally independent given U.

Y1 ր U . . . ց Ym

Assume U is binary with P(U = −1) = P(U = 1) = 1

2.

Example: U ∈ {conservative, liberal} and Y = movies watched on Netflix

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 11 / 32

slide-36
SLIDE 36

Motivation: Inference Problem

Problem: Infer a hidden variable U about a “person X” given some data Y1, . . . ,Ym ∈Y about the person that is conditionally independent given U.

Y1 ր U . . . ց Ym

Assume U is binary with P(U = −1) = P(U = 1) = 1

2.

Example: U ∈ {conservative, liberal} and Y = movies watched on Netflix Log-likelihood Ratio Test: Construct sufficient statistic Z U − → (Y1, . . . , Ym) − → Z

m

  • i=1

log PY |U(Yi|1) PY |U(Yi| − 1)

  • Maximum Likelikood Estimate:

U = sign(Z)

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 11 / 32

slide-37
SLIDE 37

Motivation: Unsupervised Model Selection

How do we learn PY |U?

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 12 / 32

slide-38
SLIDE 38

Motivation: Unsupervised Model Selection

How do we learn PY |U? Given i.i.d. training data (X1, Y1), . . . , (Xn, Yn): U1 − → X1 − → Y1 U2 − → X2 − → Y2 . . . . . . . . . Un − → Xn − → Yn where each Xi ∈ X = {1, 2, . . . , |X|} and X indexes different people.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 12 / 32

slide-39
SLIDE 39

Motivation: Unsupervised Model Selection

How do we learn PY |U? Given i.i.d. training data (X1, Y1), . . . , (Xn, Yn): U1 − → X1 − → Y1 U2 − → X2 − → Y2 . . . . . . . . . Un − → Xn − → Yn where each Xi ∈ X = {1, 2, . . . , |X|} and X indexes different people. Training data gives us empirical distribution Pn

X,Y :

∀(x, y) ∈ X × Y,

  • Pn

X,Y (x, y) 1

n

n

  • i=1

I(Xi = x, Yi = y)

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 12 / 32

slide-40
SLIDE 40

Motivation: Unsupervised Model Selection

How do we learn PY |U? Given i.i.d. training data (X1, Y1), . . . , (Xn, Yn): U1 − → X1 − → Y1 U2 − → X2 − → Y2 . . . . . . . . . Un − → Xn − → Yn where each Xi ∈ X = {1, 2, . . . , |X|} and X indexes different people. Training data gives us empirical distribution Pn

X,Y :

∀(x, y) ∈ X × Y,

  • Pn

X,Y (x, y) 1

n

n

  • i=1

I(Xi = x, Yi = y) We assume that the true distribution PX,Y = Pn

X,Y

(motivated by concentration of measure results).

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 12 / 32

slide-41
SLIDE 41

Motivation: Unsupervised Model Selection

Model Selection Problem: Given U ∼ Bernoulli 1

2

  • and the joint pmf PX,Y for the Markov chain:

PU PX|U PX PY |X PY U − → X − → Y Find PX|U

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 13 / 32

slide-42
SLIDE 42

Motivation: Unsupervised Model Selection

Model Selection Problem: Given U ∼ Bernoulli 1

2

  • and the joint pmf PX,Y for the Markov chain:

PU PX|U PX PY |X PY U − → X − → Y Find PX|U that maximizes the proportion of information that passes through the Markov chain: max I(U; Y ) I(U; X).

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 13 / 32

slide-43
SLIDE 43

Motivation: Unsupervised Model Selection

Model Selection Problem: Given U ∼ Bernoulli 1

2

  • and the joint pmf PX,Y for the Markov chain:

PU PX|U PX PY |X PY U − → X − → Y Find PX|U that maximizes the proportion of information that passes through the Markov chain: max I(U; Y ) I(U; X). Remark: I(U;Y )

I(U;X) = 1 ⇒ I(U; Y ) = I(U; X)

which means Y is a sufficient statistic for U.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 13 / 32

slide-44
SLIDE 44

Outline

1

Introduction to Contraction Coefficients

2

Motivation from Inference

3

Contraction Coefficients for KL and χ2-Divergences Data Processing Inequalities Contraction Coefficient for KL Divergence Local Approximation of KL Divergence Local Contraction Coefficient

4

Bounds between Contraction Coefficients

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 14 / 32

slide-45
SLIDE 45

Data Processing Inequalities

Data Processing Inequality for KL Divergence: Fix PX and PY |X. Then, for any RX: D(RY ||PY ) ≤ D(RX||PX) where RY is the output when RX passes through PY |X. Strong Data Processing Inequality for KL Divergence: Fix PX and PY |X. Then, for any RX: D(RY ||PY ) ≤ ηKL(PX, PY |X)D(RX||PX)

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 15 / 32

slide-46
SLIDE 46

Data Processing Inequalities

Data Processing Inequality for KL Divergence: Fix PX and PY |X. Then, for any RX: D(RY ||PY ) ≤ D(RX||PX) where RY is the output when RX passes through PY |X. Strong Data Processing Inequality for KL Divergence: Fix PX and PY |X. Then, for any RX: D(RY ||PY ) ≤ ηKL(PX, PY |X)D(RX||PX) Data Processing Inequality for Mutual Information: Given a Markov chain U → X → Y : I(U; Y ) ≤ I(U; X) Strong Data Processing Inequality for Mutual Information: For fixed PX and PY |X: I(U; Y ) ≤ ηKL(PX, PY |X)I(U; X)

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 15 / 32

slide-47
SLIDE 47

Contraction Coefficient for KL Divergence

Definition (Contraction Coefficient for KL Divergence)

For a fixed source distribution PX and channel PY |X, we define the contraction coefficient for KL divergence and mutual information as: ηKL

  • PX, PY |X
  • sup

RX :RX =PX

D(RY ||PY ) D(RX||PX) = sup

PU,PX|U: U→X→Y

I(U; Y ) I(U; X) where the second equality is proven in [Anantharam et al., 2013] and [Polyanskiy and Wu, 2016].

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 16 / 32

slide-48
SLIDE 48

Contraction Coefficient for KL Divergence

Definition (Contraction Coefficient for KL Divergence)

For a fixed source distribution PX and channel PY |X, we define the contraction coefficient for KL divergence and mutual information as: ηKL

  • PX, PY |X
  • sup

RX :RX =PX

D(RY ||PY ) D(RX||PX) = sup

PU,PX|U: U→X→Y

I(U; Y ) I(U; X) where the second equality is proven in [Anantharam et al., 2013] and [Polyanskiy and Wu, 2016]. This provides an optimization criterion which finds both PU and PX|U for our model selection problem.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 16 / 32

slide-49
SLIDE 49

Contraction Coefficient for KL Divergence

Definition (Contraction Coefficient for KL Divergence)

For a fixed source distribution PX and channel PY |X, we define the contraction coefficient for KL divergence and mutual information as: ηKL

  • PX, PY |X
  • sup

RX :RX =PX

D(RY ||PY ) D(RX||PX) = sup

PU,PX|U: U→X→Y

I(U; Y ) I(U; X) where the second equality is proven in [Anantharam et al., 2013] and [Polyanskiy and Wu, 2016]. This provides an optimization criterion which finds both PU and PX|U for our model selection problem. The problem is not concave. So, it is difficult to solve.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 16 / 32

slide-50
SLIDE 50

Contraction Coefficient for KL Divergence

Definition (Contraction Coefficient for KL Divergence)

For a fixed source distribution PX and channel PY |X, we define the contraction coefficient for KL divergence and mutual information as: ηKL

  • PX, PY |X
  • sup

RX :RX =PX

D(RY ||PY ) D(RX||PX) = sup

PU,PX|U: U→X→Y

I(U; Y ) I(U; X) where the second equality is proven in [Anantharam et al., 2013] and [Polyanskiy and Wu, 2016]. This provides an optimization criterion which finds both PU and PX|U for our model selection problem. The problem is not concave. So, it is difficult to solve. Observation: D(RY ||PY ) ≤ D(RX||PX) is tight when RX = PX, but the sequence of pmfs RX achieving the supremum do not tend to PX.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 16 / 32

slide-51
SLIDE 51

Local Approximation of KL Divergence

Idea: Find sequence of pmfs RX → PX that maximizes D(RY ||PY )

D(RX ||PX ).

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 17 / 32

slide-52
SLIDE 52

Local Approximation of KL Divergence

Idea: Find sequence of pmfs RX → PX that maximizes D(RY ||PY )

D(RX ||PX ).

Consider the trajectory: ∀x ∈ X, R(ǫ)

X (x) = PX(x) + ǫ

  • PX(x)KX(x)

where we can think of KX and √PX as vectors, and K T

X

√PX = 0.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 17 / 32

slide-53
SLIDE 53

Local Approximation of KL Divergence

Idea: Find sequence of pmfs RX → PX that maximizes D(RY ||PY )

D(RX ||PX ).

Consider the trajectory: ∀x ∈ X, R(ǫ)

X (x) = PX(x) + ǫ

  • PX(x)KX(x)

where we can think of KX and √PX as vectors, and K T

X

√PX = 0. Taylor’s theorem: D(R(ǫ)

X ||PX) = 1

2 ǫ2 KX2

2

+ o

  • ǫ2

D(R(ǫ)

Y ||PY ) = 1

2 ǫ2 BKX2

2 + o

  • ǫ2

where R(ǫ)

Y

= PY |X · R(ǫ)

X , and B captures the effect of the channel on KX:

B diag

  • PY

−1 · PY |X · diag

  • PX
  • .
slide-54
SLIDE 54

Local Approximation of KL Divergence

Idea: Find sequence of pmfs RX → PX that maximizes D(RY ||PY )

D(RX ||PX ).

Consider the trajectory: ∀x ∈ X, R(ǫ)

X (x) = PX(x) + ǫ

  • PX(x)KX(x)

where we can think of KX and √PX as vectors, and K T

X

√PX = 0. Taylor’s theorem: D(R(ǫ)

X ||PX) = 1

2 ǫ2 KX2

2

  • =χ2(R(ǫ)

X ,PX )

+ o

  • ǫ2

D(R(ǫ)

Y ||PY ) = 1

2 ǫ2 BKX2

2

  • =χ2(R(ǫ)

Y ,PY )

+ o

  • ǫ2

where R(ǫ)

Y

= PY |X · R(ǫ)

X , and B captures the effect of the channel on KX:

B diag

  • PY

−1 · PY |X · diag

  • PX
  • .
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 17 / 32

slide-55
SLIDE 55

Local Contraction Coefficient

Theorem (Local Contraction Coefficient) [Makur and Zheng, 2015]

For random variables X and Y with joint pmf PX,Y , we have: lim

ǫ→0

sup

RX :RX =PX D(RX ||PX )= 1

2 ǫ2

D(RY ||PY ) D(RX||PX) = max

KX :KX = K T

X

√PX =0

BKX2

2

KX2

2

= ηχ2

  • PX, PY |X
  • where B = diag

√PY −1 · PY |X · diag √PX

  • , and the RHS is maximized

by K ∗

X, which is the right singular vector of B corresponding to its

“largest” singular value.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 18 / 32

slide-56
SLIDE 56

Local Contraction Coefficient

Theorem (Local Contraction Coefficient) [Makur and Zheng, 2015]

For random variables X and Y with joint pmf PX,Y , we have: lim

ǫ→0

sup

RX :RX =PX D(RX ||PX )= 1

2 ǫ2

D(RY ||PY ) D(RX||PX) = max

KX :KX = K T

X

√PX =0

BKX2

2

KX2

2

= ηχ2

  • PX, PY |X
  • where B = diag

√PY −1 · PY |X · diag √PX

  • , and the RHS is maximized

by K ∗

X, which is the right singular vector of B corresponding to its

“largest” singular value. The trajectory: ∀x ∈ X, R(ǫ)

X (x) = PX(x) + ǫ

  • PX(x)K ∗

X(x)

achieves the supremum in the LHS as ǫ → 0.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 18 / 32

slide-57
SLIDE 57

Local Contraction Coefficient

Theorem (Local Contraction Coefficient) [Makur and Zheng, 2015]

For random variables X and Y with joint pmf PX,Y , we have: lim

ǫ→0

sup

RX :RX =PX D(RX ||PX )= 1

2 ǫ2

D(RY ||PY ) D(RX||PX) = max

KX :KX = K T

X

√PX =0

BKX2

2

KX2

2

= ηχ2

  • PX, PY |X
  • where B = diag

√PY −1 · PY |X · diag √PX

  • , and the RHS is maximized

by K ∗

X, which is the right singular vector of B corresponding to its

“largest” singular value. The trajectory: ∀x ∈ X, R(ǫ)

X (x) = PX(x) + ǫ

  • PX(x)K ∗

X(x)

achieves the supremum in the LHS as ǫ → 0. This formulation admits an easy solution using the SVD.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 18 / 32

slide-58
SLIDE 58

Local Contraction Coefficient

Theorem (Local Contraction Coefficient) [Makur and Zheng, 2015]

For random variables X and Y with joint pmf PX,Y , we have: lim

ǫ→0

sup

RX :RX =PX D(RX ||PX )= 1

2 ǫ2

D(RY ||PY ) D(RX||PX) = max

KX :KX = K T

X

√PX =0

BKX2

2

KX2

2

= ηχ2

  • PX, PY |X
  • where B = diag

√PY −1 · PY |X · diag √PX

  • , and the RHS is maximized

by K ∗

X, which is the right singular vector of B corresponding to its

“largest” singular value. Model Selection Solution: ∀x ∈ X, PX|U(x|1) = PX(x) + ǫ

  • PX(x)K ∗

X(x)

∀x ∈ X, PX|U(x| − 1) = PX(x) − ǫ

  • PX(x)K ∗

X(x)

for fixed small ǫ.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 19 / 32

slide-59
SLIDE 59

Local Contraction Coefficient

Theorem (Local Contraction Coefficient) [Makur and Zheng, 2015]

For random variables X and Y with joint pmf PX,Y , we have: lim

ǫ→0

sup

RX :RX =PX D(RX ||PX )= 1

2 ǫ2

D(RY ||PY ) D(RX||PX) = max

KX :KX = K T

X

√PX =0

BKX2

2

KX2

2

= ηχ2

  • PX, PY |X
  • where B = diag

√PY −1 · PY |X · diag √PX

  • , and the RHS is maximized

by K ∗

X, which is the right singular vector of B corresponding to its

“largest” singular value. ηχ2

  • PX, PY |X
  • is also equal to the

squared Hirschfeld-Gebelein-R´ enyi maximal correlation.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 20 / 32

slide-60
SLIDE 60

Local Contraction Coefficient

Theorem (Local Contraction Coefficient) [Makur and Zheng, 2015]

For random variables X and Y with joint pmf PX,Y , we have: lim

ǫ→0

sup

RX :RX =PX D(RX ||PX )= 1

2 ǫ2

D(RY ||PY ) D(RX||PX) = max

KX :KX = K T

X

√PX =0

BKX2

2

KX2

2

= ηχ2

  • PX, PY |X
  • where B = diag

√PY −1 · PY |X · diag √PX

  • , and the RHS is maximized

by K ∗

X, which is the right singular vector of B corresponding to its

“largest” singular value. ηχ2

  • PX, PY |X
  • is also equal to the

squared Hirschfeld-Gebelein-R´ enyi maximal correlation. Other singular vectors of B can be used to decompose information into “mutually orthogonal” parts [Makur et al., 2015].

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 20 / 32

slide-61
SLIDE 61

Local Contraction Coefficient

Theorem (Local Contraction Coefficient) [Makur and Zheng, 2015]

For random variables X and Y with joint pmf PX,Y , we have: lim

ǫ→0

sup

RX :RX =PX D(RX ||PX )= 1

2 ǫ2

D(RY ||PY ) D(RX||PX) = max

KX :KX = K T

X

√PX =0

BKX2

2

KX2

2

= ηχ2

  • PX, PY |X
  • where B = diag

√PY −1 · PY |X · diag √PX

  • , and the RHS is maximized

by K ∗

X, which is the right singular vector of B corresponding to its

“largest” singular value. Compare ηχ2

  • PX, PY |X
  • and ηKL
  • PX, PY |X
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 21 / 32

slide-62
SLIDE 62

Outline

1

Introduction to Contraction Coefficients

2

Motivation from Inference

3

Contraction Coefficients for KL and χ2-Divergences

4

Bounds between Contraction Coefficients Contraction Coefficient Bound Upper Bound on Contraction Coefficient of KL Divergence Bounding KL Divergence with χ2-Divergence Binary Symmetric Channel Example

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 22 / 32

slide-63
SLIDE 63

Contraction Coefficient Bound

Theorem (Contraction Coefficient Bound) [Makur and Zheng, 2015]

For a fixed source distribution PX and channel PY |X, we have: ηχ2

  • PX, PY |X
  • ≤ ηKL
  • PX, PY |X
  • ≤ ηχ2
  • PX, PY |X
  • min

x∈X PX(x)

.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 23 / 32

slide-64
SLIDE 64

Contraction Coefficient Bound

Theorem (Contraction Coefficient Bound) [Makur and Zheng, 2015]

For a fixed source distribution PX and channel PY |X, we have: ηχ2

  • PX, PY |X
  • ≤ ηKL
  • PX, PY |X
  • ≤ ηχ2
  • PX, PY |X
  • min

x∈X PX(x)

. Remark: Our local model selection method cannot perform “too poorly.”

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 23 / 32

slide-65
SLIDE 65

Contraction Coefficient Bound

Theorem (Contraction Coefficient Bound) [Makur and Zheng, 2015]

For a fixed source distribution PX and channel PY |X, we have: ηχ2

  • PX, PY |X
  • ≤ ηKL
  • PX, PY |X
  • ≤ ηχ2
  • PX, PY |X
  • min

x∈X PX(x)

. Remark: Our local model selection method cannot perform “too poorly.” Lower Bound: lim

ǫ→0

sup

RX :RX =PX D(RX ||PX )= 1

2 ǫ2

D(RY ||PY ) D(RX||PX)

  • ηχ2(PX ,PY |X)

≤ sup

RX :RX =PX

D(RY ||PY ) D(RX||PX)

  • ηKL(PX ,PY |X)
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 23 / 32

slide-66
SLIDE 66

Contraction Coefficient Bound

Theorem (Contraction Coefficient Bound) [Makur and Zheng, 2015]

For a fixed source distribution PX and channel PY |X, we have: ηχ2

  • PX, PY |X
  • ≤ ηKL
  • PX, PY |X
  • ≤ ηχ2
  • PX, PY |X
  • min

x∈X PX(x)

. Remark: Our local model selection method cannot perform “too poorly.” Lower Bound: lim

ǫ→0

sup

RX :RX =PX D(RX ||PX )= 1

2 ǫ2

D(RY ||PY ) D(RX||PX)

  • ηχ2(PX ,PY |X)

≤ sup

RX :RX =PX

D(RY ||PY ) D(RX||PX)

  • ηKL(PX ,PY |X)

Result is known in the literature, and inequality can be strict, as demonstrated in [Anantharam et al., 2013].

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 23 / 32

slide-67
SLIDE 67

Upper Bound on Contraction Coefficient of KL Divergence

Theorem (Contraction Coefficient Bound) [Makur and Zheng, 2015]

For a fixed source distribution PX and channel PY |X, we have: ηχ2

  • PX, PY |X
  • ≤ ηKL
  • PX, PY |X
  • ≤ ηχ2
  • PX, PY |X
  • min

x∈X PX(x)

. Upper Bound Proof Sketch:

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 24 / 32

slide-68
SLIDE 68

Upper Bound on Contraction Coefficient of KL Divergence

Theorem (Contraction Coefficient Bound) [Makur and Zheng, 2015]

For a fixed source distribution PX and channel PY |X, we have: ηχ2

  • PX, PY |X
  • ≤ ηKL
  • PX, PY |X
  • ≤ ηχ2
  • PX, PY |X
  • min

x∈X PX(x)

. Upper Bound Proof Sketch: Suppose we have: D(RY ||PY ) ≤ α BKX2

2, for some α

D(RX||PX) ≥ β KX2

2, for some β

where ∀x ∈ X, RX(x) = PX(x) +

  • PX(x) KX(x).
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 24 / 32

slide-69
SLIDE 69

Upper Bound on Contraction Coefficient of KL Divergence

Theorem (Contraction Coefficient Bound) [Makur and Zheng, 2015]

For a fixed source distribution PX and channel PY |X, we have: ηχ2

  • PX, PY |X
  • ≤ ηKL
  • PX, PY |X
  • ≤ ηχ2
  • PX, PY |X
  • min

x∈X PX(x)

. Upper Bound Proof Sketch: Suppose we have: D(RY ||PY ) ≤ α BKX2

2, for some α

D(RX||PX) ≥ β KX2

2, for some β

where ∀x ∈ X, RX(x) = PX(x) +

  • PX(x) KX(x).

Then, we can prove an upper bound because: D(RY ||PY ) D(RX||PX) ≤ α β BKX2

2

KX2

2

.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 24 / 32

slide-70
SLIDE 70

Bounding KL Divergence with χ2-Divergence

KL Divergence Lower Bound:

convex set 𝑄 convex function 𝐺(𝑦) 𝑦 𝑦0 tangent “plane”

𝐺 𝑦0 + 𝛼𝐺 𝑦0 𝑦 − 𝑦0

𝑦1 Bregman divergence:

𝐺 𝑦1 − 𝐺 𝑦0 − 𝛼𝐺 𝑦0 𝑦1 − 𝑦0

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 25 / 32

slide-71
SLIDE 71

Bounding KL Divergence with χ2-Divergence

KL Divergence Lower Bound:

convex set 𝑄 convex function 𝐺(𝑦) 𝑦 𝑦0 tangent “plane”

𝐺 𝑦0 + 𝛼𝐺 𝑦0 𝑦 − 𝑦0

𝑦1 Bregman divergence:

𝐺 𝑦1 − 𝐺 𝑦0 − 𝛼𝐺 𝑦0 𝑦1 − 𝑦0

Bregman Divergence: Given F : P → R convex: ∀x1, x0 ∈ P, BF (x1, x0) F (x1) − F (x0) − ∇F (x0)T (x1 − x0)

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 25 / 32

slide-72
SLIDE 72

Bounding KL Divergence with χ2-Divergence

KL Divergence Lower Bound: Let Hn : PX → R be the negative Shannon entropy function: ∀Q ∈ PX , Hn(Q)

  • x∈X

Q(x) log (Q(x)). KL divergence is a Bregman divergence [Banerjee et al., 2005]: D(RX||PX) = Hn(RX) − Hn(PX) − ∇Hn(PX)T (RX − PX) .

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 26 / 32

slide-73
SLIDE 73

Bounding KL Divergence with χ2-Divergence

KL Divergence Lower Bound: Let Hn : PX → R be the negative Shannon entropy function: ∀Q ∈ PX , Hn(Q)

  • x∈X

Q(x) log (Q(x)). KL divergence is a Bregman divergence [Banerjee et al., 2005]: D(RX||PX) = Hn(RX) − Hn(PX) − ∇Hn(PX)T (RX − PX) . Hn : PX → R is strongly convex because ∇2Hn(Q) = diag (Q)−1 I, where I denotes the identity matrix.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 26 / 32

slide-74
SLIDE 74

Bounding KL Divergence with χ2-Divergence

KL Divergence Lower Bound: Let Hn : PX → R be the negative Shannon entropy function: ∀Q ∈ PX , Hn(Q)

  • x∈X

Q(x) log (Q(x)). KL divergence is a Bregman divergence [Banerjee et al., 2005]: D(RX||PX) = Hn(RX) − Hn(PX) − ∇Hn(PX)T (RX − PX) . Hn : PX → R is strongly convex because ∇2Hn(Q) = diag (Q)−1 I, where I denotes the identity matrix. Hn(RX) ≥ Hn(PX) + ∇Hn(PX)T (RX − PX) + 1 2 RX − PX2

2

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 26 / 32

slide-75
SLIDE 75

Bounding KL Divergence with χ2-Divergence

KL Divergence Lower Bound: Let Hn : PX → R be the negative Shannon entropy function: ∀Q ∈ PX , Hn(Q)

  • x∈X

Q(x) log (Q(x)). KL divergence is a Bregman divergence [Banerjee et al., 2005]: D(RX||PX) = Hn(RX) − Hn(PX) − ∇Hn(PX)T (RX − PX) . Hn : PX → R is strongly convex because ∇2Hn(Q) = diag (Q)−1 I, where I denotes the identity matrix. Hn(RX) ≥ Hn(PX) + ∇Hn(PX)T (RX − PX) + 1 2 RX − PX2

2

D(RX||PX) ≥ 1 2 RX − PX2

2

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 26 / 32

slide-76
SLIDE 76

Bounding KL Divergence with χ2-Divergence

KL Divergence Lower Bound: Let Hn : PX → R be the negative Shannon entropy function: ∀Q ∈ PX , Hn(Q)

  • x∈X

Q(x) log (Q(x)). KL divergence is a Bregman divergence [Banerjee et al., 2005]: D(RX||PX) = Hn(RX) − Hn(PX) − ∇Hn(PX)T (RX − PX) . Hn : PX → R is strongly convex because ∇2Hn(Q) = diag (Q)−1 I, where I denotes the identity matrix. Hn(RX) ≥ Hn(PX) + ∇Hn(PX)T (RX − PX) + 1 2 RX − PX2

2

D(RX||PX) ≥ 1 2 RX − PX2

2

Using ∀x ∈ X, RX(x) = PX(x) +

  • PX(x) KX(x), we see that:

D(RX||PX) ≥ 1 2 RX − PX2

2 ≥ minx∈X PX(x)

2 KX2

2 .

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 26 / 32

slide-77
SLIDE 77

Bounding KL Divergence with χ2-Divergence

Lemma (KL Divergence Lower Bound)

Given pmfs PX and RX, we have: D(RX||PX) ≥ min

x∈X PX(x)

2 KX2

2

where ∀x ∈ X, RX(x) = PX(x) +

  • PX(x) KX(x).
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 27 / 32

slide-78
SLIDE 78

Bounding KL Divergence with χ2-Divergence

Lemma (KL Divergence Lower Bound)

Given pmfs PX and RX, we have: D(RX||PX) ≥ min

x∈X PX(x)

2 KX2

2

where ∀x ∈ X, RX(x) = PX(x) +

  • PX(x) KX(x).

which can be improved to:

Lemma (KL Divergence Lower Bound)

Given pmfs PX and RX, we have: D(RX||PX) ≥ min

x∈X PX(x) KX2 2

where ∀x ∈ X, RX(x) = PX(x) +

  • PX(x) KX(x).
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 27 / 32

slide-79
SLIDE 79

Bounding KL Divergence with χ2-Divergence

Lemma (KL Divergence Upper Bound)

Given pmfs PX and RX, we have: D(RX||PX) ≤ log

  • 1 + KX2

2

  • ≤ KX2

2

where ∀x ∈ X, RX(x) = PX(x) +

  • PX(x) KX(x).
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 28 / 32

slide-80
SLIDE 80

Bounding KL Divergence with χ2-Divergence

Lemma (KL Divergence Upper Bound)

Given pmfs PX and RX, we have: D(RX||PX) ≤ log

  • 1 + KX2

2

  • ≤ KX2

2

where ∀x ∈ X, RX(x) = PX(x) +

  • PX(x) KX(x).

Proof: D(RX||PX) = ERX

  • log

RX(X) PX(X)

  • ≤ log
  • ERX

RX(X) PX(X)

  • [Jensen]
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 28 / 32

slide-81
SLIDE 81

Bounding KL Divergence with χ2-Divergence

Lemma (KL Divergence Upper Bound)

Given pmfs PX and RX, we have: D(RX||PX) ≤ log

  • 1 + KX2

2

  • ≤ KX2

2

where ∀x ∈ X, RX(x) = PX(x) +

  • PX(x) KX(x).

Proof: D(RX||PX) = ERX

  • log

RX(X) PX(X)

  • ≤ log
  • ERX

RX(X) PX(X)

  • [Jensen]

Simplify: ERX

  • RX (X)

PX (X)

  • =

x∈X RX (x)2 PX (x) = 1 + KX2 2.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 28 / 32

slide-82
SLIDE 82

Bounding KL Divergence with χ2-Divergence

Lemma (KL Divergence Upper Bound)

Given pmfs PX and RX, we have: D(RX||PX) ≤ log

  • 1 + KX2

2

  • ≤ KX2

2

where ∀x ∈ X, RX(x) = PX(x) +

  • PX(x) KX(x).

Proof: D(RX||PX) = ERX

  • log

RX(X) PX(X)

  • ≤ log
  • ERX

RX(X) PX(X)

  • [Jensen]

Simplify: ERX

  • RX (X)

PX (X)

  • =

x∈X RX (x)2 PX (x) = 1 + KX2 2.

Hence, we have: D(RX||PX) ≤ log

  • 1 + KX2

2

  • ≤ KX2

2,

using the fact that: ∀x > −1, log(1 + x) ≤ x.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 28 / 32

slide-83
SLIDE 83

Contraction Coefficient Bound

For a fixed source distribution PX and channel PY |X, we have: D(RX||PX) ≥ min

x∈X PX(x) KX2 2

D(RY ||PY ) ≤ BKX2

2

where RY is the output when RX passes through PY |X, and B = diag √PY −1 · PY |X · diag √PX

  • .
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 29 / 32

slide-84
SLIDE 84

Contraction Coefficient Bound

For a fixed source distribution PX and channel PY |X, we have: D(RX||PX) ≥ min

x∈X PX(x) KX2 2

D(RY ||PY ) ≤ BKX2

2

where RY is the output when RX passes through PY |X, and B = diag √PY −1 · PY |X · diag √PX

  • .

Theorem (Contraction Coefficient Bound) [Makur and Zheng, 2015]

For a fixed source distribution PX and channel PY |X, we have: ηχ2

  • PX, PY |X
  • ≤ ηKL
  • PX, PY |X
  • ≤ ηχ2
  • PX, PY |X
  • min

x∈X PX(x)

.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 29 / 32

slide-85
SLIDE 85

Example of Contraction Coefficient Bound

Binary Symmetric Channel Bounds: ηχ2

  • PX, PY |X
  • ≤ ηKL
  • PX, PY |X
  • ≤ ηχ2
  • PX, PY |X
  • min

x∈X PX(x)

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 30 / 32

slide-86
SLIDE 86

Conclusion

Theorem (Contraction Coefficient Bound) [Makur and Zheng, 2015]

For a fixed source distribution PX and channel PY |X, we have: ηχ2

  • PX, PY |X
  • ≤ ηKL
  • PX, PY |X
  • ≤ ηχ2
  • PX, PY |X
  • min

x∈X PX(x)

. Summary: Contraction coefficient for KL divergence can perform model selection, but no simple algorithm to solve it. Contraction coefficient for χ2-divergence performs (suboptimal) model selection using the SVD. Bounds exist between these contraction coefficients.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 31 / 32

slide-87
SLIDE 87
  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 32 / 32

slide-88
SLIDE 88

Amari, S. and Cichocki, A. (2010). Information geometry of divergence functions. Bulletin of the Polish Academy of Sciences, Technical Sciences, 58(1):183–195. Anantharam, V., Gohari, A., Kamath, S., and Nair, C. (2013). On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv:1304.6133 [cs.IT]. Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749. Makur, A., Kozynski, F., Huang, S.-L., and Zheng, L. (2015). An efficient algorithm for information decomposition and extraction. In Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing, pages 972–979, Allerton House, UIUC, Illinois, USA. Makur, A. and Zheng, L. (2015).

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 32 / 32

slide-89
SLIDE 89

Bounds between contraction coefficients. In Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing, pages 1422–1429, Allerton House, UIUC, Illinois, USA. Polyanskiy, Y. and Wu, Y. (2016). Dissipation of information in channels with input constraints. IEEE Transactions on Information Theory, 62(1):35–55.

  • A. Makur & L. Zheng (MIT)

Local and Global Contraction Coefficients 5 November 2015 32 / 32