Lecture 5: Measures of Information for Continuous Random Variables - - PowerPoint PPT Presentation

lecture 5 measures of information for continuous random
SMART_READER_LITE
LIVE PREVIEW

Lecture 5: Measures of Information for Continuous Random Variables - - PowerPoint PPT Presentation

Entropy and Mutual Information Differential Entropy Lecture 5: Measures of Information for Continuous Random Variables I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 26, 2015 1 / 24


slide-1
SLIDE 1

Entropy and Mutual Information Differential Entropy

Lecture 5: Measures of Information for Continuous Random Variables

I-Hsiang Wang

Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw

October 26, 2015

1 / 24 I-Hsiang Wang IT Lecture 5

slide-2
SLIDE 2

Entropy and Mutual Information Differential Entropy

From Discrete to Continuous

So far we have focused on the discrete (& finite-alphabet) r.v’s: Entropy and mutual information for discrete r.v’s. Lossless source coding for discrete stationary sources. Channel coding over discrete memoryless channels. In this lecture and the next two lectures, we extend the basic principles and fundamental theorems to continuous random sources and channels. In particular: Mutual information for continuous r.v.’s. (this lecture) Lossy source coding for continuous stationary sources. (Lecture 7) Gaussian channel capacity. (Lecture 6)

2 / 24 I-Hsiang Wang IT Lecture 5

slide-3
SLIDE 3

Entropy and Mutual Information Differential Entropy

Outline

1 First we investigate basic information measures – entropy, mutual

information, and KL divergence – when the r.v.’s are continuous. We will see that both mutual information and KL divergence are well defined, while entropy of continuous r.v. is not.

2 Then, we introduce differential entropy as a continuous r.v.’s

counterpart of Shannon entropy, and discuss the related properties.

3 / 24 I-Hsiang Wang IT Lecture 5

slide-4
SLIDE 4

Entropy and Mutual Information Differential Entropy

1 Entropy and Mutual Information 2 Differential Entropy

4 / 24 I-Hsiang Wang IT Lecture 5

slide-5
SLIDE 5

Entropy and Mutual Information Differential Entropy

Entropy of a Continuous Random Variable

Question: What is the entropy of a continuous real-valued random variable X ? Suppose X has the probability density function (p.d.f.) f (x). Let us discretize X to answer this question, as follows: Partition R into length-∆ intervals: R =

k=−∞

[k∆, (k + 1)∆) . Suppose that f (x) is continuous, then by the mean-value theorem, ∀ k ∈ Z, ∃ xk ∈ [k∆, (k + 1)∆) such that f (xk) = 1

∫ (k+1)∆

k∆

f (x) dx. Set [X]∆ ≜ xk if X ∈ [k∆, (k + 1)∆), with p.m.f. p (xk) = f (xk) ∆.

5 / 24 I-Hsiang Wang IT Lecture 5

slide-6
SLIDE 6

Entropy and Mutual Information Differential Entropy

f (x) x

FX (x) P {X ≤ x} f (x) dFX (x) dx

6 / 24 I-Hsiang Wang IT Lecture 5

slide-7
SLIDE 7

Entropy and Mutual Information Differential Entropy

f (x) x ∆

7 / 24 I-Hsiang Wang IT Lecture 5

slide-8
SLIDE 8

Entropy and Mutual Information Differential Entropy

f (x) x ∆

x1 x3 x5

8 / 24 I-Hsiang Wang IT Lecture 5

slide-9
SLIDE 9

Entropy and Mutual Information Differential Entropy

Observation: lim∆→0 H ([X]∆ ) = H (X ) (intuitively), while H ([X]∆ ) = −

k=−∞

(f (xk) ∆) log (f (xk) ∆) = −∆

k=−∞

f (xk) log f (xk) − log ∆ → − ∫ ∞

f (x) log f (x) dx + ∞ = ∞ as ∆ → 0 Hence, H (X ) = ∞ if − ∫ ∞

∞ f (x) log f (x) dx = E

[ log

1 f(X)

] exists. It is quite intuitive that the entropy of a continuous random variable can be arbitrarily large, because it can take infinitely many possible values.

9 / 24 I-Hsiang Wang IT Lecture 5

slide-10
SLIDE 10

Entropy and Mutual Information Differential Entropy

Mutual Information between Continuous Random Variables

How about mutual information between two continuous r.v.’s X and Y, with joint p.d.f. fX,Y (x, y) and marginal p.d.f.’s fX (x) and fY (y)? Again, we use discretization: Partition R2 plane into ∆ × ∆ squares: R2 = ∪∞

k,j=−∞ I∆ k × I∆ j ,

where I∆

k ≜ [k∆, (k + 1)∆).

Suppose that fX,Y (x, y) is continuous, then by the mean-value theorem (MVT), ∀ k, j ∈ Z, ∃ (xk, yj) ∈ I∆

k × I∆ j

such that fX,Y (xk, yj) =

1 ∆2

I∆

k ×I∆ j fX,Y (x, y) dx dy.

Set ([X]∆ , [Y]∆) ≜ (xk, yj) if (X, Y) ∈ I∆

k × I∆ j , with p.m.f.

p (xk, yj) = fX,Y (xk, yj) ∆2.

10 / 24 I-Hsiang Wang IT Lecture 5

slide-11
SLIDE 11

Entropy and Mutual Information Differential Entropy

By MVT, ∀ k, j ∈ Z, ∃ xk ∈ I∆

k and

yj ∈ I∆

j

such that p (xk) = ∫

I∆

k fX (x) dx = ∆fX (

xk) , p (yj) = ∫

I∆

j fY (y) dy = ∆fY (

yj) . Observation: lim∆→0 I ([X]∆ ; [Y]∆ ) = I (X ; Y ) (intuitively), while

I ( [X]∆ ; [Y]∆ ) =

k,j=−∞

p (xk, yj) log p (xk, yj) p (xk) p (yj) =

k,j=−∞

( fX,Y (xk, yj) ∆2) log fX,Y (xk, yj)✚

∆2 fX ( xk) fY ( yj)✚

∆2 = ∆2

k,j=−∞

fX,Y (xk, yj) log fX,Y (xk, yj) fX ( xk) fY ( yj) → ∫ ∞

−∞

∫ ∞

−∞

fX,Y (x, y) log fX,Y (x, y) fX (x) fY (y) dx dy as ∆ → 0

Hence, I (X ; Y ) = E [ log

f(X,Y) f(X)f(Y)

] if the improper integral exists.

11 / 24 I-Hsiang Wang IT Lecture 5

slide-12
SLIDE 12

Entropy and Mutual Information Differential Entropy

Mutual Information

Unlike entropy that is only well-defined for discrete random variables, in general we can define the mutual information between two real-valued random variables (no necessarily continuous or discrete) as follows. Definition 1 (Mutual information) The mutual information between two random variables X and Y is defined as I (X ; Y ) = sup

P,Q

I ( [X]P ; [Y]Q ) , where the supremum is taken

  • ver all pairs of partitions P and Q of R.

Similar to mutual information, KL divergence can also be defined between two probability measures, no matter the probability distributions are discrete, continuous, etc. Remark: Although defining information measures in such a general way is nice, these definitions do not provide explicit ways to compute these information measures.

12 / 24 I-Hsiang Wang IT Lecture 5

slide-13
SLIDE 13

Entropy and Mutual Information Differential Entropy

1 Entropy and Mutual Information 2 Differential Entropy

13 / 24 I-Hsiang Wang IT Lecture 5

slide-14
SLIDE 14

Entropy and Mutual Information Differential Entropy

Differential Entropy

For continuous r.v.’s, it turns out to be useful to define the counterparts

  • f entropy and conditional entropy, as follows:

Definition 2 (Differential entropy and conditional differential entropy) The differential entropy of a continuous r.v. X with p.d.f. f (x) is defined as h (X ) ≜ E [ log

1 f(X)

] if the (improper) integral exists. The conditional differential entropy of a continuous r.v. X given Y, where (X, Y) has joint p.d.f. f (x, y) and conditional p.d.f. f (x|y), is defined as h (X |Y ) ≜ E [ log

1 f(X|Y)

] if the (improper) integral exists. We have the following theorem immediately from the previous discussion: Theorem 1 (Mutual information between two continuous r.v.’s) I (X ; Y ) = E [ log

f(X,Y) f(X)f(Y)

] = h (X ) − h (X |Y ).

14 / 24 I-Hsiang Wang IT Lecture 5

slide-15
SLIDE 15

Entropy and Mutual Information Differential Entropy

Kullback-Leibler Divergence

Definition 3 (KL divergence between densities) The Kullback-Leibler divergence between two probability density functions f (x) and g (x) is defined as D (f ∥g) ≜ E [ log f(X)

g(X)

] if the (improper) integral exists. The expectation is taken over r.v. X ∼ f (x). By Jensen’s inequality, it is straightforward to see that the non-negativity

  • f KL divergence remains.

Proposition 1 (Non-negativity of KL divergence) D (f ∥g) ≥ 0, with equality iff f = g almost everywhere (i.e., except for some points with zero probability). Note: D (f ∥g) is finite only if the support of f (x) is contained in the support of g (x).

15 / 24 I-Hsiang Wang IT Lecture 5

slide-16
SLIDE 16

Entropy and Mutual Information Differential Entropy

Properties that Extend to Continuous R.V.’s

Proposition 2 (Chain rule) h (X, Y ) = h (X ) + h (Y |X ) , h (Xn ) =

n

i=1

h ( Xi

  • Xi−1 )

. Proposition 3 (Conditioning reduces differential entropy) h (X |Y ) ≤ h (X ) , h (X |Y, Z ) ≤ h (X |Z ) . Proposition 4 (Non-negativity of mutual information) I (X ; Y ) ≥ 0, I (X ; Y |Z ) ≥ 0.

16 / 24 I-Hsiang Wang IT Lecture 5

slide-17
SLIDE 17

Entropy and Mutual Information Differential Entropy

Examples

Example 1 (Differential entropy of a uniform r.v.) For a r.v. X ∼ Unif [a, b], that is, its p.d.f. fX (x) =

1 b−a1 {a ≤ x ≤ b},

its differential entropy h (X ) = log (b − a). Example 2 (Differential entropy of N (0, 1)) For a r.v. X ∼ N (0, 1), that is, its p.d.f. fX (x) =

1 √ 2πe− x2

2 , its

differential entropy h (X ) = 1

2 log (2πe).

17 / 24 I-Hsiang Wang IT Lecture 5

slide-18
SLIDE 18

Entropy and Mutual Information Differential Entropy

New Properties of Differential Entropy

Differential entropy can be negative. Since b − a can be made arbitrarily small, h (X ) = log (b − a) can be

  • negative. Hence, the non-negative property of entropy cannot be

extended to differential entropy. Scaling will change the differential entropy. Consider X ∼ Unif [0, 1]. Then, 2X ∼ Unif [0, 2]. Hence, h (X ) = log 1 = 0, h (2X ) = log 2 = 1 = ⇒ h (X ) ̸= h (2X ) . This is in sharp contrast to entropy: H (X ) = H (g (X) ) as long as g (·) is an invertible function.

18 / 24 I-Hsiang Wang IT Lecture 5

slide-19
SLIDE 19

Entropy and Mutual Information Differential Entropy

Scaling and Translation

Proposition 5 (Scaling and Translation in the Scaler Case) Let X be a continuous random variable with differential entropy h (X). Translation does not change the differential entropy: For a constant c, h (X + c) = h (X). Scaling shift the differential entropy: For a constant a ̸= 0, h (aX) = h (X) + log |a|. Proposition 6 (Scaling and Translation in the Vector Case) Let X be a continuous random vector with differential entropy h (X). For a constant vector c, h (X + c) = h (X). For an invertible matrix A ∈ Rn×n, h (AX) = h (X) + log|det A|. The proof of these propositions are left as exercises (simple calculus).

19 / 24 I-Hsiang Wang IT Lecture 5

slide-20
SLIDE 20

Entropy and Mutual Information Differential Entropy

Differential Entropy of Gaussian Random Vectors

Example 3 (Differential entropy of Gaussian random vectors) For a n-dim random vector X ∼ N (m, K), its differential entropy h (X ) = 1

2 log (2πe)n (det K).

sol: For a n-dim random vector X ∼ N (m, K), we can rewrite X as X = AW + m, where AAT = K and W consists of i.i.d. Wi ∼ N (0, 1), i = 1, . . . , n. Hence, by the translation and scaling properties of differential entropy: h (X ) = h (W ) + log|det A| =

n

i=1

h (Wi ) + 1

2 log (det K)

= n

2 log (2πe) + 1 2 log (det K) = 1 2 log (2πe)n (det K) .

20 / 24 I-Hsiang Wang IT Lecture 5

slide-21
SLIDE 21

Entropy and Mutual Information Differential Entropy

Maximum Differential Entropy

Uniform distribution maximizes entropy for r.v. with finite support. For differential entropy, the maxmization problem needs to be associated with constraints on the distribution. (otherwise, it is simple to make it infinite) The following theorem asserts that, under a second moment constraint, zero-mean Gaussian maximizes the differential entropy. Theorem 2 (Maximum Differential Entropy under Covariance Constraint) Let X be a random vector with mean m and covariance matrix E [ (X − m) (X − m)T] = K, and XG be Gaussian with the same covariance K. Then, h (X ) ≤ h ( XG ) = 1

2 log (2πe)n (det K) .

21 / 24 I-Hsiang Wang IT Lecture 5

slide-22
SLIDE 22

Entropy and Mutual Information Differential Entropy

pf: First, we can assume WLOG that both X and XG are zero-mean. Let the p.d.f. of X be f (x) and the p.d.f. of XG be fG (x). Hence, 0 ≤ D (f ∥fG) = E [log f (X)] − E [log fG (X)] = −h (X ) − Ef [log fG (X)] . Note that log fG (x) is a quadratic function of x, and X, XG have the same second moment. Hence, Ef [log fG (X)] = EfG [log fG (X)] = −h ( XG ) , = ⇒ 0 ≤ D (f ∥fG) = −h (X ) + h ( XG ) = ⇒ h (X ) ≤ h ( XG ) .

22 / 24 I-Hsiang Wang IT Lecture 5

slide-23
SLIDE 23

Entropy and Mutual Information Differential Entropy

Summary

23 / 24 I-Hsiang Wang IT Lecture 5

slide-24
SLIDE 24

Entropy and Mutual Information Differential Entropy

Mutual information between two continuous r.v.’s X and Y with joint density fX,Y: I (X ; Y ) = E [ log fX,Y(X,Y)

fX(X)fY(Y)

] . Differential entropy and conditional differential entropy: h (X ) ≜ E [ log

1 fX(X)

] , h (X |Y ) ≜ E [ log

1 fX|Y(X|Y)

] . I (X ; Y ) = h (X ) − h (X |Y ) = h (Y ) − h (Y |X ). KL divergence between densities f and g: D (f ∥g) ≜ Ef [ log f(X)

g(X)

] . Chain rule, conditioning reduces differential entropy, non-negativity

  • f mutual information and KL divergence: remain to hold.

Differential entropy can be negative; h (X ) ≤h (X, Y ).

24 / 24 I-Hsiang Wang IT Lecture 5