Chapter 8 Differential Entropy Peng-Hua Wang Graduate Inst. of - - PowerPoint PPT Presentation

chapter 8 differential entropy
SMART_READER_LITE
LIVE PREVIEW

Chapter 8 Differential Entropy Peng-Hua Wang Graduate Inst. of - - PowerPoint PPT Presentation

Chapter 8 Differential Entropy Peng-Hua Wang Graduate Inst. of Comm. Engineering National Taipei University Chapter Outline Chap. 8 Differential Entropy 8.1 Definitions 8.2 AEP for Continuous Random Variables 8.3 Relation of Differential


slide-1
SLIDE 1

Chapter 8 Differential Entropy

Peng-Hua Wang

Graduate Inst. of Comm. Engineering National Taipei University

slide-2
SLIDE 2

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 2/24

Chapter Outline

  • Chap. 8 Differential Entropy

8.1 Definitions 8.2 AEP for Continuous Random Variables 8.3 Relation of Differential Entropy to Discrete Entropy 8.4 Joint and Conditional Differential Entropy 8.5 Relative Entropy and Mutual Information 8.6 Properties of Differential Entropy and Related Amounts

slide-3
SLIDE 3

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 3/24

8.1 Definitions

slide-4
SLIDE 4

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 4/24

Definitions

Definition 1 (Differential entropy) The differential entropy h(X) of a continuous random variable X with pdf f(X) is defined as

h(X) = −

  • S

f(x) log f(x)dx,

where S is the support region of the random variable. Example

X ∼ U(0, a), h(X) = − a 1 a log 1 adx = log a.

slide-5
SLIDE 5

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 5/24

Differential Entropy of Gaussian

  • Example. If X ∼ N(0, σ2) with pdf φ(x) =

1 √ 2πσ2e− x2

2σ2 , then

ha(x) = −

  • φ(x) loga φ(x)dx

= −

  • φ(x)
  • loga

1 √ 2πσ2 − x2 2σ2 loga e

  • dx

= 1 2 loga(2πσ2) + loga e 2σ2 Eφ[X2] = 1 2 loga(2πeσ2)

slide-6
SLIDE 6

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 6/24

Differential Entropy of Gaussian

  • Remark. If a random variable with pdf f(x) has zero mean and

variance σ2, then

  • f(x) loga φ(x)dx

= −

  • f(x)
  • loga

1 √ 2πσ2 − x2 2σ2 loga e

  • dx

=1 2 loga(2πσ2) + loga e 2σ2 Ef[X2] = 1 2 loga(2πeσ2)

slide-7
SLIDE 7

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 7/24

Gaussian has Maximal Differential Entropy

Suppose that a random variable X with pdf f(x) has zero mean and variance σ2, what is its maximal differential entropy? Let φ(x) be the pdf of N(0, σ2).

h(X) +

  • f(x) log φ(x)dx =
  • f(x) log φ(x)

f(x)dx ≤ log

  • f(x)φ(x)

f(x)dx

  • (convexity of logarithm)

= log

  • φ(x)dx = 0

That is,

h(X) ≤ −

  • f(x) log φ(x)dx = 1

2 log(2πeσ2)

and equality holds if f(x) = φ(x).

slide-8
SLIDE 8

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 8/24

8.2 AEP for Continuous Random Variables

slide-9
SLIDE 9

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 9/24

AEP

Theorem 1 (AEP) Let X1, X2, . . . , Xn be a sequence of i.i.d. random variables with common pdf f(x). Then,

−1 n log f(X1, X2, . . . , Xn) → E[− log f(X)] = h(X)

in probability. Definition 2 (Typical Set) For ǫ > 0 the typical set A(n)

ǫ

with respect to f(x) is defined as

A(n)

ǫ

=

  • (x1, x2, . . . , xn) ∈ Sn :
  • −1

n log f(x1, x2, . . . , xn) − h(X)

  • ≤ ǫ
slide-10
SLIDE 10

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 10/24

AEP

Definition 3 (Volume) The volume Vol(A) of a set A ⊂ Rn is defined as

Vol(A) =

  • A

dx1dx2 . . . dxn

Theorem 2 (Properties of typical set) 1. Pr(A(n)

ǫ ) > 1 − ǫ for n

sufficiently large.

  • 2. Vol(A(n)

ǫ ) ≤ 2n(h(X)+ǫ) for all n.

  • 3. Vol(A(n)

ǫ ) ≥ (1 − ǫ)2n(h(X)−ǫ) for n sufficiently large.

slide-11
SLIDE 11

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 11/24

8.4 Joint and Conditional Differential Entropy

slide-12
SLIDE 12

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 12/24

Definitions

Definition 4 (Differential entropy) The differential entropy of jointly distributed random variables X1, X2, . . . , Xn is defined as

h(X1, X2, . . . , Xn) = −

  • f(xn) log f(xn)dxn

where f(xn) = f(x1, x2, . . . , xn) is the joint pdf. Definition 5 (Conditional differential entropy) The conditional differential entropy of jointly distributed random variables X, Y with joint pdf f(x, y) is defined as, if it exists,

h(X|Y ) = −

  • f(x, y) log f(x|y)dxdy = h(X, Y ) − h(Y )
slide-13
SLIDE 13

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 13/24

Multivariate Normal Distribution

Theorem 3 (Entropy of a multivariate normal) Let X1, X2, . . . , Xn have a multivariate normal distribution with mean vector µ and covariance matrix K. Then

h(X1, X2, . . . , Xn) = 1 2 log(2πe)n|K|

  • Proof. The joint pdf of a multivariate normal distribution is

φ(x) = 1 (2π)n/2|K|1/2e− 1

2 (x−µ)tK−1(x−µ)

slide-14
SLIDE 14

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 14/24

Multivariate Normal Distribution

Therefore,

h(X1, X2, . . . , Xn) = −

  • φ(x) loga φ(x)dx

=

  • φ(x)

1 2 loga(2π)n|K| + 1 2(x − µ)tK−1(x − µ) loga e

  • dx

=1 2 loga(2π)n|K| + 1 2(loga e) E

  • (x − µ)tK−1(x − µ)
  • =n

=1 2 loga(2π)n|K| + 1 2n loga e =1 2 loga(2πe)n|K|

slide-15
SLIDE 15

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 15/24

Multivariate Normal Distribution

Let Y = (Y1, Y2, . . . , Yn)t be a random vector. If K = E[YYt], then E[YtK−1Y] = n.

  • Proof. Denote

K = E[YYt] =     | | | k1 k2 . . . kn | | |    

and

K−1 =         at

1

at

2

. . .

at

n

       

We have ki = E[YiY] and at

jki = δij.

slide-16
SLIDE 16

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 16/24

Multivariate Normal Distribution

Now,

YtK−1Y = Yt         at

1

at

2

. . .

at

n

        Y = (Y1, Y2, . . . , Yn)         at

1Y

at

2Y

. . .

at

nY

        = Y1at

1Y + Y2at 2Y + · · · + Ynat nY

and

E[YtK−1Y] = at

1E[Y1Y] + at 2E[Y2Y] + · · · + at nE[YnY]

= at

1k1 + at 2k2 + · · · + at nkn = n

slide-17
SLIDE 17

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 17/24

8.5 Relative Entropy and Mutual Information

slide-18
SLIDE 18

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 18/24

Definitions

Definition 6 (Relative entropy) The relative entropy (or KullbackVLeibler distance) D(f||g) between two densities f(x) and

g(x) is defined as D(f||g) =

  • f(x) log f(x)

g(x)dx

Definition 7 (Mutual information) The mutual information I(X; Y ) between two random variables with joint density f(x, y) is defined as

I(X; Y ) =

  • f(x, y) log f(x, y)

f(x)f(y)dxdy

slide-19
SLIDE 19

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 19/24

Example

Let (X, Y ) ∼ N(0, K) where

K =

  • σ2

ρσ2 ρσ2 σ2

  • .

Then h(X) = h(Y ) = 1

2 log(2πe)σ2 and

h(X, Y ) = 1 2 log(2πe)2|K| = 1 2 log(2πe)2σ4(1 − ρ2).

Therefore,

I(X; Y ) = h(X) + h(Y ) − h(X; Y ) = −1 2 log(1 − ρ2).

slide-20
SLIDE 20

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 20/24

8.6 Properties of Differential Entropy and Related Amounts

slide-21
SLIDE 21

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 21/24

Properties

Theorem 4 (Relative entropy)

D(f||g) ≥ 0

with equality iff f = g almost everywhere. Corollary 1 1. I(X; Y ) ≥ 0 with equality iff X and Y are independent.

  • 1. h(X|Y ) ≤ h(X) with equality iff X and Y are independent.
slide-22
SLIDE 22

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 22/24

Properties

Theorem 5 (Chain rule for differential entropy)

h(X1, X2, . . . , Xn) =

n

  • i=1

h(Xi|X1, X2, . . . , Xi−1)

Corollary 2

h(X1, X2, . . . , Xn) ≤

n

  • i=1

h(Xi)

Corollary 3 (Hadamard’s inequality) If K is the covariance matrix of a multivariate normal distribution, then

|K| ≤

n

  • i=1

Kii.

slide-23
SLIDE 23

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 23/24

Properties

Theorem 6 1. h(X + c) = h(X)

  • 2. h(aX) = h(X) + log |a|.
  • 3. h(AX) = h(X) + log | det(A)|
slide-24
SLIDE 24

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 24/24

Gaussian has Maximal Entropy

Theorem 7 Let the random vector X ∈ Rn have zero mean and covariance K = E[XXt]. Then h(X) ≤ 1

2 log(2πe)n|K|. with

equality X ∼ N(0, K)

  • Proof. Let g(x) be any density satisfying
  • xixjg(x)dx = Kij. Let

φ(x) be the density of N(0, K). Then, 0 ≤ D(g||φ) =

  • g log(g/φ) = −h(g) −
  • g log φ

= −h(g) −

  • φ log φ = −h(g) + h(φ)

That is, h(g) ≤ h(φ). Equality holds if g = φ.