Lecture 5 Continuous-Valued Sources and Channels I-Hsiang Wang - - PowerPoint PPT Presentation

lecture 5 continuous valued sources and channels
SMART_READER_LITE
LIVE PREVIEW

Lecture 5 Continuous-Valued Sources and Channels I-Hsiang Wang - - PowerPoint PPT Presentation

Lecture 5 Continuous-Valued Sources and Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 4, 2016 1 / 63 I-Hsiang Wang IT Lecture 5 From Discrete to Continuous So far we have


slide-1
SLIDE 1

Lecture 5 Continuous-Valued Sources and Channels

I-Hsiang Wang

Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw

November 4, 2016

1 / 63 I-Hsiang Wang IT Lecture 5

slide-2
SLIDE 2

From Discrete to Continuous

So far we have focused on the discrete (& finite-alphabet) sources and channels: Information measures for discrete r.v's. Lossless/Lossy source coding for discrete stationary/memoryless sources. Channel coding over discrete memoryless channels. In this lecture, we extend the basic principles and fundamental theorems to continuous random sources and channels. In particular: Mutual information and information divergence for continuous r.v.'s. Channel coding over continuous memoryless channels. Lossy source coding for continuous memoryless sources.

2 / 63 I-Hsiang Wang IT Lecture 5

slide-3
SLIDE 3

Outline

1 First we investigate basic information measures – entropy, mutual information, and KL

divergence – when the r.v.'s are continuous. We will see that both mutual information and KL divergence are well defined, while entropy of continuous r.v. is not.

2 Then, we introduce differential entropy as a continuous r.v.'s counterpart of Shannon entropy,

and discuss the related properties.

3 / 63 I-Hsiang Wang IT Lecture 5

slide-4
SLIDE 4

Measures of Information for Continuous Random Variables Entropy and Mutual Information

1

Measures of Information for Continuous Random Variables Entropy and Mutual Information Differential Entropy

2

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel Gaussian Channel Capacity

3

Lossy Source Coding for Continuous Memoryless Sources

4 / 63 I-Hsiang Wang IT Lecture 5

slide-5
SLIDE 5

Measures of Information for Continuous Random Variables Entropy and Mutual Information

Entropy of a Continuous Random Variable

Question: What is the entropy of a continuous real-valued random variable X ?

Suppose X has the probability density function (p.d.f.) fX (·).

Let us discretize X to answer this question, as follows: Partition R into length-∆ intervals: R = ∪∞

k=−∞ [k∆, (k + 1)∆).

Suppose that fX (·) is continuous (drop subscript "X" below), then by the mean-value theorem,

∀ k ∈ Z, ∃ xk ∈ [k∆, (k + 1)∆) such that f (xk) = 1

∫ (k+1)∆

k∆

f (x) dx.

Set [X]∆ ≜ xk if X ∈ [k∆, (k + 1)∆), with p.m.f. P (xk) = f (xk) ∆.

5 / 63 I-Hsiang Wang IT Lecture 5

slide-6
SLIDE 6

Measures of Information for Continuous Random Variables Entropy and Mutual Information

f (x) x

FX (x) P {X ≤ x} f (x) dFX (x) dx

6 / 63 I-Hsiang Wang IT Lecture 5

slide-7
SLIDE 7

Measures of Information for Continuous Random Variables Entropy and Mutual Information

f (x) x ∆

7 / 63 I-Hsiang Wang IT Lecture 5

slide-8
SLIDE 8

Measures of Information for Continuous Random Variables Entropy and Mutual Information

f (x) x ∆

x1 x3 x5

8 / 63 I-Hsiang Wang IT Lecture 5

slide-9
SLIDE 9

Measures of Information for Continuous Random Variables Entropy and Mutual Information

Observation: lim∆→0 H ([X]∆ ) = H (X ) (intuitively), while

H ([X]∆ ) = −

k=−∞

(f (xk) ∆) log (f (xk) ∆) = −∆

k=−∞

f (xk) log f (xk) − log ∆ → − ∫ ∞

f (x) log f (x) dx + ∞ = ∞

as ∆ → 0 Hence, H (X ) = ∞ if −

∫ ∞

∞ f (x) log f (x) dx = E

[ log

1 f(X)

]

exists. It is quite intuitive that the entropy of a continuous random variable can be arbitrarily large, because it can take infinitely many possible values.

9 / 63 I-Hsiang Wang IT Lecture 5

slide-10
SLIDE 10

Measures of Information for Continuous Random Variables Entropy and Mutual Information

Mutual Information between Continuous Random Variables

How about mutual information between two continuous r.v.'s X and Y , with joint p.d.f. fX,Y (x, y) and marginal p.d.f.'s fX (x) and fY (y)? Again, we use discretization: Partition R2 into ∆ × ∆ squares: R2 = ∪∞

k,j=−∞ I∆ k × I∆ j , where I∆ k = [k∆, (k + 1)∆).

Suppose that fX,Y (x, y) is continuous, then by the mean-value theorem (MVT), ∀ k, j ∈ Z,

∃ (xk, yj) ∈ I∆

k × I∆ j such that

fX,Y (xk, yj) =

1 ∆2

I∆

k ×I∆ j fX,Y (x, y) dx dy.

Set ([X]∆ , [Y ]∆) ≜ (xk, yj) if (X, Y ) ∈ I∆

k × I∆ j , with p.m.f.

PX,Y (xk, yj) = fX,Y (xk, yj) ∆2.

10 / 63 I-Hsiang Wang IT Lecture 5

slide-11
SLIDE 11

Measures of Information for Continuous Random Variables Entropy and Mutual Information

By MVT, ∀ k, j ∈ Z, ∃

xk ∈ I∆

k and

yj ∈ I∆

j such that

PX (xk) = ∫

I∆

k fX (x) dx = ∆fX (

xk) , PY (yj) = ∫

I∆

j fY (y) dy = ∆fY (

yj) .

Observation: lim∆→0 I ([X]∆ ; [Y ]∆ ) = I (X ; Y ) (intuitively), while I ([X]∆ ; [Y ]∆) = ∑∞

k,j=−∞ PX,Y (xk, yj) log PX,Y (xk,yj) PX(xk)PY (yj)

= ∑∞

k,j=−∞

( fX,Y (xk, yj) ∆2) log

fX,Y (xk,yj)✟

∆2 fX( xk)fY ( yj)✟

∆2

= ∆2 ∑∞

k,j=−∞ fX,Y (xk, yj) log fX,Y (xk,yj) fX( xk)fY ( yj)

→ ∫ ∞

−∞

∫ ∞

−∞ fX,Y (x, y) log fX,Y (x,y) fX(x)fY (y) dx dy as ∆ → 0

Hence, I (X ; Y ) = E

[ log

f(X,Y ) f(X)f(Y )

]

if the improper integral exists.

11 / 63 I-Hsiang Wang IT Lecture 5

slide-12
SLIDE 12

Measures of Information for Continuous Random Variables Entropy and Mutual Information

Mutual Information

Unlike entropy that is only well-defined for discrete r.v.'s, in general we can define the mutual information between two real-valued r.v.'s (no necessarily continuous or discrete) as follows. Definition 1 (Mutual information) The mutual information between two r.v.'s X and Y is defined as

I (X ; Y ) = sup

P,Q

I ( [X]P ; [Y ]Q )

, where the supremum is taken over all pairs of partitions P and Q of R. Similar to mutual information, information divergence can also be defined between two probability measures, no matter the probability distributions are discrete, continuous, etc. Remark: Although defining information measures in such a general way is nice, these definitions do not provide explicit ways to compute these information measures.

12 / 63 I-Hsiang Wang IT Lecture 5

slide-13
SLIDE 13

Measures of Information for Continuous Random Variables Differential Entropy

1

Measures of Information for Continuous Random Variables Entropy and Mutual Information Differential Entropy

2

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel Gaussian Channel Capacity

3

Lossy Source Coding for Continuous Memoryless Sources

13 / 63 I-Hsiang Wang IT Lecture 5

slide-14
SLIDE 14

Measures of Information for Continuous Random Variables Differential Entropy

Differential Entropy

For continuous r.v.'s, let's define the following counterparts of entropy and conditional entropy. Definition 2 (Differential entropy and conditional differential entropy) The differential entropy of a continuous r.v. X with p.d.f. fX is defined as h (X ) ≜ E

[ log

1 fX(X)

]

if the (improper) integral exists. The conditional differential entropy of a continuous r.v. X given Y , where (X, Y ) has joint p.d.f. fX,Y and conditional p.d.f. fX|Y , is defined as

h (X |Y ) ≜ E [ log

1 fX|Y (X|Y )

]

if the (improper) integral exists. We have the following theorem immediately from the previous discussion: Theorem 1 (Mutual information between two continuous r.v.'s)

I (X ; Y ) = E [ log

fX,Y (X,Y ) fX(X)fY (Y )

] = h (X ) − h (X |Y ).

14 / 63 I-Hsiang Wang IT Lecture 5

slide-15
SLIDE 15

Measures of Information for Continuous Random Variables Differential Entropy

Information Divergence

Definition 3 (Information divergence for densities) The information divergence from density g (x) to f (x) is defined as

D (f ∥g) ≜ EX∼f [ log f(X)

g(X)

] = ∫

x∈suppf f (x) log f(x) g(x)dx

if the (improper) integral exists. By Jensen's inequality, it is straightforward to see that the non-negativity of KL divergence remains. Proposition 1 (Non-negativity of Information divergence)

D (f ∥g) ≥ 0, with equality iff f = g almost everywhere (i.e., except for some points with zero probability).

Note: D (f ∥g) is finite only if the support of f (x) is contained in the support of g (x).

15 / 63 I-Hsiang Wang IT Lecture 5

slide-16
SLIDE 16

Measures of Information for Continuous Random Variables Differential Entropy

Properties that Extend to Continuous R.V.'s

Proposition 2 (Chain rule)

h (X, Y ) = h (X ) + h (Y |X ) , h (Xn ) = ∑n

i=1 h

( Xi

  • Xi−1 )

.

Proposition 3 (Conditioning reduces differential entropy)

h (X |Y ) ≤ h (X ) , h (X |Y, Z ) ≤ h (X |Z ) .

Proposition 4 (Non-negativity of mutual information)

I (X ; Y ) ≥ 0, I (X ; Y |Z ) ≥ 0.

16 / 63 I-Hsiang Wang IT Lecture 5

slide-17
SLIDE 17

Measures of Information for Continuous Random Variables Differential Entropy

Examples

Example 1 (Differential entropy of a uniform r.v.) For a r.v. X ∼ Unif [a, b], that is, its p.d.f. fX (x) =

1 b−a1 {a ≤ x ≤ b}, its differential entropy

h (X ) = log (b − a).

Example 2 (Differential entropy of N (0, 1)) For a r.v. X ∼ N (0, 1), that is, its p.d.f. fX (x) =

1 √ 2πe− x2

2 , its differential entropy

h (X ) = 1

2 log (2πe).

17 / 63 I-Hsiang Wang IT Lecture 5

slide-18
SLIDE 18

Measures of Information for Continuous Random Variables Differential Entropy

New Properties of Differential Entropy

Differential entropy can be negative. Since b − a can be made arbitrarily small, h (X ) = log (b − a) can be negative. Hence, the non-negative property of entropy cannot be extended to differential entropy. Scaling will change the differential entropy. Consider X ∼ Unif [0, 1]. Then, 2X ∼ Unif [0, 2]. Hence,

h (X ) = log 1 = 0, h (2X ) = log 2 = 1 = ⇒ h (X ) ̸= h (2X ) .

This is in sharp contrast to entropy: H (X ) = H (g (X) ) as long as g (·) is an invertible function.

18 / 63 I-Hsiang Wang IT Lecture 5

slide-19
SLIDE 19

Measures of Information for Continuous Random Variables Differential Entropy

Scaling and Translation

Proposition 5 (Scaling and Translation in the Scaler Case) Let X be a continuous random variable with differential entropy h (X). Translation does not change the differential entropy: For a constant c, h (X + c) = h (X). Scaling shift the differential entropy: For a constant a ̸= 0, h (aX) = h (X) + log |a|. Proposition 6 (Scaling and Translation in the Vector Case) Let X be a continuous random vector with differential entropy h (X). For a constant vector c, h (X + c) = h (X). For an invertible matrix a ∈ Rn×n, h (aX) = h (X) + log|det a|. The proof of these propositions are left as exercises (simple calculus).

19 / 63 I-Hsiang Wang IT Lecture 5

slide-20
SLIDE 20

Measures of Information for Continuous Random Variables Differential Entropy

Differential Entropy of Gaussian Random Vectors

Example 3 (Differential entropy of Gaussian random vectors) For a n-dim random vector X ∼ N (m, k), its differential entropy h (X ) = 1

2 log (2πe)n (det k).

sol: For a n-dim random vector X ∼ N (m, k), we can rewrite X as

X = aW + m,

where aa⊺ = k and W consists of i.i.d. Wi ∼ N (0, 1), i = 1, . . . , n. Hence, by the translation and scaling properties of differential entropy:

h (X ) = h (W ) + log|det a| =

n

i=1

h (Wi ) + 1

2 log (det k)

= n

2 log (2πe) + 1 2 log (det k) = 1 2 log (2πe)n (det k) .

20 / 63 I-Hsiang Wang IT Lecture 5

slide-21
SLIDE 21

Measures of Information for Continuous Random Variables Differential Entropy

Maximum Differential Entropy

Uniform distribution maximizes entropy for r.v. with finite support. For differential entropy, the maxmization problem needs to be associated with constraints on the distribution. (otherwise, it is simple to make it infinite) Under a second moment constraint, zero-mean Gaussian maximizes the differential entropy. Theorem 2 (Maximum Differential Entropy under Covariance Constraint) Let X be a random vector with mean m and covariance matrix

E [(X − m) (X − m)⊺] = k,

and XG be Gaussian with the same covariance k. Then,

h (X ) ≤ h ( XG ) = 1

2 log (2πe)n (det k) .

21 / 63 I-Hsiang Wang IT Lecture 5

slide-22
SLIDE 22

Measures of Information for Continuous Random Variables Differential Entropy

pf: First, we can assume WLOG that both X and XG are zero-mean. Let the p.d.f. of X be f (x) and the p.d.f. of XG be fG (x). Hence,

0 ≤ D (f ∥fG) = E [log f (X)] − E [log fG (X)] = −h (X ) − Ef [log fG (X)] .

Note that log fG (x) is a quadratic function of x, and X, XG have the same second moment. Hence,

Ef [log fG (X)] = EfG [log fG (X)] = −h ( XG ) , = ⇒ 0 ≤ D (f ∥fG) = −h (X ) + h ( XG ) = ⇒ h (X ) ≤ h ( XG ) .

22 / 63 I-Hsiang Wang IT Lecture 5

slide-23
SLIDE 23

Channel Coding over Continuous Memoryless Channels

1

Measures of Information for Continuous Random Variables Entropy and Mutual Information Differential Entropy

2

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel Gaussian Channel Capacity

3

Lossy Source Coding for Continuous Memoryless Sources

23 / 63 I-Hsiang Wang IT Lecture 5

slide-24
SLIDE 24

Channel Coding over Continuous Memoryless Channels

We have investigated the measures of information for continuous r.v.'s: The amount of uncertainty (entropy) is mostly infinite. Mutual information and KL divergences are well defined. Differential entropy is useful for computing information measures of continuous r.v.'s. Question: How about coding theorems? Is there a general way or framework to extend coding theorems from discrete sources/channels to continuous sources/channels?

24 / 63 I-Hsiang Wang IT Lecture 5

slide-25
SLIDE 25

Channel Coding over Continuous Memoryless Channels

C (B) = sup

X: E[b(X)]≤B

I (X ; Y ) .

Channel Encoder Channel Decoder Channel

xN yN w b w pY |X Discrete Memoryless Channel C (B) = max

X: E[b(X)]≤BI (X ; Y ) .

Channel Encoder Channel Decoder Channel

xN yN w b w Continuous Memoryless Channel fY |X

?

25 / 63 I-Hsiang Wang IT Lecture 5

slide-26
SLIDE 26

Channel Coding over Continuous Memoryless Channels

Coding Theorems: from Discrete to Continuous (1)

Two main techniques for extending the achievability part of coding theorems from the discrete world to the continuous world:

1 Discretization: Discretize the source and channel input/output to create a discrete system, and

then make the discretization finer and finer to prove the achievability.

2 New typicality: Extend weak typicality for continuous r.v. and repeat the arguments in a similar

  • way. In particular, replace the entropy terms in the definitions of weakly typical sequences by

differential entropy terms. Using discretization to derive the achievability of Gaussian channel capacity follows Gallager[2] and

El Gamal&Kim[6]. Cover&Thomas[1] and Yeung[5] use weak typicality for continuous r.v.'s. Moser[4] uses threshold decoder, similar to weak typicality.

26 / 63 I-Hsiang Wang IT Lecture 5

slide-27
SLIDE 27

Channel Coding over Continuous Memoryless Channels

Coding Theorems: from Discrete to Continuous (2)

In this lecture, we use discretization for the achievability proof. Pros: No need for new tools (eg., typicality) for continuous r.v.'s. Extends naturally to multi-terminal settings – can focus on discrete memoryless networks. Cons: Technical; not much insight on how to achieve capacity. Hence, we use a geometric argument to provide insights on how to achieve capacity. Disclaimer: We will not be 100% rigorous in deriving the results in this lecture. Instead, you can find rigorous treatment in the references.

27 / 63 I-Hsiang Wang IT Lecture 5

slide-28
SLIDE 28

Channel Coding over Continuous Memoryless Channels

Outline

1 First, we formulate the channel coding problem over continuous memoryless channels (CMC),

state the coding theorem, and sketch the converse and achievability proofs.

2 Second, we introduce additive Gaussian noise (AGN) channel, derive the Gaussian channel

capacity, and provide insights based on geometric arguments.

28 / 63 I-Hsiang Wang IT Lecture 5

slide-29
SLIDE 29

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel

1

Measures of Information for Continuous Random Variables Entropy and Mutual Information Differential Entropy

2

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel Gaussian Channel Capacity

3

Lossy Source Coding for Continuous Memoryless Sources

29 / 63 I-Hsiang Wang IT Lecture 5

slide-30
SLIDE 30

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel

Continous Memoryless Channel

Channel Encoder Channel Decoder Channel

xN yN w b w fY |X

1 Input/output alphabet X = Y = R. 2 Continuous Memoryless Channel (CMC):

Channel Law: Governed by the conditional density (p.d.f.) fY |X. Memoryless: Yk − Xk −

( Xk−1, Y k−1)

.

3 Average input cost constraint B:

1 N

∑N

k=1 b (xk) ≤ B, where b : R → [0, ∞) is the

(single-letter) cost function. The definitions of error probability, achievable rate, and capacity, are the same as those in DMC.

30 / 63 I-Hsiang Wang IT Lecture 5

slide-31
SLIDE 31

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel

Channel Coding Theorem

Theorem 3 (Continuous Memoryless Channel Capacity) The capacity of the CMC

( R, fY |X, R )

with input cost constraint B is

C = sup

X: E[b(X)]≤B

I (X ; Y ) .

(1) Note: The input distribution of the r.v. X needs not to have a density – it could also be discrete.

How to compute h (Y |X ) when X has no density? Recall h (Y |X ) = EX

[ − ∫

suppY f (y|X) log f (y|X) dy

]

, where f (y|x) is the conditional density of Y given X.

Converse proof: Exactly the same as that in the DMC case.

31 / 63 I-Hsiang Wang IT Lecture 5

slide-32
SLIDE 32

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel

Sketch of the Achievability (1): Discretization

ENC

xN yN w b w

DEC

fY |X

Achievability proof makes use of discretization – we can apply the result in DMC with input cost:

32 / 63 I-Hsiang Wang IT Lecture 5

slide-33
SLIDE 33

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel

Sketch of the Achievability (1): Discretization

ENC

w b w

DEC

fY |X Qin Qout

Achievability proof makes use of discretization – we can apply the result in DMC with input cost:

Qin: (single-letter) discretization that maps X ∈ R to Xd ∈ Xd. Qout: (single-letter) discretization that maps Y ∈ R to Yd ∈ Yd.

Note that both Xd and Yd are discrete (countable) alphabets.

33 / 63 I-Hsiang Wang IT Lecture 5

slide-34
SLIDE 34

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel

Sketch of the Achievability (1): Discretization

ENC

w b w Qin Qout

DEC

fY |X

New ENC Equivalent DMC

Achievability proof makes use of discretization – we can apply the result in DMC with input cost:

Qin: (single-letter) discretization that maps X ∈ R to Xd ∈ Xd. Qout: (single-letter) discretization that maps Y ∈ R to Yd ∈ Yd.

Note that both Xd and Yd are discrete (countable) alphabets. Idea: With the two discretization blocks Qin and Qout, one can build an equivalent DMC

( Xd, PYd|Xd, Yd )

as shown above.

34 / 63 I-Hsiang Wang IT Lecture 5

slide-35
SLIDE 35

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel

Sketch of the Achievability (2): Arguments

Equivalent DMC New ENC

w b w

DEC

PYd|Xd xN

d

yN

d

Qin Qout

1 Random codebook generation: Generate the codebook randomly based on the original

(continuous) r.v. X, satisfying E [b (X)] ≤ B.

2 Choice of discretization: Choose Qin such that the cost constraint will not be violated after

  • discretization. Specifically, E [b (Xd)] ≤ B.

3 Achievability in the equivalent DMC: By the achievability part of the channel coding theorem

for DMC with input constraint, any rate R < I (Xd ; Yd ) is achievable.

4 Achievability in the original CMC: Prove that when the discretization in Qin and Qout gets

finer and finer, I (Xd ; Yd ) → I (X ; Y ).

35 / 63 I-Hsiang Wang IT Lecture 5

slide-36
SLIDE 36

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

1

Measures of Information for Continuous Random Variables Entropy and Mutual Information Differential Entropy

2

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel Gaussian Channel Capacity

3

Lossy Source Coding for Continuous Memoryless Sources

36 / 63 I-Hsiang Wang IT Lecture 5

slide-37
SLIDE 37

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Additive White Gaussian Noise (AWGN) Channel

Channel Encoder Channel Decoder

xN yN w b w zN

1 Input/output alphabet X = Y = R. 2 AWGN Channel:

Conditional p.d.f. fY |X is given by Y = X + Z, Z ∼ N

( 0, σ2) ⊥ ⊥ X. {Zk} form an i.i.d. (white) Gaussian r.p. with Zk ∼ N ( 0, σ2)

, ∀ k. Memoryless: Zk ⊥

⊥ ( W, Xk−1, Zk−1)

. Without feedback: ZN ⊥

⊥ XN.

3 Average input power constraint P :

1 N

∑N

k=1|xk|2 ≤ P .

37 / 63 I-Hsiang Wang IT Lecture 5

slide-38
SLIDE 38

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Channel Coding Theorem for Gaussian Channel

Theorem 4 (Gaussian Channel Capacity) The capacity of the AWGN channel with input power constraint P and noise variance σ2 is given by

C = sup

X: E[|X|2]≤P

I (X ; Y ) = 1

2 log

( 1 + P

σ2

) .

(2) Note: For the AWGN channel, the supremum is actually attainable with Gaussian input

X ∼ N (0, P), that is, the input has density fX (x) =

1 √ 2πP e− x2

2P , as shown in the next slide.

38 / 63 I-Hsiang Wang IT Lecture 5

slide-39
SLIDE 39

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Evaluation of Capacity

Let us compute the capacity of AWGN channel (2) as follows:

I (X ; Y ) = h (Y ) − h (Y |X ) = h (Y ) − h (X + Z |X ) = h (Y ) − h (Z |X ) = h (Y ) − h (Z )

(since Z ⊥

⊥ X) = h (Y ) − 1

2 log (2πe) σ2 (a)

≤ 1

2 log (2πe)

( P + σ2) − 1

2 log (2πe) σ2 = 1 2 log

( 1 + P

σ2

)

Here (a) is due to h (Y ) ≤ 1

2 log (2πe) Var [Y ] and Var [Y ] = Var [X] + Var [Z] ≤ P + σ2, since

Var [X] ≤ E [ X2] ≤ P .

Finally, note that the above inequalities hold with equality when X ∼ N (0, P).

39 / 63 I-Hsiang Wang IT Lecture 5

slide-40
SLIDE 40

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Geometric Intuition: Sphere Packing

RN

p N(P + σ2)

y = x + z By LLN, as N → ∞, most output y (yN) will lie inside the N-dimensional sphere of radius

√ N (P + σ2).

40 / 63 I-Hsiang Wang IT Lecture 5

slide-41
SLIDE 41

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Geometric Intuition: Sphere Packing

RN

p N(P + σ2) √ Nσ2

y = x + z By LLN, as N → ∞, most output y (yN) will lie inside the N-dimensional sphere of radius

√ N (P + σ2).

Also by LLN, as N → ∞, y will lie near the surface of the N-dimensional sphere centered at

x with radius √ Nσ2.

41 / 63 I-Hsiang Wang IT Lecture 5

slide-42
SLIDE 42

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Geometric Intuition: Sphere Packing

RN

p N(P + σ2) √ Nσ2

y = x + z By LLN, as N → ∞, most output y (yN) will lie inside the N-dimensional sphere of radius

√ N (P + σ2).

Also by LLN, as N → ∞, y will lie near the surface of the N-dimensional sphere centered at

x with radius √ Nσ2.

Vanishing error probability criterion =

non-overlapping spheres.

  • Max. # of non-overlapping spheres = Max. # of

codewords that can be reliably delivered. Question: How many non-overlapping spheres can be packed into the large sphere?

42 / 63 I-Hsiang Wang IT Lecture 5

slide-43
SLIDE 43

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Geometric Intuition: Sphere Packing

RN

p N(P + σ2) √ Nσ2

y = x + z Back-of-envelope calculation:

2NR ≤ √

N(P+σ2)

N

√ Nσ2N

= ⇒ R ≤ 1

N log

(√

N(P+σ2)

N

√ Nσ2N

) =

1 2 log

( 1 + P

σ2

)

Hence, intuitively any achievable R cannot exceed

C = 1

2 log

( 1 + P

σ2

)

.

How to achieve it?

43 / 63 I-Hsiang Wang IT Lecture 5

slide-44
SLIDE 44

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Achieving Capacity via Good Packing

x -sphere

√ NP x1 x2

Random codebook generation: Generate 2NR

N-dim. vectors (codewords) {x1, . . . , x2NR}

lying in the ``x-sphere" of radius

√ NP .

44 / 63 I-Hsiang Wang IT Lecture 5

slide-45
SLIDE 45

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Achieving Capacity via Good Packing

x -sphere

√ NP x1 αy x2

Random codebook generation: Generate 2NR

N-dim. vectors (codewords) {x1, . . . , x2NR}

lying in the "x-sphere" of radius

√ NP .

Decoding: α ≜

P P +σ2 (MMSE coeff.)

y → MMSE → αy →

Nearest Neighbor

→ x

45 / 63 I-Hsiang Wang IT Lecture 5

slide-46
SLIDE 46

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Achieving Capacity via Good Packing

x -sphere

√ NP r N Pσ2 P + σ2 x1 αy x2

Random codebook generation: Generate 2NR

N-dim. vectors (codewords) {x1, . . . , x2NR}

lying in the "x-sphere" of radius

√ NP .

Decoding: α ≜

P P +σ2 (MMSE coeff.)

y → MMSE → αy →

Nearest Neighbor

→ x

By LLN, we have

∥αy − x1∥2 = ∥αz + (α − 1)x1∥2 ≈ α2Nσ2 + (α − 1)2NP = N Pσ2

P+σ2

46 / 63 I-Hsiang Wang IT Lecture 5

slide-47
SLIDE 47

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Achieving Capacity via Good Packing

x -sphere

√ NP r N Pσ2 P + σ2 x1 αy x2

Performance analysis: When does an error occur? When another codeword, say, x2, falls inside the uncertainty sphere centered at αy. What is that probability? It is the ratio of the volumes of the two spheres!

P {x1 → x2} = √

NPσ2/(P+σ2)

N

√ NP

N

= (

σ2 P+σ2

)N/2

47 / 63 I-Hsiang Wang IT Lecture 5

slide-48
SLIDE 48

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Achieving Capacity via Good Packing

x -sphere

√ NP r N Pσ2 P + σ2 x1 αy x2

By the Union of Events Bound, the total probability of error

P {E} ≤ 2NR (

σ2 P+σ2

)N/2 = 2

N ( R+ 1

2 log

(

1 1+ P σ2

))

,

which vanishes as N → ∞ if

R < 1

2 log

( 1 + P

σ2

) .

Hence, any R < 1

2 log

( 1 + P

σ2

)

is achievable.

48 / 63 I-Hsiang Wang IT Lecture 5

slide-49
SLIDE 49

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Practical Relevance of the Gaussian Noise Model

In communication engineering, the additive Gaussian noise is the most widely used model for a noisy channel with real (complex) input/output. Reasons:

1 Due to CLT, Gaussian well models noise that is the aggregation of many small perturbations. 2 Analytically Gaussian is highly tractable. 3 Consider a input-power-constrained channel with independent additive noise. Within the family

  • f noise distributions that have the same variance, Gaussian noise is the worst case noise.

The last point is important: for a additive-noise-channel with input power constraint P and noise variance σ2, its capacity is lower bounded by the Gaussian channel capacity 1

2 log

( 1 + P

σ2

)

.

49 / 63 I-Hsiang Wang IT Lecture 5

slide-50
SLIDE 50

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Gaussian Noise is the Worst-Case Noise

Proposition 7 Consider a Gaussian r.v. XG ∼ N (0, P) and Y = XG + Z, where Z has density fZ (z), variance

Var [Z] = σ2 and Z ⊥ ⊥ XG. Then, I ( XG ; Y ) ≥ 1

2 log

( 1 + P

σ2

) .

With Proposition 7, we immediately obtain the following theorem: Theorem 5 (Gaussian is the Worst-Case Additive Noise) Consider a CMC fY |X: Y = X + Z, Z ⊥

⊥ X, with input power constraint P and noise variance σ2.

The additive noise has density. Then, the capacity C is minimized when Z ∼ N

( 0, σ2)

and

C ≥ CG ≜ 1

2 log

( 1 + P

σ2

)

. pf: C ≥ I

( XG ; XG + Z ) ≥ 1

2 log

( 1 + P

σ2

)

.

50 / 63 I-Hsiang Wang IT Lecture 5

slide-51
SLIDE 51

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Proof of Proposition 7

Let ZG ∼ N

( 0, σ2)

, and denote Y G ≜ XG + ZG. We aim to prove

I ( XG ; Y ) ≥ I ( XG ; Y G ) .

First note that I

( XG ; Y ) = h (Y ) − h (Z ) does not change if we shift Z by a constant. Hence,

WLOG assume E [Z] = 0. Since both XG and Z are zero-mean, so does Y . Note that Y G ∼ N

( 0, P + σ2)

and ZG ∼ N

( 0, σ2)

. Hence,

h ( Y G ) = EY G [ − log fY G ( Y G)] = 1

2 log

( 2π(P + σ2) ) +

log e 2(P+σ2)EY G

[( Y G)2] = 1

2 log

( 2π(P + σ2) ) +

log e 2(P+σ2)EY

[ (Y )2] = EY [− log fY G (Y )]

51 / 63 I-Hsiang Wang IT Lecture 5

slide-52
SLIDE 52

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

The key in the above is to realize that Y and Y G has the same variance. Similarly, h

( ZG ) = EZ [− log fZG (Z)]. Therefore, I ( XG ; Y G ) − I ( XG ; Y ) = { h ( Y G ) − h (Y ) } − { h ( ZG ) − h (Z ) } = {EY [− log fY G (Y )] − EY [− log fY (Y )]} − {EZ [− log fZG (Z)] − EZ [− log fZ (Z)]} = EY [ log fY (Y )

fY G(Y )

] − EZ [ log fZ(Z)

fZG(Z)

] = EY,Z [ log

fY (Y )fZG(Z) fY G(Y )fZ(Z)

] ≤ log ( EY,Z [fY (Y )fZG(Z)

fY G(Y )fZ(Z)

]) .

(Jensen's Inequality)

To finish the proof, we shall prove that EY,Z

[fY (Y )fZG(Z)

fY G(Y )fZ(Z)

] = 1.

52 / 63 I-Hsiang Wang IT Lecture 5

slide-53
SLIDE 53

Channel Coding over Continuous Memoryless Channels Gaussian Channel Capacity

Let us calculate EY,Z

[ fY (Y )fZG(Z)

fY G(Y )fZ(Z)

]

as follows:

EY,Z [fY (Y ) fZG (Z) fY G (Y ) fZ (Z) ] = ∫ ∫ fY,Z (y, z) fY (y) fZG (z) fY G (y) fZ (z) dz dy = ∫ ∫ fZ (z) fXG (y − z) fY (y) fZG (z) fY G (y) fZ (z) dz dy

(∵ Y = XG + Z)

= ∫ ∫ [fXG (y − z) fZG (z)] fY (y) fY G (y) dz dy = ∫ ∫ fY G,ZG (y, z) fY (y) fY G (y) dz dy

(∵ Y = XG + Z)

= ∫ fY (y) fY G (y) (∫ fY G,ZG (y, z) dz ) dy = ∫ fY (y) fY G (y)fY G (y) dy = ∫ fY (y) dy = 1.

Hence, the proof is complete.

53 / 63 I-Hsiang Wang IT Lecture 5

slide-54
SLIDE 54

Lossy Source Coding for Continuous Memoryless Sources

1

Measures of Information for Continuous Random Variables Entropy and Mutual Information Differential Entropy

2

Channel Coding over Continuous Memoryless Channels Continuous Memoryless Channel Gaussian Channel Capacity

3

Lossy Source Coding for Continuous Memoryless Sources

54 / 63 I-Hsiang Wang IT Lecture 5

slide-55
SLIDE 55

Lossy Source Coding for Continuous Memoryless Sources

Lossy Source Coding Theorem

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

Theorem 6 (A Lossy Source Coding Theorem for CMS) For a continuous memoryless source {Si | i ∈ N} with p.d.f. fS,

R(D) = inf

f

S|S: E[d(S,

S)]≤D

I ( S ; S ) .

(3) Remark: One can use weak typicality or the discretization method used in channel coding to extend the lossy source coding theorem from discrete memoryless sources to continuous.

55 / 63 I-Hsiang Wang IT Lecture 5

slide-56
SLIDE 56

Lossy Source Coding for Continuous Memoryless Sources

Gaussian Source with Squared Error Distortion

Source (Gaussian)

Si ∈ S = R, and Si

i.i.d.

∼ N ( µ, σ2) ∀ i.

Distortion (Squared Error) d (s,

s) = |s − s|2.

Theorem 7 (Rate distortion function of Gaussian source with squared error distortion) For Gaussian source with squared error distortion (as defined above), the rate distortion function is

R (D) =   

1 2 log

(

σ2 D

) , 0 ≤ D ≤ σ2 0, D > σ2 .

Note: In particular, note that R (0) = ∞, which is quite intuitive!

56 / 63 I-Hsiang Wang IT Lecture 5

slide-57
SLIDE 57

Lossy Source Coding for Continuous Memoryless Sources

pf: First step: identify Dmin and Dmax. Dmin = 0

because one can choose

s (s) = s. Dmax = σ2

because one can choose

s = µ, the mean of S. Next step: lower bound I

( S ; S ) = h (S ) − h ( S

  • S

)

.

It is equivalent to upper bounding h

( S

  • S

)

:

h ( S

  • S

) = h ( S − S

  • S

) ≤ h ( S − S ) ≤ 1 2 log (2πe D) ,

where the last inequality holds since Var

[ S − S ] ≤ E [

  • S −

S

  • 2]

≤ D. Hence, I

( S ; S ) ≥ 1

2 log

( 2πe σ2) − 1

2 log (2πe D) = 1 2 log

(

σ2 D

)

.

57 / 63 I-Hsiang Wang IT Lecture 5

slide-58
SLIDE 58

Lossy Source Coding for Continuous Memoryless Sources

Final step: show that the lower bound 1

2 log

(

σ2 D

)

can be attained.

The goal is to find a conditional density f

S|S such that

  • S ⊥

⊥ ( S − S )

so that h

( S − S

  • S

) = h ( S − S )

and

( S − S ) ∼ N (0, D). Again, this can be done via an auxiliary reverse channel.

Consider a channel with input

S, output S, additive noise Z ∼ N (0, D) ⊥ ⊥ S. S = S + Z = ⇒ Z = S − S.

The reverse channel specifies the joint distribution fS,

S and hence f S|S!

S ∼ N

  • µ, σ2
  • S

Z ∼ N (0, D)

  • S ∼ N
  • µ, σ2 − D

  • S −

S

  • 58 / 63

I-Hsiang Wang IT Lecture 5

slide-59
SLIDE 59

Lossy Source Coding for Continuous Memoryless Sources

Gaussian Source is the Hardest to Compress

Theorem 8 (Gaussian is the Worst-Case Source to Compress) Consider a zero-mean CMS {Si} with density fS and variance σ2. Then, the rate distortion function with squared error distortion is maximized when S ∼ N

( 0, σ2)

, and

R (D) ≤ RG (D) ≜ max { 0, 1

2 log

(

σ2 D

)}

. pf: Note that R(D) = minf

S|S: E[d(S,

S)]≤D I

( S ; S )

. To obtain an upper bound, we simply need to choose some f

S|S that yields the desired I

( S ; S ) = h (

  • S

) − h (

  • S
  • S

) ≤ RG(D). Note that

The term h

(

  • S
  • S

)

can be computed if

S = aS + bZG for a standard Gaussian ZG ⊥ ⊥ S.

The term h

(

  • S

)

can be upper bounded by 1

2 log

( 2πeVar [

  • S

]) = 1

2 log 2πe

( a2σ2 + b2)

.

59 / 63 I-Hsiang Wang IT Lecture 5

slide-60
SLIDE 60

Lossy Source Coding for Continuous Memoryless Sources

How to find the coefficients a and b? Let's reverse engineering:

1 The distortion should be D: E

[ (S − S)2] = (1 − a)2σ2 + b2

!

= D.

2 The induced mutual information is upper bounded by RG(D):

I ( S ; S ) = h (

  • S

) − h (

  • S
  • S

) ≤ 1

2 log 2πe

( a2σ2 + b2) − 1

2 log 2πe

( b2) = 1

2 log a2σ2+b2 b2 !

= 1

2 log σ2 D

Combining the above two, we can solve a = σ2−D

D

and b = a

σ2D σ2−D =

σ2−D D

σ, for D < σ2.

For D ≥ σ2, it is obvious that R(D) = 0. Proof complete.

60 / 63 I-Hsiang Wang IT Lecture 5

slide-61
SLIDE 61

Summary

61 / 63 I-Hsiang Wang IT Lecture 5

slide-62
SLIDE 62

Mutual information between two continuous r.v.'s X and Y with joint density fX,Y :

I (X ; Y ) = E [ log

fX,Y (X,Y ) fX(X)fY (Y )

]

. Differential entropy and conditional differential entropy:

h (X ) ≜ E [ log

1 fX(X)

]

, h (X |Y ) ≜ E

[ log

1 fX|Y (X|Y )

]

.

I (X ; Y ) = h (X ) − h (X |Y ) = h (Y ) − h (Y |X ).

Information divergence between densities f and g: D (f ∥g) ≜ EX∼f

[ log f(X)

g(X)

]

. Chain rule, conditioning reduces differential entropy, non-negativity of mutual information and information divergence: remain to hold. Differential entropy can be negative; h (X )

  • ≤h (X, Y ).

62 / 63 I-Hsiang Wang IT Lecture 5

slide-63
SLIDE 63

Continuous memoryless channel capacity:

C (B) = sup

X: E[b(X)]≤B

I (X; Y ).

Rate distortion function for continuous memoryless source:

R(D) = inf

f

S|S: E[d(S,

S)]≤D

I ( S ; S )

. Gaussian channel capacity: C (P) = 1

2 log

( 1 + P

σ2

)

. Gaussian source with squared error distortion: R (D) = max

{

1 2 log

(

σ2 D

) , 0 }

. Gaussian noise is the worst-case additive noise under second moment constraint. Gaussian source is the worst-case source under second moment constraint.

63 / 63 I-Hsiang Wang IT Lecture 5