Chapter 7 Channel Capacity Peng-Hua Wang Graduate Inst. of Comm. - - PowerPoint PPT Presentation

chapter 7 channel capacity
SMART_READER_LITE
LIVE PREVIEW

Chapter 7 Channel Capacity Peng-Hua Wang Graduate Inst. of Comm. - - PowerPoint PPT Presentation

Chapter 7 Channel Capacity Peng-Hua Wang Graduate Inst. of Comm. Engineering National Taipei University Chapter Outline Chap. 7 Channel Capacity 7.1 Examples of Channel Capacity 7.2 Symmetric Channels 7.3 Properties of Channel Capacity 7.4


slide-1
SLIDE 1

Chapter 7 Channel Capacity

Peng-Hua Wang

Graduate Inst. of Comm. Engineering National Taipei University

slide-2
SLIDE 2

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 2/62

Chapter Outline

  • Chap. 7 Channel Capacity

7.1 Examples of Channel Capacity 7.2 Symmetric Channels 7.3 Properties of Channel Capacity 7.4 Preview of the Channel Coding Theorem 7.5 Definitions 7.6 Jointly Typical Sequences 7.7 Channel Coding Theorem 7.8 Zero-Error Codes 7.9 Fano’s Inequality and the Converse to the Coding Theorem

slide-3
SLIDE 3

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 3/62

7.1 Examples of Channel Capacity

slide-4
SLIDE 4

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 4/62

Channel Model

■ Operational channel capacity: the bit number to represent the

maximum number of distinguishable signals for n uses of a communication channel.

◆ In n transmission, we can send M signals without error, the

channel capacity is log M/n bits per transmission.

■ Information channel capacity: the maximum mutual information ■ Operational channel capacity is equal to Information channel capacity. ◆ Fundamental theory and central success of information theory.

slide-5
SLIDE 5

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 5/62

Channel capacity

Definition 1 (Discrete Channel) A system consisting of an input alphabet X and output alphabet Y and a probability transition matrix

p(y|x).

Definition 2 (Channel capacity) The “information” channel capacity of a discrete memoryless channel is

C = max

p(x) I(X; Y )

where the maximum is taken over all possible input distribution p(x).

■ Operational definition of channel capacity: The highest rate in bits per

channel use at which information can be sent.

■ Shannon’s second theorem: The information channel capacity is

equal to the operational channel capacity.

slide-6
SLIDE 6

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 6/62

Example 1

Noiseless binary channel

p(Y = 0) = p(X = 0) = π0, p(Y = 1) = p(X = 1) = π1 = 1 − π0 I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) ≤ 1

“=” ⇒ π0 = π1 = 1/2

slide-7
SLIDE 7

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 7/62

Example 2

Noisy channel with non-overlapping outputs

p(X = 0) = π0, p(X = 1) = π1 = 1 − π0 p(Y = 1) = π0p, p(Y = 2) = π0(1 − p), p = 1/2 p(Y = 3) = π1q, p(Y = 4) = π1(1 − q), q = 1/3 I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − π0H(p) − π1H(q) = H(π0) = H(X) ≤ 1

slide-8
SLIDE 8

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 8/62

Noisy Typewriter

Noisy typewriter

slide-9
SLIDE 9

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 9/62

Noisy Typewriter

I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) −

  • x

p(x)H(Y |X = x) = H(Y ) −

  • x

p(x)H( 1

2)

= H(Y ) − H( 1

2)

≤ log 26 − 1 = log 13 C = max I(X; Y ) = log 13

slide-10
SLIDE 10

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 10/62

Binary Symmetric Channel (BSC)

Binary Symmetric Channel (BSC)

slide-11
SLIDE 11

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 11/62

Binary Symmetric Channel (BSC)

I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) −

  • x

p(x)H(Y |X = x) = H(Y ) −

  • x

p(x)H(p) = H(Y ) − H(p) ≤ 1 − H(p) C = max I(X; Y ) = 1 − H(p)

slide-12
SLIDE 12

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 12/62

Binary Erasure Channel

slide-13
SLIDE 13

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 13/62

Binary Erasure Channel

I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) −

  • x

p(x)H(Y |X = x) = H(Y ) −

  • x

p(x)H(α) = H(Y ) − H(α) H(Y ) = (1 − α)H(π0) + H(α) C = max I(X; Y ) = 1 − α

slide-14
SLIDE 14

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 14/62

7.3 Properties of Channel Capacity

slide-15
SLIDE 15

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 15/62

Properties of Channel Capacity

■ C ≥ 0. ■ C ≤ log |X|. ■ C ≤ log |Y|. ■ I(X; Y ) is a continuous function of p(x), ■ I(X; Y ) is a concave function of p(x),

slide-16
SLIDE 16

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 16/62

7.4 Preview of the Channel Coding Theorem

slide-17
SLIDE 17

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 17/62

Preview of the Channel Coding Theorem

■ For each input n-sequence, there are approximately 2nH(Y |X),

possible Y sequences.

■ The total number of possible (typical) Y sequences is 2nH(Y ). ■ This set has to be divided into sets of size 2nH(Y |X) corresponding to

the different input X sequences.

■ The total number of disjoint sets is less than or equal to

2nH(Y )/2nH(Y |X) = 2n(H(Y )−H(Y |X)) = 2nI(X;Y )

■ We can send at most 2nI(X;Y ) distinguishable sequences of length n.

slide-18
SLIDE 18

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 18/62

Example

■ 6 typical sequences for Xn. 4 typical sequences for Y n. ■ 12 typical sequences for (Xn, Y n). ■ For every Xn, we have

2nH(X,Y )/2nH(X) = 2nH(Y |X) = 2

typical Y n. e.g., for Xn = 001100 ⇒ Y n = 010100, 101011.

slide-19
SLIDE 19

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 19/62

Example

■ Since we have 2nH(Y ) = 4 typical Y n in total, how many typical Xn

can these typical Y n be assigned?

2nH(Y )/2nH(Y |X) = 2n(H(Y )−H(Y |X)) = 2nI(X;Y ) = 2.

■ Can we assign more typical Xn? No. For some Y n received, we

can’t not determine which Xn is received. e.g., If we use 001100, 101101, and 101000 as codewords, we can’t determine which codeword is sent when we receive 101011.

slide-20
SLIDE 20

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 20/62

7.5 Definitions

slide-21
SLIDE 21

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 21/62

Communication Channel

■ Message W ∈ {1, 2, ..., M}. ■ Encoder: input W , output Xn ≡ Xn(W) ∈ X n ◆ n is the length of the signal. We then transmit the signal via the

channel by using the channel n times. Every time we send a symbol of the signal.

■ Channel: input Xn, output Y n with distribution p(yn|xn) ■ Decoder: input Y n, output ˆ

W = g(Y n) where g(Y n) is a

deterministic decoding rule.

■ If ˆ

W = W , an error occurs.

slide-22
SLIDE 22

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 22/62

Definitions

Definition 3 (Discrete Channel) A discrete channel, denoted by

(X, p(y|x), Y), consists of two finite sets X and Y and a collection of

probability mass functions p(y|x).

■ X: input, Y :output, for every input x ∈ X , y p(y|x) = 1.

Definition 4 (Discrete Memoryless Channel, DMC) The nth extension of the discrete memoryless channel is the channel

(X n, p(yn|xn), Yn) where p(yk|xk, yk−1) = p(yk|xk), k = 1, 2, . . . , n.

■ Without feedback: p(xk|xk−1, yk−1) = p(xk|xk−1) ■ nth extension of DMC without feedback:

p(yn|xn) =

n

  • i=1

p(yi|xi).

slide-23
SLIDE 23

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 23/62

Definitions

Definition 5 (M, n) code] An (M, n) code for the channel

(X, p(y|x), Y) consists of the following:

  • 1. An index set {1, 2, . . . , M}.
  • 2. An encoding function Xn : {1, 2, . . . , M} → X n. The codewords

are xn(1), xn(2), . . . , xn(M). The set of codewords is called the codebook.

  • 3. A decoding function g : Yn → {1, 2, . . . , M}
slide-24
SLIDE 24

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 24/62

Definitions

Definition 6 (Conditional probability of error)

λi = Pr(g(Y n) = i|Xn = xn(i)) =

  • g(yn)=i

p(yn|xn(i)) =

  • yn

p(yn|xn(i))I(g(yn) = i)

■ I(·) is the indicator function.

slide-25
SLIDE 25

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 25/62

Definitions

Definition 7 (Maximal probability of error)

λ(n) = max

i∈{1,2,...,M} λi

Definition 8 (Average probability of error)

P (n)

e

= 1 M

M

  • i=1

λi

■ The decoding error is

Pr(g(Y n) = W) =

M

  • i=1

Pr(W = i) Pr(g(Y n) = i|W = i)

If the index W is chosen uniformly from {1, 2, . . . , M}, then

P (n)

e

= Pr(g(Y n) = W).

slide-26
SLIDE 26

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 26/62

Definitions

Definition 9 (Rate) The rate R of an (M, n) code is

R = log M n

bits per transmission Definition 10 (Achievable rate) A rate R is said to be achievable if there exists a (⌈2nR⌉, n) code such that the maximal probability of error λ(n) tends to 0 as n → ∞. Definition 11 (Channel capacity) The capacity of a channel is the supremum of all achievable rates.

slide-27
SLIDE 27

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 27/62

7.6 Jointly Typical Sequences

slide-28
SLIDE 28

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 28/62

Definitions

Definition 12 (Jointly typical sequences) The set A(n)

ǫ

  • f jointly

typical sequences {(xn, yn)} with respect to the distribution p(x, y) is defined by

A(n)

ǫ

=

  • (xn, yn) ∈ X n × Yn :
  • − 1

n log p(xn) − H(X)

  • < ǫ,
  • − 1

n log p(yn) − H(Y )

  • < ǫ,
  • − 1

n log p(xn, yn) − H(X, Y )

  • < ǫ
  • where

p(xn, yn) =

n

  • i=1

p(xi, yi)

slide-29
SLIDE 29

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 29/62

Joint AEP

Theorem 1 (Joint AEP) Let (Xn, Y n) be sequences of length n drawn i.i.d. according to p(xn, yn). Then:

  • 1. Pr
  • (xn, yn) ∈ A(n)

ǫ

  • → 1 as n → ∞.

2.

  • A(n)

ǫ

  • ≤ 2n(H(X,Y )+ǫ)
  • 3. If ( ˜

Xn, ˜ Y n) ∼ p(xn)p(yn) [i.e., ˜ Xn and ˜ Y n are independent with

the same marginals], then

Pr

  • ( ˜

Xn, ˜ Y n) ∈ A(n)

ǫ

  • ≤ 2−n(I(X;Y )−3ǫ).

Also, for sufficient large n,

Pr

  • ( ˜

Xn, ˜ Y n) ∈ A(n)

ǫ

  • ≥ (1 − ǫ)2−n(I(X;Y )+3ǫ).
slide-30
SLIDE 30

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 30/62

Joint AEP

Theorem 2 (Joint AEP) 1. Pr

  • (xn, yn) ∈ A(n)

ǫ

  • → 1 as n → ∞.
  • Proof. Given ǫ > 0, define events A, B, C as

A

  • Xn :
  • −1

n log p(Xn) − H(X)

  • ≥ ǫ
  • B
  • Y n :
  • −1

n log p(Y n) − H(Y )

  • ≥ ǫ
  • C
  • (Xn, Y n) :
  • −1

n log p(Xn, Y n) − H(X, Y )

  • ≥ ǫ
  • ,
slide-31
SLIDE 31

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 31/62

Joint AEP

Then, by weak law of large number, there exists n1, n2, n3 such that,

Pr (A) < ǫ 3, ∀n > n1, Pr (B) < ǫ 3, ∀n > n2, Pr (C) < ǫ 3, ∀n > n3.

Thus,

Pr

  • (xn, yn) ∈ A(n)

ǫ

  • = Pr(Ac ∩ Bc ∩ Cc)

=1 − Pr(A ∪ B ∪ C) ≥ 1 − (Pr(A) + Pr(B) + Pr(C)) ≥1 − ǫ

for all n > max{n1, n2, n3}.

slide-32
SLIDE 32

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 32/62

Joint AEP

Theorem 3 (Joint AEP) 2.

  • A(n)

ǫ

  • ≤ 2n(H(X,Y )+ǫ)

Proof.

1 =

  • p(xn, yn) ≥
  • (xn,yn)∈A(n)

ǫ

p(xn, yn) ≥ |A(n)

ǫ |2−n(H(X,Y )+ǫ)

Thus,

  • A(n)

ǫ

  • ≤ 2n(H(X,Y )+ǫ).
slide-33
SLIDE 33

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 33/62

Joint AEP

Theorem 4 (Joint AEP) 3. If ( ˜

Xn, ˜ Y n) ∼ p(xn)p(yn) [i.e., ˜ Xn and ˜ Y n are independent with the same marginals], then Pr

  • ( ˜

Xn, ˜ Y n) ∈ A(n)

ǫ

  • ≤ 2−n(I(X;Y )−3ǫ).

Also, for sufficient large n,

Pr

  • ( ˜

Xn, ˜ Y n) ∈ A(n)

ǫ

  • ≥ (1 − ǫ)2−n(I(X;Y )+3ǫ).
slide-34
SLIDE 34

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 34/62

Joint AEP

Proof.

Pr

  • ( ˜

Xn, ˜ Y n) ∈ A(n)

ǫ

  • =
  • (xn,yn)∈A(n)

ǫ

p(xn)p(yn) ≤ 2n(H(X,Y )+ǫ)2−n(H(X)−ǫ)2−n(H(Y )−ǫ) = 2−n(I(X;Y )−3ǫ).

For sufficient large n, Pr

  • A(n)

ǫ

  • ≥ 1 − ǫ, and therefore

1 − ǫ ≤

  • (xn,yn)∈A(n)

ǫ

p(xn, yn) ≤

  • A(n)

ǫ

  • 2−n(H(X,Y )−ǫ).

and

  • A(n)

ǫ

  • ≥ (1 − ǫ)2n(H(X,Y )−ǫ)
slide-35
SLIDE 35

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 35/62

Joint AEP

Pr

  • ( ˜

Xn, ˜ Y n) ∈ A(n)

ǫ

  • =
  • (xn,yn)∈A(n)

ǫ

p(xn)p(yn) ≥(1 − ǫ)2n(H(X,Y )−ǫ)2−n(H(X)+ǫ)2−n(H(Y )+ǫ) =(1 − ǫ)2−n(I(X;Y )+3ǫ)

slide-36
SLIDE 36

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 36/62

Joint AEP: Conclusion

■ There are about 2n(H(X) typical X sequences, and about 2n(H(Y )

typical X sequences.

■ There are about 2n(H(X,Y ) jointly typical sequences. ■ Randomly chosen a pair of typical Xn and typical Y n, the probability

that it is jointly typical is about 2−nI(X;Y ).

slide-37
SLIDE 37

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 37/62

7.7 Channel Coding Theorem

slide-38
SLIDE 38

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 38/62

Channel Coding Theorem

Theorem 5 (Channel coding theorem) For every rate R < C, there exists a sequence of (2nR, n) codes with maximum probability of error

λ(n) → 0.

Conversely, any sequence of (2nR, n) codes with λ(n) → 0 must have

R ≤ C.

■ We have to prove two parts. ◆ R < C → achievable. ◆ Achievable → R ≤ C. ■ Main ideas. ◆ Random encoding (random code) ◆ Jointly typical decoding

slide-39
SLIDE 39

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 39/62

Random Code

■ Generate a (2nR, n) code at random according to the distribution

p(x) (fixed). That is, the 2nR codewords have the distribution p(xn) =

n

  • i=1

p(xi)

■ A particular code C is the matrix with 2nR codewords as the row.

C =        x1(1) x2(1) · · · xn(1) x1(2) x2(2) · · · xn(2)

. . . . . . ... . . .

x1(2nR) x2(2nR) · · · xn(2nR)       

■ The code C is revealed to both sender and receiver. Both sender and

receiver are also assumed to know the channel transition matrix

p(y|x) for the channel.

slide-40
SLIDE 40

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 40/62

Random Code

■ There are (|X|n)2nR different codes. ■ The probability of a particular code C is

Pr(C) =

2nR

  • w=1

n

  • i=1

p (xi(w))

slide-41
SLIDE 41

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 41/62

Transmission and Channel

■ A message W is chosen according to a uniform distribution

Pr[W = w] = 2−nR, w = 1, 2, . . . , 2nR.

■ The wth codeword Xn(w), corresponding to the wth row of C, is

sent over the channel.

■ The receiver receives a sequence Y n according to the distribution

P(yn|xn(w)) =

n

  • i=1

p(yi|xi(w)).

That is, use the DMC channel for n times.

slide-42
SLIDE 42

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 42/62

Jointly Typical Decoding

■ The receiver declares that the message ˆ

W was sent if

  • Xn( ˆ

W), Y n

is jointly typical.

◆ There is no other jointly typical pair for Y n. That is, there is no

  • ther W ′ = ˆ

W such that W ′, Y n is jointly typical.

■ If no such ˆ

W exists or if there is more than one such, an error is

declared ( ˆ

W = 0).

■ There is decoding error if ˆ

W = W . Let E be the event [ ˆ W = W].

slide-43
SLIDE 43

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 43/62

Proof of R < C → Achievable

■ The average probability of error averaged over all codewords in the

codebook, and averaged over all codebooks.

Pr(E) =

  • C

P (n)

e

(C) =

  • C

Pr(C) 1 2nR

2nR

  • w=1

λw(C) = 1 2nR

2nR

  • w=1
  • C

Pr(C)λw(C)

◆ P (n) e

(C) is defined for jointly typical decoding.

◆ By the symmetry of the code construction,

  • C

Pr(C)λw(C) does

not depend on w.

slide-44
SLIDE 44

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 44/62

Proof of R < C → Achievable

■ Therefore,

Pr(E) = 1 2nR

2nR

  • w=1
  • C

Pr(C)λw(C) =

  • C

Pr(C)λw(C)

for any w

=

  • C

Pr(C)λ1(C) = Pr(E|W = 1)

■ Define Ei = {(Xn(i), Y n) is jointly typical pair} for

i = 1, 2, . . . , 2nR where Y n is the channel output when the first

codeword Xn(1) was sent. Then decoding error is declared if

◆ Ec 1: The transmitted codeword and the received sequence are not

jointly typical.

◆ E2 ∪ E3 ∪ · · · ∪ E2nR: a wrong codeword is jointly typical with

the received sequence.

slide-45
SLIDE 45

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 45/62

Proof of R < C → Achievable

■ Y n is the channel output when the first codeword Xn(1) was sent. ■ Ec 1: The transmitted codeword and the received sequence are not

jointly typical.

■ E2, E3, . . . , E2nR: wrong codewords that are jointly typical with the

received sequence.

slide-46
SLIDE 46

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 46/62

Proof of R < C → Achievable

■ The average error

Pr(E|W = 1) = P(Ec

1 ∪ E2 ∪ E3 ∪ · · · ∪ E2nR|W = 1)

≤ P(Ec

1|W = 1) + 2nR

  • i=2

P(Ei|W = 1)

■ By AEP

,

P(Ec

1|W = 1) ≤ ǫ

for n sufficiently large

P(Ei|W = 1) ≤ 2−n(I(X;Y )−3ǫ)

(Y n and Xn(1) are jointly typical.)

slide-47
SLIDE 47

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 47/62

Proof of R < C → Achievable

■ We have

Pr(E|W = 1) ≤ ǫ + (2nR − 1)2−n(I(X;Y )−3ǫ) ≤ ǫ + 2nR2−n(I(X;Y )−3ǫ) = ǫ + 2−n(I(X;Y )−R−3ǫ)

■ If I(X; Y ) − R − 3ǫ > 0, then 2−n(I(X;Y )−R−3ǫ) < ǫ for n

sufficiently large, and

Pr(E|W = 1) ≤ 2ǫ.

■ So far, we prove that: for any ǫ, if R < I(X; Y ) and n sufficient

large, the average decoding error Pr(E) = Pr(E|W = 1) < 2ǫ.

■ What do we need? If R < C, the maximum error probability

λ(n) → 0.

(We are almost there. Almost...)

slide-48
SLIDE 48

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 48/62

Proof of R < C → Achievable, final part

■ Choose p(x) such that I(X; Y ) is maximum. That is, choose p(x)

such that I(X; Y ) achieve channel capacity C. Then the condition

R < I(X; Y ) − 3ǫ can be replaced by the achievability condition R < C − 3ǫ.

■ Since the average probability of error over codebooks is less then 2ǫ,

there exists at least one codebook C∗ such that Pr(E|C∗) < 2ǫ.

◆ C∗ can be found by an exhaustive search over all codes. ■ Since W is chosen uniformly, we have

Pr(E|C∗) = 1 2nR

2nR

  • i=1

λi(C∗) ≤ 2ǫ

which implies that the maximal error probability of the better half codewords is less than 4ǫ.

◆ There are 10 students. Their average score is 40. Then the highest

slide-49
SLIDE 49

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 49/62

Proof of R < C → Achievable, final part

■ We throw away the worst half of the codewords in the best codebook

C∗. The new code has a maximal probability of error less than 4ǫ.

However, we construct a

  • 2nR/2, n
  • r
  • 2n(R−1/n), n
  • code. The

rate of the new code is R − 1/n.

■ Summary. If R − 1/n < C − 3ǫ for any ǫ, then λ(n) ≤ 4ǫ for n

sufficiently large.

slide-50
SLIDE 50

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 50/62

7.8 Zero-Error Codes

slide-51
SLIDE 51

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 51/62

No error → R ≤ C

■ Assume that we have a

  • 2nR, n
  • code with zero probability of error.

◆ W is determined by Y n. p(g(Y n) = W) = 1. H(W|Y n) = 0. ■ To obtain a strong bound, assume that W is uniformly distributed

  • ver {1, 2, . . . , 2nR}.

nR = H(W) = H(W|Y n) + I(W; Y n) = I(W; Y n) ≤ I(Xn; Y n)(data processing ineq. W → Xn(W) → Y n) ≤

n

  • i=1

I(Xi; Yi)

(See next page.)

≤ nC

(definition of channel capacity)

■ That is, no error → R ≤ C.

slide-52
SLIDE 52

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 52/62

No error → R ≤ C

Lemma 1 Let Y n be the result of passing Xn through a discrete memoryless channel of capacity C. Then for all p(xn),

I(Xn; Y n) ≤ nC.

Proof.

I(Xn; Y n) = H(Y n) − H(Y n|Xn) = H(Y n) −

n

  • i=1

H(Yi|Y1, . . . , Yi−1, Xn) = H(Y n) −

n

  • i=1

H(Yi|Xi)

(definition of DMC)

n

  • i=1

H(Yi) −

n

  • i=1

H(Yi|Xi) =

n

  • i=1

I(Yi; Xi) ≤ nC

slide-53
SLIDE 53

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 53/62

7.9 Fano’s Inequality and the Converse to the Coding Theorem

slide-54
SLIDE 54

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 54/62

Fano’s Inequality

Theorem 6 (Fano’s inequality) Let X and W have the same sample spaces X = {1, 2, . . . , M} and have the joint p.m.f. p(x, w). Let

Pe = Pr[X = W] =

  • x∈X
  • w∈X,

w=x

p(x, w).

Then

Pe log(M − 1) + H(Pe) ≥ H(X|W)

where

H(Pe) = −Pe log Pe − (1 − Pe) log(1 − Pe).

slide-55
SLIDE 55

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 55/62

Fano’s Inequality

  • Proof. We will prove that H(X|W) − H(Pe) − Pe log(M − 1) ≤ 0.

H(X|W) =

  • x
  • w

p(x, w) log 1 p(x|w) =

  • x
  • w=x

p(x, w) log 1 p(x|w) +

  • x
  • w=x

p(x, w) log 1 p(x|w) −Pe log(M − 1) =

  • x
  • w=x

p(x, w) log 1 M − 1 −H(Pe) = Pe log Pe + (1 − Pe) log(1 − Pe) =

  • x
  • w=x

p(x, w) log Pe +

  • x
  • w=x

p(x, w) log(1 − Pe)

Add the above three terms together.

slide-56
SLIDE 56

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 56/62

Fano’s Inequality

Proof (cont.)

H(X|W) − Pe log(M − 1) − H(Pe) =

  • x
  • w=x

p(x, w) log Pe (M − 1)p(x|w) +

  • x
  • w=x

p(x, w) log 1 − Pe p(x|w) ≤ log  

x

  • w=x

p(x, w) Pe (M − 1)p(x|w) +

  • x
  • w=x

p(x, w) 1 − Pe p(x|w)   = log   Pe M − 1

  • x
  • w=x

p(w) + (1 − Pe)

  • x
  • w=x

p(w)   = log[Pe + (1 − Pe)] = 0

slide-57
SLIDE 57

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 57/62

Fano’s Inequality

Corollary 1 1. Pe log M + H(Pe) ≥ H(X|W), Pe = Pr[X = W]

  • 2. 1 + Pe log M ≥ H(X|W), Pe = Pr[X = W]
  • 3. If X → Y → ˆ

X and Pe = Pr[X = ˆ X], then H(Pe) + Pe log M ≥ H(X| ˆ X) ≥ H(X|Y )

Remark.

  • 1. H(X|W) ≤ Pe log(M − 1) + H(Pe) ≤ Pe log M + H(Pe).
  • 2. H(X|W) ≤ Pe log(M − 1) + H(Pe) ≤ Pe log M + 1.
  • 3. The second ineq. can be obtained by data processing ineq.
slide-58
SLIDE 58

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 58/62

Data Processing Inequality

Lemma 2 (Data processing inequality) If X → Y → Z, then

I(X; Z) ≤ I(X; Y )

Proof.

I(X; Z) − I(X; Y ) =H(X) − H(X|Z) − [H(X) − H(X|Y )] = H(X|Y ) − H(X|Z) =

  • x
  • y

p(x, y) log 1 p(x|y) −

  • x
  • z

p(x, z) log 1 p(x|z) =

  • x
  • y
  • z

p(x, y, z) log 1 p(x|y) −

  • x
  • y
  • z

p(x, y, z) log 1 p(x|z) ≤ log

  • x
  • y
  • z

p(x, y, z)p(x|z) p(x|y)

  • (by convexity of logarithm)
slide-59
SLIDE 59

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 59/62

Data Processing Inequality

Proof (cont.) Since X → Y → Z, we have

p(x, y, z) = p(x, y)p(z|x, y) = p(x, y)p(z|y) = p(x, y)p(y, z) p(y)

and

p(x, y, z)p(x|z) p(x|y) = p(x, y)p(y, z) p(y) × p(x, z)p(y) p(z)p(x, y) = p(x, z)p(y, z) p(z)

Therefore,

  • x
  • y
  • z

p(x, y, z)p(x|z) p(x|y) =

  • x
  • y
  • z

p(x, z)p(y, z) p(z) =

  • x
  • z

p(x, z) p(z)

  • y

p(y, z) =

  • x
  • z

p(x, z) = 1

slide-60
SLIDE 60

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 60/62

Data Processing Inequality (Summary)

Lemma 3 1. If X → Y → Z, then

I(X; Z) ≤

  • I(X; Y )

I(Y ; Z) , H(X|Y ) ≤ H(X|Z)

  • 2. If X → Y → Z → W , then

I(X; Z) + I(Y ; W) ≤ I(X; W) + I(Y ; Z), I(X; W) ≤ I(Y ; Z)

slide-61
SLIDE 61

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 61/62

Achievable → R ≤ C

Theorem 7 (Converse to Channel coding theorem) Any sequence of

(2nR, n) codes with λ(n) → 0 must have R ≤ C.

Proof.

■ For a fixed encoding rule Xn(W) and a fixed decoding rule

ˆ W = g(Y n), we have W → Xn(W) → Y n → ˆ W.

■ For each n, let W be drawn according to a uniform distribution= over

{1, 2, . . . , 2nR}.

■ Since W has a uniform distribution,

Pr[W = ˆ W] = P (n)

e

= 1 2nR

2nR

  • i=1

λi.

slide-62
SLIDE 62

Peng-Hua Wang, April 16, 2012 Information Theory, Chap. 7 - p. 62/62

Achievable → R ≤ C

Proof (cont.)

nR = H(W)

(W is uniform distribution)

= H(W| ˆ W) + I(W; ˆ W) ≤ 1 + P (n)

e

nR + I(W; ˆ W)

(Fano’s ineq.)

≤ 1 + P (n)

e

nR + I(Xn; Y n)

(data processing ineq.)

≤ 1 + P (n)

e

nR + nC

(lemma 7.9.2)

⇒ P (n)

e

≥ 1 − C R − 1 nR

That is, if R > C, the probability of error is large than a positive value for sufficiently large n. The error probability can’t achieve arbitrary small.