Lecture 2 Lossless Source Coding I-Hsiang Wang Department of - - PowerPoint PPT Presentation

lecture 2 lossless source coding
SMART_READER_LITE
LIVE PREVIEW

Lecture 2 Lossless Source Coding I-Hsiang Wang Department of - - PowerPoint PPT Presentation

Lecture 2 Lossless Source Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 2, 2016 1 / 50 I-Hsiang Wang IT Lecture 2 The engineering problem motivating the study of this


slide-1
SLIDE 1

Lecture 2 Lossless Source Coding

I-Hsiang Wang

Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw

October 2, 2016

1 / 50 I-Hsiang Wang IT Lecture 2

slide-2
SLIDE 2

The engineering problem motivating the study of this lecture:

For a (random) source sequence of length N, design an encoding scheme (mapping) to describe it using K bits, so that the decoder can reconstruct the source sequence at the destination from these K bits. How the encoding scheme works (the mapping) is known by the decoder a priori. Fundamental Questions: What is the minimum possible ratio K

N (compression ratio/rate) ?

How to achieve that fundamental limit? In this lecture, we will demonstrate that, for most random sources, the fundamental limit is the entropy rate of the random process of the source when we want to reconstruct the source losslessly.

2 / 50 I-Hsiang Wang IT Lecture 2

slide-3
SLIDE 3

The Source Coding Problem (Shannon's Abstraction)

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

Meta Description

1 Encoder: Represent the source sequence s[1 : N] by a binary source codeword

w ≜ b[1 : K] ∈ { 0, 1, . . . , 2K − 1 }

, with K as small as possible.

2 Decoder: From the source codeword w, reconstruct the source sequence either losslessly or

within a certain distortion.

3 Efficiency: Determined by the code rate R ≜ K

N bits/symbol time

3 / 50 I-Hsiang Wang IT Lecture 2

slide-4
SLIDE 4

Decoding Criteria

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

Naturally, one would think of two different decoding criteria for the source coding problem.

1 Exact: the reconstructed sequence

s[1 : N] = s[1 : N].

2 Lossy: the reconstructed sequence

s[1 : N] ̸= s[1 : N], but is within a prescribed distortion.

4 / 50 I-Hsiang Wang IT Lecture 2

slide-5
SLIDE 5

Let us begin with some simple back-of-envelope analysis of the system with the exact recovery criterion to get some intuition. For N fixed, if the decoder would like to reconstruct s[1 : N] exactly for all possible s[1 : N] ∈ SN, then it is simple to see that the smallest K must satisfy

2K−1 < |S|N ≤ 2K = ⇒ K = ⌈N log |S|⌉. Why?

Because every possible sequence has to be uniquely represented by K bits! It seems impossible to have data compression if we require exact reconstruction.

What is going wrong?

5 / 50 I-Hsiang Wang IT Lecture 2

slide-6
SLIDE 6

Random Source

Recall: data compression is possible because there is redundancy in the source sequence. One of the simplest ways to capture redundancy is to model the source as a random process.

(Another reason to use a random source model is due to engineering reasons, as mentioned in Lecture 1.)

Redundancy comes from the fact that different symbols in S take different probabilities to be drawn. With a random source model, immediately there are two approaches one can take to demonstrate data compression: Allow variable codeword length for different symbols with different probabilities, rather than fixing it to be K Allow (almost) lossless reconstruction rather than exact recovery

6 / 50 I-Hsiang Wang IT Lecture 2

slide-7
SLIDE 7

Block-to-Variable Source Coding

b[1 : K]

Source Encoder Source Decoder Source Destination

s[1 : N] b s[1 : N]

Variable Length

The key difference here is that we allow K to depend on the realization of the source, s[1 : N]. Using variable codeword length is intuitive – for symbols with higher probability, we tend to use shorter codewords to represent it. The definition of the code rate is modified to R ≜ E[K]

N .

An optimal block-to-variable source code, Huffman code, is introduced to achieve the minimum compression rate for a given distribution of the random source. (See Chapter 5 of Cover&Thomas) Note: the decoding criterion here is exact reconstruction (zero error)

7 / 50 I-Hsiang Wang IT Lecture 2

slide-8
SLIDE 8

(Almost) Lossless Decoding Criterion

Another way to let the randomness kick in: allow non-exact recovery. To be precise, we turn our focus to finding the smallest possible R = K

N given that the error

probability

P(N)

e

≜ P { S[1 : N] ̸= S[1 : N] } → 0 as N → ∞.

Key features of this approach: Focus on the asymptotic regime where N → ∞; instead of error-free reconstruction, the criterion is relaxed to vanishing error probability. Compared to the previous approach where the analysis is mainly combinatorial, the analysis here is majorly probabilistic.

8 / 50 I-Hsiang Wang IT Lecture 2

slide-9
SLIDE 9

Outline

In this lecture, we shall

1 First, focusing on memoryless sources, introduce a powerful tool called typical sequences, and

use typical sequences to prove a lossless source coding theorem

2 Second, extend the typical sequence framework to sources with memory, and prove a similar

lossless source coding theorem there. We will show that the minimum compression rate is equal to the entropy of the random source. Let us begin with the simplest case where the source {S[t] | t = 1, 2, . . .} consists of i.i.d. random variables S[t]

i.i.d.

∼ PS, which is called a discrete memoryless source (DMS).

9 / 50 I-Hsiang Wang IT Lecture 2

slide-10
SLIDE 10

Typical Sequences and a Lossless Source Coding Theorem

1

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP Lossless Source Coding Theorem

2

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes Typicality for Sources with Memory

10 / 50 I-Hsiang Wang IT Lecture 2

slide-11
SLIDE 11

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP

1

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP Lossless Source Coding Theorem

2

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes Typicality for Sources with Memory

11 / 50 I-Hsiang Wang IT Lecture 2

slide-12
SLIDE 12

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP

Overview of Typicality Methods

Goal: Understand and exploit the probabilistic asymptotic properties of a i.i.d. randomly generated sequence S[1 : N] for coding. Key Observation: When N → ∞, one often observe that a substantially small set of sequences become "typical", which contribute almost the whole probability, while others become "atypical".

(cf. Lecture 2 "Operational Meaning of Entropy")

For lossless reconstruction with vanishing error probability, we can use shorter codewords to label "typical" sequences and ignore "atypical" ones. Note: There are several notions of typicality and various definitions in the literature. In this lecture, we give two definitions: (robust) typicality and weak typicality. Notation: For notational convenience, we shall use the following interchangeably:

x[t] ← → xt, and x[1 : N] ← → xN.

12 / 50 I-Hsiang Wang IT Lecture 2

slide-13
SLIDE 13

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP

Typical Sequence

A (robust) typical sequence is a sequence whose empirical distribution is close to true distribution. For a sequence xn, its empirical p.m.f. is given by the frequency of occurrence of a symbol in xn:

π (a|xn) ≜ 1

n

∑n

i=1 1 {xi = a}.

Due the law of large numbers, π (a|Xn)

p

→ PX (a) for all a ∈ X as n → ∞, if Xi

i.i.d.

∼ PX.

That is, with high probability, the empirical p.m.f. does not deviate too much from the actual p.m.f. Definition 1 (Typical Sequence) For ε ∈ (0, 1), a sequence xn is called ε-typical with respect to random variable X ∼ PX if

|π (a|xn) − PX (a)| ≤ εPX (a) , ∀ a ∈ X .

The ε-typical set T (n)

ε

(X) ≜ {xn ∈ X n | xn is ε-typical with respect to X} .

13 / 50 I-Hsiang Wang IT Lecture 2

slide-14
SLIDE 14

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP

Note: In the following, if the context is clear, we will write "T (n)

ε

" instead of "T (n)

ε

(X)".

Example 1 Consider a random bit sequence generated i.i.d. based on Ber

( 1

2

)

. Let us set ε = 0.2 and n = 10. What is T (n)

ε

? How large is the typical set? sol: Based on the definition, a n-sequence xn is ε-typical iff

π (0|xn) ∈ [0.4, 0.6] and π (1|xn) ∈ [0.4, 0.6].

In other words, the # of "0"s in the sequence should be 4, 5, or 6. Hence, T (n)

ε

consists of all length-10 sequences with 4, 5, or 6 "0"s. The size of T (n)

ε

is

(10

4

) + (10

5

) + (10

6

) = 714.

14 / 50 I-Hsiang Wang IT Lecture 2

slide-15
SLIDE 15

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP

Properties of Typical Sequences

Let P (xn) ≜ P {Xn = xn} = ∏n

i=1 PX (xi), that is, the probability that the DMS generates the

sequence xn. Similarly P (A) ≜ P {Xn ∈ A}, denotes the probability of a set A. Proposition 1 (Properties of Typical Sequences and Typical Set)

1 ∀ xn ∈ T (n)

ε

(X), 2−n(H(X )+δ(ε)) ≤ P (xn) ≤ 2−n(H(X )−δ(ε)), where δ(ε) = εH (X ).

(by definition of typical sequences and entropy)

2

lim

n→∞ P

( T (n)

ε

(X) ) = 1, i.e., P ( T (n)

ε

(X) ) ≥ 1 − ε for n large enough.

(by the law of large numbers (LLN))

3 |T (n)

ε

(X)| ≤ 2n(H(X )+δ(ε)).

(by summing up the lower bound in property 1 over the typical set)

4 |T (n)

ε

(X)| ≥ (1 − ε)2n(H(X )−δ(ε)) for n large enough. (by the upper bound in property 1, and property 2)

15 / 50 I-Hsiang Wang IT Lecture 2

slide-16
SLIDE 16

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP

Asymptotic Equipartition Property (AEP)

X n T (n)

ε

(X)

P

  • T (n)

ε

(X)

  • → 1

|T (n)

ε

(X)| ≈ 2nH(X)

typical xn

P (xn) ≈ 2−nH(X) Observations:

1 The typical set has probability

approach 1 as n → ∞, while its size is roughly equal to

2nH(X ), significantly smaller

than |X n| = 2n log|X|.

2 Within the typical set, all

typical sequences have roughly the same probability

2−nH(X ).

16 / 50 I-Hsiang Wang IT Lecture 2

slide-17
SLIDE 17

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP

Application to Data Compression

X n T (n)

ε

(X)

P

  • T (n)

ε

(X)

  • → 1

|T (n)

ε

(X)| ≈ 2nH(X)

typical xn

P (xn) ≈ 2−nH(X) In other words, as n → ∞, with probability 1 the realization of the DMS will be a typical sequence, typical sequences are roughly uniformly distributed over the typical set, and there are roughly

2nH(X ) of them.

Hence, we can use roughly

nH (X ) bits to uniquely describe

each typical sequence, and ignore the atypical ones. Since the probability of getting an atypical sequence vanishes as n → ∞, so does error probability.

17 / 50 I-Hsiang Wang IT Lecture 2

slide-18
SLIDE 18

Typical Sequences and a Lossless Source Coding Theorem Lossless Source Coding Theorem

1

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP Lossless Source Coding Theorem

2

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes Typicality for Sources with Memory

18 / 50 I-Hsiang Wang IT Lecture 2

slide-19
SLIDE 19

Typical Sequences and a Lossless Source Coding Theorem Lossless Source Coding Theorem

Lossless Source Coding: Problem Setup (Formally)

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

1 A

( 2NR, N )

lossless source code consists of

an encoding function (encoder) encN : SN → {0, 1}K that maps each source sequence sN to a bit sequence bK, where K ≜ ⌊NR⌋. a decoding function (decoder) decN : {0, 1}K → SN that maps each bit sequence bK to a reconstructed source sequence

sN.

2 The error probability for a

( 2NR, N )

code is defined as P(N)

e

≜ P { SN ̸= SN}

.

3 A rate R is said to be achievable if there exist a sequence of

( 2NR, N )

codes such that

P(N)

e

→ 0 as N → ∞. The optimal compression rate R∗ ≜ inf {R | R : achievable}.

19 / 50 I-Hsiang Wang IT Lecture 2

slide-20
SLIDE 20

Typical Sequences and a Lossless Source Coding Theorem Lossless Source Coding Theorem

A Lossless Source Coding Theorem

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

Theorem 1 (A Lossless Source Coding Theorem for DMS) For a DMS S, R∗ = H (S ). Remark: In information theory, to establish a coding theorem, one needs to prove two directions: Direct part (achievability): show that ∀ R > R∗ = H (S ), ∃ a sequence of

( 2NR, N )

codes such that P(N)

e

→ 0 as N → ∞.

Converse part (converse): show that for every sequence of

( 2NR, N )

codes such that

P(N)

e

→ 0 as N → ∞, the rate R ≥ R∗ = H (S ).

20 / 50 I-Hsiang Wang IT Lecture 2

slide-21
SLIDE 21

Typical Sequences and a Lossless Source Coding Theorem Lossless Source Coding Theorem

Lossless Source Coding Theorem: Achievability Proof (1)

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

pf: Here we provide a simple proof based on typical sequences (typicality).

Codebook Generation: Let us choose an ε > 0 and set R = H (S ) + δ (ε) such that NR ∈ Z, where

δ (ε) = εH (S ). By property 3 of Proposition 1, we have an upper bound on the # of typical sequences: |T (N)

ε

(S)| ≤ 2N(H(S )+δ(ε)) = 2NR.

Encoding: Hence, we are able to label each typical sequence with a length NR bit sequence, which defines the encoding function ∀ sN ∈ T (N)

ε

. For sN /

∈ T (N)

ε

, the encoding function maps them to the all 0 sequence. Decoding: The decoding function simply maps the received bit sequence back to the typical sequence.

21 / 50 I-Hsiang Wang IT Lecture 2

slide-22
SLIDE 22

Typical Sequences and a Lossless Source Coding Theorem Lossless Source Coding Theorem

Lossless Source Coding Theorem: Achievability Proof (2)

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

Error Probability Analysis: Obviously an error occurs iff the generated sequence sN /

∈ T (N)

ε

, since all typical sequence can be reconstructed uniquely and perfectly. Hence,

P(N)

e

≤ P

(N) e

≜ P { SN / ∈ T (N)

ε

} = 1 − P ( T (N)

ε

) → 1 − 1 = 0 as N → ∞,

due to property 2 of Proposition 1. Finally, since δ (ε) can be made arbitrarily small, we have shown that ∀ R > R∗ = H (S ), ∃ a sequence of

( 2NR, N )

codes such that P(N)

e

→ 0 as N → ∞.

22 / 50 I-Hsiang Wang IT Lecture 2

slide-23
SLIDE 23

Typical Sequences and a Lossless Source Coding Theorem Lossless Source Coding Theorem

Reflections

In proving achievability of coding theorems, we often upper bound P(N)

e

for some coding scheme; by asserting the upper bound → 0 as N → ∞, we show that the rate is achievable. For the achievability of the lossless source coding problem, we find an upper bound for the error

  • probability. Note that the typicality encoder needs not be optimal. Hence, the optimal probability of

error P(N)

e

≤ P

(N) e

= P { SN / ∈ T (N)

ε

} = 1 − P ( T (N)

ε

)

, tending to 0 as N → ∞ due to LLN. Next, to prove the converse, we need to lower bound P(N)

e

for all possible coding schemes; by forcing this lower bound → 0 as N → ∞ because we require P(N)

e

→ 0, we show that any

achievable rate has to satisfy certain necessary condition. In the following, we introduce an important lemma due to Robert Fano (Fano's inequality). Fano's inequality is widely used in converse proofs.

23 / 50 I-Hsiang Wang IT Lecture 2

slide-24
SLIDE 24

Typical Sequences and a Lossless Source Coding Theorem Lossless Source Coding Theorem

Fano's Inequality

Lemma 1 (Fano's Inequality) For jointly distributed r.v.'s (U, V ), let Pe ≜ P {U ̸= V }. Then, H (U |V ) ≤ Hb (Pe) + Pe log|U|. pf: Define E ≜ 1 {U ̸= V }, the indicator function of event {U ̸= V }. Hence, E ∼ Ber (Pe). Using chain rule and the non-negativity of conditional entropy, we have

H (U |V ) ≤ H (U, E |V ) = H (E |V ) + H (U |V, E ).

Note that H (E |V ) ≤ H (E ) = Hb (Pe), and

H (U |V, E ) = P {E = 1}

  • =Pe

H (U |V, E = 1 )

  • ≤log|U|

+P {E = 0} H (U |V, E = 0 )

  • =0, ∵U=V

Hence, H (U |V ) ≤ Hb (Pe) + Pe log|U|.

24 / 50 I-Hsiang Wang IT Lecture 2

slide-25
SLIDE 25

Typical Sequences and a Lossless Source Coding Theorem Lossless Source Coding Theorem

Corollary 1 (Lower Bound on Error Probability)

Pe ≥ H (U |V ) − 1 log|U| .

pf: From Lemma 1 and Hb (Pe) ≤ 1, we have

H (U |V ) ≤ Hb (Pe) + Pe log|U| ≤ 1 + Pe log|U|.

Exercise 1 Show that Lemma 1 can be sharpened as follows

H (U |V ) ≤ Hb (Pe) + Pe log (|U| − 1) ,

if U, V both take values in U.

25 / 50 I-Hsiang Wang IT Lecture 2

slide-26
SLIDE 26

Typical Sequences and a Lossless Source Coding Theorem Lossless Source Coding Theorem

Lossless Source Coding Theorem: Converse Proof

Recall: we would like to show that for every sequence of

( 2NR, N )

codes such that P(N)

e

→ 0 as N → ∞, the rate R ≥ R∗ = H (S ).

pf: Note that BK is a r.v. because it is generated by another r.v, SN.

K = NR ≥ H ( BK ) ≥ I ( BK ; SN )

(1)

≥ I ( SN ; SN ) = H ( SN ) − H ( SN

  • SN )

(2)

≥ H ( SN ) − ( 1 + P(N)

e

log|SN| )

(3) (1) is due to the upper bound of entropy and mutual information. (2) is due to SN − BK −

SN and the data processing inequality.

(3) is due to Fano's inequality

26 / 50 I-Hsiang Wang IT Lecture 2

slide-27
SLIDE 27

Typical Sequences and a Lossless Source Coding Theorem Lossless Source Coding Theorem

We have the following inequality for all length N and source codes:

NR ≥ H ( SN ) − ( 1 + P(N)

e

log|SN| ) .

Since the source is a DMS, we have H

( SN ) = NH (S ).

Dividing both sides of the above inequality by N, we get

R ≥ H (S ) − 1

N − P(N) e

log|S|.

If the rate R is achievable, then by definition, P(N)

e

→ 0 as N → ∞.

Taking N → ∞, we conclude that R ≥ H (S ) if it is achievable.

Exercise 2 Prove the lossless source coding theorem for DMS by using Theorem 1 in Lecture 2.

27 / 50 I-Hsiang Wang IT Lecture 2

slide-28
SLIDE 28

Weakly Typical Sequences and Sources with Memory

1

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP Lossless Source Coding Theorem

2

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes Typicality for Sources with Memory

28 / 50 I-Hsiang Wang IT Lecture 2

slide-29
SLIDE 29

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

1

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP Lossless Source Coding Theorem

2

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes Typicality for Sources with Memory

29 / 50 I-Hsiang Wang IT Lecture 2

slide-30
SLIDE 30

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

Beyond Memoryless Sources

Recap: So far we have established a coding theorem for (block-to-block) lossless source coding with discrete memoryless sources (DMS): Achievability: we use typicality to construct a simple code achieving rate R > H (S ). Converse: we use Fano's inequality and data processing inequality to show that every achievable rate must satisfy R ≥ H (S ). Question: What if the source is not memoryless? (In other words, we cannot use a single p.m.f. PS to describe the random process.) Affected: Entropy, AEP . Unaffected: Fano and data processing in the converse proof. For sources with memory, we develop the following to establish a lossless source coding theorem:

1 A measure of uncertainty for random processes 2 A general AEP for random processes with memory

30 / 50 I-Hsiang Wang IT Lecture 2

slide-31
SLIDE 31

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

Discrete Stationary Source

A (discrete-time) random process {Xi | i = 1, 2, . . .} consists of an infinite sequence of r.v.'s. Such a random process is characterized by all joint p.m.f.'s PX1,X2,...,Xn, ∀ n = 1, 2, . . .. Definition 2 (Stationary Random Process) A random process {Xi} is stationary if for all shift l ∈ N,

PX1,X2,...,Xn = PX1+l,X2+l,...,Xn+l, ∀ n ∈ N

When extending the source coding theorem, instead of discrete memoryless sources (DMS), we focus on discrete stationary sources (DSS), where the source process {Si | i ∈ N} is stationary (but not necessarily memoryless).

31 / 50 I-Hsiang Wang IT Lecture 2

slide-32
SLIDE 32

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

Entropy Rate

For a discrete random process {Xi}, how do we measure its uncertainty? Infinitely many r.v.'s in a process {Xi} → Meaningless to use H (X1, X2, . . . ).

(likely to be ∞)

Instead, we should measure the amount of uncertainty per symbol! Or, measure the marginal amount of uncertainty of the current symbol conditioned on the past We give the following intuitive definitions. Definition 3 (Entropy Rate) The entropy rate of a random process {Xi} is defined by

H ({Xi} ) ≜ lim

n→∞ 1 nH (X1, X2, . . . , Xn )

if the limit exists. Alternatively, we can also define it by the following if the limit exists.

  • H ({Xi} ) ≜ lim

n→∞ H

( Xn

  • Xn−1 )

.

32 / 50 I-Hsiang Wang IT Lecture 2

slide-33
SLIDE 33

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

Example 2 (Entropy Rate of i.i.d. Process) Consider a random process {Xi} where X1, X2, . . . are i.i.d. according to PX. Does the entropy rate exist? If so, compute it. sol: Since the r.v.'s are i.i.d., H (X1, . . . , Xn ) = nH (X1 ), H

( Xn

  • Xn−1 )

= H (Xn ) = H (X1 ).

Hence, H ({Xi} ) =

H ({Xi} ) = H (X1 ) = EPX [− log PX (X)].

Exercise 3 (H and

H May be Different)

Consider a random process {Xi} where X1, X3, . . . are i.i.d. and X2k = X2k−1 for all k ∈ N. Show that H ({Xi} ) exists, but

H ({Xi} ) does not.

33 / 50 I-Hsiang Wang IT Lecture 2

slide-34
SLIDE 34

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

The Two Notions of Entropy Rate

In Definition 3, we have defined two notions of entropy rate: H and

H.

In Exercise 3, we see that the two notions are not equivalent in general. Note: Let an ≜ 1

nH (X1, . . . , Xn ) and bn ≜ H

( Xn

  • Xn−1 )

:

an = 1

n

∑n

k=1 bk due to chain rule.

H ({Xi} ) = limn→∞ an and H ({Xi} ) = limn→∞ bn.

The following lemma from Calculus sheds some light on the relationship between these two notions. Lemma 2 (Cesàro Mean)

lim

n→∞ bn = c =

⇒ lim

n→∞ an = c, where an ≜ 1 n n

k=1

  • bk. The reverse direction is not true in general.

As a corollary, if

H exists, so does H and H = H.

34 / 50 I-Hsiang Wang IT Lecture 2

slide-35
SLIDE 35

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

Entropy Rate of Stationary Process

Lemma 3 For a stationary random process {Xi}, H

( Xn

  • Xn−1 )

is decreasing in n. pf: Due to the fact that conditioning reduces entropy, we have

H (Xn+1 |Xn ) = H (Xn+1 |X[2 : n], X1 ) ≤ H (Xn+1 |X[2 : n] ) .

Since {Xi} is stationary, H (Xn+1 |X[2 : n] ) = H

( Xn

  • Xn−1 )

.

Exercise 4 Show that for a stationary {Xi}, 1

nH (X1, . . . , Xn ) is decreasing in n, and H

( Xn

  • Xn−1 )

≤ 1

nH (X1, . . . , Xn ).

35 / 50 I-Hsiang Wang IT Lecture 2

slide-36
SLIDE 36

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

Theorem 2 For a stationary random process {Xi}, H ({Xi} ) =

H ({Xi} ).

pf: Since bn ≜ H

( Xn

  • Xn−1 )

is decreasing in n, and bn ≥ 0 is bounded from below, we conclude that bn converges as n → ∞. Since 1

nH (X1, . . . , Xn ) = 1 n n

k=1

bk, by Lemma 2, proof complete.

36 / 50 I-Hsiang Wang IT Lecture 2

slide-37
SLIDE 37

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

Markov Process and its Entropy Rate

Markov process is one of the simplest random processes with memory. Definition 4 (Markov Process) A random process {Xi} is Markov if ∀ n ∈ N,

PXn|Xn−1,Xn−2,...,X1 = PXn|Xn−1.

For a stationary Markov {Xi}, the entropy rate is simple to compute:

H ({Xi} ) = lim

n→∞ H

( Xn

  • Xn−1 ) Markovity

= lim

n→∞ H (Xn |Xn−1 )

Stationarity

= H (X2 |X1 ) .

37 / 50 I-Hsiang Wang IT Lecture 2

slide-38
SLIDE 38

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

Computation of Entropy Rate of Markov Process

Example 3 (Two-State Markov Process) Consider a stationary two-state Markov process {Xi | i ∈ N} taking values in {0, 1} with probability transition matrix

PX2|X1 = [1 − α α β 1 − β ] ,

where α, β ∈ (0, 1). Find the marginal p.m.f. pXn (x) for all n ∈ N and the entropy rate H ({Xi} ).

1

1 − α 1 − β α β

1 − α α β 1 − β

  • 1

1

38 / 50 I-Hsiang Wang IT Lecture 2

slide-39
SLIDE 39

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes

sol: The stationary distribution

[ π(0) π(1) ]

  • f a Markov chain can be computed by solving the

following linear equation: for all n ∈ N,

[ π(0) π(1) ] = [ π(0) π(1) ] [1 − α α β 1 − β ] = ⇒ [ π(0) π(1) ] = [

β α+β α α+β

] = [ PXn(0) PXn(1) ] ,

To compute H ({Xi} ), since it is equal to H (X2 |X1 ), we can easily compute it as follows:

H ({Xi} ) = H (X2 |X1 ) = π(0)H (X2 |X1 = 0 ) + π(1)H (X2 |X1 = 1 ) = β α + β Hb (α) + α α + β Hb (β) .

39 / 50 I-Hsiang Wang IT Lecture 2

slide-40
SLIDE 40

Weakly Typical Sequences and Sources with Memory Typicality for Sources with Memory

1

Typical Sequences and a Lossless Source Coding Theorem Typicality and AEP Lossless Source Coding Theorem

2

Weakly Typical Sequences and Sources with Memory Entropy Rate of Random Processes Typicality for Sources with Memory

40 / 50 I-Hsiang Wang IT Lecture 2

slide-41
SLIDE 41

Weakly Typical Sequences and Sources with Memory Typicality for Sources with Memory

Typicality for Sources with Memory

How to extend typicality to sources with memory?

Observations: A random process with memory is characterized by joint distributions with arbitrary length, because the depth of the memory is arbitrary. Empirical distribution π (a|xn) is insufficient to determine if the sequence xn is typical or not, because it only describes the marginal p.m.f. PX, not joint distributions of other lengths. Should ask: what properties are critical for us to extend typicality to sources with memory? Recall: the most critical feature of typical sequences is the Asymptotic Equipartition Property (AEP).

41 / 50 I-Hsiang Wang IT Lecture 2

slide-42
SLIDE 42

Weakly Typical Sequences and Sources with Memory Typicality for Sources with Memory

X n T (n)

ε

(X)

P

  • T (n)

ε

(X)

  • → 1

|T (n)

ε

(X)| ≈ 2nH(X)

typical xn

P (xn) ≈ 2−nH(X) Suppose now we want to give another definition of typicality and another kind of typical set A(n)

ε

. The key properties we would like to keep are the bottom two: Why? Because the other two regarding the size of A(n)

ε

can be derived from these two properties.

1 ∀ xn ∈ A(n)

ε

, 2−n(H(X )+δ(ε)) ≤ P (xn) ≤ 2−n(H(X )−δ(ε)).

(definition)

2

lim

n→∞ P

( A(n)

ε

) = 1.

(by LLN)

42 / 50 I-Hsiang Wang IT Lecture 2

slide-43
SLIDE 43

Weakly Typical Sequences and Sources with Memory Typicality for Sources with Memory

Weakly Typical Sequences (for i.i.d. Random Process)

Idea: why not directly define typical sequences as those that satisfy

2−n(H(X )+δ(ε)) ≤ P (xn) ≤ 2−n(H(X )−δ(ε))

? Again, let P (xn) ≜ P {Xn = xn} = ∏n

i=1 PX (xi), that is, the probability that the DMS generates

the sequence xn. Definition 5 (Weakly Typical Sequences for i.i.d. Random Process) For ε ∈ (0, 1), a sequence xn is called weakly ε-typical with respect to random variable X ∼ PX if

  • − 1

n log P (xn) − H (X )

  • ≤ ε.

The weakly ε-typical set A(n)

ε

(X) ≜ {xn ∈ X n | xn is weakly ε-typical with respect to X} .

43 / 50 I-Hsiang Wang IT Lecture 2

slide-44
SLIDE 44

Weakly Typical Sequences and Sources with Memory Typicality for Sources with Memory

Typical vs. Weakly Typical Sequences

Comparisons in Definition: Typical Sequence Weakly Typical Sequence

π (·|xn) = 1 n

n

i=1

1 {xi = ·} ← → − 1 n log P (xn) = 1 n

n

i=1

log 1 PX (xi) PX (·) ← → H (X )

Exercise 5 Show that T (n)

ε

⊆ A(n)

δ

where δ = εH (X ), where in general for some ε > 0, there is no δ′ > 0 such that A(n)

δ′

⊆ T (n)

ε

.

44 / 50 I-Hsiang Wang IT Lecture 2

slide-45
SLIDE 45

Weakly Typical Sequences and Sources with Memory Typicality for Sources with Memory

AEP with i.i.d. Source and Weakly Typical Sequences

Proposition 2

1 ∀ xn ∈ A(n)

ε

(X), 2−n(H(X )+ε) ≤ P (xn) ≤ 2−n(H(X )−ε).

(by definition)

2

lim

n→∞ P

( A(n)

ε

(X) ) = 1.

(by LLN)

3 |A(n)

ε

(X)| ≤ 2n(H(X )+ε).

(by 1 and 2)

4 |A(n)

ε

(X)| ≥ (1 − ε)2n(H(X )−ε) for n large enough.

(by 1 and 2)

Note: We only need to check that Property 2 holds, since Property 1 is automatic by definition, and Property 3 and 4 are due to 1 and 2. This is true due to LLN: as n → ∞,

− 1

n log P (Xn) = 1 n n

i=1

log

1 PX(Xi) p

− → E [ log

1 PX(X)

] = H (X ).

45 / 50 I-Hsiang Wang IT Lecture 2

slide-46
SLIDE 46

Weakly Typical Sequences and Sources with Memory Typicality for Sources with Memory

Stationary Ergodic Random Processes

Let's turn to one of our original goals: finding AEP and the definition of weakly typical sequences for random processes with memory. In other words, we would like to establish the following LLN-like key property for random processes with memory: as n → ∞,

− 1 n log P (Xn)

p

− → H ({Xi} ) .

It turns out that this is true for stationary ergodic processes. Roughly speaking, a stationary process {Xi} is ergodic if the time average (empirical average) converges to the ensemble average with probability 1. More specifically, ∀ k1, k2, . . . , km ∈ N, f measurable,

P { lim

n→∞

1 n

n−1

l=0

f (Xk1+l, . . . , Xkm+l) = E [f (Xk1, . . . , Xkm)] } = 1

46 / 50 I-Hsiang Wang IT Lecture 2

slide-47
SLIDE 47

Weakly Typical Sequences and Sources with Memory Typicality for Sources with Memory

AEP for Stationary Ergodic Processes

Theorem 3 (Shannon-McMillan-Breiman) If H ({Xi} ) is the entropy rate of a stationary ergodic process {Xi}, then

P { lim

n→∞ − 1

n log P (Xn) = H ({Xi} ) } = 1,

which implies that − 1

n log P (Xn) p

− → H ({Xi} ) as n → ∞.

With the above theorem, we can re-define weakly typical sequences as we did in the i.i.d. case, with the following substitution: H (X ) → H ({Xi} ) and derive corresponding properties. As we discussed before, the four key properties in Proposition 1 and 2 remain the same.

47 / 50 I-Hsiang Wang IT Lecture 2

slide-48
SLIDE 48

Weakly Typical Sequences and Sources with Memory Typicality for Sources with Memory

Lossless Source Coding Theorem for Ergodic DSS

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

Theorem 4 (A Lossless Source Coding Theorem for Ergodic DSS) For a discrete stationary ergodic source {Si}, R∗ = H ({Si} ). Achievability can be proved as that in the DMS case, based on weakly typical sequences. The proof is identical. Converse is also the same as that in the DMS case, except that we shall make use of the following fact from Exercise 4:

1 N H

( SN ) ↓ H ({Si} ) as n → ∞.

48 / 50 I-Hsiang Wang IT Lecture 2

slide-49
SLIDE 49

Summary

49 / 50 I-Hsiang Wang IT Lecture 2

slide-50
SLIDE 50

Lossless source coding theorem: R∗ = H, where

1 H = H (S ) for DMS {Si}. 2 H = H ({Si} ) for ergodic DSS {Si}.

Typical Sequence T (n)

ε

  • vs. Weakly Typical Sequence A(n)

ε

Asymptotic Equipartition Property (AEP) for both typical sequences with DMS and weakly typical sequences with ergodic DSS:

1 ∀ xn ∈ T (n) ε

(X), 2−n(H(X )+δ(ε)) ≤ P (xn) ≤ 2−n(H(X )−δ(ε)). ∀ xn ∈ A(n)

ε

({Xi}), 2−n(H({Xi} )+ε) ≤ P (xn) ≤ 2−n(H({Xi} )−ε).

2 limn→∞ P

( T (n)

ε

(X) ) = 1. limn→∞ P ( A(n)

ε

({Xi}) ) = 1.

3 |T (n) ε

(X)| ≤ 2n(H(X )+δ(ε)). |A(n)

ε

(X)| ≤ 2n(H({Xi} )+ε).

4 |T (n) ε

(X)| ≥ (1 − ε)2n(H(X )−δ(ε)) for n large enough. |A(n)

ε

({Xi})| ≥ (1 − ε)2n(H({Xi} )−ε) for n large enough. Fano's inequality

50 / 50 I-Hsiang Wang IT Lecture 2