Lecture 3 Source Coding I-Hsiang Wang Department of Electrical - - PowerPoint PPT Presentation

lecture 3 source coding
SMART_READER_LITE
LIVE PREVIEW

Lecture 3 Source Coding I-Hsiang Wang Department of Electrical - - PowerPoint PPT Presentation

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary Lecture 3 Source Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October


slide-1
SLIDE 1

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Lecture 3 Source Coding

I-Hsiang Wang

Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw

October 19, 2014

1 / 43 I-Hsiang Wang NIT Lecture 3

slide-2
SLIDE 2

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

The Source Coding Problem

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

Meta Description

1 Encoder:

Represent the source sequence s[1 : N] by a binary source codeword w := b[1 : K] ∈ [ 0 : 2K − 1 ] , with K as small as possible.

2 Decoder:

From the source codeword w, reconstruct the source sequence either losslessly or within a certain distortion.

3 Efficiency:

Determined by the code rate R := K

N bits/symbol time

2 / 43 I-Hsiang Wang NIT Lecture 3

slide-3
SLIDE 3

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Decoding Criteria

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

Naturally, one would think of two different decoding criteria for the source coding problem.

1 Exact: the reconstructed sequence

s[1 : N] = s[1 : N].

2 Lossy: the reconstructed sequence

s[1 : N] ̸= s[1 : N], but is within a prescribed distortion.

3 / 43 I-Hsiang Wang NIT Lecture 3

slide-4
SLIDE 4

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Let us begin with some simple analysis of the system with the exact recovery criterion. For N fixed, if the decoder would like to reconstruct s[1 : N] exactly for all possible s[1 : N] ∈ SN, then it is simple to see that the smallest K must satisfy 2K−1 < |S|N ≤ 2K = ⇒ K = ⌈N log |S|⌉. Why? Because every possible sequence has to be uniquely represented by K bits! As a consequence, it seems that if we require exact reconstruction, it is impossible to have data compression.

What is going wrong?

4 / 43 I-Hsiang Wang NIT Lecture 3

slide-5
SLIDE 5

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Random Source

Recall: data compression is possible because there is redundancy in the source sequence. One of the simplest ways to capture redundancy is to model the source as a random process. (Another reason to use a random source model is due to

engineering reasons, as mentioned in Lecture 1.)

Redundancy comes from the fact that different symbols in S take different probabilities to be drawn. With a random source model, immediately there are two approaches one can take to demonstrate data compression: Allow variable codeword length for different symbols with different probabilities, rather than fixing it to be K Allow (almost) lossless reconstruction rather than exact recovery

5 / 43 I-Hsiang Wang NIT Lecture 3

slide-6
SLIDE 6

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Block-to-Variable Source Coding

b[1 : K]

Source Encoder Source Decoder Source Destination

s[1 : N] b s[1 : N]

Variable Length

The key difference here is that we allow K to depend on the realization of the source, s[1 : N]. Using variable codeword length is intuitive – for symbols with higher probability, we tend to use shorter codewords to represent it. The definition of the code rate is modified to R := E[K]

N .

In this lecture we will introduce an optimal block-to-variable source code, called Huffman code, which can achieve the minimum compression rate for a given distribution of the random source. Note: the decoding criterion here is exact reconstruction.

6 / 43 I-Hsiang Wang NIT Lecture 3

slide-7
SLIDE 7

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

(Almost) Lossless Decoding Criterion

Another way to let the randomness kick in: allow non-exact recovery. To be precise, we turn our focus to finding the smallest possible R = K

N

given that the error probability P(N)

e

:= Pr { S[1 : N] ̸= S[1 : N] } → 0 as N → ∞. Key features of this approach: Focus on the asymptotic regime where N → ∞; instead of error-free reconstruction, the criterion is relaxed to vanishing error probability. Compared to the previous approach where the analysis is mainly combinatorial, the analysis here is majorly based on probabilistic arguments.

7 / 43 I-Hsiang Wang NIT Lecture 3

slide-8
SLIDE 8

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Outline

In this lecture, we shall

1 First, introduce a powerful tool called typical sequences, and use

typical sequences to prove a lossless source coding theorem

2 Second, introduce block-to-variable source coding schemes,

especially Huffman codes, and prove its optimality In both cases, we will show that the minimum compression rate is equal to the entropy of the random source. We shall begin with the simplest case where the random process {S[t] | t = 1, 2, . . .} consists of i.i.d. random variables S[t] ∼ pS, which is called a discrete memoryless source (DMS).

8 / 43 I-Hsiang Wang NIT Lecture 3

slide-9
SLIDE 9

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

1 Typical Sequences and a Lossless Source Coding Theorem 2 Weakly Typical Sequences and Sources with Memory 3 Summary

9 / 43 I-Hsiang Wang NIT Lecture 3

slide-10
SLIDE 10

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Overview of Typicality Methods

Goal: Understand and exploit the probabilistic asymptotic properties of a i.i.d. randomly generated sequence S[1 : N] for coding. Key Observation: When N → ∞, one often observe that a substantially small set of sequences will become “typical”, which contribute almost the whole probability, while others become “atypical”. For lossless reconstruction with vanishing error probability, we can use shorter codewords to label “typical” sequences and ignore “atypical” ones. Note: There are several notions of typicality and various definitions in the literature. In this lecture, we give two definitions: (robust) typicality and weak typicality. Notation: For notational convenience, we shall use the following interchangeably: x[t] ← → xt, x[1 : N] ← → xN.

10 / 43 I-Hsiang Wang NIT Lecture 3

slide-11
SLIDE 11

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Typical Sequence

Roughly speaking, a (robust) typical sequence is a sequence whose empirical distribution is close to the actual distribution. For a sequence xn, its empirical p.m.f. is given by the frequency of

  • ccurrence of a symbol in the sequence: π (a|xn) :=

∑n

i=1 I{xi=a}

n

. Due the law of large numbers, π (a|xn)

p

→ pX (a) for all a ∈ X as n → ∞, if xn is drawn i.i.d. based on pX. With high probability, the empirical p.m.f. does not deviate too much from the actual p.m.f. Definition 1 (Typical Sequence) For X ∼ pX and ϵ ∈ (0, 1), a sequence xn is called ϵ-typical if |π (a|xn) − pX (a)| ≤ ϵpX (a) , ∀ a ∈ X. The typical set is defined as the collection of all ϵ-typical length-n sequences, denoted by T (n)

ϵ

(X).

11 / 43 I-Hsiang Wang NIT Lecture 3

slide-12
SLIDE 12

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Note: In the following, if the context is clear, we will write “T (n)

ϵ

” instead of “T (n)

ϵ

(X)”. Example 1 Consider a random bit sequence generated i.i.d. based on Ber ( 1

2

) . Let us set ϵ = 0.2 and n = 10. What is T (n)

ϵ

? How large is the typical set? sol: Based on the definition, a n-sequence xn is ϵ-typical iff π (0|xn) ∈ [0.4, 0.6] and π (1|xn) ∈ [0.4, 0.6]. In other words, the # of “0”s in the sequence should be 4, 5, or 6. Hence, T (n)

ϵ

consists of all length-10 sequences with 4, 5, or 6 0”s. The size of T (n)

ϵ

is (10

4

) + (10

5

) + (10

6

) = 714.

12 / 43 I-Hsiang Wang NIT Lecture 3

slide-13
SLIDE 13

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Properties of Typical Sequences

Let p (xn) := Pr {Xn = xn} = ∏n

i=1 pX (xi), that is, the probability that

the DMS generates the sequence xn. Similarly p (A) := Pr {Xn ∈ A}, denotes the probability of a set A. Proposition 1 (Properties of Typical Sequences and Typical Set)

1 ∀ xn ∈ T (n) ϵ

(X), 2−n(H(X)+δ(ϵ)) ≤ p (xn) ≤ 2−n(H(X)−δ(ϵ)), where δ(ϵ) = ϵH (X).

(by definition of typical sequences and entropy)

2

lim

n→∞ p

( T (n)

ϵ

(X) ) = 1, i.e., p ( T (n)

ϵ

(X) ) ≥ 1 − ϵ for n large enough.

(by the law of large numbers (LLN))

3 |T (n) ϵ

(X)| ≤ 2n(H(X)+δ(ϵ)).

(by summing up the lower bound in property 1 over the typical set)

4 |T (n) ϵ

(X)| ≥ (1 − ϵ)2n(H(X)−δ(ϵ)) for n large enough.

(by the upper bound in property 1, and property 2)

13 / 43 I-Hsiang Wang NIT Lecture 3

slide-14
SLIDE 14

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Asymptotic Equipartition Property (AEP)

X n T (n)

(X)

p ⇣ T (n)

(X) ⌘ → 1 |T (n)

(X)| ≈ 2nH(X)

typical xn

p (xn) ≈ 2−nH(X)

Observation:

1 The typical set has probability approach 1 as n → ∞, while its size is

roughly equal to 2nH(X), significantly smaller than |X n| = 2n log|X|.

2 Within the typical set, all typical sequences have roughly the same

probability 2−nH(X).

14 / 43 I-Hsiang Wang NIT Lecture 3

slide-15
SLIDE 15

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Application to Data Compression

X n T (n)

(X)

p ⇣ T (n)

(X) ⌘ → 1 |T (n)

(X)| ≈ 2nH(X)

typical xn

p (xn) ≈ 2−nH(X)

In other words, as n → ∞, with probability 1 the realization of the DMS will be a typical sequence, typical sequences are roughly uniformly distributed over the typical set, and there are roughly 2nH(X) of them. Hence, we can use roughly nH (X) bits to uniquely describe each typical sequence, and ignore the atypical ones. Since the probability of getting an atypical sequence vanishes as n → ∞, so does error probability.

15 / 43 I-Hsiang Wang NIT Lecture 3

slide-16
SLIDE 16

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Lossless Source Coding: Problem Setup

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

1 A

( 2NR, N ) lossless source code consists of

an encoding function (encoder) encN : SN → {0, 1}K that maps each source sequence sN to a bit sequence bK, where K := ⌊NR⌋. a decoding function (decoder) decN : {0, 1}K → SN that maps each bit sequence bK to a reconstructed source sequence sN.

2 The error probability for a

( 2NR, N ) lossless source code is defined as P(N)

e

:= Pr { SN ̸= SN} .

3 A rate R is said to be achievable if there exist a sequence of

( 2NR, N ) codes such that P(N)

e

→ 0 as N → ∞. The optimal compression rate R∗ := inf {R | R : achievable}.

16 / 43 I-Hsiang Wang NIT Lecture 3

slide-17
SLIDE 17

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

A Lossless Source Coding Theorem

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

Theorem 1 (A Lossless Source Coding Theorem for DMS) For a DMS S, R∗ = H (S). Remark: In information theory, to establish a coding theorem, usually

  • ne needs to prove two directions:

Direct part (achievability): show that ∀ R > R∗ = H (S), ∃ a sequence of ( 2NR, N ) codes such that P(N)

e

→ 0 as N → ∞. Converse part (converse): show that for every sequence of ( 2NR, N ) codes such that P(N)

e

→ 0 as N → ∞, the rate R ≥ R∗ = H (S).

17 / 43 I-Hsiang Wang NIT Lecture 3

slide-18
SLIDE 18

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Lossless Source Coding Theorem: Achievability Proof (1)

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

pf: Here we provide a simple proof based on typical sequences.

Codebook Generation: Let us choose an ϵ > 0 and set R = H (S) + δ (ϵ) such that NR ∈ Z, where δ (ϵ) = ϵH (S). By property 3 of Proposition 1, we have an upper bound on the # of typical sequences: |T (N)

ϵ

(S)| ≤ 2N(H(S)+δ(ϵ)) = 2NR. Encoding: Hence, we are able to label each typical sequence with a length NR bit sequence, which defines the encoding function ∀ sN ∈ T (N)

ϵ

. For sN / ∈ T (N)

ϵ

, the encoding function maps them to the all 0 sequence. Decoding: The decoding function simply maps the received bit sequence back to the typical sequence labeled by the bit sequence.

18 / 43 I-Hsiang Wang NIT Lecture 3

slide-19
SLIDE 19

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Lossless Source Coding Theorem: Achievability Proof (2)

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N] Error Probability Analysis: Obviously an error occurs iff the generated sequence sN / ∈ T (N)

ϵ

, since all typical sequence can be reconstructed uniquely and perfectly. Hence, P(N)

e

= Pr { SN / ∈ T (N)

ϵ

} = 1 − p ( T (N)

ϵ

) → 1 − 1 = 0 as N → ∞, due to property 2 of Proposition 1. Finally, since δ (ϵ) can be made arbitrarily small, we have shown that ∀ R > R∗ = H (S), ∃ a sequence of ( 2NR, N ) codes such that P(N)

e

→ 0 as N → ∞.

19 / 43 I-Hsiang Wang NIT Lecture 3

slide-20
SLIDE 20

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Reflections

In proving the achievability, essentially we give an upper bound on P(N)

e

for some coding scheme; by asserting that the upper bound → 0 as N → 0, we show that the rate of the coding scheme is achievable. For the achievability of the lossless source coding problem, we need not find an upper bound for the error probability, because we have an exact expression: P(N)

e

= Pr { SN / ∈ T (N)

ϵ

} = 1 − p ( T (N)

ϵ

) , and it tends to 0 as N → ∞ due to LLN. Next, to prove the converse, we need to develop a lower bound on P(N)

e

for all possible coding schemes; by forcing this the lower bound → 0 as N → 0 because we require P(N)

e

→ 0, we show that any achievable rate has to satisfy certain necessary condition. In the following, we introduce an important lemma due to Robert Fano (Fano’s inequality). Fano’s inequality is widely used in converse proofs.

20 / 43 I-Hsiang Wang NIT Lecture 3

slide-21
SLIDE 21

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Fano’s Inequality

Lemma 1 (Fano’s Inequality) For jointly distributed r.v.’s (U, V), let Pe := Pr {U ̸= V}. Then, H (U|V) ≤ Hb (Pe) + Pe log|U|. pf: Let us define E := I {U ̸= V}, the indicator function of the error event {U ̸= V}. Hence, E ∼ Ber (Pe), and H (E) = Hb (Pe). Using chain rule and the non-negativity of conditional entropy, we have H (U|V) ≤ H (U, E|V) = H (E|V) + H (U|V, E). Note that H (E|V) ≤ H (E) = Hb (Pe), and H (U|V, E) = Pr {E = 1}

  • =Pe

H (U|V, E = 1)

  • ≤log|U|

+ Pr {E = 0} H (U|V, E = 0)

  • =0, ∵U=V

Hence, H (U|V) ≤ Hb (Pe) + Pe log|U|.

21 / 43 I-Hsiang Wang NIT Lecture 3

slide-22
SLIDE 22

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Corollary 1 (Lower Bound on Error Probability) Pe ≥ H (U|V) − 1 log|U| . pf: From Lemma 1 and Hb (Pe) ≤ 1, we have H (U|V) ≤ Hb (Pe) + Pe log|U| ≤ 1 + Pe log|U|. Exercise 1 Show that Lemma 1 can be sharpened as follows H (U|V) ≤ Hb (Pe) + Pe log (|U| − 1) , if U, V both take values in U.

22 / 43 I-Hsiang Wang NIT Lecture 3

slide-23
SLIDE 23

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Lossless Source Coding Theorem: Converse Proof (1)

Recall: we would like to show that for every sequence of ( 2NR, N ) codes such that P(N)

e

→ 0 as N → ∞, the rate R ≥ R∗ = H (S). pf: Note that BK is a r.v. because it is generated by another r.v, SN. K = NR ≥ H ( BK) ≥ I ( BK; SN) (1) ≥ I ( SN; SN) = H ( SN) − H ( SN

  • SN)

(2) ≥ H ( SN) − ( 1 + P(N)

e

log|SN| ) (3) (1) is due to the upper bound of entropy and mutual information. (2) is due to SN − BK − SN and the data processing inequality. (3) is due to Fano’s inequality

23 / 43 I-Hsiang Wang NIT Lecture 3

slide-24
SLIDE 24

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Lossless Source Coding Theorem: Converse Proof (2)

We have the following inequality for all length N and source codes: NR ≥ H ( SN) − ( 1 + P(N)

e

log|SN| ) . Since the source is a DMS, we have H ( SN) = NH (S). Dividing both sides of the above inequality by N, we get R ≥ H (S) − 1 N − P(N)

e

log|S|. If the rate R is achievable, then by definition, P(N)

e

→ 0 as N → ∞. Taking N → ∞, we conclude that R ≥ H (S) if it is achievable.

24 / 43 I-Hsiang Wang NIT Lecture 3

slide-25
SLIDE 25

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

1 Typical Sequences and a Lossless Source Coding Theorem 2 Weakly Typical Sequences and Sources with Memory 3 Summary

25 / 43 I-Hsiang Wang NIT Lecture 3

slide-26
SLIDE 26

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Beyond Memoryless Sources

Recap: So far we have establish a coding theorem for (block-to-block) lossless source coding with discrete memoryless sources (DMS): Achievability: we use typical sequences to construct a simple code achieving every rate R > H (S). Converse: we use Fano’s inequality and data processing inequality to show that every rate that is achievable must satisfy R ≥ H (S). Question: What if the source is not memoryless? (In other words, we cannot use a single p.m.f. pS to describe the random process.) Affected: Entropy, AEP. Unaffected: Fano and data processing in the converse proof. For sources with memory, we should develop the following two so that a lossless source coding theorem can be established:

1 A measure of uncertainty for random processes 2 A general AEP for random processes with memory

26 / 43 I-Hsiang Wang NIT Lecture 3

slide-27
SLIDE 27

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Discrete Stationary Source

A (discrete-time) random process {Xi | i = 1, 2, . . .} consists of an infinite sequence of r.v.’s. Such a random process is characterized by all joint p.m.f.’s pX1,X2,...,Xn, ∀ n = 1, 2, . . .. Definition 2 (Stationary Random Process) A random process {Xi} is stationary if for all shift l ∈ N, pX1,X2,...,Xn = pX1+l,X2+l,...,Xn+l, ∀ n ∈ N When extending the source coding theorem, instead of discrete memoryless sources (DMS), we focus on discrete stationary sources (DSS), where the source process {Si | i ∈ N} is stationary (but not necessarily memoryless).

27 / 43 I-Hsiang Wang NIT Lecture 3

slide-28
SLIDE 28

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Entropy Rate

For a discrete random process {Xi}, how do we measure its uncertainty? Since there are infinitely many r.v.’s in a random process {Xi}, it is meaningless to directly use H (X1, X2, . . .).

(∵ it is likely to be ∞)

Instead, we should measure the amount of uncertainty per symbol! Or, we can measure the marginal amount of uncertainty of the current symbol conditioned on all the past symbols We give the following intuitive definitions. Definition 3 (Entropy Rate) The entropy rate of a random process {Xi} is defined by H ({Xi}) := limn→∞ 1

nH (X1, X2, . . . , Xn) if the limit exists.

Alternatively, we can also define it by

  • H ({Xi}) := limn→∞ H

( Xn|Xn−1) if the limit exists.

28 / 43 I-Hsiang Wang NIT Lecture 3

slide-29
SLIDE 29

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Example 2 (Entropy Rate of i.i.d. Process) Consider a random process {Xi} where X1, X2, . . . are i.i.d. according to

  • pX. Does the entropy rate exist? If so, compute it.

sol: Since the r.v.’s are i.i.d., for all n ∈ N, H (X1, . . . , Xn) = nH (X1) and H ( Xn|Xn−1) = H (Xn) = H (X1). Hence, H ({Xi}) = H ({Xi}) = H (X1) = EpX [− log pX (X)]. Exercise 2 (H and H May be Different) Consider a random process {Xi} where X1, X3, . . . are i.i.d. and X2k = X2k−1 for all k ∈ N. Show that H ({Xi}) exists, but H ({Xi}) does not.

29 / 43 I-Hsiang Wang NIT Lecture 3

slide-30
SLIDE 30

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

The Two Notions of Entropy Rate

In Definition 2, we have defined two notions of entropy rate: H and H. In Exercise 2, we see that the two notions are not equivalent in general. Note: Let an := 1

nH (X1, . . . , Xn) and bn := H

( Xn|Xn−1) : an = 1

n

∑n

k=1 bk due to chain rule.

H ({Xi}) = limn→∞ an and H ({Xi}) = limn→∞ bn. The following lemma from Calculus sheds some light on the relationship between these two notions. Lemma 2 (Cesàro Mean) lim

n→∞ bn = c =

⇒ lim

n→∞ an = c, where an := 1 n

∑n

k=1 bk.

The reverse direction is not true in general. As a corollary, if H exists, so does H and H = H.

30 / 43 I-Hsiang Wang NIT Lecture 3

slide-31
SLIDE 31

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Entropy Rate of Stationary Process

Lemma 3 For a stationary random process {Xi}, H ( Xn|Xn−1) is decreasing in n. pf: Due to the fact that conditioning reduces entropy, we have H (Xn+1|Xn) = H (Xn+1|X[2 : n], X1) ≤ H (Xn+1|X[2 : n]). Since {Xi} is stationary, H (Xn+1|X[2 : n]) = H ( Xn|Xn−1) . Theorem 2 For a stationary random process {Xi}, H ({Xi}) = H ({Xi}). pf: Since bn := H ( Xn|Xn−1) is decreasing in n, and bn ≥ 0 is bounded from below, we conclude that bn converges as n → ∞. Since 1

nH (X1, . . . , Xn) = 1 n

∑n

k=1 bk, by Lemma 2, proof complete.

31 / 43 I-Hsiang Wang NIT Lecture 3

slide-32
SLIDE 32

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Exercise 3 Show that for a stationary random process {Xi},

1 nH (X1, . . . , Xn) is decreasing in n.

H ( Xn|Xn−1) ≤ 1

nH (X1, . . . , Xn).

32 / 43 I-Hsiang Wang NIT Lecture 3

slide-33
SLIDE 33

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Markov Process and its Entropy Rate

Markov process is one of the simplest random processes with memory. Definition 4 (Markov Process) A random process {Xi} is Markov if ∀ n ∈ N, p (xn|xn−1, xn−2, . . . , x1) = p (xn|xn−1) . For a stationary Markov {Xi}, the entropy rate is simple to compute: H ({Xi}) = lim

n→∞ H

( Xn|Xn−1) Markovity = lim

n→∞ H (Xn|Xn−1) Stationarity

= H (X2|X1) .

33 / 43 I-Hsiang Wang NIT Lecture 3

slide-34
SLIDE 34

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Typicality for Sources with Memory

How to extend typicality to sources with memory?

Observation: A random process with memory is characterized by joint distributions with arbitrary length, because the depth of the memory is arbitrary. Empirical distribution π (a|xn) is insufficient to determine if the sequence is typical or not, because it only describes the marginal p.m.f. pX , not joint distributions of other lengths. We should ask: what properties are critical for us to extend typicality to sources with memory? Recall: the most important feature of typical sequences is the Asymptotic Equipartition Property (AEP).

34 / 43 I-Hsiang Wang NIT Lecture 3

slide-35
SLIDE 35

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

X n T (n)

(X)

p ⇣ T (n)

(X) ⌘ → 1 |T (n)

(X)| ≈ 2nH(X)

typical xn

p (xn) ≈ 2−nH(X)

Suppose now we would like to give another definition of typicality and typical set A(n)

ϵ

. The key properties we would like to keep are:

1 ∀ xn ∈ A(n) ϵ

, 2−n(H(X)+δ(ϵ)) ≤ p (xn) ≤ 2−n(H(X)−δ(ϵ)).

(by definition)

2

lim

n→∞ p

( A(n)

ϵ

) = 1.

(by LLN)

Why? Because the other two regarding the size of A(n)

ϵ

can be derived from the above two properties.

35 / 43 I-Hsiang Wang NIT Lecture 3

slide-36
SLIDE 36

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Weakly Typical Sequences (for i.i.d. Random Process)

Idea: why not directly define typical sequences as those that satisfy 2−n(H(X)+δ(ϵ)) ≤ p (xn) ≤ 2−n(H(X)−δ(ϵ)) ? Again, let p (xn) := Pr {Xn = xn} = ∏n

i=1 pX (xi), that is, the probability

that the DMS generates the sequence xn. Definition 5 (Weakly Typical Sequences for i.i.d. Random Process) For X ∼ pX and ϵ ∈ (0, 1), a sequence xn is called weakly ϵ-typical if

  • − 1

n log p (xn) − H (X)

  • ≤ ϵ.

The weakly typical set is defined as the collection of all weakly ϵ-typical length-n sequences, denoted by A(n)

ϵ

(X).

36 / 43 I-Hsiang Wang NIT Lecture 3

slide-37
SLIDE 37

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Typical vs. Weakly Typical Sequences

Comparisons in Definition: Typical Sequence Weakly Typical Sequence π (·|xn) = 1 n

n

i=1

I {xi = ·} ← → − 1 n log p (xn) = 1 n

n

i=1

log 1 pX (xi) pX (·) ← → H (X) Exercise 4 Show that T (n)

ϵ

⊆ A(n)

δ

where δ = ϵH (X), where in general for some ϵ > 0, there is no δ′ > 0 such that A(n)

δ′

⊆ T (n)

ϵ

.

37 / 43 I-Hsiang Wang NIT Lecture 3

slide-38
SLIDE 38

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

AEP with i.i.d. Source and Weakly Typical Sequences

Proposition 2

1 ∀ xn ∈ A(n) ϵ

(X), 2−n(H(X)+ϵ) ≤ p (xn) ≤ 2−n(H(X)−ϵ).

(by definition)

2

lim

n→∞ p

( A(n)

ϵ

(X) ) = 1.

(by LLN)

3 |A(n) ϵ

(X)| ≤ 2n(H(X)+ϵ).

(by 1 and 2)

4 |A(n) ϵ

(X)| ≥ (1 − ϵ)2n(H(X)−ϵ) for n large enough.

(by 1 and 2)

Note: We only need to check that Property 2 holds, since Property 1 is automatic by definition, and Property 3 and 4 are due to 1 and 2. This is true due to LLN: as n → ∞, − 1 n log p (xn) = 1 n

n

i=1

log 1 pX (xi)

p

− → E [ log 1 pX (X) ] = H (X) .

38 / 43 I-Hsiang Wang NIT Lecture 3

slide-39
SLIDE 39

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Stationary Ergodic Random Processes

Let’s turn to one of our original goals: finding AEP and the definition of weakly typical sequences for random processes with memory. In other words, we would like to establish the following LLN-like key property for random processes with memory: as n → ∞, − 1 n log p (xn)

p

− → H ({Xi}) . It turns out that this is true for stationary ergodic processes. Roughly speaking, a stationary process {Xi} is ergodic if the time average (empirical average) converges to the ensemble average with probability 1. More specifically, ∀ k1, k2, . . . , km ∈ N, f measurable, Pr { lim

n→∞

1 n

n−1

l=0

f (Xk1+l, . . . , Xkm+l) = E [f (Xk1, . . . , Xkm)] } = 1

39 / 43 I-Hsiang Wang NIT Lecture 3

slide-40
SLIDE 40

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

AEP for Stationary Ergodic Processes

Theorem 3 (Shannon-McMillan-Breiman) If H ({Xi}) is the entropy rate of a stationary ergodic process {Xi}, then Pr { lim

n→∞ − 1

n log p (xn) = H ({Xi}) } = 1, which implies that − 1

n log p (xn) p

− → H ({Xi}) as n → ∞. With the above theorem, we can re-define weakly typical sequences as we did in the i.i.d. case, with the following substitution: H (X) → H ({Xi}) and derive corresponding properties. As we discussed before, the four key properties in Proposition 1 and 2 remain the same.

40 / 43 I-Hsiang Wang NIT Lecture 3

slide-41
SLIDE 41

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Lossless Source Coding Theorem for Ergodic DSS

Source Encoder Source Decoder Source Destination

s[1 : N] b[1 : K] b s[1 : N]

Theorem 4 (A Lossless Source Coding Theorem for Ergodic DSS) For a discrete stationary ergodic source {Si}, R∗ = H ({Si}). Achievability can be proved as that in the DMS case, based on weakly typical sequences. The proof is identical. Converse is also the same as that in the DMS case, except that we shall make use of the following fact from Exercise 3: 1 NH ( SN) ↓ H ({Si}) as n → ∞.

41 / 43 I-Hsiang Wang NIT Lecture 3

slide-42
SLIDE 42

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

1 Typical Sequences and a Lossless Source Coding Theorem 2 Weakly Typical Sequences and Sources with Memory 3 Summary

42 / 43 I-Hsiang Wang NIT Lecture 3

slide-43
SLIDE 43

Typical Sequences and a Lossless Source Coding Theorem Weakly Typical Sequences and Sources with Memory Summary

Lossless source coding theorem: R∗ = H, where

1 H = H (S) for DMS {Si}. 2 H = H ({Si}) for ergodic DSS {Si}.

Typical Sequence T (n)

ϵ

  • vs. Weakly Typical Sequence A(n)

ϵ

Asymptotic Equipartition Property (AEP) for both typical sequences with DMS and weakly typical sequences with ergodic DSS:

1 ∀ xn ∈ T (n)

ϵ

(X), 2−n(H(X)+δ(ϵ)) ≤ p (xn) ≤ 2−n(H(X)−δ(ϵ)). ∀ xn ∈ A(n)

ϵ

({Xi}), 2−n(H({Xi})+ϵ) ≤ p (xn) ≤ 2−n(H({Xi})−ϵ).

2 limn→∞ p

( T (n)

ϵ

(X) ) = 1. limn→∞ p ( A(n)

ϵ

({Xi}) ) = 1.

3 |T (n)

ϵ

(X)| ≤ 2n(H(X)+δ(ϵ)). |A(n)

ϵ

(X)| ≤ 2n(H({Xi})+ϵ).

4 |T (n)

ϵ

(X)| ≥ (1 − ϵ)2n(H(X)−δ(ϵ)) for n large enough. |A(n)

ϵ

({Xi})| ≥ (1 − ϵ)2n(H({Xi})−ϵ) for n large enough.

Fano’s inequality

43 / 43 I-Hsiang Wang NIT Lecture 3