[PPT] - Chapter 5 Data Compression Peng-Hua Wang Graduate Inst. of Comm. PowerPoint Presentation

SLIDE 1

Chapter 5 Data Compression

Peng-Hua Wang

Graduate Inst. of Comm. Engineering National Taipei University

SLIDE 2

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 2/41

Chapter Outline

Chap. 5 Data Compression

5.1 Example of Codes 5.2 Kraft Inequality 5.3 Optimal Codes 5.4 Bound on Optimal Code Length 5.5 Kraft Inequality for Uniquely Decodable Codes 5.6 Huffman Codes 5.7 Some Comments on Huffman Codes 5.8 Optimality of Huffman Codes 5.9 Shannon-Fano-Elias Coding 5.10 Competitive Optimality of the Shannon Code 5.11 Generation of Discrete Distributions from Fair Coins

SLIDE 3

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 3/41

5.1 Example of Codes

SLIDE 4

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 4/41

Source code

Definition (Source code) A source code C for a random variable X is a mapping from X , the range of X, to D∗, the set of finite-length strings of symbols from a D-ary alphabet. Let C(x) denote the codeword corresponding to x and let l(x) denote the length of C(x).

■ For example, C(red) = 00, C(blue) = 11 is a source code with

mapping from X = {red, blue} to D2 with alphabet D = {0, 1}.

SLIDE 5

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 5/41

Source code

Definition (Expected length) The expected length L(C) of a source code C(x) for a random variable X with probability mass function p(x) is given by

L(C) =

x∈X

p(x)l(x).

where l(X) is the length of the codeword associated with X.

SLIDE 6

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 6/41

Example

Example 5.1.1 Let X be a random variable with the following distribution and codeword assignment

Pr{X = 1} = 1

2, codeword C(1) = 0

Pr{X = 2} = 1

4, codeword C(2) = 10

Pr{X = 3} = 1

8, codeword C(3) = 110

Pr{X = 4} = 1

8, codeword C(4) = 111 ■ H(X) = 1.75 bits. ■ El(x) = 1.75 bits. ■ uniquely decodable

SLIDE 7

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 7/41

Example

Example 5.1.2 Consider following example.

Pr{X = 1} = 1

3, codeword C(1) = 0

Pr{X = 2} = 1

3, codeword C(2) = 10

Pr{X = 3} = 1

3, codeword C(3) = 11 ■ H(X) = 1.58 bits. ■ El(x) = 1.66 bits. ■ uniquely decodable

SLIDE 8

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 8/41

Source code

Definition (non-singular) A code is said to be nonsingular if every element of the range of X maps into a different string in D∗; that is,

x = x′ ⇒ C(x) = C(x′)

Definition (extension code) The extension C∗ of a code C is the mapping from finite length-strings of X to finite-length strings of D, defined by

C(x1x2 · · · xn) = C(x1)C(x2) · · · C(xn)

where C(x1)C(x2) · · · C(xn) indicates concatenation of the corresponding codewords.

Example 5.1.4 If C(x1) = 00 and C(x2) = 11, then C(x1x2) = 0011.

SLIDE 9

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 9/41

Source code

Definition (uniquely decodable) A code is called uniquely decodable if its extension is nonsingular. Definition (prefix code) A code is called a prefix code or an instantaneous code if no codeword is a prefix of any other codeword.

■ For an instantaneous code, the symbol xi can be decoded as soon

as we come to the end of the codeword corresponding to it.

■ For example, the binary string 01011111010 produced by the code

f Example 5.1.1 is parsed as 0, 10, 111, 110, 10.

SLIDE 10

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 10/41

Source code

X

Singular Nonsingular, but not UD, UD But Not Inst. Inst. 1 10 2 010 00 10 3 01 11 110 4 10 110 111

SLIDE 11

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 11/41

Decoding Tree

SLIDE 12

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 12/41

5.2 Kraft Inequality

SLIDE 13

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 13/41

Kraft Inequality

Theorem 5.2.1 (Kraft Inequality) For any instantaneous code (prefix code) over an alphabet of size D, the codeword lengths l1, l2, . . . , lm must satisfy the inequality

i

D−li ≤ 1.

Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these word lengths.

SLIDE 14

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 14/41

Extended Kraft Inequality

Theorem 5.2.2 (Extended Kraft Inequality) For any countably infinite set of codewords that form a prefix code, the codeword lengths satisfy the extended Kraft inequality,

∞

i=1

D−li ≤ 1.

Conversely, given any l1, l2, . . . satisfying the extended Kraft inequality, we can construct a prefix code with these codeword lengths.

SLIDE 15

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 15/41

5.3 Optimal Codes

SLIDE 16

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 16/41

Minimize expected length

Problem Given the source pmf pP1, p2, . . . pm, find the code length

l1, l2, . . . , lm such that the expected code length is minimized L =

pili

with constraint

D−li ≤ 1.

■ l1, l2, . . . , lm are integers. ■ We first relax the original integer programming problem. The

restriction of integer length is relaxed to real number.

■ Solve by Lagrange multipliers.

SLIDE 17

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 17/41

Solve the relaxed problem

J =

pili + λ
D−li

, ∂J ∂li = pi − λD−li ln D ∂J ∂li = 0 ⇒ D−li = pi λ ln D

D−li ≤ 1 ⇒ λ =

1 ln D ⇒ pi = D−li ⇒ optimal code length l∗

i = − logD pi

The expected code length is

L∗ =

pil∗

i = −

pi logD pi = HD(X)

SLIDE 18

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 18/41

Expected code length

Theorem 5.3.1 The expected length L of any instantaneous D-ary code for a random variable X is greater than or equal to the entropy

HD(X); that is, L ≥ HD(X)

with equality if and only if D−li = pi.

Proof.

L − HD(X) =

pili +
pi logD pi

= −

pi logD D−li +
pi logD pi

= D(p||q) ≥ 0

where qi = D−li . WRONG!! Because D−li ≤ 1 may not be a valid distribution.

SLIDE 19

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 19/41

Expected code length

Proof.

L − HD(X) =

pili +
pi logD pi

= −

pi logD D−li +
pi logD pi

Let c =

j D−lj, ri = D−li/c,

L − HD(X) =

pi logD

pi ri − logD c = D(p||r) + logD 1 c ≥ 0

since D(p||r) ≥ 0 and c ≤ 1 by Kraft inequality. L ≤ HD(X) with equality iff pi = D−li. That is, iff − logD pi is an integer for all i.

SLIDE 20

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 20/41

D-adic

Definition (D-adic) A probability distribution is called D-adic if each of the probabilities is equal to D−n for some n.

■ L = HD(X) if and only if the distribution of X is D-adic. ■ How to find the optimal code? ⇒ Find the D-adic distribution that is

closest (in the relative entropy sense) to the distribution of X.

■ What is the upper bound of the optimal code ?

SLIDE 21

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 21/41

5.4 Bound on Optimal Code Length

SLIDE 22

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 22/41

Optimal code length

Theorem 5.4.1 Let l∗

1, l∗ 2, . . . , l∗ m be optimal codeword lengths for a

source distribution p and sa D-ary alphabet, and let L∗ be the associated expected length of an optimal code (L|ast = pil∗

i ).

Then

HD(X) ≤ L∗ < HD(X) + 1.

Proof. Let li = ⌈logD

1 pi⌉ where ⌈x⌉ is the smallest integer ≥ x.

These lengths satisfy the Kraft inequality since

D−⌈logD

1 pi ⌉ ≤ D− logD 1 pi =

pi = 1.

These choice of codeword lengths satisfies

logD 1 pi ≤ li ≤ logD 1 pi + 1.

SLIDE 23

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 23/41

Optimal code length

Multiplying by pi and summing over i, we obtain

HD(X) ≤ L < HD(X) + 1.

Since L∗ is the expected length of the optimal code,

L∗ ≤ L < HD(X) + 1.

On another hand, from Theorem 5.3.1,

L∗ ≥ HD(X).

Therefore,

HD(X) ≤ L∗ < HD(X) + 1.

SLIDE 24

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 24/41

Optimal code length

Consider a system in which we send a sequence of n symbols from X. Define Ln to be the expected codeword length per input symbol,

Ln = 1 n

p(x1, x2, . . . , xn)l(x1, x2, . . . , xn)

= 1 nE[l(X1, X2, . . . , Xn)]

We have

H(X1, X2, . . . , Xn) ≤ E[l(X1, X2, . . . , Xn)] < H(X1, X2, . . . , Xn) + 1

If X1, X2, . . . , Xn are i.i.d., we have

H(X) ≤ Ln < H(X) + 1 n

SLIDE 25

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 25/41

Optimal code length

Consider a system in which we send a sequence of n symbols from X. Define Ln to be the expected codeword length per input symbol,

Ln = 1 n

p(x1, x2, . . . , xn)l(x1, x2, . . . , xn)

= 1 nE[l(X1, X2, . . . , Xn)]

We have

H(X1, X2, . . . , Xn) ≤ E[l(X1, X2, . . . , Xn)] < H(X1, X2, . . . , Xn) + 1

SLIDE 26

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 26/41

Optimal code length

If X1, X2, . . . , Xn are independent but not identically distributed, we have

1 nH(X1, X2, . . . , Xn) ≤ Ln < 1 nH(X1, X2, . . . , Xn) + 1

If the random process is stationary

1 nH(X1, X2, . . . , Xn) → H(X)

SLIDE 27

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 27/41

Optimal code length

Theorem 5.4.2 The minimum expected codeword length per symbol satisfies

1 nH(X1, X2, . . . , Xn) ≤ L∗

n < 1

nH(X1, X2, . . . , Xn) + 1

If X1, X2, . . . , Xn is a stationary random process,

L∗

n → H(X)

where H(X) is the entropy rate of the random process.

SLIDE 28

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 28/41

Wrong Code

Theorem 5.4.3 (Wrong Code) The expected length under p(x) of the code assignment l(x) = ⌈log

1 q(x)⌉ satisfies

H(p) + D(p||q) ≤ Ep[l(X)] < H(p) + D(p||q) + 1.

Proof. The expected codelength is

E[l(X)] =

x

p(x)

log

1 q(x)

<
x

p(x)

log

1 q(x) + 1

=
x

p(x) log 1 q(x) + 1 =

x

p(x)

log p(x)

q(x) − log p(x)

+ 1

= H(p) + D(p||q) + 1

The lower bound can be derived similarly.

SLIDE 29

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 29/41

5.5 Kraft Inequality for Uniquely Decodable Codes

SLIDE 30

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 30/41

Uniquely Decodable Codes

Theorem 5.5.1 (Wrong Code) The codeword lengths of any uniquely decodable D-ary code must satisfy the Kraft inequality

D−li ≤ 1.

Conversely, given a set of codeword lengths that satisfy this inequality, it is possible to construct a uniquely decodable code with these codeword lengths.

Proof. Consider
x

D−l(x) k =

x1
x2

· · ·

xk

D−l(x1)D−l(x2) · · · D−l(xk) =

x1,...,xk∈X k

D−l(x1)D−l(x2) · · · D−l(xk) =

xk∈X k

D−l(xk)

SLIDE 31

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 31/41

Uniquely Decodable Codes

We now gather the terms by word lengths to obtain

xk∈X k

D−l(xk) =

klmax

m=1

a(m)D−m

where lmax is the maximum codeword length and a(m) is the number of source sequences mapping into codewords of length m. Since the code is uniquely decodable, so there is at most one sequence mapping into each code m-sequence and there are at most Dm code

m-sequences. Thus, a(m) ≤ Dm.

SLIDE 32

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 32/41

Uniquely Decodable Codes

Therefore,

x

D−l(x) k =

klmax

m=1

a(m)D−m ≤

klmax

m=1

DmD−m = klmax

r
j

D−lj ≤ (klmax)1/k

Since this inequality is true for all k, it is true in the limit as k → ∞. Since (klmax)1/k → 1, we have

j

D−lj ≤ 1.

SLIDE 33

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 33/41

5.6 Huffman Codes

SLIDE 34

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 34/41

Example 5.6.1

X = {1, 2, 3, 4, 5}, p = {0.25, 0.25, 0.2, 0.15, 0.15}, binary.

SLIDE 35

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 35/41

Example 5.6.2

X = {1, 2, 3, 4, 5}, p = {0.25, 0.25, 0.2, 0.15, 0.15}, ternary.

SLIDE 36

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 36/41

Example 5.6.3

X = {1, 2, 3, 4, 5, 6}, p = {0.25, 0.25, 0.2, 0.1, 0.1, 0.1}, ternary.

■ At kth stage, the total number of symbols is 1 + k(D − 1).

SLIDE 37

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 37/41

Shannon Code

■ Using codeword lengths of ⌈log 1 pi⌉ ■ May be much worse than the optimal code. For example,

p1 = 1 − 1/1024 and p2 = 1/1024. ⌈log 1

p1⌉ = 1 and

⌈log 1

p2⌉ = 10. However, we can use exactly 1 bit.

SLIDE 38

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 38/41

5.8 Optimality of Huffman Codes

SLIDE 39

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 39/41

Properties of Optimal Codes

Lemma 5.8.1 For any distribution, there exists an optimal instantaneous code that satisfies the following properties:

1. If pj > pk, then lj ≤ lk.
2. The two longest codewords have the same length.
3. Two of the longest codewords differ only in the last bits.

SLIDE 40

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 40/41

Optimality of Huffman codes

Lemma 5.8.2 Let C be the codes with distribution

p1 ≤ p2 ≤ · · · ≤ pK−1 ≤ pK. C′ is the codes with distribution p1, p2, . . . , pK−1 + pK. If C′ is optimal with code assignment p1 → w1, p2 → w2, . . . , pK−1 + pK → wK−1,

then C is also optimal with code assignment

p1 → w1, p2 → w2, . . . , pK−1 → wK−10, pK → wK−11.

SLIDE 41

Peng-Hua Wang, April 2, 2012 Information Theory, Chap. 5 - p. 41/41

Optimality of Huffman codes

Proof. The average length for C′ is