Data Compression (Chapters 4-6) presented by Tapani Raiko Feb 26, - - PowerPoint PPT Presentation

data compression chapters 4 6
SMART_READER_LITE
LIVE PREVIEW

Data Compression (Chapters 4-6) presented by Tapani Raiko Feb 26, - - PowerPoint PPT Presentation

T-61.182 Information Theory and Machine Learning Data Compression (Chapters 4-6) presented by Tapani Raiko Feb 26, 2004 Contents (Data Compression) Chap. 4 Chap. 5 Chap. 6 Data Block Symbol Stream Lossy? Lossy Lossless Lossless


slide-1
SLIDE 1

T-61.182 Information Theory and Machine Learning

Data Compression (Chapters 4-6)

presented by Tapani Raiko Feb 26, 2004

slide-2
SLIDE 2

Contents (Data Compression)

  • Chap. 4
  • Chap. 5
  • Chap. 6

Data Block Symbol Stream Lossy? Lossy Lossless Lossless Result Shannon’s source Huffman coding Arithmetic coding coding theorem algorithm algorithm

slide-3
SLIDE 3

Weighting Problem (What is information?)

  • 12 balls, all equal in weight except for one
  • Two-pan balance to use
  • Determine which is the odd ball and whether it is heavier or lighter
  • As few uses of the balance as possible!
  • The outcome of a random experiment is guaranteed to be most

informative if the probability distribution over outcomes is uniform

slide-4
SLIDE 4

1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 1− 2− 3− 4− 5− 6− 7− 8− 9− 10− 11− 12− 1 2 3 4 5 6 7 8 weigh ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✍ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇ ◆ ✲ 1+ 2+ 3+ 4+ 5− 6− 7− 8− 1 2 6 3 4 5 weigh 1− 2− 3− 4− 5+ 6+ 7+ 8+ 1 2 6 3 4 5 weigh 9+ 10+ 11+ 12+ 9− 10− 11− 12− 9 10 11 1 2 3 weigh ✁ ✁ ✁ ✁✁ ✕ ❆ ❆ ❆ ❆❆ ❯ ✲ ✁ ✁ ✁ ✁✁ ✕ ❆ ❆ ❆ ❆❆ ❯ ✲ ✁ ✁ ✁ ✁✁ ✕ ❆ ❆ ❆ ❆❆ ❯ ✲ 1+2+5− 1 2 3+4+6− 3 4 7−8− 1 7 6+3−4− 3 4 1−2−5+ 1 2 7+8+ 7 1 9+10+11+ 9 10 9−10−11− 9 10 12+12− 12 1

❅ ❅ ❘ ✲

❅ ❅ ❘ ✲

❅ ❅ ❘ ✲

❅ ❅ ❘ ✲

❅ ❅ ❘ ✲

❅ ❅ ❘ ✲

❅ ❅ ❘ ✲

❅ ❅ ❘ ✲

❅ ❅ ❘ ✲ 1+ 2+ 5− 3+ 4+ 6− 7− 8− ⋆ 4− 3− 6+ 2− 1− 5+ 7+ 8+ ⋆ 9+ 10+ 11+ 10− 9− 11− 12+ 12− ⋆

slide-5
SLIDE 5

Definitions

  • Shannon information content:

h(x = ai) ≡ log2 1 pi

  • Entropy:

H(X) =

  • i

pi log2 1 pi

  • Both are additive for independent variables

2 4 6 8 10 0.2 0.4 0.6 0.8 1 p

h(p) = log2 1 p p h(p) H2(p) 0.001 10.0 0.011 0.01 6.6 0.081 0.1 3.3 0.47 0.2 2.3 0.72 0.5 1.0 1.0 H2(p)

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 p

slide-6
SLIDE 6

Game of Submarine

  • Player hides a submarine in one square of an 8 by 8 grid
  • Another player trys to hit it

A B C D E F G H 8 7 6 5 4 3 2 1

×

× ×

×

× × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

× × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

S

move # 1 2 32 48 49 question G3 B1 E5 F3 H3

  • utcome

x = n x = n x = n x = n x = y P(x) 63 64 62 63 32 33 16 17 1 16 h(x) 0.0227 0.0230 0.0443 0.0874 4.0 Total info. 0.0227 0.0458 1.0 2.0 6.0

  • Compare to asking 6 yes/no questions about the location
slide-7
SLIDE 7

Raw Bit Content

  • A binary name is given to each outcome of a random variable X
  • The length of the names would be log2 |AX|

(assuming |AX| happens to be a power of 2)

  • Define: The raw bit content of X is

H0(X) = log2 |AX|

  • Simply counts the possible outcomes - no compression yet
  • Additive: H0(X, Y ) = H0(X) + H0(Y )
slide-8
SLIDE 8

Lossy Compression

  • Let

AX = { a, b, c, d, e, f, g, h } PX = 1

4, 1 4, 1 4, 3 16, 1 64, 1 64, 1 64, 1 64

  • The raw bit content is 3 bits (8 binary

names)

  • If we are willing to run a risk of δ = 1/16
  • f not having a name for x, then we can

get by with 2 bits (4 names)

δ = 0 x c(x) a 000 b 001 c 010 d 011 e 100 f 101 g 110 h 111 δ = 1/16 x c(x) a 00 b 01 c 10 d 11 e − f − g − h −

slide-9
SLIDE 9

log2 P(x) −2 −2.4 −4 −6 S0 S 1

16

a,b,c d e,f,g,h

✻ ✻ ✻

The outcomes of X ranked by their probability

slide-10
SLIDE 10

Essential Bit Content

  • Allow an error with probability δ
  • Choose the smallest sufficient subset Sδ such that

P(x ∈ Sδ) ≥ 1 − δ (arrange the elements of AX in order of decreasing probability and take enough from beginning)

  • Define: The essential bit content of X is

Hδ(X) = log2 |Sδ|

  • Note that the raw bit content H0 is a special case of Hδ
slide-11
SLIDE 11

Hδ(X)

0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 {a,b} {a,b,c} {a,b,c,d} {a,b,c,d,e} {a,b,c,d,e,f} {a} {a,b,c,d,e,f,g} {a,b,c,d,e,f,g,h}

δ

The essential bit content as the function of allowed probability of error

slide-12
SLIDE 12

Extended Ensembles (Blocks)

  • Consider a tuple of N i.i.d. random variables
  • Denote by XN the ensemble (X1, X2, . . . , XN)
  • Entropy is additive: H(XN) = NH(X)
  • Example: N flips of a bent coin: p0 = 0.9, p1 = 0.1
slide-13
SLIDE 13

✲ log2 P(x) −2 −4 −6 −8 −10 −12 −14 S0.01 S0.1

0000 0010, 0001, . . . 0110, 1010, . . . 1101, 1011, . . . 1111

✻ ✻ ✻ ✻ ✻

Outcomes of the bent coin ensemble X4

slide-14
SLIDE 14

Hδ(X4)

0.5 1 1.5 2 2.5 3 3.5 4 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 N=4

δ

Essential bit content of the bent coin ensemble X4

slide-15
SLIDE 15

Hδ(X10)

2 4 6 8 10 0.2 0.4 0.6 0.8 1 N=10

δ

Essential bit content of the bent coin ensemble X10

slide-16
SLIDE 16

1 N Hδ(XN)

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 N=10 N=210 N=410 N=610 N=810 N=1010

δ

Essential bit content per toss

slide-17
SLIDE 17

Shannon’s Source Coding Theorem

Given ǫ > 0 and 0 < δ < 1, there exists a positive integer N0 such that for N > N0,

  • 1

N Hδ(XN) − H(X)

  • < ǫ.

1 N Hδ(XN)

H0(X) 1 δ H − ǫ H H + ǫ

  • Proof involves

– Law of large numbers – Chebyshev’s inequality

slide-18
SLIDE 18

x log2(P(x))

...1...................1.....1....1.1.......1........1...........1.....................1.......11...

−50.1

......................1.....1.....1.......1....1.........1.....................................1....

−37.3

........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1.

−65.9

1.1...1................1.......................11.1..1............................1.....1..1.11.....

−56.4

...11...........1...1.....1.1......1..........1....1...1.....1............1.........................

−53.2

..............1......1.........1.1.......1..........1............1...1......................1.......

−43.7

.....1........1.......1...1............1............1...........1......1..11........................

−46.8

.....1..1..1...............111...................1...............1.........1.1...1...1.............1

−56.4

.........1..........1.....1......1..........1....1..............................................1...

−37.3

......1........................1..............1.....1..1.1.1..1...................................1.

−43.7

1.......................1..........1...1...................1....1....1........1..11..1.1...1........

−56.4

...........11.1.........1................1......1.....................1.............................

−37.3

.1..........1...1.1.............1.......11...........1.1...1..............1.............11..........

−56.4

......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1..............

−59.5

............11.1......1....1..1............................1.......1..............1.......1.........

−46.8

....................................................................................................

−15.2

1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

−332.1

Some samples from X100. Compare to H(X100) = 46.9 bits.

slide-19
SLIDE 19

Typicality

  • A string contains r 1s and N − r 0s
  • Consider r as a random variable (binomial distribution)
  • Mean and std: r ∼ Np1 ±
  • Np1(1 − p1)
  • A typical string is a one with r ≃ Np1
  • In general, information content within N [H(X) ± β]

log2 1 P(x) = N

  • i

pi log2 1 pi ≃ NH(X)

slide-20
SLIDE 20

N = 100 N = 1000

n(r) = N

r

  • 2e+28

4e+28 6e+28 8e+28 1e+29 1.2e+29 10 20 30 40 50 60 70 80 90 100 5e+298 1e+299 1.5e+299 2e+299 2.5e+299 3e+299 100 200 300 400 500 600 700 800 9001000

log2 P(x)

  • 350
  • 300
  • 250
  • 200
  • 150
  • 100
  • 50

10 20 30 40 50 60 70 80 90 100 T

  • 3500
  • 3000
  • 2500
  • 2000
  • 1500
  • 1000
  • 500

100 200 300 400 500 600 700 800 9001000 T

n(r)P(x) = N

r

  • pr

1(1 − p1)N−r

0.02 0.04 0.06 0.08 0.1 0.12 0.14 10 20 30 40 50 60 70 80 90 100 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 100 200 300 400 500 600 700 800 9001000

r r

Anatomy of the typical set T

slide-21
SLIDE 21

✲ log2 P(x) −NH(X)

TNβ

✻ ✻ ✻ ✻ ✻

  • 0000000000000. . . 00000000000
  • 0001000000000. . . 00000000000
  • 0100000001000. . . 00010000000
  • 0000100000010. . . 00001000010
  • 1111111111110. . . 11111110111

Outcomes of XN ranked by their probability and the typical set TNβ

slide-22
SLIDE 22

Shannon’s source coding theorem (verbal statement)

N i.i.d. random variables each with entropy H(X) can be compressed into more than NH(X) bits with negligible risk of information loss, as N → ∞; conversely if they are compressed into fewer than NH(X) bits it is virtually certain that information will be lost.

slide-23
SLIDE 23

End of Chapter 4

  • Chap. 4
  • Chap. 5
  • Chap. 6

Data Block Symbol Stream Lossy? Lossy Lossless Lossless Result Shannon’s source Huffman coding Arithmetic coding coding theorem algorithm algorithm

slide-24
SLIDE 24

Contents, Chap. 5: Symbol Codes

  • Lossless coding: shorter encodings to the more probable outcomes

and longer encodings to the less probable

  • Practical to decode?
  • Best achievable compression?
  • Source coding theorem (symbol codes):

The expected length L(C, X) ∈ [H(X), H(X) + 1).

  • Huffman coding algorithm
slide-25
SLIDE 25

Definitions

  • A (binary) symbol code is a mapping from Ax to {0, 1}+
  • c(x) is the codeword of x and l(x) its length
  • Extended code c+(x1x2 . . . xN) = c(x1)c(x2) . . . c(xN)

(no punctuation)

  • A code C(X) is uniquely decodable if no two distinct strings have

the same encoding

  • A symbol code is called a prefix code if no codeword is a prefix of

any other codeword (constraining to prefix codes doesn’t lose any performance)

slide-26
SLIDE 26

Examples

AX = {a, b, c, d}, PX = 1 2, 1 4, 1 8, 1 8

  • ,
  • Using C0:

c+(acdba) = 10000010000101001000

  • Code C1 = {0, 101} is a prefix code so it

can be represented as a tree

  • Code C2 = {1, 101} is a not prefix code

because 1 is a prefix of 101

C0: ai c(ai) li a 1000 4 b 0100 4 c 0010 4 d 0001 4 C1 1 1 101

slide-27
SLIDE 27

Expected length

  • Expected length L(C, X) of a symbol code C for ensemble X is

L(C, X) =

  • x∈AX

P(x) l(x).

  • Bounded below by H(X) (uniquely decodeable code)
  • Equal to H(X) only if the codelengths are equal to the Shannon

information contents: li = log2(1/pi)

  • Codelengths implicitly define a probability distribution {qi}

qi ≡ 2−li

slide-28
SLIDE 28

Examples

C3 1 10 1 1 111 110 C4 1 1 00 1 01 10 11

eighing can states.

  • L(C3, X) = 1.75 = H(X)
  • L(C4, X) = 2 > H(X)
  • L(C5, X) = 1.25 < H(X)

C3: ai c(ai) pi h(pi) li a

1/2

1.0 1 b 10

1/4

2.0 2 c 110

1/8

3.0 3 d 111

1/8

3.0 3 C4 C5 a 00 b 01 1 c 10 00 d 11 11

slide-29
SLIDE 29

Example

C6: ai c(ai) pi h(pi) li a

1/2

1.0 1 b 01

1/4

2.0 2 c 011

1/8

3.0 3 d 111

1/8

3.0 3

  • L(C6, X) = 1.75 = H(X)
  • C6 is not a prefix code but is in fact uniquely decodable
slide-30
SLIDE 30

Kraft Inequality

  • If a code is uniquely decodeable its lengths must satisfy
  • i

2−li ≤ 1

  • For any lengths satisfying the Kraft inequality, there exists a prefix

code with those lengths

1111 1110 1101 1100 1011 1010 1001 1000 0111 0110 0101 0100 0011 0010 0001 0000 111 110 101 100 011 010 001 000 11 10 01 00 1

The total symbol code budget

slide-31
SLIDE 31

Source coding theorem for symbol codes

  • By setting

li = ⌈log2(1/pi)⌉, where ⌈l∗⌉ denotes the smallest integer greater than or equal to l∗, we get (with Kraft’s inequality):

  • There exists a prefix code C with

H(X) ≤ L(C, X) < H(X) + 1

  • Relative entropy DKL(p||q) measures how many bits per symbol are

wasted L(C, X) =

  • i

pi log(1/qi) = H(X) + DKL(p||q)

slide-32
SLIDE 32

Huffman Coding Algorithm

  • 1. Take two least probable symbols in the alphabet
  • 2. Give them the longest codewords differing only in the last digit
  • 3. Combine them into a single symbol and repeat

0.25 0.25 0.2 0.15 0.15 0.25 0.25 0.2 0.3 0.25 0.45 0.3 0.55 0.45 1.0 a b c d e 1 1 1 1

✂ ✂ ✂ ✂

  • x

step 1 step 2 step 3 step 4

ai pi h(pi) li c(ai) a 0.25 2.0 2 00 b 0.25 2.0 2 10 c 0.2 2.3 2 11 d 0.15 2.7 3 010 e 0.15 2.7 3 011

slide-33
SLIDE 33

Optimality

  • Huffman coding is optimal in two senses:

– Smallest expected codelength of uniquely decodable symbol codes – Prefix code → easy to decode

  • But:

– The overhead of between 0 and 1 bits per symbol is important if H(X) is small → compress blocks of symbols to make H(X) larger – Does not take context into account (symbol code vs. stream code)

slide-34
SLIDE 34

End of Chapter 5

  • Chap. 4
  • Chap. 5
  • Chap. 6

Data Block Symbol Stream Lossy? Lossy Lossless Lossless Result Shannon’s source Huffman coding Arithmetic coding coding theorem algorithm algorithm

slide-35
SLIDE 35

Guessing Game

  • Human was asked to guess a sentence character by character
  • The numbers of guesses are listed below each character

T H E R E - I S - N O - R E V E R S E - O N - A - M O T O R C Y C L E -

1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2 1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1 1

  • One could encode only the string 1, 1, 1, 5, 1, . . .
  • Decoding requires an identical twin who also plays the guessing

game

slide-36
SLIDE 36

Arithmetic Coding (1/2)

  • Human predictor is replaced by a probabilistic model of the source
  • The model supplies a predictive distribution over the next symbol
  • It can handle complex adaptive models (context-dependent)
  • Binary strings define real intervals

within the real line [0, 1)

  • The

string 01 corresponds to [0.01, 0.10) in binary or [0.25, 0.50) in base ten

0.00 0.25 0.50 0.75 1.00

✻ ❄ ✻ ❄

1

✻ ❄

01 01101

slide-37
SLIDE 37

Arithmetic Coding (2/2)

  • Divide the real line [0, 1) into I intervals of lengths equal to the

probabilities P(x1 = ai)

to the probabilities P(x1 =ai), as shown in figure 6.2. 0.00 P(x1 = a1) P(x1 =a1) + P(x1 = a2) P(x1 = a1) + . . . + P(x1 = aI−

1)

1.0 . . .

✻ ❄

a1

✻ ❄

a2

✻ ❄

aI . . . a2a5

a2a1

  • Pick an interval and subdivide it (and iterate)
  • Send a binary string whose interval lies within that interval
slide-38
SLIDE 38

Example: Bent Coin (1/3)

  • Coin sides are a and b, and the ’end of file’ symbol is
  • Use a Bayesian model with a uniform prior over probabilities of
  • utcomes

Context (sequence thus far) Probability of next symbol P(a) = 0.425 P(b) = 0.425 P(✷) = 0.15 b P(a | b) = 0.28 P(b | b) = 0.57 P(✷ | b) = 0.15 bb P(a | bb) = 0.21 P(b | bb) = 0.64 P(✷ | bb) = 0.15 bbb P(a | bbb) =0.17 P(b | bbb) =0.68 P(✷ | bbb) =0.15 bbba P(a | bbba) =0.28 P(b | bbba) =0.57 P(✷ | bbba) =0.15

slide-39
SLIDE 39

a b ✷ ba bb b✷ bba bbb bb✷ bbba bbbb bbb✷ 1 00 01 000 001 0000 0001 00000 00001 00010 00011 0010 0011 00100 00101 00110 00111 010 011 0100 0101 01000 01001 01010 01011 0110 0111 01100 01101 01110 01111 10 11 100 101 1000 1001 10000 10001 10010 10011 1010 1011 10100 10101 10110 10111 110 111 1100 1101 11000 11001 11010 11011 1110 1111 11100 11101 11110 11111 ✂ ✂ ✂ ✂ ❇ ❇ ❇ ❇ 100111101 ❈ ❈ ❈ ❈ ❖ bbba bbbaa bbbab bbba✷ 10011 10010111 10011000 10011001 10011010 10011011 10011100 10011101 10011110 10011111 10100000 Figure 6.4. Illustration of the arithmetic coding process as the sequence bbba✷ is transmitted.

slide-40
SLIDE 40

a b ✷ aa ab a✷ aaa aab aa✷ aaaa aaab aaba aabb aba abb ab✷ abaa abab abba abbb ba bb b✷ baa bab ba✷ baaa baab baba babb bba bbb bb✷ bbaa bbab bbba bbbb 1 00 01 000 001 0000 0001 00000 00001 00010 00011 0010 0011 00100 00101 00110 00111 010 011 0100 0101 01000 01001 01010 01011 0110 0111 01100 01101 01110 01111 10 11 100 101 1000 1001 10000 10001 10010 10011 1010 1011 10100 10101 10110 10111 110 111 1100 1101 11000 11001 11010 11011 1110 1111 11100 11101 11110 11111

slide-41
SLIDE 41

On Arithmetic Coding

  • Computationally efficient
  • Length of a string closely matches the Shannon information content
  • Overhead required to terminate a message is never more than 2 bits

⇒ Finding a good coding equivalent to finding a good probabilistic model!

  • Flexible:

– any source alphabet and any encoded alphabet – alphabets can change with time – probabilities are context-dependent

  • Can be used to generate random samples from random bits

economically

slide-42
SLIDE 42

Lempel-Ziv Coding

  • Used in gzip etc.

source substrings λ 1 11 01 010 00 10 s(n) 1 2 3 4 5 6 7 s(n)binary 000 001 010 011 100 101 110 111 (pointer, bit) (, 1) (0, 0) (01, 1) (10, 1) (100, 0) (010, 0) (001, 0)

  • Asymptotically compress down to the entropy of the source

(not in practice)

slide-43
SLIDE 43

Summary (1/2)

  • Fixed-length block codes (Chapter 4)

– Only a tiny fraction of source strings are given an encoding – Identify entropy as the measure of compressibility – No practical use

  • Symbol codes (Chapter 5)

– Variable code lengths allow lossless compression – Expected code length is H + DKL (between the source distribution and the code’s implicit distribution) – DKL can be made smaller than 1 bit per symbol – Huffman code is the optimal symbol code

slide-44
SLIDE 44

Summary (2/2)

  • Stream codes (Chapter 6)

– Arithmetic coding combines a probabilistic model with an encoding algorithm – Lempel-Ziv memorises strings that have already occured – If any of the bits is altered by noise, the rest of the encoding fails

slide-45
SLIDE 45

Exercises

  • 6.19 (entropy and information)
  • 4.16 (Shannon source coding theorem)
  • 6.16 (Huffman coding)
  • 6.7 (Arithmetic coding)