[PPT] - Compression: Information Theory Greg Plaxton Theory in Programming PowerPoint Presentation

SLIDE 1

Compression: Information Theory

Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin

SLIDE 2

Coding Theory

Encoder

– Input: a message over some finite alphabet such as {0, 1} or {a, . . . , z} – Output: encoded message

Decoder

– Input: some encoded message produced by the encoder – Output: (a good approximation to) the associated input message

Motivation?

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 3

Some Applications of Coding Theory

Compression

– Goal: Produce a short encoding of the input message

Error detection/correction

– Goal: Produce a fault-tolerant encoding of the input message

Cryptography

– Goal: Produce an encoding of the input message that can only be decoded by the intended recipient(s) of the message

It is desirable for the encoding and decoding algorithms to be efficient

in terms of time and space – Various tradeoffs are appropriate for different applications

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 4

Compression

Lossless: decoder recovers the original input message
Lossy: decoder recovers an approximation to the original input message
The application dictates how much, if any, loss we can tolerate

– Text compression is usually required to be lossless – Image/video compression is often lossy

We will focus on techniques for lossless compression

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 5

Text Compression

Practical question: I’m running out of disk space; how much can I

compress my files?

A (naive?) idea:

– Any file can be compressed to the empty string: just write a decoder that outputs the file when given the empty string as input! – A problem with this approach is that we need to store the decoder, and the naive implementation of the decoder (which simply stores the original file in some static data structure within the decoder program) is at least as large as the original file – Can this idea be salvaged?

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 6

Kolmogorov Complexity

In some cases, a large file can be generated by a very small program

running on the empty string; e.g., a file containing a list of the first trillion prime numbers

Your files can be compressed down to the size of the smallest program

that (when given the empty string as input) produces them as output – How do I figure out this shortest program? – Won’t it be time-consuming to write/debug/maintain?

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 7

Information Theory

May be viewed as providing a practical way to (approximately) carry
ut the strategy suggested by Kolmogorov complexity
Consider a file that you would like to compress

– Assume that this file can be viewed, to a reasonable degree

f approximation, as being drawn from a particular probability

distribution (e.g., we will see that this is true of English text) – Perhaps many other people have files drawn from this distribution,

r from distributions in a similar class

– If so, a good encoder/decoder pair for that class of distributions may already exist; with luck, it will already be installed on your system

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 8

Example: English Text

In what sense can we view English text as being (approximately) drawn

from a probability distribution?

English text is one of the example applications discussed in Shannon’s

1948 paper “A Mathematical Theory of Communication” – On page 7 we find a sequence of successively more accurate probabilistic models of English text – Claude Shannon (1916–2001) is known as the “father of information theory”

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 9

Entropy in Thermodynamics

In thermodynamics, entropy is a measure of energy dispersal

– The more “spread out” the energy of a system is, the higher the entropy – A system in which the energy is concentrated at a single point has zero entropy – A system in which the energy is uniformly distributed has reached its maximum possible entropy

Second law of thermodynamics: The entropy of an isolated system can
nly increase

– Bad news: The entropy of the universe can only increase as matter and energy degrade to an ultimate state of inert uniformity – Good news: This process is likely to take a while

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 10

Entropy in Information Theory (Shannon)

A measure of the uncertainty associated with a probability distribution

– The more “spread out” the distribution is, the higher the entropy – A probability distribution in which all

f

the probability is concentrated on a single outcome has zero entropy – For any given set of possible outcomes, the probability distribution with the maximum entropy is the uniform distribution

Consider a distribution over a set of n outcomes in which the ith
utcome has associated probability pi; Shannon defined the entropy of

this distribution as

i

pi log 1 pi = −

i

pi log pi

The logarithm above is normally assumed to be taken base 2, in which

case the units of entropy are bits (binary digits)

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 11

Entropy of an I.I.D. Source

Consider a message in which each successive symbol is independently

drawn from the same probability distribution over n symbols, where the probability of drawing the ith symbol is pi

The entropy of such a source is −

i pi log pi bits per symbol

Example: Shannon’s first-order model of English text yields an entropy
f 4.07 bits per symbol

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 12

Discrete Markov Process

A more general notion of a source
Includes as special cases the kth order processes discussed earlier in

connection with Shannon’s modeling of English text

Closely related to the concept of finite state machines to be discussed

later in this course

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 13

Entropy of a Discrete Markov Process

Under certain (relatively mild) technical assumptions, for any k > 0

and any X in Ak where A denotes the set of symbols, the fraction of all sequences of length k in the output that are equal to X converges to a particular number p(X)

We may then define Hk as

1 k

X∈Ak

p(X) log 1 p(X)

Theorem (Shannon): If a given discrete Markov process satisfies the

technical assumptions alluded to above, then its entropy is equal to limk→∞ Hk bits per symbol

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 14

Example: English Text

Zero-order approximation: log 27 ≈ 4.75 bits per symbol
First-order approximation: 4.07 bits per symbol
Second-order approximation: 3.36 bits per symbol
Third-order approximation: 2.77 bits per symbol
Approximation based on experiments involving humans: 0.6 to 1.3 bits

per symbol

Theory in Programming Practice, Plaxton, Fall 2005

SLIDE 15

Entropy as a Measure of Compressibility

Fundamental Theorem for a Noiseless Channel (Shannon): Let a source

have entropy H (bits per symbol) and a channel have capacity C (bits per second). Then it is possible to encode the output of the source in such a way as to transmit at the average rate C

H − symbols per

second where is arbitrarily small. It is not possible to transmit at an average rate greater than C

H.

What does this imply regarding how much we can hope to compress a

given file containing n symbols, where n is large? – Suppose the file content is similar in structure to the output of a source with entropy H – Then we cannot hope to encode the file using fewer than about nH bits – Furthermore this bound can be achieved to within an arbitrarily small factor

Theory in Programming Practice, Plaxton, Fall 2005