Compression: Other Lossless Compression Algorithms Greg Plaxton - - PowerPoint PPT Presentation

compression other lossless compression algorithms
SMART_READER_LITE
LIVE PREVIEW

Compression: Other Lossless Compression Algorithms Greg Plaxton - - PowerPoint PPT Presentation

Compression: Other Lossless Compression Algorithms Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin LZ78 (Lempel-Ziv) The encoder and decoder each maintain a


slide-1
SLIDE 1

Compression: Other Lossless Compression Algorithms

Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin

slide-2
SLIDE 2

LZ78 (Lempel-Ziv)

  • The encoder and decoder each maintain a “dictionary” containing

certain words seen previously – Initially the dictionary contains only the empty string (in practice it is often initialized to the set of single-symbol words) – The algorithm maintains the invariant that the encoder and decoder dictionaries are the same (except the decoder dictionary can lag behind by a word) – The encoder communicates a dictionary entry to the decoder by sending an integer index into the dictionary – If the dictionary becomes full, a common strategy is to evict the LRU entry

Theory in Programming Practice, Plaxton, Fall 2005

slide-3
SLIDE 3

LZ78: Outline of a Single Iteration

  • Suppose the encoder has consumed some prefix of the input sequence
  • The encoder now considers successively longer prefixes of the remaining

input until it finds the first prefix αx such that α is a word in the dictionary and αx is not a word in the dictionary

  • The word αx is added to the dictionary of the encoder
  • The word αx is communicated to the decoder by transmitting the index

i of α and the symbol x

  • The decoder uses its dictionary to map i to α, and then adds the word

αx to its dictionary

Theory in Programming Practice, Plaxton, Fall 2005

slide-4
SLIDE 4

LZ78: Dictionary Data Structure

  • It is common to implement the dictionary as a trie

– If the set of symbols is, e.g., the 256 possible bytes, then each node

  • f the trie might have an array of length 256 to store its children

– While fast (linear time), this implementation is somewhat inefficient in terms of space – A trick that can achieve a good space-time tradeoff is to store the children of a trie node in a linked list until the number of children is sufficiently large (say 10 or so), and then switch to an array – Alternatively, the children of a trie node could be stored in a hash table

  • The integers used to represent dictionary entries are indices into an

array of pointers into the trie

Theory in Programming Practice, Plaxton, Fall 2005

slide-5
SLIDE 5

LZ Algorithms

  • Quite a few variations of LZ77 and LZ78 have been proposed
  • The LZ algorithms are popular because they run in a single pass,

provide good compression, are easy to code, and run quickly

  • Used in popular compression utilities such as compress, gzip, and

WinZip

Theory in Programming Practice, Plaxton, Fall 2005

slide-6
SLIDE 6

Arithmetic Coding

  • Assume an i.i.d. source with alphabet A and where the ith symbol in

A has associated probability pi, 1 ≤ i ≤ n = |A|

  • Map each input string to a subinterval of the real interval [0, 1]

– Chop up the unit interval based on the first symbol of the string, with the ith symbol assigned to the subinterval  

1≤j<i

pj,

  • 1≤j≤i

pj   – Recursively construct the mapping within each subinterval to handle strings of length 2, then 3, et cetera

  • The encoder specifies the real interval corresponding to the next fixed-

size block of symbols to be sent

Theory in Programming Practice, Plaxton, Fall 2005

slide-7
SLIDE 7

Arithmetic Coding: Specifying a Particular Interval

  • To specify an interval, the encoder sends a (variable length) bit string

that is itself interpreted as a subinterval of [0, 1] – For example, 010 is interpreted as the interval containing all reals with binary expansion of the form .010∗∗∗. . . where the ∗’s represent don’t cares (0 or 1) – Thus 010 corresponds to [1/4, 3/8), 0 corresponds to [0, 1/2), 11 corresponds to [3/4, 1), et cetera

  • Once the decoder has received a bit string that is entirely contained

within an interal corresponding to a particular block, it outputs that block and proceeds to the next iteration

Theory in Programming Practice, Plaxton, Fall 2005

slide-8
SLIDE 8

Arithmetic Coding: An Example

  • Assume that our alphabet is {a, b}, that each symbol is an a with

probability 1/4, and that we wish to encode blocks of two symbols

  • We associate aa with the interval [0, 1/16), ab with [1/16, 1/4), ba

with [1/4, 7/16), and bb with [7/16, 1)

  • Thus we can set the codeword for aa to 0000 (since [0, 1/16) ⊆

[0, 1/16)), for ab to 001 (since [1/8, 1/4) ⊆ [1/16, 1/4)), for ba to 010 (since [1/4, 3/8) ⊆ [1/4, 7/16)), and for bb to 1 (since [1/2, 1) ⊆ [7/16, 1)) – Note that this is a prefix code (why?) – We can optimize this code further by contracting away any degree-

  • ne internal nodes in the trie representation of the prefix code

– This optimization yields the codewords 000 for aa, 001 for ab, 01 for ba, and 1 for bb

Theory in Programming Practice, Plaxton, Fall 2005

slide-9
SLIDE 9

Arithmetic Coding: Another Example

  • Consider A = {a, b} where the probability associated with a is close to

1, e.g., 0.99 – The entropy per symbol is close to zero, so a direct application of Huffman coding performs poorly – Even with a block size of 50, arithmetic coding communicates the all-a’s block using only a single bit since 0.9950 > 1/2

Theory in Programming Practice, Plaxton, Fall 2005

slide-10
SLIDE 10

Arithmetic Coding versus Huffman Coding

  • Why not just use a Huffman code defined over the probability

distribution of all strings of the desired block length? – This is guaranteed to compress at least as well as arithmetic coding, since both techniques yield prefix codes, and Huffman’s algorithm gives an optimal prefix code

  • Note that the number of strings with the desired block length is

typically enormous – Thus, computing and representing the Huffman tree is prohibitively expensive

  • The key advantage of arithmetic coding is that there is no need for

either the encoder or the decoder to maintain an explicit representation

  • f the entire code

– Due to the simple structure of the code, the encoder/decoder can encode/decode on the fly

Theory in Programming Practice, Plaxton, Fall 2005

slide-11
SLIDE 11

Run-Length Coding

  • Another technique that is useful for dealing with certain low-entropy

sources

  • The basic idea is to encode a run of length k the same symbol a as the

pair (a, k)

  • The resulting sequence of pairs are then typically coded using some
  • ther technique, e.g., Huffman coding
  • Example: FAX protocols

– Run-length coding converts document to alternating runs of white and black pixels – Run lengths are encoded using a fixed Huffman code that works well

  • n typical documents

– A long run such as 500 might be coded by passing Huffman codes for 128+, 128+, 128+, 64+, 52

Theory in Programming Practice, Plaxton, Fall 2005

slide-12
SLIDE 12

Move-To-Front Coding

  • A good technique for dealing with sources where the output favors

certain symbols for a while, then favors another set of symbols, et cetera

  • Keep the symbols in a list
  • When a symbol is transmitted, move it to the head of the list
  • Transmit a symbol by indicating its current position (index) in the list
  • The hope is that we will mostly be sending small indices

Theory in Programming Practice, Plaxton, Fall 2005

slide-13
SLIDE 13

Move-To-Front Coding: Compressing the Index Sequence

  • The sequence of indices can be compressed using another method such

as Huffman coding

  • An easy alternative (though perhaps unlikely to give the best

performance) is to encode each k-bit index using 2k − 1 bits as follows – Assume the lowest index is 1; thus k > 0 – Send (k − 1) 0’s followed by the k-bit index – The decoder counts the leading zeros to determine k, then decodes the k-bit index

Theory in Programming Practice, Plaxton, Fall 2005

slide-14
SLIDE 14

Prediction by Partial Matching

  • This is essentially the approach that Shannon used in his experiments

with English text discussed in an earlier lecture

  • The idea is to maintain, for each string α of some fixed length k,

the conditional probability distribution for the symbol that follows the string α

  • The encoder specifies the next symbol using some appropriate code,

e.g., a Huffman code for the given probability distribution

  • Shannon showed that for a wide class of discrete Markov sources, the

performance of this technique approaches the entropy lower bound for k sufficiently large – But in practice we cannot afford to use a value of k that is very large since the number of separate probability distributions to maintain is |A|k

Theory in Programming Practice, Plaxton, Fall 2005

slide-15
SLIDE 15

Burrows-Wheeler Transform

  • A relatively recent (1994) technique
  • A number of compression algorithms have been proposed that make use
  • f the Burrows-Wheeler transform in combination with other techniques

such as arithmetic coding, run-length coding, and move-to-front coding

  • The bzip utility is such an algorithm

– Outperforms gzip and other LZ-based algorithms

Theory in Programming Practice, Plaxton, Fall 2005

slide-16
SLIDE 16

Burrows-Wheeler Transform: Abstract View

  • Take the next block of symbols to be encoded
  • Construct n strings corresponding to all rotations of the block,

numbering then from 0 (say)

  • Sort the resulting n strings
  • Given this sorted list of strings, transmit the index of the first string

and the sequence of last symbols

  • Symbols with a similar context in the original string are now grouped

together, so this sequence can be compressed using other methods

  • A nontrivial insight is that the information transmitted is sufficient for

decoding

Theory in Programming Practice, Plaxton, Fall 2005

slide-17
SLIDE 17

Burrows-Wheeler Transform: Implementation

  • The preceding high-level description seems to imply that quadratic

space is needed

  • In fact, each of the n rotations of the original string can be represented

by a pointer into the original string

  • A standard sorting utility can be used, but each comparison could be

costly in the worst case (e.g., if all of the symbols in the block are the same)

  • Better

worst-case guarantees can be achieved using algorithms specifically designed for suffix sorting

Theory in Programming Practice, Plaxton, Fall 2005