1
play

1 Data structures for decoder: Construction of canonical Huffman: - PDF document

Canonical Huffman trees: A non-Huffman same cost tree Goals: a scheme for large alphabets with symbol frequency Code1 Code 2 decimal Efficient decoding (huffman) a 10 000 000 0 Efficient coding Economic use of main


  1. Canonical Huffman trees: A non-Huffman same cost tree Goals: a scheme for large alphabets with symbol frequency Code1 Code 2 decimal • Efficient decoding (huffman) a 10 000 000 0 • Efficient coding • Economic use of main memory b 11 001 001 1 c 12 100 010 2 d 13 101 011 3 e 22 01 10 4 f 23 11 11 5 ε Code 1: lca(e,b) = 0 code 2: lca(e,b) = Code 2: successive integers (going down from longest codes) DL - 2004 Compression3 – Beeri/Feitelson 2 DL - 2004 Compression3 – Beeri/Feitelson 1 General approach: tree for code 2: let the maximal length be L = let = #(leaves) of length , n i i 1,..., L (some possibly zeros) i e f − allocate to the numbers 0,..., L n 1, in binary 4 5 L b c d a (complete by zeros on the left to length , as needed) L 0 1 2 3 − − these occupy nodes 0,..., n / 2 1 ( n /2 nodes) on level L 1 L L + − allocate to -1 the numbers L n / 2,..., n / 2 n 1 Lemma: # (nodes) in each level in Huffman is even − L L L 1 (complete by zeros on the left, to length -1, as needed) L + − now nodes 0,...,( n / 2 n )/2 on level L 2 are occupied Proof: a parent with single child is impossible − L L 1 and so on, down to 1 DL - 2004 Compression3 – Beeri/Feitelson 4 DL - 2004 Compression3 – Beeri/Feitelson 3 Decoding: (assume we start now on new symbol) Canonical Huffman Algorithm : i= 1; • compute lengths of codes and numbers of symbols for each length (for regular Huffman) v = nextbit(); / / we have read the first bit L = max length while v< first(i) { / / small codes start at large numbers! first(L) = 0 i+ + ; for i = L-1 downto 1 { v= 2* v + nextbit(); – } first(i) = (first(i+1) + n + ) / 2 i 1 – assign to symbols of length i codes of this length, / * now, v is code of length i of a symbol s starting at first(i) s is in position v –first(i) in the block of } symbols with code length i (positions from 0) Q: What happens when there are no symbols of length i? * / Decoding can be implemented by shift/ compare Does first(L) = 0< first(L-1)< … < First(1) always hold? (very efficient) DL - 2004 Compression3 – Beeri/Feitelson 6 DL - 2004 Compression3 – Beeri/Feitelson 5 1

  2. Data structures for decoder: Construction of canonical Huffman: (sketch) • The array first(i) assumption: we have the symbol frequencies • Arrays S(i) of the symbols with code length i, Input: a sequence of (symbol, freq) ordered by their code (v-first(i) is the index value to get the symbol for code v) Output: a sequence of (symbol, length) Idea: use an array to represent a heap for Thus, decoding uses efficient arithmetic operations creating the tree, and the resulting tree and + array look-up – more efficient then storing a tree and traversing pointers lengths What about coding (for large alphabets, where symbols = words or blocks)? We illustrate by example The problem: millions of symbols � large Huffman tree, … DL - 2004 Compression3 – Beeri/Feitelson 8 DL - 2004 Compression3 – Beeri/Feitelson 7 after one more step: Example: frequencies: 2, 8, 11, 12 (each cell with a freq. also contains a symbol – not shown) # 6 # 3 21 # 3 # 4 12 # 3 # 4 # 5 # 8 # 7 # 6 2 12 11 8 Finally, a representation of the Huffman tree: Now reps of 2, 8 (smallest) go out, rest percolate # 2 33 # 2 # 3 # 4 # 2 # 3 # 4 # 7 # 6 2 12 11 8 The sum 10 is put into cell4, and its rep into cell 3 Next, by i= 2 to 8, assign lengths (here shown after i= 4) Cell4 is the parent (“sum”) of cells 5, 8. # 2 0 1 2 # 4 # 2 # 3 # 4 # 7 # 6 # 3 10 # 4 12 11 # 4 DL - 2004 Compression3 – Beeri/Feitelson 10 DL - 2004 Compression3 – Beeri/Feitelson 9 Entropy H: a lower bound on compression Summary: How can one still improve? • Insertion of (symbol,freq) into array – O(n) • Creation of heap – kn log n Huffman works for given frequencies, e.g., for the English language – static modeling k log n • Creating tree from heap: each step is Plus: No need to store model in coder/ decoder kn log n total is • Computing lengths O(n) But, can construct frequency table for each file • Storage requirements: 2n (compare to tree!) semi-static modeling Minus: – Need to store model in compressed file (negligible for large files) – Takes more time to compress Plus: may provide for better compression DL - 2004 Compression3 – Beeri/Feitelson 12 DL - 2004 Compression3 – Beeri/Feitelson 11 2

  3. 3rd option: start compressing with default freqs Adaptive Huffman: As coding proceeds, update frequencies: • Construction of Huffman after each symbol: O(n) After reading a symbol: • Incremental adaptation in O(logn) is possible – compress it Both too expensive for practical use (large alphabets) – update freq table* Adaptive modeling Decoding must use precisely same algorithm for updating freqs � can follow coding Plus: We illustrate adaptive by arithmetic coding (soon) • Model need not be stored • May provide compression that adapts to file, including local changes of freqs Minus: less efficient then previous models * May use a sliding window to better reflect local changes DL - 2004 Compression3 – Beeri/Feitelson 14 DL - 2004 Compression3 – Beeri/Feitelson 13 Arithmetic coding : Higher-order modeling: use of context Can be static, semi-static, adaptive Basic idea: e.g.: for each block of 2 letters, construct freq. • Coder: start with the interval [ 0,1) table for the next letter (2-order compression) • 1 st symbol selects a sub-interval, based on its (uses conditional probabilities – hence improvement) probability This also can be static/ semi-static/ adaptive • i’th symbol selects a sub-interval of (i-1)’th interval, based on its probability • When file ends, store a number in the final interval • Decoder: reads the number, reconstructs the sequence of intervals, i.e. symbols • Important: Length of file stored at beginning of compressed file (otherwise, decoder does not know when to stop) DL - 2004 Compression3 – Beeri/Feitelson 16 DL - 2004 Compression3 – Beeri/Feitelson 15 Why is it a good approach in general? Example : (static) a ~ 3/ 4, b ~ 1/ 4 The file to be compressed: aaaba For a symbol with large probability, # of binary The sequence of intervals (& symbols creating them) : digits needed to represent an occurrence is [ 0,1), a [ 0,3/ 4), a [ 0,9/ 16), a [ 0, 27/ 64), smaller than 1 � poor compression with b [ 81/ 256, 108/ 256), a [ 324/ 1024, 405/ 1024) Huffman But, arithmetic represents such a symbol with a small shrinkage of interval, hence the extra Assuming this is the end, we store: number of digits is smaller than 1! • 5 –length of file • Any number in final interval, say 0.011 (3 digits) Consider the example above, after aaa (after first 3 a’s, one digit suffices!) (for a large file, the length will be negligible) DL - 2004 Compression3 – Beeri/Feitelson 18 DL - 2004 Compression3 – Beeri/Feitelson 17 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend