Sign In Lecture #4: Simple Compression: Huffman Trees Website: - PDF document

Sign In Lecture #4: Simple Compression: Huffman Trees � Website: https://goo.gl/forms/FzHSa5INKlavWIJC3 � Strings are composed of characters, which (like everything else in a computer) are represented as bit strings. � Enter the word of the day in the appropriate slot. � The relationship between characters and their bit representations ( encodings or code points ) is arbitrary. Standardization is neces- sary to prevent chaos. � Python now uses an international standard known as Unicode, which encodes (as of Version 9.0) 128,237 characters, using code points that range from 0–1,114,111. � These cover 135 scripts (roughly, alphabets), and various sets of symbols: punctuation, control characters (like tab or newline), math- ematical symbols, etc. � A few examples: Literal Glyph Encoding Glyph Encoding Glyph A "\u0041" "\u00A7" "\u0398" Θ � "\u0061" a "\u00A9" "\u2663" ♣ � 0 "\u0030" "\u00E9" e ´ "\u2639" � "\u0040" @ "\u05D0" ℵ Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 1 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 2 More Efficient Encoding Still More Efficient � If every character in a text is represented by an integer value in � We can, however, do better still by using other variable-length en- the full range, we’d have 3 bytes (24 bits) per character. codings that can use less than a byte per character. � So usually, the code points themselves are encoded. � There’s potential problem with this idea, however: ambiguity. � One common encoding, UTF-8, uses 1–4 bytes per character, de- � Suppose we tried an encoding like this, using shorter codes for more pending on the number of significant bits in the code point. common letters: Bits Range of Byte 1 Byte 2 Byte 3 Byte 4 E => 0, T => 1, A => 10, O => 11, I => 100, ... Coded code points � And suppose we receive the bits 100 . 7 0x0000 .. 0x007F 0xxxxxxx � Is this “TEE”, “AE”, or “I”? Where does one letter end and the next 11 0x0080 .. 0x07FF 110xxxxx 10xxxxxx begin? 16 0x0800 .. 0xFFFF 1110xxxx 10xxxxxx 10xxxxxx 21 0x10000 .. 0x10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx � x’s mark places containing the bits of the code points. The other bits flag how many bytes are needed. � Where one-byte characters are common, this saves space. � One clever feature is that bytes 2–4 (continuation bytes) all start with a distinctive pattern (10), so that if one starts at any byte in an array of bytes, one can find the beginning of the character. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 3 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 4 Unique Prefix Property Decoding Using the Unique Prefix Property � This ambiguity problem can be solved by choosing a code with the � Given a bit encoding with the unique prefix property, how do we decode? Unique Prefix Property: The bit encoding for any character is never a prefix of the encoding of any other character. � Discussion in previous slide gives one solution using a dictionary to map encodings to characters. � For example, the encoding � For simplicity, imagine our encoded text as a string of 0s and 1s (not E => 0, T => 10, A => 1101, O => 1100, I => 1110, ... a representation you’d actually use in practice!). has this property (at least for the characters shown). No encodings � Suppose D is a dictionary from such strings of 0s and 1s to charac- appears at the beginning of any other. ters. Then, � E.g., “TEE” encodes to 1000 , “AE” to 11010 , and ‘I’ to 1110 . def decode(msg): � There is never any ambiguity about where a character begins, if one """Convert encoded message MSG into the character string it represents.""" ch = "" works from the left. result = "" � Starting from a given bit position, p , as soon as one collects bits that for b in msg: match the encoding of character C , we know that C has to be the ch += b character that starts at p , since adding more bits can never match if ch in D: another character. result += D[ch] ch = "" Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 5 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 6

Using Trees A Problem � Binary trees offer a particular way to represent the dictionary from � How, then, do we get an encoding that the last slide. – Minimizes the size of a text, and – Satisfies the unique prefix property (so that it can be decoded Letter Encoding unambiguously.) A 00 B 01 � There is no universal encoding that does this for any text. C 100 A B � We’d like an algorithn that finds a custom-made optimal encoding D 101 for any particular text. D E 1100 C F 1101 � Idea is to encode more common charcters in fewer bits. E F � Left branches tell what to do when looking at a 0 bit; right branches do the same for 1 bits (result is called a Patricia tree . � To decode, e.g., 1101001011100 , – Following bits 1101 (right, right, left, right) takes us to leaf ‘F’. – Returning to the top, 00 takes us to ‘A’. – Again from the top, 101 takes us to ’D’. – Finally, 1100 gives ‘E’. Complete decoding: “FADE”. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 7 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 8 Huffman Coding Example � Huffman coding is named after an MIT student who invented this � Want to encode string “ AAAAAAAAAABBBBBCCCCCCCDDDDDDDDDEEEF ” encoding in response to a class assignment. � Here, the frequencies are � Given an alphabet of symbols to be encoded, with their relative fre- Letter Count quencies in a text, it produces the optimal variable-width unique- A 10 prefix encoding, assuming that we encode individual characters in- B 5 dependently. C 7 D 9 � Basic idea is to accumulate trees representing subsets of charac- E 3 ters from the bottom up, starting with trivial trees each containing F 1 a single character. � Represent as 6 one-node trees labeled with letters and their fre- � Each time two trees are clustered into one under a new parent node, quencies: it represents an additional bit in the coding, so it is best to prefer clustering trees that represent characters with smallest frequency. F/1 E/3 B/5 C/7 D/9 A/10 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 9 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 10 Forming Subtrees Forming Subtrees (II) � Starting with � And again: F/1 E/3 B/5 C/7 D/9 A/10 D/9 A/10 /16 � We combine the two nodes with the smallest frequencies to get a C/7 /9 “bigger” node representing both the characters E and F: /4 B/5 /4 B/5 C/7 D/9 A/10 F/1 E/3 F/1 E/3 � Keeping the resulting trees in order by frequency, repeat: C/7 /9 D/9 A/10 /4 B/5 F/1 E/3 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 11 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 12

Forming Subtrees (III) Forming Subtrees (IV) � And yet again: � Finally, we get the tree on the left, which corresponds to the encoding table on the right /16 /19 /35 Letter Encoding C/7 /9 D/9 A/10 /16 /19 A 11 B 011 /4 B/5 C 00 C/7 /9 D/9 A/10 D 10 F/1 E/3 E 0101 /4 B/5 F 0100 F/1 E/3 � So string “ AAAAAAAAAABBBBBCCCCCCCDDDDDDDDDEEEF ” encodes as “ 11111111111111111111011011011011011000000000000001010101010101010100101010101010100 ” which is 84 bits as opposed to 94 with our previous unique-prefix encoding from slide 6, and 280 using UTF-8 and Unicode. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 13 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 14

Sign In Lecture #4: Simple Compression: Huffman Trees Website: - PDF document

Sign In Lecture #4: Simple Compression: Huffman Trees Website: https://goo.gl/forms/FzHSa5INKlavWIJC3 Strings are composed of characters, which (like everything else in a computer) are represented as bit strings. Enter the word of the

Huffman Coding Variable Rate Codes Example: David A. Huffman (1951) Huffman coding uses

Huffman Trees To save space when storing it. Greedy Algorithm for Data Compression To save

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Compression Overview Multimedia Encoding and Compression Huffman codes Lossless

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

1 Data structures for decoder: Construction of canonical Huffman: (sketch) The array

Video Compression Lecture # 5 6 Shahab Baqai LUMS Outline Image compression

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Welcome to M2 SCCI 2014-2015 Promotion David Albert Huffman David A. Huffman(1925-1999) [Photo:

PTAS for Huffman coding with unequal letter costs Mordecai Golin (HKUST), Claire Mathieu (Brown)

Lecture #5: Higher-Order Functions Do You Understand the Machinery? (I) What is printed (0, 1, or

An Object-Oriented Dynamic Logic with Updates Andr Platzer University of Karlsruhe Andr

Efficient Interpolant Generation in Satisfiability Modulo Linear Integer Arithmetic Alberto

Ramiro Sarabia Demetrius Cooper @ramsarabia on IG, Twtr, @thatsmycheese on IG LinkedIn

Machine Learning Decision trees Types of classifiers We can divide the large variety of

Planning and Optimization E3. Landmarks: LM-cut Heuristic Malte Helmert and Thomas Keller

How fast goes the light ? Euro LLVM 2015 Arnaud de Grandmaison 1 Scope Speed of light: the

On some numerical invariants of finite groups Jan Krempa Institute of Mathematics, University of