Sign In Website: https://goo.gl/forms/FzHSa5INKlavWIJC3 Enter the - PowerPoint PPT Presentation

Sign In � Website: https://goo.gl/forms/FzHSa5INKlavWIJC3 � Enter the word of the day in the appropriate slot. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 1

Lecture #4: Simple Compression: Huffman Trees � Strings are composed of characters, which (like everything else in a computer) are represented as bit strings. � The relationship between characters and their bit representations ( encodings or code points ) is arbitrary. Standardization is neces- sary to prevent chaos. � Python now uses an international standard known as Unicode, which encodes (as of Version 9.0) 128,237 characters, using code points that range from 0–1,114,111. � These cover 135 scripts (roughly, alphabets), and various sets of symbols: punctuation, control characters (like tab or newline), math- ematical symbols, etc. � A few examples: Literal Glyph Encoding Glyph Encoding Glyph A "\u0041" "\u00A7" "\u0398" Θ � a ♣ "\u0061" "\u00A9" "\u2663" � 0 "\u0030" "\u00E9" ´ e "\u2639" � @ ℵ "\u0040" "\u05D0" Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 2

More Efficient Encoding � If every character in a text is represented by an integer value in the full range, we’d have 3 bytes (24 bits) per character. � So usually, the code points themselves are encoded. � One common encoding, UTF-8, uses 1–4 bytes per character, de- pending on the number of significant bits in the code point. Bits Range of Byte 1 Byte 2 Byte 3 Byte 4 Coded code points 7 0x0000 .. 0x007F 0xxxxxxx 11 0x0080 .. 0x07FF 110xxxxx 10xxxxxx 16 0x0800 .. 0xFFFF 1110xxxx 10xxxxxx 10xxxxxx 21 0x10000 .. 0x10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx � x’s mark places containing the bits of the code points. The other bits flag how many bytes are needed. � Where one-byte characters are common, this saves space. � One clever feature is that bytes 2–4 (continuation bytes) all start with a distinctive pattern (10), so that if one starts at any byte in an array of bytes, one can find the beginning of the character. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 3

Still More Efficient � We can, however, do better still by using other variable-length encodings that can use less than a byte per character. � There’s potential problem with this idea, however: ambiguity. � Suppose we tried an encoding like this, using shorter codes for more common letters: E => 0, T => 1, A => 10, O => 11, I => 100, ... � And suppose we receive the bits 100 . � Is this “TEE”, “AE”, or “I”? Where does one letter end and the next begin? Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 4

Unique Prefix Property � This ambiguity problem can be solved by choosing a code with the Unique Prefix Property: The bit encoding for any character is never a prefix of the encoding of any other character. � For example, the encoding E => 0, T => 10, A => 1101, O => 1100, I => 1110, ... has this property (at least for the characters shown). No encodings appears at the beginning of any other. � E.g., “TEE” encodes to 1000 , “AE” to 11010 , and ‘I’ to 1110 . � There is never any ambiguity about where a character begins, if one works from the left. � Starting from a given bit position, p , as soon as one collects bits that match the encoding of character C , we know that C has to be the character that starts at p , since adding more bits can never match another character. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 5

Decoding Using the Unique Prefix Property � Given a bit encoding with the unique prefix property, how do we decode? � Discussion in previous slide gives one solution using a dictionary to map encodings to characters. � For simplicity, imagine our encoded text as a string of 0s and 1s (not a representation you’d actually use in practice!). � Suppose D is a dictionary from such strings of 0s and 1s to characters. Then, def decode(msg): """Convert encoded message MSG into the character string it represents.""" ch = "" result = "" for b in msg: ch += b if ch in D: result += D[ch] ch = "" Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 6

Using Trees � Binary trees offer a particular way to represent the dictionary from the last slide. Letter Encoding A 00 B 01 C 100 A B D 101 C D E 1100 F 1101 E F � Left branches tell what to do when looking at a 0 bit; right branches do the same for 1 bits (result is called a Patricia tree . � To decode, e.g., 1101001011100 , – Following bits 1101 (right, right, left, right) takes us to leaf ‘F’. – Returning to the top, 00 takes us to ‘A’. – Again from the top, 101 takes us to ’D’. – Finally, 1100 gives ‘E’. Complete decoding: “FADE”. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 7

A Problem � How, then, do we get an encoding that – Minimizes the size of a text, and – Satisfies the unique prefix property (so that it can be decoded unambiguously.) � There is no universal encoding that does this for any text. � We’d like an algorithn that finds a custom-made optimal encoding for any particular text. � Idea is to encode more common charcters in fewer bits. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 8

Huffman Coding � Huffman coding is named after an MIT student who invented this encoding in response to a class assignment. � Given an alphabet of symbols to be encoded, with their relative frequencies in a text, it produces the optimal variable-width unique- prefix encoding, assuming that we encode individual characters in- dependently. � Basic idea is to accumulate trees representing subsets of characters from the bottom up, starting with trivial trees each containing a single character. � Each time two trees are clustered into one under a new parent node, it represents an additional bit in the coding, so it is best to prefer clustering trees that represent characters with smallest frequency. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 9

Example � Want to encode string “ AAAAAAAAAABBBBBCCCCCCCDDDDDDDDDEEEF ” � Here, the frequencies are Letter Count A 10 B 5 C 7 D 9 E 3 F 1 � Represent as 6 one-node trees labeled with letters and their frequencies: F/1 E/3 B/5 C/7 D/9 A/10 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 10

Forming Subtrees � Starting with F/1 E/3 B/5 C/7 D/9 A/10 � We combine the two nodes with the smallest frequencies to get a “bigger” node representing both the characters E and F: /4 B/5 C/7 D/9 A/10 F/1 E/3 � Keeping the resulting trees in order by frequency, repeat: C/7 /9 D/9 A/10 /4 B/5 F/1 E/3 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 11

Forming Subtrees (II) � And again: D/9 A/10 /16 C/7 /9 /4 B/5 F/1 E/3 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 12

Forming Subtrees (III) � And yet again: /16 /19 C/7 /9 D/9 A/10 /4 B/5 F/1 E/3 Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 13

Forming Subtrees (IV) � Finally, we get the tree on the left, which corresponds to the encoding table on the right /35 Letter Encoding /16 /19 A 11 B 011 C 00 C/7 /9 D/9 A/10 D 10 E 0101 /4 B/5 F 0100 F/1 E/3 � So string “ AAAAAAAAAABBBBBCCCCCCCDDDDDDDDDEEEF ” encodes as “ 11111111111111111111011011011011011000000000000001010101010101010100101010101010100 ” which is 84 bits as opposed to 94 with our previous unique-prefix encoding from slide 6, and 280 using UTF-8 and Unicode. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 14

Sign In Website: https://goo.gl/forms/FzHSa5INKlavWIJC3 Enter the - PowerPoint PPT Presentation

Sign In Website: https://goo.gl/forms/FzHSa5INKlavWIJC3 Enter the word of the day in the appropriate slot. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 1 Lecture #4: Simple Compression: Huffman Trees Strings are

for a sign! But none will be given it except the sign of the prophet Jonah. 40 40 For as Jonah was

and Presentation submissions 1.0 Sign up / Register 2.0 Sign in 3.0 Manage submissions 3.1 New

Then some of the Pharisees and teachers of the law said to him, Teacher, we want to see a sign

Sign-Out and Sign-Out Summary - GME Convenient notes and tasks for Physician Collaboration

WELCOME BUDGET & RIO TRAINING M PLEASE NOTE: A Registration Sign-In In N Please Sign-In

SIGN CODE UPDATE | DOWNTOWN SIGN DISTRICT SPRINGFIELD CITY COUNCIL WORK SESSION March 25, 2019

Handouts & Sign-In Hum an Service Transportation Plan Please sign in ( HSTP) If

SIGN REGULATIONS S1 Phase 3 of the project will entail preparation of updated Sign Regulations,

Sign up for an Aurasma Account Please sign in on the link below: https://goo.gl/t5sDLr

IEEE PES 11 th GENERAL MEETING SIGN IN HERE: tinyurl.com/PES-SignIn11 Welcome back! Keep a

How to sign up for a sport 1st Sign into your School Tool Account

IEEE PES 8 th GENERAL MEETING SIGN IN HERE: tinyurl.com/PES-SignIn8 Be m Eet

The Central Vein Sign and Paramagnetic Rim Sign in White Matter Lesions of Radiologically

RECERTIFICATION of Public Assistance REQUIRED FORMS S IGN AND D ATE THIS FORM SIGN & DATE:

Single sign-on enabled OpenCms Architecture for Single sign-on implementation into OpenCms

T e l E x B r o o k E n d 2 Jubilee TP Hawthorns Harkaway Elizabeth House

Recent Blink Improvements in Text & Layout #webengineshackfest Dominik Rttsches

FROM HACKATHON TO PRODUCTION IN A YEAR Victor Kropp, JetBrains Software Engineer HI! I am

AlgorithmsinaNutshell Session4 RecapAlgorithmThemes 11:2011:40

2000-02-04 Lecture 12: Vector fields Hedgehogs Warping and displacement plots

Information Theory in Visual Analytics Min Chen Professor of Scientific Visualization including

Electromagnetic eavesdropping risks of flat-panel displays Markus G. Kuhn Computer Laboratory

Hypervariate Information Visualization Hauptseminar Information Visualization"

Analyzing Delays in Trajectories Maximilian Konzack , Thomas McKetterick, Georgina Wilcox, Maike

Sign In Website: https://goo.gl/forms/FzHSa5INKlavWIJC3 Enter the - PowerPoint PPT Presentation

Sign In Website: https://goo.gl/forms/FzHSa5INKlavWIJC3 Enter the word of the day in the appropriate slot. Last modified: Fri Mar 10 17:54:45 2017 CS198: Extra Lecture #4 1 Lecture #4: Simple Compression: Huffman Trees Strings are

for a sign! But none will be given it except the sign of the prophet Jonah. 40 40 For as Jonah was

and Presentation submissions 1.0 Sign up / Register 2.0 Sign in 3.0 Manage submissions 3.1 New

Then some of the Pharisees and teachers of the law said to him, Teacher, we want to see a sign

Sign-Out and Sign-Out Summary - GME Convenient notes and tasks for Physician Collaboration

WELCOME BUDGET &amp; RIO TRAINING M PLEASE NOTE: A Registration Sign-In In N Please Sign-In

SIGN CODE UPDATE | DOWNTOWN SIGN DISTRICT SPRINGFIELD CITY COUNCIL WORK SESSION March 25, 2019

Handouts &amp; Sign-In Hum an Service Transportation Plan Please sign in ( HSTP) If

SIGN REGULATIONS S1 Phase 3 of the project will entail preparation of updated Sign Regulations,

Sign up for an Aurasma Account Please sign in on the link below: https://goo.gl/t5sDLr

IEEE PES 11 th GENERAL MEETING SIGN IN HERE: tinyurl.com/PES-SignIn11 Welcome back! Keep a

How to sign up for a sport 1st Sign into your School Tool Account

IEEE PES 8 th GENERAL MEETING SIGN IN HERE: tinyurl.com/PES-SignIn8 Be m Eet

The Central Vein Sign and Paramagnetic Rim Sign in White Matter Lesions of Radiologically

RECERTIFICATION of Public Assistance REQUIRED FORMS S IGN AND D ATE THIS FORM SIGN &amp; DATE:

Single sign-on enabled OpenCms Architecture for Single sign-on implementation into OpenCms

T e l E x B r o o k E n d 2 Jubilee TP Hawthorns Harkaway Elizabeth House

Recent Blink Improvements in Text &amp; Layout #webengineshackfest Dominik Rttsches

FROM HACKATHON TO PRODUCTION IN A YEAR Victor Kropp, JetBrains Software Engineer HI! I am

AlgorithmsinaNutshell Session4 RecapAlgorithmThemes 11:2011:40

2000-02-04 Lecture 12: Vector fields Hedgehogs Warping and displacement plots

Information Theory in Visual Analytics Min Chen Professor of Scientific Visualization including

Electromagnetic eavesdropping risks of flat-panel displays Markus G. Kuhn Computer Laboratory

Hypervariate Information Visualization Hauptseminar Information Visualization&quot;

Analyzing Delays in Trajectories Maximilian Konzack , Thomas McKetterick, Georgina Wilcox, Maike

WELCOME BUDGET & RIO TRAINING M PLEASE NOTE: A Registration Sign-In In N Please Sign-In

Handouts & Sign-In Hum an Service Transportation Plan Please sign in ( HSTP) If

RECERTIFICATION of Public Assistance REQUIRED FORMS S IGN AND D ATE THIS FORM SIGN & DATE:

Recent Blink Improvements in Text & Layout #webengineshackfest Dominik Rttsches

Hypervariate Information Visualization Hauptseminar Information Visualization"