cs 3000 algorithms data jonathan ullman
play

CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data - PowerPoint PPT Presentation

Ah CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data Compression Greedy Algorithms: Huffman Codes Apr 5, 2018 MARAMMMM Apr 8,2020 Data Compression How do we store strings of text compactly? Alphabet A binary


  1. Ah CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data Compression • Greedy Algorithms: Huffman Codes • Apr 5, 2018 MARAMMMM Apr 8,2020

  2. Data Compression • How do we store strings of text compactly? Alphabet • A binary code is a mapping from Σ → 0,1 ∗ 0 • Simplest code: assign numbers 1,2, … , Σ to each symbol, map to binary numbers of ⌈log - Σ ⌉ bits 00000 A O 000 l Bi 0007 O E • Morse Code: O 00 I 1 D variable length code

  3. Data Compression • Letters have uneven frequencies! • Want to use short encodings for frequent letters, long encodings for infrequent leters a b c d avg. len. Frequency 1/2 1/4 1/8 1.8 I Encoding 1 00 01 10 11 2.0 Encoding 2 0 10 110 111 1.75 t x 3 I x 2 E x I It 1.75 I f I

  4. Data Compression • What properties would a good code have? • Easy to encode a string Encode(KTS) = – ● – – ● ● ● I K ITI average bits per letter s f some frequencies • The encoding is short on average given ≤ 4 bits per letter (30 symbols max!) • Easy to decode a string? Decode( – ● – – ● ● ● ) = possibilities S T K Many E TT S T T E TT E E E K NI

  5. as Prefix Free Codes J • Cannot decode if there are ambiguities • e.g. enc(“6”) is a prefix of enc(“9”) Tee • Prefix-Free Code: • A binary enc: Σ → 0,1 ∗ such that for every < ≠ > ∈ Σ , enc < is not a prefix of enc > • Any fixed-length code is prefix-free O 00 a aa I 0 O l b b I 0 l I 0 c c g I I d do I l 1 a prefix free variable length code

  6. Prefix Free Codes safe • Can represent a prefix-free code as a tree a 1 Of 1 o Ji binary a'ITabeled with a symbol A p r r EE • Encode by going up the tree (or using a table) • d a b → 0 0 1 1 0 0 1 1 i MMMM 001 l i 011 • Decode by going down the tree • 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 b l b teal d teal

  7. � Huffman Codes • (An algorithm to find) an optimal prefix-free code average number of bits per letter I BCDEFGHECDD I len J = ∑ • optimal = min M ⋅ len I R N N∈P 0 • Note, optimality depends on what you’re compressing • H is the 8 th most frequent letter in English (6.094%) but the 20 th most frquent in Italian (0.636%) fo fd Fa f a b c d Frequency 1/2 1/4 1/8 1/8 Encoding 0 10 110 111 fax 3 fax 1.75 fax I fb x 2 3 t t

  8. Huffman Codes • First Try: split letters into two sets of roughly equal frequency and recurse • Balanced binary trees should have low depth IO ILO 111 OO 01 a b c d e .32 .25 .20 .18 .05 T.tw o O oso 1 0 O l ol tho Q is es o.O 25 25 18 32

  9. Huffman Codes • First Try: split letters into two sets of roughly equal frequency and recurse a b c d e .32 .25 .20 .18 .05 first try optimal IM O 0 l len = 2.25 l len = 2.23 J 1 F E l I O O O l l O J l O nm ITfft codeword II fthfuest letter

  10. zgb.ee gdie3 Huffman Codes 38 57 Ea b3 Eadie 43 • Huffman’s Algorithm: pair up the two letters with the lowest frequency and recurse a b c d e .32 .25 .20 .18 .05 X Of o Y of to o

  11. Huffman Codes • Huffman’s Algorithm: pair up the two letters with the lowest frequency and recurse • Theorem: Huffman’s Algorithm produces a prefix- free code of optimal length • We’ll prove the theorem using an exchange argument

  12. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • (1) In an optimal prefix-free code (a tree), every internal node has exactly two children i Z z J 1

  13. lowest depth If the In the optimal code leaves two least there at then d are is siblings d and they are depth at Happen cant

  14. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • (2) If <, > have the lowest frequency, then there is an optimal code where <, > are siblings and are at the bottom of the tree i e have the lowest depths someone gave you the Suppose without labels optimal tree but I should label then O fb fo foe fee the highest leaves with 0 the 0 most frequent symbols 2 and go down e d the lowest depth two strings at there i are By the least frequent items My optimal code fills those siblings w

  15. Huffman Codes 1 1 • Theorem: Huffman’s Alg produces an optimal prefix-free code • Proof by Induction on the Number of Letters in Σ : • Base case ( Σ = 2 ): rather obvious is optimal for Huffman alg If Inductive Step IG I its optimal for L Kil then k I 3 f 7f f 3 fee have 3 frequencies Suppose we k 2 tf 3 w fr fw 1,2 El 19 t K I

  16. Huffman Code Huffman Code for for E A Ow 1 O code T T code 1 fu lent ten T t ft tf t ten an optimal By the T inductive hypothesis is minimizes ler for T E code for E U an optimal code is Suppose the lowest siblings at K L and K are 2 By the tree u fore level of E Ufo Do left lucid fk fk left kn w ler U w les i

  17. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • Proof by Induction on the Number of Letters in Σ : • Inductive Hypothesis:

  18. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • Proof by Induction on the Number of Letters in Σ : • Inductive Hypothesis: • Without loss of generality, frequencies are M S , … , M T , the two lowest are M S , M - • Merge 1,2 into a new letter U + 1 with M TWS = M S + M -

  19. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • Proof by Induction on the Number of Letters in Σ : • Inductive Hypothesis: • Without loss of generality, frequencies are M S , … , M T , the two lowest are M S , M - • Merge 1,2 into a new letter U + 1 with M TWS = M S + M - • By induction, if J X is the Huffman code for M Y , … , M TWS , then J X is optimal • Need to prove that J is optimal for M S , … , M T

  20. Huffman Codes • Theorem: Huffman’s Alg produces an optimal prefix-free code • If J′ is optimal for M Y , … , M TWS then J is optimal for M S , … , M T

  21. An Experiment • Take the Dickens novel A Tale of Two Cities • File size is 799,940 bytes • Build a Huffman code and compress 3 3 • File size is now 439,688 bytes Raw Huffman Size 799,940 439,688 2554

  22. Huffman Codes • Huffman’s Algorithm: pair up the two letters with the lowest frequency and recurse • Theorem: Huffman’s Algorithm produces a prefix- free code of optimal length • In what sense is this code really optimal? (Bonus material… will not test you on this)

  23. Length of Huffman Codes for integeli • What can we say about Huffman code length? T N = 2 Hℓ \ for every R ∈ Σ • Suppose M • Then, len I R = ℓ N for the optimal Huffman code • Proof: MMM b d etter c a 3 2 3 2 2 2 2 freq Ill 110 Ok 10 0 code 3 3 I 2 ten

  24. � Length of Huffman Codes • What can we say about Huffman code length? N = 2 Hℓ \ for every R ∈ Σ • Suppose M • Then, len I R = ℓ N for the optimal Huffman code • len J = ∑ N ⋅ log - S ] \ M ^ 1 in N∈P li di f 3 L 2 e Li log fi log.tl fi li

  25. � Entropy • Given a set of frequencies (aka a probability distribution) the entropy is N ⋅ log - 1 M _ M = ` M length of ^ N the Hoffman code N • Entropy is a “measure of randomness”

  26. � Entropy • Given a set of frequencies (aka a probability distribution) the entropy is N ⋅ log - 1 M _ M = ` M ^ N random How N text is the • Entropy is a “measure of randomness” • Entropy was introduced by Shannon in 1948 and is the foundational concept in: • Data compression • Error correction (communicating over noisy channels) • Security (passwords and cryptography)

  27. Entropy of Passwords • Your password is a specific string, so M abc = 1.0 • To talk about security of passwords, we have to model them as random • Random 16 letter string: _ = 16 ⋅ log - 26 ≈ 75.2 • Random IMDb movie: _ = log - 1764727 ≈ 20.7 • Your favorite IMDb movie: _ ≪ 20.7 • Entropy measures how difficult passwords are to guess “on average”

  28. Entropy of Passwords

  29. � Entropy and Compression • Given a set of frequencies (probability distribution) the entropy is length of N ⋅ log - 1 M _ M = ` M ^ Hoffman code N N • Suppose that we generate string 9 by choosing j random letters independently with frequencies M • Any compression scheme requires at least _ M bits-per-letter to store 9 (as j → ∞ ) • Huffman codes are truly optimal!

  30. But Wait! • Take the Dickens novel A Tale of Two Cities • File size is 799,940 bytes • Build a Huffman code and compress • File size is now 439,688 bytes • But we can do better! Raw Huffman gzip bzip2 Size 799,940 439,688 301,295 220,156

  31. What do the frequencies represent? • Real data (e.g. natural language, music, images) have patterns between letters • U becomes a lot more common after a Q • Possible approach: model pairs of letters • Build a Huffman code for pairs-of-letters • Improves compression ratio, but the tree gets bigger • Can only model certain types of patterns • Zip is based on an algorithm called LZW that tries to identify patterns based on the data

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend