Lecture 19: Data Compression and Huffman Codes
Tim LaRock larock.t@northeastern.edu bit.ly/cs3000syllabus
Lecture 19: Data Compression and Huffman Codes Tim LaRock - - PowerPoint PPT Presentation
Lecture 19: Data Compression and Huffman Codes Tim LaRock larock.t@northeastern.edu bit.ly/cs3000syllabus Business Homework 5 grades posted Request regrades on GradeScope ASAP! Midterm 2 approximately halfway graded Grades should be
Tim LaRock larock.t@northeastern.edu bit.ly/cs3000syllabus
Homework 5 grades posted
Midterm 2 approximately halfway graded
Extra Credit Assignment 1 open until Sunday night Extra Credit Assignment 2 to be released this evening and due Thursday 5PM
Final Exam to be released Thursday 6PM and due Monday at Midnight
email and I may be able to improvise a brief discussion!
We can modify the order of the files on the tape, resulting in a permutation π where π(π) returns the index of the file in the πth block. We can then rewrite the expected (average) cost of accessing file k as π½ πππ‘π’(π) = 1 π - - π[π(π)]
1 234 5 134
Intuitively: To minimize average cost, we should store the smallest files first,
smaller ones! But how do we prove that this is the optimal strategy?
1 1 1 2 2 3 3 3 4 4 2 2 4 4 1 1 1 3 3 3 π½ πππ‘π’ = 26 4 π½ πππ‘π’(π) = 2 + 4 + 7 + 10 4 = 23 4
1. Find the unwritten file with minimum length (break ties arbitrarily) 2. Write that file to the tape
Claim: π½ πππ‘π’ π is minimized when π π π β€ π[π π + 1 ] for all π. Proof: Let a = π π and π = π(π + 1) and suppose π π > π[π] for some index π. If we swap the files π and π on the tape, then the cost of accessing π increases by π[π] and the cost of accessing π decreases by π[π]. Overall, the swap changes the expected cost by
C D EC[F] 5
. This change represents an improvement because π π < π[π]. Thus, if the files are out of length-order, we can decrease expected cost by swapping pairs to put them in order.
Key Point: If we had some other potentially optimal solution πβ, we can transform it into the optimal solution by iteratively swapping files that are out of length-order.
symbol, map to binary numbers of βlogP Ξ£ β bits
encodings for infrequent leters
a b c d
Frequency 1/2 1/4 1/8 1/8 Encoding 1 00 01 10 11 2.0 Encoding 2 10 110 111 1.75
encodings for infrequent leters
a b c d
Frequency 1/2 1/4 1/8 1/8 Encoding 1 00 01 10 11 2.0 Encoding 2 10 110 111 1.75 1 β 1 2 + 2 β 1 4 + 3 1 8 + 3 1 8 = 1 2 + 1 2 + 3 4 = 1.75
Encode(KTS) = β β β β β β β Decode(β β β β β β β) = β€ 4 bits per letter (30 symbols max!)
for every π¦ β π§ β Ξ£, enc π¦ is not a prefix of enc π§
fghijkEighh l len π = β
2
(0.636%)
a b c d Frequency 1/2 1/4 1/8 1/8 Encoding 10 110 111
a b c d e .32 .25 .20 .18 .05
a b c d e .32 .25 .20 .18 .05 0.5 0.5
a b c d e .32 .25 .20 .18 .05 0.5 0.5 2 β 0.32 + 0.25 + 0.18 + 3 β 0.20 + 0.05 = 2 β 0.75 + 3 β 0.25 = 2.25
first try len = 2.25 a b c d e .32 .25 .20 .18 .05
first try len = 2.25
len = 2.23 a b c d e .32 .25 .20 .18 .05
first try len = 2.25
len = 2.23 a b c d e .32 .25 .20 .18 .05
a b c d e .32 .25 .20 .18 .05
has exactly two children
has exactly two children
2 2 1 1 a b c
has exactly two children
Adding another internal node anywhere would only raise the average length! 2 2 1 1 a b c
has exactly two children
2 1 1 2 1 1 Adding another internal node anywhere would only raise the average length! a b c d 2 2 1 1 a b c
has exactly two children
Adding another internal node anywhere would only raise the average length! What is the implication of removing the internal node? 2 2 1 1 a b c 2 1 1 2 1 1 a b c d
has exactly two children
2 2 1 1 1 1 Adding another internal node anywhere would only raise the average length! c d What is the implication of removing the internal node? a b A strictly shorter code! 2 2 1 1 a b c 2 1 1 2 1 1 a b c d
has exactly two children
Adding another internal node anywhere would only raise the average length! What is the implication of removing the internal node? A strictly shorter code!
Implication: If a code tree has depth π, there are at least 2 leaves at depth π that are siblings!
2 2 1 1 a b c 2 1 1 2 1 1 a b c d 2 2 1 1 1 1 c d a b
code where π¦, π§ are siblings and are at the bottom of the tree
code where π¦, π§ are siblings and are at the bottom of the tree
Suppose someone gave you the optimal tree, but with no labels. Ex: Ξ£ = π, π, π, π, π , with π
F > π D > π t > π u > π v
How should you label the leaves? Given what we proved in (1), the two least frequent symbols will be siblings at the lowest depth! By definition, the highest frequency symbols should be on the highest leaves!
code where π¦, π§ are siblings and are at the bottom of the tree
Suppose someone gave you the optimal tree, but with no labels. Ex: Ξ£ = π, π, π, π, π , with π
F > π D > π t > π u > π v
How should you label the leaves? Given what we proved in (1), the two least frequent symbols will be siblings at the lowest depth! By definition, the highest frequency symbols should be on the highest leaves!
a b c
code where π¦, π§ are siblings and are at the bottom of the tree
Suppose someone gave you the optimal tree, but with no labels. Ex: Ξ£ = π, π, π, π, π , with π
F > π D > π t > π u > π v
How should you label the leaves? Given what we proved in (1), the two least frequent symbols will be siblings at the lowest depth! By definition, the highest frequency symbols should be on the highest leaves!
a b c d e
code where π¦, π§ are siblings and are at the bottom of the tree
Suppose someone gave you the optimal tree, but with no labels. Ex: Ξ£ = π, π, π, π, π , with π
F > π D > π t > π u > π v
How should you label the leaves? Given what we proved in (1), the two least frequent symbols will be siblings at the lowest depth! By definition, the highest frequency symbols should be on the highest leaves!
a b c d e
Implication: The first step of Huffmanβs Algorithm is towards an optimal code!
Inductive Step: If Huffmanβs algorithm is optimal for Ξ£ = π β 1, then it is optimal for Ξ£ = π. Suppose we have frequencies π
4 β₯ π P β₯ β― β₯ π 1E4 β₯ π 1.
Based on Huffmanβs alg and what we proved in (1) and (2), we merge π
1E4 and π 1 to get a new symbol π₯. Now we have
Ξ£| = {1, 2, β― , π β 2, π₯}, where π
1E4 + π 1
Now Ξ£| = π β 1, which we have assumed to be optimal by the inductive hypothesis!
Inductive Step: If Huffmanβs algorithm is optimal for Ξ£ = π β 1, then it is optimal for Ξ£ = π. Suppose we have frequencies π
4 β₯ π P β₯ β― β₯ π 1E4 β₯ π 1.
Based on Huffmanβs alg and what we proved in (1) and (2), we merge π
1E4 and π 1 to get a new symbol π₯. Now we have
Ξ£| = {1, 2, β― , π β 2, π₯}, where π
1E4 + π 1
Now Ξ£| = π β 1, which we have assumed to be optimal by the inductive hypothesis!
Inductive Step: If Huffmanβs algorithm is optimal for Ξ£ = π β 1, then it is optimal for Ξ£ = π. Suppose we have frequencies π
4 β₯ π P β₯ β― β₯ π 1E4 β₯ π 1.
Based on Huffmanβs alg and what we proved in (1) and (2), we merge π
1E4 and π 1 to get a new symbol π₯. Now we have
Ξ£| = {1, 2, β― , π β 2, π₯}, where π
1E4 + π 1
Now Ξ£| = π β 1, which we have assumed to be optimal by the inductive hypothesis!
Inductive Step: If Huffmanβs algorithm is optimal for Ξ£ = π β 1, then it is optimal for Ξ£ = π. Suppose we have frequencies π
4 β₯ π P β₯ β― β₯ π 1E4 β₯ π 1.
Based on Huffmanβs alg and what we proved in (1) and (2), we merge π
1E4 and π 1 to get a new symbol π₯. Now we have
Ξ£| = {1, 2, β― , π β 2, π₯}, where π
1E4 + π 1
Now Ξ£| = π β 1, which we have assumed to be optimal by the inductive hypothesis!
Inductive Step: If Huffmanβs algorithm is optimal for Ξ£ = π β 1, then it is optimal for Ξ£ = π. Suppose we have frequencies π
4 β₯ π P β₯ β― β₯ π 1E4 β₯ π 1.
Based on Huffmanβs alg and what we proved in (1) and (2), we merge π
1E4 and π 1 to get a new symbol π₯. Now we have
Ξ£| = {1, 2, β― , π β 2, π₯}, where π
1E4 + π 1
Now Ξ£| = π β 1, which we have assumed to be optimal by the inductive hypothesis!
Inductive Step: If Huffmanβs algorithm is optimal for Ξ£ = π β 1, then it is optimal for Ξ£ = π. Suppose we have frequencies π
4 β₯ π P β₯ β― β₯ π 1E4 β₯ π 1.
Based on Huffmanβs alg and what we proved in (1) and (2), we merge π
1E4 and π 1 to get a new symbol π₯. Now we have
Ξ£| = {1, 2, β― , π β 2, π₯}, where π
1E4 + π 1
Now Ξ£| = π β 1, which is optimal by the inductive hypothesis.
We showed thatβ¦ 1) In an optimal prefix-free code tree, every internal node has exactly two children 2) If symbols π¦, π§ have the lowest frequency, then there is an
3) Every Huffman code satisfies these two properties by
algorithm is an optimal prefix-free code. We proved this by induction on the number of symbols.
Raw Huffman Size 799,940 439,688
2 = 2Eββ’ for every π β Ξ£
Letter a b c d Frequency 2E4 2EP 2Eβ 2Eβ Code 01 110 111 Length 1 2 3 3
2 = 2Eββ’ for every π β Ξ£
Letter a b c d Frequency 2E4 2EP 2Eβ 2Eβ Code 01 110 111 Length 1 2 3 3
2 = 2Eββ’ for every π β Ξ£
π
2 β logP 4 Ζβ’
β
2 = ββ2
4 Ζβ’ = ββ2
Letter a b c d Frequency 2E4 2EP 2Eβ 2Eβ Code 01 110 111 Length 1 2 3 3
πΌ π = - π
2 β logP 1 π 2
β
πΌ π = - π
2 β logP 1 π 2
β
β β’u = 1.0
πΌ π = - π
2 β logP 1 π 2
β
Raw Huffman gzip bzip2 Size 799,940 439,688 301,295 220,156
letters
between different letters! πΌ π = - π
2 β logP 1 π 2
β