Lecture 19: Data Compression and Huffman Codes Tim LaRock - - PowerPoint PPT Presentation

β–Ά
lecture 19 data compression and huffman codes
SMART_READER_LITE
LIVE PREVIEW

Lecture 19: Data Compression and Huffman Codes Tim LaRock - - PowerPoint PPT Presentation

Lecture 19: Data Compression and Huffman Codes Tim LaRock larock.t@northeastern.edu bit.ly/cs3000syllabus Business Homework 5 grades posted Request regrades on GradeScope ASAP! Midterm 2 approximately halfway graded Grades should be


slide-1
SLIDE 1

Lecture 19: Data Compression and Huffman Codes

Tim LaRock larock.t@northeastern.edu bit.ly/cs3000syllabus

slide-2
SLIDE 2

Business

Homework 5 grades posted

  • Request regrades on GradeScope ASAP!

Midterm 2 approximately halfway graded

  • Grades should be out by Wednesday night

Extra Credit Assignment 1 open until Sunday night Extra Credit Assignment 2 to be released this evening and due Thursday 5PM

  • Optional Greedy Algorithms and Information Theory assignment
  • Points will be added to your 2nd lowest homework grade

Final Exam to be released Thursday 6PM and due Monday at Midnight

  • Exam is cumulative, all topics fair game
  • Review during lecture on Thursday – form for questions will go out tonight
slide-3
SLIDE 3

This Week

  • Today: Greedy algorithms + proof strategies
  • Data Compression, Huffman Codes, Information theory
  • Tomorrow: More greedy algorithms/info theory
  • Clustering; community detection in graphs/networks
  • Wednesday: Advanced topics and course wrap-up
  • If we haven’t talked about something you hoped we would, feel free to send me an

email and I may be able to improvise a brief discussion!

  • Thursday: Final Exam Review
slide-4
SLIDE 4

Last time: Files on Tape

We can modify the order of the files on the tape, resulting in a permutation 𝜌 where 𝜌(𝑗) returns the index of the file in the 𝑗th block. We can then rewrite the expected (average) cost of accessing file k as 𝔽 𝑑𝑝𝑑𝑒(𝜌) = 1 π‘œ - - 𝑀[𝜌(𝑗)]

1 234 5 134

Intuitively: To minimize average cost, we should store the smallest files first,

  • therwise we will need to unnecessarily spend time skipping the large files to read

smaller ones! But how do we prove that this is the optimal strategy?

1 1 1 2 2 3 3 3 4 4 2 2 4 4 1 1 1 3 3 3 𝔽 𝑑𝑝𝑑𝑒 = 26 4 𝔽 𝑑𝑝𝑑𝑒(𝜌) = 2 + 4 + 7 + 10 4 = 23 4

slide-5
SLIDE 5

Last time: Files on Tape

Input: A set of files labeled 1 … π‘œ with lengths 𝑀[𝑗] Output: An ordering of the files on the tape Repeat until all files are on the tape:

1. Find the unwritten file with minimum length (break ties arbitrarily) 2. Write that file to the tape

How can we show this is optimal?

slide-6
SLIDE 6

Last time: Files on Tape

Claim: 𝔽 𝑑𝑝𝑑𝑒 𝜌 is minimized when 𝑀 𝜌 𝑗 ≀ 𝑀[𝜌 𝑗 + 1 ] for all 𝑗. Proof: Let a = 𝜌 𝑗 and 𝑐 = 𝜌(𝑗 + 1) and suppose 𝑀 𝑏 > 𝑀[𝑐] for some index 𝑗. If we swap the files 𝑏 and 𝑐 on the tape, then the cost of accessing 𝑏 increases by 𝑀[𝑐] and the cost of accessing 𝑐 decreases by 𝑀[𝑏]. Overall, the swap changes the expected cost by

C D EC[F] 5

. This change represents an improvement because 𝑀 𝑐 < 𝑀[𝑏]. Thus, if the files are out of length-order, we can decrease expected cost by swapping pairs to put them in order.

Key Point: If we had some other potentially optimal solution πœŒβˆ—, we can transform it into the optimal solution by iteratively swapping files that are out of length-order.

slide-7
SLIDE 7

Data Compression and Huffman Codes

slide-8
SLIDE 8

Data Compression

  • How do we store strings of text compactly?
  • A binary code is a mapping from Ξ£ β†’ 0,1 βˆ—
  • Simplest code: assign numbers 1,2, … , Ξ£ to each

symbol, map to binary numbers of ⌈logP Ξ£ βŒ‰ bits

  • Morse Code:
slide-9
SLIDE 9

Data Compression

  • Letters have uneven frequencies!
  • Want to use short encodings for frequent letters, long

encodings for infrequent leters

a b c d

  • avg. len.

Frequency 1/2 1/4 1/8 1/8 Encoding 1 00 01 10 11 2.0 Encoding 2 10 110 111 1.75

slide-10
SLIDE 10

Data Compression

  • Letters have uneven frequencies!
  • Want to use short encodings for frequent letters, long

encodings for infrequent leters

a b c d

  • avg. len.

Frequency 1/2 1/4 1/8 1/8 Encoding 1 00 01 10 11 2.0 Encoding 2 10 110 111 1.75 1 β‹… 1 2 + 2 β‹… 1 4 + 3 1 8 + 3 1 8 = 1 2 + 1 2 + 3 4 = 1.75

slide-11
SLIDE 11

Data Compression

  • What properties would a good code have?
  • Easy to encode a string
  • The encoding is short on average (bits per letter given frequencies)
  • Easy to decode a string?

Encode(KTS) = – ● – – ● ● ● Decode(– ● – – ● ● ●) = ≀ 4 bits per letter (30 symbols max!)

slide-12
SLIDE 12

Prefix Free Codes

  • Cannot decode if there are ambiguities
  • e.g. enc(β€œπΉβ€) is a prefix of enc(β€œπ‘‡β€)
  • Prefix-Free Code:
  • A binary enc: Ξ£ β†’ 0,1 βˆ— such that

for every 𝑦 β‰  𝑧 ∈ Ξ£, enc 𝑦 is not a prefix of enc 𝑧

  • Any fixed-length code is prefix-free
slide-13
SLIDE 13

Prefix Free Codes

  • Can represent a prefix-free code as a binary tree
  • Left child = 0
  • Right child = 1
  • Encode by going up the tree (or using a table)
  • d a b β†’ 0 0 1 1 0 1 1
  • Decode by going down the tree
  • 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 ← beadcab
slide-14
SLIDE 14

Huffman Codes

  • (An algorithm to find) an optimal prefix-free code
  • optimal =

min

fghijkEighh l len π‘ˆ = βˆ‘

𝑔

2

  • 2∈q

β‹… lenl 𝑗

  • Note, optimality depends on what you’re compressing
  • H is the 8th most frequent letter in English (6.094%) but the 20th most frequent in Italian

(0.636%)

a b c d Frequency 1/2 1/4 1/8 1/8 Encoding 10 110 111

slide-15
SLIDE 15

Huffman Codes

  • First Try: split letters into two sets of roughly equal frequency and

recurse

  • Balanced binary trees should have low depth

a b c d e .32 .25 .20 .18 .05

slide-16
SLIDE 16

Huffman Codes

  • First Try: split letters into two sets of roughly equal frequency and

recurse

  • Balanced binary trees should have low depth

a b c d e .32 .25 .20 .18 .05 0.5 0.5

slide-17
SLIDE 17

Huffman Codes

  • First Try: split letters into two sets of roughly equal frequency and

recurse

  • Balanced binary trees should have low depth

a b c d e .32 .25 .20 .18 .05 0.5 0.5 2 β‹… 0.32 + 0.25 + 0.18 + 3 β‹… 0.20 + 0.05 = 2 β‹… 0.75 + 3 β‹… 0.25 = 2.25

slide-18
SLIDE 18

Huffman Codes

  • First Try: split letters into two sets of roughly equal frequency and

recurse

first try len = 2.25 a b c d e .32 .25 .20 .18 .05

slide-19
SLIDE 19

Huffman Codes

  • First Try: split letters into two sets of roughly equal frequency and

recurse

first try len = 2.25

  • ptimal

len = 2.23 a b c d e .32 .25 .20 .18 .05

slide-20
SLIDE 20

Huffman Codes

  • First Try: split letters into two sets of roughly equal frequency and

recurse

first try len = 2.25

  • ptimal

len = 2.23 a b c d e .32 .25 .20 .18 .05

slide-21
SLIDE 21

Huffman Codes

  • Huffman’s Algorithm: pair up the two letters with the lowest

frequency and recurse

a b c d e .32 .25 .20 .18 .05

slide-22
SLIDE 22

Huffman Codes

  • Huffman’s Algorithm: pair up the two letters with the lowest

frequency and recurse

  • Theorem: Huffman’s Algorithm produces a prefix-free code of optimal

length

  • We’ll prove the theorem using an exchange argument
slide-23
SLIDE 23

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (1) In an optimal prefix-free code (a tree), every internal node

has exactly two children

slide-24
SLIDE 24

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (1) In an optimal prefix-free code (a tree), every internal node

has exactly two children

2 2 1 1 a b c

slide-25
SLIDE 25

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (1) In an optimal prefix-free code (a tree), every internal node

has exactly two children

Adding another internal node anywhere would only raise the average length! 2 2 1 1 a b c

slide-26
SLIDE 26

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (1) In an optimal prefix-free code (a tree), every internal node

has exactly two children

2 1 1 2 1 1 Adding another internal node anywhere would only raise the average length! a b c d 2 2 1 1 a b c

slide-27
SLIDE 27

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (1) In an optimal prefix-free code (a tree), every internal node

has exactly two children

Adding another internal node anywhere would only raise the average length! What is the implication of removing the internal node? 2 2 1 1 a b c 2 1 1 2 1 1 a b c d

slide-28
SLIDE 28

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (1) In an optimal prefix-free code (a tree), every internal node

has exactly two children

2 2 1 1 1 1 Adding another internal node anywhere would only raise the average length! c d What is the implication of removing the internal node? a b A strictly shorter code! 2 2 1 1 a b c 2 1 1 2 1 1 a b c d

slide-29
SLIDE 29

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (1) In an optimal prefix-free code (a tree), every internal node

has exactly two children

Adding another internal node anywhere would only raise the average length! What is the implication of removing the internal node? A strictly shorter code!

Implication: If a code tree has depth 𝑒, there are at least 2 leaves at depth 𝑒 that are siblings!

2 2 1 1 a b c 2 1 1 2 1 1 a b c d 2 2 1 1 1 1 c d a b

slide-30
SLIDE 30

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (2) If 𝑦, 𝑧 have the lowest frequency, then there is an optimal

code where 𝑦, 𝑧 are siblings and are at the bottom of the tree

slide-31
SLIDE 31

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (2) If 𝑦, 𝑧 have the lowest frequency, then there is an optimal

code where 𝑦, 𝑧 are siblings and are at the bottom of the tree

Suppose someone gave you the optimal tree, but with no labels. Ex: Ξ£ = 𝑏, 𝑐, 𝑑, 𝑒, 𝑓 , with 𝑔

F > 𝑔 D > 𝑔 t > 𝑔 u > 𝑔 v

How should you label the leaves? Given what we proved in (1), the two least frequent symbols will be siblings at the lowest depth! By definition, the highest frequency symbols should be on the highest leaves!

slide-32
SLIDE 32

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (2) If 𝑦, 𝑧 have the lowest frequency, then there is an optimal

code where 𝑦, 𝑧 are siblings and are at the bottom of the tree

Suppose someone gave you the optimal tree, but with no labels. Ex: Ξ£ = 𝑏, 𝑐, 𝑑, 𝑒, 𝑓 , with 𝑔

F > 𝑔 D > 𝑔 t > 𝑔 u > 𝑔 v

How should you label the leaves? Given what we proved in (1), the two least frequent symbols will be siblings at the lowest depth! By definition, the highest frequency symbols should be on the highest leaves!

a b c

slide-33
SLIDE 33

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (2) If 𝑦, 𝑧 have the lowest frequency, then there is an optimal

code where 𝑦, 𝑧 are siblings and are at the bottom of the tree

Suppose someone gave you the optimal tree, but with no labels. Ex: Ξ£ = 𝑏, 𝑐, 𝑑, 𝑒, 𝑓 , with 𝑔

F > 𝑔 D > 𝑔 t > 𝑔 u > 𝑔 v

How should you label the leaves? Given what we proved in (1), the two least frequent symbols will be siblings at the lowest depth! By definition, the highest frequency symbols should be on the highest leaves!

a b c d e

slide-34
SLIDE 34

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (2) If 𝑦, 𝑧 have the lowest frequency, then there is an optimal

code where 𝑦, 𝑧 are siblings and are at the bottom of the tree

Suppose someone gave you the optimal tree, but with no labels. Ex: Ξ£ = 𝑏, 𝑐, 𝑑, 𝑒, 𝑓 , with 𝑔

F > 𝑔 D > 𝑔 t > 𝑔 u > 𝑔 v

How should you label the leaves? Given what we proved in (1), the two least frequent symbols will be siblings at the lowest depth! By definition, the highest frequency symbols should be on the highest leaves!

a b c d e

Implication: The first step of Huffman’s Algorithm is towards an optimal code!

slide-35
SLIDE 35

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • Proof by Induction on the Number of Letters in Ξ£:
  • Base case ( Ξ£ = 2): rather obvious

Inductive Step: If Huffman’s algorithm is optimal for Ξ£ = 𝑙 βˆ’ 1, then it is optimal for Ξ£ = 𝑙. Suppose we have frequencies 𝑔

4 β‰₯ 𝑔 P β‰₯ β‹― β‰₯ 𝑔 1E4 β‰₯ 𝑔 1.

Based on Huffman’s alg and what we proved in (1) and (2), we merge 𝑔

1E4 and 𝑔 1 to get a new symbol π‘₯. Now we have

Ξ£| = {1, 2, β‹― , 𝑙 βˆ’ 2, π‘₯}, where 𝑔

  • = 𝑔

1E4 + 𝑔 1

Now Ξ£| = 𝑙 βˆ’ 1, which we have assumed to be optimal by the inductive hypothesis!

slide-36
SLIDE 36

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • Proof by Induction on the Number of Letters in Ξ£:
  • Base case ( Ξ£ = 2): rather obvious

Inductive Step: If Huffman’s algorithm is optimal for Ξ£ = 𝑙 βˆ’ 1, then it is optimal for Ξ£ = 𝑙. Suppose we have frequencies 𝑔

4 β‰₯ 𝑔 P β‰₯ β‹― β‰₯ 𝑔 1E4 β‰₯ 𝑔 1.

Based on Huffman’s alg and what we proved in (1) and (2), we merge 𝑔

1E4 and 𝑔 1 to get a new symbol π‘₯. Now we have

Ξ£| = {1, 2, β‹― , 𝑙 βˆ’ 2, π‘₯}, where 𝑔

  • = 𝑔

1E4 + 𝑔 1

Now Ξ£| = 𝑙 βˆ’ 1, which we have assumed to be optimal by the inductive hypothesis!

slide-37
SLIDE 37

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • Proof by Induction on the Number of Letters in Ξ£:
  • Base case ( Ξ£ = 2): rather obvious

Inductive Step: If Huffman’s algorithm is optimal for Ξ£ = 𝑙 βˆ’ 1, then it is optimal for Ξ£ = 𝑙. Suppose we have frequencies 𝑔

4 β‰₯ 𝑔 P β‰₯ β‹― β‰₯ 𝑔 1E4 β‰₯ 𝑔 1.

Based on Huffman’s alg and what we proved in (1) and (2), we merge 𝑔

1E4 and 𝑔 1 to get a new symbol π‘₯. Now we have

Ξ£| = {1, 2, β‹― , 𝑙 βˆ’ 2, π‘₯}, where 𝑔

  • = 𝑔

1E4 + 𝑔 1

Now Ξ£| = 𝑙 βˆ’ 1, which we have assumed to be optimal by the inductive hypothesis!

slide-38
SLIDE 38

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • Proof by Induction on the Number of Letters in Ξ£:
  • Base case ( Ξ£ = 2): rather obvious

Inductive Step: If Huffman’s algorithm is optimal for Ξ£ = 𝑙 βˆ’ 1, then it is optimal for Ξ£ = 𝑙. Suppose we have frequencies 𝑔

4 β‰₯ 𝑔 P β‰₯ β‹― β‰₯ 𝑔 1E4 β‰₯ 𝑔 1.

Based on Huffman’s alg and what we proved in (1) and (2), we merge 𝑔

1E4 and 𝑔 1 to get a new symbol π‘₯. Now we have

Ξ£| = {1, 2, β‹― , 𝑙 βˆ’ 2, π‘₯}, where 𝑔

  • = 𝑔

1E4 + 𝑔 1

Now Ξ£| = 𝑙 βˆ’ 1, which we have assumed to be optimal by the inductive hypothesis!

slide-39
SLIDE 39

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • Proof by Induction on the Number of Letters in Ξ£:
  • Base case ( Ξ£ = 2): rather obvious

Inductive Step: If Huffman’s algorithm is optimal for Ξ£ = 𝑙 βˆ’ 1, then it is optimal for Ξ£ = 𝑙. Suppose we have frequencies 𝑔

4 β‰₯ 𝑔 P β‰₯ β‹― β‰₯ 𝑔 1E4 β‰₯ 𝑔 1.

Based on Huffman’s alg and what we proved in (1) and (2), we merge 𝑔

1E4 and 𝑔 1 to get a new symbol π‘₯. Now we have

Ξ£| = {1, 2, β‹― , 𝑙 βˆ’ 2, π‘₯}, where 𝑔

  • = 𝑔

1E4 + 𝑔 1

Now Ξ£| = 𝑙 βˆ’ 1, which we have assumed to be optimal by the inductive hypothesis!

slide-40
SLIDE 40

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • Proof by Induction on the Number of Letters in Ξ£:
  • Base case ( Ξ£ = 2): rather obvious

Inductive Step: If Huffman’s algorithm is optimal for Ξ£ = 𝑙 βˆ’ 1, then it is optimal for Ξ£ = 𝑙. Suppose we have frequencies 𝑔

4 β‰₯ 𝑔 P β‰₯ β‹― β‰₯ 𝑔 1E4 β‰₯ 𝑔 1.

Based on Huffman’s alg and what we proved in (1) and (2), we merge 𝑔

1E4 and 𝑔 1 to get a new symbol π‘₯. Now we have

Ξ£| = {1, 2, β‹― , 𝑙 βˆ’ 2, π‘₯}, where 𝑔

  • = 𝑔

1E4 + 𝑔 1

Now Ξ£| = 𝑙 βˆ’ 1, which is optimal by the inductive hypothesis.

slide-41
SLIDE 41

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code

We showed that… 1) In an optimal prefix-free code tree, every internal node has exactly two children 2) If symbols 𝑦, 𝑧 have the lowest frequency, then there is an

  • ptimal code where 𝑦, 𝑧 are siblings and are at the bottom
  • f the tree

3) Every Huffman code satisfies these two properties by

  • definition. Therefore, a code produced by Huffman’s

algorithm is an optimal prefix-free code. We proved this by induction on the number of symbols.

slide-42
SLIDE 42
  • Take the Dickens novel A Tale of Two Cities
  • File size is 799,940 bytes
  • Build a Huffman code and compress
  • File size is now 439,688 bytes

An Experiment

Raw Huffman Size 799,940 439,688

slide-43
SLIDE 43

Huffman Codes

  • Huffman’s Algorithm: pair up the two letters with the lowest

frequency and recurse

  • Theorem: Huffman’s Algorithm produces a prefix-free code of optimal

length

  • In what sense is this code really optimal? (Bonus material…will not

test you on this)

slide-44
SLIDE 44

Length of Huffman Codes

  • What can we say about Huffman code length?
  • Suppose 𝑔

2 = 2Eβ„“β€’ for every 𝑗 ∈ Ξ£

  • Then, lenl 𝑗 = β„“2 for the optimal Huffman code

Letter a b c d Frequency 2E4 2EP 2Eβ€š 2Eβ€š Code 01 110 111 Length 1 2 3 3

slide-45
SLIDE 45

Length of Huffman Codes

  • What can we say about Huffman code length?
  • Suppose 𝑔

2 = 2Eβ„“β€’ for every 𝑗 ∈ Ξ£

  • Then, lenl 𝑗 = β„“2 for the optimal Huffman code
  • Length of the code is the sum of 2Eβ„“β€’ β‹… β„“2 for all 𝑗

Letter a b c d Frequency 2E4 2EP 2Eβ€š 2Eβ€š Code 01 110 111 Length 1 2 3 3

slide-46
SLIDE 46

Length of Huffman Codes

  • What can we say about Huffman code length?
  • Suppose 𝑔

2 = 2Eβ„“β€’ for every 𝑗 ∈ Ξ£

  • Then, lenl 𝑗 = β„“2 for the optimal Huffman code
  • Length of the code is the sum of 2Eβ„“β€’ β‹… β„“2 for all 𝑗
  • len π‘ˆ = βˆ‘

𝑔

2 β‹… logP 4 Ζ’β€’

β€ž

  • 2∈q
  • logP 𝑔

2 = βˆ’β„“2

  • logP

4 Ζ’β€’ = βˆ’β„“2

Letter a b c d Frequency 2E4 2EP 2Eβ€š 2Eβ€š Code 01 110 111 Length 1 2 3 3

slide-47
SLIDE 47

Entropy

  • Given a set of frequencies (aka a probability distribution) the entropy is
  • Entropy is a β€œmeasure of randomness”
  • Entropy was introduced by Shannon in 1948 and is the foundational

concept in:

  • Data compression
  • Error correction (communicating over noisy channels)
  • Security (passwords and cryptography)

𝐼 𝑔 = - 𝑔

2 β‹… logP 1 𝑔 2

β€ž

  • 2
slide-48
SLIDE 48

Entropy

  • Given a set of frequencies (aka a probability distribution) the entropy is
  • Entropy is a β€œmeasure of randomness”
  • Entropy was introduced by Shannon in 1948 and is the foundational

concept in:

  • Data compression
  • Error correction (communicating over noisy channels)
  • Security (passwords and cryptography)

𝐼 𝑔 = - 𝑔

2 β‹… logP 1 𝑔 2

β€ž

  • 2
slide-49
SLIDE 49

Entropy of Passwords

  • Your password is a specific string, so 𝑔

†‒u = 1.0

  • To talk about security of passwords, we have to model them as

random

  • Random 16 letter string: 𝐼 = 16 β‹… logP 26 β‰ˆ 75.2
  • Random IMDb movie: 𝐼 = logP 1764727 β‰ˆ 20.7
  • Your favorite IMDb movie: 𝐼 β‰ͺ 20.7
  • Entropy measures how difficult passwords are to guess β€œon average”
slide-50
SLIDE 50

Entropy of Passwords

slide-51
SLIDE 51

Entropy and Compression

  • Given a set of frequencies (probability distribution) the entropy is
  • Suppose that we generate string 𝑇 by choosing π‘œ random letters

independently with frequencies 𝑔

  • Any compression scheme requires at least 𝐼 𝑔 bits-per-letter to

store 𝑇 (as π‘œ β†’ ∞)

  • Huffman codes are truly optimal!

𝐼 𝑔 = - 𝑔

2 β‹… logP 1 𝑔 2

β€ž

  • 2
slide-52
SLIDE 52

But Wait!

  • Take the Dickens novel A Tale of Two Cities
  • File size is 799,940 bytes
  • Build a Huffman code and compress
  • File size is now 439,688 bytes
  • But we can do better!

Raw Huffman gzip bzip2 Size 799,940 439,688 301,295 220,156

slide-53
SLIDE 53

What do the frequencies represent?

  • Real data (e.g. natural language, music, images) have patterns between

letters

  • U becomes a lot more common after a Q
  • Possible approach: model pairs of letters
  • Build a Huffman code for pairs-of-letters
  • Improves compression ratio, but the tree gets bigger
  • Can only model certain types of patterns
  • Zip is based on an algorithm called LZW that tries to identify patterns based
  • n the data
slide-54
SLIDE 54

Entropy and Compression

  • Given a set of frequencies (probability distribution) the entropy is
  • Suppose that we generate string 𝑇 by choosing π‘œ random letters

independently with frequencies 𝑔

  • Any compression scheme requires at least 𝐼 𝑔 bits-per-letter to

store 𝑇

  • Huffman codes are truly optimal if and only if there is no relationship

between different letters! 𝐼 𝑔 = - 𝑔

2 β‹… logP 1 𝑔 2

β€ž

  • 2
slide-55
SLIDE 55

Wrap-up

Reading and Extra Credit Assignment: Will send out an announcement this evening Tomorrow: Greedy algorithm for clustering and an application Wednesday: Advanced topics and course wrap (let me know if you want to hear about something in particular!) Thursday: Final exam review