CS 3000: Algorithms & Data Jonathan Ullman
Lecture 19:
- Data Compression
- Greedy Algorithms: Huffman Codes
Apr 5, 2018
Ah
MARAMMMM Apr8,2020
CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data - - PowerPoint PPT Presentation
Ah CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data Compression Greedy Algorithms: Huffman Codes Apr 5, 2018 MARAMMMM Apr 8,2020 Data Compression How do we store strings of text compactly? Alphabet A binary
CS 3000: Algorithms & Data Jonathan Ullman
Lecture 19:
Apr 5, 2018
Ah
MARAMMMM Apr8,2020
Data Compression
symbol, map to binary numbers of ⌈log- Σ ⌉ bits
Alphabet
A
00000 Bi
O 000 l
E
0007 O
D
O 00 I 1
variablelength code
Data Compression
encodings for infrequent leters
a b c d
Frequency 1/2 1/4 1/8 1.8 Encoding 1 00 01 10 11 2.0 Encoding 2 10 110 111 1.75
I
E
x I
I
x 2
t
x 3
I I f
It
1.75
Data Compression
Encode(KTS) = – ● – – ● ● ● Decode(– ● – – ● ● ●) = ≤ 4 bits per letter (30 symbols max!)
K ITI
s
I
f
average bits per letter
given
some frequencies
K
T S
Many
possibilities
T
E TT S T E TT E E E
K
NI
Prefix Free Codes
for every < ≠ > ∈ Σ, enc < is not a prefix of enc >
as
aa 00
a
O
b
O l
b
I 0
c
I 0
cg
l I 0
d
I I
do
I l 1
a prefix free variable lengthcode
Prefix Free Codes
code as a tree
1
Of
1
a
binary
a'ITabeled
with a symbol
p
r
r
A
EE
MMMM
001
i
l i
011
b teal
d teal
b l
Huffman Codes
min
BCDEFGHECDD I len J = ∑
M
N
⋅ lenI R
most frquent in Italian (0.636%)
a b c d Frequency 1/2 1/4 1/8 1/8 Encoding 10 110 111
average number of bits
I
per letter
Fa
f
fo
fd
fax I fb x 2
t
fax
3
t
fax 3
1.75
Huffman Codes
frequency and recurse
a b c d e .32 .25 .20 .18 .05
OO IO
ILO
01 111
1
O
l
es
is
Q
32
18 25 25
Huffman Codes
frequency and recurse
first try len = 2.25
len = 2.23 a b c d e .32 .25 .20 .18 .05
O l
l
J
E
1
F
I
O l
O
l
O O
l
O l
nm
ITfft codeword II
fthfuest letter
Huffman Codes
the lowest frequency and recurse
a b c d e .32 .25 .20 .18 .05
38
zgb.ee
gdie3
57 Ea b3
43
Eadie
Y
Huffman Codes
the lowest frequency and recurse
free code of optimal length
Huffman Codes
has exactly two children
i
z Z
1
In
the optimal
code
If
the
lowest depth
is
d
then
there
are
at
least
two leaves at
depth d
and
they
are
siblings
cant
Happen
Huffman Codes
code where <, > are siblings and are at the bottom of the tree
Suppose
someone gaveyou the
i e have the lowestdepths
without labels
O
then
I should label
fb fo foe fee
the highest leaves with
the
most frequent symbols
2
and go down
e
d
By
i
there
are
two strings at
the lowest depth
My optimal code fills those siblings w
the least frequent items
Huffman Codes
1 1
Inductive Step
If
Huffman alg
is optimal for
Kil
k I
then
its optimal for
IG I L
Suppose
we
have
frequencies
f
3 fee 3
3 f
7f
El
1,2
3
k
2
w
fw
fr
tf
19 t K
I
Huffman Code
Huffman Code
for E
for
A
Ow
O
1
code
T
code T
ten
T
lent
1 fu
ten
t
t ft
tf
By the
inductive
hypothesis
T
is
an optimal
code
for
E
minimizes ler
T
Suppose
U
is
an optimal
code
for
E
By
2
K L and K
are
siblings at
the lowest
level of
the
tree
u fore
Ufo
E
Do
lucid left
kn
w
ler
U
fk fk
i
les
w
left
Huffman Codes
Huffman Codes
S, … , M T, the
two lowest are M
S, M
TWS = M S + M
Huffman Codes
S, … , M T, the
two lowest are M
S, M
TWS = M S + M
Y, … , M TWS,
then JX is optimal
S, … , M T
Huffman Codes
Y, … , M TWS then J is optimal for M S, … , M T
An Experiment
Raw Huffman Size 799,940 439,688
3
3
2554
Huffman Codes
the lowest frequency and recurse
free code of optimal length
(Bonus material… will not test you on this)
Length of Huffman Codes
N = 2Hℓ\ for every R ∈ Σ
T
for integeli MMM
etter
a
b
c
d freq
2 2
2
2
3
2
3
code
Ok 10
110
Ill I
2
3 3 ten
Length of Huffman Codes
N = 2Hℓ\ for every R ∈ Σ
M
N ⋅ log- S ]\
^
1
e
2
di
L
f
3
li
log fi
Li
log.tl fi
li
Entropy
distribution) the entropy is
_ M = ` M
N ⋅ log- 1 M N
^
length of
the Hoffman code
Entropy
distribution) the entropy is
the foundational concept in:
_ M = ` M
N ⋅ log- 1 M N
^
How
random
is the
text
Entropy of Passwords
abc = 1.0
model them as random
guess “on average”
Entropy of Passwords
Entropy and Compression
the entropy is
random letters independently with frequencies M
bits-per-letter to store 9 (as j → ∞)
_ M = ` M
N ⋅ log- 1 M N
^
length of
Hoffman code
But Wait!
Raw Huffman gzip bzip2 Size 799,940 439,688 301,295 220,156
What do the frequencies represent?
have patterns between letters
identify patterns based on the data
Entropy and Compression
the entropy is
random letters independently with frequencies M
bits-per-letter to store 9
is no relationship between different letters! _ M = ` M
N ⋅ log- 1 M N
^