CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data - - PowerPoint PPT Presentation

cs 3000 algorithms data jonathan ullman
SMART_READER_LITE
LIVE PREVIEW

CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data - - PowerPoint PPT Presentation

Ah CS 3000: Algorithms & Data Jonathan Ullman Lecture 19: Data Compression Greedy Algorithms: Huffman Codes Apr 5, 2018 MARAMMMM Apr 8,2020 Data Compression How do we store strings of text compactly? Alphabet A binary


slide-1
SLIDE 1

CS 3000: Algorithms & Data Jonathan Ullman

Lecture 19:

  • Data Compression
  • Greedy Algorithms: Huffman Codes

Apr 5, 2018

Ah

MARAMMMM Apr8,2020

slide-2
SLIDE 2

Data Compression

  • How do we store strings of text compactly?
  • A binary code is a mapping from Σ → 0,1 ∗
  • Simplest code: assign numbers 1,2, … , Σ to each

symbol, map to binary numbers of ⌈log- Σ ⌉ bits

  • Morse Code:

Alphabet

A

00000 Bi

O 000 l

E

0007 O

D

O 00 I 1

variablelength code

slide-3
SLIDE 3

Data Compression

  • Letters have uneven frequencies!
  • Want to use short encodings for frequent letters, long

encodings for infrequent leters

a b c d

  • avg. len.

Frequency 1/2 1/4 1/8 1.8 Encoding 1 00 01 10 11 2.0 Encoding 2 10 110 111 1.75

I

E

x I

I

x 2

t

x 3

I I f

It

1.75

slide-4
SLIDE 4

Data Compression

  • What properties would a good code have?
  • Easy to encode a string
  • The encoding is short on average
  • Easy to decode a string?

Encode(KTS) = – ● – – ● ● ● Decode(– ● – – ● ● ●) = ≤ 4 bits per letter (30 symbols max!)

K ITI

s

I

f

average bits per letter

given

some frequencies

K

T S

Many

possibilities

T

E TT S T E TT E E E

K

NI

slide-5
SLIDE 5

Prefix Free Codes

  • Cannot decode if there are ambiguities
  • e.g. enc(“6”) is a prefix of enc(“9”)
  • Prefix-Free Code:
  • A binary enc: Σ → 0,1 ∗ such that

for every < ≠ > ∈ Σ, enc < is not a prefix of enc >

  • Any fixed-length code is prefix-free

as

J

Tee

aa 00

a

O

b

O l

b

I 0

c

I 0

cg

l I 0

d

I I

do

I l 1

a prefix free variable lengthcode

slide-6
SLIDE 6

Prefix Free Codes

  • Can represent a prefix-free

code as a tree

  • Encode by going up the tree (or using a table)
  • d a b → 0 0 1 1 0 0 1 1
  • Decode by going down the tree
  • 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1

safe

1

Of

1

a

binary

  • Ji

a'ITabeled

with a symbol

p

r

r

A

EE

MMMM

001

i

l i

011

b teal

d teal

b l

slide-7
SLIDE 7

Huffman Codes

  • (An algorithm to find) an optimal prefix-free code
  • optimal =

min

BCDEFGHECDD I len J = ∑

M

N

  • N∈P

⋅ lenI R

  • Note, optimality depends on what you’re compressing
  • H is the 8th most frequent letter in English (6.094%) but the 20th

most frquent in Italian (0.636%)

a b c d Frequency 1/2 1/4 1/8 1/8 Encoding 10 110 111

average number of bits

I

per letter

Fa

f

fo

fd

fax I fb x 2

t

fax

3

t

fax 3

1.75

slide-8
SLIDE 8

Huffman Codes

  • First Try: split letters into two sets of roughly equal

frequency and recurse

  • Balanced binary trees should have low depth

a b c d e .32 .25 .20 .18 .05

OO IO

ILO

01 111

T.tw

  • O
  • so

1

O

l

  • l

tho

es

is

Q

  • .O

32

18 25 25

slide-9
SLIDE 9

Huffman Codes

  • First Try: split letters into two sets of roughly equal

frequency and recurse

first try len = 2.25

  • ptimal

len = 2.23 a b c d e .32 .25 .20 .18 .05

O l

IM

l

J

E

1

F

I

O l

O

l

O O

l

J

O l

nm

ITfft codeword II

fthfuest letter

slide-10
SLIDE 10

Huffman Codes

  • Huffman’s Algorithm: pair up the two letters with

the lowest frequency and recurse

a b c d e .32 .25 .20 .18 .05

38

zgb.ee

gdie3

57 Ea b3

43

Eadie

Of

X

  • f

Y

  • to
slide-11
SLIDE 11

Huffman Codes

  • Huffman’s Algorithm: pair up the two letters with

the lowest frequency and recurse

  • Theorem: Huffman’s Algorithm produces a prefix-

free code of optimal length

  • We’ll prove the theorem using an exchange argument
slide-12
SLIDE 12

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (1) In an optimal prefix-free code (a tree), every internal node

has exactly two children

i

z Z

1

J

slide-13
SLIDE 13
slide-14
SLIDE 14

In

the optimal

code

If

the

lowest depth

is

d

then

there

are

at

least

two leaves at

depth d

and

they

are

siblings

cant

Happen

slide-15
SLIDE 15

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • (2) If <, > have the lowest frequency, then there is an optimal

code where <, > are siblings and are at the bottom of the tree

Suppose

someone gaveyou the

i e have the lowestdepths

  • ptimal tree but

without labels

O

then

I should label

fb fo foe fee

the highest leaves with

the

most frequent symbols

2

and go down

e

d

By

i

there

are

two strings at

the lowest depth

My optimal code fills those siblings w

the least frequent items

slide-16
SLIDE 16

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • Proof by Induction on the Number of Letters in Σ:
  • Base case ( Σ = 2): rather obvious

1 1

Inductive Step

If

Huffman alg

is optimal for

Kil

k I

then

its optimal for

IG I L

Suppose

we

have

frequencies

f

3 fee 3

3 f

7f

El

1,2

3

k

2

w

fw

fr

tf

19 t K

I

slide-17
SLIDE 17

Huffman Code

Huffman Code

for E

for

A

Ow

O

1

code

T

code T

ten

T

lent

1 fu

ten

t

t ft

tf

By the

inductive

hypothesis

T

is

an optimal

code

for

E

minimizes ler

T

Suppose

U

is

an optimal

code

for

E

By

2

K L and K

are

siblings at

the lowest

level of

the

tree

u fore

Ufo

E

Do

lucid left

kn

w

ler

U

fk fk

i

les

w

left

slide-18
SLIDE 18

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • Proof by Induction on the Number of Letters in Σ:
  • Inductive Hypothesis:
slide-19
SLIDE 19

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • Proof by Induction on the Number of Letters in Σ:
  • Inductive Hypothesis:
  • Without loss of generality, frequencies are M

S, … , M T, the

two lowest are M

S, M

  • Merge 1,2 into a new letter U + 1 with M

TWS = M S + M

slide-20
SLIDE 20

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • Proof by Induction on the Number of Letters in Σ:
  • Inductive Hypothesis:
  • Without loss of generality, frequencies are M

S, … , M T, the

two lowest are M

S, M

  • Merge 1,2 into a new letter U + 1 with M

TWS = M S + M

  • By induction, if JX is the Huffman code for M

Y, … , M TWS,

then JX is optimal

  • Need to prove that J is optimal for M

S, … , M T

slide-21
SLIDE 21

Huffman Codes

  • Theorem: Huffman’s Alg produces an optimal prefix-free code
  • If J′ is optimal for M

Y, … , M TWS then J is optimal for M S, … , M T

slide-22
SLIDE 22

An Experiment

  • Take the Dickens novel A Tale of Two Cities
  • File size is 799,940 bytes
  • Build a Huffman code and compress
  • File size is now 439,688 bytes

Raw Huffman Size 799,940 439,688

3

3

2554

slide-23
SLIDE 23

Huffman Codes

  • Huffman’s Algorithm: pair up the two letters with

the lowest frequency and recurse

  • Theorem: Huffman’s Algorithm produces a prefix-

free code of optimal length

  • In what sense is this code really optimal?

(Bonus material… will not test you on this)

slide-24
SLIDE 24

Length of Huffman Codes

  • What can we say about Huffman code length?
  • Suppose M

N = 2Hℓ\ for every R ∈ Σ

  • Then, lenI R = ℓN for the optimal Huffman code
  • Proof:

T

for integeli MMM

etter

a

b

c

d freq

2 2

2

2

3

2

3

code

Ok 10

110

Ill I

2

3 3 ten

slide-25
SLIDE 25

Length of Huffman Codes

  • What can we say about Huffman code length?
  • Suppose M

N = 2Hℓ\ for every R ∈ Σ

  • Then, lenI R = ℓN for the optimal Huffman code
  • len J = ∑

M

N ⋅ log- S ]\

^

  • N∈P

1

in

e

2

di

L

f

3

li

log fi

Li

log.tl fi

li

slide-26
SLIDE 26

Entropy

  • Given a set of frequencies (aka a probability

distribution) the entropy is

  • Entropy is a “measure of randomness”

_ M = ` M

N ⋅ log- 1 M N

^

  • N

length of

the Hoffman code

slide-27
SLIDE 27

Entropy

  • Given a set of frequencies (aka a probability

distribution) the entropy is

  • Entropy is a “measure of randomness”
  • Entropy was introduced by Shannon in 1948 and is

the foundational concept in:

  • Data compression
  • Error correction (communicating over noisy channels)
  • Security (passwords and cryptography)

_ M = ` M

N ⋅ log- 1 M N

^

  • N

How

random

is the

text

slide-28
SLIDE 28

Entropy of Passwords

  • Your password is a specific string, so M

abc = 1.0

  • To talk about security of passwords, we have to

model them as random

  • Random 16 letter string: _ = 16 ⋅ log- 26 ≈ 75.2
  • Random IMDb movie: _ = log- 1764727 ≈ 20.7
  • Your favorite IMDb movie: _ ≪ 20.7
  • Entropy measures how difficult passwords are to

guess “on average”

slide-29
SLIDE 29

Entropy of Passwords

slide-30
SLIDE 30

Entropy and Compression

  • Given a set of frequencies (probability distribution)

the entropy is

  • Suppose that we generate string 9 by choosing j

random letters independently with frequencies M

  • Any compression scheme requires at least _ M

bits-per-letter to store 9 (as j → ∞)

  • Huffman codes are truly optimal!

_ M = ` M

N ⋅ log- 1 M N

^

  • N

length of

Hoffman code

slide-31
SLIDE 31

But Wait!

  • Take the Dickens novel A Tale of Two Cities
  • File size is 799,940 bytes
  • Build a Huffman code and compress
  • File size is now 439,688 bytes
  • But we can do better!

Raw Huffman gzip bzip2 Size 799,940 439,688 301,295 220,156

slide-32
SLIDE 32

What do the frequencies represent?

  • Real data (e.g. natural language, music, images)

have patterns between letters

  • U becomes a lot more common after a Q
  • Possible approach: model pairs of letters
  • Build a Huffman code for pairs-of-letters
  • Improves compression ratio, but the tree gets bigger
  • Can only model certain types of patterns
  • Zip is based on an algorithm called LZW that tries to

identify patterns based on the data

slide-33
SLIDE 33

Entropy and Compression

  • Given a set of frequencies (probability distribution)

the entropy is

  • Suppose that we generate string 9 by choosing j

random letters independently with frequencies M

  • Any compression scheme requires at least _ M

bits-per-letter to store 9

  • Huffman codes are truly optimal if and only if there

is no relationship between different letters! _ M = ` M

N ⋅ log- 1 M N

^

  • N