CSE 417 Algorithms Winter 2007 Huffman Codes: An Optimal Data - PowerPoint PPT Presentation

CSE 417 Algorithms Winter 2007 Huffman Codes: An Optimal Data Compression Method 1

a 45% b 13% Compression Example c 12% d 16% e 9% f 5% 100k file, 6 letter alphabet: File Size: ASCII, 8 bits/char: 800kbits 2 3 > 6; 3 bits/char: 300kbits Why? Storage, transmission vs 1Ghz cpu 2

a 45% b 13% Compression Example c 12% d 16% e 9% f 5% 100k file, 6 letter alphabet: File Size: E.g.: Why not: ASCII, 8 bits/char: 800kbits a 00 00 2 3 > 6; 3 bits/char: 300kbits b 01 01 better: d 10 10 2.52 bits/char 74%*2 +26%*4 : 252kbits c 1100 110 e 1101 1101 Optimal? f 1110 1110 1101110 = cf or ec? 3

Data Compression Binary character code (“code”) each k-bit source string maps to unique code word (e.g. k=8) “compression” alg: concatenate code words for successive k-bit “characters” of source Fixed/variable length codes all code words equal length? Prefix codes no code word is prefix of another (unique decoding) 4

a 45% b 13% Prefix Codes = Trees c 12% d 16% e 9% f 5% 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 f a b f a b

a 45% b 13% Greedy Idea #1 c 12% d 16% e 9% f 5% Put most frequent under root, then recurse … 100 . a:45 . . . . 6

a 45% b 13% Greedy Idea #1 c 12% d 16% e 9% f 5% Put most frequent under root, then recurse 100 Too greedy: a:45 55 unbalanced tree .45*1 + .16*2 + .13*3 … = 2.34 not too bad, but imagine if all d:16 29 freqs were ~1/6: (1+2+3+4+5+5)/6=3.33 . . b:13 . 7

a 45% b 13% Greedy Idea #2 c 12% d 16% e 9% f 5% Divide letters into 2 groups, with ~50% 100 weight in each; recurse (Shannon-Fano code) 50 50 Again, not terrible 2*.5+3*.5 = 2.5 But this tree a:45 f:5 25 25 can easily be improved! (How?) b:13 c:12 d:16 e:9 8

a 45% b 13% Greedy idea #3 c 12% d 16% e 9% f 5% Group least frequent letters near bottom 100 . . . . . . 25 14 c:12 b:13 f:5 e:9 9

.45*1 + .41*3 + .14*4 = 2.24 bits per char

Huffman’s Algorithm (1952) Algorithm: insert node for each letter into priority queue by freq while queue length > 1 do remove smallest 2; call them x, y make new node z from them, with f(z) = f(x) + f(y) insert z into queue Analysis: O(n) heap ops: O(n log n) Goal: Minimize � B ( T ) = freq(c)*depth(c) c � C Correctness : ??? 12

Correctness Strategy Optimal solution may not be unique, so cannot prove that greedy gives the only possible answer. Instead, show that greedy’s solution is as good as any. 13

Defn: A pair of leaves is an inversion if depth(x) ≥ depth(y) and freq(x) ≥ freq(y) Claim: If we flip an inversion, cost never increases. Why? All other things being equal, better to give more frequent letter the shorter code. before after (d(x)*f(x) + d(y)*f(y)) - (d(x)*f(y) + d(y)*f(x)) = (d(x) - d(y)) * (f(x) - f(y)) ≥ 0 I.e. non-negative cost savings.

Lemma 1: “Greedy Choice Property” The 2 least frequent letters might as well be siblings at deepest level Let a be least freq, b 2 nd Let u, v be siblings at max depth, f(u) ≤ f(v) (why must they exist?) Then (a,u) and (b,v) are inversions. Swap them. 15

Lemma 2 Let (C, f) be a problem instance: C an n-letter alphabet with letter frequencies f(c) for c in C. For any x, y in C, let C’ be the (n-1) letter alphabet C - {x,y} ∪ {z} and for all c in C’ define f'(c) = � f(c), if c � x,y,z � f(x) + f(y), if c = z � Let T’ be an optimal tree for (C’,f’). Then T’ = z T x y is optimal for (C,f) among all trees having x,y as siblings 16

Proof: � B ( T ) = d T ( c ) � f ( c ) c � C B ( T ) � B ( T ') = d T ( x ) � ( f ( x ) + f ( y )) � d T ' ( z ) � f '( z ) = ( d T ' ( z ) + 1) � f '( z ) � d T ' ( z ) � f '( z ) = f '( z ) ˆ T Suppose (having x & y as siblings) is better than T, i.e. ˆ B ( ˆ T ' Collapse x & y to z, forming ; as above: T ) < B ( T ). B ( ˆ ) � B ( ˆ T T ') = f '( z ) Then: B ( ˆ ') = B ( ˆ T T ) � f '( z ) < B ( T ) � f '( z ) = B ( T ') Contradicting optimality of T’

Theorem: Huffman gives optimal codes Proof: induction on |C| Basis: n=1,2 – immediate Induction: n>2 Let x,y be least frequent Form C´, f´, & z, as above By induction, T´ is opt for (C ´,f´) By lemma 2, T´ → T is opt for (C,f) among trees with x,y as siblings By lemma 1, some opt tree has x, y as siblings Therefore, T is optimal. 18

Data Compression Huffman is optimal. BUT still might do better! Huffman encodes fixed length blocks. What if we vary them? Huffman uses one encoding throughout a file. What if characteristics change? What if data has structure? E.g. raster images, video,… Huffman is lossless. Necessary? LZW, MPEG, … 19

20 David A. Huffman, 1925-1999

CSE 417 Algorithms Winter 2007 Huffman Codes: An Optimal Data - PowerPoint PPT Presentation

CSE 417 Algorithms Winter 2007 Huffman Codes: An Optimal Data Compression Method 1 a 45% b 13% Compression Example c 12% d 16% e 9% f 5% 100k file, 6 letter alphabet: File Size: ASCII, 8 bits/char: 800kbits 2 3 > 6; 3

CSE 417: Algorithms and Computational Complexity 1: Organization & Overview Winter 2006

I: Organization & Overview Winter 2007 Larry Ruzzo 1 http://www.cs.washington.edu/417 2

CSE 417: Algorithms and Computational Complexity 4: Dynamic Programming, I Fibonacci Winter

The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at

Lecture 4 Additional Slides CSE 344, Winter 2014 Sudeepa Roy CSE 344 - Winter 2014 1 NOTE:

Dynamic Programming CSE 417: Algorithms and Outline: Computational Complexity General

CSE 417 c 12% Compression Example d 16% e 9% Algorithms f 5% 100k file, 6 letter

Announcements CSE 590f seminar Wednesday, 4pm, CSE 403 CSE 477, Winter/Spring 2009 UW

CSE202: Design and Analysis of Algorithms Ragesh Jaiswal, CSE, UCSD Ragesh Jaiswal, CSE, UCSD

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Second quarter 2016 Results ING posts 2Q16 underlying net profit of EUR 1,417 million Ralph

18.417 Introduction to Computational Molecular Biology Foundations of Structural

Protein Structure Prediction Protein = chain of amino acids (AA) aa connected by peptide

Lecture 23 Introduction to Bode Plots CL-417 Process Control Prof. Kannan M. Moudgalya IIT

Predicting Protein Folding Paths S.Will, 18.417, Fall 2011 Protein Folding by Robotics S.Will,

Lecture 11 Controller Specifications CL-417 Process Control Prof. Kannan M. Moudgalya IIT

CSE 421 Algorithms Huffman Codes: An Optimal Data Compression Method 1 a 45% b 13%

Arithmetic Coding Mathias Winther Madsen mathias.winther@gmail.com Institute for Logic,

A First Look at Modern Enterprise Traffic Ruoming Pang , Princeton University Mark Allman ( ICSI

Challenges for Polarimetry at the ILC Spin Tracking Studies Moritz Beckmann, Jenny List DESY -

Formal Modeling in Cognitive Science Source Codes Lecture 30: Codes; Kraft Inequality; Source

Space- and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri

Applications of Galois Geometries to Coding Theory and Cryptography Leo Storme Ghent University

Introduction to Information by Erol Seke For the course Communications OSMANGAZI

Sambuz

Useful Links

Newsletter

Mail Us

CSE 417 Algorithms Winter 2007 Huffman Codes: An Optimal Data - PowerPoint PPT Presentation

CSE 417 Algorithms Winter 2007 Huffman Codes: An Optimal Data Compression Method 1 a 45% b 13% Compression Example c 12% d 16% e 9% f 5% 100k file, 6 letter alphabet: File Size: ASCII, 8 bits/char: 800kbits 2 3 > 6; 3

CSE 417: Algorithms and Computational Complexity 1: Organization &amp; Overview Winter 2006

I: Organization &amp; Overview Winter 2007 Larry Ruzzo 1 http://www.cs.washington.edu/417 2

CSE 417: Algorithms and Computational Complexity 4: Dynamic Programming, I Fibonacci Winter

The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at Wisley The Winter Walk at

Lecture 4 Additional Slides CSE 344, Winter 2014 Sudeepa Roy CSE 344 - Winter 2014 1 NOTE:

Dynamic Programming CSE 417: Algorithms and Outline: Computational Complexity General

CSE 417 c 12% Compression Example d 16% e 9% Algorithms f 5% 100k file, 6 letter

Announcements CSE 590f seminar Wednesday, 4pm, CSE 403 CSE 477, Winter/Spring 2009 UW

CSE202: Design and Analysis of Algorithms Ragesh Jaiswal, CSE, UCSD Ragesh Jaiswal, CSE, UCSD

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Second quarter 2016 Results ING posts 2Q16 underlying net profit of EUR 1,417 million Ralph

18.417 Introduction to Computational Molecular Biology Foundations of Structural

Protein Structure Prediction Protein = chain of amino acids (AA) aa connected by peptide

Lecture 23 Introduction to Bode Plots CL-417 Process Control Prof. Kannan M. Moudgalya IIT

Predicting Protein Folding Paths S.Will, 18.417, Fall 2011 Protein Folding by Robotics S.Will,

Lecture 11 Controller Specifications CL-417 Process Control Prof. Kannan M. Moudgalya IIT

CSE 421 Algorithms Huffman Codes: An Optimal Data Compression Method 1 a 45% b 13%

Arithmetic Coding Mathias Winther Madsen mathias.winther@gmail.com Institute for Logic,

A First Look at Modern Enterprise Traffic Ruoming Pang , Princeton University Mark Allman ( ICSI

Challenges for Polarimetry at the ILC Spin Tracking Studies Moritz Beckmann, Jenny List DESY -

Formal Modeling in Cognitive Science Source Codes Lecture 30: Codes; Kraft Inequality; Source

Space- and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri

Applications of Galois Geometries to Coding Theory and Cryptography Leo Storme Ghent University

Introduction to Information by Erol Seke For the course Communications OSMANGAZI

Sambuz

Useful Links

Newsletter

Mail Us

CSE 417: Algorithms and Computational Complexity 1: Organization & Overview Winter 2006

I: Organization & Overview Winter 2007 Larry Ruzzo 1 http://www.cs.washington.edu/417 2