RadixZip: Linear Time Compression of Token Streams Binh Vo - PowerPoint PPT Presentation

RadixZip: Linear Time Compression of Token Streams Binh Vo <binh@google.com> Gurmeet Singh Manku <manku@google.com> Google Inc., USA

Data of interest ● Collections of records: – Databases. – Logs (query or ad-clicks at Google). – Tables (telephone records at AT&T). ● Transposing into collections of columns. – Faster lookup of specific attributes. – Improved compression.

Context sorting compressors ● BZip - 1994 (Burrows, Wheeler, Seward). – General purpose compression. – Based on the BWT (suffix sorting). ● Vczip - 2004 (Vo and Vo). – Fixed width table compression. – Based on column dependency (predictor sorting). ● Common theme: sort data by some context. – A context is any string which helps 'predict' target. – Similar to sorting the target if prediction is accurate. – But reversible!

BWT: Suffixes as a context ● Transformed data is more compressible. – Bzip = BWT + Move-to-Front + Run-Length + Huffman

Column-specific properties ● Boundary awareness: – Byte indices. – Intra-token contexts. ● Multi-column context: – Dependency. – E.g. a user with a fixed IP and browser.

Token-specific redundancy ● Boundary awareness: – Byte indices. – Intra-token contexts. ● Multi-column context: – Dependency. – E.g. a user with a fixed IP and browser.

RadixZipTransform ● For each col i: – Sort by token prefixes formed from earlier columns. – Append reordered col i to output.

Linear Time ● Perform a Radix sort. ● Append one column before each iteration.

Compression benefits ● Preserves byte columns. ● Context sorted, but limited to token boundaries. ● Transformed data is more compressible: – RadixZip = RadixZipTransform + MTF + RLE + Huffman

Performance ● Linear time complexity. ● Memory properties: – Requires 8 bytes per token. – Cache-friendly. ● Comparison to BWT: – Faster than currently known BWT implementations. – Similarly, using less memory. – RadixZip is simple to implement, robust code.

Inter-column dependency ● Passing permutations equivalent to presorting. ● Passed permutations continue to propagate.

RadixZip vs Bzip2 (census data) ● US population survey. – Fixed-width fields. – Divided by field. ● RadixZip outperforms on larger columns. ● Loss on smaller ones, – Likely due to needing more byte-columns to 'ramp up'. ● About 15% total gain.

RadixZip vs Bzip2 (census data) ● Compression speed improves: – Especially on highly compressible streams, – Since Bzip2's alg is worst-case quadratic. ● Decompression speed improves. ● Most outliers are on very small streams.

Dependency results ● Hand-picked dependencies from census data. ● Use of a predictor can reduce compressed size to ~0. ● High dependency indicates little to no new information.

Conclusion ● RadixZipTransform - a linear time transform. ● Improvement in both performance and compression for token streams over general purpose compressors. ● Efficient exploitation of stream correlation.

RadixZip: Linear Time Compression of Token Streams Binh Vo - PowerPoint PPT Presentation

RadixZip: Linear Time Compression of Token Streams Binh Vo <binh@google.com> Gurmeet Singh Manku <manku@google.com> Google Inc., USA Data of interest Collections of records: Databases. Logs (query or ad-clicks at

HOTNOW HOT Token HOTNOW l TOKEN SALE THE FIRST UTILITY TOKEN WITH REAL INTRINSIC VALUE REINVENT

PIV Token Issuance PIV Token Issuance Ketan Mehta Mehta_Ketan@bah.com October 6, 2004 1

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

TREE = TOKEN The Frontier of Impact Finance T TREE T TREE Token = oken = 1 The Frontier

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Co-Founder, Managing Partner, SPiCE VC - Security Token Pioneers @AmiBenDavid 1 st Fully

Resolution Limits in Digital Photography Luiz Velho IMPA 1 Motivation Longstanding

FIA Europe: Breakfast Briefing Transaction reporting under MiFID II / MiFIR Jonathan Herbst and

Multimedia at UPMC Universit Pierre et Marie Curie Yves Epelboin www.upmc.fr SG TICE

General Orientation to the Warrior Transition Unit (WTU) January 2008 UNCLASSIFIED Slide 1

Global Space-based Inter- Calibration System (GSICS) Mitchell D. Goldberg GSICS Exec Panel Chair

Welcom ome to Graduate S Studies Graduate S Student Orientation Electr trical & &

Jingfeng Jiang MICHIGAN TECH JJ Jia iang , jjiang1@mtu.edu RESEARCH FORUM TECHTALKS Areas of

Paul van Susante MICHIGAN TECH Paul van Susante , pjvansus@mtu.edu RESEARCH FORUM TECHTALKS

RadixZip: Linear Time Compression of Token Streams Binh Vo - PowerPoint PPT Presentation

RadixZip: Linear Time Compression of Token Streams Binh Vo <binh@google.com> Gurmeet Singh Manku <manku@google.com> Google Inc., USA Data of interest Collections of records: Databases. Logs (query or ad-clicks at

HOTNOW HOT Token HOTNOW l TOKEN SALE THE FIRST UTILITY TOKEN WITH REAL INTRINSIC VALUE REINVENT

PIV Token Issuance PIV Token Issuance Ketan Mehta Mehta_Ketan@bah.com October 6, 2004 1

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

TREE = TOKEN The Frontier of Impact Finance T TREE T TREE Token = oken = 1 The Frontier

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Co-Founder, Managing Partner, SPiCE VC - Security Token Pioneers @AmiBenDavid 1 st Fully

Resolution Limits in Digital Photography Luiz Velho IMPA 1 Motivation Longstanding

FIA Europe: Breakfast Briefing Transaction reporting under MiFID II / MiFIR Jonathan Herbst and

Multimedia at UPMC Universit Pierre et Marie Curie Yves Epelboin www.upmc.fr SG TICE

General Orientation to the Warrior Transition Unit (WTU) January 2008 UNCLASSIFIED Slide 1

Global Space-based Inter- Calibration System (GSICS) Mitchell D. Goldberg GSICS Exec Panel Chair

Welcom ome to Graduate S Studies Graduate S Student Orientation Electr trical &amp; &amp;

Jingfeng Jiang MICHIGAN TECH JJ Jia iang , jjiang1@mtu.edu RESEARCH FORUM TECHTALKS Areas of

Paul van Susante MICHIGAN TECH Paul van Susante , pjvansus@mtu.edu RESEARCH FORUM TECHTALKS

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Welcom ome to Graduate S Studies Graduate S Student Orientation Electr trical & &