Random Access Archives for Efficient Compression of Many Small Files - PowerPoint PPT Presentation

Introduction Current Methods New Archiving Method Soup Generator Conclusion Random Access Archives for Efficient Compression of Many Small Files or Avoiding the void Robert Jan Hensing July 8, 2011

Introduction Current Methods New Archiving Method Soup Generator Conclusion Overview Introduction 1 Current Methods 2 New Archiving Method 3 Soup Generator 4 Conclusion 5

Introduction Current Methods New Archiving Method Soup Generator Conclusion Compression Finding a more space-efficient way to store data. Lempel, Ziv ’77 Replace strings of symbols by references to earlier occurrences (LZ77) Huffman Coding Use fewer bits for frequent symbols Information Theory: compression can be optimal in a sense

Introduction Current Methods New Archiving Method Soup Generator Conclusion Adaptive Compression LZ77 adapts to its input Huffman Coding does not: mapping of bits to symbols does not change. Large files: store the mapping Adaptive Huffman is: modify the mapping while encoding/decoding

Introduction Current Methods New Archiving Method Soup Generator Conclusion Adaptive Compression Advantages Compress any type of file Mixed files can be compressed too Adaptive algorithms are great!

Introduction Current Methods New Archiving Method Soup Generator Conclusion Are they? LZ77 can not refer to anything until some symbols are encoded. Adaptive Huffman does not know the distribution of probabilities until something is read.

Introduction Current Methods New Archiving Method Soup Generator Conclusion Obvious solution “Solid compression” ... and your small files are gone

Introduction Current Methods New Archiving Method Soup Generator Conclusion Data Small text files newspaper articles source code book of law tweets ... Storage efficiency Random access

Introduction Current Methods New Archiving Method Soup Generator Conclusion Current Methods Choice: Solid compression Efficient coding, no random access Example: back-ups and software distribution: .tar.gz Individual compression Random access, inefficient coding of small files Example: ZIP, used in .zip , .jar , OpenDocument, EPUB e-books.

Introduction Current Methods New Archiving Method Soup Generator Conclusion How bad is it? 6000 identity 150 standard deviation compressed data size (bytes) gzip 5000 100 prefixed estimate 50 4000 0 0 50 100 150 200 3000 2000 1000 0 0 1000 2000 3000 4000 5000 6000 7000 8000 original data size (bytes)

Introduction Current Methods New Archiving Method Soup Generator Conclusion How bad is it? 2 identity standard deviation gzip prefixed estimate 1.5 compression ratio 1 0.5 0 0 1000 2000 3000 4000 5000 6000 7000 8000 original data size (bytes)

Introduction Current Methods New Archiving Method Soup Generator Conclusion New Archiving Method Make sure the model of the algorithm is in a suitable state. Bad solution: group the files. Compression improves. Access time is linear in file group size; average is half group decoding time. Writing always requires rewriting the whole group.

Introduction Current Methods New Archiving Method Soup Generator Conclusion New Archiving Method Instead: generate a redundant file. Compression improves. Access time is linear in generated file size. Generated file can be more dense and can be size adjusted.

Introduction Current Methods New Archiving Method Soup Generator Conclusion Deduplicating or trimming How does increasing the input size help decrease the output size?

Introduction Current Methods New Archiving Method Soup Generator Conclusion Deduplicating or trimming How does increasing the input size help decrease the output size? Condition: At any point during compression, the output does depend on future data. Or, more formally: There exists a small constant k, such that for any soup s and file f , c ( s ) equals c ( sf ) up to # c ( s ) − k bytes.

Introduction Current Methods New Archiving Method Soup Generator Conclusion Archiving Generate Soup Prepend Compress Trim Store

Introduction Current Methods New Archiving Method Soup Generator Conclusion Soup Generator Counting: Consider only the number of files, not total occurrences Locality: Most implementations favor references at a short distance Frequency: The soup should contain all substrings that occur in at least two files. Unicity: Any substring should be in the soup only once Utility: Any short substring in the soup should occur in a least two input files.

Introduction Current Methods New Archiving Method Soup Generator Conclusion Approximating Longest Common Substrings Considering the input strings AaaaBbbbCccc and BbbAaCcccc , the longest common substrings are: 1 Cccc 2 Bbb 3 Aa

Introduction Current Methods New Archiving Method Soup Generator Conclusion Approximating Longest Common Substrings Data structure: limited depth trie Easily record unique substrings in a file (Counting) Can define merge operation Redundant substrings stored only once

Introduction Current Methods New Archiving Method Soup Generator Conclusion Putting it together For all files: Read all fixed-length substrings of file into trie. Merge all tries For all frequent substrings (merged ≥ 1) Prediction Cut off Reverse prediction Concatenate

Introduction Current Methods New Archiving Method Soup Generator Conclusion Complexity Time complexity O ( kn + n log n ). Space complexity is O ( kn ) in worst case. where n is the number of input bytes and k is the depth limit Worst space complexity: random files

Introduction Current Methods New Archiving Method Soup Generator Conclusion Implementation Command line tool for archiving User space filesystem for access

Introduction Current Methods New Archiving Method Soup Generator Conclusion Demo

Introduction Current Methods New Archiving Method Soup Generator Conclusion Conclusion Adaptive algorithms need space to adapt to small files Redundancy between files can be significant and can be taken advantage of Redundancy between files can be modeled with a soup Slides, paper and source code available later today http://roberthensing.nl/har/news.html

Random Access Archives for Efficient Compression of Many Small Files - PowerPoint PPT Presentation

Introduction Current Methods New Archiving Method Soup Generator Conclusion Random Access Archives for Efficient Compression of Many Small Files or Avoiding the void Robert Jan Hensing July 8, 2011 Introduction Current Methods New

The National Archives Engagement Team Working with the wider archives sector Emma Jay 16

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Library and Archives Canada Wallot-Sylvestre Seminar 2018 Archives Matter Jeff James, Chief

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Introduction to Journal Archives Over 4 million articles from over 600 journals, sourced from 8

The Swiss Federal Archives and Wikimedia Presentation by Marco Majoleth, Swiss Federal Archives at

Cambridge Assessment Archives: Role of the Archives Gillian Cooke Group Archivist CAN Seminar,

Library Archives Building Project Regional Archives Five Branches Central Eastern

4.4. Arithmetic coding Advantages: Reaches the entropy (within computing precision)

4. Source Encoding Methods Called also entropy coders , because the methods try to get

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Huffman Encoding 13-Oct-11 Entropy Entropy is a measure of information content: the number of

Text Operations Text Operations Berlin Chen 2003 References: 1. Modern Information Retrieval,

Priority Queue implementation Creating Heaps

Wireless Communication Systems @CS.NCTU Lecture 5: Compression Instructor: Kate Ching-Ju Lin (

Data Compression (Chapters 4-6) presented by Tapani Raiko Feb 26, 2004 Contents (Data