random access archives for efficient compression of many
play

Random Access Archives for Efficient Compression of Many Small Files - PowerPoint PPT Presentation

Introduction Current Methods New Archiving Method Soup Generator Conclusion Random Access Archives for Efficient Compression of Many Small Files or Avoiding the void Robert Jan Hensing July 8, 2011 Introduction Current Methods New


  1. Introduction Current Methods New Archiving Method Soup Generator Conclusion Random Access Archives for Efficient Compression of Many Small Files or Avoiding the void Robert Jan Hensing July 8, 2011

  2. Introduction Current Methods New Archiving Method Soup Generator Conclusion Overview Introduction 1 Current Methods 2 New Archiving Method 3 Soup Generator 4 Conclusion 5

  3. Introduction Current Methods New Archiving Method Soup Generator Conclusion Compression Finding a more space-efficient way to store data. Lempel, Ziv ’77 Replace strings of symbols by references to earlier occurrences (LZ77) Huffman Coding Use fewer bits for frequent symbols Information Theory: compression can be optimal in a sense

  4. Introduction Current Methods New Archiving Method Soup Generator Conclusion Adaptive Compression LZ77 adapts to its input Huffman Coding does not: mapping of bits to symbols does not change. Large files: store the mapping Adaptive Huffman is: modify the mapping while encoding/decoding

  5. Introduction Current Methods New Archiving Method Soup Generator Conclusion Adaptive Compression Advantages Compress any type of file Mixed files can be compressed too Adaptive algorithms are great!

  6. Introduction Current Methods New Archiving Method Soup Generator Conclusion Are they? LZ77 can not refer to anything until some symbols are encoded. Adaptive Huffman does not know the distribution of probabilities until something is read.

  7. Introduction Current Methods New Archiving Method Soup Generator Conclusion Obvious solution “Solid compression” ... and your small files are gone

  8. Introduction Current Methods New Archiving Method Soup Generator Conclusion Data Small text files newspaper articles source code book of law tweets ... Storage efficiency Random access

  9. Introduction Current Methods New Archiving Method Soup Generator Conclusion Current Methods Choice: Solid compression Efficient coding, no random access Example: back-ups and software distribution: .tar.gz Individual compression Random access, inefficient coding of small files Example: ZIP, used in .zip , .jar , OpenDocument, EPUB e-books.

  10. Introduction Current Methods New Archiving Method Soup Generator Conclusion How bad is it? 6000 identity 150 standard deviation compressed data size (bytes) gzip 5000 100 prefixed estimate 50 4000 0 0 50 100 150 200 3000 2000 1000 0 0 1000 2000 3000 4000 5000 6000 7000 8000 original data size (bytes)

  11. Introduction Current Methods New Archiving Method Soup Generator Conclusion How bad is it? 2 identity standard deviation gzip prefixed estimate 1.5 compression ratio 1 0.5 0 0 1000 2000 3000 4000 5000 6000 7000 8000 original data size (bytes)

  12. Introduction Current Methods New Archiving Method Soup Generator Conclusion New Archiving Method Make sure the model of the algorithm is in a suitable state. Bad solution: group the files. Compression improves. Access time is linear in file group size; average is half group decoding time. Writing always requires rewriting the whole group.

  13. Introduction Current Methods New Archiving Method Soup Generator Conclusion New Archiving Method Instead: generate a redundant file. Compression improves. Access time is linear in generated file size. Generated file can be more dense and can be size adjusted.

  14. Introduction Current Methods New Archiving Method Soup Generator Conclusion Deduplicating or trimming How does increasing the input size help decrease the output size?

  15. Introduction Current Methods New Archiving Method Soup Generator Conclusion Deduplicating or trimming How does increasing the input size help decrease the output size? Condition: At any point during compression, the output does depend on future data. Or, more formally: There exists a small constant k, such that for any soup s and file f , c ( s ) equals c ( sf ) up to # c ( s ) − k bytes.

  16. Introduction Current Methods New Archiving Method Soup Generator Conclusion Archiving Generate Soup Prepend Compress Trim Store

  17. Introduction Current Methods New Archiving Method Soup Generator Conclusion Soup Generator Counting: Consider only the number of files, not total occurrences Locality: Most implementations favor references at a short distance Frequency: The soup should contain all substrings that occur in at least two files. Unicity: Any substring should be in the soup only once Utility: Any short substring in the soup should occur in a least two input files.

  18. Introduction Current Methods New Archiving Method Soup Generator Conclusion Approximating Longest Common Substrings Considering the input strings AaaaBbbbCccc and BbbAaCcccc , the longest common substrings are: 1 Cccc 2 Bbb 3 Aa

  19. Introduction Current Methods New Archiving Method Soup Generator Conclusion Approximating Longest Common Substrings Data structure: limited depth trie Easily record unique substrings in a file (Counting) Can define merge operation Redundant substrings stored only once

  20. Introduction Current Methods New Archiving Method Soup Generator Conclusion Putting it together For all files: Read all fixed-length substrings of file into trie. Merge all tries For all frequent substrings (merged ≥ 1) Prediction Cut off Reverse prediction Concatenate

  21. Introduction Current Methods New Archiving Method Soup Generator Conclusion Complexity Time complexity O ( kn + n log n ). Space complexity is O ( kn ) in worst case. where n is the number of input bytes and k is the depth limit Worst space complexity: random files

  22. Introduction Current Methods New Archiving Method Soup Generator Conclusion Implementation Command line tool for archiving User space filesystem for access

  23. Introduction Current Methods New Archiving Method Soup Generator Conclusion Demo

  24. Introduction Current Methods New Archiving Method Soup Generator Conclusion Conclusion Adaptive algorithms need space to adapt to small files Redundancy between files can be significant and can be taken advantage of Redundancy between files can be modeled with a soup Slides, paper and source code available later today http://roberthensing.nl/har/news.html

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend