Random Access Archives for Efficient Compression of Many Small Files - - PowerPoint PPT Presentation

random access archives for efficient compression of many
SMART_READER_LITE
LIVE PREVIEW

Random Access Archives for Efficient Compression of Many Small Files - - PowerPoint PPT Presentation

Introduction Current Methods New Archiving Method Soup Generator Conclusion Random Access Archives for Efficient Compression of Many Small Files or Avoiding the void Robert Jan Hensing July 8, 2011 Introduction Current Methods New


slide-1
SLIDE 1

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Random Access Archives for Efficient Compression

  • f Many Small Files
  • r

Avoiding the void Robert Jan Hensing July 8, 2011

slide-2
SLIDE 2

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Overview

1

Introduction

2

Current Methods

3

New Archiving Method

4

Soup Generator

5

Conclusion

slide-3
SLIDE 3

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Compression

Finding a more space-efficient way to store data. Lempel, Ziv ’77 Replace strings of symbols by references to earlier

  • ccurrences (LZ77)

Huffman Coding Use fewer bits for frequent symbols Information Theory: compression can be optimal in a sense

slide-4
SLIDE 4

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Adaptive Compression

LZ77 adapts to its input Huffman Coding does not: mapping of bits to symbols does not change. Large files: store the mapping Adaptive Huffman is: modify the mapping while encoding/decoding

slide-5
SLIDE 5

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Adaptive Compression

Advantages Compress any type of file Mixed files can be compressed too Adaptive algorithms are great!

slide-6
SLIDE 6

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Are they?

LZ77 can not refer to anything until some symbols are encoded. Adaptive Huffman does not know the distribution of probabilities until something is read.

slide-7
SLIDE 7

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Obvious solution “Solid compression”

... and your small files are gone

slide-8
SLIDE 8

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Data

Small text files

newspaper articles source code book of law tweets ...

Storage efficiency Random access

slide-9
SLIDE 9

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Current Methods

Choice: Solid compression Efficient coding, no random access Example: back-ups and software distribution: .tar.gz Individual compression Random access, inefficient coding of small files Example: ZIP, used in .zip, .jar, OpenDocument, EPUB e-books.

slide-10
SLIDE 10

Introduction Current Methods New Archiving Method Soup Generator Conclusion

How bad is it?

1000 2000 3000 4000 5000 6000 1000 2000 3000 4000 5000 6000 7000 8000 compressed data size (bytes)

  • riginal data size (bytes)

50 100 150 50 100 150 200

identity standard deviation gzip prefixed estimate

slide-11
SLIDE 11

Introduction Current Methods New Archiving Method Soup Generator Conclusion

How bad is it?

0.5 1 1.5 2 1000 2000 3000 4000 5000 6000 7000 8000 compression ratio

  • riginal data size (bytes)

identity standard deviation gzip prefixed estimate

slide-12
SLIDE 12

Introduction Current Methods New Archiving Method Soup Generator Conclusion

New Archiving Method

Make sure the model of the algorithm is in a suitable state. Bad solution: group the files. Compression improves. Access time is linear in file group size; average is half group decoding time. Writing always requires rewriting the whole group.

slide-13
SLIDE 13

Introduction Current Methods New Archiving Method Soup Generator Conclusion

New Archiving Method

Instead: generate a redundant file. Compression improves. Access time is linear in generated file size. Generated file can be more dense and can be size adjusted.

slide-14
SLIDE 14

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Deduplicating or trimming

How does increasing the input size help decrease the output size?

slide-15
SLIDE 15

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Deduplicating or trimming

How does increasing the input size help decrease the output size? Condition: At any point during compression, the output does depend on future data. Or, more formally: There exists a small constant k, such that for any soup s and file f , c(s) equals c(sf ) up to #c(s) − k bytes.

slide-16
SLIDE 16

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Archiving

Generate Soup Prepend Compress Trim Store

slide-17
SLIDE 17

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Soup Generator

Counting: Consider only the number of files, not total

  • ccurrences

Locality: Most implementations favor references at a short distance Frequency: The soup should contain all substrings that occur in at least two files. Unicity: Any substring should be in the soup only once Utility: Any short substring in the soup should occur in a least two input files.

slide-18
SLIDE 18

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Approximating Longest Common Substrings

Considering the input strings AaaaBbbbCccc and BbbAaCcccc, the longest common substrings are:

1 Cccc 2 Bbb 3 Aa

slide-19
SLIDE 19

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Approximating Longest Common Substrings

Data structure: limited depth trie Easily record unique substrings in a file (Counting) Can define merge operation Redundant substrings stored only once

slide-20
SLIDE 20

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Putting it together

For all files: Read all fixed-length substrings of file into trie. Merge all tries For all frequent substrings (merged ≥ 1)

Prediction Cut off Reverse prediction

Concatenate

slide-21
SLIDE 21

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Complexity

Time complexity O(kn + n log n). Space complexity is O(kn) in worst case. where n is the number of input bytes and k is the depth limit Worst space complexity: random files

slide-22
SLIDE 22

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Implementation

Command line tool for archiving User space filesystem for access

slide-23
SLIDE 23

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Demo

slide-24
SLIDE 24

Introduction Current Methods New Archiving Method Soup Generator Conclusion

Conclusion

Adaptive algorithms need space to adapt to small files Redundancy between files can be significant and can be taken advantage of Redundancy between files can be modeled with a soup Slides, paper and source code available later today http://roberthensing.nl/har/news.html