Non-Linear Compression: Gzip Me Not! Michael F. Nowlan Bryan Ford - - PowerPoint PPT Presentation
Non-Linear Compression: Gzip Me Not! Michael F. Nowlan Bryan Ford - - PowerPoint PPT Presentation
Non-Linear Compression: Gzip Me Not! Michael F. Nowlan Bryan Ford Ramakrishna Gummadi Decentralized and Distributed Systems Group Department of Computer Science Yale University 4 th USENIX Workshop on Hot Topics in Storage and File Systems
DeDiS Group, Yale CS HotStorage '12, Boston, MA 2
Linear Compression
The popular compression schemes (i.e., gzip, bzip2) are linear.
t S0
comp
C1
B1 S1
comp
C2
B2 S2
DeDiS Group, Yale CS HotStorage '12, Boston, MA 3
Linear Compression
Compression state accumulates sequentially, with each successive block of data that is compressed.
t S0
comp
C1
B1 S1
comp
C2
B2 S2
Any given state depends on all previous compression states.
DeDiS Group, Yale CS HotStorage '12, Boston, MA 4
Linear Compression
This dependency chain is restrictive.
t S0
dcomp
S1
dcomp
S2 B1 B2
C1 C2
DeDiS Group, Yale CS HotStorage '12, Boston, MA 5
Linear Compression
This dependency chain is restrictive. It forces decompression to proceed in the same order as compression (i.e., prohibits random-access).
t S0
dcomp
S1
dcomp
S2 B1 B2
C1 C2
DeDiS Group, Yale CS HotStorage '12, Boston, MA 6
Linear Compression
In summary: Popular compression schemes transform compression state linearly.
S0
comp
C1
B1 S1
comp
C2
B2 S2
DeDiS Group, Yale CS HotStorage '12, Boston, MA 7
Outline
- Linear Compression
- Compression in Storage Systems
- Storage Requirements
- Linear Limitations
- Non-Linear Compression
- Architecture and API
- Example Applications
- Prototype Implementation
- Preliminary Results
- Future Work
DeDiS Group, Yale CS HotStorage '12, Boston, MA 8
Outline
- Linear Compression
- Compression in Storage Systems
- Storage Requirements
- Linear Limitations
- Non-Linear Compression
- Architecture and API
- Example Applications
- Prototype Implementation
- Preliminary Results
- Future Work
DeDiS Group, Yale CS HotStorage '12, Boston, MA 9
B2
Compression in Storage Systems
Storage systems that use compression generally perform: 1) block compression, and/or 2) delta-encoding Examples include:
- De-duplicating file systems
- Distributed source control management
- Collaborative editing systems
B1 B2 Data Source
DeDiS Group, Yale CS HotStorage '12, Boston, MA 10
Storage Requirements
Data blocks may be related, or not, and they may be available at different times (e.g., versions of a file), or all at once.
Related Unrelated At once Over time
Inter-Block Content Availability
DeDiS Group, Yale CS HotStorage '12, Boston, MA 11
Storage Requirements
Related Unrelated At once Linear Over time Linear
Inter-Block Content Availability Data blocks may be related, or not, and they may be available at different times (e.g., versions of a file), or all at once.
DeDiS Group, Yale CS HotStorage '12, Boston, MA 12
Storage Requirements
Data blocks may be related, or not, and they may be available at different times (e.g., versions of a file), or all at once.
Related Unrelated At once Linear ??? Over time ??? Linear
Inter-Block Content Availability
DeDiS Group, Yale CS HotStorage '12, Boston, MA 13
Linear Limitations
Related Unrelated At once ??? Over time
Random Access
DeDiS Group, Yale CS HotStorage '12, Boston, MA 14
Linear Limitations
Resetting compression state between blocks enables random access... but significantly reduces the compression ratio for small blocks.
DeDiS Group, Yale CS HotStorage '12, Boston, MA 15
Linear Limitations
Reuse Compression State
Related Unrelated At once Over time ???
No abstraction for doing this!
DeDiS Group, Yale CS HotStorage '12, Boston, MA 16
Linear Limitations
Linear compression forces an all-or-nothing choice (especially for blocks < 1KB) of: (Random-access) vs. (Compression ratio) and no notion of copying, or reusing, compression state.
DeDiS Group, Yale CS HotStorage '12, Boston, MA 17
Outline
- Linear Compression
- Compression in Storage Systems
- Storage Requirements
- Linear Limitations
- Non-Linear Compression
- Architecture and API
- Example Applications
- Prototype Implementation
- Preliminary Results
- Future Work
DeDiS Group, Yale CS HotStorage '12, Boston, MA 18
NLC API
Linear Compression API Non-Linear Compression API
- State initialize();
- int compress(State, void*, int);
- int decompress(State, void*, int);
- State fork(State);
DeDiS Group, Yale CS HotStorage '12, Boston, MA 19
NLC Fork
Foo.c
v.1 v.2a v.2b
- Small delta w/ Content
dependency
- Small delta w/
Content dependency
- Independent of v.2a
Alice Bob
DeDiS Group, Yale CS HotStorage '12, Boston, MA 20
NLC Fork
Intuition: Fork copies compression state to allow independent compression, or decompression, using previous compression state.
S2a S2b S1 S0
Fork Compress v.1 Compress Independently
DeDiS Group, Yale CS HotStorage '12, Boston, MA 21
NLC API
Linear Compression API Non-Linear Compression API
- State initialize();
- int compress(State, void*, int);
- int decompress(State, void*, int);
- State fork(State);
- State merge(State, State);
DeDiS Group, Yale CS HotStorage '12, Boston, MA 22
NLC Merge
Foo.c
v.1 v.2a v.2b … int func_alice() { … } … int func_bob() { … } v.3
Alice Bob
DeDiS Group, Yale CS HotStorage '12, Boston, MA 23
NLC Merge
Intuition: Merge combines compression state to allow future compression to use all acquired state between two nodes.
S2a S2b
Compress Independently
S3a S3b S3
Merge
DeDiS Group, Yale CS HotStorage '12, Boston, MA 24
NLC API
Linear Compression API Non-Linear Compression API
- State initialize();
- int compress(State, void*, int);
- int decompress(State, void*, int);
- State fork(State);
- State merge(State, State);
DeDiS Group, Yale CS HotStorage '12, Boston, MA 25
NLC Architecture
- NLC module provided by the OS.
- Single abstraction for all outstanding state nodes.
- Independent of any specific compression scheme.
- Supports Huffman, Arithmetic, LZW, LZ77, etc.
- No expectation of random access within a block.
- Normal linear compression within blocks.
- Application can use different paths through the DAG for logically distinct
“streams” of data.
- Application keeps compressor in-sync with decompressor, but Future
Work discusses potential NLC “naming”, or “identification”, schemes.
DeDiS Group, Yale CS HotStorage '12, Boston, MA 26
Outline
- Linear Compression
- Compression in Storage Systems
- Storage Requirements
- Linear Limitations
- Non-Linear Compression
- Architecture and API
- Example Applications
- Prototype Implementation
- Preliminary Results
- Future Work
DeDiS Group, Yale CS HotStorage '12, Boston, MA 27
NLC – Parallel Compression
S0 S2 S1 S3 S5 S4 S6 Legend: = Fork = Merge = Compress
DeDiS Group, Yale CS HotStorage '12, Boston, MA 28
NLC – Synchronized Streams
S0 S1 S2 S5 S3 S4 Legend: = Fork = Merge = Compress
DeDiS Group, Yale CS HotStorage '12, Boston, MA 29
NLC – Windowed Compression
S0 S2 S1 S3 S2' S1' S3' Base state SCUM Cumulative state S4 S5 S6 SCUM SCUM For any given state, x, and current state, c, x is merged into the Cumulative State when: x <= (c - w) Window, w, = 3. Legend: = Fork = Merge = Compress
DeDiS Group, Yale CS HotStorage '12, Boston, MA 30
Outline
- Linear Compression
- Compression in Storage Systems
- Storage Requirements
- Linear Limitations
- Non-Linear Compression
- Architecture and API
- Example Applications
- Prototype Implementation
- Preliminary Results
- Future Work
DeDiS Group, Yale CS HotStorage '12, Boston, MA 31
Prototype Implementation
- We have an Adaptive Huffman compressor in C++
- Proof-of-concept; Not meant to compete head-to-head with
gzip or other compressors.
- Order of magnitude slower
- Fork and Merge are very expensive
- Compression ratios approach optimal
depending on application fork/merge strategy.
- Merge allows eventual usage of all
compression state.
DeDiS Group, Yale CS HotStorage '12, Boston, MA 32
Preliminary Results
Block size = 128 bytes Window size = 3 blocks The cost for “unordered decompression” is paid in the first 10 KB.
DeDiS Group, Yale CS HotStorage '12, Boston, MA 33
Outline
- Linear Compression
- Compression in Storage Systems
- Storage Requirements
- Linear Limitations
- Non-Linear Compression
- Architecture and API
- Example Applications
- Prototype Implementation
- Preliminary Results
- Future Work
DeDiS Group, Yale CS HotStorage '12, Boston, MA 34
Future Work – Challenges
- Merge, Merge, Merge
- It's computationally expensive and slow.
- Is it even needed? Are approximation
heuristics good enough?
- Fork/Merge behaviors
- Should we use Fork and Merge sparingly?
- Block size vs. Memory overhead
- As block sizes decrease, the compression
- verhead ratio increases.
- State node “naming” or “identification”
- NLC module should do it for the application.
DeDiS Group, Yale CS HotStorage '12, Boston, MA 35
Conclusion
- Data Compression is used everywhere.
However, the API is one-size-fits-all.
- Non-Linear Compression aims to be a superset of the
traditional compression API by offering Fork and Merge.
- Fork and Merge allow compression state to follow the
data's natural logical dependencies.
- This provides localized compression and unordered
decompression in many instances.
DeDiS Group, Yale CS HotStorage '12, Boston, MA 36
Thanks to Jana Iyengar, Avi Silberschatz, Michael Fischer, Rob Ross, the anonymous reviewers... And all of you for listening! Questions?
DeDiS Group, Yale CS HotStorage '12, Boston, MA 37
Compression in Storage Modern Requirements Non-Linear Compression Linear Limitations Architecture API
Outline
Prototype Implementation Future Work
DeDiS Group, Yale CS HotStorage '12, Boston, MA 38
Non-Linear Compression
S2 S3 S1 S4 S5 S6
DeDiS Group, Yale CS HotStorage '12, Boston, MA 39
Non-Linear Compression
S2 S3 S1 S4 S5 S6
DeDiS Group, Yale CS HotStorage '12, Boston, MA 40
Non-Linear Compression
S2 S3 S1 S4 S5 S6