VXA : A Virtual Architecture for Durable Compressed Archives Bryan - - PowerPoint PPT Presentation
VXA : A Virtual Architecture for Durable Compressed Archives Bryan - - PowerPoint PPT Presentation
VXA : A Virtual Architecture for Durable Compressed Archives Bryan Ford Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology http://pdos.csail.mit.edu/~baford/vxa/ The Ubiquity of Data Compression
The Ubiquity of Data Compression
Everything is compressed these days
– Archive/Backup/Distribution: ZIP, tar.gz, ... – Multimedia streams: mp3, ogg, wmv, ... – Office documents: XML-in-ZIP – Digital cameras: JPEG, proprietary RAW, ... – Video camcorders: DV, MPEG-2, ...
Compressed Data Formats
Observation #1: Data compression formats evolve rapidly
— S Q — L U — A R C — Z O O — L H a r c — Z I P — b z i p 2 — g z i p — c
- m
p r e s s — 7 z Lossless Compression — R A R 1980 1985 1990 1995 2000 2005
Compressed Data Formats
Observation #1: Data compression formats evolve rapidly
— S Q — L U — A R C — Z O O — L H a r c — Z I P — b z i p 2 — g z i p — c
- m
p r e s s — 7 z Lossless Compression — R A R — M P E G
- 2
— D V — W M V 7 Video Encoding Audio Encoding — M P E G
- 1
— M P E G
- 4
— M P 3 — R e a l A u d i
- —
A A C — F L A C — V
- r
b i s — S
- r
e n s
- n
— Q u i c k T i m e — J P E G — G I F — P N G — J P E G 2 Image Encoding — T I F F — W M A 7 — W M A 9 — W M V 9 — W M V 8 — F L I C — W A V — A I F F — 8 S V X — A N I M — A I F F
- C
— I L B M — P C X — T G A — B M P 1980 1985 1990 1995 2000 2005
Compressed Data Formats
Observation #1: Data compression formats evolve rapidly Problems:
– Inconvenient:
each new algorithm requires decoder install/upgrade
– Impedes data portability:
data unusable on systems without supported decoder
– Threatens long-term data usability:
- ld decoders may not run on new operating systems
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)
Fully Backward Compatible Extensions
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector) — 68000 — MIPS — ARM — PA-RISC — PowerPC — DEC Alpha — Itanium Other Architectures — SPARC
— 68000 — MIPS — ARM — PA-RISC — PowerPC — DEC Alpha — Itanium Other Architectures — SPARC
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)
— 68000 — MIPS — ARM — PA-RISC — PowerPC — DEC Alpha — Itanium Other Architectures — SPARC
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)
— 68000 — MIPS — ARM — PA-RISC — PowerPC — DEC Alpha — Itanium Other Architectures — SPARC
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)
— 68000 — MIPS — ARM — PA-RISC — PowerPC — DEC Alpha — Itanium Other Architectures — SPARC
Archiving Compressed Data
Observation #2: Processor architectures evolve more conservatively
1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)
Itanic
VXA: Virtual Executable Archives
Observation 1+2: Instruction formats are historically more durable than compressed data formats
Make archive self-extracting (data + executable decoder) To extract data, archive reader runs embedded decoder
Archive Writer Archive Reader
D D
Archive
Encoder Decoder
Goals of VXA
Make self-extracting archives...
Archive Writer Archive Reader
D D
Archive
Encoder Decoder
Goals of VXA
Make self-extracting archives...
- 1. Safe: malicious decoders can't compromise host
- 2. Future-proof: simple, well-defined architecture [Lorie]
Archive Writer Archive Reader
D
Emulator
D
Archive
Encoder Decoder
Goals of VXA
Make self-extracting archives...
- 1. Safe: malicious decoders can't compromise host
- 2. Future-proof: simple, well-defined architecture [Lorie]
- 3. Easy: allow reuse of existing code, languages, tools
Archive Writer Archive Reader
D
x86 Emulator
D
Archive
Encoder Decoder
Goals of VXA
Make self-extracting archives...
- 1. Safe: malicious decoders can't compromise host
- 2. Future-proof: simple, well-defined architecture [Lorie]
- 3. Easy: allow reuse of existing code, languages, tools
- 4. Efficient: practical for short term data packaging too
Archive Writer Archive Reader
D
Fast x86 Emulator
D
Archive
Encoder Decoder
Outline
- Archiver Operation
- vxZIP Archive Format
- Decoder Architecture
- Emulator Design & Implementation
- Evaluation (performance, storage overhead)
- Conclusion
Archive Writer Operation
Archive VXA Archiver
Archive Writer Operation
Archive
D1
Uncompressed Input Files
General Compressor Decoder1
VXA Archiver
Archive Writer Operation
Archive
D1
Uncompressed Input Files
General Compressor Decoder1
VXA Archiver
Archive Writer Operation
Archive
D1 D2 D3
Uncompressed Input Files
General Compressor Decoder1 Image Compressor Decoder2 Audio Compressor Decoder3
Archive Writer Operation
Archive Uncompressed Input Files Pre-Compressed Input Files
D1 D2 D3 General Compressor Image Compressor Audio Compressor Decoder1 Decoder2 Decoder3
Archive Writer Operation
Archive
D4 D5
Uncompressed Input Files Pre-Compressed Input Files
Image Format Recognizer Decoder4 Audio Format Recognizer Decoder5 D1 D2 D3 General Compressor Image Compressor Audio Compressor Decoder1 Decoder2 Decoder3
Archive Reader Operation
Archive VXA Archive Reader
x86 Emulator
D4 D5 D1 D2 D3
VXA Archive Reader
Archive Reader Operation
Archive Original Uncompressed Files
x86 Emulator Decoder1
D4 D5 D1 D2 D3
VXA Archive Reader
Archive Reader Operation
Archive Original Uncompressed Files
x86 Emulator Decoder1 Decoder2 Decoder3
D4 D5 D1 D2 D3
VXA Archive Reader
Archive Reader Operation
Archive Original Uncompressed Files Original Pre-Compressed Files
x86 Emulator
D4 D5 D1 D2 D3
Decoder1 Decoder2 Decoder3
VXA Archive Reader
Archive Reader Operation
Archive Original Uncompressed Files De-compressed Files
x86 Emulator Decoder4 Decoder5
D4 D5 D1 D2 D3
Decoder1 Decoder2 Decoder3
vxZIP Archive Format
- Backward compatible
with legacy ZIP format
Central Directory Audio file Audio file Image file
vxZIP Archive
vxZIP Archive Format
- Backward compatible
with legacy ZIP format
- Decoders intermixed
with archived files
Central Directory Audio file FLAC Decoder Audio file Image file JP2 Decoder
vxZIP Archive
vxZIP Archive Format
- Backward compatible
with legacy ZIP format
- Decoders intermixed
with archived files
- Archived files have
new extension header pointing to decoder
Central Directory Audio file (FLAC-encoded) FLAC Decoder Audio file (FLAC-encoded) Image file (JP2-encoded) JP2 Decoder
vxZIP Archive
vxZIP Archive Format
- Backward compatible
with legacy ZIP format
- Decoders intermixed
with archived files
- Archived files have
new extension header pointing to decoder
- Decoders are hidden,
“deflated” (gzip)
Central Directory Audio file (FLAC-encoded) FLAC Decoder (deflated) Audio file (FLAC-encoded) Image file (JP2-encoded) JP2 Decoder (deflated)
vxZIP Archive
vxZIP Decoder Architecture
- Decoders are ELF executables for x86-32
– Can be written in any language, safe or unsafe – Compiled using ordinary tools (GCC)
- Decoders have access to five “system calls”:
– read stdin, write stdout, malloc, next file, exit
- Decoders cannot:
– open files, windows, devices, network connections, ... – get system info: user name, current time, OS type, ...
Decoders Ported So Far
(using existing implementations in C, mostly unmodified)
General-purpose (lossless):
- zlib: Classic gzip/deflate algorithm
- bzip2: Burrows-Wheeler algorithm
Still image codecs:
- jpeg: Classic lossy image compression scheme
- jp2: JPEG 2000 wavelet-based algorithm, lossy or lossless
Audio codecs:
- flac: Free Lossless Audio Codec
- vorbis: Standard lossy audio codec for Ogg streams
vx32 Emulator Architecture
Runs in vxUnZIP process
- Loads decoder into
address space sandbox
- Restricts decoder's
memory accesses to sandbox
- Dispatches decoder's
VXA “system calls” to vxUnZIP (not to host OS!)
VXA Decoder Address Space
(up to 1GB) vxUnZIP Process Address Space
vxUnZIP Application
Decoder Address Space VXA System Calls vx32 Emulator library
vx32 Emulator Implementation
On x86-{32/64} hosts:
– Secure fault isolation
[Wahbe]
– Data sandboxing via
custom LDT segments
– Code sandboxing via
instruction rewriting [Sites, Nethercote]
– No privileges or
kernel extensions
Kernel Address Space
Code Rewriting
VXA Decoder Address Space
(up to 1GB) Flat-Model Code/Data Segment
vxUnZIP Application
Decoder Data Segment (LDT)
Controlled Procedure Calls
vx32 Emulator library Transformed code cache
vx32 Emulator Implementation
On other host architectures:
– Portable but slow “fallback” instruction interpreter
(mostly done)
– Fast x86-to-PowerPC binary translator
(in progress)
– Hopefully more in the future
Emulator implemented as generic library
– Can be used for other sandboxing applications
Evaluation
Two issues to address:
- Performance overhead of emulated decoders
– not important for long-term archival storage, but... – very important for common short-term uses of archives:
backups, software distribution, structured documents, ...
- Storage overhead of archived decoders
Performance Test Method
Run 6 ported decoders on appropriate data sets
– Athlon 64 3000+ PC running SuSE Linux 9.3 – Measure user-mode CPU time (not wall-clock time)
Compare:
– Emulated vs native execution – Running on x86-32 vs x86-64 host environment
Performance Overhead
zlib bzip2 jpeg jp2 flac vorbis
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
native x86-32 vx32 on x86-32 native x86-64 vx32 on x86-64
Normalized User-mode Execution Time
Performance Overhead
zlib bzip2 jpeg jp2 flac vorbis
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
native x86-32 vx32 on x86-32 native x86-64 vx32 on x86-64
Normalized User-mode Execution Time
Storage Overhead
Archiver stores only one copy of each decoder
– Storage cost amortized over all files of same type – Relative overhead depends on size of archive
Therefore, measure only absolute decoder size
(compressed, as stored in archive)
Storage Overhead
zlib bzip2 jpeg jp2 flac vorbis 10 20 30 40 50 60 70 80 90 100 110 120 130
Decoder C library
Compressed code size (KBytes)
Conclusion
VXA makes self-extracting archives...
- Safe: decoders fully sandboxed
- Future-proof: simple, OS-independent environment
- Easy: re-use existing decoders, languages, tools
- Efficient: ≤ 11% slowdown vs native x86-32