VXA : A Virtual Architecture for Durable Compressed Archives Bryan - - PowerPoint PPT Presentation

vxa a virtual architecture for durable compressed archives
SMART_READER_LITE
LIVE PREVIEW

VXA : A Virtual Architecture for Durable Compressed Archives Bryan - - PowerPoint PPT Presentation

VXA : A Virtual Architecture for Durable Compressed Archives Bryan Ford Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology http://pdos.csail.mit.edu/~baford/vxa/ The Ubiquity of Data Compression


slide-1
SLIDE 1

VXA: A Virtual Architecture for Durable Compressed Archives

Bryan Ford

Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology http://pdos.csail.mit.edu/~baford/vxa/

slide-2
SLIDE 2

The Ubiquity of Data Compression

Everything is compressed these days

– Archive/Backup/Distribution: ZIP, tar.gz, ... – Multimedia streams: mp3, ogg, wmv, ... – Office documents: XML-in-ZIP – Digital cameras: JPEG, proprietary RAW, ... – Video camcorders: DV, MPEG-2, ...

slide-3
SLIDE 3

Compressed Data Formats

Observation #1: Data compression formats evolve rapidly

— S Q — L U — A R C — Z O O — L H a r c — Z I P — b z i p 2 — g z i p — c

  • m

p r e s s — 7 z Lossless Compression — R A R 1980 1985 1990 1995 2000 2005

slide-4
SLIDE 4

Compressed Data Formats

Observation #1: Data compression formats evolve rapidly

— S Q — L U — A R C — Z O O — L H a r c — Z I P — b z i p 2 — g z i p — c

  • m

p r e s s — 7 z Lossless Compression — R A R — M P E G

  • 2

— D V — W M V 7 Video Encoding Audio Encoding — M P E G

  • 1

— M P E G

  • 4

— M P 3 — R e a l A u d i

A A C — F L A C — V

  • r

b i s — S

  • r

e n s

  • n

— Q u i c k T i m e — J P E G — G I F — P N G — J P E G 2 Image Encoding — T I F F — W M A 7 — W M A 9 — W M V 9 — W M V 8 — F L I C — W A V — A I F F — 8 S V X — A N I M — A I F F

  • C

— I L B M — P C X — T G A — B M P 1980 1985 1990 1995 2000 2005

slide-5
SLIDE 5

Compressed Data Formats

Observation #1: Data compression formats evolve rapidly Problems:

– Inconvenient:

each new algorithm requires decoder install/upgrade

– Impedes data portability:

data unusable on systems without supported decoder

– Threatens long-term data usability:

  • ld decoders may not run on new operating systems
slide-6
SLIDE 6

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)

slide-7
SLIDE 7

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)

Fully Backward Compatible Extensions

slide-8
SLIDE 8

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector) — 68000 — MIPS — ARM — PA-RISC — PowerPC — DEC Alpha — Itanium Other Architectures — SPARC

slide-9
SLIDE 9

— 68000 — MIPS — ARM — PA-RISC — PowerPC — DEC Alpha — Itanium Other Architectures — SPARC

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)

slide-10
SLIDE 10

— 68000 — MIPS — ARM — PA-RISC — PowerPC — DEC Alpha — Itanium Other Architectures — SPARC

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)

slide-11
SLIDE 11

— 68000 — MIPS — ARM — PA-RISC — PowerPC — DEC Alpha — Itanium Other Architectures — SPARC

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)

slide-12
SLIDE 12

— 68000 — MIPS — ARM — PA-RISC — PowerPC — DEC Alpha — Itanium Other Architectures — SPARC

Archiving Compressed Data

Observation #2: Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005 x86 Architecture — 8086 — 80386 (32-bit) — x86-64 (64-bit) — SSE (FP vector) — MMX (int vector)

Itanic

slide-13
SLIDE 13

VXA: Virtual Executable Archives

Observation 1+2: Instruction formats are historically more durable than compressed data formats

Make archive self-extracting (data + executable decoder) To extract data, archive reader runs embedded decoder

Archive Writer Archive Reader

D D

Archive

Encoder Decoder

slide-14
SLIDE 14

Goals of VXA

Make self-extracting archives...

Archive Writer Archive Reader

D D

Archive

Encoder Decoder

slide-15
SLIDE 15

Goals of VXA

Make self-extracting archives...

  • 1. Safe: malicious decoders can't compromise host
  • 2. Future-proof: simple, well-defined architecture [Lorie]

Archive Writer Archive Reader

D

Emulator

D

Archive

Encoder Decoder

slide-16
SLIDE 16

Goals of VXA

Make self-extracting archives...

  • 1. Safe: malicious decoders can't compromise host
  • 2. Future-proof: simple, well-defined architecture [Lorie]
  • 3. Easy: allow reuse of existing code, languages, tools

Archive Writer Archive Reader

D

x86 Emulator

D

Archive

Encoder Decoder

slide-17
SLIDE 17

Goals of VXA

Make self-extracting archives...

  • 1. Safe: malicious decoders can't compromise host
  • 2. Future-proof: simple, well-defined architecture [Lorie]
  • 3. Easy: allow reuse of existing code, languages, tools
  • 4. Efficient: practical for short term data packaging too

Archive Writer Archive Reader

D

Fast x86 Emulator

D

Archive

Encoder Decoder

slide-18
SLIDE 18

Outline

  • Archiver Operation
  • vxZIP Archive Format
  • Decoder Architecture
  • Emulator Design & Implementation
  • Evaluation (performance, storage overhead)
  • Conclusion
slide-19
SLIDE 19

Archive Writer Operation

Archive VXA Archiver

slide-20
SLIDE 20

Archive Writer Operation

Archive

D1

Uncompressed Input Files

General Compressor Decoder1

VXA Archiver

slide-21
SLIDE 21

Archive Writer Operation

Archive

D1

Uncompressed Input Files

General Compressor Decoder1

VXA Archiver

slide-22
SLIDE 22

Archive Writer Operation

Archive

D1 D2 D3

Uncompressed Input Files

General Compressor Decoder1 Image Compressor Decoder2 Audio Compressor Decoder3

slide-23
SLIDE 23

Archive Writer Operation

Archive Uncompressed Input Files Pre-Compressed Input Files

D1 D2 D3 General Compressor Image Compressor Audio Compressor Decoder1 Decoder2 Decoder3

slide-24
SLIDE 24

Archive Writer Operation

Archive

D4 D5

Uncompressed Input Files Pre-Compressed Input Files

Image Format Recognizer Decoder4 Audio Format Recognizer Decoder5 D1 D2 D3 General Compressor Image Compressor Audio Compressor Decoder1 Decoder2 Decoder3

slide-25
SLIDE 25

Archive Reader Operation

Archive VXA Archive Reader

x86 Emulator

D4 D5 D1 D2 D3

slide-26
SLIDE 26

VXA Archive Reader

Archive Reader Operation

Archive Original Uncompressed Files

x86 Emulator Decoder1

D4 D5 D1 D2 D3

slide-27
SLIDE 27

VXA Archive Reader

Archive Reader Operation

Archive Original Uncompressed Files

x86 Emulator Decoder1 Decoder2 Decoder3

D4 D5 D1 D2 D3

slide-28
SLIDE 28

VXA Archive Reader

Archive Reader Operation

Archive Original Uncompressed Files Original Pre-Compressed Files

x86 Emulator

D4 D5 D1 D2 D3

Decoder1 Decoder2 Decoder3

slide-29
SLIDE 29

VXA Archive Reader

Archive Reader Operation

Archive Original Uncompressed Files De-compressed Files

x86 Emulator Decoder4 Decoder5

D4 D5 D1 D2 D3

Decoder1 Decoder2 Decoder3

slide-30
SLIDE 30

vxZIP Archive Format

  • Backward compatible

with legacy ZIP format

Central Directory Audio file Audio file Image file

vxZIP Archive

slide-31
SLIDE 31

vxZIP Archive Format

  • Backward compatible

with legacy ZIP format

  • Decoders intermixed

with archived files

Central Directory Audio file FLAC Decoder Audio file Image file JP2 Decoder

vxZIP Archive

slide-32
SLIDE 32

vxZIP Archive Format

  • Backward compatible

with legacy ZIP format

  • Decoders intermixed

with archived files

  • Archived files have

new extension header pointing to decoder

Central Directory Audio file (FLAC-encoded) FLAC Decoder Audio file (FLAC-encoded) Image file (JP2-encoded) JP2 Decoder

vxZIP Archive

slide-33
SLIDE 33

vxZIP Archive Format

  • Backward compatible

with legacy ZIP format

  • Decoders intermixed

with archived files

  • Archived files have

new extension header pointing to decoder

  • Decoders are hidden,

“deflated” (gzip)

Central Directory Audio file (FLAC-encoded) FLAC Decoder (deflated) Audio file (FLAC-encoded) Image file (JP2-encoded) JP2 Decoder (deflated)

vxZIP Archive

slide-34
SLIDE 34

vxZIP Decoder Architecture

  • Decoders are ELF executables for x86-32

– Can be written in any language, safe or unsafe – Compiled using ordinary tools (GCC)

  • Decoders have access to five “system calls”:

– read stdin, write stdout, malloc, next file, exit

  • Decoders cannot:

– open files, windows, devices, network connections, ... – get system info: user name, current time, OS type, ...

slide-35
SLIDE 35

Decoders Ported So Far

(using existing implementations in C, mostly unmodified)

General-purpose (lossless):

  • zlib: Classic gzip/deflate algorithm
  • bzip2: Burrows-Wheeler algorithm

Still image codecs:

  • jpeg: Classic lossy image compression scheme
  • jp2: JPEG 2000 wavelet-based algorithm, lossy or lossless

Audio codecs:

  • flac: Free Lossless Audio Codec
  • vorbis: Standard lossy audio codec for Ogg streams
slide-36
SLIDE 36

vx32 Emulator Architecture

Runs in vxUnZIP process

  • Loads decoder into

address space sandbox

  • Restricts decoder's

memory accesses to sandbox

  • Dispatches decoder's

VXA “system calls” to vxUnZIP (not to host OS!)

VXA Decoder Address Space

(up to 1GB) vxUnZIP Process Address Space

vxUnZIP Application

Decoder Address Space VXA System Calls vx32 Emulator library

slide-37
SLIDE 37

vx32 Emulator Implementation

On x86-{32/64} hosts:

– Secure fault isolation

[Wahbe]

– Data sandboxing via

custom LDT segments

– Code sandboxing via

instruction rewriting [Sites, Nethercote]

– No privileges or

kernel extensions

Kernel Address Space

Code Rewriting

VXA Decoder Address Space

(up to 1GB) Flat-Model Code/Data Segment

vxUnZIP Application

Decoder Data Segment (LDT)

Controlled Procedure Calls

vx32 Emulator library Transformed code cache

slide-38
SLIDE 38

vx32 Emulator Implementation

On other host architectures:

– Portable but slow “fallback” instruction interpreter

(mostly done)

– Fast x86-to-PowerPC binary translator

(in progress)

– Hopefully more in the future

Emulator implemented as generic library

– Can be used for other sandboxing applications

slide-39
SLIDE 39

Evaluation

Two issues to address:

  • Performance overhead of emulated decoders

– not important for long-term archival storage, but... – very important for common short-term uses of archives:

backups, software distribution, structured documents, ...

  • Storage overhead of archived decoders
slide-40
SLIDE 40

Performance Test Method

Run 6 ported decoders on appropriate data sets

– Athlon 64 3000+ PC running SuSE Linux 9.3 – Measure user-mode CPU time (not wall-clock time)

Compare:

– Emulated vs native execution – Running on x86-32 vs x86-64 host environment

slide-41
SLIDE 41

Performance Overhead

zlib bzip2 jpeg jp2 flac vorbis

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

native x86-32 vx32 on x86-32 native x86-64 vx32 on x86-64

Normalized User-mode Execution Time

slide-42
SLIDE 42

Performance Overhead

zlib bzip2 jpeg jp2 flac vorbis

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

native x86-32 vx32 on x86-32 native x86-64 vx32 on x86-64

Normalized User-mode Execution Time

slide-43
SLIDE 43

Storage Overhead

Archiver stores only one copy of each decoder

– Storage cost amortized over all files of same type – Relative overhead depends on size of archive

Therefore, measure only absolute decoder size

(compressed, as stored in archive)

slide-44
SLIDE 44

Storage Overhead

zlib bzip2 jpeg jp2 flac vorbis 10 20 30 40 50 60 70 80 90 100 110 120 130

Decoder C library

Compressed code size (KBytes)

slide-45
SLIDE 45

Conclusion

VXA makes self-extracting archives...

  • Safe: decoders fully sandboxed
  • Future-proof: simple, OS-independent environment
  • Easy: re-use existing decoders, languages, tools
  • Efficient: ≤ 11% slowdown vs native x86-32

Available at: http://pdos.csail.mit.edu/~baford/vxa/