Tradeoffs in XML Database Compression James Cheney University of - - PowerPoint PPT Presentation

tradeoffs in xml database compression
SMART_READER_LITE
LIVE PREVIEW

Tradeoffs in XML Database Compression James Cheney University of - - PowerPoint PPT Presentation

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression Conference March 30, 2006 Tradeoffs in XML Database Compression p.1/22 XML Compression XML: a format for tree-structured data Increasingly used


slide-1
SLIDE 1

Tradeoffs in XML Database Compression

James Cheney University of Edinburgh Data Compression Conference March 30, 2006

Tradeoffs in XML Database Compression – p.1/22

slide-2
SLIDE 2

XML Compression

XML: a format for tree-structured data Increasingly used for large data collections (e.g. bibliographic/scientific databases)

book title author chapter chapter text text <book> <title>text</title> <author>text</author> <chapter><p>text</p></chapter> <chapter><p>text</p></chapter> </book> p p p p text text text text

Verbose, so gzip or bzip2 usually used to compress XML Can XML-specific compression techniques do better?

Tradeoffs in XML Database Compression – p.2/22

slide-3
SLIDE 3

Prior work

XMill (Liefke, Suciu 2000): first (serious) XML compression work transform XML document, then compress with gzip/bzip2 XMLPPM (Cheney, DCC 2001): uses statistical modeling, better compression than XMill but slower SCMPPM (Adiego, de la Fuente, Navarro, DCC 2004), XAUST (Hariharan, Shankar, CIAA 2005): use different statistical models, report improvement over XMLPPM Other approaches have been explored but statistical methods have best performance

Tradeoffs in XML Database Compression – p.3/22

slide-4
SLIDE 4

Motivation

Most experimental evaluations have focused only on compression rate (and often only for large files) Other relevant factors such as memory requirements, rate of convergence neglected Thus, experiments demonstrating improved compression are valid, but don’t tell the whole story. Goal of this talk: detailed comparison of memory vs compression rate and rate of convergence Focus on unstructured text compression behavior of statistical XML compression models used in XMLPPM, SCMPPM, and XAUST

Tradeoffs in XML Database Compression – p.4/22

slide-5
SLIDE 5

Statistical models

Statistical text compression: compresses text by building a model that predicts next symbol Adaptive approach: interleave model building and prediction/compression. Requires only one pass over data, but has to “learn” model as it goes

a c a d a b a r a b r P(a) = .45 P(b) = .225 P(r) = .225 P(X) = 0.1 Model 001011101010010111

Tradeoffs in XML Database Compression – p.5/22

slide-6
SLIDE 6

Statistical models

Statistical text compression: compresses text by building a model that predicts next symbol Adaptive approach: interleave model building and prediction/compression. Requires only one pass over data, but has to “learn” model as it goes

a c a d a b a r a b r P(a) = .45 P(b) = .225 P(r) = .225 P(X) = 0.1 Model 00101110101001011110110

Tradeoffs in XML Database Compression – p.6/22

slide-7
SLIDE 7

Statistical models

Statistical text compression: compresses text by building a model that predicts next symbol Adaptive approach: interleave model building and prediction/compression. Requires only one pass over data, but has to “learn” model as it goes

a c a d a b a r a b r P(a) = .38 P(b) = .19 P(r) = .19 P(c) = .19 P(X) = 0.05 Model 00101110101001011110110

Tradeoffs in XML Database Compression – p.7/22

slide-8
SLIDE 8

Approach #1: Multi-model

Idea: Switch between n models, one model M(e) per element name e Use M(e) to encode the text immediately under e

abc 456 A B C D xyz 123 B C D "xyz" "123" "abc" "456" M(A) M(B) M(C) M(D) 0101 11010 01110 11111

Used in SCMPPM, XAUST I’ll call this the Structured Contexts Model (SCM) approach

Tradeoffs in XML Database Compression – p.8/22

slide-9
SLIDE 9

Approach #2: Single-model

Idea: Use a single model for text, but “prime” model with element symbols Priming symbols are “free” since can be inferred from tree context (this is part of the fixed cost we’re ignoring)

Model abc 456 A B C D xyz 123 B C D

(00) (01) (02) "xyz" (03) "123" (02) "abc" (03) "456"

A = 00 B = 01 C = 02 D = 03

where (00), (01) etc are priming symbols for various element tags Used in XMLPPM, so I’ll call it the XMLPPM approach

Tradeoffs in XML Database Compression – p.9/22

slide-10
SLIDE 10

Prior experiments

XMLPPM: wide variety of XML documents, max size <1MB, used 1MB memory for statistical models When limit reached, statistical model restarts SCMPPM: used large TREC documents with 8 elements, very little structure; statistical models used 1MB each (maximum of 8MB for TREC) XAUST: used large documents such as DBLP; no memory upper limit

Tradeoffs in XML Database Compression – p.10/22

slide-11
SLIDE 11

Flaws in prior experiments

XMLPPM: didn’t consider large documents, memory variation SCMPPM, XAUST: didn’t consider small documents, memory variation Can’t tell whether reported compression gain is due to using more memory or more accurate modeling SCM approach may allocate much more memory than it ever uses SCM approach may eventually attain much better compression, but may converge very slowly (benefiting only large files) Not enough data to draw any conclusions about relative merits of these approaches

Tradeoffs in XML Database Compression – p.11/22

slide-12
SLIDE 12

Text is the dominant factor

Most of the “interesting” content of most XML documents is unstructured text

gzip xmlppm fi le struct total %struct struct total %struct DBLP 9.9MB 52.4MB 19% 667KB 33.4MB 2.0% Medline 2.7MB 20.2MB 14% 539KB 13.7MB 3.9% XMark 4.1MB 38.1MB 11% 287KB 27.6MB 1.0% PSD 13.6MB 108MB 12% 2.5MB 79.6MB 3.1%

Existing techniques already compress structure well (less than 1–20% of document) So, in this work, focus only on modeling/compression of unstructured text in XML Compressing the structure is treated as a small fixed cost

Tradeoffs in XML Database Compression – p.12/22

slide-13
SLIDE 13

Experimental methodology

Three experiments:

  • 1. Memory vs. compression rate: for a wide range of

model sizes, measured compression rate vs. memory used

  • 2. Convergence rate: compressed prefixes of large

files, and measured prefix length vs. compression rate

  • 3. Memory footprint (not shown): for a wide range of

model sizes, measured memory allocated vs. memory used

Tradeoffs in XML Database Compression – p.13/22

slide-14
SLIDE 14

Experiments

Used two large “typical” data sets: DBLP (bibliography, 300MB uncompressed) PSD (protein sequence database, 717MB uncompressed). Tested plain PPM, XMLPPM, SCM, and a “hybrid” (not shown) Further details in paper

Tradeoffs in XML Database Compression – p.14/22

slide-15
SLIDE 15

Memory use vs. compression rate

0.5 0.6 0.7 0.8 0.9 1 1 10 100 Bit rate (bpc) Memory used (MB) DBLP ppm xmlppm scm

Tradeoffs in XML Database Compression – p.15/22

slide-16
SLIDE 16

Memory use vs. compression rate

0.7 0.8 0.9 1 1.1 1.2 1 10 100 Bit rate (bpc) Memory used (MB) PSD ppm xmlppm scm

Tradeoffs in XML Database Compression – p.16/22

slide-17
SLIDE 17

Memory use vs. compression rate

For DBLP , improvement for SCM is minor (5%), needs

  • ver 40MB to achieve this.

For PSD, SCM can perform around 10% better, improves after 10MB. Why? Small XMLPPM models benefit from sharing common statistics But large SCM models benefit from specialization

Tradeoffs in XML Database Compression – p.17/22

slide-18
SLIDE 18

Convergence rate

0.5 1 1.5 2 2.5 3 1000 1e+06 1e+09 Bit rate (bpc) Input size (bytes) DBLP ppm xmlppm scm

Tradeoffs in XML Database Compression – p.18/22

slide-19
SLIDE 19

Convergence rate

0.5 1 1.5 2 2.5 3 3.5 1000 1e+06 1e+09 Bit rate (bpc) Input size (bytes) PSD ppm xmlppm scm

Tradeoffs in XML Database Compression – p.19/22

slide-20
SLIDE 20

Convergence rate

Overall trend: SCM performs worse early, but eventually better Why? because SCM separates text under different elements, each model learns any common text separately but because XMLPPM lumps all text into a single model, eventually it does worse because of averaging

Tradeoffs in XML Database Compression – p.20/22

slide-21
SLIDE 21

Conclusions

The SCM approach does provide better compression... provided you give it lots of memory and lots of data Of course, for “archiving” XML DBs (DBLP , PSD, etc), this is fine! However, the XMLPPM approach is better for small documents or using small amounts of memory This may make it preferable for on-the-fly compression of XML “messages” webpages, RDF , RSS feeds, SOAP RPCs Or low-memory devices such as PDAs, mobile phones

Tradeoffs in XML Database Compression – p.21/22

slide-22
SLIDE 22

Meta-conclusions

XML compression research is still wide open area However, so far experiments have focused on compression rate and ignored other factors More generally, standards for benchmarking and evaluating XML compression systems needed! Source code should also be made available to allow repeatability http://xmlppm.sourceforge.net

Tradeoffs in XML Database Compression – p.22/22