An Empirical Evaluation of Simple DTD-Conscious Compression - PowerPoint PPT Presentation

An Empirical Evaluation of Simple DTD-Conscious Compression Techniques James Cheney Database Group/Digital Curation Centre University of Edinburgh WebDB 2005 June 17, 2005 1

Always start with a joke... Why did the chicken cross the road? To get to the other side! 2

Always start with a joke... <?xml version="1.0"?> <!DOCTYPE joke SYSTEM "joke.dtd"> <joke type="question-answer"> <setup> Why did the chicken cross the road? </setup> <punch-line> To get to the other side! </punch-line> <laughter type="optional"/> </joke> XML is verbose. 3

XML Compression The term XML compression has been used in several different contexts: 1. minimum-length encoding for efficient XML storage and transmission 2. compact binary formats for efficient XML stream processing 3. techniques for efficient in-database XML storage and query processing For us, XML compression means (1). 4

Prior work: XML compression • State of practice: use gzip or bzip2 (or library variants) to compress XML as text • [Liefke, Suciu 2000] XMill: transform XML document to bring similar text closer together, then use gzip/bzip2 • [Cheney 2001] XMLPPM: compress XML by leveraging advanced statistical text compression techniques – XMLPPM/variants have best published results so far. 5

DTD-conscious compression DTD/schema information tells us what valid XML documents to expect, so “obviously” should help compression Assume encoder and decoder have access to (identical) DTD XML DTD-specific XML encoding Encoder Decoder DTD DTD 6

Prior work: DTD-conscious compression [Levene and Wood, 2002]: use DTD regexp content models to encode element structure Example: In regexp model ( c + d )( ab ) ∗ d ? , encode cabababd as 011101 Bits indicate decisions made at choice points during validation. 7

Prior work: DTD-conscious compression While likely much more compact than XML text, LW02 technique does not compress better than XMLPPM Why? XMLPPM already “learns” a lot about data structure, and uses a more advanced statistical model than Levene and Wood’s encoding. Moreover, LW02’s technique is not easy to incorporate into XMLPPM Why? LW02’s encoding breaks byte alignment , confusing later text compression stages Lesson: Need to avoid stepping on toes of later stages 8

Why DTDs vs XML Schemas? • Pro: DTDs simpler, more stable, less work to validate; techniques should generalize • Con: XML Schemas more descriptive (especially datatypes), appear to be more popular now It is a lot of work to implement DTD-conscious, let alone XML Schema- conscious compression; is it worth the effort? 9

Our approach Look for simple techniques for leveraging DTD information in XMLPPM. Easier to implement, easier to test, easier to incorporate into XMLPPM. If simple techniques are effective, more complex techniques may be worthwhile. Implemented in DTDPPM, an XMLPPM variant that simultaneously vali- dates and compresses 10

Four simple optimizations • Strip ignorable (non-PCDATA) whitespace — obvious but necessary for good compression due to properties of underlying compressor • Re-use element, attribute, default symbols found in DTDs • Predict element symbols (open and close-element tags) using regular expression context • Sort and encode attribute lists using bitmaps; use types and default information also 11

Example Given element declaration <!ELEMENT book (title,author+)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> Encode <book> <title>Title</title> <author>Auth1</author> <author>Auth2</author></book> as 00 ’f’ ’o’ ’o’ ’A’ ’u’ ’t’ ’h’ ’1’ 01 ... FF FF 12

Example: attribute list coding Given attribute list declaration <!ATTLIST elt att1 CDATA #FIXED "foo" att2 (x|y|z) #REQUIRED att3 CDATA #IMPLIED att4 CDATA "bar"> we can encode the attribute list of <elt att1=’foo’ att2=’y’ att4=’baz’> as 01000000 2 01 ’b’ ’a’ ’z’ 00 13

Evaluation • “XMLPPM benchmark”: corpus used in [Cheney 2001]; mostly histori- cal interest (5MB, mixed sources) • NewsML: Reuters news reports (2.7MB total, 11KB avg) • MusicXML: Musical scores (1.8MB total, 101KB avg) • Medium data sets (Washington corpus, 3MB total, mixed sources) • Large data sets (DBLP , XMark, PSD, Medline, 100-700MB each) 14

Setup Experimental setup: AMD64 3000+, 512MB RAM, FC3 Measured • compression effectiveness (compressed bits per input character) • compression time (ns per input character) Note: Decompression for PPM techniques ≈ compression time (but gzip, bzip2 decompress faster than they compress) 15

Compression rate (bits per input character) gzip 3.000 bzip2 2.500 xmlppm dtdppm 2.000 1.500 1.000 0.500 0.000 xmlppm newsml musicxml uw xmark medline psd dblp (5.3MB) (2.7MB) (1.8MB) (3,9MB) (116MB) (127MB) (717MB) (103MB) Compression speed (ns per input character) gzip 2500 bzip2 2000 xmlppm dtdppm 1500 1000 500 0 xmlppm newsml musicxml uw xmark medline psd dblp (5.3MB) (2.7MB) (1.8MB) (3,9MB) (116MB) (127MB) (717MB) (103MB)

Observations Short documents (NewsML) compress better, but re-parsing DTD is very expensive. Highly-structured documents (MusicXML) compress much better Flat data sets or very large irregular documents compress no better than bzip2, but xmlppm/dtdppm are faster than bzip2 XMark compresses no better, but may not be a realistic compression benchmark (since randomly generated) 16

Which technique is best? No single technique dominates. In particular, improvement is not all from WS stripping; each technique can account for 0-80% of improvement. Need a variety of techniques because XML data structure varies widely. WS stripping is probably the best value for effort: everyone should (and many already) do it when compressing XML. 17

Conclusions DTD information: “obviously” should be useful for compression However, real improvements over advanced XML-only techniques do not come easily We have explored many alternatives and identified four that do work (in the context of one XML compressor, XMLPPM). Future work: Improving efficiency, more advanced techniques, XML Schema http://sourceforge.net/projects/xmlppm 18

An Empirical Evaluation of Simple DTD-Conscious Compression - PowerPoint PPT Presentation

An Empirical Evaluation of Simple DTD-Conscious Compression Techniques James Cheney Database Group/Digital Curation Centre University of Edinburgh WebDB 2005 June 17, 2005 1 Always start with a joke... Why did the chicken cross the road?

A Cache-conscious Profitability A Cache-conscious Profitability Model for Empirical Tuning of

1 KAREN 2 The DTD is the Document Type Definition. The Matrix refers to the standards for

3. Defining the document structure (DTD) Declaration of application-specific names and

Simple Rails Template <?xml version="1.0" encoding="utf-8"?>

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Chapter 4 Marketing ethics Today Conscious marketing Ethical marketing decision 2

ABSYNT - Abstract Syntaxe Model DTD ABSYNT ABSYNT XSD DTD XSD ( XML.dtd ) ( XML.xsd ) (

DTD and XML Schema XML Extensible Markup Language A standard adopted in 1998 by the W3C

Good-bye Cruel World! <?xml version="1.0" encoding="utf-8"?>

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Classification of curves Simple, not closed Simple, closed Closed, not simple Not simple, not

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

Mikeys Matthew Gunton, Joseph Min, Dave Jha, and Aidan Gales What is a Price Conscious

CONSCIOUS CAPITALISM Raj Sisodia Director, Mastek Ltd. Professor of Marketing, Bentley

ONE GLOBE ONE PEOPLE The conscious human the world needs you Programme 19.00 to 22.00 The

School to Heart to Home Conscious Discipline Parent Workshop Series 1. Composure The 7 2.

An Introduction to Information Theory Carlton Downey November 12, 2013 Motivation Information

More Transport, Please! More Transport, Please! Kory Draughn June 9-12, 2020 Software Developer

Encryption at Rest in ZFS Tom Caputi tcaputi@datto.com Overview of Encryption Implementation 2

Huffman Trees To save space when storing it. Greedy Algorithm for Data Compression To save

Randomness and quantum computa/on Computability and the BSS

Topological Data Analysis - I Afra Zomorodian Department of Computer Science Dartmouth College

Geometry driven collapses for simplifying Cech complexes Dominique Attali ( * ) and Andr

EVC Computer Vision R h Rehersal 1 l 1 http://

An Empirical Evaluation of Simple DTD-Conscious Compression - PowerPoint PPT Presentation

An Empirical Evaluation of Simple DTD-Conscious Compression Techniques James Cheney Database Group/Digital Curation Centre University of Edinburgh WebDB 2005 June 17, 2005 1 Always start with a joke... Why did the chicken cross the road?

A Cache-conscious Profitability A Cache-conscious Profitability Model for Empirical Tuning of

1 KAREN 2 The DTD is the Document Type Definition. The Matrix refers to the standards for

3. Defining the document structure (DTD) Declaration of application-specific names and

Simple Rails Template &lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Chapter 4 Marketing ethics Today Conscious marketing Ethical marketing decision 2

ABSYNT - Abstract Syntaxe Model DTD ABSYNT ABSYNT XSD DTD XSD ( XML.dtd ) ( XML.xsd ) (

DTD and XML Schema XML Extensible Markup Language A standard adopted in 1998 by the W3C

Good-bye Cruel World! &lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;

Functional Principal Component Analysis May 14, 2018 Empirical Principal Component FPC for the

Classification of curves Simple, not closed Simple, closed Closed, not simple Not simple, not

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

Mikeys Matthew Gunton, Joseph Min, Dave Jha, and Aidan Gales What is a Price Conscious

CONSCIOUS CAPITALISM Raj Sisodia Director, Mastek Ltd. Professor of Marketing, Bentley

ONE GLOBE ONE PEOPLE The conscious human the world needs you Programme 19.00 to 22.00 The

School to Heart to Home Conscious Discipline Parent Workshop Series 1. Composure The 7 2.

An Introduction to Information Theory Carlton Downey November 12, 2013 Motivation Information

More Transport, Please! More Transport, Please! Kory Draughn June 9-12, 2020 Software Developer

Encryption at Rest in ZFS Tom Caputi tcaputi@datto.com Overview of Encryption Implementation 2

Huffman Trees To save space when storing it. Greedy Algorithm for Data Compression To save

Randomness and quantum computa/on Computability and the BSS

Topological Data Analysis - I Afra Zomorodian Department of Computer Science Dartmouth College

Geometry driven collapses for simplifying Cech complexes Dominique Attali ( * ) and Andr

EVC Computer Vision R h Rehersal 1 l 1 http://

Simple Rails Template <?xml version="1.0" encoding="utf-8"?>

Good-bye Cruel World! <?xml version="1.0" encoding="utf-8"?>