Syllable-based compression for XML Katsiaryna Chernik, Jan Lnsk, - - PowerPoint PPT Presentation

syllable based compression for xml
SMART_READER_LITE
LIVE PREVIEW

Syllable-based compression for XML Katsiaryna Chernik, Jan Lnsk, - - PowerPoint PPT Presentation

Syllable-based compression for XML Katsiaryna Chernik, Jan Lnsk, Leo Galambo Dept. of Software Engineering Faculty of Mathematics and Physics Charles University Content Motivation Syllable-based compression XMLSyl


slide-1
SLIDE 1

Syllable-based compression for XML

Katsiaryna Chernik, Jan Lánský, Leo Galamboš

  • Dept. of Software Engineering

Faculty of Mathematics and Physics Charles University

slide-2
SLIDE 2

Content

Motivation Syllable-based compression XMLSyl XMillSyl Results Conclusion

slide-3
SLIDE 3

Motivatoin

XML Simple text format for structured text

documents

Data exchange standard Hight redundancy

slide-4
SLIDE 4

Compression Methods for XML

Character-based XMill XMLPPM XGrind, ... Word-based ? Syllable-based ?

slide-5
SLIDE 5

Syllable-based compression

LZWL Dictionary-based method Syllable-based version of LZW HufSyl Statistical method Adaptive Huffman coding Inspired by HuffWord

slide-6
SLIDE 6

Syllable-based compression

Syllable-based compression is

suitable for languages with rich morphology (Czech)

Syllable-based compression is

suitable for small or middle-sized files

slide-7
SLIDE 7

Syllable-based compression of XML

Majority of XML documents are small or

middle-sized

Many text-like XML documents news in RSS format documentations or books in DocBook format

Syllable-based compression and XML?

slide-8
SLIDE 8

XMLSyl

Idea

Syllable-based compressor XML tokens are divided to many

syllables

XMLSyl XML tokens are treated as single

syllables

slide-9
SLIDE 9

XMLSyl

Architecture

SAX Parser Structure Encoder

Element Container Data and Structure Container

XML Document

Attribute Container Compressed XML document Syllable Compressor Syllable Compressor

slide-10
SLIDE 10

XMLSyl

Example

XML doc: < book> < title lang= "en"> XML< / title> < / book> SAX events: startElement(“book”) startElement(“title”,(“lang”,”en”)) characters(“XML”) endElement(“title”) endElement(“book”)

slide-11
SLIDE 11

XMLSyl

Example – Encoding process

SAX events: startElement(“book”) characters(“XML”) endElement(“title”) endElement(“book”) Attribute Container Element Container Data and Structure Container book E0 title E1 lang A0 E1 E0 A0 en END_ATT CHAR XML END_CHAR END_TAG END_TAG startElement(“title ,(“lang”,”en”))

slide-12
SLIDE 12

XMLSyl

Implementation details

SAX parser – EXPAT Syllable Compressor – LZWL and

HufSyl

Encoding was inspired by existing

XML compression methods

XMLPPM, XGrind, XPress, XMill

slide-13
SLIDE 13

XMillSyl

Based on XMill Main principles of XMill Separating structure from data Grouping Data values with related

meaning

slide-14
SLIDE 14

Architecture of XMill

SAX Parser Path Processor

Structure Container Data Container k Data Container1

Input XML file

Data Container 2 Compressed XML file gzip gzip gzip gzip

slide-15
SLIDE 15

XMill – one container

SAX Parser Path Processor

Structure Container Data Container

Input XML file

Compressed XML file gzip gzip

slide-16
SLIDE 16

XMillSyl

Architecture

SAX Parser Path Processor

Structure Container Data Container k Data Container1

Input XML file

Data Container 2 Compressed XML file gzip LZWL / HufSyl LZWL / HufSyl LZWL / HufSyl

slide-17
SLIDE 17

Syllable-based compression of XML

Experimental results

XMLSyl & XMillSyl vs. LZWL & HufSyl

Non-textual XML data 50-60% better Textual XML data 10-20% better

slide-18
SLIDE 18

Syllable-based compression of XML

Experimental results

XMillSyl more containers XMillSyl one container XMLSyl Text-like XML documents

slide-19
SLIDE 19

Syllable-based compression of XML

Experimental results

XMLSyl

XMLHuf is suitable for small-sized

files

XMLzwl is suitable for large-sized

files

XMLSyl vs. XMill

On average 10-15% worse than XMill On some documents the same

performance or better Text-like XML documents

slide-20
SLIDE 20

Conclusion

New syllable-based compression

methods of XML

XMLSyl (versions: XMLzwl, XMLhuf) XMillSyl (versions: XMillzwl, XMillhuf) One of our method outperforms XMill

  • n some documents
slide-21
SLIDE 21

Conclusion

Future work extract and utilize the information in

the DTD section

create a special syllable dictionary for

elements and attributes

compress HTML data