Compressing and Searching XML Data Via Two Zips Paolo Ferragina - - PowerPoint PPT Presentation

compressing and searching xml data via two zips
SMART_READER_LITE
LIVE PREVIEW

Compressing and Searching XML Data Via Two Zips Paolo Ferragina - - PowerPoint PPT Presentation

Compressing and Searching XML Data Via Two Zips Paolo Ferragina Dipartimento di Informatica, Universit di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] Paolo Ferragina, Universit di Pisa Six years ago... [now, J. ACM 05]


slide-1
SLIDE 1

Paolo Ferragina, Università di Pisa

Compressing and Searching XML Data Via Two Zips

Paolo Ferragina

Dipartimento di Informatica, Università di Pisa

[Joint with F. Luccio, G. Manzini, S. Muthukrishnan]

slide-2
SLIDE 2

Paolo Ferragina, Università di Pisa

Six years ago... [now, J. ACM 05]

Opportunistic Data Structures with Applications

  • P. Ferragina, G. Manzini

Survey by Navarro-Makinen cites more than 50 papers on the subject !!

slide-3
SLIDE 3

Paolo Ferragina, Università di Pisa

An XML excerpt

<dblp> <book> <author> Donald E. Knuth </author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293-326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article>

...

</dblp>

It is verbose !

slide-4
SLIDE 4

Paolo Ferragina, Università di Pisa

A tree interpretation...

XML document exploration ≡ Tree navigation XML document search ≡ Labeled subpath searches

Subset of XPath [W3C]

slide-5
SLIDE 5

Paolo Ferragina, Università di Pisa

The Problem

Summary indexes (like Dataguide, 1-index or 2-index)

  • large space and do not support “content” searches

XML-aware compressors (like XMill, XmlPpm, ScmPpm,...)

  • need the whole decompression

We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations:

  • Navigational operations: parent(u), child(u, i), child(u, i, c)
  • Subpath searches: given a sequence Π of k labels
  • Content searches: subpath + substring search
  • Visualization operation: given a node, visualize its descending subtree

XML-queriable compressors (like XPress, XGrind, XQzip,...)

  • poor compression and scan of the whole (compressed) file

XML-native search engines

might exploit this tool as a core block for query optimization and (compressed) storage

Theoretically do exist many solutions, starting from [Jacobson, IEEE Focs ’89]

  • no subpath/ content searches, and poor performance on labeled trees
slide-6
SLIDE 6

Paolo Ferragina, Università di Pisa

A transform for “labeled trees”

[Ferragina et al, IEEE Focs ’05]

We proposed the XBW-transform that mimics on trees the nice

structural properties of the Burrows-and-Wheeler Trasform on strings (do you know bzip !?).

The XBW linearizes the tree T in 2 arrays s.t.: the compression of T reduces to use any k-th order entropy

compressor (gzip, bzip,...) over these two arrays

the indexing of T reduces to implement simple rank/select

query operations over these two arrays

slide-7
SLIDE 7

Paolo Ferragina, Università di Pisa

The XBW-Transform

C B A B D c c a b a D c D a b C B D c a c A b a D c B D b a

ε

C B C D B C D B C B C C A C A C A C D A C C B C D B C B C

Sα Sπ

upward labeled paths Permutation

  • f tree nodes

Step 1.

Visit the tree in pre-order. For each node, write down its label and the labels on its upward path

slide-8
SLIDE 8

Paolo Ferragina, Università di Pisa

The XBW-Transform

C B A B D c c a b a D c D a b C b a D D c D a B A B c c a b

ε

A C A C A C B C B C B C B C C C C D A C D B C D B C D B C

Sα Sπ

upward labeled paths

Step 2.

Stably sort according to Sπ

slide-9
SLIDE 9

Paolo Ferragina, Università di Pisa

XBW takes optimal t log |Σ| + 2t bits 1 1 1 1 1 1 1 1

The XBW-Transform

C B A B D c c a b a D c D a b C b a D D c D a B A B c c a b

ε

A C A C A C B C B C B C B C C C C D A C D B C D B C D B C

Step 3.

Add a binary array Slast marking the rows corresponding to last children

Slast

XBW

XBW can be built and inverted in optimal O(t) time

Key fact

Nodes correspond to items in < Slast,Sα>

slide-10
SLIDE 10

Paolo Ferragina, Università di Pisa

XBzip – a simple XML compressor

Pcdata

Tags, Attributes and symbol =

XBW is compressible:

Sα and Spcdata are locally homogeneous Slast has some structure

slide-11
SLIDE 11

Paolo Ferragina, Università di Pisa

XBzip = XBW + PPMd

String compressors are not so bad: within 5% 0% 5% 10% 15% 20% 25% DBLP Pathways News

gzip bzip2 ppmdi xmill + ppmdi scmppm XBzip

slide-12
SLIDE 12

Paolo Ferragina, Università di Pisa

1 1 1 1 1 1 1 1

Some structural properties

C b a D D c D a B A B c c a b

Sα Slast

ε

A C A C A C B C B C B C B C C C C D A C D B C D B C D B C

XBW

C B A B D c c a b a D c D a b C B A B D c c a b a D c D a b Two useful properties:

  • Children are contiguous and delimited by 1s
  • Children reflect the order of their parents

B

slide-13
SLIDE 13

Paolo Ferragina, Università di Pisa

1 1 1 1 1 1 1 1

XBW is navigational

C b a D D c D a B A B c c a b

ε

A C A C A C B C B C B C B C C C C D A C D B C D B C D B C

Sπ Slast

XBW

C B A B D c c a b a D c D a b C A B D c c a b a D c D a b XBW is navigational:

  • Rank-Select data structures on Slast and Sα
  • The array C of |Σ| integers

B

Get_children

Rank(B,Sα)= 2

Select in Slast the 2° item 1 from here...

A 2 B 5 C 9 D 12

C

slide-14
SLIDE 14

Paolo Ferragina, Università di Pisa

1 1 1 1 1 1 1 1

XBW is searchable (count subpaths)

C B A B D c c a b a D c D a b C b a D D c D a B A B c c a b

ε

A C A C A C B C B C B C B C C C C D A C D B C D B C D B C

Slast Sπ

XBW-index

Inductive step:

Pick the next char in Π[i+ 1], i.e. ‘D’ Search for the first and last ‘D’ in Sα[fr,lr] Jump to their children

fr lr

Π = B D

Π[i+ 1]

Rows whose Sπ starts with ‘B’

Their children have upward path = ‘D B’

A 2 B 5 C 9 D 12

lr fr

XBW is searchable:

  • Rank-Select data structures on Slast and Sα
  • Array C of |Σ| integers

C

2 occurrences of Π because of two 1s

slide-15
SLIDE 15

Paolo Ferragina, Università di Pisa

XBzipIndex: XBW + FM-index

Upto 36% improvement in compression ratio Query (counting) time ≅ 8 ms, Navigation time ≅ 3 ms

0% 10% 20% 30% 40% 50% 60%

DBLP Pathways News Huffword XPress XQzip XBzipIndex XBzip

DBLP: 1.75 bytes/node, Pathways: 0.31 bytes/node, News: 3.91 bytes/node

slide-16
SLIDE 16

Paolo Ferragina, Università di Pisa

Indexing

[Kosaraju, Focs ‘89]

[I EEE Focs ’05] [WWW ’06]

The overall picture on Compressed Indexing...

Data type Compressed Indexing

Strong connection

[I EEE Focs ’00] [J. ACM ’05]

This is a powerful paradigm to design compressed indexes:

  • 1. Transform the input in few arrays (via BWT or XBW)
  • 2. Index (+ Compress) the arrays to support rank/select ops

Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1) Experimental: Wea ’06 (2)

http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl