Paolo Ferragina, Università di Pisa
Compressing and Searching XML Data Via Two Zips
Paolo Ferragina
Dipartimento di Informatica, Università di Pisa
[Joint with F. Luccio, G. Manzini, S. Muthukrishnan]
Compressing and Searching XML Data Via Two Zips Paolo Ferragina - - PowerPoint PPT Presentation
Compressing and Searching XML Data Via Two Zips Paolo Ferragina Dipartimento di Informatica, Universit di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] Paolo Ferragina, Universit di Pisa Six years ago... [now, J. ACM 05]
Paolo Ferragina, Università di Pisa
Dipartimento di Informatica, Università di Pisa
[Joint with F. Luccio, G. Manzini, S. Muthukrishnan]
Paolo Ferragina, Università di Pisa
Survey by Navarro-Makinen cites more than 50 papers on the subject !!
Paolo Ferragina, Università di Pisa
<dblp> <book> <author> Donald E. Knuth </author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293-326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article>
</dblp>
Paolo Ferragina, Università di Pisa
Subset of XPath [W3C]
Paolo Ferragina, Università di Pisa
We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations:
might exploit this tool as a core block for query optimization and (compressed) storage
Paolo Ferragina, Università di Pisa
[Ferragina et al, IEEE Focs ’05]
structural properties of the Burrows-and-Wheeler Trasform on strings (do you know bzip !?).
compressor (gzip, bzip,...) over these two arrays
query operations over these two arrays
Paolo Ferragina, Università di Pisa
C B A B D c c a b a D c D a b C B D c a c A b a D c B D b a
ε
C B C D B C D B C B C C A C A C A C D A C C B C D B C B C
Sα Sπ
upward labeled paths Permutation
Visit the tree in pre-order. For each node, write down its label and the labels on its upward path
Paolo Ferragina, Università di Pisa
C B A B D c c a b a D c D a b C b a D D c D a B A B c c a b
ε
A C A C A C B C B C B C B C C C C D A C D B C D B C D B C
Sα Sπ
upward labeled paths
Stably sort according to Sπ
Paolo Ferragina, Università di Pisa
XBW takes optimal t log |Σ| + 2t bits 1 1 1 1 1 1 1 1
C B A B D c c a b a D c D a b C b a D D c D a B A B c c a b
Sα
ε
A C A C A C B C B C B C B C C C C D A C D B C D B C D B C
Sπ
Add a binary array Slast marking the rows corresponding to last children
Slast
XBW
XBW can be built and inverted in optimal O(t) time
Nodes correspond to items in < Slast,Sα>
Paolo Ferragina, Università di Pisa
Pcdata
Tags, Attributes and symbol =
XBW is compressible:
Sα and Spcdata are locally homogeneous Slast has some structure
Paolo Ferragina, Università di Pisa
String compressors are not so bad: within 5% 0% 5% 10% 15% 20% 25% DBLP Pathways News
gzip bzip2 ppmdi xmill + ppmdi scmppm XBzip
Paolo Ferragina, Università di Pisa
1 1 1 1 1 1 1 1
C b a D D c D a B A B c c a b
Sα Slast
ε
A C A C A C B C B C B C B C C C C D A C D B C D B C D B C
Sπ
XBW
C B A B D c c a b a D c D a b C B A B D c c a b a D c D a b Two useful properties:
B
Paolo Ferragina, Università di Pisa
1 1 1 1 1 1 1 1
C b a D D c D a B A B c c a b
Sα
ε
A C A C A C B C B C B C B C C C C D A C D B C D B C D B C
Sπ Slast
XBW
C B A B D c c a b a D c D a b C A B D c c a b a D c D a b XBW is navigational:
B
Get_children
Rank(B,Sα)= 2
Select in Slast the 2° item 1 from here...
A 2 B 5 C 9 D 12
Paolo Ferragina, Università di Pisa
1 1 1 1 1 1 1 1
C B A B D c c a b a D c D a b C b a D D c D a B A B c c a b
Sα
ε
A C A C A C B C B C B C B C C C C D A C D B C D B C D B C
Slast Sπ
XBW-index
Inductive step:
Pick the next char in Π[i+ 1], i.e. ‘D’ Search for the first and last ‘D’ in Sα[fr,lr] Jump to their children
fr lr
Π[i+ 1]
Rows whose Sπ starts with ‘B’
Their children have upward path = ‘D B’
A 2 B 5 C 9 D 12
lr fr
XBW is searchable:
2 occurrences of Π because of two 1s
Paolo Ferragina, Università di Pisa
Upto 36% improvement in compression ratio Query (counting) time ≅ 8 ms, Navigation time ≅ 3 ms
0% 10% 20% 30% 40% 50% 60%
DBLP Pathways News Huffword XPress XQzip XBzipIndex XBzip
DBLP: 1.75 bytes/node, Pathways: 0.31 bytes/node, News: 3.91 bytes/node
Paolo Ferragina, Università di Pisa
Indexing
[Kosaraju, Focs ‘89]
[I EEE Focs ’05] [WWW ’06]
Data type Compressed Indexing
[I EEE Focs ’00] [J. ACM ’05]
This is a powerful paradigm to design compressed indexes:
Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1) Experimental: Wea ’06 (2)
http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl