RE-Tree: An Efficient Index Structure for Regular Expressions
Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi
Information Sciences Research Center Bell Laboratories, Lucent Technologies
RE-Tree: An Efficient Index Structure for Regular Expressions - - PowerPoint PPT Presentation
RE-Tree: An Efficient Index Structure for Regular Expressions Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi Information Sciences Research Center Bell Laboratories, Lucent Technologies RE-Tree: An Efficient Index Structure for Regular
RE-Tree: An Efficient Index Structure for Regular Expressions
Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi
Information Sciences Research Center Bell Laboratories, Lucent Technologies
RE-Tree: An Efficient Index Structure for Regular Expressions 2
Motivation
powerful formalism for pattern/structure specifications.
– XPath pattern language for XML documents – Policy language of Border Gateway Protocol (BGP)
Input string s Subset of R that match s RE Filter R, Set of REs
RE-Tree: An Efficient Index Structure for Regular Expressions 3
Our Approach: RE-Tree
hierarchical index structure to maximize pruning of search space. Challenge: REs generally define infinite sets and there is no well-defined metric for clustering REs.
RE-Tree: An Efficient Index Structure for Regular Expressions 4
RE-Tree Overview
M
1
M
2
M
3
M
4
M
5
M
6
M
7
M
8
......
at next level; each directory entry = (FA, Pointer)
Leaf FAs
RE-Tree: An Efficient Index Structure for Regular Expressions 5
RE-Tree: Containment Property
L(M1) L(M2) L(M3) L(M4)
⊇
∪ ∪
a (a | b) c* aa ( a | b | c)* c ab (bc | cc)*
M
2
M
3
M
4
a (a | b) ( a | b | c)*
M 1
N’ N
RE-Tree: An Efficient Index Structure for Regular Expressions 6
Bounding Finite Automata
– Most precise FA accepts union of L(Mi) for all Mi in the set. – Least precise FA accepts
– A more precise FA improves search pruning but its size could be large, resulting in lower fan-out of index node.
number of states per internal FA (using an index parameter ).
precision of bounding FAs.
Σ
*
α
RE-Tree: An Efficient Index Structure for Regular Expressions 7
RE-Trees vs. R-Trees
RE-trees R-trees
Internal node entries Minimal bounding rectangles (MBR) Bounding FAs Update
Minimize volume of MBRs Minimize size of languages accepted by bounding FAs Data Type Multi-dimensional rectangles Regular languages
RE-Tree: An Efficient Index Structure for Regular Expressions 8
RE-Tree Algorithms
– Selecting an optimal insertion node – Computing an optimal bounding FA – Computing an optimal node split
RE-Tree: An Efficient Index Structure for Regular Expressions 9
RE-Tree Optimization Problems
Select the node corrresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted.
Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|.
Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized.
∩
α
RE-Tree: An Efficient Index Structure for Regular Expressions 10
RE-Tree Optimization Problems
Select the node corrresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted.
Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|.
Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized.
∩
α
Possibly Infinite!
RE-Tree: An Efficient Index Structure for Regular Expressions 11
Main Challenge
Example: (a|b)* is larger than a(a|b)*.
captures intuition of “larger than’’ relationship.
Ζ+
∃
∈
∀
s.t. N k > N
∑
i = 1 k
∑
i = 1 k
L(M, i) L(M’, i) >
RE-Tree: An Efficient Index Structure for Regular Expressions 12
|L(M)| = |L(M,1)| + |L(M,2)| + .....+ |L(M,k)|
Max-Count Size Measure
Example: L(M1) = (b|c)* d (a|b)* d (b|c)* d L(M2) = dd (a|b|c)* d L(M2) is larger than L(M1), but max-count measure is correct iff maximum length parameter value > 15.
RE-Tree: An Efficient Index Structure for Regular Expressions 13
definition of an optimal model for a given data set.
MDL-based Size Measure
L(M2) is larger than L(M1)
∈
∑
w S1
Encode(w, M1) / |w|
∈
∑
w S2
Encode(w, M2) / |w| <
L(M1) S, L(M2) S
⊇
M1 is more precise than M2
∈
∑
w S
Encode(w, M1)
∈
∑
w S
Encode(w, M2) <
⊇
RE-Tree: An Efficient Index Structure for Regular Expressions 14
Definition of Encoding(w,M)
Example:
a,b,c b d d d
M Encode( ddbd, M) = log(1) + log(2) + log(4) = 5
∈
∑
i = 0 n-1 log ( # out-going transitions in si)
RE-Tree: An Efficient Index Structure for Regular Expressions 15
Algorithm to Optimize Bounding FA
(1) M has at most number of states, and (2) |L(M)| is minimized.
incrementally relax its precision (by greedily merging pairs of states) until the space constraint is satisfied.
α
RE-Tree: An Efficient Index Structure for Regular Expressions 16
An Example
a b b b a a b a b b a a b a a b
α
Compute bounding FA for S = { abb* , aa*b } with = 3
RE-Tree: An Efficient Index Structure for Regular Expressions 17
Other RE-Tree Algorithms
– Select the node corresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted.
– Partition S into S1 & S2 (each with at last m FAs) such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized. – Problem is NP-hard. – Heuristic used is similar to R-tree’s Quadratic Split Algorithm.
∩
RE-Tree: An Efficient Index Structure for Regular Expressions 18
Optimizing RE-Tree Operations
(i.e., union & intersection).
Example: Selecting optimal insertion node requires computing |L(Mi M)| for each Mi in current node. An unbiased estimate of |L(Mi M, k)| is given by
∩ ∩
(# strings in S accepted by Mi) |L(M, k)| |S|
where S = uniform random sample of L(M, k).
RE-Tree: An Efficient Index Structure for Regular Expressions 19
Related Work
to speed up searching of an RE query.
[VLDB’00], YFilter [ICDE’02], XTrie [ICDE’02], matchMaker [EDBT’02]. – Class of REs supported in XPath is more restrictive. – Indexes for filtering XPath are all main-memory structures.
RE-Tree: An Efficient Index Structure for Regular Expressions 20
Experimental Evaluation
– Vary RE similarity, , size of data set.
data set.
memory running FreeBSD 4.1.
α
RE-Tree: An Efficient Index Structure for Regular Expressions 21
0.5 1 1.5 2 2.5 3 3.5 10 20 30 40 50
Result Size Ratio of FA Comparisons
p = 0.5 p = 0.75 p= 1.0
Varying Similarity of REs
RE-Tree: An Efficient Index Structure for Regular Expressions 22
0.5 1 1.5 2 10 20 30 40 50
Result Size Ratio of Evaluation Time
p = 0.5 p = 0.75 p= 1.0
Varying Similarity of REs
RE-Tree: An Efficient Index Structure for Regular Expressions 23
Conclusions