RE-Tree: An Efficient Index Structure for Regular Expressions - - PowerPoint PPT Presentation

re tree an efficient index structure for regular
SMART_READER_LITE
LIVE PREVIEW

RE-Tree: An Efficient Index Structure for Regular Expressions - - PowerPoint PPT Presentation

RE-Tree: An Efficient Index Structure for Regular Expressions Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi Information Sciences Research Center Bell Laboratories, Lucent Technologies RE-Tree: An Efficient Index Structure for Regular


slide-1
SLIDE 1

RE-Tree: An Efficient Index Structure for Regular Expressions

Chee-Yong Chan, Minos Garofalakis, Rajeev Rastogi

Information Sciences Research Center Bell Laboratories, Lucent Technologies

slide-2
SLIDE 2

RE-Tree: An Efficient Index Structure for Regular Expressions 2

Motivation

  • Regular Expressions (REs) provide a simple yet

powerful formalism for pattern/structure specifications.

  • Example applications:

– XPath pattern language for XML documents – Policy language of Border Gateway Protocol (BGP)

  • RE Filtering Problem:

Input string s Subset of R that match s RE Filter R, Set of REs

slide-3
SLIDE 3

RE-Tree: An Efficient Index Structure for Regular Expressions 3

Our Approach: RE-Tree

  • Idea: Partition RE data set using a height-balanced

hierarchical index structure to maximize pruning of search space. Challenge: REs generally define infinite sets and there is no well-defined metric for clustering REs.

slide-4
SLIDE 4

RE-Tree: An Efficient Index Structure for Regular Expressions 4

RE-Tree Overview

M

1

M

2

M

3

M

4

M

5

M

6

M

7

M

8

......

  • Dynamic, height-balanced, hierarchical index structure.
  • REs are stored as finite automata (FA) in the leaf nodes.
  • Internal nodes contain directory entries pointing to nodes

at next level; each directory entry = (FA, Pointer)

} Internal FAs

Leaf FAs

slide-5
SLIDE 5

RE-Tree: An Efficient Index Structure for Regular Expressions 5

RE-Tree: Containment Property

  • Example:
  • M1 = Bounding FA of { M2, M3, M4 }

L(M1) L(M2) L(M3) L(M4)

∪ ∪

a (a | b) c* aa ( a | b | c)* c ab (bc | cc)*

M

2

M

3

M

4

a (a | b) ( a | b | c)*

....

M 1

N’ N

slide-6
SLIDE 6

RE-Tree: An Efficient Index Structure for Regular Expressions 6

Bounding Finite Automata

  • Many possible bounding FAs for a given set of FAs.

– Most precise FA accepts union of L(Mi) for all Mi in the set. – Least precise FA accepts

  • Space-Precision tradeoff for bounding FAs:

– A more precise FA improves search pruning but its size could be large, resulting in lower fan-out of index node.

  • RE-tree controls fan-out by bounding the maximum

number of states per internal FA (using an index parameter ).

  • Goal: Optimize search performance by maximizing

precision of bounding FAs.

Σ

*

α

slide-7
SLIDE 7

RE-Tree: An Efficient Index Structure for Regular Expressions 7

RE-Trees vs. R-Trees

  • RE-trees are similar in spirit to R-trees.

RE-trees R-trees

Internal node entries Minimal bounding rectangles (MBR) Bounding FAs Update

  • perations

Minimize volume of MBRs Minimize size of languages accepted by bounding FAs Data Type Multi-dimensional rectangles Regular languages

slide-8
SLIDE 8

RE-Tree: An Efficient Index Structure for Regular Expressions 8

RE-Tree Algorithms

  • RE-tree construction involves three key operations:

– Selecting an optimal insertion node – Computing an optimal bounding FA – Computing an optimal node split

slide-9
SLIDE 9

RE-Tree: An Efficient Index Structure for Regular Expressions 9

RE-Tree Optimization Problems

  • Let S = {M1, M2, ...., Mn} be set of FAs in a node N.
  • Selecting an optimal insertion node

Select the node corrresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted.

  • Computing an optimal bounding FA

Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|.

  • Computing an optimal node split

Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized.

α

slide-10
SLIDE 10

RE-Tree: An Efficient Index Structure for Regular Expressions 10

RE-Tree Optimization Problems

  • Let S = {M1, M2, ...., Mn} be set of FAs in a node N.
  • Selecting an optimal insertion node

Select the node corrresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted.

  • Computing an optimal bounding FA

Compute M, a bounding FA of S (with at most states), that minimizes |L(M)|.

  • Computing an optimal node split

Partition S into S1 & S2 such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized.

α

Possibly Infinite!

slide-11
SLIDE 11

RE-Tree: An Efficient Index Structure for Regular Expressions 11

Main Challenge

  • Problem: How to measure size of REs?
  • Observe: Infinite REs may not have the same size.

Example: (a|b)* is larger than a(a|b)*.

  • Idea: Need a computable measure for size of REs that

captures intuition of “larger than’’ relationship.

  • Let L(M,i) = Set of length-i strings in L(M).
  • Intuitively, L(M) is larger than L(M’) iff

Ζ+

s.t. N k > N

i = 1 k

i = 1 k

L(M, i) L(M’, i) >

slide-12
SLIDE 12

RE-Tree: An Efficient Index Structure for Regular Expressions 12

  • Idea: Count size of L(M) up to some maximum length.

|L(M)| = |L(M,1)| + |L(M,2)| + .....+ |L(M,k)|

  • Cons: Sensitive to maximum length parameter value.

Max-Count Size Measure

Example: L(M1) = (b|c)* d (a|b)* d (b|c)* d L(M2) = dd (a|b|c)* d L(M2) is larger than L(M1), but max-count measure is correct iff maximum length parameter value > 15.

slide-13
SLIDE 13

RE-Tree: An Efficient Index Structure for Regular Expressions 13

  • MDL Principle: Provides an information-theoretic

definition of an optimal model for a given data set.

  • Observation:

MDL-based Size Measure

L(M2) is larger than L(M1)

w S1

Encode(w, M1) / |w|

w S2

Encode(w, M2) / |w| <

  • MDL-based Measure:

L(M1) S, L(M2) S

M1 is more precise than M2

w S

Encode(w, M1)

w S

Encode(w, M2) <

slide-14
SLIDE 14

RE-Tree: An Efficient Index Structure for Regular Expressions 14

Definition of Encoding(w,M)

Example:

a,b,c b d d d

M Encode( ddbd, M) = log(1) + log(2) + log(4) = 5

  • How to encode w L(M) using M ?
  • Let p = < s0, s1, ..., sn > be accepting path of w in M.
  • Encode(w, M) =

i = 0 n-1 log ( # out-going transitions in si)

slide-15
SLIDE 15

RE-Tree: An Efficient Index Structure for Regular Expressions 15

Algorithm to Optimize Bounding FA

  • Compute a bounding FA M for a given set of FAs S s.t.

(1) M has at most number of states, and (2) |L(M)| is minimized.

  • Problem is NP-hard.
  • Heuristic: Compute the most precise FA for S & then

incrementally relax its precision (by greedily merging pairs of states) until the space constraint is satisfied.

α

slide-16
SLIDE 16

RE-Tree: An Efficient Index Structure for Regular Expressions 16

An Example

a b b b a a b a b b a a b a a b

α

Compute bounding FA for S = { abb* , aa*b } with = 3

slide-17
SLIDE 17

RE-Tree: An Efficient Index Structure for Regular Expressions 17

Other RE-Tree Algorithms

  • Selecting an optimal insertion node

– Select the node corresponding to Mi that maximizes |L(M Mi)|, where M is the FA to be inserted.

  • Computing an optimal node split

– Partition S into S1 & S2 (each with at last m FAs) such that |L(union of FAs in S1)| + |L(union of FAs in S2)| is minimized. – Problem is NP-hard. – Heuristic used is similar to R-tree’s Quadratic Split Algorithm.

slide-18
SLIDE 18

RE-Tree: An Efficient Index Structure for Regular Expressions 18

Optimizing RE-Tree Operations

  • RE-tree algorithms involve many FA operations

(i.e., union & intersection).

  • Speed up performance using sampling techniques.

Example: Selecting optimal insertion node requires computing |L(Mi M)| for each Mi in current node. An unbiased estimate of |L(Mi M, k)| is given by

∩ ∩

(# strings in S accepted by Mi) |L(M, k)| |S|

where S = uniform random sample of L(M, k).

slide-19
SLIDE 19

RE-Tree: An Efficient Index Structure for Regular Expressions 19

Related Work

  • A lot of work on the traditional RE search problem: how

to speed up searching of an RE query.

  • But none on the RE filtering problem.
  • Indexes for filtering XPath expressions: XFilter

[VLDB’00], YFilter [ICDE’02], XTrie [ICDE’02], matchMaker [EDBT’02]. – Class of REs supported in XPath is more restrictive. – Indexes for filtering XPath are all main-memory structures.

slide-20
SLIDE 20

RE-Tree: An Efficient Index Structure for Regular Expressions 20

Experimental Evaluation

  • Algorithms: RE-tree vs Sequential File Approach.
  • Data Set: Generated synthetic RE data sets.

– Vary RE similarity, , size of data set.

  • Queries: Generated 1000 random query strings from RE

data set.

  • System: 700 MHz Intel Pentium III with 512 MB

memory running FreeBSD 4.1.

α

slide-21
SLIDE 21

RE-Tree: An Efficient Index Structure for Regular Expressions 21

0.5 1 1.5 2 2.5 3 3.5 10 20 30 40 50

Result Size Ratio of FA Comparisons

p = 0.5 p = 0.75 p= 1.0

Varying Similarity of REs

slide-22
SLIDE 22

RE-Tree: An Efficient Index Structure for Regular Expressions 22

0.5 1 1.5 2 10 20 30 40 50

Result Size Ratio of Evaluation Time

p = 0.5 p = 0.75 p= 1.0

Varying Similarity of REs

slide-23
SLIDE 23

RE-Tree: An Efficient Index Structure for Regular Expressions 23

Conclusions

  • RE-Tree, a novel index structure for REs.
  • Novel size measures for REs.
  • Update algorithms to optimize bounding FAs.
  • Sampling-based techniques to speed up RE-tree
  • perations.