FIX: Feature-based Indexing Technique for XML Documents Ning Zhang - - PowerPoint PPT Presentation

fix feature based indexing technique for xml documents
SMART_READER_LITE
LIVE PREVIEW

FIX: Feature-based Indexing Technique for XML Documents Ning Zhang - - PowerPoint PPT Presentation

FIX: Feature-based Indexing Technique for XML Documents Ning Zhang University of Waterloo http://www.cs.uwaterloo.ca/~nzhang Joint work with M. Tamer Ozsu, Ihab F. Ilyas, and Ashraf Aboulnaga Ning Zhang 1 Motivating Example Twig


slide-1
SLIDE 1

FIX: Feature-based Indexing Technique for XML Documents

Ning Zhang

University of Waterloo http://www.cs.uwaterloo.ca/~nzhang

Joint work with M. Tamer ¨ Ozsu, Ihab F. Ilyas, and Ashraf Aboulnaga

1 Ning Zhang

slide-2
SLIDE 2

Motivating Example

  • Twig Query (root axis could

be //, others are /): Q1: Find phone numbers (P) of all authors (A) who also have email (E) and school (S). //A[./E][./S]/P

  • Find all subtrees satisfying a

pattern tree:

/

E A P S

/ /

2 Ning Zhang

slide-3
SLIDE 3

Motivating Example

  • Twig Query (root axis could

be //, others are /): Q1: Find phone numbers (P) of all authors (A) who also have email (E) and school (S). //A[./E][./S]/P

  • Find all subtrees satisfying a

pattern tree:

/

E A P S

/ /

  • A general path containing

// in the middle can be decomposed into interconnected twig queries.

N // / / / / A B C L M

2 Ning Zhang

slide-4
SLIDE 4

Motivating Example

  • Twig Query (root axis could

be //, others are /): Q1: Find phone numbers (P) of all authors (A) who also have email (E) and school (S). //A[./E][./S]/P

  • Find all subtrees satisfying a

pattern tree:

/

E A P S

/ /

  • A general path containing

// in the middle can be decomposed into interconnected twig queries.

N // / / / / A B C L M

2 Ning Zhang

slide-5
SLIDE 5

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree and perform Tree Pattern Matching (TPM) operation on every tree node

XML Tree

3 Ning Zhang

slide-6
SLIDE 6

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree and perform Tree Pattern Matching (TPM) operation on every tree node

XML Tree

3 Ning Zhang

slide-7
SLIDE 7

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree and perform Tree Pattern Matching (TPM) operation on every tree node

  • XML Tree

3 Ning Zhang

slide-8
SLIDE 8

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree and perform Tree Pattern Matching (TPM) operation on every tree node

  • XML Tree

3 Ning Zhang

slide-9
SLIDE 9

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree and perform Tree Pattern Matching (TPM) operation on every tree node

  • XML Tree

3 Ning Zhang

slide-10
SLIDE 10

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree and perform Tree Pattern Matching (TPM) operation on every tree node

  • XML Tree

3 Ning Zhang

slide-11
SLIDE 11

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree and perform Tree Pattern Matching (TPM) operation on every tree node

  • XML Tree
  • Analogous to sequential scan, very expensive:
  • 4,000,000+ TPM operations on DBLP.

3 Ning Zhang

slide-12
SLIDE 12

Approaches to Evaluating Twig Queries

Navigational Approach

Traverse the XML tree and perform Tree Pattern Matching (TPM) operation on every tree node

  • XML Tree
  • Analogous to sequential scan, very expensive:
  • 4,000,000+ TPM operations on DBLP.
  • Many unnecessary operations for highly selective queries.

3 Ning Zhang

slide-13
SLIDE 13

Approaches to Evaluating Twig Queries

Join-based Approach

XML elements are clustered by tag-names; elements matched with the tag-names are structurally joined

...

E A S P

...

S S S A A A P P P E E E

... ...

3 Ning Zhang

slide-14
SLIDE 14

Approaches to Evaluating Twig Queries

Join-based Approach

XML elements are clustered by tag-names; elements matched with the tag-names are structurally joined

...

E A S P

...

S S S A A A P P P E E E

... ...

  • Analogous to index-based join.

3 Ning Zhang

slide-15
SLIDE 15

Approaches to Evaluating Twig Queries

Join-based Approach

XML elements are clustered by tag-names; elements matched with the tag-names are structurally joined

...

E A S P

...

S S S A A A P P P E E E

... ...

  • Analogous to index-based join.
  • Tag-name indexes are not discriminative enough:
  • Elements are selected to join solely based on their tag names,

without considering their descendants.

3 Ning Zhang

slide-16
SLIDE 16

Objectives

  • Build an index that does not only consider root tags, but also

the whole subtree. Starting points using index considering the whole subtree

4 Ning Zhang

slide-17
SLIDE 17

Objectives

  • Build an index that does not only consider root tags, but also

the whole subtree. TPM Starting Points using sequential scan abz)e)cf)g))cf)g))cf)g))c)c))cz)e)df)g))i)c)))

4 Ning Zhang

slide-18
SLIDE 18

Objectives

  • Build an index that does not only consider root tags, but also

the whole subtree. Starting Points Using Tag−name Index abz)e)cf)g))cf)g))cf)g))c)c))cz)e)df)g))i)c)))

4 Ning Zhang

slide-19
SLIDE 19

Objectives

  • Build an index that does not only consider root tags, but also

the whole subtree. Starting points using index considering the whole subtree abz)e)cf)g))cf)g))cf)g))c)c))cz)e)df)g))i)c)))

4 Ning Zhang

slide-20
SLIDE 20

Objectives

  • Build an index that does not only consider root tags, but also

the whole subtree. Starting points using index considering the whole subtree abz)e)cf)g))cf)g))cf)g))c)c))cz)e)df)g))i)c)))

  • Exploiting existing indexes (e.g., B+ tree) to build the new

index.

4 Ning Zhang

slide-21
SLIDE 21

Objectives

  • Build an index that does not only consider root tags, but also

the whole subtree. Starting points using index considering the whole subtree abz)e)cf)g))cf)g))cf)g))c)c))cz)e)df)g))i)c)))

  • Exploiting existing indexes (e.g., B+ tree) to build the new

index.

  • Incorporating both structures and values in the index.

4 Ning Zhang

slide-22
SLIDE 22

Related Work — Structural Indexes

Cluster XML tree nodes having similar structures in terms of:

  • Tag-name: tag-name indexes: tag-names as keys for B+ tree

(the pruning power comes from the root of query tree).

  • Rooted path: DataGuide, 1-index, A(k)-index (simple path

expressions only)

  • Subtree: bisimulation graph
  • Rooted path & subtree: F&B bisimulation graph

author affiliation article bib email title author author author book article www inproceedings address phone Bisimulation Graph

The bisimulation and F&B bisimulation graphs could be very large (3 × 105 vertices and 2 × 106 edges for Treebank).

5 Ning Zhang

slide-23
SLIDE 23

Our Approach: Feature-based Index (FIX )

Key Idea:

  • 1. Data and query trees are all converted to bisimulation graphs.

1.1 Bisimulation graph is much smaller than the XML tree 1.2 Bisimulation graph preserves all structural information

  • 2. Enumerate all subgraphs of depth k (indexable units) in the

data bisimulation graph.

  • 3. Insert indexable units based on their distinctive features.
  • 4. Calculate the features of query bisimulation graph and use

them to filter out indexed units by comparing their features.

6 Ning Zhang

slide-24
SLIDE 24

Our Approach: Feature-based Index (FIX )

Key Idea:

  • 1. Data and query trees are all converted to bisimulation graphs.

1.1 Bisimulation graph is much smaller than the XML tree 1.2 Bisimulation graph preserves all structural information

  • 2. Enumerate all subgraphs of depth k (indexable units) in the

data bisimulation graph.

  • 3. Insert indexable units based on their distinctive features.
  • 4. Calculate the features of query bisimulation graph and use

them to filter out indexed units by comparing their features. What are the features for labeled trees?

6 Ning Zhang

slide-25
SLIDE 25

Features of XML Tree

Three features of indexable units (after converted to a special matrix):

  • 1. minimum eigenvalue λmin
  • 2. maximum eigenvalue λmax
  • 3. root label r

7 Ning Zhang

slide-26
SLIDE 26

Features of XML Tree

Three features of indexable units (after converted to a special matrix):

  • 1. minimum eigenvalue λmin
  • 2. maximum eigenvalue λmax
  • 3. root label r

Theorem

Given two graphs G and H, if H is an induced subgraph of G, then λmin(G) ≤ λmin(H) ≤ λmax(H) ≤ λmax(G).

7 Ning Zhang

slide-27
SLIDE 27

Features of XML Tree

Three features of indexable units (after converted to a special matrix):

  • 1. minimum eigenvalue λmin
  • 2. maximum eigenvalue λmax
  • 3. root label r

Theorem

Given two graphs G and H, if H is an induced subgraph of G, then λmin(G) ≤ λmin(H) ≤ λmax(H) ≤ λmax(G). Necessary conditions for a query Q having positive answers in data tree D: λmin(D) ≤ λmin(Q) ≤ λmax(Q) ≤ λmax(D) ∧ r(Q) = r(D)

7 Ning Zhang

slide-28
SLIDE 28

Features of XML Tree

Three features of indexable units (after converted to a special matrix):

  • 1. minimum eigenvalue λmin
  • 2. maximum eigenvalue λmax
  • 3. root label r

Theorem

Given two graphs G and H, if H is an induced subgraph of G, then λmin(G) ≤ λmin(H) ≤ λmax(H) ≤ λmax(G). Necessary conditions for a query Q having positive answers in data tree D: λmin(D) ≤ λmin(Q) ≤ λmax(Q) ≤ λmax(D) ∧ r(Q) = r(D) Returned results may have false-positives: need refinement.

7 Ning Zhang

slide-29
SLIDE 29

Calculating Features

  • 1. Compute bisimulation graph for an XML tree
  • 2. Convert labeled directed graph into weighted directed graph

(encode labeled edge into edge weight). e.g., (article, title) → 3 (article, author) → 5 (author, address) → 4 (author, email) → 7

  • 3. Convert weighted directed graph into anti-symmetric matrix

4 email author article address title 2 3 4 5 1 3 5 7

M =       5 3 −5 4 7 −3 −4 −7      

  • 4. Calculate eigenvalues λ1, λ2, . . . , λn of the matrix. The λmin,

λmax and the root note label r are three features.

8 Ning Zhang

slide-30
SLIDE 30

Building Index

  • Bisimulation graph could be too large — restrict the depth of

XML trees to a small k.

  • For each document in a collection:
  • if the document’s maximum depth is less than k, convert the

whole tree into bisimulation graph.

  • otherwise, enumerate all subtrees having depth under the limit

k, convert them into bisimulation graph.

  • Both clustered and unclustered indexes can be built for a

collection

Index Unclustered Copy of Primary XML Data Storage with Redundancy Primary XML Data Storage Clustered Index

9 Ning Zhang

slide-31
SLIDE 31

Query Processing

  • Convert the query tree into bisimulation graph
  • Convert bisimulation graph into anti-symmetric matrix
  • Calculate λmin(Q) and λmax(Q)
  • Reduced to range query:
  • find all [λmin, λmax] in the index that contain

[λmin(Q), λmax(Q)] and r = Q.root.

  • Refinement: evaluating Tree Pattern Matching on returned

candidate results.

10 Ning Zhang

slide-32
SLIDE 32

Incorporating Values

  • Treat values as special tag names:
  • Values are hashed into a small domain Dv outside of tag name

encodings Dt, i.e., Dv ∩ Dt = ∅.

  • (tag, value) edges are also mapped to a distinct integer.
  • Can answer equality constrained queried, e.g.,

//book[title="TCP/IP Illustrated"]/price

11 Ning Zhang

slide-33
SLIDE 33

Performance Evaluation: Implementation-independent Metrics

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 TCMD DBLP Xmark Treebank Data Sets Average values for different metrics

  • avg. sel
  • avg. pp
  • avg. fpr

Implementation- independent metrics: sel = 1 − rst/ent pp = 1 − cdt/ent fpr = 1 − rst/cdt

12 Ning Zhang

slide-34
SLIDE 34

Performance Evaluation: Runtime

Runtime Performance on XMark

1 10 100 1000 10000 Xmark_hi_sp Xmark_lo_sp Xmark_hi_bp Xmark_lo_bp

Test Queries Runtime in log scale (msec.) NoK FIX unclustered F&B FIX clustered

FIX improves performance significantly on structure-rich data sets.

13 Ning Zhang

slide-35
SLIDE 35

Performance Evaluation: Runtime (cont.)

Runtime Performance on DBLP

0.1 1 10 100 1000 10000 100000 DBLP_hi_sp DBLP_lo_sp DBLP_hi_bp DBLP_lo_bp

Test Queries Runtime in log scale (msec) NoK FIX unclustered F&B FIX clustered

FIX does not perform as well on simple structured data sets.

14 Ning Zhang

slide-36
SLIDE 36

Performance Evaluation: Runtime (cont.)

Runtime Performance on DBLP with values

1 10 100 1000 10000 DBLP_hi_bp DBLP_lo_bp Queries Runtime in log scale (msec.) F&B FIX clustered

But when considering values, FIX performs better.

15 Ning Zhang

slide-37
SLIDE 37

Conclusion and Future Work

Summary:

  • We identify three features for pruning subtrees during query

processing.

  • Easy to calculate.
  • Pruning uses simple numeric comparisons.
  • A unified structure and value index (FIX ) can be built based
  • n these features to improve query performance significantly.
  • Query evaluating based on FIX is simple and based on

well-studied techniques. Future Work:

  • Try R-tree or other high-dimensional index instead of B+ tree.
  • Support wider range of queries.
  • Find more features!

16 Ning Zhang

slide-38
SLIDE 38

Thank you!

17 Ning Zhang