graph indexing tree delta delta graph graph graph
play

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph - PowerPoint PPT Presentation

The Chinese University of Hong Kong The Chinese University of Hong Kong Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao, Jeffrey Xu Yu, Philip S. Yu Zhao, Jeffrey Xu Yu,


  1. The Chinese University of Hong Kong The Chinese University of Hong Kong Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao, Jeffrey Xu Yu, Philip S. Yu Zhao, Jeffrey Xu Yu, Philip S. Yu Peixian The Chinese University of Hong Kong, {pxzhao, pxzhao,yu}@se.cuhk.edu.hk yu}@se.cuhk.edu.hk The Chinese University of Hong Kong, { IBM Watson Research Center, psyu@us.ibm.com psyu@us.ibm.com IBM Watson Research Center, 1 1

  2. An Overview An Overview • Graph containment query Graph containment query • • The framework and query cost model The framework and query cost model • • Some existing path/graph based solutions Some existing path/graph based solutions • • A new tree-based approach A new tree-based approach • • Experimental studies Experimental studies • • Conclusion Conclusion • 2 2

  3. Graph Containment Query Graph Containment Query • Given a graph database G = { g 1 , g 2 , …, g N } and a query sup(q ) { g |q g ,g G } graph q , find the set = � � i i i √ √ (q) • Infeasible to check subgraph isomorphism for every g i in G , because subgraph-isomorphism is NP-Complete. 3 3

  4. The Framework The Framework • Index construction generates a set of features, generates a set of features, F, F, from the from the • Index construction graph database G G . Each . Each feature feature , , f f , maintains a set of graph ids , maintains a set of graph ids graph database in G G containing, containing, f f , , sup sup ( ( f f ). ). in • Query processing is a is a filtering-verification filtering-verification process. process. • Query processing • Filtering phase uses the features in query graph, q, to compute the candidate set. • Verification phase checks subgraph isomorphism for every graph in C q . False positives are pruned. 4 4

  5. Query Cost Model Query Cost Model • The cost of processing a graph containment query q q upon G is upon G is • The cost of processing a graph containment query modeled as modeled as • C f : the filtering cost, and • C v : the verification cost (NP-Complete) • Several Facts: • Several Facts: • To improve query performance is to minimize |C q |. • The feature set F selected has great impacts on C f and |C q |. • There is also an index construction cost, which is the cost of discovering the feature set F. 5 5

  6. Existing Solutions: Paths vs vs Graphs Graphs Existing Solutions: Paths • Path-based Indexing Approach: GraphGrep ( PODS PODS’ ’02 02 ) ) • Path-based Indexing Approach: GraphGrep ( • All paths up to a certain length l p are enumerated as indexing features – An efficient index construction process – Index size is determined by l p – Limited pruning power, because the structural information is lost. • Graph-based Indexing Approach: gIndex ( SIGMOD SIGMOD’ ’04) 04) • Graph-based Indexing Approach: gIndex ( • Discriminative frequent subgraphs are mined from G as indexing features – A costly index construction process – Compact index structure – Great pruning power, because structural information is well- preserved 6 6

  7. Tree Features? Tree Features? • Regarding paths and graphs as index features: Regarding paths and graphs as index features: • • The cost of generating path features is small but the candidate set can be large. • The cost of generating frequent graph features is high but the candidate set can be small. • The key observation The key observation : the majority of frequent : the majority of frequent • graph-features (more than 95%) are trees. graph-features (more than 95%) are trees. • How good can tree features do? How good can tree features do? • 7 7

  8. A New Approach: Tree+ Δ Δ A New Approach: Tree+ • To explore To explore indexability indexability of path, tree and graph. of path, tree and graph. • • A new approach Tree+ A new approach Tree+ Δ Δ : : • • To select frequent tree features. • To select a small number of discriminative graph- features that can prune graphs effectively, on demand, without costly graph mining . 8 8

  9. Indexability of Path, Tree and Graph Indexability of Path, Tree and Graph • We consider three main factors to answer indexability. • The frequent feature set size: | F | • The feature selection cost (mining): C FS • The candidate set size: |C q | 9 9

  10. The Frequent Feature Set Size: | F F | • 95% of frequent graph features are trees. Why? 95% of frequent graph features are trees. Why? • • Consider non-tree frequent graph features g and g Consider non-tree frequent graph features g and g’ ’. . • • Based on Apriori principle, all g’s subtrees, t 1 , t 2 , …, t n are frequent. • Because of the structural diversity and vertex/edge label variety, there is a little chance that subrees of g coincide with those of g’. 10 10

  11. Frequent Feature Distributions Frequent Feature Distributions The Real Dataset (AIDS antivirus screen dataset) N = 1,000, σ = 0.1 11 11

  12. The Feature Selection Cost: C C FS The Feature Selection Cost: FS • Given a graph database, G , and a minimum support threshold, σ , to discover the frequent feature set F from G. • Graph : two prohibitive operations are unavoidable – Subgraph isomorphism – Graph isomorphism • Tree : one prohibitive operation is unavoidable – Tree-in-Graph testing • Path : polynomial time 12 12

  13. The Candidate Set Size: |C |C q | The Candidate Set Size: q | • Let pruning power of a frequent feature, f, be • Let pruning power of a frequent feature set S = { f 1 , f 2 , …, f n } • Let a frequent subtree feature set of graph, g, be T ( g ) = { t 1 , t 2 , …, t n }. power( power( g g ) ) ≥ ≥ power( power( T T ( ( g g )) )) • Let a frequent subpath feature set of tree, t, be P ( t ) = { p 1 , p 2 , …, p n }. power( power( t t ) ) ≥ ≥ power( power( P P ( ( t t )) )) 13 13

  14. The Pruning Power The Pruning Power The Real Dataset (AIDS antivirus screen dataset) N = 1,000, σ = 0.1 14 14

  15. Indexability of Tree Indexability of Tree • The frequent tree-feature set dominates (95%). The frequent tree-feature set dominates (95%). • • Discovering frequent tree-features can be done Discovering frequent tree-features can be done • much more efficiently than mining frequent much more efficiently than mining frequent general graph-features. general graph-features. • Frequent tree features Frequent tree features can contribute similar can contribute similar • pruning power as frequent graph features do. pruning power as frequent graph features do. 15 15

  16. Add Graph Features On Demand Add Graph Features On Demand • Consider a query graph q which contains a subgraph g • If power( T ( g )) ≈ power( g ), there is no need to index the graph-feature g . • If power( g ) >> power( T ( g )), it needs to select g as an index feature, because g is more discriminative than T ( g ), in terms of pruning. • Select discriminative graph-features on-demand, without mining the whole set of frequent graph-features from G. • The selected graph features are additional indexing features, denoted Δ , for later reuse . 16 16

  17. Discriminative Ratio Discriminative Ratio • A discriminative ratio, ε (g) , is defined to measure the similarity of pruning power between a graph -feature g and its subtrees T(g). • A non-tree graph feature, g, is discriminative if ε (g) ≥ ε 0 . 17 17

  18. Discriminative Graph Selection (1) (1) Discriminative Graph Selection • Consider two graphs g and g’ , where g g’ . • If the gap between power( g’ ) and power( g ) is large, reclaim g’ from G. Otherwise , do not reclaim g’ in the presence of g. • Approximate the discriminative between g’ and g , in the presence of frequent tree-features discovered. 18 18

  19. Discriminative Graph Selection (2) (2) Discriminative Graph Selection • Let occurrence probability of g in the graph DB be • The conditional occurrence probability of g’ , w.r.t. g : • When Pr(g’|g) is small, g’ has higher probability to be discriminative w.r.t. g. 19 19

  20. Discriminative Graph Selection (3) (3) Discriminative Graph Selection • The upper and lower bound of Pr (g’| g ) become because ε (g) ≥ ε 0 and ε (g’) ≥ ε 0 . recall: | sup( ) | / | x G | � = x 20 20

  21. Discriminative Graph Selection (4) (4) Discriminative Graph Selection • Because 0 ≤ Pr (g’| g ) ≤ 1, the conditional occurrence probability of Pr ( g ’ |g ), is solely upper-bounded by T (g’). 21 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend