Optimization of Regular Path Queries in Large Graphs
Nikolay Yakovets
in collaboration with: Parke Godfrey and Jarek Gryz
Optimization of in collaboration with: Parke Godfrey and Jarek Gryz - - PowerPoint PPT Presentation
Optimization of in collaboration with: Parke Godfrey and Jarek Gryz Regular Path Queries in Large Graphs Nikolay Yakovets Optimization of RPQs Scalable & e ffi cient evaluation of regular path queries Evaluation Implementation RPQs
in collaboration with: Parke Godfrey and Jarek Gryz
2
Scalable & efficient evaluation of regular path queries Linked Data
RPQs
Semantics
Evaluation
WAVEGUIDE Implementation Optimization Plans Costs
Adjacency Query
list all neighbours, find k- neighbourhood of a node
Pattern Matching Query
find all sub-graphs in a database that are isomorphic to a given query pattern graph
Summarization Query
summarize or operate on query results e.g. aggregation; avg(), min(), max(), etc
navigational query deals with paths in a graph test whether nodes are reachable in a graph paths of fixed or arbitrary lengths
pattern
3
SPARQL Protocol and RDF Query Language (SPARQL)
Query: Graph:
SELECT ?pop WHERE { :Oakville :population ?pop } variables graph pattern ny:nikolay dbpedia:Oakville "182520" foaf:based_near dp:population "Nikolay Yakovets" foaf:name ?pop
adjacency pattern matching summarization
4
disjunction concatenation
p1/p2 p1|p2
zero or one
p?
inverted
ˆp
negated
!iri
Kleene star
p∗
p+
path
underlying data paths
5
en:Gundam jp:ガンダム
:sameAs :isLocatedIn
jp:お台場 en:Daiba en:Tokyo en:Japan jp:東京 jp:関東地⽅斺 jp:本州 jp:⽇旦本
select ?place { en:Gundam (:sameAs*/:isLocatedIn)+/sameAs* ?place .}
hierarchy (:isLocatedIn) to fully utilize richer spacial information in Japanese dataset
:isLocatedIn :isLocatedIn :sameAs
6
free variables regular language
[[Q]]G - an evaluation of Q over graph database G
a bag (allow duplicates)
a set (discard duplicates)
path-induced string λ(p) ∈ L(r)
path is simple or arbitrary
7
Tractable on DAGs, or restricted compatible regex
SPARQL (W3C proposal for RDF query language) support of RPQs through SPARQL1.1 property paths
(Arenas et al., Losemann et al., 2013) Tractable on DAGs, or restricted compatible regex
8
9
10
select ?place { en:Gundam (:sameAs*/:isLocatedIn)+/sameAs* ?place .}
query ε-NFA:
if necessary :
query and graph automata.
states to produce an answer to a query
11
select ?place { en:Gundam (:sameAs*/:isLocatedIn)+/sameAs* ?place .}
Have SPRJU-RA extended with 𝝱 𝝱 computes the least-fixpoint:
𝝱 computes the transitive closure of a given relation
Q parse tree Q RA tree favourite RDBMS
plan spaces
PFA ∉ 𝝱-RA
12
13
U ·b ·b a· a· Wab: Wab: U Wab· Wab· Wab+: Wab+: Wab· Wab· Pab+ Pab+
a wavefront
label seed starting state set of states edge labels transition function wavefront labels accepting states
a transition function
δ : Q × ((E ∪ L) × {· , ·} ∪ {ε}) → 2Q δ : Q × ((E ∪ L) × {· , ·} ∪ {ε}) → 2Q
graph edges
appending or prepending pipeline
a seed
by construction
q0 q0 Wl Wl S
starting state seed
14
a waveplan
transitions over a view
labels and seeds, but not vice-versa
e.g., query (?x, (a/b)+, ?y)
(a/b)+
15
U ·b ·b a· a· Wab: Wab: U Wab· Wab· Wab+: Wab+: Wab· Wab· Pab+ Pab+ <Pab+ <Pab+
set of wavefronts
Exploration procedure based on semi- naive evaluation Intermediate search results kept in the search cache cache keeps track of end-nodes and corresponding states in a plan
loop while discover new tuples
16
17
techniques? efficient?
enabled by WAVEGUIDE?
analysis?
18
19
a
start start
b a c
P(abc)+ P(abc)+ U b· b· a· a· W(abc)+: W(abc)+: <P(abc)+ <P(abc)+ c· c· a· a·
20
T T T
σp=a σp=a σp=b σp=b
T T
. /o=s . /o=s σp=c σp=c . /o=s . /o=s
α
P(abc)+ P(abc)+ U b· b· a· a· Wabc: Wabc: <P(abc)+ <P(abc)+ c· c· U W(abc)+: W(abc)+: Wabc· Wabc· Wabc· Wabc·
21
P(abc)+ P(abc)+ U ·b ·b Wbc: Wbc: <P(abc)+ <P(abc)+ c· c· U W(abc)+: W(abc)+: a· a· a· a· Wbc· Wbc·
22
sub-space of standard plans
the cost model
rule id description waveplan precondition s1 s1 s2 s2
seed seed p : p : CC concat compound |s1| > 1 |s1| > 1 |s2| > 1 |s2| > 1 null d1 d1 p1 p1 d2 = U d2 = U p : p : CCF concat compound flip d2 d2 p2 p2 |s1| > 1 |s1| > 1 |s2| > 1 |s2| > 1 null p : p : CP concat pipe |s2| = 1 |s2| = 1 null d1 d1 p1 p1 p : p : CPF concat pipe flip d2 d2 p2 p2 |s1| = 1 |s1| = 1 |s2| > 0 |s2| > 0 null / / / / / / s1 s1 s2 s2 s1 s1 s2 s2 / /
Ws2· Ws2· ·Ws1 ·Ws1 s2· s2· |s1| > 0 |s1| > 0 d1 = U d1 = U ·s1 ·s1
DP direct pipeline p2 p2 ε d2 = s1 d2 = s1 d1 d1 p1 p1 p : p : s1 s1 s2 s2 DP inverse pipeline p1 p1 ε d1 = s2 d1 = s2 d2 d2 p2 p2 p : p : s2 s2 s1 s1 / / / / null null
|s1| > 0 |s1| > 0 |s2| > 0 |s2| > 0
rule id description waveplan precondition s1 s1 s2 s2
seed seed ASDP absorb seed direct pipe
d
s1· s1· p : p : |s1| = 1 |s1| = 1 null null ASIP absorb seed inverse pipe
d
·s2 ·s2 p : p : |s2| = 1 |s2| = 1 null null d d ASDC absorb seed direct compound
d
Ws1· Ws1· p : p : |s1| > 1 |s1| > 1 null null ASIC absorb seed inverse compound
d
·Ws2 ·Ws2 p : p : |s2| > 1 |s2| > 1 null null d d
seed passing
d1 = U d1 = U d2 = U d2 = U rule id description waveplan precondition s1 s1 s2 s2
seed seed KP kleene plus p1 p1 ε p : p :
d
d1 = d/(s1)+ d1 = d/(s1)+ + null s1 s1 d1 = (s1)+/d d1 = (s1)+/d null KS kleene star p1 p1 ε p : p :
d
d1 = d/(s1)∗ d1 = d/(s1)∗ * null s1 s1 d1 = (s1)∗/d d1 = (s1)∗/d null ε ε ε ε ε
PSWP PWP PWP PFA PFA Pα-RA Pα-RA
PTFA PTFA PSWP PSWP
23
the search
deltas (search space)
(search space)
cache (materialized cache size)
24
Search cardinality
directions
frequencies - synopsis
Solution redundancy
conforming paths
Sub-path redundancy
hierarchical structures
paths
25
Choice of wavefronts
directions with direct/ inverse and graph/ view transitions
Reduce
discovered and cyclic
Threading
(views)
26
Waveguide in the context of SPARQL
path query optimization on large RDF datasets
Guided search as procedural SQL
Illustration
27
28
Observations
improvement even for simple queries
profiles depending on tape
iterations
29
30
DBPedia dataset: mining 21 queries of type ?x (a/b) ?y evaluating pipelined and full loop caching: is rich WG plan space useful? need to cost, as the type of edge walks performed is different depending on a plan and shape of the graph
31
mining RPQ patterns and set of realistic queries over YAGO2s and DBPedia benchmarking:
despite slower transitive closure, WG gains significant improvement due to richer plan space
32
Devise WAVEGUIDE (WG) framework for planning and evaluation
Demonstrate that it subsumes existing techniques and extends well beyond them Analyze WG’s plan space and provide an efficient way to enumerate through subspace of plans Model the cost factors that determine the efficiency of the plans Present and prototype powerful optimizations offered by WG plans
33
plans?
binning
assumption
can we do better?
34
PWP PWP PFA PFA Pα-RA Pα-RA
PTFA PTFA PSWP PSWP PUnroll PUnroll
PWP PWP PFA PFA Pα-RA Pα-RA
PTFA PTFA PSWP PSWP PGlu PGlu
PWP PWP PFA PFA Pα-RA Pα-RA
PTFA PTFA PSWP PSWP PDerivative PDerivative
PSWP
35