Characteristic Sets:
Accurate Cardinality Estimation for RDF Queries with Multiple Joins
Thomas Neumann Guido Moerkotte
Presented By :
Characteristic Sets: Accurate Cardinality Estimation for RDF Queries - - PowerPoint PPT Presentation
Characteristic Sets: Accurate Cardinality Estimation for RDF Queries with Multiple Joins Thomas Neumann Guido Moerkotte Presented By : Pranjal Gupta Recap. RDF is the underlying query language of the Semantic Web. Data is represented as
Accurate Cardinality Estimation for RDF Queries with Multiple Joins
Presented By :
SELECT DISTINCT ?e WHERE { ?e <author> “Jane Austen” , ?e <title> ?b, ?e <year> ?y }
query plans that has optimal join ordering. SELECT DISTINCT ?e WHERE { ?e <author> “Jane Austen” , ?e <title> ?b, ?e <year> ?y }
subject.
subject. SELECT DISTINCT ?e WHERE { ?e <author> “Jane Austen” , ?e <title> ?b, ?e <year> ?y }
<title> < a u t h
> ?e Jane Austen ?b <year> ?y
○ By using Characteristic sets.
general queries.
1. Lack of explicit schema based on the structure. Cannot partition the data for estimation, since all data looks the same. 2. Predicates are correlated and hence, cardinality cannot be estimated using single-bucket histograms. 3. RDF predicates are usually string values -> histograms are deemed inappropriate for estimation. 4. RDF-3X’s solution.
IDEA
1. RDF data does not have a fixed schema 2. The outgoing “predicate” edges gives an idea about the “class” of the entity. e.g. - Artist, City, Country. 3. A “soft” schema hence occur in data, based on the predicates of a subject.
Set of all predicates that have atleast one tuple with the subject
Set of all predicates that have atleast one tuple with the subject
{ “product”, “founder”, “founded_in”, “CEO”, “website” } SC(“Google”) =
Set of characteristic sets of all subject s give that there exists atleast one pair of predicate p and object o
{ “Founder”, “Founded In”, “CEO”, “CFO”, “Product”, “Revenue”, “Profit” } { “Country”, “Province”, “Population”, “latitude”, “longitude” } { “Author”, “Title”, “Publisher”, “ISBN”, “Year”, “Language” } “Namesake” “The girl with a dragon tattoo” “Tell me your Dreams” “Google” “Amazon” “Tesla” “Mumbai” “New York” “Toronto”
Set of characteristic sets of all subject s give that there exists atleast one pair of predicate p and object o
SELECT DISTINCT ?e WHERE { ?e <author> ?a , ?e <title> ?b }
<title> < a u t h
> ?e ?a ?b
SELECT DISTINCT ?e WHERE { ?e <author> ?a , ?e <title> ?b }
<title> < a u t h
> ?e ?a ?b
SOLUTION
Sum of cardinalities of all the supersets
Q = SC(Q) = { “title”, “author” }
Limitation of previous calculations :
Limitation of previous calculations :
<title> <author> Let it Snow <ent #416> John Green <author> Lauren Myracle Ralph < a u t h
>
SC(<ent 416>) = { “title”, “author” } count = 1
Limitation of previous calculations :
<title> <author> Let it Snow <ent #416> John Green <author> Lauren Myracle Lauren Myracle < a u t h
>
SC(<ent 416>) = { “title”, “author” } count = 1 SELECT DISTINCT ?e WHERE { ?e <author> ?a , ?e <title> ?b }
Predicate Annotations !
characteristic set is also stored
SELECT DISTINCT ?e WHERE { ?e <author> ?a , ?e <title> ?b } Q = SC(Q) = { “title”, “author” }
SELECT DISTINCT ?e WHERE { ?e <author> ?a , ?e <title> ?b } Q = SC(Q) = { “title”, “author” } S = { “title”, “author”, “year” }
= 2300/1000 = 2.3
= 1010/1000 = 1.01
in -> correlation b/w subject and predicate.
and object ? INEFFICIENT
in -> correlation b/w subject and predicate.
and object ? INEFFICIENT
OBSERVATION
key of a relational table.
selective.
Complete Algorithm
Complete Algorithm
Loops over all the characteristic sets in SC that is the super-set of the Query characteristic set
Complete Algorithm
Loops over all the triples that appear in the query
Complete Algorithm
if object is bounded, take the minimum of the selectivity lower bound among all object- bounded triples in query
Complete Algorithm
else, update the cummulative selectivity (m)
Complete Algorithm
Calculate the cardinality in current characteristic set and add to global cardinality
MERGING SOLUTIONS
S1 = {(author, 120), 100} S2 = {(title, 230), 200} S3 = {(author, 2300), (title, 1001), (year, 1000), 1000 } S4 = {(author, 30), (title, 20), 20}
MERGING SOLUTIONS
S1 = {(author, 120), 100} S2 = {(title, 230), 200} S3 = {(author, 2300), (title, 1001), (year, 1000), 1000 } S4 = {(author, 30), (title, 20), 20}
S4 S1 S2
S1 = {(author, 150), 120} S2 = {(title, 250), 140}
MERGING SOLUTIONS
MERGING SOLUTIONS
S1 = {(author, 120), 100} S2 = {(title, 230), 200} S3 = {(author, 2300), (title, 1001), (year, 1000), 1000 } S4 = {(author, 30), (title, 20), 20}
S3 S4
S3 = {(author, 2330), (title, 1021), (year, 1000), 1020 }
MERGING SOLUTIONS
MERGING SOLUTIONS
correct upper bound in computation
S3 S4
S3 = {(author, 2330), (title, 1021), (year, 1000), 1020 }
MERGING SOLUTIONS
Set of all characteristic sets that are superset of S.
S’ = Set of all character- istic sets which have the least elements in S merge S with the one which has the maximum distinct
Else, break S into S1 and S2, such that S1 is the maximal subset of a characteristic set in SC Merge S1 and S2
Else, break S into S1 and S2, such that S1 is the maximal subset of a characteristic set in SC Merge S1 and S2
Principles for using characteristic set based cardinality estimator into the plan generator:
Principles for using characteristic set based cardinality estimator into the plan generator:
Calculate cardinality estimate once per equivalent query plans
Principles for using characteristic set based cardinality estimator into the plan generator:
Use maximum amount of consistent correlation information
information for only a few portions of the graph.
the graph, before starting to use join estimates.
Principles for using characteristic set based cardinality estimator into the plan generator:
Assume independence if no correlation information is available.
calculate estimates using general join stats.
“late” in cost estimation.
SELECT ?a ?t WHERE { ?b <author> ?a , ?b <title> ?t, ?b <year> ‘2009’, ?b <published_by> ?p, ?p <name> ? ”ACM” }
SELECT ?a ?t WHERE { ?b <author> ?a , ?b <title> ?t, ?b <year> ‘2009’, ?b <published_by> ?p, ?p <name> ? ”ACM” }
?b ?a ?t ?p “2009”
<name> < y e a r > < a u t h
> <title> <pub_by> ?b <title> ?t ?b <author> ?a ?b <pub_by> ?p ?b <year> ?t ?p <pub_by> “ACM” QUERY GRAPH JOIN GRAPH
?b <title> ?t ?b <author> ?a ?b <pub_by> ?p ?b <year> ?t ?p <pub_by> “ACM”
Programming approach. At each step, match one of the query patterns.
cardinality for the query subgraph from the DP table, if available.
for the part of graph using the ESTIMATE QUERY CARDINALITY function
WORST CASE
?b ?a ?t ?p “2009”
<name> < y e a r > < a u t h
> <title> <pub_by>
ESTIMATE_QUERY_CARDINALITY
Selects the largest subject star join (S) from the uncovered region of QR and calculates the cardinality of that star. marks S in QR.
Selects the largest object star join (S) from the uncovered region of QR and calculates the cardinality of that star. marks S in QR.
Uses independence assumption for all the nodes and edges left in the QR for estimation
ESTIMATE_QUERY_CARDINALITY
ESTIMATE_QUERY_CARDINALITY
ESTIMATE_QUERY_CARDINALITY
nodes/edges
ESTIMATE_QUERY_CARDINALITY
Yago LibraryThing
○ strong schema ○ Very good cardinality estimates
○ Merging