A Node Indexing Scheme for Web Entity Retrieval Renaud Delbru, - - PowerPoint PPT Presentation

a node indexing scheme for web entity retrieval
SMART_READER_LITE
LIVE PREVIEW

A Node Indexing Scheme for Web Entity Retrieval Renaud Delbru, - - PowerPoint PPT Presentation

A Node Indexing Scheme for Web Entity Retrieval Renaud Delbru, Nickolai Toupikov, Michele Catasta, and Giovanni Tummarello Digital Enterprise Research Institute, Galway June 2, 2010 Introduction Tree Model Query Model Implementation


slide-1
SLIDE 1

A Node Indexing Scheme for Web Entity Retrieval

Renaud Delbru, Nickolai Toupikov, Michele Catasta, and Giovanni Tummarello

Digital Enterprise Research Institute, Galway

June 2, 2010

slide-2
SLIDE 2

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Introduction

Web of Data

Pages with semantic markups: RDF, RDFa, Microformats. Currently in the area of X00.000.000 pages with semantic markups.

How to consume these data ?

Traditional search engines ineffective; Shift from text document to data entity.

Semi-structured IR: node indexing scheme

Technique from XML IR world; Good compromise between query expressiveness, query processing time and update complexity.

SIREn (Semantic Information Retrieval Engine)

Open Source implementation; At the core of the Sindice search engine.

1 / 29

slide-3
SLIDE 3

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

From “Web” to “Web of Data”

Web

Document Bag of words Unstructured Document centric

Web of Data

Dataset - Entity Bag of RDF assertions Semi-structured Dataset - entity centric

2 / 29

slide-4
SLIDE 4

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

From “Web” to “Web of Data”

Web

Document Bag of words Unstructured Document centric

Web of Data

Dataset - Entity Bag of RDF assertions Semi-structured Dataset - entity centric

2 / 29

slide-5
SLIDE 5

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

From “Web” to “Web of Data”

Web

Document Bag of words Unstructured Document centric

Web of Data

Dataset - Entity Bag of RDF assertions Semi-structured Dataset - entity centric

2 / 29

slide-6
SLIDE 6

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

From “Web” to “Web of Data”

Web

Document Bag of words Unstructured Document centric

Web of Data

Dataset - Entity Bag of RDF assertions Semi-structured Dataset - entity centric

2 / 29

slide-7
SLIDE 7

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

From “Web” to “Web of Data”

Web

Document Bag of words Unstructured Document centric

Web of Data

Dataset - Entity Bag of RDF assertions Semi-structured Dataset - entity centric

2 / 29

slide-8
SLIDE 8

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

From “Web” to “Web of Data”

Web

Document Bag of words Unstructured Document centric

Web of Data

Dataset - Entity Bag of RDF assertions Semi-structured Dataset - entity centric

2 / 29

slide-9
SLIDE 9

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

From “Web” to “Web of Data”

Web

Document Bag of words Unstructured Document centric

Web of Data

Dataset - Entity Bag of RDF assertions Semi-structured Dataset - entity centric

2 / 29

slide-10
SLIDE 10

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

From “Web” to “Web of Data”

Web

Document Bag of words Unstructured Document centric

Web of Data

Dataset - Entity Bag of RDF assertions Semi-structured Dataset - entity centric

2 / 29

slide-11
SLIDE 11

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Entity Retrieval

Entity Retrieval

Given an entity search query, find the most relevant entities (list of entities

  • rdered by relevance).

Entity Search Query

We aim to support three types of queries: full-text search keyword-based queries when the data structure is unknown; structural query complex queries specified in a star-shaped structure when the data schema is known; semi-structural query combination of the two (where full-text search can be used on any part of the star-shaped query) when the data structure is partially known. Relevant subset of SPARQL Match well with IR

3 / 29

slide-12
SLIDE 12

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Entity Retrieval

Entity Retrieval

Given an entity search query, find the most relevant entities (list of entities

  • rdered by relevance).

Entity Search Query

We aim to support three types of queries: full-text search keyword-based queries when the data structure is unknown; structural query complex queries specified in a star-shaped structure when the data schema is known; semi-structural query combination of the two (where full-text search can be used on any part of the star-shaped query) when the data structure is partially known. Relevant subset of SPARQL Match well with IR

3 / 29

slide-13
SLIDE 13

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Entity Retrieval

Entity Retrieval

Given an entity search query, find the most relevant entities (list of entities

  • rdered by relevance).

Entity Search Query

We aim to support three types of queries: full-text search keyword-based queries when the data structure is unknown; structural query complex queries specified in a star-shaped structure when the data schema is known; semi-structural query combination of the two (where full-text search can be used on any part of the star-shaped query) when the data structure is partially known. Relevant subset of SPARQL Match well with IR

3 / 29

slide-14
SLIDE 14

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Entity Retrieval

Entity Retrieval

Given an entity search query, find the most relevant entities (list of entities

  • rdered by relevance).

Entity Search Query

We aim to support three types of queries: full-text search keyword-based queries when the data structure is unknown; structural query complex queries specified in a star-shaped structure when the data schema is known; semi-structural query combination of the two (where full-text search can be used on any part of the star-shaped query) when the data structure is partially known. Relevant subset of SPARQL Match well with IR

3 / 29

slide-15
SLIDE 15

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Entity Retrieval

Entity Retrieval

Given an entity search query, find the most relevant entities (list of entities

  • rdered by relevance).

Entity Search Query

We aim to support three types of queries: full-text search keyword-based queries when the data structure is unknown; structural query complex queries specified in a star-shaped structure when the data schema is known; semi-structural query combination of the two (where full-text search can be used on any part of the star-shaped query) when the data structure is partially known. Relevant subset of SPARQL Match well with IR

3 / 29

slide-16
SLIDE 16

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Entity Retrieval

Entity Retrieval

Given an entity search query, find the most relevant entities (list of entities

  • rdered by relevance).

Entity Search Query

We aim to support three types of queries: full-text search keyword-based queries when the data structure is unknown; structural query complex queries specified in a star-shaped structure when the data schema is known; semi-structural query combination of the two (where full-text search can be used on any part of the star-shaped query) when the data structure is partially known. Relevant subset of SPARQL Match well with IR

3 / 29

slide-17
SLIDE 17

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Entity Retrieval: Star Query

(a) Visual representation of an RDF graph. (b) Star-shaped query

Figure: Oval nodes represent resources and rectangular ones represent literals.

4 / 29

slide-18
SLIDE 18

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Outline: Tree Model

Tree Model Conceptual Model Node-Labelled Tree: Model Node-Labelled Tree: Example

5 / 29

slide-19
SLIDE 19

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Conceptual Model

Figure: Conceptual representation of the node-labelled tree model

6 / 29

slide-20
SLIDE 20

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Node-Labelled Tree: Model

Origin: Semi-structured information retrieval, more recently XML retrieval. Goal: Encode relationship between nodes Operators: Parent-Child and Ancestor-Descendant (as in XPath) Requirement: Assign unique identifiers (node labels) that encode relationships between the nodes Solution: Node labelling scheme (e.g., Dewey Encoding)

7 / 29

slide-21
SLIDE 21

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Node-Labelled Tree: Example

Figure: Node-labelled tree using Dewey’s encoding

8 / 29

slide-22
SLIDE 22

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Outline: Query Model

Query Model Operator Overview Structure Operators

9 / 29

slide-23
SLIDE 23

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Operator Overview

Content Operators

Orthogonal to the structure operators; Atomic search element: keyword; Boolean operators (intersection, union, difference), proximity

  • perators (phrase, etc.), ...

Allow to compose complex keyword queries to retrieve nodes.

Structure Operators

Atomic search element: node; Allow to compose path queries to retrieve quads; Allow combination of quads.

10 / 29

slide-24
SLIDE 24

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Operator Overview

Content Operators

Orthogonal to the structure operators; Atomic search element: keyword; Boolean operators (intersection, union, difference), proximity

  • perators (phrase, etc.), ...

Allow to compose complex keyword queries to retrieve nodes.

Structure Operators

Atomic search element: node; Allow to compose path queries to retrieve quads; Allow combination of quads.

10 / 29

slide-25
SLIDE 25

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Structure Operator: Ancestor-Descendant

Ancestor-Descendant: A//D

A node A is the ancestor of a node D if it exists a path between A and D.

SELECT DISTINCT ?s WHERE { GRAPH <renaud.delbru.fr > { ?s ?p <person > }}

→ renaud.delbru.fr // person

SELECT DISTINCT ?g WHERE { GRAPH ?g { <renaud > ?p <giovanni > }}

→ renaud

// giovanni

11 / 29

slide-26
SLIDE 26

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Structure Operator: Ancestor-Descendant

Ancestor-Descendant: A//D

A node A is the ancestor of a node D if it exists a path between A and D.

SELECT DISTINCT ?s WHERE { GRAPH <renaud.delbru.fr > { ?s ?p <person > }}

→ renaud.delbru.fr // person

SELECT DISTINCT ?g WHERE { GRAPH ?g { <renaud > ?p <giovanni > }}

→ renaud

// giovanni

11 / 29

slide-27
SLIDE 27

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Structure Operator: Ancestor-Descendant

Ancestor-Descendant: A//D

A node A is the ancestor of a node D if it exists a path between A and D.

SELECT DISTINCT ?s WHERE { GRAPH <renaud.delbru.fr > { ?s ?p <person > }}

→ renaud.delbru.fr // person

SELECT DISTINCT ?g WHERE { GRAPH ?g { <renaud > ?p <giovanni > }}

→ renaud

// giovanni

11 / 29

slide-28
SLIDE 28

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Structure Operator: Parent-Child

Parent-Child: P/C

A node P is the parent of a node C if P is an ancestor of C and C is exactly one level above P.

SELECT DISTINCT ?g ?s WHERE { GRAPH ?g { ?s <knows > <giovanni > . }}

→ knows / giovanni

12 / 29

slide-29
SLIDE 29

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Structure Operator: Parent-Child

Parent-Child: P/C

A node P is the parent of a node C if P is an ancestor of C and C is exactly one level above P.

SELECT DISTINCT ?g ?s WHERE { GRAPH ?g { ?s <knows > <giovanni > . }}

→ knows / giovanni

12 / 29

slide-30
SLIDE 30

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Structure Operator: Set Operators

Set Operators

Union (∪), set-difference (\) and intersection (∩) over set of nodes.

SELECT DISTINCT ?g ?s WHERE { GRAPH ?g { ?s <name > "renaud" . { ?s <made > <paper -2> } UNION { ?s <made > <paper -12> } OPTIONAL { ?s <made > ?x . FILTER (?x = paper -1) } FILTER (! bound (?x)) }}

name / "renaud" AND made / paper -2 OR paper -12 NOT made / paper -1

13 / 29

slide-31
SLIDE 31

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Outline: Implementation

Implementation Inverted Index Incremental Update of the Inverted Lists Query Processing

14 / 29

slide-32
SLIDE 32

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Inverted Index

Inverted Index: Collection of n inverted lists It0, It1, . . . , Itn It: List of postings for the term t Posting: context, entity, predicate, object and position

Example (Inverted Index)

Term Postings name [1.1.1], [1.2.1], [3.2.1], ... renaud [1.1.1.1], [5.3.2.1], ...

15 / 29

slide-33
SLIDE 33

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Inverted Index

Inverted Index: Collection of n inverted lists It0, It1, . . . , Itn It: List of postings for the term t Posting: context, entity, predicate, object and position

Example (Inverted Index)

Term Postings name [1.1.1], [1.2.1], [3.2.1], ... renaud [1.1.1.1], [5.3.2.1], ...

15 / 29

slide-34
SLIDE 34

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Incremental Update

Insertion of one quad

1 accessing the postings list of the term p and o; 2 appending a posting to each list.

Complexity of Insertion

O(log(n) + 1)

16 / 29

slide-35
SLIDE 35

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Incremental Update

Insertion of one quad

1 accessing the postings list of the term p and o; 2 appending a posting to each list.

Complexity of Insertion

O(log(n) + 1)

16 / 29

slide-36
SLIDE 36

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Incremental Update

Insertion of one quad

1 accessing the postings list of the term p and o; 2 appending a posting to each list.

Complexity of Insertion

O(log(n) + 1)

16 / 29

slide-37
SLIDE 37

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Incremental Update

Insertion of one quad

1 accessing the postings list of the term p and o; 2 appending a posting to each list.

Complexity of Insertion

O(log(n) + 1)

16 / 29

slide-38
SLIDE 38

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Query Processing

Intersection

Quad pattern search; Combination of quad patterns.

Intersection of Posting Lists

1 Retrieve the posting list of of each term; 2 Walk through the lists simultaneously. At each step, we 1

Compare the context and subject identifiers;

2

Compare predicate and object identifiers. If they are the same, we put the pair cid, sid in the result list and advance the pointers to their next position in each postings list.

Complexity

Worst-case: linear to the total number of posting entries Average-case: sub-linear time with an internal index

17 / 29

slide-39
SLIDE 39

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Query Processing

Intersection

Quad pattern search; Combination of quad patterns.

Intersection of Posting Lists

1 Retrieve the posting list of of each term; 2 Walk through the lists simultaneously. At each step, we 1

Compare the context and subject identifiers;

2

Compare predicate and object identifiers. If they are the same, we put the pair cid, sid in the result list and advance the pointers to their next position in each postings list.

Complexity

Worst-case: linear to the total number of posting entries Average-case: sub-linear time with an internal index

17 / 29

slide-40
SLIDE 40

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Query Processing

Intersection

Quad pattern search; Combination of quad patterns.

Intersection of Posting Lists

1 Retrieve the posting list of of each term; 2 Walk through the lists simultaneously. At each step, we 1

Compare the context and subject identifiers;

2

Compare predicate and object identifiers. If they are the same, we put the pair cid, sid in the result list and advance the pointers to their next position in each postings list.

Complexity

Worst-case: linear to the total number of posting entries Average-case: sub-linear time with an internal index

17 / 29

slide-41
SLIDE 41

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Query Processing

Intersection

Quad pattern search; Combination of quad patterns.

Intersection of Posting Lists

1 Retrieve the posting list of of each term; 2 Walk through the lists simultaneously. At each step, we 1

Compare the context and subject identifiers;

2

Compare predicate and object identifiers. If they are the same, we put the pair cid, sid in the result list and advance the pointers to their next position in each postings list.

Complexity

Worst-case: linear to the total number of posting entries Average-case: sub-linear time with an internal index

17 / 29

slide-42
SLIDE 42

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Query Processing

Intersection

Quad pattern search; Combination of quad patterns.

Intersection of Posting Lists

1 Retrieve the posting list of of each term; 2 Walk through the lists simultaneously. At each step, we 1

Compare the context and subject identifiers;

2

Compare predicate and object identifiers. If they are the same, we put the pair cid, sid in the result list and advance the pointers to their next position in each postings list.

Complexity

Worst-case: linear to the total number of posting entries Average-case: sub-linear time with an internal index

17 / 29

slide-43
SLIDE 43

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Query Processing

Intersection

Quad pattern search; Combination of quad patterns.

Intersection of Posting Lists

1 Retrieve the posting list of of each term; 2 Walk through the lists simultaneously. At each step, we 1

Compare the context and subject identifiers;

2

Compare predicate and object identifiers. If they are the same, we put the pair cid, sid in the result list and advance the pointers to their next position in each postings list.

Complexity

Worst-case: linear to the total number of posting entries Average-case: sub-linear time with an internal index

17 / 29

slide-44
SLIDE 44

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Outline: Comparison

Comparison Entity Retrieval: Candidate Systems Field-based indexing scheme Comparison Table

18 / 29

slide-45
SLIDE 45

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Entity Retrieval: Candidate Systems

Node-labelled indexing scheme (SIREn) Field-based indexing scheme (Lucene) Quad table indexing scheme (RDF Database) Hybrid between field-based and quad table (Semplore)

19 / 29

slide-46
SLIDE 46

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Field-based indexing scheme

Example (Field-Based Inverted Index)

Term Postings name:renaud [1], [2], [3], ... name:delbru [1], [5], ...

20 / 29

slide-47
SLIDE 47

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Comparison among Candidates

Criteria Node Index Field Index Quad Table Semplore Dictionary Lookup O(log(n)) O(log(n ∗ m)) O(log(n)) O(log(n)) Quad Lookup O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n)) Join in Quad Lookup Yes No No No Star Query Evaluation Sub-Linear Sub-Linear O(n) O(n ∗ log(n)) Update Cost O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n) + log(l)) Multiple Indices No No Yes Yes Query Expressiveness Star Star Graph Tree Full-Text Yes Yes (on literals) No Yes (on literals) Multi-Valued Support Yes No Yes No False positive No Yes No Yes

21 / 29

slide-48
SLIDE 48

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Comparison among Candidates

Criteria Node Index Field Index Quad Table Semplore Dictionary Lookup O(log(n)) O(log(n ∗ m)) O(log(n)) O(log(n)) Quad Lookup O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n)) Join in Quad Lookup Yes No No No Star Query Evaluation Sub-Linear Sub-Linear O(n) O(n ∗ log(n)) Update Cost O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n) + log(l)) Multiple Indices No No Yes Yes Query Expressiveness Star Star Graph Tree Full-Text Yes Yes (on literals) No Yes (on literals) Multi-Valued Support Yes No Yes No False positive No Yes No Yes

21 / 29

slide-49
SLIDE 49

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Comparison among Candidates

Criteria Node Index Field Index Quad Table Semplore Dictionary Lookup O(log(n)) O(log(n ∗ m)) O(log(n)) O(log(n)) Quad Lookup O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n)) Join in Quad Lookup Yes No No No Star Query Evaluation Sub-Linear Sub-Linear O(n) O(n ∗ log(n)) Update Cost O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n) + log(l)) Multiple Indices No No Yes Yes Query Expressiveness Star Star Graph Tree Full-Text Yes Yes (on literals) No Yes (on literals) Multi-Valued Support Yes No Yes No False positive No Yes No Yes

21 / 29

slide-50
SLIDE 50

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Comparison among Candidates

Criteria Node Index Field Index Quad Table Semplore Dictionary Lookup O(log(n)) O(log(n ∗ m)) O(log(n)) O(log(n)) Quad Lookup O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n)) Join in Quad Lookup Yes No No No Star Query Evaluation Sub-Linear Sub-Linear O(n) O(n ∗ log(n)) Update Cost O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n) + log(l)) Multiple Indices No No Yes Yes Query Expressiveness Star Star Graph Tree Full-Text Yes Yes (on literals) No Yes (on literals) Multi-Valued Support Yes No Yes No False positive No Yes No Yes

21 / 29

slide-51
SLIDE 51

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Comparison among Candidates

Criteria Node Index Field Index Quad Table Semplore Dictionary Lookup O(log(n)) O(log(n ∗ m)) O(log(n)) O(log(n)) Quad Lookup O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n)) Join in Quad Lookup Yes No No No Star Query Evaluation Sub-Linear Sub-Linear O(n) O(n ∗ log(n)) Update Cost O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n) + log(l)) Multiple Indices No No Yes Yes Query Expressiveness Star Star Graph Tree Full-Text Yes Yes (on literals) No Yes (on literals) Multi-Valued Support Yes No Yes No False positive No Yes No Yes

21 / 29

slide-52
SLIDE 52

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Comparison among Candidates

Criteria Node Index Field Index Quad Table Semplore Dictionary Lookup O(log(n)) O(log(n ∗ m)) O(log(n)) O(log(n)) Quad Lookup O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n)) Join in Quad Lookup Yes No No No Star Query Evaluation Sub-Linear Sub-Linear O(n) O(n ∗ log(n)) Update Cost O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n) + log(l)) Multiple Indices No No Yes Yes Query Expressiveness Star Star Graph Tree Full-Text Yes Yes (on literals) No Yes (on literals) Multi-Valued Support Yes No Yes No False positive No Yes No Yes

21 / 29

slide-53
SLIDE 53

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Comparison among Candidates

Criteria Node Index Field Index Quad Table Semplore Dictionary Lookup O(log(n)) O(log(n ∗ m)) O(log(n)) O(log(n)) Quad Lookup O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n)) Join in Quad Lookup Yes No No No Star Query Evaluation Sub-Linear Sub-Linear O(n) O(n ∗ log(n)) Update Cost O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n) + log(l)) Multiple Indices No No Yes Yes Query Expressiveness Star Star Graph Tree Full-Text Yes Yes (on literals) No Yes (on literals) Multi-Valued Support Yes No Yes No False positive No Yes No Yes

21 / 29

slide-54
SLIDE 54

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Comparison among Candidates

Criteria Node Index Field Index Quad Table Semplore Dictionary Lookup O(log(n)) O(log(n ∗ m)) O(log(n)) O(log(n)) Quad Lookup O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n)) Join in Quad Lookup Yes No No No Star Query Evaluation Sub-Linear Sub-Linear O(n) O(n ∗ log(n)) Update Cost O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n) + log(l)) Multiple Indices No No Yes Yes Query Expressiveness Star Star Graph Tree Full-Text Yes Yes (on literals) No Yes (on literals) Multi-Valued Support Yes No Yes No False positive No Yes No Yes

21 / 29

slide-55
SLIDE 55

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Comparison among Candidates

Criteria Node Index Field Index Quad Table Semplore Dictionary Lookup O(log(n)) O(log(n ∗ m)) O(log(n)) O(log(n)) Quad Lookup O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n)) Join in Quad Lookup Yes No No No Star Query Evaluation Sub-Linear Sub-Linear O(n) O(n ∗ log(n)) Update Cost O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n) + log(l)) Multiple Indices No No Yes Yes Query Expressiveness Star Star Graph Tree Full-Text Yes Yes (on literals) No Yes (on literals) Multi-Valued Support Yes No Yes No False positive No Yes No Yes

21 / 29

slide-56
SLIDE 56

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Comparison among Candidates

Criteria Node Index Field Index Quad Table Semplore Dictionary Lookup O(log(n)) O(log(n ∗ m)) O(log(n)) O(log(n)) Quad Lookup O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n)) Join in Quad Lookup Yes No No No Star Query Evaluation Sub-Linear Sub-Linear O(n) O(n ∗ log(n)) Update Cost O(log(n)) O(log(n ∗ m)) O(log(n) + log(k)) O(log(n) + log(l)) Multiple Indices No No Yes Yes Query Expressiveness Star Star Graph Tree Full-Text Yes Yes (on literals) No Yes (on literals) Multi-Valued Support Yes No Yes No False positive No Yes No Yes

21 / 29

slide-57
SLIDE 57

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Outline: Experimental Results

Experimental Results Experimental Design Index Size Insertion Time Query Execution Scalability Test

22 / 29

slide-58
SLIDE 58

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Experimental Design

Datasets

Barton: 50M triples (approximately 6GB) Real-World: Random sampling of Sindice, 10M triples (approximately 2GB)

Candidate Systems

SIREn OpenRDF Sesame RDF-3X

Comparison Criteria

Index size Incremental index creation time Query processing time

23 / 29

slide-59
SLIDE 59

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Index Size and Indexing Time

(a) Index size in MB per dataset and system

SIREn Sesame RDF-3X Barton 789 3400 3076 Real-World 296 799 1138

(b) Indexing time in minutes per dataset and system

SIREn10 SIREn100 Sesame RDF-3X Barton 3 1.5 266 11 Real-World 1 0.5 47 3.6

Table: Report on index size and indexing time

24 / 29

slide-60
SLIDE 60

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Insertion Time

(a) Plots showing the commit time ev- ery 10.000 triples during the index cre- ation on Barton (b) Plots showing the commit time ev- ery 500.000 triples during the index creation over one billion triples

Figure: Dark dots are Sesame commit time records while gray dots are SIREn commit time records

25 / 29

slide-61
SLIDE 61

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Query Execution: Benchmark Design

Setup

2 Quad-core, 8GB, SATA disk. Cold cache 50 measurements per query Arithmetic mean reported

Query Types

A* Term lookup B* Triple pattern lookup C*,D*,E* Intersection, union, exclusion

26 / 29

slide-62
SLIDE 62

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Query Execution: Time

(a) Barton dataset

A1 A2 B1 C1 C2 D1 D2 E RDF-3X 16.12 0.12 1.38 1.16 0.38 0.23 0.14 X SIREn 2.79 0.02 1.33 1.71 0.95 0.36 0.03 0.96

(b) Real-World dataset

A1 A2 B1 B2 C1 C2 D1 E RDF-3X 0.29 0.12 0.17 0.18 0.21 0.13 0.28 X SIREn 0.23 0.03 0.04 0.05 0.09 0.08 0.16 0.53

Table: Querying time in seconds

27 / 29

slide-63
SLIDE 63

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Query Execution: Scalability Test

Dataset

Billion Triple Challenge dataset replicated 10 times 1 Terabyte of data

(a) 10 Billion Triples dataset

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Time (s) 0.75 1.3 1.4 0.5 1.5 1.6 4 Hits 7552 9344 3.5M 57K 448 8.2M 20.7M

Table: Querying time in seconds

28 / 29

slide-64
SLIDE 64

Introduction Tree Model Query Model Implementation Comparison Experimental Results Conclusion

Conclusion

SIREn (Semantic Information Retrieval Engine)

Open source implementation: http://siren.sindice.com At the core of the Sindice search engine:

100M semi-structured documents (6.5 Billion triples) On a single index (Quad-core, 8GB) Replicated: one master for updates, one slave for queries

About ranking

See “Hierarchical Link Analysis for Web Entity Retrieval”

Future Work

Query expressiveness: investigate entity path queries Index optimisation: inverted list compression

29 / 29