A Main Memory Index Structure to Query Linked Data Olaf Hartig - - PowerPoint PPT Presentation

a main memory index structure to query linked data
SMART_READER_LITE
LIVE PREVIEW

A Main Memory Index Structure to Query Linked Data Olaf Hartig - - PowerPoint PPT Presentation

A Main Memory Index Structure to Query Linked Data Olaf Hartig http://olafhartig.de/foaf.rdf#olaf @olafhartig Frank Huber Database and Information Systems Research Group Humboldt-Universitt zu Berlin The Issue 0 0,2 0,4 0,6 0,8 1 0 5


slide-1
SLIDE 1

A Main Memory Index Structure to Query Linked Data

Olaf Hartig

http://olafhartig.de/foaf.rdf#olaf @olafhartig

Frank Huber

Database and Information Systems Research Group Humboldt-Universität zu Berlin

slide-2
SLIDE 2

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 2

5 10 15 20 25 30 5 10 15 20 25 30 20 40 60 80 20 40 60 80

The Issue

(Query No. 36) (Query No. 37) (Query No. 38) (Query No. 39) (Query No. 40) hit rate query execution time (in seconds) number of query results ContactInfoPhillipe UnsetPropsPhillipe 2ndDegree1Phillipe 2ndDegree2Phillipe IncomingPhillipe

0,2 0,4 0,6 0,8 1 0,2 0,4 0,6 0,8 1 no reuse given

  • rder
slide-3
SLIDE 3

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 3

5 10 15 20 25 30 5 10 15 20 25 30 20 40 60 80 20 40 60 80

The Issue

(Query No. 36) (Query No. 37) (Query No. 38) (Query No. 39) (Query No. 40) hit rate query execution time (in seconds) number of query results

Descriptor objects in the query-local dataset after query execution: 172 533

ContactInfoPhillipe UnsetPropsPhillipe 2ndDegree1Phillipe 2ndDegree2Phillipe IncomingPhillipe

0,2 0,4 0,6 0,8 1 0,2 0,4 0,6 0,8 1 no reuse given

  • rder
slide-4
SLIDE 4

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 4

query-local dataset

What data structure do we use to physically represent the query-local dataset?

Logical representation of Linked Data from the Web Physical representation of Linked Data from the Web

?

slide-5
SLIDE 5

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 5

Outline

  • 1. Requirements + Existing Work
  • 2. Data Structures
  • 3. Evaluation
slide-6
SLIDE 6

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 6

Requirements

  • (Consecutively) build and use ad hoc collections of

many small sets of RDF triples

  • Four main operations:
  • Find

… matching triples for a triple pattern in all descriptor objects

  • Add, Remove, Replace

… descriptor objects

  • Support of concurrent access (i.e. isolation)
  • Non-relevant properties:
  • Querying descriptor objects individually is not necessary
  • No need to write data back to the Web
  • ACID properties not required for complete queries
slide-7
SLIDE 7

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 7

Requirements

  • (Consecutively) build and use ad hoc collections of

many small sets of RDF triples

  • Four main operations:
  • Find

… matching triples for a triple pattern in all descriptor objects

  • Add, Remove, Replace

… descriptor objects

  • Support of concurrent access (i.e. isolation)
  • Non-relevant properties:
  • Querying descriptor objects individually is not necessary
  • No need to write data back to the Web
  • ACID properties not required for complete queries
slide-8
SLIDE 8

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 8

Existing Work

  • Disk based storage solutions for RDF data
  • Unsuitable due to very costly I/O operations
  • Main memory based data structures in the literature
  • Focus on a large, single set of RDF triples
  • Optimized for complete graph pattern queries or path queries
  • Main memory based data structures in RDF frameworks
  • Focus on Jena, ARQ and NG4J
  • Inefficient (see evaluation)
slide-9
SLIDE 9

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 9

Outline

  • 1. Requirements + Existing Work
  • 2. Data Structures
  • 3. Evaluation
slide-10
SLIDE 10

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 10

Hash-Based Index for RDF Data

Logical representation Physical representation

P S SP O SO PO Dict

  • Dictionary:
  • Two-way mapping between RDF

terms and numerical identifiers

  • 6 hash tables:
  • Each hash table contains

all ID-encoded triples

  • Efficient support for all types of triple patterns

*Similar to Harth and Decker, 2005

slide-11
SLIDE 11

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 11

Hash-Based Index for RDF Data

Logical representation Physical representation

P S SP O SO PO Dict

  • Dictionary:
  • Two-way mapping between RDF

terms and numerical identifiers

  • 6 hash tables:
  • Each hash table contains

all ID-encoded triples

  • Efficient support for all types of triple patterns

*Similar to Harth and Decker, 2005 tid = ( ids,idp,ido )

slide-12
SLIDE 12

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 12

Hash-Based Index for RDF Data

Logical representation Physical representation

P S SP O SO PO Dict

  • Dictionary:
  • Two-way mapping between RDF

terms and numerical identifiers

  • 6 hash tables:
  • Each hash table contains

all ID-encoded triples

  • Efficient support for all types of triple patterns

*Similar to Harth and Decker, 2005 Find

knows http://bob.name ?acq

tid = ( ids,idp,ido )

slide-13
SLIDE 13

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 13

query-local dataset

Individual Indexing

Logical representation Physical representation

P S SP O SO PO

Dict

  • Idea: Index each descriptor object

separately

  • Implementation of the four operations:
  • Add, Remove, and Replace are straightforward
  • Find requires iterating over all indexes

P S SP O SO PO P S SP O SO PO

slide-14
SLIDE 14

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 14

query-local dataset

Individual Indexing

Logical representation Physical representation

P S SP O SO PO

Dict

  • Idea: Index each descriptor object

separately

  • Implementation of the four operations:
  • Add, Remove, and Replace are straightforward
  • Find requires iterating over all indexes

P S SP O SO PO P S SP O SO PO

Find

knows http://bob.name ?acq

slide-15
SLIDE 15

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 15

query-local dataset

Combined Indexing

Logical representation Physical representation

  • Idea: Use a single index

for all descriptor objects

  • src – maps each triple to a set
  • f descriptor object IDs

P S SP O SO PO

Dict

slide-16
SLIDE 16

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 16

query-local dataset

Combined Indexing

Logical representation Physical representation

  • Idea: Use a single index

for all descriptor objects

  • src – maps each triple to a set
  • f descriptor object IDs

P S SP O SO PO

Dict

tid = ( ids,idp,ido )

slide-17
SLIDE 17

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 17

query-local dataset

Combined Indexing

Logical representation Physical representation

  • Idea: Use a single index

for all descriptor objects

  • src – maps each triple to a set
  • f descriptor object IDs

P S SP O SO PO

Dict

tid = ( ids,idp,ido ) + src( tid ) = { , }

slide-18
SLIDE 18

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 18

query-local dataset

Quad Indexing

Logical representation Physical representation

  • Idea: Use a single quad index

for all descriptor objects

  • quad = ID-encoded triple

+ descriptor object ID

P S SP O SO PO

Dict

q = ( (ids,idp,ido) , )

slide-19
SLIDE 19

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 19

Outline

  • 1. Requirements + Existing Work
  • 2. Data Structures
  • 3. Evaluation
slide-20
SLIDE 20

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 20

Experiment Setup

Does this affect the overall execution time for link traversal based query executions ?

slide-21
SLIDE 21

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 21

Experiment Setup

Does this affect the overall execution time for link traversal based query executions ?

  • Simulation of the Web of Data
  • Linked Data server publishes BSBM dataset (scal. factor: 50)
  • Adjusted BSBM queries link to the simulation server
  • Experiment:
  • Sequence of 200 query mixes
  • Reuse of the query-local dataset for the whole sequence
  • IndIR, CombIR, and QuadIR (as presented), engine: SQUIN
  • NamedGraphSetImpl (NG4J/Jena), engine: SemWeb Client
slide-22
SLIDE 22

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 22

Execution Time

20 40 60 80 100 120 140 160 180 200 10 20 30 40 50 60 70 80 NG4J (SWClLib ) IndIR, m=4 CombIR, m=12 CombQuadIR, m=12 query mix execution time in seconds 40 80 120160200 500 1000 1500 2000 2500 query mix

  • verall number of descr.objects in the queried dataset
slide-23
SLIDE 23

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 23

Execution Time

20 40 60 80 100 120 140 160 180 200 10 20 30 40 50 60 70 80 NG4J (SWClLib ) IndIR, m=4 CombIR, m=12 CombQuadIR, m=12 query mix execution time in seconds 40 80 120160200 500 1000 1500 2000 2500 query mix

  • verall number of descr.objects in the queried dataset
slide-24
SLIDE 24

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 24

Summary

  • Three hash index based data structures:
  • Individually indexing
  • Combined indexing
  • Quad indexing
  • Findings:
  • A single index improves query performance significantly
  • Smaller load times with quads
  • Also for other use cases of ad hoc storing of Linked Data
  • Consecutively retrieved from remote sources
  • Used for immediate local processing
slide-25
SLIDE 25

Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 25

Backup Slides

slide-26
SLIDE 26

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 26

query-local dataset

Combined Indexing

Logical representation Physical representation

  • Idea: Use a single index

for all descriptor objects

  • src – maps each triple to a set
  • f descriptor object IDs
  • Implementation of the four operations requires:
  • status – maps each descriptor object ID to a status

( BeingIndexed, Indexed, ToBeRemoved, BeingRemoved )

  • Find reports a triple t only if: ∃d ∈ src(t) : status(d) = Indexed

P S SP O SO PO

Dict

slide-27
SLIDE 27

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 27

Implementation

  • Available as Free Software (http://squin.org)
  • Hash tables with n = 2m buckets
  • Hash functions:
  • hS( is, ip, io ) = is & bitmask[m]
  • hSP( is, ip, io ) = ( is • ip ) & bitmask[m]
  • etc.
  • Which m ?* *see paper
  • m = 4 for the individual indexes
  • m = 12 for the combined index
  • m = 12 for the quad index
slide-28
SLIDE 28

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 28

Experiment Setup

  • Comparison without link traversal based query execution
  • Compared data structures:
  • Our implementation of IndIR, CombIR, CombQuadIR
  • NamedGraphSetImpl in NG4J (Jena)
  • DatasetGraphMap in ARQ (Jena)
  • Berlin SPARQL Benchmark (BSBM)
  • BSBM datasets partitioned into query-local datasets
  • BSBM (v2.0) query mixes executed over these datasets
slide-29
SLIDE 29

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 29

Required Memory

BSBM scaling factor number of descriptor

  • bjects
  • verall

number

  • f triples

50 2,599 22,616 100 4,178 40,133 150 5,756 57,524 200 7,329 75,062 250 9,873 97,613 300 11,455 115,217 350 13,954 137,567 500 18,687 190,502

Estimated required memory in MB

100 200 300 400 500 20 40 60 80 100 120 ARQ NG4J IndIR (m=4) CombIR (m=12) CombQuad (m=12) pc (BSBM scaling factor)

slide-30
SLIDE 30

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 30

Execution Time

BSBM scaling factor number of descriptor

  • bjects
  • verall

number

  • f triples

50 2,599 22,616 100 4,178 40,133 150 5,756 57,524 200 7,329 75,062 250 9,873 97,613 300 11,455 115,217 350 13,954 137,567 500 18,687 190,502

average execution time per query mix in seconds

100 200 300 400 500 500 1000 1500 2000 2500 ARQ NG4J IndIR (m=4) CombIR (m=12) CombQuad (m=12) pc (BSBM scaling factor)

slide-31
SLIDE 31

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 31

Load Time

BSBM scaling factor number of descriptor

  • bjects
  • verall

number

  • f triples

50 2,599 22,616 100 4,178 40,133 150 5,756 57,524 200 7,329 75,062 250 9,873 97,613 300 11,455 115,217 350 13,954 137,567 500 18,687 190,502

average creation time in seconds

100 200 300 400 500 1 2 3 4 5 6 7 8 9 10 CombIR (m=12) IndIR (m=4) CombQuad (m=12) pc (BSBM scaling factor)

slide-32
SLIDE 32

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 32

Execution Time

20 40 60 80 100 120 140 160 180 200 10 20 30 40 50 60 70 80 NG4J (SWClLib) IndIR, m=4 CombIR, m=12 CombQuadIR, m=12 query mix execution time in seconds 5 10 15 20 25 10 20 30 40 50 60

CombIR, m=12 CombQuadIR , m=12

query mix

slide-33
SLIDE 33

Olaf Hartig - A Main Memory Index Structure to Query Linked Data 33

These slides have been created by Olaf Hartig http://olafhartig.de This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License (http://creativecommons.org/licenses/by-sa/3.0/)