Building a High Performance Environment for RDF Publishing
Pascal Christoph
Building a High Performance Environment for RDF Publishing Pascal - - PowerPoint PPT Presentation
Building a High Performance Environment for RDF Publishing Pascal Christoph These slides and all the graphics made by the author and those taken from https://openclipart.org/ are dedicated to the public domain :
Pascal Christoph
These slides and all the graphics made by the author and those taken from https://openclipart.org/ are dedicated to the public domain : https://creativecommons.org/about/cc0 . All marks mentioned may be trademarks or registered trademarks
Read about the license of „The scream“ of Edward Munch at https://en.wikipedia.org/wiki/File:The_Scream.jpg Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
O v e r v i e w
3 Building a High Performance Environment for RDF Publishing
Publishing is for Consuming
Story so far - experiences with lobid.org
Publishing RDF through elasticsearch
Future prospects
O v e r v i e w
4 Building a High Performance Environment for RDF Publishing
Publishing is for Consuming
Story so far - experiences with lobid.org
Publishing RDF through elasticsearch
Future prospects
P u b l i s h i n g i s f
C
s u m i n g
Building a High Performance Environment for RDF Publishing 5
P u b l i s h i n g i s f
C
s u m i n g
Mandatory
A resource:
Building a High Performance Environment for RDF Publishing 6
Mandatory
A resource: gets a dereferenceable URI:
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
7
Mandatory
A resource: gets a dereferenceable URI: which provides RDF:
<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/title> "With reference to reference" . <http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/issued> "1983" . <http://lobid.org/resource/HT002948556> <http://purl.org/ontology/bibo/isbn13> "9780915145539" . <http://lobid.org/resource/HT002948556><http://purl.org/dc/elements/1.1/creator><http://d-nb.info/gnd/135539897> .
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
8
Mandatory
=> basic LOD publishing is very simple:
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
9
Nice to have
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
10
SPARQL Endpoint
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
11
SPARQL Endpoint
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
12
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
Nice to have In principle, web developers already got simple APIs :
13
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
14
Nice to have In principle, web developers already got simple APIs :
Mandatory
A resource: gets a dereferenceable URI: which provides the data (in RDF):
<http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/title> "With reference to reference" . <http://lobid.org/resource/HT002948556> <http://purl.org/dc/terms/issued> "1983" . <http://lobid.org/resource/HT002948556> <http://purl.org/ontology/bibo/isbn13> "9780915145539" . <http://lobid.org/resource/HT002948556><http://purl.org/dc/elements/1.1/creator><http://d-nb.info/gnd/135539897> .
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
15
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
16
Nice to have In principle, web developers already got powerful APIs :
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
RESTful SPARQL example getting all data of all resources having a particular ISBN: curl -H "Accept: application/json" --data-urlencode 'query= prefix bibo: <http://purl.org/ontology/bibo/> SELECT * WHERE { ?s bibo:isbn13 "9780851706238" ; ?p ?o . } LIMIT 100 ' http://lobid.org/sparql/
17
18 Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
Nice to have
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
RESTful SPARQL example
… and the JSON/RDF result:
{ "head": { "vars": [ "s", "p","o"] }, "results": { "bindings": [ { "o": { "type": "uri", "value": "http://openlibrary.org/works/OL2109573W" }, "p": { "type": "uri", "value": "http://rdvocab.info/RDARelationshipsWEMI/workManifested" }, "s": { "type": "uri", "value": "http://lobid.org/resource/HT007824357" } }, { "o": { ...
19
20 Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
Nice to have
As it is, web developers don't like SPARQL web developer
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g i s f
C
s u m i n g
Nice to have Web developers want APIs like:
21
Happy web developer
O v e r v i e w
23 Building a High Performance Environment for RDF Publishing
Publishing is for Consuming
Story so far - experiences with lobid.org
Publishing RDF through elasticsearch
Future prospects
W h a t i s l
i d .
g ?
lobid.org
Building a High Performance Environment for RDF Publishing 24
W h a t i s l
i d .
g ?
Building a High Performance Environment for RDF Publishing
25
What's missing?
Building a High Performance Environment for RDF Publishing
W h a t i s l
i d .
g ?
26
O v e r v i e w
27 Building a High Performance Environment for RDF Publishing
Publishing is for Consuming
Story so far - experiences with lobid.org
Publishing RDF through elasticsearch
Future prospects
2010 - 2011, lobid-organisation Filesystem :
+ easy to maintain + reliable + fast
Building a High Performance Environment for RDF Publishing
s t
i n g t h e d a t a
28
lobid today Triple Store (4store) :
+ power of SPARQL +/- depending on the query: fast to horribly slow +/- search (but string searches often slow and limited)
Building a High Performance Environment for RDF Publishing
s t
i n g t h e d a t a
29
lobid today
Search engine (elasticsearch):
+ fast search + stemming, linguistics … + wildcard searching + facets + geo search + JSON + schema-less + simple RESTful API + many plugins + ... + easy to achieve High Availability + scales nicely
Building a High Performance Environment for RDF Publishing
s t
i n g t h e d a t a
30
s t
i n g / g e t t i n g t h e d a t a
lobid today
O v e r v i e w
32 Building a High Performance Environment for RDF Publishing
Publishing is for Consuming
Story so far - experiences with lobid.org
Publishing RDF through elasticsearch
Future prospects
Building a High Performance Environment for RDF Publishing
g e t t i n g t h e d a t a
lobid : technology/dependency stack lobid : technology/dependency stack
33
Building a High Performance Environment for RDF Publishing
sometimes gets stuck! sometimes gets stuck!
34
g e t t i n g t h e d a t a
lobid : technology/dependency stack lobid : technology/dependency stack highly available ! highly available ! we can do that we can do that
Building a High Performance Environment for RDF Publishing
g e t t i n g t h e d a t a
lobid : technology/dependency stack lobid : technology/dependency stack
sometimes gets stuck! sometimes gets stuck!
35
highly available ! highly available ! we can do that we can do that
Building a High Performance Environment for RDF Publishing
g e t t i n g t h e d a t a
lobid : technology/dependency stack lobid : technology/dependency stack
sometimes gets stuck! sometimes gets stuck!
36
highly available ! highly available ! we can do that we can do that
Building a High Performance Environment for RDF Publishing
s t
i n g / g e t t i n g t h e d a t a
Variant 1 : technology/dependency stack Variant 1 : technology/dependency stack
For external access. Sometimes gets stuck! For external access. Sometimes gets stuck! Closed, internal. Will be safe from malign queries. Closed, internal. Will be safe from malign queries.
37
Building a High Performance Environment for RDF Publishing
s t
i n g / g e t t i n g t h e d a t a
Variant 1 : technology/dependency stack Variant 1 : technology/dependency stack
For external access. Sometimes gets stuck! For external access. Sometimes gets stuck! Closed, internal. Will be safe from malign queries. Closed, internal. Will be safe from malign queries.
redundant, complex …
38
O v e r v i e w
39 Building a High Performance Environment for RDF Publishing
Publishing is for Consuming
Story so far - experiences with lobid.org
Publishing RDF through elasticsearch
Future prospects
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
Variant 2: technology/dependency stack Variant 2: technology/dependency stack
highly available ! highly available ! we can do that we can do that
For external access and some fancy nice-to-have stuff. Sometimes gets stuck! For external access and some fancy nice-to-have stuff. Sometimes gets stuck!
LOD basis functionality (and some other APIs) are highly available 40
Benefits
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
41
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
Benefits
42
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
performance test
Data: 10 M records <=> 300 M triple Case-insensitive query: „beach“
SELECT ?s WHERE { ?s <http://purl.org/dc/terms/title> ?o FILTER regex(str(?o), "beach", "i") }
#### => SPARQL execution time for Q8316: 108.7s, returned 2815 rows. http://$ip:9200/_search?q=beach&from=0&size=2800 # => Elasticsearch needed 0.4s
=> Elasticsearch is 250 times faster
43
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
44
performance test
(there is a support for text indexing in 4store, have not tested that.)
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
45
performance test Elasticsearch: 18 M records , 6 GB RAM: 5 hour 4store: 1 B triples, having 72 GB RAM: 7 hours
CPU: Quad Core mit 2.4 GhZ und Hyperthreading => 8 CPUs HD: 6 x 2.5" 10k U/min a 146GB
(Don't take benchmarks too seriously – they just give a clue !)
Benefits
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
46
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
Benefits
47
Benefits
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
48
Benefits
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
49
Benefits, relying on elasticsearch as basic LOD storage
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
50
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
51
Benefits
Why JSON-LD? JSON is :
JSON-LD is :
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
52
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
53
Benefits
Benefits
RESTful elasticsearch API, e. g. :
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
54
Benefits
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
55
Benefits
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
56
( … ok, something is left to be done ! )
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
57
O v e r v i e w
58 Building a High Performance Environment for RDF Publishing
Publishing is for Consuming
Story so far - experiences with lobid.org
Publishing RDF through elasticsearch
Conclusion
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
59
Caveats
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
Caveats
How to integrate semantic search into a document storage ?
dct:contributor --------> dct:creator -------> dc:creator \---------> dc:contributor
\--------> bibo:translator
…
There is no inferencing as comes with SPARQL !
60
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
Caveats
Our data flow :
61
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
Caveats
Our data flow :
62
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
63
Caveats
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
From records to RDF triples |-----> graph-database '------> computing ---> record-database
MARC/MAB/PICA... JSON-LD 64
Caveats
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
65
Caveats
tree-based vs graph-based:
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
66
Caveats
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
67
Caveats What is the document ? Only the top-level node ?
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
68
Caveats What is the document ? Only the top-level node ?
… but then you couldn't even search the authors name !
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
69
Caveats
searching needs integration of some fields from subgraphs into the document
O v e r v i e w
70 Building a High Performance Environment for RDF Publishing
Publishing is for Consuming
Story so far - experiences with lobid.org
Publishing RDF through elasticsearch
Conclusion
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
auto suggest
71
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
auto suggest
72
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
auto suggest
73
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
Demo
74
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
75
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
auto suggest
RESTful APIs:
http://demo.lobid.org/search?format=short&index=gnd-index&author=Schmidt%2C+Karl http://demo.lobid.org/search?format=page&index=gnd-index&author=Schmidt%2C+Karl http://demo.lobid.org/search?format=full&index=gnd-index&author=Schmidt%2C+Karl
… API usage:
GET /search?format=<page|full|short>&index=<lobid-index|gnd-index>&author=<query>
easy to enhance with the play framework and the elasticsearch API
Building a High Performance Environment for RDF Publishing 76
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
auto suggest
[ "Schmidt, Karl (1894-1945)",
"Schmidt, Karl", "Schmidt, Karl (1910-)", "Schmidt, Karl (1846-1928)",
"Schmidt, Karl (1913-)", "Schmidt, Karl (1899-)", "Schmidt, Karl (1924-)",
"Schmidt, Karl (1836-1888)", "Schmidt, L. F. Karl", "Schmidt, Karl (1902-1945)", "Schmidt, Karl J.", "Schmidt, Karl (1848-1905)", "Schmidt, Karl (1817-1882)", "Schmidt, Karl R.", "Schmidt, Karl (1954-)", "Schmidt, Karl (1888-)", "Schmidt, Karl (1867-)", ...
]
RESTful APIs: http://demo.lobid.org/search ?format=short&index=gnd-index&author=Schmidt%2C+Karl
Building a High Performance Environment for RDF Publishing 77
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
auto suggest
78
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
79
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
Building a High Performance Environment for RDF Publishing 80
O v e r v i e w
81 Building a High Performance Environment for RDF Publishing
Publishing is for Consuming
Story so far - experiences with lobid.org
Publishing RDF through elasticsearch
Conclusion
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
Conclusion a highly customizable/reliable/feature-rich LOD service Conclusion a highly customizable/reliable/feature-rich LOD service
highly available ! highly available ! we can do that we can do that
For external access and some fancy nice-to-have stuff. Sometimes gets stuck! For external access and some fancy nice-to-have stuff. Sometimes gets stuck!
LOD basis functionality (and some other APIs) are highly available 82
Building a High Performance Environment for RDF Publishing
P u b l i s h i n g L O D w i t h e l a s t i c s e a r c h
the software is Open Source: the software is Open Source:
https://github.com/lobid/ http://elasticsearch.org/ https://hadoop.apache.org/ http://www.playframework.org/ 83 http://4store.org/
Pascal Christoph semweb@hbz-nrw.de christoph@hbz-nrw.de
Using a dark background, this presentation saves maybe 70% of energy