Lizard A Linked Data Publishing Platform Andy Seaborne Epimorphics - - PowerPoint PPT Presentation

lizard
SMART_READER_LITE
LIVE PREVIEW

Lizard A Linked Data Publishing Platform Andy Seaborne Epimorphics - - PowerPoint PPT Presentation

Lizard A Linked Data Publishing Platform Andy Seaborne Epimorphics Ltd. Outline The (a) real world of service provision What to do about (some of) it How to do that Who am I? Andy Seaborne Editor on SPARQL query A committer on Apache Jena


slide-1
SLIDE 1

Lizard

A Linked Data Publishing Platform

Andy Seaborne Epimorphics Ltd.

slide-2
SLIDE 2

The (a) real world of service provision What to do about (some of) it How to do that

Outline

slide-3
SLIDE 3

Andy Seaborne

Editor on SPARQL query A committer on Apache Jena At Epimorphics Ltd

Who am I?

slide-4
SLIDE 4

➢ Epimorphics ➢ Funding : InnovateUK* ➢ Users

○ For the discussion and encouragement

* Used to be the Technology Strategy Board. UK Department for Business, Innovation & Skills

This work

slide-5
SLIDE 5

http://environment.data.gov.uk/ http://landregistry.data.gov.uk/

Example Services

slide-6
SLIDE 6

Maximise usage Publication not application

Customer Requirements

slide-7
SLIDE 7

Data publishing != Database backed web site

  • Different traffic patterns

○ Expensive queries, less control ○ Bot multiplier effect

  • “Admin”

○ SLAs: Heartbleed

Running Services

slide-8
SLIDE 8
  • Reacting to events
  • Machine administration / SLAs

Problem Statement

slide-9
SLIDE 9

24x7 Operation Consistency Goals

slide-10
SLIDE 10

Makes the system easier to use

○ For users ○ For operators

Each query sees an unchanging database … that did exist; no “bit of this, bit of that” Clients may conspire!

About Consistency

slide-11
SLIDE 11

Apache Jena TDB

➢ Node Table

○ Inline values (integers, date/dateTime, …)

➢ Indexes are covering

○ Range scans ○ All key, no value ○ No "triple table"

Id RDF Term Index: SPO Index: POS Index: OSP

slide-12
SLIDE 12

SPARQL Execution

{ ?x :p 123 . } Convert to NodeIds Look in POS to get all PO?, assign S to ?x 123 is an inline constant in TDB. { ?x :p 123 . ?x :q ?v . } A database join Index join (Loop+substitution) Index join (= loop) on :x1 :q ?v where :x1 is the value of ?x

slide-13
SLIDE 13

Index Implementation

➢ TDB uses threaded B+Trees for indexes

○ 8K blocks 100-way B+Tree

SPO SPO SPO

  • Ptr

Ptr

  • SPO

SPO SPO SPO

  • Ptr

Ptr Ptr

  • SPO

SPO SPO SPO SPO SPO SPO SPO SPO SPO

slide-14
SLIDE 14

Choices

Where to introduce distribution?

Query and Update Indexes / B+Trees Node table / Objects Blocks Key → Value Store

slide-15
SLIDE 15

This Does Not Work (very well)

➢ Easy to do (pick a KV store of your choice) ➢ Impedance mismatch

○ Too much data moving about ○ Little parallelism ○ Bad cold-start

Distribute the storage K->V store Index access on query processor

Query and Update B+Trees Objects Blocks Key→Value

slide-16
SLIDE 16

Distribute

➢ Distribute the indexes

○ With modified index access

➢ Distribute the nodes ➢ Comms : Apache Thrift

Query and Update B+Trees Objects Blocks Key→Value

slide-17
SLIDE 17

Clustered Node Table

➢ Node Table

○ N replicas; Read R / Write W

e.g. W=N and R =1 => Complete copies of node table on each data server

○ Can shard ○ Replaceable

Requirement: NodeId for naming

slide-18
SLIDE 18

Clustered Indexes

➢ Indexes

○ Can shard by subject ○ Replicas of each shard (R=1, W=N) ○ Compound access operations

slide-19
SLIDE 19

Clustered Indexes

Index Shard 1 Shard 2 Shard 3 Machine 1 Machine 2

slide-20
SLIDE 20

Modified SPARQL Execution

➢ Different unit of index access

○ subject + several predicates

(subj, pred1, pred2, pred3, …)

➢ Different join algorithms

○ Merge join ○ Parallel hash join

slide-21
SLIDE 21

Configuration 1

Query server Load Balancer (or RR-DNS) Data server POS Copy 1 PSO Copy 2 Data server POS Copy 1 PSO Copy 2 Data server Node Copy 1 Data server Node Copy 2 Query server

slide-22
SLIDE 22

Data server

Configuration 2

Load Balancer (or RR-DNS) Node Copy 1 Query server Data server Node Copy 2 POS Copy 1 PSO Copy 2 POS Copy 1 PSO Copy 2 Query server

slide-23
SLIDE 23

Status

Working prototype Spin-off : TDB2

slide-24
SLIDE 24

New Technology

  • Copy-on-write indexes
  • New transactional coordinator
  • Apache Thrift encoded node table
  • Side effect: TDB2

○ Arbitrary scaling transactions ○ Transactional only ○ Space recovery

slide-25
SLIDE 25

Paul Hirst / CC-BY-SA-2.5

slide-26
SLIDE 26
slide-27
SLIDE 27