The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com - - PowerPoint PPT Presentation

the shard triple store
SMART_READER_LITE
LIVE PREVIEW

The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com - - PowerPoint PPT Presentation

Clause-Iteration with Map-Reduce to Scalably Query Data Graphs: The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com schantz@bbn.com @avometric Many thanks to: Prakash Manghwani, Mike Dean, Ian Emmons, Gail Mitchell, Doug


slide-1
SLIDE 1

Clause-Iteration with Map-Reduce to Scalably Query Data Graphs:

The SHARD Triple-Store

Kurt Rohloff krohloff@bbn.com @avometric Rick Schantz schantz@bbn.com Many thanks to: Prakash Manghwani, Mike Dean, Ian Emmons, Gail Mitchell, Doug Reid, Chris Kappler from BBN Hanspeter Pfister from Harvard SEAS Phil Zeyliger from Cloudera

slide-2
SLIDE 2

Outline

  • Challenge Problem: Scalably Query Graph Data
  • Large-Scale Computing and MapReduce
  • SHARD
  • Design Insights

2 krohloff@bbn.com

slide-3
SLIDE 3

A Preface

SHARD is a cloud based graph store.

  • High-performance scalable query processing.

SHARD released open-source.

  • BSD license.

More information and code at:

– My webpage – Sourceforge (SHARD-3store)

  • Use svn to get code:

svn co https://shard-3store.svn.sourceforge.net/svnroot/shard- 3store shard-3store

– Don’t worry - this command is on SourceForge!

3 krohloff@bbn.com

slide-4
SLIDE 4

4

Scalable Graph Data Querying

  • Emerging commercially

– Use by NYTimes, BBC, Pharma, … – Numerous startups. – Oracle, MySQL have SemWeb support.

  • Government use…
  • See the SemWeb.

krohloff@bbn.com

slide-5
SLIDE 5

SPARQL-like Queries

SPARQL Query to find all people who own a car made in Detroit: SELECT ?person WHERE { ?person :owns ?car . ?car a :Car . ?car :madeIn :Detroit . }

?person ?car

  • wns

madeIn Detroit Car a

5 krohloff@bbn.com

slide-6
SLIDE 6

Answering Queries

Kurt car0 Ford

  • wns

madeBy madeIn Detroit livesIn Cambridge a a City Car a ?person ?car

  • wns

madeIn Detroit Car a

6 krohloff@bbn.com

Variables bindings: ?person to Kurt ?car to car0

slide-7
SLIDE 7

Design Considerations

  • Scalable – web-scale?
  • High Assurance.
  • Cost Effective – commodity hardware?
  • Modular inferred data separation.
  • Robustness.
  • Considerations as endless as applications.

7 krohloff@bbn.com

slide-8
SLIDE 8

Scale Limitations!

  • Triple-Store Study:

– “An Evaluation of Triple-Store Technologies for Large Data Stores”, SSWS '07 (Part of OTM).

  • What about cloud computing?

– Economic scalability…

8 krohloff@bbn.com

slide-9
SLIDE 9

General Programming for Scalable Cloud Computing From Experience:

  • Inherently multi-threaded.
  • Toolsets still young.

– Not many debugging tools.

  • Mental models are different...

– Learn an algorithm, adapt it to choosen framework. – Ex: try to fit problem into PageRank design pattern.

  • (This isn’t what we do, but this approach seems common.)

9 krohloff@bbn.com

slide-10
SLIDE 10

Scalable Distributed System (Cloud) Design Concept

  • We use maturing MapReduce framework in

Hadoop to bulk process graph edges.

  • This provides services layer to scale our graph

query processing techniques.

  • Innovation:

– Iterative clause-based construction of queries. – Join partial query responses over multiple Map-Reduce jobs using flagged keys.

10

Abstraction of parallelization enables much easier scaling.

krohloff@bbn.com

slide-11
SLIDE 11

SHARD Triple-Store Built on Hadoop

Prioritized goals:

  • Commodity hardware, ONLY
  • Web scalable
  • Robust

What is good: Design Considerations:

  • Large query responses
  • Complex queries
slide-12
SLIDE 12

Clause Iteration Query Response Construction

Source Data s

  • p

s

  • p

s

  • p

s

  • p

s

  • p

s

  • p

?person ?car

  • wns

madeIn Detroit Car a

?person ?car

  • wns

1st clause results s

  • p

s

  • p

s

  • p

s

  • p

2nd clause results s

  • p

s

  • p
  • p
  • p

?person ?car

  • wns

Car a

2nd clause results s

  • p

s

  • p
  • p
  • p
  • p
  • p

12 krohloff@bbn.com

slide-13
SLIDE 13

1st Partial Query Match By Clause

krohloff@bbn.com 13

In first Map Step, first query clause is used to find partial query matches that satisfy first clause

  • Keys are variable bindings
  • Values are set to null

Source data:

John owns dog0 Kurt livesIn Cambridge Kurt owns car0 dog0 a Dog car0 a Car …

1st Map Key-Val Output:

{John dog0} - null {Kurt car0} - null … ?person :owns ?car .

In first Reduce Step, repeated partial matches are removed

slide-14
SLIDE 14

2nd Clause Map – New Bindings

krohloff@bbn.com 14

Map partial query matches from 2nd query clause.

  • Keys are variable bindings previously observed.
  • Values are set to new variable bindings.

Map matches from previous clause for reordering.

  • Keys are variable bindings common with current clause
  • Values are previous non-common bindings

Source data:

John owns dog0 Kurt livesIn Cambridge Kurt owns car0 dog0 a Dog car0 a Car …

2nd Map Key-Val Output:

{car0} – null … {dog0} – {John} {car0} – {Kurt} … ?car a Car .

1st Map Key-Val Output:

{John dog0} - null {Kurt car0} - null …

slide-15
SLIDE 15

2nd Clause Reduce – Join

krohloff@bbn.com 15

Reduce joins partial mappings on common variable bindings with flagged keys. 2nd Map Key-Val Output:

{car0} – null … {dog0} – {John} {car0} – {Kurt} … Reduce

2nd Reduce Key- Val Output:

{car0} – {Kurt} …

Process continues over all query clauses.

slide-16
SLIDE 16

HDFS Graph Storage

Kurt car0 Ford

  • wns

madeBy madeIn Detroit livesIn Cambridge a a City Car a

16

Graphs saved as flat-file in HDFS: (Portions of file saved on each data node.)

K u r t owns car0 livesIn Cambridge C a r 0 a Car madeBy F o r d madeIn Detroit Cambridge a City Detroit a City

krohloff@bbn.com

slide-17
SLIDE 17

HDFS data partitioning

17

Client Name Node Node 2 Node 1 Node 4 Node 3

Cannon Right Cannon Left Cannon Behind

Local Cloud

Cannon Right Cannon Left Cannon Behind Cannon Right Cannon Left Cannon Behind Cannon Right Cannon Left Cannon Behind

krohloff@bbn.com

  • Hash Partitioning by Default.
  • Neighborhood partitioning would probably provide better performance.
  • R&D opportunity!
slide-18
SLIDE 18

Query Processing Implementation

  • BBN-developed query processor.

– Starting integration with “standard” interfaces

  • Jena, Sesame.
  • SHARD supports “most” of SPARQL.

– Like most commercial triple-stores.

  • Large performance improvements possible with

improved query reordering.

18 krohloff@bbn.com

slide-19
SLIDE 19

Data Persistence Advice from SHARD

  • Down to “bare metal” in HDFS for large-scale

efficiency.

– No Berkeley DB, no C-stores, …. Nothing.

  • Simple data storage as flat files.

– Lists of (predicate, object) pairs for every subject by line. – Ex: Kurt owns car0 livesin Cambridge

  • Simple often really is better…

19 krohloff@bbn.com

slide-20
SLIDE 20

Test Data

  • Deployed code on Amazon EC2 cloud.

– 19 XL nodes.

  • LUBM (Lehigh Univ. BenchMark)

– Artificial data on students, professors, courses, etc… at universities.

  • 800 million edge graph.

– 6000 LUBM university dataset.

  • In general, performed comparably to

“industrial” monolithic triple-stores.

20 krohloff@bbn.com

slide-21
SLIDE 21

Performance Comparison

Query Type SHARD Parliament+Sesame Parliament+Jena Simple Query, Small Response: Triple Lookup (Query 1) 404 sec. (approx 0.1 hr.) 0.1hr 0.001hr Triangular Query (Query 9) 740 sec. (approx 0.2 hr.) 1hr 1hr Simple Query, Large Response: (Query 14) 118 sec. (approx 0.03 hr.) 1hr 5hr

krohloff@bbn.com 21

s

  • p
  • s
  • s
  • p
slide-22
SLIDE 22

Insight from Query Performance

  • SHARD is not optimal for edge look-ups.

– This could be expected – SHARD (and MapReduce implementations) have no real indexing support.

  • SHARD does well where large portions of

dataset need to be processed.

– Ex:

  • Multiple join operations
  • Return large datasets

– This behavior is an artifact of parallel searching and joining operation native to Clause-Iteration.

22 krohloff@bbn.com

slide-23
SLIDE 23

Design Insights

  • Abstraction is a big win.

– Surprisingly economical for development.

  • Lack of indexing limits look-up capabilities.

– This may not be so bad for some applications – Index will also need to be continually updated as data added.

23 krohloff@bbn.com

slide-24
SLIDE 24

Design Insights – Data Partitioning

  • Data linking may be a big win to reduce join
  • verhead and reduce need for iterations over

clauses.

– A first step would be advanced data partitioning. – Done some in Cloud9, but still wide open for even basic R&D implementations.

  • Advanced data partitioning would also minimize
  • verhead of moving intermediate results

between compute nodes.

– This seemed to be biggest bottleneck.

krohloff@bbn.com 24

slide-25
SLIDE 25

Design Insights – Query Processing

  • Query pre-processing may also be a big win.

– Could also greatly reduce amount of data carried between nodes during join operations.

  • Subject-Iteration may be an alternative

approach for queries with strongly connected source nodes.

– Iterate over query subject rather than clauses.

25 krohloff@bbn.com

slide-26
SLIDE 26

Thanks! Questions?

Kurt Rohloff krohloff@bbn.com @avometric