the shard triple store
play

The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com - PowerPoint PPT Presentation

Clause-Iteration with Map-Reduce to Scalably Query Data Graphs: The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com schantz@bbn.com @avometric Many thanks to: Prakash Manghwani, Mike Dean, Ian Emmons, Gail Mitchell, Doug


  1. Clause-Iteration with Map-Reduce to Scalably Query Data Graphs: The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com schantz@bbn.com @avometric Many thanks to: Prakash Manghwani, Mike Dean, Ian Emmons, Gail Mitchell, Doug Reid, Chris Kappler from BBN Hanspeter Pfister from Harvard SEAS Phil Zeyliger from Cloudera

  2. Outline • Challenge Problem: Scalably Query Graph Data • Large-Scale Computing and MapReduce • SHARD • Design Insights krohloff@bbn.com 2

  3. A Preface SHARD is a cloud based graph store. • High-performance scalable query processing. SHARD released open-source. • BSD license. More information and code at: – My webpage – Sourceforge (SHARD-3store) • Use svn to get code: svn co https://shard-3store.svn.sourceforge.net/svnroot/shard- 3store shard-3store – Don’t worry - this command is on SourceForge! 3 krohloff@bbn.com

  4. Scalable Graph Data Querying • Emerging commercially – Use by NYTimes, BBC, Pharma , … – Numerous startups. – Oracle, MySQL have SemWeb support. • Government use … • See the SemWeb. krohloff@bbn.com 4

  5. SPARQL-like Queries SPARQL Query to find all people who own a car made in Detroit: SELECT ?person WHERE { ?person :owns ?car . ?car a :Car . ?car :madeIn :Detroit . Car a } owns ?person ?car madeIn Detroit 5 krohloff@bbn.com

  6. Answering Queries Car a owns madeBy car0 Ford Kurt madeIn livesIn Variables bindings: Detroit ?person to Kurt Cambridge ?car to car0 a a City Car a owns ?car ?person madeIn Detroit 6 krohloff@bbn.com

  7. Design Considerations • Scalable – web-scale? • High Assurance. • Cost Effective – commodity hardware? • Modular inferred data separation. • Robustness. • Considerations as endless as applications. krohloff@bbn.com 7

  8. Scale Limitations! • Triple-Store Study: – “An Evaluation of Triple -Store Technologies for Large Data Stores”, SSWS '07 (Part of OTM). • What about cloud computing? – Economic scalability… krohloff@bbn.com 8

  9. General Programming for Scalable Cloud Computing From Experience: • Inherently multi-threaded. • Toolsets still young. – Not many debugging tools. • Mental models are different... – Learn an algorithm, adapt it to choosen framework. – Ex: try to fit problem into PageRank design pattern. • (This isn’t what we do, but this approach seems common.) krohloff@bbn.com 9

  10. Scalable Distributed System (Cloud) Design Concept Abstraction of parallelization enables much easier scaling. • We use maturing MapReduce framework in Hadoop to bulk process graph edges. • This provides services layer to scale our graph query processing techniques. • Innovation: – Iterative clause-based construction of queries. – Join partial query responses over multiple Map-Reduce jobs using flagged keys. krohloff@bbn.com 10

  11. SHARD Triple-Store Built on Hadoop Prioritized goals: • Commodity hardware, ONLY • Web scalable • Robust What is good: Design Considerations: • Large query responses • Complex queries

  12. Clause Iteration Query Response Construction 1 st clause results p s o p s o p s o owns Source Data p s o ?car ?person 2 nd clause p s o results p p p s o s o o p p p s Car o s o o a owns p ?person ?car s o p s o 2 nd clause p s o results a p Car p s o o owns p ?car ?person madeIn o Detroit p p s o o krohloff@bbn.com 12 p o

  13. 1 st Partial Query Match By Clause In first Map Step, first query clause is used to find partial query matches that satisfy first clause • Keys are variable bindings • Values are set to null 1 st Map Key-Val Source data: Output: John owns dog0 Kurt livesIn Cambridge {John dog0} - null ?person :owns ?car . Kurt owns car0 {Kurt car0} - null dog0 a Dog … car0 a Car … In first Reduce Step, repeated partial matches are removed krohloff@bbn.com 13

  14. 2 nd Clause Map – New Bindings Map partial query matches from 2 nd query clause. • Keys are variable bindings previously observed. • Values are set to new variable bindings. Map matches from previous clause for reordering. • Keys are variable bindings common with current clause • Values are previous non-common bindings Source data: John owns dog0 2 nd Map Key-Val Kurt livesIn Cambridge ?car a Car . Kurt owns car0 Output: dog0 a Dog {car0} – null car0 a Car … … {dog0} – {John} {car0} – {Kurt} 1 st Map Key-Val … Output: {John dog0} - null {Kurt car0} - null … krohloff@bbn.com 14

  15. 2 nd Clause Reduce – Join Reduce joins partial mappings on common variable bindings with flagged keys. 2 nd Reduce Key- 2 nd Map Key-Val Val Output: Output: {car0} – {Kurt} {car0} – null Reduce … … {dog0} – {John} {car0} – {Kurt} … Process continues over all query clauses. krohloff@bbn.com 15

  16. HDFS Graph Storage Car a owns madeBy car0 Kurt Ford madeIn livesIn Detroit Cambridge a a Graphs saved as flat-file in HDFS: (Portions of file saved on each data node.) City K u r t owns car0 livesIn Cambridge C a r 0 a Car madeBy F o r d madeIn Detroit Cambridge a City Detroit a City krohloff@bbn.com 16

  17. HDFS data partitioning Cloud Local Node 2 Node 1 Client Name Node Node 4 Node 3 Cannon Right Cannon Right Cannon Right Cannon Right Cannon Left Cannon Left Cannon Left Cannon Left Cannon Behind Cannon Behind Cannon Behind Cannon Behind • Hash Partitioning by Default. • Neighborhood partitioning would probably provide better performance. • R&D opportunity! krohloff@bbn.com 17

  18. Query Processing Implementation • BBN-developed query processor. – Starting integration with “standard” interfaces • Jena, Sesame. • SHARD supports “most” of SPARQL. – Like most commercial triple-stores. • Large performance improvements possible with improved query reordering. krohloff@bbn.com 18

  19. Data Persistence Advice from SHARD • Down to “bare metal” in HDFS for large -scale efficiency. – No Berkeley DB, no C- stores, …. Nothing. • Simple data storage as flat files. – Lists of (predicate, object) pairs for every subject by line. – Ex: Kurt owns car0 livesin Cambridge • Simple often really is better… krohloff@bbn.com 19

  20. Test Data • Deployed code on Amazon EC2 cloud. – 19 XL nodes. • LUBM (Lehigh Univ. BenchMark) – Artificial data on students, professors, courses, etc… at universities. • 800 million edge graph. – 6000 LUBM university dataset . • In general, performed comparably to “industrial” monolithic triple -stores. krohloff@bbn.com 20

  21. Performance Comparison Query Type SHARD Parliament+Sesame Parliament+Jena Simple Query, Small 0.1hr 0.001hr 404 sec. Response: Triple (approx 0.1 hr.) Lookup (Query 1) p s o Triangular Query 1hr 1hr 740 sec. (Query 9) (approx 0.2 hr.) o s o Simple Query, Large 1hr 5hr 118 sec. Response: (approx 0.03 hr.) (Query 14) p s o krohloff@bbn.com 21

  22. Insight from Query Performance • SHARD is not optimal for edge look-ups. – This could be expected – SHARD (and MapReduce implementations) have no real indexing support. • SHARD does well where large portions of dataset need to be processed. – Ex: • Multiple join operations • Return large datasets – This behavior is an artifact of parallel searching and joining operation native to Clause-Iteration. krohloff@bbn.com 22

  23. Design Insights • Abstraction is a big win. – Surprisingly economical for development. • Lack of indexing limits look-up capabilities. – This may not be so bad for some applications – Index will also need to be continually updated as data added. krohloff@bbn.com 23

  24. Design Insights – Data Partitioning • Data linking may be a big win to reduce join overhead and reduce need for iterations over clauses. – A first step would be advanced data partitioning. – Done some in Cloud9, but still wide open for even basic R&D implementations. • Advanced data partitioning would also minimize overhead of moving intermediate results between compute nodes. – This seemed to be biggest bottleneck. krohloff@bbn.com 24

  25. Design Insights – Query Processing • Query pre-processing may also be a big win. – Could also greatly reduce amount of data carried between nodes during join operations. • Subject-Iteration may be an alternative approach for queries with strongly connected source nodes. – Iterate over query subject rather than clauses. krohloff@bbn.com 25

  26. Thanks! Questions? Kurt Rohloff krohloff@bbn.com @avometric

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend