final project presentation
play

Final Project Presentation Dec. 2 nd , 2013 Qing Zheng & Atreyee - PowerPoint PPT Presentation

15799 Final Project Presentation Dec. 2 nd , 2013 Qing Zheng & Atreyee Maiti Goals Graph Queries How different DBs handle large graph? Whats the differences in performance? What DB to which for a specific use-case? 2


  1. 15799 Final Project Presentation Dec. 2 nd , 2013 Qing Zheng & Atreyee Maiti

  2. Goals Graph Queries • How different DBs handle large graph? • What’s the differences in performance? • What DB to which for a specific use-case? 2

  3. Datasets/ System Neo4j v.s. MyS QL • the most popular open-source DB for each community 3

  4. Datasets/ System Neo4j v.s. MyS QL • the most popular open-source DB for each community Wikipedia Datasets • Reasonably big, easily accessible, and people are familiar with it 4

  5. Experimental Settings Amazon EC2 • Neo4j 1.9.5 • MyS QL 5.5.34 • Ubuntu 12.04.3 • 1 CPU, 4GB RAM, 410GB Disk (m1.medium) 5

  6. Benchmarks Queries • S ix-Degree <6 <6

  7. Benchmarks Queries • S ix-Degree • S hortest Path min <6 <6 min 7

  8. Benchmarks Queries • S ix-Degree • S hortest Path • Most Cited Page min <6 <6 min 8

  9. API Client Interface • S QL for MyS QL • Java API for Neo4j 9

  10. Results S ix Degree 10

  11. Results S ix Degree 2 2 3 1 2 2 2 3 11

  12. Results Most Cited Page 12

  13. Results Most Cited Page 4 ,673,396 13

  14. Performance MyS QL Neo4j 18 3000 10800 15 2500 9000 12 2000 7200 seconds 9 1500 5400 6 1000 3600 3 500 1800 0 0 0 cold warm cold warm cold warm S ix-Degree Most-Cited-Page S hortest-Path 14

  15. MYS QL ANAL YSIS SYST EM WISE 15

  16. Storage Engine INNODB • Reliable, high-performance transactional engine MYIS AM • Read-optimized, data-warehouse class engine • Dedicated in-memory buffer for index blocks • Uses OS page cache for buffering data blocks 16

  17. Bulk L oading Best Practices • Convert S QL inserts into raw CVS Files • Build indices after data is fully loaded • Increase “ MyS IAM_S ort_Buffer_S ize” • Add more memory 17

  18. T uning Optimizing for workloads • Compression (total table size after compression: 26G) • Resign table schemas • Add/ remove Indices • S et index cache to 25% of the RAM • Disable query cache (not for optimization) 18

  19. Schema Profile Wiki Datasets • 31 ,293,738 pages • 709 ,804,739 links 19

  20. Six Degree Query Breath-First S earch s 1 2 1 1 2 2 2 d 20

  21. Six Degree Query Breath-First S earch s 1 2 1 1 2 2 2 d 21

  22. Six Degree Query Breath-First S earch s 1 2 1 1 2 2 2 d 22

  23. Six Degree Query Breath-First S earch s 1 2 1 1 2 2 Group By / S ubquery 2 d 23

  24. Six Degree Query Breath-First S earch s 1 2 1 1 2 Insert Ignore Into … 2 Group By / S ubquery 2 d 24

  25. Six Degree Query Ignoring Breath-First S earch • 1/ 44 th index block read requests • No additional sorting • 5x more rows in temporary tables >>> 20x performance boost 25

  26. Six Degree Query Ignoring Breath-First S earch • 1/ 44 th index block read requests • No additional sorting • 5x more rows in temporary tables >>> 20x performance boost Need to keep temp table short! 26

  27. Six Degree Query Ignoring Breath-First S earch 0 0 1 19/ 19 1,210/ 1,211 1 9,270/ 11,743 2 85,829/ 340,632 2 619,132/ 2,594,398 3 Adolf-Hitler Walk-to-the-S ky 27

  28. Six Degree Query Bidirectional Breath-First S earch • 1/ 34 th rows in temporary tables • 1/ 386 th index block read requests • 1/ 5 th index block write requests >>> 720x additional performance boost 28

  29. Shortest Path Query Bidirectional Batched S hortest Path >>> 318x performance boost 29

  30. 2,786 secs Shortest Path Query 1.6 1.4 1.2 seconds 1 0.8 0.6 0.4 0.2 0 A 42 G Pt T M N AH WttS S JG Ah R W 30

  31. Most Cited Page Count(*) & Group-BY & Order-By & Limit 31

  32. Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 32

  33. Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 33

  34. Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 34

  35. Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 35

  36. Most Cited Path S ort Buffer • 2MB: 33 merge passes • 8MB: 8 merge passes • 64MB: 1 merge pass 36

  37. Most Cited Path S ort Buffer • 2MB: 33 merge passes • 8MB: 8 merge passes • 64MB: 1 merge pass >>> 0x performance improvements 37

  38. Most Cited Path S ort Buffer • 2MB: 33 merge passes • 8MB: 8 merge passes • 64MB: 1 merge pass >>> 0x performance improvements 45x more rows scanned than sorted 38

  39. Quick Summary MyS QL-MyS IAM • Loading takes time • Pay attention to query algorithms • Limited performance for large joins • Nice documentation with good out-of-box performance for analysis 39

  40. NEO4J ANAL YSIS SYST EM WISE 40

  41. Data cleaning/ importing Importing tool • use of graphipedia to import compressed dataset � LinkExtractor to transform xml to a links xml � Import graph which uses the links to create nodes and then relationships. Also indexes the data Graph Structure Node: pages with property "title" Relationship: "Link" Lucene index

  42. Algorithm implementation Neo4j GraphAlgoFactory Benchmark Implementation six degree findS inglePath with max depth shortest path shortestPath most cited node get all relationships, maintain count

  43. Internals Name:Qing A KNOWS B Age:24 KNOWS � records basically linked list of nodes, relations - suffers when need to traverse a lot of linked lists - most cited page � major win is joins - and then it becomes dependent on configuration and resource availability

  44. Caching T wo types • file buffer caching - use of native i/ o to cache data in memory - storage file data similar representation as disk for fast traversal • object caching - using the allocated area for the heap - caches individual nodes and relationships and their properties in a form that is optimized for fast traversal of the graph - relies on garbage collection for eviction from the cache in an LRU manner. cache levels • in heap • in file buffer cache • disk

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend