Final Project Presentation Dec. 2 nd , 2013 Qing Zheng & Atreyee - - PowerPoint PPT Presentation

final project presentation
SMART_READER_LITE
LIVE PREVIEW

Final Project Presentation Dec. 2 nd , 2013 Qing Zheng & Atreyee - - PowerPoint PPT Presentation

15799 Final Project Presentation Dec. 2 nd , 2013 Qing Zheng & Atreyee Maiti Goals Graph Queries How different DBs handle large graph? Whats the differences in performance? What DB to which for a specific use-case? 2


slide-1
SLIDE 1

15799

Final Project Presentation

  • Dec. 2nd, 2013

Qing Zheng & Atreyee Maiti

slide-2
SLIDE 2

Goals

Graph Queries

  • How different DBs handle large graph?
  • What’s the differences in performance?
  • What DB to which for a specific use-case?

2

slide-3
SLIDE 3

Datasets/ System

Neo4j v.s. MyS QL

  • the most popular open-source DB for each

community

3

slide-4
SLIDE 4

Datasets/ System

Neo4j v.s. MyS QL

  • the most popular open-source DB for each

community

Wikipedia Datasets

  • Reasonably big, easily accessible, and people are

familiar with it

4

slide-5
SLIDE 5

Experimental Settings

Amazon EC2

  • Neo4j 1.9.5
  • MyS

QL 5.5.34

  • Ubuntu 12.04.3
  • 1 CPU, 4GB RAM, 410GB Disk (m1.medium)

5

slide-6
SLIDE 6

Queries

  • S

ix-Degree

Benchmarks

<6

<6

slide-7
SLIDE 7

Queries

  • S

ix-Degree

  • S

hortest Path

Benchmarks

7 <6

<6 min min

slide-8
SLIDE 8

Queries

  • S

ix-Degree

  • S

hortest Path

  • Most Cited Page

Benchmarks

8 <6

<6 min min

slide-9
SLIDE 9

API

Client Interface

  • S

QL for MyS QL

  • Java API for Neo4j

9

slide-10
SLIDE 10

S ix Degree

Results

10

slide-11
SLIDE 11

S ix Degree

1 2 2 2 2 2 3 3

Results

11

slide-12
SLIDE 12

Results

Most Cited Page

12

slide-13
SLIDE 13

Results

Most Cited Page

13

4,673,396

slide-14
SLIDE 14

Performance

14

Neo4j MyS QL S ix-Degree S hortest-Path Most-Cited-Page seconds

3 6 9 12 15 18 cold warm 500 1000 1500 2000 2500 3000 cold warm 1800 3600 5400 7200 9000 10800 cold warm

slide-15
SLIDE 15
slide-16
SLIDE 16

ANAL YSIS SYST EM WISE

MYS QL

15

slide-17
SLIDE 17

Storage Engine

INNODB

  • Reliable, high-performance transactional engine

MYIS AM

  • Read-optimized, data-warehouse class engine
  • Dedicated in-memory buffer for index blocks
  • Uses OS

page cache for buffering data blocks

16

slide-18
SLIDE 18

Bulk L

  • ading

Best Practices

  • Convert S

QL inserts into raw CVS Files

  • Build indices after data is fully loaded
  • Increase “ MyS

IAM_S

  • rt_Buffer_S

ize”

  • Add more memory

17

slide-19
SLIDE 19

T uning

Optimizing for workloads

  • Compression (total table size after compression: 26G)
  • Resign table schemas
  • Add/ remove Indices
  • S

et index cache to 25% of the RAM

  • Disable query cache (not for optimization)

18

slide-20
SLIDE 20

Schema Profile

Wiki Datasets

  • 31,293,738 pages
  • 709,804,739 links

19

slide-21
SLIDE 21

Six Degree Query

Breath-First S earch

20

s 1 1 2 2 2 2 1 d

slide-22
SLIDE 22

Six Degree Query

Breath-First S earch

21

s 1 1 2 2 2 2 1 d

slide-23
SLIDE 23

Six Degree Query

Breath-First S earch

22

s 1 1 2 2 2 2 1 d

slide-24
SLIDE 24

Six Degree Query

Breath-First S earch

23

s 1 1 2 2 2 2 1 d

Group By / S ubquery

slide-25
SLIDE 25

Six Degree Query

Breath-First S earch

24

s 1 1 2 2 2 2 1 d

Group By / S ubquery Insert Ignore Into …

slide-26
SLIDE 26

Six Degree Query

Ignoring Breath-First S earch

  • 1/ 44th index block read requests
  • No additional sorting
  • 5x more rows in temporary tables

>>> 20x performance boost

25

slide-27
SLIDE 27

Six Degree Query

Ignoring Breath-First S earch

  • 1/ 44th index block read requests
  • No additional sorting
  • 5x more rows in temporary tables

>>> 20x performance boost

26

Need to keep temp table short!

slide-28
SLIDE 28

Six Degree Query

Ignoring Breath-First S earch

27

Adolf-Hitler

1,210/ 1,211 85,829/ 340,632 1 2

Walk-to-the-S ky

19/ 19 9,270/ 11,743 1 2 3 619,132/ 2,594,398

slide-29
SLIDE 29

Six Degree Query

Bidirectional Breath-First S earch

  • 1/ 34th rows in temporary tables
  • 1/ 386th index block read requests
  • 1/ 5th index block write requests

>>> 720x additional performance boost

28

slide-30
SLIDE 30

Shortest Path Query

Bidirectional Batched S hortest Path

>>> 318x performance boost

29

slide-31
SLIDE 31

Shortest Path Query

30

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 A 42 G Pt T M N AH WttS S JG Ah R

seconds 2,786 secs

W

slide-32
SLIDE 32

Most Cited Page

Count(*) & Group-BY & Order-By & Limit

31

slide-33
SLIDE 33

Most Cited Path

Count(*) & Group-BY & Order-By & Limit

32

+----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+

slide-34
SLIDE 34

Most Cited Path

Count(*) & Group-BY & Order-By & Limit

33

+----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+

slide-35
SLIDE 35

Most Cited Path

Count(*) & Group-BY & Order-By & Limit

34

+----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+

slide-36
SLIDE 36

Most Cited Path

Count(*) & Group-BY & Order-By & Limit

35

+----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+

slide-37
SLIDE 37

Most Cited Path

S

  • rt Buffer
  • 2MB: 33 merge passes
  • 8MB: 8 merge passes
  • 64MB: 1 merge pass

36

slide-38
SLIDE 38

Most Cited Path

S

  • rt Buffer
  • 2MB: 33 merge passes
  • 8MB: 8 merge passes
  • 64MB: 1 merge pass

>>> 0x performance improvements

37

slide-39
SLIDE 39

Most Cited Path

S

  • rt Buffer
  • 2MB: 33 merge passes
  • 8MB: 8 merge passes
  • 64MB: 1 merge pass

>>> 0x performance improvements

38

45x more rows scanned than sorted

slide-40
SLIDE 40

Quick Summary

MyS QL-MyS IAM

  • Loading takes time
  • Pay attention to query algorithms
  • Limited performance for large joins
  • Nice documentation with good out-of-box

performance for analysis

39

slide-41
SLIDE 41

ANAL YSIS SYST EM WISE

NEO4J

40

slide-42
SLIDE 42

Data cleaning/ importing

Importing tool

  • use of graphipedia to import compressed dataset
  • LinkExtractor to transform xml to a links xml
  • Import graph which uses the links to create nodes and then
  • relationships. Also indexes the data

Graph Structure Node: pages with property "title" Relationship: "Link" Lucene index

slide-43
SLIDE 43

Algorithm implementation

Neo4j GraphAlgoFactory

Benchmark Implementation six degree findS inglePath with max depth shortest path shortestPath most cited node get all relationships, maintain count

slide-44
SLIDE 44

A B KNOWS Name:Qing Age:24

Internals

KNOWS

  • records basically linked list of nodes, relations - suffers when need to

traverse a lot of linked lists - most cited page

  • major win is joins - and then it becomes dependent on configuration and

resource availability

slide-45
SLIDE 45

Caching

T wo types

  • file buffer caching - use of native i/ o to cache data in memory -

storage file data similar representation as disk for fast traversal

  • bject caching - using the allocated area for the heap - caches

individual nodes and relationships and their properties in a form that is optimized for fast traversal of the graph - relies on garbage collection for eviction from the cache in an LRU manner. cache levels

  • in heap
  • in file buffer cache
  • disk
slide-46
SLIDE 46
slide-47
SLIDE 47

T uning/ choices made

JVM options: initial heap size = 512m max heap size = 1024m CMS InitiatingOccupancyFrac tion=50 UseConc MarkS weepGC Cache type: weak c ache type (object c ac he) - Provides short life span for c ached objec ts. S uitable for high throughput applications where a larger portion of the graph than what c an fit into memory is frequently ac c essed. Memory mapping options: ( based on sizes of the c orresponding store files) nodes = 200M relationships = 5G propertystore= 500M

slide-48
SLIDE 48

Optimizations impact

Make a lot of difference for long running queries default tunings most cited node => 23 minutes with optimizations => ~5 mins!!

slide-49
SLIDE 49

Run time split

  • reading into memory - for warm reads are from memory
  • populate file buffer cache
  • waiting on gc
  • query execution
slide-50
SLIDE 50

More Resources

Algorithm On 8GB RAM, SSD On 4GB ram, HDD S ix degree 5 seconds 11 S hortest path 5 seconds 11 Most cited node 161 seconds 256

slide-51
SLIDE 51
slide-52
SLIDE 52

CONCL USION

50

slide-53
SLIDE 53

T akeaways

  • for running graph algorithms that require joins using relational, the

algorithm needs to be optimized to a large extent - for unknown destination graph algorithms, mysql is very poor

  • for scanning entire tables, mysql is good but neo4j performs poorly
  • neo4j has a lot of knobs that can help it in performing fast but they

need to be known and explored. Increasing heap is not always the solution!!

  • mysql isam query performance is good out of box, needs tuning

mostly for importing large data

  • performance also depends on the structure of your graph - how

large it is, the fan out of the graph

  • all in all it is a use case based decision that should be taken
slide-54
SLIDE 54

S ystem specific learnings

Description system comments highly connected nodes problem neo4j specific neo4j 2.1 will be solving this but not yet released neo4j has a huge set of algorithms that can be used out of the box neo4j specific neo4j community is very active neo4j specific

slide-55
SLIDE 55
  • Incorporate algorithms which require multiple hops
  • With same setup:
  • run on postgresql
  • running on S

S D

  • On new systems:
  • Graph processing systems
  • S

ciDB

  • neo4j feedback

Future work

slide-56
SLIDE 56

T HANKS!

54

slide-57
SLIDE 57

Acknowledgements

We thank Professor Andy Pavlo for giving us direction at various points in the project. We are also grateful to AWS for the funding towards running the experiments.

slide-58
SLIDE 58

References

http:/ / www.slideshare.net/ thobe/ an-overview-of-neo4j-internals http:/ / event.c wi.nl/ grades2013/ 07-welc.pdf http:/ / docs.neo4j.org/ chunked/ milestone/ configuration-caches.html http:/ / www.slideshare.net/ markhneedham/ football-graph-neo4j-and-the-premier-league https:/ / github.com/ mirkonasato/ graphipedia http:/ / istc-bigdata.org/ index.php/ benchmarking-graph-databases/ http:/ / dev.mysql.c om/ doc/ refman/ 5.5/ en/ index.html http:/ / dumps.wikimedia.org/ http:/ / vldb.org/ pvldb/ vol5/ p358_jungao_vldb2012.pdf