Evaluating Use of Data Flow Systems for Large Graph Analysis Andy - - PowerPoint PPT Presentation

evaluating use of data flow systems for large graph
SMART_READER_LITE
LIVE PREVIEW

Evaluating Use of Data Flow Systems for Large Graph Analysis Andy - - PowerPoint PPT Presentation

Lawrence Livermore National Laboratory Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the


slide-1
SLIDE 1

Lawrence Livermore National Laboratory

Evaluating Use of Data Flow Systems for Large Graph Analysis

Andy Yoo and Ian Kaplan

Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

slide-2
SLIDE 2

Graph mining techniques have been widely-used in many important applications in recent years

Graph mining extracts information by analyzing relations and structures in graphs (such as ER graphs)

So-called “scale-free” graphs can carry rich information

2

Lawrence Livermore National Laboratory

slide-3
SLIDE 3

Graph Mining Applications: Web Search

Google’s PageRank uses a web graph to rank web pages for given queries Related applications – Personalized web search

3

Lawrence Livermore National Laboratory

search – People search – Eigenvalue/eigenvector – Random walk with restart

slide-4
SLIDE 4

Graph Mining Applications: Social Network Analysis

Community detection algorithms can identify the two communities

(e.g., Girvan and Newman, 2002)

4

Lawrence Livermore National Laboratory Zachary’s Karate Club, 1977 Divided into two groups centered around two individuals, 1 and 34

Further analysis reveals detailed community structures in the graph

(e.g., van Dongen, 2000 and Palla, 2005)

slide-5
SLIDE 5

Graph Mining Applications: Protein Clustering

Can discover proteins with similar functions by clustering protein modules in the protein- protein interaction graphs.

5

Lawrence Livermore National Laboratory Protein-protein interaction network of yeast Adamcsek et. al., Bioinformatica, 1021, 2006

slide-6
SLIDE 6

Graph Mining Applications: National Security

Apply subgraph pattern matching algorithms to intelligence analysis (e.g.,

  • J. Ullman, 1976)

Other related applications – Exact and inexact

6

Lawrence Livermore National Laboratory

– Exact and inexact pattern discovery – Fraud detection – Cyber security – Behavioral prediction

  • T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies

for intelligence analysis, ACM, 2004

slide-7
SLIDE 7

Challenges

High complexity of graph mining algorithms – Common graph mining algorithms have high-order computational complexity

  • High-order algorithms (O(N2+))

− Page rank, community finding, path traversal

  • NP-Complete algorithms

7

Lawrence Livermore National Laboratory

  • NP-Complete algorithms

− Maximal cliques, subgraph pattern matching

Large data size requires out-of-core approaches – Graphs with 109+ nodes and edges are increasingly common – Intermediate result increases exponentially in many cases

slide-8
SLIDE 8

Traditional relational databases have been used in large graph analysis

Due to prevalence and ease

  • f use conventional database

systems have been used in graph analysis Designed for transaction processing

26% 37% 30% 2% 5% 0% 5% 10% 15% 20% 25% 30% 35% 40% < 1 minute 1 - 2 minutes 2 - 5 minutes 5 - 10 minutes 10+ minutes

Distribution of Response Time for 100 Bi-directional searches

300B node graph search on Netezza on 700-node NPS (SC’06)

8

Lawrence Livermore National Laboratory

Poor performance and scalability

300B node graph search on Netezza on 700-node NPS (SC’06) 120B node graph search on 60-node MSSG (Cluster ’06)

slide-9
SLIDE 9

Many-tasks paradigm is currently used for analyzing large data sets: Map/Reduce

Map/Reduce is a popular many- tasks model being used for a wide range of applications Map/Reduce model – A M/R program consists of many map and reduce tasks

9

Lawrence Livermore National Laboratory

map and reduce tasks – Each task works independently – Data between mappers and reducers via intermediate files – Processes list of (key, value) pairs Is Map/Reduce for everything?

Map/Reduce model

slide-10
SLIDE 10

Map/Reduce model is too limited for large complex graph analysis

Map/Reduce successfully used for some applications, but – Inverted index construction – Distributed sort – Term-vector calculation – Page Rank Drawbacks

10

Lawrence Livermore National Laboratory

Drawbacks – Model limited to embarrassingly parallel applications – Poor performance and scalability (due to poor handling of intermediate results)

System Platform Time (Sec) Map/Reduce 20-node Fenix Cluster 1068 SGRACE 64-node Tuson Cluster 221

BFS Search Results Full PubMed graph with 30 million vertices and 500 million edges were used, except SGRACE for which a synthetic graph with 25 million vertices and 125 million edges is used

333.75 Sec/64 Nodes

slide-11
SLIDE 11

Dataflow model is a promising alternative to address these issues

Many independent tasks accessing external data in parallel, realizing data parallelism – Tasks triggered by the More flexible and complex than Map/Reduce (Map/Reduce

  • n steroids!!)

11

Lawrence Livermore National Laboratory

– Tasks triggered by the availability of data – No flow of control – Data parallel and independent We evaluated the use of dataflow model for large graph analysis in this work

Dryad dataflow diagram

slide-12
SLIDE 12

We measured the performance of graph algorithms on an actual dataflow machine: Data Analytic Supercomputer

  • Parallel dataflow engine on

commodity clusters

  • Specialized high-performance

library

DAS RDBMS VS.

  • Sequential or parallel relational

database systems on commodity HW

12

Lawrence Livermore National Laboratory

library

  • Streaming data pipelined for

maximum in-memory processing

  • Sequentialized disk accesses
  • Optimized for SORT and JOIN
  • perations
  • Offers great flexibility for
  • ptimization

commodity HW

  • Optimized for transaction

processing

  • Ubiquitous
  • Relatively easy to use
  • Relies on SQL compiler for
  • ptimization
slide-13
SLIDE 13

DAS programming and execution environment

Uses ECL, a proprietary dataflow language Built-in ECL data manipulation constructs are implemented in a highly

  • ptimized library

ECL Code ECL Compiler ECL Library C++ Code

13

Lawrence Livermore National Laboratory

  • ptimized library

– JOIN, SORT, MERGE, etc. Unlike SQL, these low-level constructs are suitable for complex graph operations

CE CE CE

Executable …

slide-14
SLIDE 14

An example ECL code

  • 14

Lawrence Livermore National Laboratory

!"#$%%&"''#'(%%!'% ) *"#+&,%-"''./(%0,')12 % 03)1145%22 %,%)%)//2 3*%,)%)//2 3*32

slide-15
SLIDE 15

We evaluated some of the most commonly used applications in our experiments

Path Traversal Uni- and Bi-directional BFS Pattern Matching Find subgraphs that matches given template

Applications evaluated on DAS System

15

Lawrence Livermore National Laboratory

template TeraByte (TB) Sort Jim Gray’s SORT Benchmark Page Rank Eigenvector using power method Disambiguation Binning-based coreference resolution

slide-16
SLIDE 16

Real-world graphs are used in our performance experiments

Gran t

FundedBy Grant

Autho r

IsAut horOf

Journal

IsIss ueOf

Grant Agency

IssuedG rant

PubMed Sm PubMed Lg |V| 1M 29M

16

Lawrence Livermore National Laboratory

Article

Published In

Chemical

HasChemic al

Keyword

HasKeywor d

ContactInf

  • MeshHeadin

g

HasContac tInfo HasMeshHe ading

Journal Issue

Grant ueOf

|V| 1M 29M |E| 2M 270M Raw data size 400 MB 127 GB

slide-17
SLIDE 17

Path Traversal: Breadth-first search (BFS) on DAS

Sou rce Destin ation

Improved performance by constructing adjacent list via denormalization, which reduces the number of rows to join

,%2

17

Lawrence Livermore National Laboratory

Edge List Adjacency List (Denormalized) Unidirectional 287.926 120.359 Bidirectional 204.90 56.431

Used large PubMed data

,%2

slide-18
SLIDE 18

DAS system is ideal for handling complex subgraph pattern queries on large data sets

  • "6,%6,!"#'%6&,"'%6

18

Lawrence Livermore National Laboratory

  • "6,%6,!"#'%6&,"

'%%!&%7"892

  • "6,%6,!"#'%6,'%

6%$,"'7"852

  • "6,%6,!"#'%6&,"'%6

,"'7"842

slide-19
SLIDE 19

DAS system is ideal for handling complex subgraph pattern queries on large data sets (Cont’d)

19

Lawrence Livermore National Laboratory

  • ,"6,%6,6,:

"6,,!!%7"8;2

  • '66%%%,

'6,%,6 %%,6 ,%!,"6,%7"8<2

slide-20
SLIDE 20

Query Performance for Large PubMed (30M nodes)

DAS (20 nodes) Netezza (54 nodes) YADM (4 nodes) Query 1 9.422 27.3 120.00 Query 2 142.099 834.47 930.00

20

Lawrence Livermore National Laboratory

Query 3 469.511 15392.96 10188.00 Query 4 37.803 741.42 667.00 Query 5 44.600 496.48 N/A

,%2

~250 - ~300X Speedup

slide-21
SLIDE 21

DAS system still outperforms other SQL machines in price/performance

DAS Netezza YADM Query 1 2.51253E-05 6.66144E-05 0.0004 Query 2 0.000378931 0.00203618 0.0031

21

Lawrence Livermore National Laboratory

Query 3 0.001252029 0.037560164 0.03396 Query 4 0.000100808 0.001809129 0.002223333 Query 5 0.000118933 0.001211454 N/A

Metric = Time/(#Spindles * Cost)

slide-22
SLIDE 22

LNSSI/LLNL measured Terabyte Sort (TB Sort) performance

  • One of sort benchmarks that measures the elapsed time to sort

1012 bytes of data

  • Yahoo holds current record (as of March 2009)
  • 3.48 minutes on 910 nodes (4 dual-core processors, 4 disks, 8

GB memory) Hadoop Map/Reduce

  • Performed TB sort on 20-node DAS system

22

Lawrence Livermore National Laboratory

  • Achieved 2X speedup by
  • Radix-based distribution and local sort
  • Makes tasks to be independent
  • Optimized SORT operation

Apache Hadoop 03:20:44 DAS 01:39:26

slide-23
SLIDE 23

Found some key people from large Enron email graph by running Page Rank algorithm

Data has 4022 Enron employees and 51078 emails Top 30 high scorers found with some notable names – Jeff Dasovich

23

Lawrence Livermore National Laboratory

– Jeff Dasovich – Louise Kitchen – Tana Jones – John Lavorato Took 7 seconds to run on DAS

slide-24
SLIDE 24

Developed scalable algorithm for author disambiguation in many-tasks paradigm

LLNL has develop a entity resolution algorithm based

  • n binning (or blocking) algorithm

Original algorithm not complete for full PubMed data set – Only 67% completed (in 2 months) – Could not resolve bins > 300+ names

24

Lawrence Livermore National Laboratory

– Could not resolve bins > 300+ names Uses DAS as an active-disk system – “Bring computation to where data is, instead of moving data from data store” Achieved orders of magnitude performance improvement

slide-25
SLIDE 25

Distributed disambiguation algorithm: DAS as an Active Disk

Binning Algorithm (BA)

Bin Coauthors Keywords

Original (Sequential) Algorithm Many-tasks Disambiguation Algorithm

JDBC

Author Info

Reader

25

Lawrence Livermore National Laboratory Abstracts Titles

MySQL RDBMS

performance bottleneck

Binning algorithm works only on local data by many-tasks in parallel. Able to process 45529883 bins in 20 hours! … …

Binner Resolver Disambiguated Authors

slide-26
SLIDE 26

Many-tasks model has enabled efficient large graph analysis

High performance and scalability feasible by many-tasks approach Benefits – Enables data parallelism on large scale data

26

Lawrence Livermore National Laboratory

– Reduces communication via independent localized tasks – Enables optimization of tasks for built-in constructs – Combines complexity and flexibility

slide-27
SLIDE 27

Conclusions

Studied the use of many-tasks model for large complex graph analysis Evaluated the performance of a comprehensive set of graph applications, including subgraph pattern queries,

  • n an actual dataflow system

Many-tasks paradigm is very promising approach for

27

Lawrence Livermore National Laboratory

Many-tasks paradigm is very promising approach for graph mining applications and offers many advantages

  • ver contemporary methods like RDBMS and

Map/Reduce

slide-28
SLIDE 28

Thank you

28

Lawrence Livermore National Laboratory