Towards Efficient Query Processing on Massive Time-Evolving Graphs - - PowerPoint PPT Presentation

towards efficient query processing on massive time
SMART_READER_LITE
LIVE PREVIEW

Towards Efficient Query Processing on Massive Time-Evolving Graphs - - PowerPoint PPT Presentation

Towards Efficient Query Processing on Massive Time-Evolving Graphs Arash Fard, Amir Abdolrashidi, Lakshmish Ramaswamy, John A. Miller Department of Computer Science The University of Georgia International Workshop on Collaborative Big Data


slide-1
SLIDE 1

Arash Fard, Amir Abdolrashidi, Lakshmish Ramaswamy, John A. Miller Department of Computer Science The University of Georgia

Towards Efficient Query Processing

  • n Massive Time-Evolving Graphs

International Workshop on Collaborative Big Data (C-Big 2012)

1

slide-2
SLIDE 2

Introduction

 Dynamic number of nodes and edges in many emerging

applications, for example:

 Hyperlink structure of the World Wide Web  Relationship structures in online social networks  Connectivity structures of the Internet and overlays  Communication flow networks among individuals

 Time-Evolving Graph or TEG:

 A sequence of snapshots of a graph as it evolves over the time

2

slide-3
SLIDE 3

Need New Approaches for TEG

In contrast to middle size and static graphs:

1.

An additional dimension, namely time

2.

Huge size in many modern domains

Facebook has about 800 million vertices and 104 billion edges

3.

The additional temporal dimension causes the data size to increase by multiple orders of magnitude. We study three important problems about TEGs:

 Distribution on Cluster Computers  Reachability Query  Pattern Matching

3

slide-4
SLIDE 4

BSP model and Vertex-centric graph processing

4

 BSP (Bulk Synchronous Parallel) model  Vertex-centric graph processing

 Each vertex of the data graph is a computing unit.  Each vertex initially just knows its own label and its outgoing edges.  Pregel, Giraph Apache, GPS

Communication Super Step Super Step Super Step Communication

  • M. Felice Pace, BSP vs MapReduce. Proceedings of the 12th International Conference on Computational Science (ICCS '12)
slide-5
SLIDE 5

TEG distribution on Clusters

 two contradictory goals:

 Minimizing communication cost among the nodes of the cluster.  Maximizing node utilization.

 A trade-off between two extremes:

 Assigning the vertices randomly  Partitioning the graph into connected components

5

P1 P2 P3 P4

Random assignment of graph vertices Assignment of vertices based on a partitioning pattern

slide-6
SLIDE 6

TEG distribution on Clusters

 More partitions than the number of the compute nodes  Dynamic repartitioning of sub-graphs when changes pass a

certain threshold related to the connectivity and structure of the sub-graphs

 Incremental reallocation of a node in order to reduce the

communication cost

6

a b c

slide-7
SLIDE 7

Pattern Matching

7

 There are different paradigms for pattern matching:

 Sub-graph Isomorphism (NP-Complete)  Graph Simulation (Quadratic)  Dual Simulation (Cubic)  Strong Simulation (Cubic)

SD Bio PM PM

PM: Product Manager SD: Software Developer Bio: Biologist

SD Bio PM PM

John Ann Sara

SD

Mary

Bio

Bob Alice

Pattern Graph Data Graph

slide-8
SLIDE 8

Graph Simulation

PM SD Bio DM Bio DM PM SD AI AI DM AI DM AI DM AI AI DM AI DM AI DM

PM: Product Manager SD: Software Developer Bio: Biologist DM: Data Mining specialist AI: Artificial Intelligent specialist HR: Human Resource

AI PM AI SD

Pattern Graph Data Graph

HR PM SD Bio DM AI

slide-9
SLIDE 9

Graph Simulation

PM SD Bio DM PM SD Bio DM Bio DM AI PM SD AI AI DM AI DM AI DM AI AI DM AI DM AI DM AI PM AI SD HR

Pattern Graph Data Graph PM: Product Manager SD: Software Developer Bio: Biologist DM: Data Mining specialist AI: Artificial Intelligent specialist HR: Human Resource

slide-10
SLIDE 10

Graph Dual Simulation

PM SD Bio DM PM SD Bio DM Bio DM AI PM SD AI AI DM AI DM AI DM AI AI DM AI DM AI DM AI PM AI SD HR

Pattern Graph Data Graph PM: Product Manager SD: Software Developer Bio: Biologist DM: Data Mining specialist AI: Artificial Intelligent specialist HR: Human Resource

slide-11
SLIDE 11

Graph Strong Simulation

PM SD Bio DM PM SD Bio DM Bio DM AI PM SD AI AI DM AI DM AI DM AI AI DM AI DM AI DM AI PM AI SD HR

Pattern Graph Data Graph PM: Product Manager SD: Software Developer Bio: Biologist DM: Data Mining specialist AI: Artificial Intelligent specialist HR: Human Resource

slide-12
SLIDE 12

Distributed Graph simulation

12

a b c e b c a b c Query Graph Data Graph d f

Initial Distributed Graph

slide-13
SLIDE 13

Distributed Graph simulation

13

a b c e b c a b c Query Graph Data Graph d f

The First Superstep

slide-14
SLIDE 14

Distributed Graph simulation

14

The Second Superstep

a b c e b c a b c Query Graph Data Graph d f

slide-15
SLIDE 15

Distributed Graph simulation

15

The Third Superstep

a b c Query Graph a b c e b c Data Graph d f

slide-16
SLIDE 16

16

Preliminary results

Source of the graph: http://snap.stanford.edu/data/ Graph Synthesizer: http://projects.skewed.de/graph-tool/ Number of vertices in the pattern: 20

slide-17
SLIDE 17

Pattern Matching in TEGs

17

 We borrow the idea of result graphs from [1].  Lists for requests of insert and delete, and time stamps for

snapshots of the graph.

 Delete commands can only diminish the result graph  Insert commands will expand previous result graph.  Saving Result Graphs for some of the snapshots of the graph

[1] W . Fan, J. Li, J. Luo, Z. Tan, X. Wang, and

  • Y. Wu, “Incremental graph pattern matching,” in Proceedings of the 2011 ACM SIGMOD

International Conference on Management of data, ser. SIGMOD ’11. New York, NY, USA: ACM, 2011, pp. 925–936.

Diff(G2,G1):= Inserts/Deletes RG1 RG2 RG3 Diff(G3,G2):= Inserts/Deletes

slide-18
SLIDE 18

Reference

18

 Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan

Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large- scale graph processing. InProceedings of the 2010 ACM SIGMOD International Conference on Management of data(SIGMOD '10).

 “Giraph website,” http://giraph.apache.org/.  S. Salihoglu and J. Widom, “Gps: A graph processing system,” Stanford

University, Technical Report, 2012.

 S. Ma,

  • Y. Cao, J. Huai, and T. Wo, “Distributed graph pattern matching,” in

Proceedings of the 21st international conference on World Wide Web, ser. WWW ’12.

 W

. Fan, J. Li, J. Luo, Z. Tan, X. Wang, and

  • Y. Wu, “Incremental graph pattern

matching,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, ser. SIGMOD ’11.

 S. Ma,

  • Y. Cao, W

. Fan, J. Huai, and T. Wo, “Capturing topology in graph pattern matching,” Proc. VLDB Endow., vol. 5, no. 4, pp. 310–321, Dec. 2011.

 M. R. Henzinger, T. A. Henzinger, and P

. W . Kopke, “Computing simulations on finite and infinite graphs,” in Proceedings of the 36th Annual Symposium on Foundations of Computer Science, ser. FOCS ’95.