towards efficient query processing on massive time
play

Towards Efficient Query Processing on Massive Time-Evolving Graphs - PowerPoint PPT Presentation

Towards Efficient Query Processing on Massive Time-Evolving Graphs Arash Fard, Amir Abdolrashidi, Lakshmish Ramaswamy, John A. Miller Department of Computer Science The University of Georgia International Workshop on Collaborative Big Data


  1. Towards Efficient Query Processing on Massive Time-Evolving Graphs Arash Fard, Amir Abdolrashidi, Lakshmish Ramaswamy, John A. Miller Department of Computer Science The University of Georgia International Workshop on Collaborative Big Data (C-Big 2012) 1

  2. Introduction  Dynamic number of nodes and edges in many emerging applications, for example:  Hyperlink structure of the World Wide Web  Relationship structures in online social networks  Connectivity structures of the Internet and overlays  Communication flow networks among individuals  Time-Evolving Graph or TEG:  A sequence of snapshots of a graph as it evolves over the time 2

  3. Need New Approaches for TEG In contrast to middle size and static graphs: An additional dimension, namely time 1. Huge size in many modern domains 2. Facebook has about 800 million vertices and 104 billion edges  The additional temporal dimension causes the data size to 3. increase by multiple orders of magnitude. We study three important problems about TEGs:  Distribution on Cluster Computers  Reachability Query  Pattern Matching 3

  4. BSP model and Vertex-centric graph processing  BSP (Bulk Synchronous Parallel) model Super Step Super Step Super Step Communication Communication  Vertex-centric graph processing  Each vertex of the data graph is a computing unit.  Each vertex initially just knows its own label and its outgoing edges.  Pregel, Giraph Apache, GPS M. Felice Pace, BSP vs MapReduce . Proceedings of the 12th International Conference on Computational Science (ICCS '12) 4

  5. TEG distribution on Clusters  two contradictory goals:  Minimizing communication cost among the nodes of the cluster.  Maximizing node utilization.  A trade-off between two extremes:  Assigning the vertices randomly  Partitioning the graph into connected components P1 P4 P2 P3 Assignment of vertices based on a 5 Random assignment of graph vertices partitioning pattern

  6. TEG distribution on Clusters  More partitions than the number of the compute nodes  Dynamic repartitioning of sub-graphs when changes pass a certain threshold related to the connectivity and structure of the sub-graphs  Incremental reallocation of a node in order to reduce the communication cost a b 6 c

  7. Pattern Matching  There are different paradigms for pattern matching:  Sub-graph Isomorphism (NP-Complete)  Graph Simulation (Quadratic)  Dual Simulation (Cubic)  Strong Simulation (Cubic) SD Mary John SD SD Bio PM Bob Ann Bio PM Bio Alice Sara PM PM PM: Product Manager Pattern Graph Data Graph SD: Software Developer Bio: Biologist 7

  8. Graph Simulation PM PM HR PM PM AI SD SD SD DM DM AI Bio Bio Bio AI AI DM Pattern Graph DM AI DM AI DM AI PM: Product Manager SD: Software Developer AI DM AI DM AI DM Bio: Biologist DM: Data Mining specialist AI SD AI: Artificial Intelligent specialist HR: Human Resource Data Graph

  9. Graph Simulation PM PM HR PM PM AI SD SD SD DM DM AI Bio Bio Bio AI AI DM Pattern Graph DM AI DM AI DM AI PM: Product Manager SD: Software Developer AI DM AI DM AI DM Bio: Biologist DM: Data Mining specialist AI SD AI: Artificial Intelligent specialist HR: Human Resource Data Graph

  10. Graph Dual Simulation PM PM HR PM PM AI SD SD SD DM DM AI Bio Bio Bio AI AI DM Pattern Graph DM AI DM AI DM AI PM: Product Manager SD: Software Developer AI DM AI DM AI DM Bio: Biologist DM: Data Mining specialist AI SD AI: Artificial Intelligent specialist HR: Human Resource Data Graph

  11. Graph Strong Simulation PM PM HR PM PM AI SD SD SD DM DM AI Bio Bio Bio AI AI DM Pattern Graph DM AI DM AI DM AI PM: Product Manager SD: Software Developer AI DM AI DM AI DM Bio: Biologist DM: Data Mining specialist AI SD AI: Artificial Intelligent specialist HR: Human Resource Data Graph

  12. Distributed Graph simulation Initial Distributed Graph Data Graph Query Graph b a a e d c b b c f c 12

  13. Distributed Graph simulation The First Superstep Data Graph Query Graph b a a e d c b b c f c 13

  14. Distributed Graph simulation The Second Superstep Data Graph Query Graph b a a e d c b b c f c 14

  15. Distributed Graph simulation The Third Superstep Data Graph Query Graph b a a e d c b b c f c 15

  16. Preliminary results Source of the graph: http://snap.stanford.edu/data/ Number of vertices in the pattern: 20 Graph Synthesizer: http://projects.skewed.de/graph-tool/ 16

  17. Pattern Matching in TEGs  We borrow the idea of result graphs from [1].  Lists for requests of insert and delete, and time stamps for snapshots of the graph.  Delete commands can only diminish the result graph  Insert commands will expand previous result graph.  Saving Result Graphs for some of the snapshots of the graph RG1 RG2 RG3 Diff(G3,G2):= Diff(G2,G1):= Inserts/Deletes Inserts/Deletes [1] W . Fan, J. Li, J. Luo , Z. Tan, X. Wang, and Y. Wu, “Incremental graph pattern matching,” in Proceedings of the 2011 ACM SIGMOD 17 International Conference on Management of data, ser. SIGMOD ’ 11. New York, NY, USA: ACM, 2011, pp. 925 – 936.

  18. Reference  Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large- scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD '10).  “ Giraph website,” http://giraph.apache.org/.  S. Salihoglu and J. Widom , “ Gps : A graph processing system,” Stanford University, Technical Report, 2012.  S. Ma, Y. Cao, J. Huai, and T. Wo , “Distributed graph pattern matching,” in Proceedings of the 21 st international conference on World Wide Web, ser. WWW ’ 12.  W . Fan, J. Li, J. Luo , Z. Tan, X. Wang, and Y. Wu, “Incremental graph pattern matching,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, ser. SIGMOD ’ 11.  S. Ma, Y. Cao, W . Fan, J. Huai, and T. Wo , “Capturing topology in graph pattern matching,” Proc. VLDB Endow., vol. 5, no. 4, pp. 310 – 321, Dec. 2011.  M. R. Henzinger, T. A. Henzinger, and P . W . Kopke, “Computing simulations on finite and infinite graphs,” in Proceedings of the 36 th Annual Symposium on Foundations of Computer Science, ser. FOCS ’ 95. 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend