graphframes an integrated api for mixing graph and
play

GraphFrames: An Integrated API for Mixing Graph and Relational - PowerPoint PPT Presentation

GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and MateiZaharia (MIT


  1. GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and MateiZaharia (MIT and Databricks) UC BERKELEY

  2. Trend: Unified Graph Analysis 2009 2013 2016 Spark Apache Spark + Apache Spark + GraphX GraphFrames + Graph + Graph Relational Algorithms Queries Queries

  3. Graph Algorithms vs. Graph Queries Graph Algorithms Graph Queries PageRank Alternating Least Squares x ≈

  4. Graph Algorithms vs. Graph Queries Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators Article 1 Editor 1 Editor 2 Article 2 } } same day same day ⇓ Editor 1 Editor 2 Article 1 Article 2

  5. Graph Algorithms vs. Graph Queries Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators // Iterate until convergence wikipedia.find( wikipedia.pregel( "(u1)-[e11]->(article1); sendMsg = { e => (u2)-[e21]->(article1); e.sendToDst(e.srcRank * e.weight) (u1)-[e12]->(article2); }, (u2)-[e22]->(article2)") mergeMsg = _ + _, .select( vprog = { (id, oldRank, msgSum) => "*", 0.15 + 0.85 * msgSum "e11.date – e21.date".as("d1"), }) "e12.date – e22.date".as("d2")) .sort("d1 + d2".desc).take( 10 )

  6. Separate Systems Graph Algorithms Graph Queries

  7. Problem: Mixed Graph Analysis PageRank Text Table Hyperlinks Article Text Raw Wikipedia Frequent < / > < / > Collaborators < / > XML User User Vandalism Suspects Edit Table Edit Graph User Article User Article

  8. Solution: GraphFrames Graph Algorithms Graph Queries GraphFramesAPI Pattern Query Optimizer Spark SQL

  9. GraphFrames API • Unifies graph algorithms, graph queries, and relational operations (DataFrames) • Designed for interactive use • Available in Scala, Java, and Python class GraphFrame { def vertices: DataFrame def edges: DataFrame def find(pattern: String): DataFrame def registerView(pattern: String, df: DataFrame): Unit def degrees(): DataFrame def pageRank(): GraphFrame def connectedComponents(): GraphFrame ... }

  10. Implementation Graph–Relational Translation Join Elimination and Reordering Parsed Query String Pattern Spark SQL DataFrame Optimized Logical Plan Result Logical Plan Materialized Views GraphX Graph View Selection Algorithms

  11. Graph–Relational Translation ⋈ D=ID A B ⋈ Vertex Table C=Src C D ID Attr Edge Table Existing Logical Plan Src Dst Output: A,B,C

  12. Join Elimination Unnecessary join Edges Vertices SELECT src, dst Src Dst ID Attr FROM edges INNER JOIN vertices ON src = id; 1 A 1 2 can be eliminated if tables satisfy referential 2 B 1 3 integrity, simplifying graph–relational 3 C 2 3 translation: 4 D 2 5 SELECT src, dst FROM edges;

  13. Materialized View Selection Graph Edges Updated Vertices Triplet View PageRanks A A A B A A A C B B A C B B C + C B C C C D D C D D GraphX: Triplet view enabled efficient message-passing algorithms

  14. Materialized View Selection PageRank Triplet View Graph Community A C Detection Edges Vertices B C C D A A A B … B B A C User-Defined Views C B C Graph Queries D C D GraphFrames : User-defined views enable efficient graph queries

  15. Join Reordering Left-Deep Plan Bushy Plan ⋈ ⋈ Example Query C, E B ⋈ C → E ⋈ ⋈ C, D ⋈ A, B B, C C → D ⋈ B A → B B → A ⋈ B → E B B ⋈ ⋈ B → E C → B B B ⋈ B → D C → B B → D A, B User-Defined View A → B B → A

  16. Query Planning Algorithm Dynamic programming algorithm based on: J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014. 1. Considers all left-deep plans, and a subset of bushy plans • Bushy plans to explore are chosen using layered-DAG and cycle-detection heuristics 2. Considers using each view that is exactly equivalent to a plan subtree • Result: Selects the largest of multiple hierarchically contained views

  17. Evaluation Faster than Neo4j for unanchored pattern queries Anchored Pattern Query Unanchored Pattern Query 2.5 80 70 2 60 Query latency, s Query latency, s 50 1.5 40 30 1 20 0.5 10 0 0 GraphFrames Neo4j GraphFrames Neo4j Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.

  18. Evaluation Approaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation PageRank Performance 7 6 Per-iteration runtime, s 5 4 3 2 1 0 GraphFrames GraphX Naïve Spark Per-iteration performance on web-Google, single 8-core machine. Naïve Spark uses Scala RDD API.

  19. Evaluation Registering the right views can greatly improve performance for some queries Workload: J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.

  20. Future Work • Suggest views automatically • Exploit attribute-based partitioning in optimizer • Code generation for single node

  21. Try It Out! Released as a Spark Package at: https://github.com/graphframes/graphframes Thanks to Joseph Bradley, Xiangrui Meng, and Timothy Hunter. ankurd@eecs.berkeley.edu

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend