batch stream graph processing with apache flink
play

Batch & Stream Graph Processing with Apache Flink Vasia - PowerPoint PPT Presentation

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph Processing with Flink WHEN


  1. Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

  2. Outline • Distributed Graph Processing • Gelly: Batch Graph Processing with Flink • Gelly-Stream: Continuous Graph Processing with Flink

  3. WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?

  4. MISCONCEPTION #1 MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE Big Data Ninja

  5. A SOCIAL NETWORK

  6. YOUR INPUT DATASET SIZE IS _OFTEN_ IRRELEVANT

  7. INTERMEDIATE DATA: THE OFTEN ▸ Naive Who(m) to Follow: ▸ compute a friends-of-friends list per user ▸ exclude existing friends ▸ rank by common connections

  8. MISCONCEPTION #2 DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE Data Science Rockstar

  9. GRAPHS DON’T APPEAR OUT OF THIN AIR Expectation…

  10. GRAPHS DON’T APPEAR OUT OF THIN AIR Reality!

  11. HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?

  12. GRAPH APPLICATIONS ARE DIVERSE ▸ Iterative value propagation ▸ PageRank, Connected Components, Label Propagation ▸ Traversals and path exploration ▸ Shortest paths, centrality measures ▸ Ego-network analysis ▸ Personalized recommendations ▸ Pattern mining ▸ Finding frequent subgraphs

  13. LINEAR ALGEBRA Adjacency Matrix 1 2 3 4 5 2 1 0 0 1 1 0 4 2 1 0 0 1 0 1 3 0 0 0 0 0 4 0 1 1 0 1 5 3 5 0 0 1 0 0 - Partition by rows, columns, blocks - Efficient representation of non-zero elements - Algorithms expressed as vector-matrix multiplications

  14. BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0

  15. BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0

  16. BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 X = 1 0 0 0 0 0 1 1 0 0 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0

  17. BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 X = 0 0 0 0 0 0 0 1 1 1 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0

  18. BREADTH-FIRST SEARCH 2 4 1 3 5 1 2 3 4 5 1 0 0 1 1 0 2 1 0 0 1 0 X = 0 0 0 0 0 1 1 1 1 1 3 0 0 0 0 0 4 0 1 1 0 1 5 0 0 1 0 0

  19. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY Graph Traversals Pattern Matching MapReduce Pregel Giraph++ Arabesque 2004 2009 2010 2012 2013 2014 2015 Tinkerpop Pegasus PowerGraph Signal-Collect NScale Iterative value propagation Ego-network analysis

  20. PREGEL: THINK LIKE A VERTEX 1 3, 4 2 4 1 2 1, 4 . . 5 . 3 5 3

  21. PREGEL: SUPERSTEPS Superstep i Superstep i+1 1 1 3, 3, 2 2 1, 1, . . . . 5 5 3 3 (V i+1 , outbox) <— compute(V i , inbox)

  22. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING Transition VertexID Out-degree 2 Probability 1 2 1/2 4 1 2 2 1/2 3 0 - 5 3 4 3 1/3 5 1 1

  23. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING Transition VertexID Out-degree 2 Probability 1 2 1/2 4 1 2 2 1/2 3 0 - 5 3 4 3 1/3 5 1 1 PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

  24. PREGEL EXAMPLE: PAGERANK void compute(messages): sum up sum = 0.0 received messages for (m <- messages) do sum = sum + m update end for vertex rank setValue(0.15/numVertices() + 0.85*sum) distribute rank to neighbors for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for

  25. SIGNAL-COLLECT Superstep i Superstep i+1 Signal Collect 1 1 1 3, 3, 3, 2 2 2 1, 1, 1, . . . . . . 5 5 5 3 3 3 outbox <— signal(V i ) V i+1 <— collect(inbox)

  26. SIGNAL-COLLECT EXAMPLE: PAGERANK void signal(): distribute rank for (edge <- getOutEdges()) do to neighbors sendMessageTo( edge.target(), getValue()/numEdges) end for void collect(messages): sum up received sum = 0.0 messages for (m <- messages) do sum = sum + m update vertex rank end for setValue(0.15/numVertices() + 0.85*sum)

  27. GATHER-SUM-APPLY (POWERGRAPH) Superstep i Superstep i+1 Sum Apply Gather Gather 1 3 1 1 3 1 5 2 1 5 . . . . . . . . . . . . 5 3 5 3 5

  28. GSA EXAMPLE: PAGERANK double gather(source, edge, target): return target.value() / target.numEdges() compute partial double sum(rank1, rank2): rank return rank1 + rank2 combine partial ranks double apply(sum, currentRank): return 0.15 + 0.85*sum update rank

  29. PROBLEMS WITH VERTEX-CENTRIC MODELS ▸ Excessive communication ▸ Worker load imbalance ▸ Global Synchronization ▸ High memory requirements ▸ inbox /outbox can grow too large ▸ overhead for low-degree vertices in GSA

  30. Vertex-Centric Connected Components ‣ Propagate the minimum value through the graph ‣ In each superstep, the value propagates one hop ‣ Requires diameter + 1 supersets to converge

  31. THINK LIKE A (SUB)GRAPH 2 2 4 4 1 1 5 3 - compute() on the entire partition - Information flows freely inside 3 5 each partition - Network communication between partitions, not vertices

  32. Subgraph-Centric Connected Components ‣ In each superstep, the value propagates throughout each subgraph ‣ Communication between partitions only ‣ Requires less (possibly) supersteps to converge

  33. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY Graph Traversals Pattern Matching MapReduce Pregel Giraph++ Arabesque 2004 2009 2010 2012 2013 2014 2015 Tinkerpop Pegasus PowerGraph Signal-Collect NScale Iterative value propagation Ego-network analysis

  34. CAN WE HAVE IT ALL? ▸ Data pipeline integration: built on top of an efficient distributed processing engine ▸ Graph ETL: high-level API with abstractions and methods to transform graphs ▸ Familiar programming model: support popular programming abstractions

  35. Gelly the Apache Flink Graph API

  36. Flink Stack Dataflow (WiP) Hadoop M/R Cascading Dataflow SAMOA Gelly Table Table CEP ML DataSet (Java/Scala) DataStream (Java/Scala) Streaming dataflow runtime Local Remote Yarn Embedded

  37. Why Graph Processing with Apache Flink? • Native Iteration Operators • DataSet Optimizations • Ecosystem Integration

  38. Flink Iteration Operators Result Result Replace Iterative Iterative State Update Function Update Function Input Workset Solution Set

  39. Optimization • the runtime is aware of the iterative execution • no scheduling overhead between iterations • caching and state maintenance are handled automatically Push work 
 Cache Loop-invariant Data Maintain state as index “out of the loop”

  40. Beyond Iterations • Performance & Scalability • Memory management • Efficient serialization framework • Operations on binary data • Automatic Optimizations • Choose best execution strategy • Cache invariant data

  41. Meet Gelly • Java & Scala Graph APIs on top of Flink • graph transformations and utilities • iterative graph processing • library of graph algorithms • Can be seamlessly mixed with the DataSet Flink API to easily implement applications that use both record-based and graph-based analysis

  42. Hello, Gelly! Java ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env); Graph<Long, Long, NullValue> graph = Graph. fromDataSet (edges, env); DataSet<Vertex<Long, Long>> verticesWithMinIds = graph. run ( new ConnectedComponents(maxIterations)); Scala val env = ExecutionEnvironment.getExecutionEnvironment val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env) val graph = Graph. fromDataSet (edges, env) val components = graph. run (new ConnectedComponents(maxIterations))

  43. Graph Methods Transformations Graph Properties map, filter, join subgraph, union, getVertexIds difference getEdgeIds reverse, undirected numberOfVertices getTriplets numberOfEdges Mutations getDegrees add vertex/edge ... remove vertex/edge

  44. 
 Example: mapVertices // increment each vertex value by one 
 val graph = Graph . fromDataSet ( ... ) 
 // increment each vertex value by one 
 val updatedGraph = graph . mapVertices ( v => v . getValue + 1 ) 5 5 3 4 7 8 1 2 4 5

  45. 
 Example: subGraph val graph : Graph[Long , Long , Long] = ... 
 // keep only vertices with positive values 
 // and only edges with negative values 
 val subGraph = graph . subgraph ( vertex => vertex . getValue > 0 , edge => edge . getValue < 0 )

  46. Neighborhood Methods • Apply a reduce function to the 1st-hop neighborhood of each vertex in parallel graph.reduceOnNeighbors( new MinValue, EdgeDirection.OUT)

  47. Iterative Graph Processing • Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations • vertex-centric • scatter-gather • gather-sum-apply • partition-centric*

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend