Batch & Stream Graph Processing with Apache Flink Vasia - - PowerPoint PPT Presentation

batch stream graph processing with apache flink
SMART_READER_LITE
LIVE PREVIEW

Batch & Stream Graph Processing with Apache Flink Vasia - - PowerPoint PPT Presentation

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph Processing with Flink WHEN


slide-1
SLIDE 1

Batch & Stream Graph Processing with Apache Flink

Vasia Kalavri vasia@apache.org @vkalavri

slide-2
SLIDE 2
slide-3
SLIDE 3

Outline

  • Distributed Graph Processing
  • Gelly: Batch Graph Processing with Flink
  • Gelly-Stream: Continuous Graph Processing with Flink
slide-4
SLIDE 4

WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?

slide-5
SLIDE 5

MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE

Big Data Ninja

MISCONCEPTION #1

slide-6
SLIDE 6

A SOCIAL NETWORK

slide-7
SLIDE 7

YOUR INPUT DATASET SIZE IS _OFTEN_ IRRELEVANT

slide-8
SLIDE 8

INTERMEDIATE DATA: THE OFTEN

▸ Naive Who(m) to Follow:

▸ compute a friends-of-friends

list per user

▸ exclude existing friends ▸ rank by common

connections

slide-9
SLIDE 9

DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE

Data Science Rockstar

MISCONCEPTION #2

slide-10
SLIDE 10
slide-11
SLIDE 11

GRAPHS DON’T APPEAR OUT OF THIN AIR

Expectation…

slide-12
SLIDE 12

GRAPHS DON’T APPEAR OUT OF THIN AIR

Reality!

slide-13
SLIDE 13

HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?

slide-14
SLIDE 14

GRAPH APPLICATIONS ARE DIVERSE

▸ Iterative value propagation

▸ PageRank, Connected Components, Label Propagation

▸ Traversals and path exploration

▸ Shortest paths, centrality measures

▸ Ego-network analysis

▸ Personalized recommendations

▸ Pattern mining

▸ Finding frequent subgraphs

slide-15
SLIDE 15

LINEAR ALGEBRA

Adjacency Matrix

  • Partition by rows, columns, blocks
  • Efficient representation of non-zero elements
  • Algorithms expressed as vector-matrix multiplications

1 2 3 4 5 1 1 1 2 1 1 3 4 1 1 1 5 1

1 5 4 3 2

slide-16
SLIDE 16

BREADTH-FIRST SEARCH

1 5 4 3 2

1 2 3 4 5 1 1 1 2 1 1 3 4 1 1 1 5 1

slide-17
SLIDE 17

BREADTH-FIRST SEARCH

1 5 4 3 2

1 2 3 4 5 1 1 1 2 1 1 3 4 1 1 1 5 1

slide-18
SLIDE 18

BREADTH-FIRST SEARCH

1 5 4 3 2

1

X =

1 1 1 2 3 4 5 1 1 1 2 1 1 3 4 1 1 1 5 1

slide-19
SLIDE 19

BREADTH-FIRST SEARCH

1 5 4 3 2

1

X =

1 1 1 2 3 4 5 1 1 1 2 1 1 3 4 1 1 1 5 1

slide-20
SLIDE 20

BREADTH-FIRST SEARCH

1 5 4 3 2

1 1

X =

1 1 1 1 2 3 4 5 1 1 1 2 1 1 3 4 1 1 1 5 1

slide-21
SLIDE 21

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce Pegasus

2009

Pregel

2010

Signal-Collect PowerGraph

2012 Iterative value propagation

Giraph++

2013 Graph Traversals

NScale

2014 Ego-network analysis

Arabesque

2015 Pattern Matching

Tinkerpop

slide-22
SLIDE 22

PREGEL: THINK LIKE A VERTEX

1 5 4 3 2 1 3, 4 2 1, 4 5 3

. . .

slide-23
SLIDE 23

PREGEL: SUPERSTEPS

(Vi+1, outbox) <— compute(Vi, inbox)

1

3,

2

1,

5

3

. .

1

3,

2

1,

5

3

. .

Superstep i Superstep i+1

slide-24
SLIDE 24

PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

VertexID Out-degree Transition Probability

1 2 1/2 2 2 1/2 3

  • 4

3 1/3 5 1 1

1 5 4 3 2

slide-25
SLIDE 25

PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

VertexID Out-degree Transition Probability

1 2 1/2 2 2 1/2 3

  • 4

3 1/3 5 1 1

1 5 4 3 2

PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

slide-26
SLIDE 26

PREGEL EXAMPLE: PAGERANK

void compute(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for setValue(0.15/numVertices() + 0.85*sum) for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for

sum up received messages

update vertex rank

distribute rank to neighbors

slide-27
SLIDE 27

SIGNAL-COLLECT

  • utbox <— signal(Vi)

1

3,

2

1,

5

3

. .

1

3,

2

1,

5

3

. .

Superstep i

Vi+1 <— collect(inbox)

1

3,

2

1,

5

3

. .

Signal Collect Superstep i+1

slide-28
SLIDE 28

SIGNAL-COLLECT EXAMPLE: PAGERANK

void signal(): for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for void collect(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for setValue(0.15/numVertices() + 0.85*sum)

sum up received messages update vertex rank distribute rank to neighbors

slide-29
SLIDE 29

GATHER-SUM-APPLY (POWERGRAPH)

1

. . . . . .

Gather Sum

1 2 5

. . .

Apply

3 1 5 5 3 1

. . .

Gather

3 1 5 5 3

Superstep i Superstep i+1

slide-30
SLIDE 30

GSA EXAMPLE: PAGERANK

double gather(source, edge, target): return target.value() / target.numEdges() double sum(rank1, rank2): return rank1 + rank2 double apply(sum, currentRank): return 0.15 + 0.85*sum

compute partial rank combine partial ranks update rank

slide-31
SLIDE 31

PROBLEMS WITH VERTEX-CENTRIC MODELS

▸ Excessive communication ▸ Worker load imbalance ▸ Global Synchronization ▸ High memory requirements

▸ inbox /outbox can grow too large ▸ overhead for low-degree vertices in GSA

slide-32
SLIDE 32

Vertex-Centric Connected Components

  • Propagate the minimum

value through the graph

  • In each superstep, the value

propagates one hop

  • Requires diameter + 1

supersets to converge

slide-33
SLIDE 33

THINK LIKE A (SUB)GRAPH

1 5 4 3 2

  • compute() on the entire partition
  • Information flows freely inside

each partition

  • Network communication between

partitions, not vertices

1 5 4 3 2

slide-34
SLIDE 34

Subgraph-Centric Connected Components

  • In each superstep, the value

propagates throughout each subgraph

  • Communication between

partitions only

  • Requires less (possibly)

supersteps to converge

slide-35
SLIDE 35

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce Pegasus

2009

Pregel

2010

Signal-Collect PowerGraph

2012 Iterative value propagation

Giraph++

2013 Graph Traversals

NScale

2014 Ego-network analysis

Arabesque

2015 Pattern Matching

Tinkerpop

slide-36
SLIDE 36

CAN WE HAVE IT ALL?

▸ Data pipeline integration: built on top of an

efficient distributed processing engine

▸ Graph ETL: high-level API with abstractions and

methods to transform graphs

▸ Familiar programming model: support popular

programming abstractions

slide-37
SLIDE 37

Gelly

the Apache Flink Graph API

slide-38
SLIDE 38

Flink Stack

Gelly Table ML SAMOA DataSet (Java/Scala) DataStream (Java/Scala)

Hadoop M/R

Local Remote Yarn Embedded Dataflow

Dataflow (WiP)

Table Cascading Streaming dataflow runtime CEP

slide-39
SLIDE 39

Why Graph Processing with Apache Flink?

  • Native Iteration Operators
  • DataSet Optimizations
  • Ecosystem Integration
slide-40
SLIDE 40

Flink Iteration Operators

Input

Iterative Update Function

Result Replace Workset

Iterative Update Function

Result Solution Set

State

slide-41
SLIDE 41

Optimization

  • the runtime is aware of the iterative execution
  • no scheduling overhead between iterations
  • caching and state maintenance are handled automatically

Push work
 “out of the loop” Maintain state as index Cache Loop-invariant Data

slide-42
SLIDE 42

Beyond Iterations

  • Performance & Scalability
  • Memory management
  • Efficient serialization framework
  • Operations on binary data
  • Automatic Optimizations
  • Choose best execution strategy
  • Cache invariant data
slide-43
SLIDE 43

Meet Gelly

  • Java & Scala Graph APIs on top of Flink
  • graph transformations and utilities
  • iterative graph processing
  • library of graph algorithms
  • Can be seamlessly mixed with the DataSet Flink

API to easily implement applications that use both record-based and graph-based analysis

slide-44
SLIDE 44

Hello, Gelly!

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env); Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env); DataSet<Vertex<Long, Long>> verticesWithMinIds = graph.run( new ConnectedComponents(maxIterations)); val env = ExecutionEnvironment.getExecutionEnvironment val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env) val graph = Graph.fromDataSet(edges, env) val components = graph.run(new ConnectedComponents(maxIterations))

Java Scala

slide-45
SLIDE 45

Graph Methods

Graph Properties

getVertexIds getEdgeIds numberOfVertices numberOfEdges getDegrees ...

Transformations

map, filter, join subgraph, union, difference reverse, undirected getTriplets

Mutations

add vertex/edge remove vertex/edge

slide-46
SLIDE 46

Example: mapVertices

// increment each vertex value by one
 val graph = Graph.fromDataSet(...)
 
 // increment each vertex value by one
 val updatedGraph = graph.mapVertices(v => v.getValue + 1)

4 2 8 5 5 3 1 7 4 5

slide-47
SLIDE 47

Example: subGraph

val graph: Graph[Long, Long, Long] = ...
 
 // keep only vertices with positive values
 // and only edges with negative values
 val subGraph = graph.subgraph( vertex => vertex.getValue > 0, edge => edge.getValue < 0 )

slide-48
SLIDE 48

Neighborhood Methods

  • Apply a reduce function to the 1st-hop

neighborhood of each vertex in parallel

graph.reduceOnNeighbors( new MinValue, EdgeDirection.OUT)

slide-49
SLIDE 49

Iterative Graph Processing

  • Gelly offers iterative graph processing abstractions
  • n top of Flink’s Delta iterations
  • vertex-centric
  • scatter-gather
  • gather-sum-apply
  • partition-centric*
slide-50
SLIDE 50

Vertex-Centric SSSP

final class SSSPComputeFunction extends ComputeFunction {

  • verride def compute(vertex: Vertex, messages: MessageIterator) = {

var minDistance = if (vertex.getId == srcId) 0 else Double.MaxValue while (messages.hasNext) { val msg = messages.next if (msg < minDistance) minDistance = msg } if (vertex.getValue > minDistance) { setNewVertexValue(minDistance) for (edge: Edge <- getEdges) sendMessageTo(edge.getTarget, vertex.getValue + edge.getValue) }

slide-51
SLIDE 51

Library of Algorithms

  • PageRank*
  • Single Source Shortest Paths*
  • Label Propagation
  • Weakly Connected Components*
  • Community Detection
  • Triangle Count & Enumeration
  • Clustering Coefficient
  • Jaccard & Adamic-Adar Similarity
  • Graph Summarization
  • val ranks = inputGraph.run(new PageRank(0.85, 20))
  • *: both scatter-gather and GSA implementations
slide-52
SLIDE 52

Gelly-Stream

single-pass stream graph processing with Flink

slide-53
SLIDE 53

Real Graphs are dynamic

Graphs are created from events happening in real-time

slide-54
SLIDE 54
slide-55
SLIDE 55

How we’ve done graph processing so far

  • 1. Load: read the graph

from disk and partition it in memory

slide-56
SLIDE 56
  • 2. Compute: read and

mutate the graph state

How we’ve done graph processing so far

  • 1. Load: read the graph

from disk and partition it in memory

slide-57
SLIDE 57
  • 3. Store: write the final

graph state back to disk

How we’ve done graph processing so far

  • 2. Compute: read and

mutate the graph state

  • 1. Load: read the graph

from disk and partition it in memory

slide-58
SLIDE 58

What’s wrong with this model?

  • It is slow
  • wait until the computation is over before you see

any result

  • pre-processing and partitioning
  • It is expensive
  • lots of memory and CPU required in order to

scale

  • It requires re-computation for graph changes
  • no efficient way to deal with updates
slide-59
SLIDE 59

Graph Streaming Challenges

  • Maintain the

dynamic graph structure

  • Provide up-to-date

results with low latency

  • Compute on fresh

state only

slide-60
SLIDE 60

Single-Pass Graph Streaming

  • Each event is an edge addition
  • Maintain only a graph summary
  • Recent events are grouped in graph

windows

slide-61
SLIDE 61
slide-62
SLIDE 62

Graph Summaries

  • spanners for distance estimation
  • sparsifiers for cut estimation
  • sketches for homomorphic properties

graph summary algorithm algorithm

~

R1 R2

slide-63
SLIDE 63

1 4 3 2 5

i=0

Batch Connected Components

6 7 8

slide-64
SLIDE 64

1 4 3 2 5 6 7 8

i=0

Batch Connected Components

1 4 3 4 5 2 3 5 2 4 7 8 6 7 6 8

slide-65
SLIDE 65

1 2 1 2 2

i=1

Batch Connected Components

6 6 6

slide-66
SLIDE 66

1 2 1 1 2 6 6 6

i=1

Batch Connected Components

2 1 2 2 1 1 2 1 2 7 6 6 6

slide-67
SLIDE 67

1 1 1 1 1

i=2

Batch Connected Components

6 6 6

slide-68
SLIDE 68

5 4 7 6 8 6 4 2 3 1 5 2

Stream Connected Components

Graph Summary: Disjoint Set (Union-Find)

  • Only store component IDs

and vertex IDs

slide-69
SLIDE 69

5 4 7 6 8 6 4 2 4 3 3 1 5 2 1 3

Cid = 1

slide-70
SLIDE 70

5 4 7 6 8 6 4 2 4 3 8 7 3 1 5 2 1 3

Cid = 1

2 5

Cid = 2

slide-71
SLIDE 71

5 4 7 6 8 6 4 2 4 3 8 7 4 1 3 1 5 2 1 3

Cid = 1

2 5

Cid = 2

4

slide-72
SLIDE 72

5 4 7 6 8 6 4 2 4 3 8 7 4 1 3 1 5 2 1 3

Cid = 1

2 5

Cid = 2

4 6 7

Cid = 6

slide-73
SLIDE 73

5 4 7 6 8 6 4 2 4 3 8 7 4 1 3 1 5 2 1 3

Cid = 1

2 5

Cid = 2

4 6 7

Cid = 6

8

slide-74
SLIDE 74

5 4 7 6 8 6 4 2 4 3 8 7 4 1 3 1 5 2 1 3

Cid = 1

2 5

Cid = 2

4 6 7

Cid = 6

8

slide-75
SLIDE 75

5 4 7 6 8 6 4 2 4 3 8 7 4 1 5 2 6 7

Cid = 6

8 1 3

Cid = 1

2 5

Cid = 2

4

slide-76
SLIDE 76

5 4 7 6 8 6 4 2 4 3 8 7 4 1 5 2 1 3

Cid = 1

2 5 4 6 7

Cid = 6

8

slide-77
SLIDE 77

Distributed Stream Connected Components

slide-78
SLIDE 78

Stream Connected Components with Flink

DataStream<DisjointSet> cc = edgeStream
 .keyBy(0)
 .timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
 .fold(new DisjointSet(), new UpdateCC())
 .flatMap(new Merger())
 .setParallelism(1);

slide-79
SLIDE 79

Stream Connected Components with Flink

DataStream<DisjointSet> cc = edgeStream
 .keyBy(0)
 .timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
 .fold(new DisjointSet(), new UpdateCC())
 .flatMap(new Merger())
 .setParallelism(1);

Partition the edge stream

slide-80
SLIDE 80

Stream Connected Components with Flink

DataStream<DisjointSet> cc = edgeStream
 .keyBy(0)
 .timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
 .fold(new DisjointSet(), new UpdateCC())
 .flatMap(new Merger())
 .setParallelism(1);

Define the merging frequency

slide-81
SLIDE 81

Stream Connected Components with Flink

DataStream<DisjointSet> cc = edgeStream
 .keyBy(0)
 .timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
 .fold(new DisjointSet(), new UpdateCC())
 .flatMap(new Merger())
 .setParallelism(1);

merge locally

slide-82
SLIDE 82

Stream Connected Components with Flink

DataStream<DisjointSet> cc = edgeStream
 .keyBy(0)
 .timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
 .fold(new DisjointSet(), new UpdateCC())
 .flatMap(new Merger())
 .setParallelism(1);

merge globally

slide-83
SLIDE 83

Gelly on Streams

DataStream DataSet Distributed Dataflow Deployment

Gelly Gelly-Stream

  • Static Graphs
  • Multi-Pass Algorithms
  • Full Computations
  • Dynamic Graphs
  • Single-Pass Algorithms
  • Approximate Computations

DataStream

slide-84
SLIDE 84

Introducing Gelly-Stream

Gelly-Stream enriches the DataStream API with two new additional ADTs:

  • GraphStream:
  • A representation of a data stream of edges.
  • Edges can have state (e.g. weights).
  • Supports property streams, transformations and aggregations.
  • GraphWindow:
  • A “time-slice” of a graph stream.
  • It enables neighborhood aggregations
slide-85
SLIDE 85

GraphStream Operations

.getEdges() .getVertices() .numberOfVertices() .numberOfEdges() .getDegrees() .inDegrees() .outDegrees()

GraphStream -> DataStream

.mapEdges(); .distinct(); .filterVertices(); .filterEdges(); .reverse(); .undirected(); .union();

GraphStream -> GraphStream

Property Streams Transformations

slide-86
SLIDE 86

Graph Stream Aggregations

result aggregate property stream

graph stream

(window) fold combine fold reduce

local summaries global summary

edges agg global aggregates can be persistent or transient

graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))

slide-87
SLIDE 87

Slicing Graph Streams

graphStream.slice(Time.of(1, MINUTE));

11:40 11:41 11:42 11:43

slide-88
SLIDE 88

Aggregating Slices

graphStream.slice(Time.of(1, MINUTE), direction)

.reduceOnEdges(); .foldNeighbors(); .applyOnNeighbors();

  • Slicing collocates edges by vertex

information

  • Neighborhood aggregations on sliced

graphs

source target

Aggregations

slide-89
SLIDE 89

Finding Matches Nearby

graphStream.filterVertices(GraphGeeks()) .slice(Time.of(15, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs())

slice

GraphStream :: graph geek check-ins

wendy checked_in soap_bar steve checked_in soap_bar tom checked_in joe’s_grill sandra checked_in soap_bar rafa checked_in joe’s_grill

wendy steve

sandra

soap bar tom rafa joe’s grill

FindPairs {wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa}

GraphWindow :: user-place

slide-90
SLIDE 90

Feeling Gelly?

  • Gelly Guide

https://ci.apache.org/projects/flink/flink-docs-master/libs/ gelly_guide.html

  • Gelly-Stream Repository

https://github.com/vasia/gelly-streaming

  • Gelly-Stream talk @FOSDEM16

https://fosdem.org/2016/schedule/event/graph_processing_apache_flink/

  • Related Papers

http://www.citeulike.org/user/vasiakalavri/tag/graph-streaming