GraphX : Graph Processing in a Distributed Dataflow Framework OSDI - - PowerPoint PPT Presentation

graphx graph processing in a
SMART_READER_LITE
LIVE PREVIEW

GraphX : Graph Processing in a Distributed Dataflow Framework OSDI - - PowerPoint PPT Presentation

GraphX : Graph Processing in a Distributed Dataflow Framework OSDI 2014 Bidyut Hota Agenda Analytics space background Motivation Goal Approach Optimizations Results Flaws/Limitations Questions Real life Analytics


slide-1
SLIDE 1

GraphX : Graph Processing in a Distributed Dataflow Framework

OSDI 2014 Bidyut Hota

slide-2
SLIDE 2

Agenda

  • Analytics space background
  • Motivation
  • Goal
  • Approach
  • Optimizations
  • Results
  • Flaws/Limitations
  • Questions
slide-3
SLIDE 3

Real life Analytics Pipeline

Link Table Page Rank Desired results Raw data

  • Eg. Google Knowledge graph :570MVertices, 18B Edges ( as in Mid 2017)
slide-4
SLIDE 4

Real life Analytics Pipeline

Link Table Page Rank Desired results Raw data

Tables

slide-5
SLIDE 5

Real life Analytics Pipeline

Link Table Page Rank Desired results Raw data

Graphs

slide-6
SLIDE 6

Systems landscape

slide-7
SLIDE 7

Motivation

  • Currently separate systems

exist to compute on these data representation.

  • Ability to combine data

sources.

  • Enhance dataflow

frameworks to leverage inherent positives.

slide-8
SLIDE 8

Current drawbacks of dataflow frameworks

  • Implementing iterative algorithms ->

requires multiple stages of complex joins.

  • Do not cover common patterns in

graph algorithms -> Room for

  • ptimization.
  • Unlike Spark, no fine grained control
  • f data partitioning.
slide-9
SLIDE 9

Current drawbacks of specialized systems

  • Lacking ability for combining graphs

with unstructured or tabular data

  • Systems favoring snapshot recovery

rather than fault tolerance like in Spark

slide-10
SLIDE 10

What can we leverage?

  • Immutability of RDD’s
  • Reusing indices across graph and

collection views over iterations.

  • Increase in performance
slide-11
SLIDE 11

Goal

  • General purpose distributed

frameworks for graph computations

  • Comparable performances to

specialized graph processing systems

slide-12
SLIDE 12

Approach

  • Unifies Tabular view and Graph view
  • Imbibe the best of specialized systems
  • Graph representation on dataflow frameworks
  • Optimizations
  • Develop GraphX API on top of Spark
slide-13
SLIDE 13

Graph approach: Page Rank example

  • Eg. Page Rank algorithm
  • Graph parallel abstraction
  • Define a vertex program
  • Terminate when vertex programs

vote to halt

Figure : PageRank in Pregel

slide-14
SLIDE 14

Approach

  • GAS (Gather Apply Scatter)

How to apply this in dataflow frameworks?

  • Map, group-by, join dataflow operators
slide-15
SLIDE 15

Representing Property graphs as Tables

Never transfer edges

slide-16
SLIDE 16

GraphX API

slide-17
SLIDE 17

Using the dataflow operators

Logical representation Join of vertices table on edges table

slide-18
SLIDE 18

Using the dataflow operators on vertex program

Userdefined

slide-19
SLIDE 19

Optimizations

Specialized Data Structure Vertex-cut Partitioning Remote caching Active Set Tracking

slide-20
SLIDE 20

Implementing Optimizations

  • Reusable Hash index
  • Sequential scan or clustered scan based on active set (Dynamic)
  • Incremental updates
  • Automatic Join elimination

Additional optimizations:

  • Memory based shuffle
  • Batching and columnar structure
  • Variable Integer encoding
slide-21
SLIDE 21

Results

slide-22
SLIDE 22

Results

Scaling for PageRank

  • n Twitter dataset

Effect of partitioning on communication

slide-23
SLIDE 23

Current Flaws

  • Is not optimized for dynamic graphs.
  • Requires incremental updates to

routing table.

  • Is not designed for streaming

applications.

  • Asynchronous graph computation not
  • available. This is where Naiad will
  • utperform.
slide-24
SLIDE 24

Questions