Distributed GraphLab A Framework for Machine Learning and Data - - PowerPoint PPT Presentation

distributed graphlab
SMART_READER_LITE
LIVE PREVIEW

Distributed GraphLab A Framework for Machine Learning and Data - - PowerPoint PPT Presentation

Distributed GraphLab A Framework for Machine Learning and Data Mining in the Cloud Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. Hellerstein By Maciej Biskupiak for R212 Motivation Abstractions of parallel computation are


slide-1
SLIDE 1

Distributed GraphLab

A Framework for Machine Learning and Data Mining in the Cloud

By Maciej Biskupiak for R212

  • Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. Hellerstein
slide-2
SLIDE 2

Motivation

Abstractions of parallel computation are necessary. Current Models such as MapReduce, Dryad or Pregel are too limiting or inefficient for our purposes.

slide-3
SLIDE 3

GraphLab Abstraction

GraphLab is:

  • Asynchronous: parameter values are not necessarily updated at the same time
  • Dynamic: Parameters are not updated equally often
  • Serialisable: All parallel executions have an equivalent serial execution (no data

races)

It was originally developed for the multicore in memory setting.

slide-4
SLIDE 4

GraphLab Abstraction

GraphLab consists of three main parts:

  • The Data Graph
  • Update Function
  • Sync Function
slide-5
SLIDE 5

Data Graph

V1 V2 V4 V6 V5 V3

  • Computation can be

expressed as an arbitrary graph.

  • Data is associated either

with vertices or edges

  • The data itself is mutable,

but the structure of the graph is not

Dv4<->v3 Dv5<->v6 Dv1<->v5 Dv3<->v6 Dv2<->v3 Dv2<->v4 Dv1<->v4

slide-6
SLIDE 6

Update Function

V1 V2 V4 V6 V5 V3

Dv4<->v3 Dv5<->v6 Dv1<->v5 Dv3<->v6 Dv2<->v3 Dv2<->v4 Dv1<->v4

Takes a vertex V and it’s surrounding context Sv. Returns the new values of it’s context Sv and a list T of vertices that will eventually be updated.

Vertex to be updated Context

slide-7
SLIDE 7

Sync Function

The sync function provides a way to track global state. Each vertex v can publish a global value Sv. The sync function performs an associative sum

  • ver all of these values.
slide-8
SLIDE 8

Distributed GraphLab

In order to bring GraphLab to the distributed setting, we need a solution for the following:

  • Distributing the graph data and balancing

the computation

  • Maintaining consistency across nodes
  • Achieving fault tolerance
slide-9
SLIDE 9

Distributing Graph Data

We partition the graph into a set of K atoms (where K is much greater than the number

  • f servers).

Each atom is stored as a separate file and contains information about ‘ghosts’, the vertices and edges adjacent to the atoms boundary

V1 V4 V5

Dv1<->v5 Dv1<->v4

V2 V6 V3

An Atom

slide-10
SLIDE 10

Maintaining Consistency

Data races are possible if the contexts

  • f update functions overlap.

GraphLab provides two means of dealing with this:

  • A chromatic engine based on

graph coloring

  • A distributed read/writer lock

system

V1 V2 V4 V6 V5 V3

Dv4<->v3 Dv5<->v6 Dv1<->v5 Dv3<->v6 Dv2<->v3 Dv2<->v4 Dv1<->v4

slide-11
SLIDE 11

Levels of consistency

Distributed locking supports various levels of consistency.

  • Vertex consistency: Obtains a write lock on the vertex and a read lock on

data belonging to adjacent vertices

  • Vertex and edge consistency: Obtains a write lock on the vertex and it’s

adjacent edges and a read lock on it’s adjacent vertices

  • Total consistency: Obtains a write lock on a vertex and it’s adjacent edges

and vertices

Gives greater performance, as some problems do not require total consistency (EG. Pagerank)

slide-12
SLIDE 12

Fault Tolerance

In the event of a failure the system can recover to a snapshot taken at a previous point. The snapshot mechanism has to be asynchronous in order to avoid suspending execution. GraphLab implements the Chandy-Lamport algorithm to achieve this

slide-13
SLIDE 13

Performance

  • Achieves 20-60x improvement over Hadoop
  • Competitive with tailored MPI

implementations

  • Error can converge almost two times faster

than in non-dynamic computation

slide-14
SLIDE 14

Performance

Asynchronous vs synchronous performance

  • f pagerank

Comparison of scalability on Named Entity Recognition (First) and The Netflix Collaborative Filtering (Second)

slide-15
SLIDE 15

Conclusion

Powerful abstraction of parallel computation brought to the distributed setting Provides more flexibility than other models, constrained only by inability to modify the graph structure.