Distributed GraphLab
A Framework for Machine Learning and Data Mining in the Cloud
By Maciej Biskupiak for R212
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. Hellerstein
Distributed GraphLab A Framework for Machine Learning and Data - - PowerPoint PPT Presentation
Distributed GraphLab A Framework for Machine Learning and Data Mining in the Cloud Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. Hellerstein By Maciej Biskupiak for R212 Motivation Abstractions of parallel computation are
A Framework for Machine Learning and Data Mining in the Cloud
By Maciej Biskupiak for R212
races)
V1 V2 V4 V6 V5 V3
expressed as an arbitrary graph.
with vertices or edges
but the structure of the graph is not
Dv4<->v3 Dv5<->v6 Dv1<->v5 Dv3<->v6 Dv2<->v3 Dv2<->v4 Dv1<->v4
V1 V2 V4 V6 V5 V3
Dv4<->v3 Dv5<->v6 Dv1<->v5 Dv3<->v6 Dv2<->v3 Dv2<->v4 Dv1<->v4
Takes a vertex V and it’s surrounding context Sv. Returns the new values of it’s context Sv and a list T of vertices that will eventually be updated.
Vertex to be updated Context
We partition the graph into a set of K atoms (where K is much greater than the number
Each atom is stored as a separate file and contains information about ‘ghosts’, the vertices and edges adjacent to the atoms boundary
V1 V4 V5
Dv1<->v5 Dv1<->v4
V2 V6 V3
An Atom
Data races are possible if the contexts
GraphLab provides two means of dealing with this:
graph coloring
system
V1 V2 V4 V6 V5 V3
Dv4<->v3 Dv5<->v6 Dv1<->v5 Dv3<->v6 Dv2<->v3 Dv2<->v4 Dv1<->v4
Distributed locking supports various levels of consistency.
data belonging to adjacent vertices
adjacent edges and a read lock on it’s adjacent vertices
and vertices
Gives greater performance, as some problems do not require total consistency (EG. Pagerank)
In the event of a failure the system can recover to a snapshot taken at a previous point. The snapshot mechanism has to be asynchronous in order to avoid suspending execution. GraphLab implements the Chandy-Lamport algorithm to achieve this
Asynchronous vs synchronous performance
Comparison of scalability on Named Entity Recognition (First) and The Netflix Collaborative Filtering (Second)