Unleash Data Science Danny Bickson Co-Founder GraphLab Project - - PowerPoint PPT Presentation

unleash data science
SMART_READER_LITE
LIVE PREVIEW

Unleash Data Science Danny Bickson Co-Founder GraphLab Project - - PowerPoint PPT Presentation

Unleash Data Science Danny Bickson Co-Founder GraphLab Project History GraphLab GraphLab GraphChi Create (2009) (2011) (2014) GraphLab Open Source (2009) Graphs are Everywhere Graphs are Essential to Data Mining and Machine Learning


slide-1
SLIDE 1

Unleash Data Science

Danny Bickson

Co-Founder

slide-2
SLIDE 2

GraphLab Project History

GraphLab (2009) GraphChi (2011) GraphLab Create (2014)

slide-3
SLIDE 3

GraphLab Open Source (2009)

slide-4
SLIDE 4

Graphs are Everywhere

slide-5
SLIDE 5

Graphs are Essential to Data Mining and Machine Learning

Identify influential information Reason about latent properties Model complex data dependencies

slide-6
SLIDE 6

Examples of Graphs in Machine Learning

slide-7
SLIDE 7

Liberal Conservative

Post Post Post Post Post Post Post Post

Predicting User Behavior

Post Post Post Post Post Post

Post Post Post Post Post Post Post Post

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Label Propagation

slide-8
SLIDE 8

Finding Communities

Count triangles passing through each vertex: Measures “cohesiveness” of local community

More Triangles Stronger Community Fewer Triangles Weaker Community

1 2 3 4

slide-9
SLIDE 9

Ratings Items

Recommending Products

Users

slide-10
SLIDE 10

Many More Applications

  • Collaborative Filtering
  • Alternating Least Squares
  • Stochastic Gradient

Descent

  • Tensor Factorization
  • Structured Prediction
  • Loopy Belief Propagation
  • Max-Product Linear

Programs

  • Gibbs Sampling
  • Semi-supervised ML
  • Graph SSL
  • CoEM
  • Community Detection
  • Triangle-Counting
  • K-core Decomposition
  • K-Truss
  • Graph Analytics
  • PageRank
  • Personalized PageRank
  • Shortest Path
  • Graph Coloring
  • Classification
  • Neural Networks

10

slide-11
SLIDE 11

The Graph-Parallel Pattern

Model / Alg. State Computation depends only on the neighbors

slide-12
SLIDE 12

The GraphLab Framework

Data Model Property Graph Computation Vertex Programs

slide-13
SLIDE 13

Data

Machine Learning Pipeline

images docs movie ratings

Extract Features

faces important words side info

Graph Formation

similar faces shared words rated movies

Structured Machine Learning Algorithm

belief propagation LDA collaborative filtering

Value from Data

face labels doc topics movie recommend.

slide-14
SLIDE 14

Data

Parallelizing Machine Learning

Extract Features Graph Formation

Structured Machine Learning Algorithm

Value from Data

Graph Ingress

mostly data-parallel

Graph-Structured Computation

graph-parallel

slide-15
SLIDE 15

ML Tasks Beyond Data-Parallelism

Data-Parallel Graph-Parallel

Cross Validation Feature Extraction

Map Reduce

Computing Sufficient Statistics Graphical Models

Gibbs Sampling Belief Propagation Variational Opt.

Semi-Supervised Learning

Label Propagation CoEM

Graph Analysis

PageRank Triangle Counting

Collaborative Filtering

Tensor Factorization

slide-16
SLIDE 16

Example of a Graph-Parallel Algorithm

slide-17
SLIDE 17

Flashback to 1998

Why?

First Google advantage: a Graph Algorithm & System to Support it!

slide-18
SLIDE 18

PageRank: Identifying Leaders

slide-19
SLIDE 19

PageRank

What’s the rank

  • f this user?

Rank?

Depends on rank

  • f who follows her

Depends on rank

  • f who follows them…

Loops in graph  Must iterate!

slide-20
SLIDE 20

PageRank Iteration

  • α is the random reset probability
  • wji is the prob. transitioning (similarity) from j to i

R[i] R[j]

wji

Iterate until convergence:

“My rank is weighted average of my friends’ ranks”

slide-21
SLIDE 21

Properties of Graph Parallel Algorithms

Dependency Graph Iterative Computation

My Rank Friends Rank

Local Updates

slide-22
SLIDE 22

Data-Parallel vs Graph Parallel

Dependency Graph Table

Result Row Row Row Row MapReduce

slide-23
SLIDE 23

Addressing Graph-Parallel ML

Data-Parallel Graph-Parallel

Cross Validation Feature Extraction

Map Reduce

Computing Sufficient Statistics Graphical Models

Gibbs Sampling Belief Propagation Variational Opt.

Semi-Supervised Learning

Label Propagation CoEM

Data-Mining

PageRank Triangle Counting

Collaborative Filtering

Tensor Factorization

Map Reduce?

Graph-Parallel Abstraction

slide-24
SLIDE 24

Data Graph

Data associated with vertices and edges

Vertex Data:

  • User profile text
  • Current interests estimates

Edge Data:

  • Similarity weights

Graph:

  • Social Network
slide-25
SLIDE 25

How do we program graph computation?

“Think like a Vertex.”

  • Malewicz et al. [SIGMOD’10]
slide-26
SLIDE 26

pagerank(i, scope){ // Get Neighborhood data (R[i], wij, R[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }

R [i]¬ a +(1-a) wji ´ R [ j]

jÎN[i]

å

;

Update Functions

User-defined program: applied to vertex transforms data in scope of vertex

Dynamic computation

Update function applied (asynchronously) in parallel until convergence

Many schedulers available to prioritize computation

slide-27
SLIDE 27

Ensuring Race-Free Code

How much can computation overlap?

slide-28
SLIDE 28

Need for Consistency?

No Consistency

Higher Throughput

(#updates/sec)

Potentially Slower Convergence of ML

slide-29
SLIDE 29

The GraphLab Framework

Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation

slide-30
SLIDE 30

Never Ending Learner Project (CoEM)

Hadoop 95 Cores 7.5 hrs Distributed GraphLab 32 EC2 machines 80 secs

0.3% of Hadoop time

2 orders of mag faster  2 orders of mag cheaper

slide-31
SLIDE 31

GraphLab 1 provided exciting scaling performance

But… Thus far…

We couldn’t scale up to Altavista Webgraph 2002

1.4B vertices, 6.7B edges

slide-32
SLIDE 32

Natural Graphs

[Image from WikiCommons]

slide-33
SLIDE 33

Achilles Heel: Idealized Graph Assumption Assumed…

But, Natural Graphs…

Small degree  Easy to partition Many high degree vertices (power-law degree distribution)  Very hard to partition

slide-34
SLIDE 34

Power-Law Degree Distribution

100 102 104 106 108 100 102 104 106 108 1010 degree count

High-Degree Vertices: 1% vertices adjacent to 50% of edges

Number of Vertices

AltaVista WebGraph 1.4B Vertices, 6.6B Edges

Degree

slide-35
SLIDE 35

High Degree Vertices are Common

Users Movies

Netflix

“Social” People Popular Movies

θ

Z w Z w Z w Z

w

θ

Z w Z w Z w Z

w

θ

Z w Z w Z w Z

w

θ

Z w Z w Z w Z

w

B α

Hyper Parameters

Docs Words

LDA

Common Words

Obama

slide-36
SLIDE 36

Machine 1 Machine 2

  • Split High-Degree vertices
  • New Abstraction  Leads to this Split

Vertex Strategy

Program For This Run on This

GraphLab 2 Solution

slide-37
SLIDE 37

GAS Decomposition

Y

+ … + 

Y

Parallel “Sum”

Y

Gather (Reduce)

Apply the accumulated value to center vertex

Apply

Update adjacent edges and vertices.

Scatter

Accumulate information about neighborhood

Y

+ Y

Σ

Y ’ Y’

slide-38
SLIDE 38

PageRank on the Live-Journal Graph

22 354 1340

200 400 600 800 1000 1200 1400 1600

GraphLab Spark Mahout/Hadoop Runtime (in seconds, PageRank for 10 iterations)

GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark

slide-39
SLIDE 39

Topic Modeling (LDA)

English language Wikipedia

  • 2.6M Documents, 8.3M Words, 500M Tokens

Computationally intensive

20 40 60 80 100 120 140 160 Yahoo! GraphLab

Million Tokens Per Second

64 cc2.8xlarge EC2 Nodes

Specifically engineered for this task

200 lines of code & 4 human hours

100 Yahoo! Machines

slide-40
SLIDE 40

GraphLab vs. Giraph

Source: SC13 paper

slide-41
SLIDE 41

GraphChi (2011)

slide-42
SLIDE 42

GraphChi: Going small with GraphLab

Solve huge problems on small or embedded devices?

Key: Exploit non-volatile memory (starting with SSDs and HDs)

slide-43
SLIDE 43

GraphChi – disk-based GraphLab

Challenge: Random Accesses Novel GraphChi solution: Parallel sliding windows method  minimizes number of random accesses

slide-44
SLIDE 44

Triangle Counting on Twitter Graph

40M Users 1.2B Edges

Total: 34.8 Billion Triangles

Hadoop results from [Suri & Vassilvitskii WWW ‘11]

59 Minutes 64 Machines, 1024 Cores 1.5 Minutes

GraphLab2 GraphChi Hadoop

1636 Machines 423 Minutes 59 Minutes, 1 Mac Mini!

slide-45
SLIDE 45

Netflix Collaborative Filtering

  • Alternating Least Squares Matrix

Factorization

Model: 0.5 million nodes, 99 million edges

4 8 16 24 32 40 48 56 64 10

1

10

2

10

3

#Nodes Runtime(s) Hadoop MPI GraphLab

Hadoop MPI GraphLab

slide-46
SLIDE 46

Intel Labs Report on GraphLab

Data source: Nezih Yigitbasi, Intel Labs

slide-47
SLIDE 47

The Cost of Hadoop

10

1

10

2

10

3

10

4

10

−1

10 10

1

10

2

Runtime(s) Cost($) GraphLab Hadoop

slide-48
SLIDE 48

Growing User Community and Adoption

slide-49
SLIDE 49

GraphLab Conferences 2012  2013

slide-50
SLIDE 50

Growing community contribution

slide-51
SLIDE 51

Unleash Data Science

Power + Simplicity

slide-52
SLIDE 52

Real-World Pipelines Combine Graphs & Tables

Raw Wikipedia

< / > < / > < / >

XML

Hyperlinks PageRank Top 20 Pages

Title PR

Text Table

Title Body

Topic Model (LDA) Word Topics

Word Topic

Term-Doc Graph

slide-53
SLIDE 53

GraphLab Create: Blend Graphs & Tables

Enabling users to easily and efficiently express the entire graph analytics pipelines within a simple Python API.

slide-54
SLIDE 54

Machine Learning is a powerful tool but …

even basic applications can be challenging. 6 months from R/Matlab to production (at best). state-of-art algorithms are trapped in research papers.

Goal of GraphLab: Make large-scale machine learning accessible to all! 

slide-55
SLIDE 55

Now with GraphLab: Learn/Prototype/Deploy

Even basics of scalable ML can be challenging 6 months from R/Matlab to production, at best State-of-art ML algorithms trapped in research papers

Learn ML with GraphLab Notebook pip install graphlab then deploy on EC2 Fully integrated via GraphLab Toolkits

slide-56
SLIDE 56

Value Proposition

“Data scientists tend to use a variety of tools, often across different programming languages… require a lot of context- switching which affects productivity and impedes reproducibility.”

Ben Lorica, O’Reilly Media

GraphLab Create: From prototyping to production without context switching

slide-57
SLIDE 57

Three Steps to Simplicity

Learn ML with GraphLab Notebook

Learn

Last login: Tue Dec 3 11:00:00 on ttys000d-173-250-172-19:~graphlab$ > pip install graphlab > python ... >>> >>> import graphlab as AWESOME ... >>>

Easy Install graphlab

Prototype

>>> import graphlab >>> graphlab.launch(“cc2.8xlarge”)

  • r

Publish Notebook to Collaborators

Deploy

Easily Scale GraphLab with EC2 or GraphLab Platform

slide-58
SLIDE 58

Learn: GraphLab Notebooks

slide-59
SLIDE 59

Prototype: GraphLab Create

Build recommenders fast. Don't waste time coding from scratch. Code in Python. Do more in one system with tools you love. Iterate more. Don’t wait for tomorrow to improve results. Scale with ease. Create on your laptop, deploy to the Cloud.

GraphLab Create is a Python package that enables developers and data scientists to apply machine learning to build state of the art data products.

(beta)

slide-60
SLIDE 60

Deploy: GraphLab Create:

Easily install & prototype locally with new Python API Deploy to the cluster in one step

slide-61
SLIDE 61

GraphLab Toolkits

Highly scalable, state-of-the-art machine learning straight from python continually growing with external contributors across industry and academia.

Graph Analytics Graphical Models Computer Vision Clustering Topic Modeling Collaborative Filtering

slide-62
SLIDE 62

Collaborative Filtering Vertical

  • Award winning software for collaborative filtering
  • We ranked top places in several high profile competitions: ACM

KDD CUP 2011, ACM KDD CUP 2012, WCSD 2013

  • GraphLab software is used by thousands of

companies, Some examples:

  • Pandora uses GraphLab for recommending music
  • Adobe uses GraphLab for recommending designers in their social

network

  • King is using GraphLab for recommending game moves
  • References from the above companies will be given upon request
slide-63
SLIDE 63

Unmatched functionality

  • Side features
  • Cold start support (new users)
  • High dimensional models
  • RESTful API (in the works)
slide-64
SLIDE 64

GraphLab License

  • Open source: Apache 2
  • Python: closed source. Licensed (currently

free)

slide-65
SLIDE 65

Join our community at GraphLab.com Follow us @graphlabteam Build scalable data products fast