Big Data I: Graph Processing, Distributed Machine Learning CS 240: - - PowerPoint PPT Presentation

big data i graph processing distributed machine learning
SMART_READER_LITE
LIVE PREVIEW

Big Data I: Graph Processing, Distributed Machine Learning CS 240: - - PowerPoint PPT Presentation

Big Data I: Graph Processing, Distributed Machine Learning CS 240: Computing Systems and Concurrency Lecture 21 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Selected content adapted from J.


slide-1
SLIDE 1

Big Data I: Graph Processing, Distributed Machine Learning

CS 240: Computing Systems and Concurrency Lecture 21 Marco Canini

Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Selected content adapted from J. Gonzalez.

slide-2
SLIDE 2

Patient presents abdominal pain.

Diagnosis?

Patient ate which contains purchased from Also sold to Diagnoses with

  • E. Coli

infection

slide-3
SLIDE 3

Big Data is Everywhere

  • Machine learning is a reality
  • How will we design and implement “Big

Learning” systems?

3

72 Hours a Minute YouTube 28 Million Wikipedia Pages 900 Million Facebook Users 6 Billion Flickr Photos

slide-4
SLIDE 4

Threads, Locks, & Messages

“Low-level parallel primitives”

We could use ….

slide-5
SLIDE 5

Shift Towards Use Of Parallelism in ML

  • Programmers repeatedly solve the same parallel

design challenges:

– Race conditions, distributed state, communication…

  • Resulting code is very specialized:

– Difficult to maintain, extend, debug…

5

GPUs Multicore Clusters Clouds Supercomputers

Idea: Avoid these problems by using high-level abstractions

slide-6
SLIDE 6

MapReduce / Hadoop

Build learning algorithms on top of high-level parallel abstractions ... a better answer:

slide-7
SLIDE 7

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

7

Embarrassingly Parallel independent computation

1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8

No Communication needed

slide-8
SLIDE 8

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

8 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 2 4 . 1 8 4 . 3 1 8 . 4 8 4 . 4

Image Features

slide-9
SLIDE 9

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

9

Embarrassingly Parallel independent computation

1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 1 7 . 5 6 7 . 5 1 4 . 9 3 4 . 3 2 4 . 1 8 4 . 3 1 8 . 4 8 4 . 4

slide-10
SLIDE 10

CPU 1 CPU 2

MapReduce – Reduce Phase

10 1 2 . 9 4 2 . 3 2 1 . 3 2 5 . 8 2 4 . 1 8 4 . 3 1 8 . 4 8 4 . 4 1 7 . 5 6 7 . 5 1 4 . 9 3 4 . 3 22 26 . 26 17 26 . 31

Image Features Outdoor Picture Statistics Indoor Picture Statistics I O O I I I O O I O I I Outdoor Pictures Indoor Pictures

slide-11
SLIDE 11

Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso

Map-Reduce for Data-Parallel ML

  • Excellent for large data-parallel tasks!

11

Data-Parallel Graph-Parallel Algorithm Tuning Feature Extraction

Map Reduce

Basic Data Processing

Is there more to Machine Learning

?

slide-12
SLIDE 12

Exploiting Dependencies

slide-13
SLIDE 13

Graphs are Everywhere

Users Movies

Netflix

Collaborative Filtering

Docs Words

Wiki

Text Analysis Social Network Probabilistic Analysis

slide-14
SLIDE 14

Concrete Example

Label Propagation

slide-15
SLIDE 15

Profile

Label Propagation Algorithm

  • Social Arithmetic:
  • Recurrence Algorithm:

– iterate until convergence

  • Parallelism:

– Compute all Likes[i] in parallel

Sue Ann Carlos Me 40% 10% 50% 80% Cameras 20% Biking 30% Cameras 70% Biking 50% Cameras 50% Biking 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like I Like:

+

60% Cameras, 40% Biking

Likes[i]= Wij × Likes[ j]

j∈Friends[i]

slide-16
SLIDE 16

Properties of Graph Parallel Algorithms

Dependency Graph Iterative Computation

What I Like What My Friends Like

Factored Computation

slide-17
SLIDE 17

Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso

Map-Reduce for Data-Parallel ML

  • Excellent for large data-parallel tasks!

17

Data-Parallel Graph-Parallel

MapReduce MapReduce?

Algorithm Tuning Feature Extraction

Basic Data Processing

slide-18
SLIDE 18

Problem: Data Dependencies

  • MapReduce doesn’t efficiently express

data dependencies

– User must code substantial data transformations – Costly data replication

Independent Data Rows

slide-19
SLIDE 19

Slow Processor

Iterative Algorithms

  • MR doesn’t efficiently express iterativealgorithms:

Data Data Data Data Data Data Data Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3

Iterations Barrier Barrier Barrier

slide-20
SLIDE 20

MapAbuse: Iterative MapReduce

  • Only a subset of data needs computation:

Data Data Data Data Data Data Data Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3

Iterations Barrier Barrier Barrier

slide-21
SLIDE 21

MapAbuse: Iterative MapReduce

  • System is not optimized for iteration:

Data Data Data Data Data Data Data Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3

Iterations

Disk Penalty Disk Penalty Disk Penalty Startup Penalty Startup Penalty Startup Penalty

slide-22
SLIDE 22

ML Tasks Beyond Data-Parallelism

Data-Parallel Graph-Parallel

Cross Validation Feature Extraction

Map Reduce

Computing Sufficient Statistics Graphical Models

Gibbs Sampling Belief Propagation Variational Opt.

Semi-Supervised Learning

Label Propagation CoEM

Graph Analysis

PageRank Triangle Counting

Collaborative Filtering

Tensor Factorization

22

?

slide-23
SLIDE 23

ML Tasks Beyond Data-Parallelism

Data-Parallel Graph-Parallel

Cross Validation Feature Extraction

Map Reduce

Computing Sufficient Statistics

23

Pregel

slide-24
SLIDE 24

24

  • Limited CPU Power
  • Limited Memory
  • Limited Scalability
slide-25
SLIDE 25

Distributed Cloud

Challenges:

  • Distribute state
  • Keep data consistent
  • Provide fault tolerance

25

Scale up computational resources!

slide-26
SLIDE 26

The GraphLab Framework

Consistency Model Graph Based Data Representation Update Functions User Computation

26

slide-27
SLIDE 27

Data Graph

Data is associated with both vertices and edges

Vertex Data:

  • User profile
  • Current interests estimates

Edge Data:

  • Relationship

(friend, classmate, relative) Graph:

  • Social Network

27

slide-28
SLIDE 28

Distributed Data Graph

28

Partition the graph across multiple machines:

slide-29
SLIDE 29
  • Ghost vertices maintain adjacency structure

and replicate remote data.

“ghost” vertices

29

Distributed Data Graph

slide-30
SLIDE 30

Distributed Data Graph

  • Cut efficiently using HPC Graph partitioning

tools (ParMetis / Scotch / …)

30

“ghost” vertices

slide-31
SLIDE 31

The GraphLab Framework

Consistency Model Graph Based Data Representation Update Functions User Computation

31

slide-32
SLIDE 32

Pagerank(scope){ // Update the current vertex data // Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; }

vertex.PageRank = α ForEach inPage: vertex.PageRank += (1−α)×inPage.PageRank

Update Function

A user-defined program, applied to a vertex; transforms data in scope of vertex

Selectively triggers computation at neighbors

Update function applied (asynchronously) in parallel until convergence

Many schedulers available to prioritize computation

32

slide-33
SLIDE 33

Distributed Scheduling

e i h b a f g k j d c a h f g j c b i

Each machine maintains a schedule over the vertices it owns

33

Distributed Consensus used to identify completion

slide-34
SLIDE 34
  • How much can computation overlap?

Ensuring Race-Free Code

34

slide-35
SLIDE 35

The GraphLab Framework

Consistency Model

Graph Based Data Representation Update Functions User Computation

35

slide-36
SLIDE 36

PageRank Revisited

36

Pagerank(scope) { … }

vertex.PageRank = α ForEach inPage: vertex.PageRank += (1−α)×inPage.PageRank vertex.PageRank = tmp

slide-37
SLIDE 37

PageRank data races confound convergence

37

slide-38
SLIDE 38

Racing PageRank: Bug

38

Pagerank(scope) { … }

vertex.PageRank = α ForEach inPage: vertex.PageRank += (1−α)×inPage.PageRank vertex.PageRank = tmp

slide-39
SLIDE 39

Racing PageRank: Bug Fix

39

Pagerank(scope) { … }

vertex.PageRank = α ForEach inPage: vertex.PageRank += (1−α)×inPage.PageRank vertex.PageRank = tmp

tmp tmp

slide-40
SLIDE 40

Throughput != Performance

No Consistency

Higher Throughput

(#updates/sec)

Potentially Slower Convergence of ML

40

slide-41
SLIDE 41

Serializability

41

For every parallel execution, there exists a sequential execution

  • f update functions which produces the same result.

CPU 1 CPU 2 Single CPU

Parallel Sequential time

slide-42
SLIDE 42

Serializability Example

42

Read Write Update functions one vertex apart can be run in parallel.

Edge Consistency

Overlapping regions are only read.

Stronger / Weaker consistency levels available

User-tunable consistency levels trades off parallelism & consistency

slide-43
SLIDE 43

Distributed Consistency

  • Solution 1:Chromatic Engine

– Edge Consistency via Graph Coloring

  • Solution 2: Distributed Locking
slide-44
SLIDE 44

Chromatic Distributed Engine

Time

Execute tasks

  • n all vertices of

color 0 Execute tasks

  • n all vertices of

color 0 Ghost Synchronization Completion + Barrier Execute tasks

  • n all vertices of

color 1 Execute tasks

  • n all vertices of

color 1 Ghost Synchronization Completion + Barrier

44

slide-45
SLIDE 45

Matrix Factorization

  • Netflix Collaborative Filtering

– Alternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix Users Movies

D D

45

Users Movies

slide-46
SLIDE 46

Netflix Collaborative Filtering

46

4 8 16 24 32 40 48 56 64 1 2 4 6 8 10 12 14 16 #Nodes Speedup Ideal d=100 (30M Cycles) d=50 (7.7M Cycles) d=20 (2.1M Cycles) d=5 (1.0M Cycles)

Ideal D=100 D=20 # machines 4 8 16 24 32 40 48 56 64 10

1

10

2

10

3

#Nodes Runtime(s) Hadoop MPI GraphLab Hadoop MPI GraphLab # machines (D = 20) vs 4 machines

slide-47
SLIDE 47

Distributed Consistency

  • Solution 1:Chromatic Engine

– Edge Consistency via Graph Coloring – Requires a graph coloring to be available – Frequent barriers à inefficient when only some vertices active

  • Solution 2: Distributed Locking
slide-48
SLIDE 48

Distributed Locking

Edge Consistency can be guaranteed through locking.

: RW Lock

48

slide-49
SLIDE 49

Consistency Through Locking

Acquire write-lock on center vertex, read-lock on adjacent.

49

Performance problem: Acquiring a lock from a neighboring machine incurs a latency penalty

slide-50
SLIDE 50

Simple locking

lock scope 1 Process request 1 scope 1 acquired update_function 1 release scope 1 Process release 1

Time

50

slide-51
SLIDE 51

Pipelining hides latency

GraphLab Idea: Hide latency using pipelining

lock scope 1 Process request 1 scope 1 acquired update_function 1 release scope 1 Process release 1 lock scope 2

Time

lock scope 3 Process request 2 Process request 3 scope 2 acquired scope 3 acquired update_function 2 release scope 2

51

slide-52
SLIDE 52

Distributed Consistency

  • Solution 1:Chromatic Engine

– Edge Consistency via Graph Coloring – Requires a graph coloring to be available – Frequent barriers à inefficient when only some vertices active

  • Solution 2: Distributed Locking

– Residual BP on 190K-vertex/560K-edge graph, 4 machines – No pipelining: 472 sec; with pipelining: 10 sec

slide-53
SLIDE 53

How to handle machine failure?

  • What when machines fail? How do we

provide fault tolerance?

  • Strawman scheme: Synchronous snapshot

checkpointing

  • 1. Stop the world
  • 2. Write each machines’ state to disk
slide-54
SLIDE 54

Snapshot Performance

50 100 150 0.5 1 1.5 2 2.5x 10

8

time elapsed(s) vertices updated

  • sync. snapshot

no snapshot

  • async. snapshot

54

No Snapshot

Snapshot One slow machine

How can we do better, leveraging GraphLab’s consistency mechanisms?

Snapshot time Slow machine

slide-55
SLIDE 55

Chandy-Lamport checkpointing

Step 1. Atomically one initiator (a) Turns red, (b) Records its own state (c) sends marker to neighbors Step 2. On receiving marker non-red node atomically: (a) Turns red, (b) Records its own state, (c) Sends markers along all outgoing channels

First-in, first-

  • ut channels

between nodes

Implemented within GraphLab as an Update Function

slide-56
SLIDE 56
  • Async. Snapshot Performance

50 100 150 0.5 1 1.5 2 2.5x 10

8

time elapsed(s) vertices updated

  • sync. snapshot

no snapshot

  • async. snapshot

56

No Snapshot

Snapshot One slow machine

No system performance penalty incurred from the slow machine!

slide-57
SLIDE 57

Summary

  • Two different methods of achieving consistency

– Graph Coloring – Distributed Locking with pipelining

  • Efficient implementations
  • Asynchronous FT w/fine-grained Chandy-Lamport

57

Performance

Useability

Efficiency Scalability

slide-58
SLIDE 58

Sunday topic: Streaming Data Processing and Cluster Coordination

58