Splash User-friendly Programming Interface for Parallelizing - - PowerPoint PPT Presentation

splash
SMART_READER_LITE
LIVE PREVIEW

Splash User-friendly Programming Interface for Parallelizing - - PowerPoint PPT Presentation

Splash User-friendly Programming Interface for Parallelizing Stochastic Algorithms Yuchen Zhang and Michael Jordan AMP Lab, UC Berkeley AMP Lab Splash April 2015 1 / 27 Batch Algorithm v.s. Stochastic Algorithm n Consider minimizing a


slide-1
SLIDE 1

Splash

User-friendly Programming Interface for Parallelizing Stochastic Algorithms

Yuchen Zhang and Michael Jordan

AMP Lab, UC Berkeley

AMP Lab Splash April 2015 1 / 27

slide-2
SLIDE 2

Batch Algorithm v.s. Stochastic Algorithm

Consider minimizing a loss function L(w) := 1

n

n

i=1 ℓi(w).

AMP Lab Splash April 2015 2 / 27

slide-3
SLIDE 3

Batch Algorithm v.s. Stochastic Algorithm

Consider minimizing a loss function L(w) := 1

n

n

i=1 ℓi(w).

Gradient Descent: iteratively update wt+1 = wt − ηt∇L(wt).

AMP Lab Splash April 2015 2 / 27

slide-4
SLIDE 4

Batch Algorithm v.s. Stochastic Algorithm

Consider minimizing a loss function L(w) := 1

n

n

i=1 ℓi(w).

Gradient Descent: iteratively update wt+1 = wt − ηt∇L(wt). Pros: Easy to parallelize (via Spark). Cons: May need hundreds of iterations to converge.

running time (seconds) 50 100 150 200 250 loss function 0.55 0.6 0.65 0.7

Gradient Descent - 64 threads

AMP Lab Splash April 2015 2 / 27

slide-5
SLIDE 5

Batch Algorithm v.s. Stochastic Algorithm

Consider minimizing a loss function L(w) := 1

n

n

i=1 ℓi(w).

Stochastic Gradient Descent (SGD): randomly draw ℓt, then wt+1 = wt − ηt∇ℓt(wt).

AMP Lab Splash April 2015 3 / 27

slide-6
SLIDE 6

Batch Algorithm v.s. Stochastic Algorithm

Consider minimizing a loss function L(w) := 1

n

n

i=1 ℓi(w).

Stochastic Gradient Descent (SGD): randomly draw ℓt, then wt+1 = wt − ηt∇ℓt(wt). Pros: Much faster convergence. Cons: Sequential algorithm, difficult to parallelize.

running time (seconds) 50 100 150 200 250 loss function 0.55 0.6 0.65 0.7

Gradient Descent - 64 threads Stochastic Gradient Descent

AMP Lab Splash April 2015 3 / 27

slide-7
SLIDE 7

More Stochastic Algorithms

Convex Optimization Adaptive SGD (Duchi et al.) Stochastic Average Gradient Method (Schmidt et al.) Stochastic Dual Coordinate Ascent (Shalev-Shwartz and Zhang)

AMP Lab Splash April 2015 4 / 27

slide-8
SLIDE 8

More Stochastic Algorithms

Convex Optimization Adaptive SGD (Duchi et al.) Stochastic Average Gradient Method (Schmidt et al.) Stochastic Dual Coordinate Ascent (Shalev-Shwartz and Zhang) Probabilistic Model Inference Markov chain Monte Carlo and Gibbs sampling Expectation propagation (Minka) Stochastic variational inference (Hoffman et al.)

AMP Lab Splash April 2015 4 / 27

slide-9
SLIDE 9

More Stochastic Algorithms

Convex Optimization Adaptive SGD (Duchi et al.) Stochastic Average Gradient Method (Schmidt et al.) Stochastic Dual Coordinate Ascent (Shalev-Shwartz and Zhang) Probabilistic Model Inference Markov chain Monte Carlo and Gibbs sampling Expectation propagation (Minka) Stochastic variational inference (Hoffman et al.) SGD variants for Matrix factorization Learning neural networks Learning denoising auto-encoder

AMP Lab Splash April 2015 4 / 27

slide-10
SLIDE 10

More Stochastic Algorithms

Convex Optimization Adaptive SGD (Duchi et al.) Stochastic Average Gradient Method (Schmidt et al.) Stochastic Dual Coordinate Ascent (Shalev-Shwartz and Zhang) Probabilistic Model Inference Markov chain Monte Carlo and Gibbs sampling Expectation propagation (Minka) Stochastic variational inference (Hoffman et al.) SGD variants for Matrix factorization Learning neural networks Learning denoising auto-encoder How to parallelize these algorithms?

AMP Lab Splash April 2015 4 / 27

slide-11
SLIDE 11

First Attempt

After processing a subsequence of random samples... Single-thread Algorithm: incremental update w ← w + ∆.

AMP Lab Splash April 2015 5 / 27

slide-12
SLIDE 12

First Attempt

After processing a subsequence of random samples... Single-thread Algorithm: incremental update w ← w + ∆. Parallel Algorithm: Thread 1 (on 1/m of samples): w ← w + ∆1. Thread 2 (on 1/m of samples): w ← w + ∆2. . . . Thread m (on 1/m of samples): w ← w + ∆m.

AMP Lab Splash April 2015 5 / 27

slide-13
SLIDE 13

First Attempt

After processing a subsequence of random samples... Single-thread Algorithm: incremental update w ← w + ∆. Parallel Algorithm: Thread 1 (on 1/m of samples): w ← w + ∆1. Thread 2 (on 1/m of samples): w ← w + ∆2. . . . Thread m (on 1/m of samples): w ← w + ∆m. Aggregate parallel updates w ← w + ∆1 + · · · + ∆m.

AMP Lab Splash April 2015 5 / 27

slide-14
SLIDE 14

First Attempt

After processing a subsequence of random samples... Single-thread Algorithm: incremental update w ← w + ∆. Parallel Algorithm: Thread 1 (on 1/m of samples): w ← w + ∆1. Thread 2 (on 1/m of samples): w ← w + ∆2. . . . Thread m (on 1/m of samples): w ← w + ∆m. Aggregate parallel updates w ← w + ∆1 + · · · + ∆m.

running time (seconds) 20 40 60 loss function 20 40 60 80 100

Single-thread SGD Parallel SGD - 64 threads

Doesn’t work for SGD!

AMP Lab Splash April 2015 5 / 27

slide-15
SLIDE 15

Conflicts in Parallel Updates

Reason of failure: ∆1, . . . , ∆m simultaneously manipulate the same variable w, causing conflicts in parallel updates.

AMP Lab Splash April 2015 6 / 27

slide-16
SLIDE 16

Conflicts in Parallel Updates

Reason of failure: ∆1, . . . , ∆m simultaneously manipulate the same variable w, causing conflicts in parallel updates. How to resolve conflicts

AMP Lab Splash April 2015 6 / 27

slide-17
SLIDE 17

Conflicts in Parallel Updates

Reason of failure: ∆1, . . . , ∆m simultaneously manipulate the same variable w, causing conflicts in parallel updates. How to resolve conflicts

1 Frequent communication between threads:

Pros: general approach to resolving conflict. Cons: inter-node (asynchronous) communication is expensive!

AMP Lab Splash April 2015 6 / 27

slide-18
SLIDE 18

Conflicts in Parallel Updates

Reason of failure: ∆1, . . . , ∆m simultaneously manipulate the same variable w, causing conflicts in parallel updates. How to resolve conflicts

1 Frequent communication between threads:

Pros: general approach to resolving conflict. Cons: inter-node (asynchronous) communication is expensive!

2 Carefully partition the data to avoid threads simultaneously

manipulating the same variable:

Pros: doesn’t need frequent communication. Cons: need problem-specific partitioning schemes; only works for a subset of problems.

AMP Lab Splash April 2015 6 / 27

slide-19
SLIDE 19

Splash: A Principle Solution

Splash is A programming interface for developing stochastic algorithms An execution engine for running stochastic algorithm on distributed systems.

AMP Lab Splash April 2015 7 / 27

slide-20
SLIDE 20

Splash: A Principle Solution

Splash is A programming interface for developing stochastic algorithms An execution engine for running stochastic algorithm on distributed systems. Features of Splash include: Easy Programming: User develop single-thread algorithms via Splash: no communication protocol, no conflict management, no data partitioning, no hyper-parameters tuning.

AMP Lab Splash April 2015 7 / 27

slide-21
SLIDE 21

Splash: A Principle Solution

Splash is A programming interface for developing stochastic algorithms An execution engine for running stochastic algorithm on distributed systems. Features of Splash include: Easy Programming: User develop single-thread algorithms via Splash: no communication protocol, no conflict management, no data partitioning, no hyper-parameters tuning. Fast Performance: Splash adopts novel strategy for automatic parallelization with infrequent communication. Communication is no longer a performance bottleneck.

AMP Lab Splash April 2015 7 / 27

slide-22
SLIDE 22

Splash: A Principle Solution

Splash is A programming interface for developing stochastic algorithms An execution engine for running stochastic algorithm on distributed systems. Features of Splash include: Easy Programming: User develop single-thread algorithms via Splash: no communication protocol, no conflict management, no data partitioning, no hyper-parameters tuning. Fast Performance: Splash adopts novel strategy for automatic parallelization with infrequent communication. Communication is no longer a performance bottleneck. Integration with Spark: taking RDD as input and returning RDD as

  • utput. Work with KeystoneML, MLlib and other data analysis tools
  • n Spark.

AMP Lab Splash April 2015 7 / 27

slide-23
SLIDE 23

Programming Interface

AMP Lab Splash April 2015 8 / 27

slide-24
SLIDE 24

Programming with Splash

Splash users implement the following function: def process(sample: Any, weight: Int, var: VariableSet){ /*implement stochastic algorithm*/ } where sample — a random sample from the dataset. weight — observe the sample duplicated by weight times. var — set of all shared variables.

AMP Lab Splash April 2015 9 / 27

slide-25
SLIDE 25

Example: SGD for Linear Regression

Goal: find w∗ = arg minw 1

n

n

i=1(wxi − yi)2.

SGD update: randomly draw (xi, yi), then w ← w − η∇w(wxi − yi)2.

AMP Lab Splash April 2015 10 / 27

slide-26
SLIDE 26

Example: SGD for Linear Regression

Goal: find w∗ = arg minw 1

n

n

i=1(wxi − yi)2.

SGD update: randomly draw (xi, yi), then w ← w − η∇w(wxi − yi)2. Splash implementation: def process(sample: Any, weight: Int, var: VariableSet){ val stepsize = var.get(“eta”) * weight val gradient = sample.x * (var.get(“w”) * sample.x - sample.y) var.add(“w”, - stepsize * gradient) }

AMP Lab Splash April 2015 10 / 27

slide-27
SLIDE 27

Example: SGD for Linear Regression

Goal: find w∗ = arg minw 1

n

n

i=1(wxi − yi)2.

SGD update: randomly draw (xi, yi), then w ← w − η∇w(wxi − yi)2. Splash implementation: def process(sample: Any, weight: Int, var: VariableSet){ val stepsize = var.get(“eta”) * weight val gradient = sample.x * (var.get(“w”) * sample.x - sample.y) var.add(“w”, - stepsize * gradient) } Supported operations: get, add, multiply, delayedAdd.

AMP Lab Splash April 2015 10 / 27

slide-28
SLIDE 28

Get Operations

Get the value of the variable (Double or Array[Double]). get(key) returns var[key] getArray(key) returns varArray[key] getArrayElement(key, index) returns varArray[key][index] getArrayElements(key, indices) returns varArray[key][indices] Array-based operations are more efficient than element-wise operations, because the key-value retrieval is executed only once for operating an array.

AMP Lab Splash April 2015 11 / 27

slide-29
SLIDE 29

Add Operations

Add a quantity δ to the variable. add(key, delta): var[key] += delta addArray(key, deltaArray): varArray[key] += deltaArray addArrayElement(key, index, delta): varArray[key][index] += delta addArrayElements(key, indices, deltaArrayElements): varArray[key][indices] += deltaArrayElements

AMP Lab Splash April 2015 12 / 27

slide-30
SLIDE 30

Multiply Operations

Multiply a quantity γ to the variable v. multiply(key, gamma): var[key] *= gamma multiplyArray(key, gamma): varArray[key] *= gamma We have optimized the implementation so that the time complexity of multiplyArray is O(1), independent of the array dimension.

AMP Lab Splash April 2015 13 / 27

slide-31
SLIDE 31

Multiply Operations

Multiply a quantity γ to the variable v. multiply(key, gamma): var[key] *= gamma multiplyArray(key, gamma): varArray[key] *= gamma We have optimized the implementation so that the time complexity of multiplyArray is O(1), independent of the array dimension. Example: SGD with sparse features and ℓ2-norm regularization. w ← (1 − λ) ∗ w (multiply operation) (1) w ← w − η∇f (w) (addArrayElements operation) (2) Time complexity of (1) = O(1); Time complexity of (2) = nnz(∇f (w)).

AMP Lab Splash April 2015 13 / 27

slide-32
SLIDE 32

Delayed Add Operations

Add a quantity δ to the variable v. The operation is not executed until the next time the same sample is processed by the system. delayedAdd(key, delta): var[key] += delta delayedAddArray(key, deltaArray): varArray[key] += deltaArray delayedAddArrayElement(key, index, delta): varArray[key][index] += delta

AMP Lab Splash April 2015 14 / 27

slide-33
SLIDE 33

Delayed Add Operations

Add a quantity δ to the variable v. The operation is not executed until the next time the same sample is processed by the system. delayedAdd(key, delta): var[key] += delta delayedAddArray(key, deltaArray): varArray[key] += deltaArray delayedAddArrayElement(key, index, delta): varArray[key][index] += delta Example: Collapsed Gibbs Sampling for LDA – update the word-topic counter nwk when topic k is assigned to word w. nwk ← nwk + weight (add operation) (3) nwk ← nwk − weight (delayed add operation) (4) (3) executed instantly; (4) will be executed at the next time before a new topic is sampled for the same word.

AMP Lab Splash April 2015 14 / 27

slide-34
SLIDE 34

Running Stochastic Algorithm

Three simple steps:

1 Convert RDD dataset to Parametrized RDD:

val paramRdd = new ParametrizedRDD(rdd)

AMP Lab Splash April 2015 15 / 27

slide-35
SLIDE 35

Running Stochastic Algorithm

Three simple steps:

1 Convert RDD dataset to Parametrized RDD:

val paramRdd = new ParametrizedRDD(rdd)

2 Set a function that implements the algorithm:

paramRdd.setProcessFunction(process)

AMP Lab Splash April 2015 15 / 27

slide-36
SLIDE 36

Running Stochastic Algorithm

Three simple steps:

1 Convert RDD dataset to Parametrized RDD:

val paramRdd = new ParametrizedRDD(rdd)

2 Set a function that implements the algorithm:

paramRdd.setProcessFunction(process)

3 Start running:

paramRdd.run()

AMP Lab Splash April 2015 15 / 27

slide-37
SLIDE 37

Execution Engine

AMP Lab Splash April 2015 16 / 27

slide-38
SLIDE 38

How does Splash work?

In each iteration, the execution engine does:

1 Propose candidate degrees of parallelism m1, . . . , mk such that

k

i mi = m := (# of cores). For each i ∈ [k], collect mi cores and

do:

AMP Lab Splash April 2015 17 / 27

slide-39
SLIDE 39

How does Splash work?

In each iteration, the execution engine does:

1 Propose candidate degrees of parallelism m1, . . . , mk such that

k

i mi = m := (# of cores). For each i ∈ [k], collect mi cores and

do:

1

Each core gets a sub-sequence of samples (by default

1 m of the full

data). They process the samples sequentially using the process

  • function. Every sample is weighted by mi.

AMP Lab Splash April 2015 17 / 27

slide-40
SLIDE 40

How does Splash work?

In each iteration, the execution engine does:

1 Propose candidate degrees of parallelism m1, . . . , mk such that

k

i mi = m := (# of cores). For each i ∈ [k], collect mi cores and

do:

1

Each core gets a sub-sequence of samples (by default

1 m of the full

data). They process the samples sequentially using the process

  • function. Every sample is weighted by mi.

2

Combine the updates of all mi cores to get the global update. There are different strategies for combining different types of updates. For add operations, the updates are averaged.

AMP Lab Splash April 2015 17 / 27

slide-41
SLIDE 41

How does Splash work?

In each iteration, the execution engine does:

1 Propose candidate degrees of parallelism m1, . . . , mk such that

k

i mi = m := (# of cores). For each i ∈ [k], collect mi cores and

do:

1

Each core gets a sub-sequence of samples (by default

1 m of the full

data). They process the samples sequentially using the process

  • function. Every sample is weighted by mi.

2

Combine the updates of all mi cores to get the global update. There are different strategies for combining different types of updates. For add operations, the updates are averaged.

2 If k > 1, then select the best mi by a parallel cross-validation

procedure.

AMP Lab Splash April 2015 17 / 27

slide-42
SLIDE 42

How does Splash work?

In each iteration, the execution engine does:

1 Propose candidate degrees of parallelism m1, . . . , mk such that

k

i mi = m := (# of cores). For each i ∈ [k], collect mi cores and

do:

1

Each core gets a sub-sequence of samples (by default

1 m of the full

data). They process the samples sequentially using the process

  • function. Every sample is weighted by mi.

2

Combine the updates of all mi cores to get the global update. There are different strategies for combining different types of updates. For add operations, the updates are averaged.

2 If k > 1, then select the best mi by a parallel cross-validation

procedure.

3 Broadcast the best update to all machines to apply this update. Then

proceed to the next iteration. (degree of parallelism doesn’t decrease)

AMP Lab Splash April 2015 17 / 27

slide-43
SLIDE 43

Example: Reweighting for SGD

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

(a) Optimal solution (b) Solution with full update (c) Local solutions with unit-weight update

AMP Lab Splash April 2015 18 / 27

slide-44
SLIDE 44

Example: Reweighting for SGD

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

(a) Optimal solution (b) Solution with full update (c) Local solutions with unit-weight update (d) Average local solutions in (c) (e) Aggregate local solutions in (c) (29,8)

AMP Lab Splash April 2015 18 / 27

slide-45
SLIDE 45

Example: Reweighting for SGD

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

(a) Optimal solution (b) Solution with full update (c) Local solutions with unit-weight update (d) Average local solutions in (c) (e) Aggregate local solutions in (c) (f) Local solutions with weighted update (29,8)

AMP Lab Splash April 2015 18 / 27

slide-46
SLIDE 46

Example: Reweighting for SGD

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

(a) Optimal solution (b) Solution with full update (c) Local solutions with unit-weight update (d) Average local solutions in (c) (e) Aggregate local solutions in (c) (f) Local solutions with weighted update (g) Average local solutions in (f) (29,8)

AMP Lab Splash April 2015 18 / 27

slide-47
SLIDE 47

Experiments

AMP Lab Splash April 2015 19 / 27

slide-48
SLIDE 48

Experiment Setups

System: Amazon EC2 cluster with 8 workers. Each worker has 8 Intel Xeon E5-2665 cores and 30 GBs of memory and was connected to a commodity 1GB network Algorithms: SGD for logistic regression; mini-batch SGD for collaborative filtering; Gibbs Sampling for topic modelling;. Datasets:

MNIST 8M (LR): 8 million samples, 7,840 parameters. Netflix (CF): 100 million samples, 65 million parameters. NYTimes (LDA): 100 million samples, 200 million parameters.

Baseline methods: single-thread stochastic algorithm; MLlib (the

  • fficial machine learning library for Spark).

AMP Lab Splash April 2015 20 / 27

slide-49
SLIDE 49

Logistic Regression on MNIST Digit Recognition

runtime (seconds)

100 200 300 400 500

loss function

0.45 0.5 0.55 0.6

Splash (SGD) Single-thread SGD MLlib (L-BFGS)

loss function value

0.46 0.47 0.48 0.49

speedup rate

10 20 30 40

Over single-thread SGD Over MLlib (L-BFGS)

Splash converges to a good solution in a few seconds, while other methods take hundreds of seconds. Splash is 10x - 25x faster than single-thread SGD. Splash is 15x - 30x faster than parallelized L-BFGS.

AMP Lab Splash April 2015 21 / 27

slide-50
SLIDE 50

Netflix Movie Recommendation

runtime (seconds)

100 200 300 400 500

prediction loss

0.8 1 1.2 1.4

Splash (SGD) Single-thread SGD MLlib (ALS)

Splash is 36x faster than parallelized Alternating Least Square (ALS). Splash converges to a better solution than ALS (the problem is non-convex).

AMP Lab Splash April 2015 22 / 27

slide-51
SLIDE 51

Topic Modelling on New York Times Articles

runtime (seconds)

1000 2000

predictive log-likelihood

  • 9
  • 8.5
  • 8

Splash (Gibbs) Single-thread (Gibbs) MLlib (VI)

Splash is 3x - 6x faster than parallelized Variational Inference (VI). Splash converges to a better solution than VI.

AMP Lab Splash April 2015 23 / 27

slide-52
SLIDE 52

Runtime Analysis

MNIST 8M (LR) Netflix (CF) NYTimes (LDA)

Runtime per pass

10 20 30 40 50 60

Computation time Waiting time Communication time

Waiting time is 16%, 21%, 26% of the computation time. Communication time is 6%, 39% and 103% of the computation time.

AMP Lab Splash April 2015 24 / 27

slide-53
SLIDE 53

Machine Learning Package

AMP Lab Splash April 2015 25 / 27

slide-54
SLIDE 54

Stochastic Machine Learning Library on Splash

Goal:

Fast performance: order-of-magnitude faster than MLlib. Ease of use: call with one line of code. Integration: easy to build a data analytics pipeline.

Algorithms:

Stochastic gradient descent. Stochastic matrix factorization. Gibbs sampling for LDA.

Will implement more algorithms in the future...

AMP Lab Splash April 2015 26 / 27

slide-55
SLIDE 55

Summary

Splash is a general-purpose programming interface for developing stochastic algorithms. Splash is also an execution engine for automatic parallelizing stochastic algorithms. Reweighting is the key to achieve fast performance without scarifying communication efficiency. We observe good empirical performance and we have theoretical guarantees for SGD. Splash is online at http://zhangyuc.github.io/splash/.

AMP Lab Splash April 2015 27 / 27