[PPT] - Scaling with the Parameter Server Variations on a Theme Alexander PowerPoint Presentation

SLIDE 1

Alexander Smola Google Research & CMU alex.smola.org

Scaling with the Parameter Server Variations on a Theme

SLIDE 2

Source: place source info here

Joey Gonzalez Shravan Narayanamurthy Markus Weimer Sergiy Matyusevich Amr Ahmed

2

Thanks

Nino Shervashidze

SLIDE 3

Source: place source info here

Multicore
asynchronous optimization with shared state
Multiple machines
exact synchronization (Yahoo LDA)
approximate synchronization
dual decomposition

3

Practical Distributed Inference

SLIDE 4

Motivation Data & Systems

4

MITT’S

SLIDE 5

Source: place source info here

High Performance Computing

Very reliable, custom built, expensive

Consumer hardware

Cheap, effjcient, easy to replicate, Not very reliable, deal with it!

Commodity Hardware

5

SLIDE 6

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/ en//people/jefg/stanford-295-talk.pdf

Slide courtesy of Jefg Dean

The Joys of Real Hardware

6

SLIDE 7

Source: place source info here

Scaling problems

7

Data (lower bounds)

– >10 Billion documents (webpages, e-mails, advertisements, tweets) – >100 Million users on Google, Facebook, Twitter, Yahoo, Hotmail – >1 Million days of video on YouTube – >10 Billion images on Facebook

Processing capability for single machine 1TB/hour

But we have much more data

Parameter space for models is big for a single machine (but not too much)

Personalize content for many millions of users

Need to process data on many cores and many machines simultaneously

SLIDE 8

Source: place source info here

Some Problems

Good old-fashioned supervised learning

(classification, regression, tagging, entity extraction, ...)

Graph factorization

(latent variable estimation, social recommendation, discovery)

Structure inference

(clustering, topics, hierarchies, DAGs, whatever else your NP Bayes friends have)

Example use case - combine information from generic webpages,

databases, human generated data, semistructured tables into knowledge about entities.

8

SLIDE 9

Source: place source info here

Some Problems

Good old-fashioned supervised learning

(classification, regression, tagging, entity extraction, ...)

Graph factorization

(latent variable estimation, social recommendation, discovery)

Structure inference

(clustering, topics, hierarchies, DAGs, whatever else your NP Bayes friends have)

Example use case - combine information from generic webpages,

databases, human generated data, semistructured tables into knowledge about entities.

8

How do we solve it at scale?

SLIDE 10

Source: place source info here

Some Problems

Good old-fashioned supervised learning

(classification, regression, tagging, entity extraction, ...)

Graph factorization

(latent variable estimation, social recommendation, discovery)

Structure inference

(clustering, topics, hierarchies, DAGs, whatever else your NP Bayes friends have)

Example use case - combine information from generic webpages,

databases, human generated data, semistructured tables into knowledge about entities.

8

How do we solve it at scale?

this talk

SLIDE 11

Multicore parallelism

9

MITT’S

SLIDE 12

Source: place source info here

Multicore Parallelism

Many processor cores

– Decompose into separate tasks – Good Java/C++ tool support

Shared memory

– Exact estimates - requires locking of neighbors (see e.g. Graphlab) Good if problem can be decomposed cleanly (e.g. Gibbs sampling in large model) – Exact updates but delayed incorporation - requires locking of state Good if delayed update is of little consequence (e.g. Yahoo LDA, Yahoo online) – Hogwild updates - no locking whatsoever - requires atomic state Good if collision probability is low

10

loss gradient data source x

SLIDE 13

Delayed updates

(round robin for data parallelism, aggregation tree for parameter parallelism)

Online template

Source: place source info here

Stochastic Gradient Descent

loss gradient data source x data source data part n x part n updater

11

data parallel parameter parallel

minimize

w

i

fi(w)

Input: scalar σ > 0 and delay τ for t = τ + 1 to T + τ do Obtain ft and incur loss ft(wt) Compute gt := ⇥ft(wt) and set ηt =

1 σ(t−τ)

Update wt+1 = wt ηtgt−τ end for

SLIDE 14

Source: place source info here

Guarantees

Worst case guarantee (Zinkevich, Langford, Smola, 2010)

SGD with delay τ on τ processors is no worse than sequential SGD

Lower bound is tight

Proof: send same instance τ times

Better bounds with iid data

– Penalty is covariance in features – Vanishing penalty for smooth f(w)

Works even (better) if we don’t lock between updates

(Recht, Re, Wright, 2011) Hogwild

12

E[fi(w)] ≤ 4RL √ τT

E[R[X]] ≤  28.3R2H + 2 3RL + 4 3R2H log T

τ 2 + 8

3RL √ T.

SLIDE 15

Source: place source info here

Speedup on TREC

50 100 150 200 250 300 350 400 450 1 2 3 4 5 6 7

13

number of cores speedup in %

SLIDE 16

Smola and Narayanamurthy, 2010

LDA Multicore Inference

Decouple multithreaded sampling and updating (almost)

avoids stalling for locks in the sampler

Joint state table

–much less memory required –samplers synchronized (10 docs vs. millions delay)

Hyperparameter update via stochastic gradient descent
No need to keep documents in memory (streaming)

tokens topics file combiner count updater diagnostics &

ptimization
utput to

file topics sampler sampler sampler sampler sampler

Intel Threading Building Blocks

joint state table

14

SLIDE 17

Smola and Narayanamurthy, 2010

LDA Multicore Inference

tokens topics file combiner count updater diagnostics &

ptimization
utput to

file topics sampler sampler sampler sampler sampler

Intel Threading Building Blocks

joint state table

15

Sequential collapsed Gibbs sampler, separate state table

Mallet (Mimno et al. 2008) - slow mixing, high memory load, many iterations

Sequential collapsed Gibbs sampler (parallel)

Yahoo LDA (Smola and Narayanamurthy, 2010) - fast mixing, many iterations

Sequential stochastic gradient descent (variational, single logical thread)

VW LDA (Hofgman et al, 2011) - fast convergence, few iterations, dense

Sequential stochastic sampling gradient descent (only partly variational)

Hofgman, Mimno, Blei, 2012 - fast convergence, quite sparse, single logical thread

SLIDE 18

Source: place source info here

General strategy

Shared state space
Delayed updates from cores
Proof technique is usually to show that the problem hasn’t changed too

much during the delay (in terms of interactions).

More work

– Macready, Siapas and Kaufgman, 1995 Criticality and Parallelism in Combinatorial Optimization – Low, Gonzalez, Kyrola, Bickson, Guestrin and Hellerstein, 2010 Shotgun for l1

16

SLIDE 19

Source: place source info here

This was easy ... what if we need many machines?

17

SLIDE 20

Source: place source info here

This was easy ... what if we need many machines?

18

SLIDE 21

Source: place source info here

This was easy ... what if we need many machines?

19

SLIDE 22

Parameter Server 30,000 ft view

20

MITT’S

SLIDE 23

diagram from Ramakrishnan, Sakrejda, Canon, DoE 2011

Why (not) MapReduce?

Map(key, value)

process instances on a subset of the data / emit aggregate statistics

Reduce(key, value)

aggregate for all the dataset - update parameters

This is a parameter exchange mechanism (simply repeat MapReduce)

good if you can make your algorithm fit (e.g. distributed convex online solvers)

Can be slow to propagate updates between machines & slow convergence

(e.g. a really bad idea in clustering - each machine proposes difgerent clustering) Hadoop MapReduce loses the state between mapper iterations

21

SLIDE 24

Source: place source info here 22

General parallel algorithm template

Clients have local copy of parameters to be estimated
P2P is infeasible since O(n2) connections

(see Asuncion et al. for amazing tour de force)

Synchronize* with parameter server

– Reconciliation protocol average parameters, lock variables, turnstile counter – Synchronization schedule asynchronous, synchronous, episodic – Load distribution algorithm single server, uniform distribution, fault tolerance, recovery

client server

SLIDE 25

Source: place source info here 23

General parallel algorithm template

client server

client syncs to many masters master serves many clients

complete graph is bad for network use randomized messaging to fix it

SLIDE 26

Source: place source info here

Desiderata

Variable and load distribution
Large number of objects (a priori unknown)
Large pool of machines (often faulty)
Assign objects to machines such that
Object goes to the same machine (if possible)
Machines can be added/fail dynamically
Consistent hashing (elements, sets, proportional)
Symmetric, dynamically scalable, fault tolerant
for large scale inferences
for real time data sketches

24

SLIDE 27

Karger et al. 1999, Ahmed et al. 2011

Random Caching Trees

Cache / synchronize an object
Uneven load distribution
Must not generate hotspot
For given key, pick random order of machines
Map order onto tree / star via BFS ordering

25

SLIDE 28

Karger et al. 1999, Ahmed et al. 2011

Random Caching Trees

26

Cache / synchronize an object
Uneven load distribution
Must not generate hotspot
For given key, pick random order of machines
Map order onto tree / star via BFS ordering

(Karger et al. 1999 - ‘Akamai’ paper)

SLIDE 29

Consistent hashing

– Uniform distribution over machine pool M – Fully determined by hash function h. No need to ask master – If we add/remove machine m’ all but O(1/m) keys remain

Consistent hashing with k replications

– If we add/remove a machine only O(k/m) need reassigning (also self repair)

Cost to assign is O(m). This can be expensive for 1000 servers

Source: place source info here

Argmin Hash

m(key) = argmin

m∈M

h(key, m) Pr {m(key) = m0} = 1 m m(key, k) = k smallest

m∈M

h(key, m)

27

SLIDE 30

Source: place source info here

Distributed Hash Table

Fixing the O(m) lookup

– Assign machines to ring via hash h(m) – Assign keys to ring – Pick machine nearest to key to the left

O(log m) lookup
Insert/removal only afgects neighbor

(however, big problem for neighbor)

Uneven load distribution

(load depends on segment size)

Insert machine more than once to fix this

(do not use messy Cassandra-style manual balancing)

For k term replication, simply pick the k leftmost machines (skip

duplicates)

ring of N keys

28

SLIDE 31

Source: place source info here

Distributed Hash Table

Fixing the O(m) lookup

– Assign machines to ring via hash h(m) – Assign keys to ring – Pick machine nearest to key to the left

O(log m) lookup
Insert/removal only afgects neighbor

(however, big problem for neighbor)

Uneven load distribution

(load depends on segment size)

Insert machine more than once to fix this

(do not use messy Cassandra-style manual balancing)

For k term replication, simply pick the k leftmost machines (skip

duplicates)

ring of N keys

28

SLIDE 32

Source: place source info here

Distributed Hash Table

Fixing the O(m) lookup

– Assign machines to ring via hash h(m) – Assign keys to ring – Pick machine nearest to key to the left

O(log m) lookup
Insert/removal only afgects neighbor

(however, big problem for neighbor)

Uneven load distribution

(load depends on segment size)

Insert machine more than once to fix this

(do not use messy Cassandra-style manual balancing)

For k term replication, simply pick the k leftmost machines (skip

duplicates)

ring of N keys

28

SLIDE 33

Source: place source info here

Distributed Hash Table

Fixing the O(m) lookup

– Assign machines to ring via hash h(m) – Assign keys to ring – Pick machine nearest to key to the left

O(log m) lookup
Insert/removal only afgects neighbor

(however, big problem for neighbor)

Uneven load distribution

(load depends on segment size)

Insert machine more than once to fix this

(do not use messy Cassandra-style manual balancing)

For k term replication, simply pick the k leftmost machines (skip

duplicates)

ring of N keys

28

SLIDE 34

Source: place source info here

Distributed Hash Table

Fixing the O(m) lookup

– Assign machines to ring via hash h(m) – Assign keys to ring – Pick machine nearest to key to the left

O(log m) lookup
Insert/removal only afgects neighbor

(however, big problem for neighbor)

Uneven load distribution

(load depends on segment size)

Insert machine more than once to fix this

(do not use messy Cassandra-style manual balancing)

For k term replication, simply pick the k leftmost machines (skip

duplicates)

ring of N keys

28

SLIDE 35

Exact Synchronization

29

MITT’S

SLIDE 36

Source: place source info here

Motivation - Latent Variable Models

global state data local state

30

SLIDE 37

Source: place source info here

Distribution

global state data local state copy

31

SLIDE 38

Source: place source info here

Preserving the polytope

Delayed count updates

– Collapsed representation for exponential families (bad things can happen otherwise - negative counts, indefinite covariances) – Need to keep track of aggregate state of random variables

Exchangeable random process

– See also Church by Mansinghka, Tenenbaum, Roy etc. – Need to maintain statistic of the aggregate

32

p(X|µ0, m0) = Z dθp(θ|µ0, m0)

m

Y

i=1

p(xi|θ) = p m X

i=1

φ(xi)|µ0, m0 !

Abelian group

Delays are OK. Approximation is not!

SLIDE 39

Source: place source info here

Example - User Profiling

data local state global state

Vanilla LDA User profiling

global state

33

SLIDE 40

Source: place source info here

Example - User Profiling

data local state global state

Vanilla LDA User profiling

global state

33

SLIDE 41

Source: place source info here

Distribution

global replica

rack cluster 34

SLIDE 42

Source: place source info here

Distribution

global replica

rack cluster 34

SLIDE 43

Ahmed et al., 2012

Synchronization

Child updates local state

– Start with common state – Child stores old and new state – Parent keeps global state

Transmit difgerences asynchronously

– Inverse element for difgerence – Abelian group for commutativity (sum, log-sum, cyclic group, exponential families)

local to global global to local

x ← x + (xglobal − xold) xold ← xglobal δ ← x − xold xold ← x xglobal ← xglobal + δ

35

SLIDE 44

Ahmed et al., 2012

Synchronization

local to global global to local

x ← x + (xglobal − xold) xold ← xglobal δ ← x − xold xold ← x xglobal ← xglobal + δ

36

Naive approach (dumb master)

– Global is only (key,value) storage – Local node needs to lock/read/write/unlock master – Needs a 4 TCP/IP roundtrips - latency bound

Better solution (smart master)

– Client sends message to master / in queue / master incorporates it – Master sends message to client / in queue / client incorporates it – Bandwidth bound (>10x speedup in practice)

SLIDE 45

Source: place source info here

Weak scaling (more data = more machines)

37

SLIDE 46

Source: place source info here

Weak scaling (more data = more machines)

37

SLIDE 47

Source: place source info here

Exact Synchronization in a Nutshell

38

Each machine computes local updates
Inference relative to local (stale) version of the model
Send local changes to global
Receive global changes at local client
Only send / receive aggregate changes
Easy change relative to single machine implementation
Not fault tolerant (need to restart system if single machine fails)
Delays may destroy convergence properties

SLIDE 48

Approximate Synchronization & Dual Decomposition

39

MITT’S

SLIDE 49

Source: place source info here

Motivation - Distributed Optimization

Distributed optimization problem
Decompose over p processors
Make progress on subproblems per processor
Exchange updates with parameter server (difgerence & state)
Retrieve related state from parameter server and update locally

(do not assume that the xi yield an orthogonal decomposition)

Difgerence to exact parameter server:

stochastic gradient descent updates

40

f(x) = X

i

fi(x) fi(xi)

SLIDE 50

Source: place source info here

Properties

Fault tolerant

– Restart server(s) from last backup state – No need to restart entire system when individual machines fail

Works well for deep belief networks

– See Google Brain project

Paper by Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng

Fails to converge for graph factorization (tried and tested ...)

– Initial convergence but parameters diverge subsequently – Caused by delay in parameter updates (local updates overcompensate changes to parameter values)

41

SLIDE 51

Source: place source info here

Dual Decomposition to the rescue

Optimization problem
Lagrangian relaxation

update x, z and Lagrange multipliers. Include equality constraint if needed.

This explicitly deals with difgerent values in local state and global consensus.

42

minimize

x

X

i

fi(x)

r equivalently minimize

xi,z

X

i

fi(xi) subject to xi = z L(xi, z, λ) = X

i

fi(xi) + λ X

i

kxi zk2

SLIDE 52

Source: place source info here

Synchronous Variant (MapReduce

Lagrangian relaxation
Local step (Map step)

Solve local minimization problems on each machine

Global step (Reduce step + intermediate)

– Aggregate local solutions and average to compute new value of z – Update Lagrange multiplier – Rebroadcast to local clients

43

L(xi, z, λ) = X

i

fi(xi) + λ X

i

kxi zk2 minimize

xi

fi(xi) + λ kxi zk2

SLIDE 53

Source: place source info here

Asynchronous Variant

Lagrangian relaxation
Local step (continous)

Solve local minimization problems (e.g. via SGD) and send updates to server

Global step (continous)

– Aggregate local solutions asynchronously from clients – Update Lagrange multiplier – Rebroadcast global state to local clients

44

L(xi, z, λ) = X

i

fi(xi) + λ X

i

kxi zk2 minimize

xi

fi(xi) + λ kxi zk2

Ahmed, Shervashidze, Smola, 2013

SLIDE 54

Source: place source info here

Convergence (synchronous vs. asynchronous)

45

100 200 300 400 500 600 700 800 900 1000 2 3 4 5 6 7 8 x 10

11

Time in minutes (linear scale) Objective Function Full Dataset: 200M nodes Asynchronus optimization Synchronus optimization

SLIDE 55

Source: place source info here

Convergence (synchronous vs. asynchronous)

46

10

−2

10

−1

10 10

1

10

2

10

3

2 3 4 5 6 7 8 x 10

11

Time in minutes (log scale) Objective Function Full Dataset: 200M nodes Asynchronus optimization Synchronus optimization

SLIDE 56

Source: place source info here

Acceleration (single CPU vs. 32 machines)

47

10

−2

10

−1

10 10

1

10

2

10

3

2.5 2.55 2.6 2.65 2.7 2.75 2.8 2.85 2.9 2.95 3

Time in minutes (Log Scale) Average test erorr 32M Nodes Multi−Machine Asynchronus (32 machines) Single machine

SLIDE 57

Source: place source info here

Weak scaling (more data = more machines)

48

20 40 60 80 100 120 140 160 180 200 2 4 6 8 10 12 14

Number of nodes in Millions Time per epoch (minutes)

Scalability

Multi−Machine Asynchronus Single machine Linearly scaling #machines: 4,8,16,..

128 64 32 16 8

SLIDE 58

Source: place source info here

Even more parameter server variants

49

Graphlab (PowerGraph decomposition and updates)
Facebook parameter server for EP updates
Google brain project
Graph factorization

... your algorithm here ...

From January 2013 on at CMU

Open source version (ping me if you want to contribute)

SLIDE 59

The road ahead

SLIDE 60

Fault tolerant parameter server

general communication pattern

SLIDE 61

Fault tolerant parameter server

sources for a given key

SLIDE 62

Elect master server for a given key x
Replicas as hot failover backups (consistent hashing)
Need replication and synchronization protocol

Fault tolerant parameter server

2 1 3

sources for a given key

SLIDE 63

Elect master server for a given key x
Replicas as hot failover backups (consistent hashing)
Need replication and synchronization protocol

Fault tolerant parameter server

2 1 3

replicate from master

SLIDE 64

Elect master server for a given key x
Replicas as hot failover backups (consistent hashing)
Need replication and synchronization protocol

Fault tolerant parameter server

2 1 3

replicate from master (small overhead)

O ✓ r r + c ◆

SLIDE 65

Elect master server for a given key x
Replicas as hot failover backups (consistent hashing)
Need replication and synchronization protocol

Fault tolerant parameter server

2 1 3

replicate from master (small overhead)

O ✓ r r + c ◆

Disk

SLIDE 66

Plug in existing solvers on nodes
LibLinear / Dual Cached Loops / VW
Stan(?) / Graphlab backing (handles approximate sync not as first class primitive)
Sketching server (minimal changes to insert/query operation needed)
Second order solve on the subgradients (not just ADMM)
Difference to traditional (k,v) servers is smart updates on server side

Fault tolerant parameter server

2 1 3

replicate from master (small overhead)

O ✓ r r + c ◆

Disk

SLIDE 67

Problem distribution - a teaser

(Parsa: Li, Andersen, Smola)

10

−2

10

−1

10 10

1

10

2

10

3

10

4

500 1000 1500 2000 running time (sec) improvment (%) Parsa Zoltan PaToH

(a) Text datasets

10

−2 10 −1 10 0 10 1 10 2 10 3 10 4 10 5

50 100 150 200 250 running time (sec) improvment (%) Parsa Zoltan PaToH METIS PowerGraph

(b) Social Networks

SLIDE 68

Multicore
stochastic gradient descent

(Hogwild, Slow learners are fast, ...)

Multiple machines
exact synchronization
approximate synchronization &

dual decomposition

59

MITT’S