Scaling with the Parameter Server Variations on a Theme Alexander - - PowerPoint PPT Presentation

scaling with the parameter server variations on a theme
SMART_READER_LITE
LIVE PREVIEW

Scaling with the Parameter Server Variations on a Theme Alexander - - PowerPoint PPT Presentation

Scaling with the Parameter Server Variations on a Theme Alexander Smola Google Research & CMU alex.smola.org Thanks Amr Nino Joey Ahmed Shervashidze Gonzalez Shravan Sergiy Markus Narayanamurthy Matyusevich Weimer Source: place


slide-1
SLIDE 1

Alexander Smola Google Research & CMU alex.smola.org

Scaling with the Parameter Server Variations on a Theme

slide-2
SLIDE 2

Source: place source info here

Joey Gonzalez Shravan Narayanamurthy Markus Weimer Sergiy Matyusevich Amr Ahmed

2

Thanks

Nino Shervashidze

slide-3
SLIDE 3

Source: place source info here

  • Multicore
  • asynchronous optimization with shared state
  • Multiple machines
  • exact synchronization (Yahoo LDA)
  • approximate synchronization
  • dual decomposition

3

Practical Distributed Inference

slide-4
SLIDE 4

Motivation Data & Systems

4

MITT’S

slide-5
SLIDE 5

Source: place source info here

  • High Performance Computing

Very reliable, custom built, expensive

  • Consumer hardware

Cheap, effjcient, easy to replicate, Not very reliable, deal with it!

Commodity Hardware

5

slide-6
SLIDE 6

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/ en//people/jefg/stanford-295-talk.pdf

Slide courtesy of Jefg Dean

The Joys of Real Hardware

6

slide-7
SLIDE 7

Source: place source info here

Scaling problems

7

  • Data (lower bounds)

– >10 Billion documents (webpages, e-mails, advertisements, tweets) – >100 Million users on Google, Facebook, Twitter, Yahoo, Hotmail – >1 Million days of video on YouTube – >10 Billion images on Facebook

  • Processing capability for single machine 1TB/hour

But we have much more data

  • Parameter space for models is big for a single machine (but not too much)

Personalize content for many millions of users

  • Need to process data on many cores and many machines simultaneously
slide-8
SLIDE 8

Source: place source info here

Some Problems

  • Good old-fashioned supervised learning

(classification, regression, tagging, entity extraction, ...)

  • Graph factorization

(latent variable estimation, social recommendation, discovery)

  • Structure inference

(clustering, topics, hierarchies, DAGs, whatever else your NP Bayes friends have)

  • Example use case - combine information from generic webpages,

databases, human generated data, semistructured tables into knowledge about entities.

8

slide-9
SLIDE 9

Source: place source info here

Some Problems

  • Good old-fashioned supervised learning

(classification, regression, tagging, entity extraction, ...)

  • Graph factorization

(latent variable estimation, social recommendation, discovery)

  • Structure inference

(clustering, topics, hierarchies, DAGs, whatever else your NP Bayes friends have)

  • Example use case - combine information from generic webpages,

databases, human generated data, semistructured tables into knowledge about entities.

8

How do we solve it at scale?

slide-10
SLIDE 10

Source: place source info here

Some Problems

  • Good old-fashioned supervised learning

(classification, regression, tagging, entity extraction, ...)

  • Graph factorization

(latent variable estimation, social recommendation, discovery)

  • Structure inference

(clustering, topics, hierarchies, DAGs, whatever else your NP Bayes friends have)

  • Example use case - combine information from generic webpages,

databases, human generated data, semistructured tables into knowledge about entities.

8

How do we solve it at scale?

this talk

slide-11
SLIDE 11

Multicore parallelism

9

MITT’S

slide-12
SLIDE 12

Source: place source info here

Multicore Parallelism

  • Many processor cores

– Decompose into separate tasks – Good Java/C++ tool support

  • Shared memory

– Exact estimates - requires locking of neighbors (see e.g. Graphlab) Good if problem can be decomposed cleanly (e.g. Gibbs sampling in large model) – Exact updates but delayed incorporation - requires locking of state Good if delayed update is of little consequence (e.g. Yahoo LDA, Yahoo online) – Hogwild updates - no locking whatsoever - requires atomic state Good if collision probability is low

10

loss gradient data source x

slide-13
SLIDE 13
  • Delayed updates

(round robin for data parallelism, aggregation tree for parameter parallelism)

  • Online template

Source: place source info here

Stochastic Gradient Descent

loss gradient data source x data source data part n x part n updater

11

data parallel parameter parallel

minimize

w

  • i

fi(w)

Input: scalar σ > 0 and delay τ for t = τ + 1 to T + τ do Obtain ft and incur loss ft(wt) Compute gt := ⇥ft(wt) and set ηt =

1 σ(t−τ)

Update wt+1 = wt ηtgt−τ end for

slide-14
SLIDE 14

Source: place source info here

Guarantees

  • Worst case guarantee (Zinkevich, Langford, Smola, 2010)

SGD with delay τ on τ processors is no worse than sequential SGD

  • Lower bound is tight

Proof: send same instance τ times

  • Better bounds with iid data

– Penalty is covariance in features – Vanishing penalty for smooth f(w)

  • Works even (better) if we don’t lock between updates

(Recht, Re, Wright, 2011) Hogwild

12

E[fi(w)] ≤ 4RL √ τT

E[R[X]] ≤  28.3R2H + 2 3RL + 4 3R2H log T

  • τ 2 + 8

3RL √ T.

slide-15
SLIDE 15

Source: place source info here

Speedup on TREC

50 100 150 200 250 300 350 400 450 1 2 3 4 5 6 7

13

number of cores speedup in %

slide-16
SLIDE 16

Smola and Narayanamurthy, 2010

LDA Multicore Inference

  • Decouple multithreaded sampling and updating (almost)

avoids stalling for locks in the sampler

  • Joint state table

–much less memory required –samplers synchronized (10 docs vs. millions delay)

  • Hyperparameter update via stochastic gradient descent
  • No need to keep documents in memory (streaming)

tokens topics file combiner count updater diagnostics &

  • ptimization
  • utput to

file topics sampler sampler sampler sampler sampler

Intel Threading Building Blocks

joint state table

14

slide-17
SLIDE 17

Smola and Narayanamurthy, 2010

LDA Multicore Inference

tokens topics file combiner count updater diagnostics &

  • ptimization
  • utput to

file topics sampler sampler sampler sampler sampler

Intel Threading Building Blocks

joint state table

15

  • Sequential collapsed Gibbs sampler, separate state table

Mallet (Mimno et al. 2008) - slow mixing, high memory load, many iterations

  • Sequential collapsed Gibbs sampler (parallel)

Yahoo LDA (Smola and Narayanamurthy, 2010) - fast mixing, many iterations

  • Sequential stochastic gradient descent (variational, single logical thread)

VW LDA (Hofgman et al, 2011) - fast convergence, few iterations, dense

  • Sequential stochastic sampling gradient descent (only partly variational)

Hofgman, Mimno, Blei, 2012 - fast convergence, quite sparse, single logical thread

slide-18
SLIDE 18

Source: place source info here

General strategy

  • Shared state space
  • Delayed updates from cores
  • Proof technique is usually to show that the problem hasn’t changed too

much during the delay (in terms of interactions).

  • More work

– Macready, Siapas and Kaufgman, 1995 Criticality and Parallelism in Combinatorial Optimization – Low, Gonzalez, Kyrola, Bickson, Guestrin and Hellerstein, 2010 Shotgun for l1

16

slide-19
SLIDE 19

Source: place source info here

This was easy ... what if we need many machines?

17

slide-20
SLIDE 20

Source: place source info here

This was easy ... what if we need many machines?

18

slide-21
SLIDE 21

Source: place source info here

This was easy ... what if we need many machines?

19

slide-22
SLIDE 22

Parameter Server 30,000 ft view

20

MITT’S

slide-23
SLIDE 23

diagram from Ramakrishnan, Sakrejda, Canon, DoE 2011

Why (not) MapReduce?

  • Map(key, value)

process instances on a subset of the data / emit aggregate statistics

  • Reduce(key, value)

aggregate for all the dataset - update parameters

  • This is a parameter exchange mechanism (simply repeat MapReduce)

good if you can make your algorithm fit (e.g. distributed convex online solvers)

  • Can be slow to propagate updates between machines & slow convergence

(e.g. a really bad idea in clustering - each machine proposes difgerent clustering) Hadoop MapReduce loses the state between mapper iterations

21

slide-24
SLIDE 24

Source: place source info here 22

General parallel algorithm template

  • Clients have local copy of parameters to be estimated
  • P2P is infeasible since O(n2) connections

(see Asuncion et al. for amazing tour de force)

  • Synchronize* with parameter server

– Reconciliation protocol average parameters, lock variables, turnstile counter – Synchronization schedule asynchronous, synchronous, episodic – Load distribution algorithm single server, uniform distribution, fault tolerance, recovery

client server

slide-25
SLIDE 25

Source: place source info here 23

General parallel algorithm template

client server

client syncs to many masters master serves many clients

complete graph is bad for network use randomized messaging to fix it

slide-26
SLIDE 26

Source: place source info here

Desiderata

  • Variable and load distribution
  • Large number of objects (a priori unknown)
  • Large pool of machines (often faulty)
  • Assign objects to machines such that
  • Object goes to the same machine (if possible)
  • Machines can be added/fail dynamically
  • Consistent hashing (elements, sets, proportional)
  • Symmetric, dynamically scalable, fault tolerant
  • for large scale inferences
  • for real time data sketches

24

slide-27
SLIDE 27

Karger et al. 1999, Ahmed et al. 2011

Random Caching Trees

  • Cache / synchronize an object
  • Uneven load distribution
  • Must not generate hotspot
  • For given key, pick random order of machines
  • Map order onto tree / star via BFS ordering

25

slide-28
SLIDE 28

Karger et al. 1999, Ahmed et al. 2011

Random Caching Trees

26

  • Cache / synchronize an object
  • Uneven load distribution
  • Must not generate hotspot
  • For given key, pick random order of machines
  • Map order onto tree / star via BFS ordering

(Karger et al. 1999 - ‘Akamai’ paper)

slide-29
SLIDE 29
  • Consistent hashing

– Uniform distribution over machine pool M – Fully determined by hash function h. No need to ask master – If we add/remove machine m’ all but O(1/m) keys remain

  • Consistent hashing with k replications

– If we add/remove a machine only O(k/m) need reassigning (also self repair)

  • Cost to assign is O(m). This can be expensive for 1000 servers

Source: place source info here

Argmin Hash

m(key) = argmin

m∈M

h(key, m) Pr {m(key) = m0} = 1 m m(key, k) = k smallest

m∈M

h(key, m)

27

slide-30
SLIDE 30

Source: place source info here

Distributed Hash Table

  • Fixing the O(m) lookup

– Assign machines to ring via hash h(m) – Assign keys to ring – Pick machine nearest to key to the left

  • O(log m) lookup
  • Insert/removal only afgects neighbor

(however, big problem for neighbor)

  • Uneven load distribution

(load depends on segment size)

  • Insert machine more than once to fix this

(do not use messy Cassandra-style manual balancing)

  • For k term replication, simply pick the k leftmost machines (skip

duplicates)

ring of N keys

28

slide-31
SLIDE 31

Source: place source info here

Distributed Hash Table

  • Fixing the O(m) lookup

– Assign machines to ring via hash h(m) – Assign keys to ring – Pick machine nearest to key to the left

  • O(log m) lookup
  • Insert/removal only afgects neighbor

(however, big problem for neighbor)

  • Uneven load distribution

(load depends on segment size)

  • Insert machine more than once to fix this

(do not use messy Cassandra-style manual balancing)

  • For k term replication, simply pick the k leftmost machines (skip

duplicates)

ring of N keys

28

slide-32
SLIDE 32

Source: place source info here

Distributed Hash Table

  • Fixing the O(m) lookup

– Assign machines to ring via hash h(m) – Assign keys to ring – Pick machine nearest to key to the left

  • O(log m) lookup
  • Insert/removal only afgects neighbor

(however, big problem for neighbor)

  • Uneven load distribution

(load depends on segment size)

  • Insert machine more than once to fix this

(do not use messy Cassandra-style manual balancing)

  • For k term replication, simply pick the k leftmost machines (skip

duplicates)

ring of N keys

28

slide-33
SLIDE 33

Source: place source info here

Distributed Hash Table

  • Fixing the O(m) lookup

– Assign machines to ring via hash h(m) – Assign keys to ring – Pick machine nearest to key to the left

  • O(log m) lookup
  • Insert/removal only afgects neighbor

(however, big problem for neighbor)

  • Uneven load distribution

(load depends on segment size)

  • Insert machine more than once to fix this

(do not use messy Cassandra-style manual balancing)

  • For k term replication, simply pick the k leftmost machines (skip

duplicates)

ring of N keys

28

slide-34
SLIDE 34

Source: place source info here

Distributed Hash Table

  • Fixing the O(m) lookup

– Assign machines to ring via hash h(m) – Assign keys to ring – Pick machine nearest to key to the left

  • O(log m) lookup
  • Insert/removal only afgects neighbor

(however, big problem for neighbor)

  • Uneven load distribution

(load depends on segment size)

  • Insert machine more than once to fix this

(do not use messy Cassandra-style manual balancing)

  • For k term replication, simply pick the k leftmost machines (skip

duplicates)

ring of N keys

28

slide-35
SLIDE 35

Exact Synchronization

29

MITT’S

slide-36
SLIDE 36

Source: place source info here

Motivation - Latent Variable Models

global state data local state

30

slide-37
SLIDE 37

Source: place source info here

Distribution

global state data local state copy

31

slide-38
SLIDE 38

Source: place source info here

Preserving the polytope

  • Delayed count updates

– Collapsed representation for exponential families (bad things can happen otherwise - negative counts, indefinite covariances) – Need to keep track of aggregate state of random variables

  • Exchangeable random process

– See also Church by Mansinghka, Tenenbaum, Roy etc. – Need to maintain statistic of the aggregate

32

p(X|µ0, m0) = Z dθp(θ|µ0, m0)

m

Y

i=1

p(xi|θ) = p m X

i=1

φ(xi)|µ0, m0 !

Abelian group

Delays are OK. Approximation is not!

slide-39
SLIDE 39

Source: place source info here

Example - User Profiling

data local state global state

Vanilla LDA User profiling

global state

33

slide-40
SLIDE 40

Source: place source info here

Example - User Profiling

data local state global state

Vanilla LDA User profiling

global state

33

slide-41
SLIDE 41

Source: place source info here

Distribution

global replica

rack cluster 34

slide-42
SLIDE 42

Source: place source info here

Distribution

global replica

rack cluster 34

slide-43
SLIDE 43

Ahmed et al., 2012

Synchronization

  • Child updates local state

– Start with common state – Child stores old and new state – Parent keeps global state

  • Transmit difgerences asynchronously

– Inverse element for difgerence – Abelian group for commutativity (sum, log-sum, cyclic group, exponential families)

local to global global to local

x ← x + (xglobal − xold) xold ← xglobal δ ← x − xold xold ← x xglobal ← xglobal + δ

35

slide-44
SLIDE 44

Ahmed et al., 2012

Synchronization

local to global global to local

x ← x + (xglobal − xold) xold ← xglobal δ ← x − xold xold ← x xglobal ← xglobal + δ

36

  • Naive approach (dumb master)

– Global is only (key,value) storage – Local node needs to lock/read/write/unlock master – Needs a 4 TCP/IP roundtrips - latency bound

  • Better solution (smart master)

– Client sends message to master / in queue / master incorporates it – Master sends message to client / in queue / client incorporates it – Bandwidth bound (>10x speedup in practice)

slide-45
SLIDE 45

Source: place source info here

Weak scaling (more data = more machines)

37

slide-46
SLIDE 46

Source: place source info here

Weak scaling (more data = more machines)

37

slide-47
SLIDE 47

Source: place source info here

Exact Synchronization in a Nutshell

38

  • Each machine computes local updates
  • Inference relative to local (stale) version of the model
  • Send local changes to global
  • Receive global changes at local client
  • Only send / receive aggregate changes
  • Easy change relative to single machine implementation
  • Not fault tolerant (need to restart system if single machine fails)
  • Delays may destroy convergence properties
slide-48
SLIDE 48

Approximate Synchronization & Dual Decomposition

39

MITT’S

slide-49
SLIDE 49

Source: place source info here

Motivation - Distributed Optimization

  • Distributed optimization problem
  • Decompose over p processors
  • Make progress on subproblems per processor
  • Exchange updates with parameter server (difgerence & state)
  • Retrieve related state from parameter server and update locally

(do not assume that the xi yield an orthogonal decomposition)

  • Difgerence to exact parameter server:

stochastic gradient descent updates

40

f(x) = X

i

fi(x) fi(xi)

slide-50
SLIDE 50

Source: place source info here

Properties

  • Fault tolerant

– Restart server(s) from last backup state – No need to restart entire system when individual machines fail

  • Works well for deep belief networks

– See Google Brain project

Paper by Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng

  • Fails to converge for graph factorization (tried and tested ...)

– Initial convergence but parameters diverge subsequently – Caused by delay in parameter updates (local updates overcompensate changes to parameter values)

41

slide-51
SLIDE 51

Source: place source info here

Dual Decomposition to the rescue

  • Optimization problem
  • Lagrangian relaxation

update x, z and Lagrange multipliers. Include equality constraint if needed.

  • This explicitly deals with difgerent values in local state and global consensus.

42

minimize

x

X

i

fi(x)

  • r equivalently minimize

xi,z

X

i

fi(xi) subject to xi = z L(xi, z, λ) = X

i

fi(xi) + λ X

i

kxi zk2

slide-52
SLIDE 52

Source: place source info here

Synchronous Variant (MapReduce

  • Lagrangian relaxation
  • Local step (Map step)

Solve local minimization problems on each machine

  • Global step (Reduce step + intermediate)

– Aggregate local solutions and average to compute new value of z – Update Lagrange multiplier – Rebroadcast to local clients

43

L(xi, z, λ) = X

i

fi(xi) + λ X

i

kxi zk2 minimize

xi

fi(xi) + λ kxi zk2

slide-53
SLIDE 53

Source: place source info here

Asynchronous Variant

  • Lagrangian relaxation
  • Local step (continous)

Solve local minimization problems (e.g. via SGD) and send updates to server

  • Global step (continous)

– Aggregate local solutions asynchronously from clients – Update Lagrange multiplier – Rebroadcast global state to local clients

44

L(xi, z, λ) = X

i

fi(xi) + λ X

i

kxi zk2 minimize

xi

fi(xi) + λ kxi zk2

Ahmed, Shervashidze, Smola, 2013

slide-54
SLIDE 54

Source: place source info here

Convergence (synchronous vs. asynchronous)

45

100 200 300 400 500 600 700 800 900 1000 2 3 4 5 6 7 8 x 10

11

Time in minutes (linear scale) Objective Function Full Dataset: 200M nodes Asynchronus optimization Synchronus optimization

slide-55
SLIDE 55

Source: place source info here

Convergence (synchronous vs. asynchronous)

46

10

−2

10

−1

10 10

1

10

2

10

3

2 3 4 5 6 7 8 x 10

11

Time in minutes (log scale) Objective Function Full Dataset: 200M nodes Asynchronus optimization Synchronus optimization

slide-56
SLIDE 56

Source: place source info here

Acceleration (single CPU vs. 32 machines)

47

10

−2

10

−1

10 10

1

10

2

10

3

2.5 2.55 2.6 2.65 2.7 2.75 2.8 2.85 2.9 2.95 3

Time in minutes (Log Scale) Average test erorr 32M Nodes Multi−Machine Asynchronus (32 machines) Single machine

slide-57
SLIDE 57

Source: place source info here

Weak scaling (more data = more machines)

48

20 40 60 80 100 120 140 160 180 200 2 4 6 8 10 12 14

Number of nodes in Millions Time per epoch (minutes)

Scalability

Multi−Machine Asynchronus Single machine Linearly scaling #machines: 4,8,16,..

128 64 32 16 8

slide-58
SLIDE 58

Source: place source info here

Even more parameter server variants

49

  • Graphlab (PowerGraph decomposition and updates)
  • Facebook parameter server for EP updates
  • Google brain project
  • Graph factorization

... your algorithm here ...

  • From January 2013 on at CMU

Open source version (ping me if you want to contribute)

slide-59
SLIDE 59

The road ahead

slide-60
SLIDE 60

Fault tolerant parameter server

general communication pattern

slide-61
SLIDE 61

Fault tolerant parameter server

sources for a given key

slide-62
SLIDE 62
  • Elect master server for a given key x
  • Replicas as hot failover backups (consistent hashing)
  • Need replication and synchronization protocol

Fault tolerant parameter server

2 1 3

sources for a given key

slide-63
SLIDE 63
  • Elect master server for a given key x
  • Replicas as hot failover backups (consistent hashing)
  • Need replication and synchronization protocol

Fault tolerant parameter server

2 1 3

replicate from master

slide-64
SLIDE 64
  • Elect master server for a given key x
  • Replicas as hot failover backups (consistent hashing)
  • Need replication and synchronization protocol

Fault tolerant parameter server

2 1 3

replicate from master (small overhead)

O ✓ r r + c ◆

slide-65
SLIDE 65
  • Elect master server for a given key x
  • Replicas as hot failover backups (consistent hashing)
  • Need replication and synchronization protocol

Fault tolerant parameter server

2 1 3

replicate from master (small overhead)

O ✓ r r + c ◆

Disk

slide-66
SLIDE 66
  • Plug in existing solvers on nodes
  • LibLinear / Dual Cached Loops / VW
  • Stan(?) / Graphlab backing (handles approximate sync not as first class primitive)
  • Sketching server (minimal changes to insert/query operation needed)
  • Second order solve on the subgradients (not just ADMM)
  • Difference to traditional (k,v) servers is smart updates on server side

Fault tolerant parameter server

2 1 3

replicate from master (small overhead)

O ✓ r r + c ◆

Disk

slide-67
SLIDE 67

Problem distribution - a teaser

(Parsa: Li, Andersen, Smola)

10

−2

10

−1

10 10

1

10

2

10

3

10

4

500 1000 1500 2000 running time (sec) improvment (%) Parsa Zoltan PaToH

(a) Text datasets

10

−2 10 −1 10 0 10 1 10 2 10 3 10 4 10 5

50 100 150 200 250 running time (sec) improvment (%) Parsa Zoltan PaToH METIS PowerGraph

(b) Social Networks

slide-68
SLIDE 68
  • Multicore
  • stochastic gradient descent

(Hogwild, Slow learners are fast, ...)

  • Multiple machines
  • exact synchronization
  • approximate synchronization &

dual decomposition

59

MITT’S