NOMAD: A Distributed Framework for Latent Variable Models Inderjit - - PowerPoint PPT Presentation

nomad a distributed framework for latent variable models
SMART_READER_LITE
LIVE PREVIEW

NOMAD: A Distributed Framework for Latent Variable Models Inderjit - - PowerPoint PPT Presentation

NOMAD: A Distributed Framework for Latent Variable Models Inderjit S. Dhillon Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu, C.-J. Hsieh, H. Yun, and S.V.N. Vishwanathan NIPS 2014 Workshop: Distributed


slide-1
SLIDE 1

NOMAD: A Distributed Framework for Latent Variable Models

Inderjit S. Dhillon Department of Computer Science University of Texas at Austin

Joint work with H.-F. Yu, C.-J. Hsieh, H. Yun, and S.V.N. Vishwanathan

NIPS 2014 Workshop: Distributed Machine Learning and Matrix Computations

Inderjit Dhillon (UT Austin.) Dec 12, 2014 1 / 40

slide-2
SLIDE 2

Outline

Challenges Matrix Completion

Stochastic Gradient Method Existing Distributed Approaches Our Solution: NOMAD-MF

Latent Dirichlet Allocation (LDA)

Gibbs Sampling Existing Distributed Solutions: AdLDA, Yahoo LDA Our Solution: F+NOMAD-LDA

Inderjit Dhillon (UT Austin.) Dec 12, 2014 2 / 40

slide-3
SLIDE 3

Large-scale Latent Variable Modeling

Latent Variable Models: very useful in many applications

Latent models for recommender systems (e.g., MF) Topic models for document corpus (e.g., LDA)

Fast growth of data

Almost 2.5 × 1018 bytes of data added each day 90% of the world’s data today was generated in the past two year

Inderjit Dhillon (UT Austin.) Dec 12, 2014 3 / 40

slide-4
SLIDE 4

Challenges

Algorithmic as well as hardware level

Many effective algorithms involve fine-grain iterative computation ⇒ hard to parallelize Many current parallel approaches

bulk synchronization ⇒ wasted CPU power when communicating complicated locking mechanism ⇒ hard to scale to many machines asynchronous computation using parameter server ⇒ not serializable, danger of stale parameters

Proposed NOMAD Framework

access graph analysis to exploit parallelism asynchronous computation, non-blocking communication, and lock-free serializable (or almost serializable) successful applications: MF and LDA

Inderjit Dhillon (UT Austin.) Dec 12, 2014 4 / 40

slide-5
SLIDE 5

Matrix Factorization: Recommender Systems

Inderjit Dhillon (UT Austin.) Dec 12, 2014 5 / 40

slide-6
SLIDE 6

Recommender Systems

Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40

slide-7
SLIDE 7

Matrix Factorization Approach A ≈ WHT

Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40

slide-8
SLIDE 8

Matrix Factorization Approach A ≈ WHT

Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40

slide-9
SLIDE 9

Matrix Factorization Approach

min

W ∈Rm×k H∈Rn×k

  • (i,j)∈Ω

(Aij − wT

i hj)2 + λ

  • W 2

F + H2 F

  • ,

Ω = {(i, j) | Aij is observed} Regularized terms to avoid over-fitting A transform maps users/items to latent feature space Rk the ith user ⇒ ith row of W , wT

i ,

the jth item ⇒ jth column of HT, hj. wT

i hj: measures the interaction.

Inderjit Dhillon (UT Austin.) Dec 12, 2014 7 / 40

slide-10
SLIDE 10

SGM: Stochastic Gradient Method

SGM update: pick (i, j) ∈ Ω Rij ← Aij − wT

i hj,

wi ← wi − η( λ

|Ωi|wi − Rijhj),

hj ← hj − η( λ

|¯ Ωj|hj − Rijwi),

Ωi: observed ratings of i-th row. ¯ Ωj: observed ratings of j-th column.

wT

1

wT

2

wT

3

            A11 A12 A13 A21 A22 A23 A31 A32 A33             h1 h2 h3

  • An iteration : |Ω| updates

Time per update: O(k) Time per iteration: O(|Ω|k), better than O(|Ω|k2) for ALS

Inderjit Dhillon (UT Austin.) Dec 12, 2014 8 / 40

slide-11
SLIDE 11

Parallel Stochastic Gradient Descent for MF

Challenge: direct parallel updates ⇒ memory conflicts. Multi-core parallelization

Hogwild [Niu 2011] Jellyfish [Recht et al, 2011] FPSGD** [Zhuang et al, 2013]

Multi-machine parallelization:

DSGD [Gemulla et al, 2011] DSGD ++ [Teflioudi et al, 2013]

Inderjit Dhillon (UT Austin.) Dec 12, 2014 9 / 40

slide-12
SLIDE 12

DSGD/JellyFish [Gemulla et al, 2011; Recht et al, 2011]

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Synchronize and communicate

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Synchronize and communicate

Inderjit Dhillon (UT Austin.) Dec 12, 2014 10 / 40

slide-13
SLIDE 13

Proposed Asynchronous Approach: NOMAD-MF [Yun et al, 2014]

Inderjit Dhillon (UT Austin.) Dec 12, 2014 11 / 40

slide-14
SLIDE 14

Motivation

Most existing parallel approaches require Synchronization and/or

E.g., ALS, DSGD/JellyFish, DSGD++, CCD++ Computing power is wasted:

Interleaved computation and communication Curse of the last reducer

Locking and/or

E.g., parallel SGD, FPSGD** A standard way to avoid conflict and guarantee serializability Complicated remote locking slows down the computation Hard to implement efficient locking on a distributed system

Computation using stale values

E.g., Hogwild, Asynchronous SGD using parameter server Lack of serializability

Q: Can we avoid both synchronization and locking but keep CPU from being idle and guarantee serializability?

Inderjit Dhillon (UT Austin.) Dec 12, 2014 12 / 40

slide-15
SLIDE 15

Our answer: NOMAD

A: Yes, NOMAD keeps CPU and network busy simultaneously Stochastic gradient update rule

  • nly a small set of variables involved

Nomadic token passing

widely used in telecommunication area avoids conflict without explicit remote locking Idea: “owner computes” NOMAD: multiple “active tokens” and nomadic passing

Features: fully asynchronous computation lock-free implementation non-blocking communication serializable update sequence

Inderjit Dhillon (UT Austin.) Dec 12, 2014 13 / 40

slide-16
SLIDE 16

Access Graph for Stochastic Gradient

Access graph G = (V , E):

V = {wi} ∪ {hj} E = {eij : (i, j) ∈ Ω}

Connection to SG:

each eij corresponds to a SG update

  • nly access to wi and hj

Parallelism:

edges without common node can be updated in parallel identify “matching” in the graph

Nomadic Token Passing:

mechanism s.t. active edges always form a “matching” serializability guaranteed

users items hj wi

Inderjit Dhillon (UT Austin.) Dec 12, 2014 14 / 40

slide-17
SLIDE 17

More Details

Nomadic Tokens for {hj}: n tokens (j, hj): O(k) space Worker: p workers a computing unit + a concurrent token queue a block of W : O(mk/p) a block row of A: O(|Ω|/p)

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 15 / 40

slide-18
SLIDE 18

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-19
SLIDE 19

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-20
SLIDE 20

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-21
SLIDE 21

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-22
SLIDE 22

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-23
SLIDE 23

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-24
SLIDE 24

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-25
SLIDE 25

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-26
SLIDE 26

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-27
SLIDE 27

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-28
SLIDE 28

Illustration of NOMAD communication

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

slide-29
SLIDE 29

Comparison on a Multi-core System

On a 32-core processor with enough RAM. Comparison: NOMAD, FPSGD**, and CCD++.

100 200 300 400 0.91 0.92 0.93 0.94 0.95 seconds test RMSE Netflix, machines=1, cores=30, λ = 0.05, k = 100 NOMAD FPSGD** CCD++

(100M ratings)

100 200 300 400 22 24 26 seconds test RMSE Yahoo!, machines=1, cores=30, λ = 1.00, k = 100 NOMAD FPSGD** CCD++

(250M ratings)

Inderjit Dhillon (UT Austin.) Dec 12, 2014 17 / 40

slide-30
SLIDE 30

Comparison on a Distributed System

On a distributed system with 32 machines. Comparison: NOMAD, DSGD, DSGD++, and CCD++.

20 40 60 80 100 120 0.92 0.94 0.96 0.98 1 seconds test RMSE Netflix, machines=32, cores=4, λ = 0.05, k = 100 NOMAD DSGD DSGD++ CCD++

(100M ratings)

20 40 60 80 100 120 22 24 26 seconds test RMSE Yahoo!, machines=32, cores=4, λ = 1.00, k = 100 NOMAD DSGD DSGD++ CCD++

(250M ratings)

Inderjit Dhillon (UT Austin.) Dec 12, 2014 18 / 40

slide-31
SLIDE 31

Super Linear Scaling of NOMAD-MF

0.5 1 1.5 2 ·104 22 23 24 25 26 27 seconds × machines × cores test RMSE Yahoo!, cores=4, λ = 1.00, k = 100 # machines=1 # machines=2 # machines=4 # machines=8 # machines=16 # machines=32

Inderjit Dhillon (UT Austin.) Dec 12, 2014 19 / 40

slide-32
SLIDE 32

Topic Modeling: Latent Dirichlet Allocation

Inderjit Dhillon (UT Austin.) Dec 12, 2014 20 / 40

slide-33
SLIDE 33

Latent Dirichlet Allocation (LDA)

Each topic is a multinomial distribution over words Each document is a multinomial distribution over topics Each word is drawn from one of these topics

1source: http://www.cs.columbia.edu/~blei/papers/icml-2012-tutorial.pdf Inderjit Dhillon (UT Austin.) Dec 12, 2014 21 / 40

slide-34
SLIDE 34

Graphical Model for LDA

wi,j zi,j θi α φt β

j = 1, . . . , ni i = 1, . . . , I T

Observed word Topic assignment Topic proportion Proportion parameter Topics Topic parameter

Joint distribution Pr(·) =

T

  • t=1

Pr(φt | β)

I

  • i=1

Pr(θi | α)  

ni

  • j=1

Pr(zi,j | θi)Pr(wi,j | φzi,j)   Pr(φt | β), Pr(θi | α): Dirichlet distributions Pr(w | φt), Pr(z | θi): multinomial distributions

Inderjit Dhillon (UT Austin.) Dec 12, 2014 22 / 40

slide-35
SLIDE 35

Inference for LDA

Only documents are observed θt, φt, zi,j are latent Goal: infer these latent structures

1source: http://www.cs.columbia.edu/~blei/papers/icml-2012-tutorial.pdf Inderjit Dhillon (UT Austin.) Dec 12, 2014 23 / 40

slide-36
SLIDE 36

Posterior Inference for LDA

Task: Pr(θi, φt, zi,j | {di}, α, β) Given

a corpus of documents {di : i = 1, . . . , N}, α, β each document di = {wi,j : j = 1, . . . , ni}

Exact inference for zi,j, θi, φt

Intractable Latent variables are dependent when conditioned on data

Approximate Inference approaches: Variational Methods

See [Blei et al, 2003] an optimization approach runs faster but generates biased results

Gibbs Samplings

See [Griffiths & Steyvers, 2004] an MCMC approach more accurate but slower with a vanilla implementation

Goal: Design a scalable Gibbs sampler for LDA

Inderjit Dhillon (UT Austin.) Dec 12, 2014 24 / 40

slide-37
SLIDE 37

Gibbs Sampling for LDA [Griffiths & Steyvers, 2004]

Count matrices for topic assignment {zi,j}:

ndt: # words of document d assigned to topic t nwt: # of times word w assigned to topic t nt :=

w nwt = d ndt

Gibbs Sampling Step

1

choose w := wi,j with old assignment to := zi,j of document d := di

2

Decrease ndto, nwto, nto by 1

3

Resample a new assignment tn := zi,j according to Pr(zi,j = t) ∝ (ndt + α) (nwt + β) nt + ¯ β , ∀t = 1, . . . , T.

4

Increase ndtn, nwtn, ntn by 1

Constants

J: vocabulary size ¯ β = β × J

Inderjit Dhillon (UT Austin.) Dec 12, 2014 25 / 40

slide-38
SLIDE 38

Access Pattern for Gibbs Sampling

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Words Topics Docs

zij nwt ndt nt

Inderjit Dhillon (UT Austin.) Dec 12, 2014 26 / 40

slide-39
SLIDE 39

Multinomial Sampling Techniques for p ∈ R⊤

+

Initialization Generation Parameter Update Time Space Time Time LSearch Θ(T) Θ(1) Θ(T) Θ(1) BSearch Θ(T) Θ(1) Θ(log T) Θ(T) Alias Method Θ(T) Θ(T) Θ(1) Θ(T) F+tree Sampling Θ(T) Θ(1) Θ(log T) Θ(log T) LSearch

maintain cT = p⊤1 linear search Θ(1) update

BSearch

maintain c = cumsum(p) binary search no support for update

Alias Method

Alias table construction has some overhead no support for updates

F+tree

a variant of Fenwick tree construction has low overhead logarithmic time for sampling and update

Inderjit Dhillon (UT Austin.) Dec 12, 2014 27 / 40

slide-40
SLIDE 40

F+Tree: Construction

Construction in Θ(T) time p = [0.3, 1.5, 0.4, 0.3]⊤

2.5

001

1.8

010

0.3

100

1.5

101

0.7

011

0.4

110

0.3

111

= p4 = p2 = p3 = p1 =0.3+1.5 =0.4+0.3 =1.8+0.7

1 2 3 4

Inderjit Dhillon (UT Austin.) Dec 12, 2014 28 / 40

slide-41
SLIDE 41

F+Tree: Sampling

Multinomial sampling in Θ(log T) time Initial u: a uniformly number drawn from [0, F[1])

2.5

001

1.8

010

0.3

100

1.5

101

0.7

011

0.4

110

0.3

111 u=2.1 u≥1.8 ց u=0.3 u< 0.4 ւ 1 2 3 4

Inderjit Dhillon (UT Austin.) Dec 12, 2014 29 / 40

slide-42
SLIDE 42

F+Tree: Update

Update in Θ(log T) time p3 ← p3 + δ

3.5

001

1.8

010

0.3

100

1.5

101

1.7

011

1.4

110

0.3

111

=0.4+δ =0.7+δ =2.5+δ

1 2 3 4

Inderjit Dhillon (UT Austin.) Dec 12, 2014 30 / 40

slide-43
SLIDE 43

F+LDA = LDA with F+tree Sampling

Decomposition of p pt = (ndt + α) (nwt + β) nt + ¯ β , ∀t = 1, . . . , T. = β ndt + α nt + ¯ β

  • qt

+ nwt ndt + α nt + ¯ β

  • rt

. (1) p = βq + r

two-level sampling for p

q is dense

  • nly 2 entries (qto, qtn) change for each Gibbs step in the same document

use F+Tree for q

r is sparse

nonzero entries: Tw := {t : ntw = 0} entire r changes for each Gibbs step use BSearch for r

Can also work on word-by-word update sequence

Inderjit Dhillon (UT Austin.) Dec 12, 2014 31 / 40

slide-44
SLIDE 44

F+LDA: Alternative Decomposition

Word-by-word Gibbs sampling sequence Decomposition of p pt = (ndt + α) (nwt + β) nt + ¯ β , ∀t = 1, . . . , T. = α nwt + β nt + ¯ β

  • qt

+ ndt nwt + β nt + ¯ β

  • rt

. (2) p = αq + r q: slight changes for this sequence ⇒ use F+Tree r: |Td := {t : ndt = 0}| nonzeros ⇒ use BSearch

Inderjit Dhillon (UT Austin.) Dec 12, 2014 32 / 40

slide-45
SLIDE 45

Comparison to Other LDA Sampling

F+LDA F+LDA Sparse-LDA Alias-LDA Sequence Word-by-Word Doc-by-Doc Doc-by-Doc Doc-by-Doc Exact? Yes Yes Yes No Decomposition α

  • nwt+β

nt+¯ β

  • +ndt
  • nwt+β

nt+¯ β

  • β
  • ndt+α

nt+¯ β

  • +nwt
  • ndt+α

nt+¯ β

  • αβ

nt+¯ β

  • ndt

nt+¯ β

  • +nwt
  • ndt+α

nt+¯ β

  • α
  • nwt+β

nt+¯ β

  • +ndt
  • nwt+β

nt+¯ β

  • Structure

F+tree BSearch F+tree BSearch LSearch LSearch LSearch Alias Alias Fresh samples Yes Yes Yes Yes Yes Yes Yes No Yes Initialization Θ(log T) Θ(|Td|) Θ(log T) Θ(|Tw|) Θ(1) Θ(1) Θ(|Tw|) Θ(1) Θ(|Td|) Sampling Θ(log T) Θ(log |Td|) Θ(log T) Θ(log |Tw|) Θ(T) Θ(|Td|) Θ(|Tw|) Θ(#MH) Θ(#MH)

F+LDA: word-by-word faster than doc-by-doc for large I

|Td| bounded by ni, but |Tw| approaches to T per Gibbs step cost: ρF log T + ρB|Td|

SparseLDA:

per Gibbs step cost: Θ(T + |Td| + |Tw|) the first Θ(T) rarely happens but |Tw| → T for large I

AliasLDA:

per Gibbs step cost: ρA|Td| + #MH ρA ≈ 3 × ρB: construction overhead of Alias table If (ρA − ρB) |Td| > ρF log T ⇒ AliasLDA slower than F+LDA say |Td| ≈ 100, F+LDA still faster for T < 250

Inderjit Dhillon (UT Austin.) Dec 12, 2014 33 / 40

slide-46
SLIDE 46

Comparison of various sampling methods

Single machine, single thread y-axis: speedup over normal O(T) multinomial sampling Enron: 38K docs with 6M tokens NyTimes: 0.3M docs with 100M tokens

Inderjit Dhillon (UT Austin.) Dec 12, 2014 34 / 40

slide-47
SLIDE 47

Access Pattern for Gibbs Sampling

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Words Topics Docs

zij nwt ndt nt

Inderjit Dhillon (UT Austin.) Dec 12, 2014 35 / 40

slide-48
SLIDE 48

Access Graph for Gibbs Sampling

G = (V , E): a hyper graph V = {di} ∪ {wj} ∪ {s} E = {eij = (di, wj, s)} Connection to Gibbs sampling

(di)t := ndit, (wj)t := nwjt, (s)t := nt each eij: a Gibbs step for word wj in di access to (di, wj, s)

Parallelism: more challenging

all edges incident to s all (s)t are large in general ⇒ slightly stale s is fine for accuracy duplicate s for parallelism

words documents s summation node di wj

Inderjit Dhillon (UT Austin.) Dec 12, 2014 36 / 40

slide-49
SLIDE 49

Nomadic Tokens for wj

Nomadic Tokens for {wj : j = 1, . . . , J}: J tokens (j, wj): O(T) space Worker: p workers a computing unit + a concurrent token queue a subset of {di}: O(IT/p) “x”: an occurrence of a word bigger rectangle: a subset of corpus smaller rectangle: a unit subtask

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 37 / 40

slide-50
SLIDE 50

Nomadic Token for s: Circular Delta Update

Single global s

travels among machines as a messenger broadcasts local delta updates

Every machine p: (sp,¯ s)

sp: local working copy ¯ s: snapshot version of global s

s1, ¯ s s2, ¯ s s3, ¯ s s4, ¯ s

s ← s + (s3 − ¯ s) ¯ s ← s s3 ← s

s

Inderjit Dhillon (UT Austin.) Dec 12, 2014 38 / 40

slide-51
SLIDE 51

Comparison on a single multi-core machine

On a machine with a 20-core processor Comparison: F+NOMAD LDA, Yahoo! LDA PubMed: 9M docs with 700M tokens Amazon: 30M docs with 1.5B tokens

Inderjit Dhillon (UT Austin.) Dec 12, 2014 39 / 40

slide-52
SLIDE 52

Comparison on a Multi-machine System

32 machines, each with a 20-core processor. Comparison: F+NOMAD LDA, Yahoo! LDA Amazon: 30M docs with 1.5B tokens UMBC: 40M docs with 1.5B tokens

Inderjit Dhillon (UT Austin.) Dec 12, 2014 40 / 40

slide-53
SLIDE 53

Conclusions

NOMAD framework uses nomadic tokens to provide

Asynchronous computation Non-blocking communication Lock-free implementation Serializable or near Serializable

Recommender System: Matrix factorization

scalable parallel stochastic gradient Serializability guarantee

Topic Modeling: Latent Dirichlet Allocation

Logarithmic F+tree sampling Efficient Gibbs Sampling Duplicated nomadic tokens for the common node Outperforms Yahoo! LDA

Inderjit Dhillon (UT Austin.) Dec 12, 2014 41 / 40