[PPT] - Large-Scale Matrix Factorization with Distributed Stochastic PowerPoint Presentation

SLIDE 1

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Rainer Gemulla August 23, 2011 Peter J. Haas Yannis Sismanis Erik Nijkamp

SLIDE 2

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

2 / 32

SLIDE 3

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

3 / 32

SLIDE 4

Collaborative Filtering

◮ Problem

◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...) 4 / 32

SLIDE 5

Collaborative Filtering

◮ Problem

◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...)

◮ Predict additional items a user may like

◮ Assumption: Similar feedback =

⇒ Similar taste

4 / 32

SLIDE 6

Collaborative Filtering

◮ Problem

◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...)

◮ Predict additional items a user may like

◮ Assumption: Similar feedback =

⇒ Similar taste

◮ Example

  Avatar The Matrix Up Alice 4 2 Bob 3 2 Charlie 5 3  

4 / 32

SLIDE 7

Collaborative Filtering

◮ Problem

◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...)

◮ Predict additional items a user may like

◮ Assumption: Similar feedback =

⇒ Similar taste

◮ Example

  Avatar The Matrix Up Alice ? 4 2 Bob 3 2 ? Charlie 5 ? 3  

4 / 32

SLIDE 8

Collaborative Filtering

◮ Problem

◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...)

◮ Predict additional items a user may like

◮ Assumption: Similar feedback =

⇒ Similar taste

◮ Example

  Avatar The Matrix Up Alice ? 4 2 Bob 3 2 ? Charlie 5 ? 3  

◮ Netflix competition: 500k users, 20k movies, 100M movie

ratings, 3M question marks

4 / 32

SLIDE 9

Semantic Factors (Koren et al., 2009)

∈  ∈  ∈ 

∈

∑

κ

λ

κ λ Geared toward males Serious Escapist The Princess Diaries Braveheart Lethal Weapon Independence Day Ocean’s 11 Sense and Sensibility Gus Dave Geared toward females Amadeus The Lion King Dumb and Dumber The Color Purple

5 / 32

SLIDE 10

Latent Factor Models

◮ Discover latent factors (r = 1)

Avatar The Matrix Up Alice 4 2 Bob 3 2 Charlie 5 3

6 / 32

SLIDE 11

Latent Factor Models

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice 4 2 (1.98) Bob 3 2 (1.21) Charlie 5 3 (2.30)

6 / 32

SLIDE 12

Latent Factor Models

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice 4 2 (1.98) (3.8) (2.3) Bob 3 2 (1.21) (2.7) (2.3) Charlie 5 3 (2.30) (5.2) (2.7)

◮ Minimum loss

min

W,H

(i,j)∈Z

(Vij − [WH]ij)2

6 / 32

SLIDE 13

Latent Factor Models

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)

◮ Minimum loss

min

W,H

(i,j)∈Z

(Vij − [WH]ij)2

6 / 32

SLIDE 14

Latent Factor Models

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)

◮ Minimum loss

min

W,H,u,m

(i,j)∈Z

(Vij − µ − ui − mj − [WH]ij)2

◮ Bias

6 / 32

SLIDE 15

Latent Factor Models

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)

◮ Minimum loss

min

W,H,u,m

(i,j)∈Z

(Vij − µ − ui − mj − [WH]ij)2 + λ (W + H + u + m)

◮ Bias, regularization

6 / 32

SLIDE 16

Latent Factor Models

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)

◮ Minimum loss

min

W,H,u,m

(i,j,t)∈Zt

(Vij − µ − ui(t) − mj(t) − [W(t)H]ij)2 + λ (W(t) + H + u(t) + m(t))

◮ Bias, regularization, time

6 / 32

SLIDE 17

Generalized Matrix Factorization

◮ A general machine learning problem

◮ Recommender systems, text indexing, face recognition, . . . 7 / 32

SLIDE 18

Generalized Matrix Factorization

◮ A general machine learning problem

◮ Recommender systems, text indexing, face recognition, . . .

◮ Training data

◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings) 7 / 32

V Vij

SLIDE 19

Generalized Matrix Factorization

◮ A general machine learning problem

◮ Recommender systems, text indexing, face recognition, . . .

◮ Training data

◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)

◮ Parameter space

◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors) 7 / 32

V Vij W H Wi∗ H∗j

SLIDE 20

Generalized Matrix Factorization

◮ A general machine learning problem

◮ Recommender systems, text indexing, face recognition, . . .

◮ Training data

◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)

◮ Parameter space

◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors)

◮ Model

◮ Lij(Wi∗, H∗j): loss at element (i, j) ◮ Includes prediction error, regularization,

auxiliary information, . . .

◮ Constraints (e.g., non-negativity) 7 / 32

V Vij W H Wi∗ H∗j

SLIDE 21

Generalized Matrix Factorization

◮ A general machine learning problem

◮ Recommender systems, text indexing, face recognition, . . .

◮ Training data

◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)

◮ Parameter space

◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors)

◮ Model

◮ Lij(Wi∗, H∗j): loss at element (i, j) ◮ Includes prediction error, regularization,

auxiliary information, . . .

◮ Constraints (e.g., non-negativity)

◮ Find best model

argmin

W,H

(i,j)∈Z

Lij(Wi∗, H∗j)

7 / 32

V Vij W H Wi∗ H∗j

SLIDE 22

Successful Applications

◮ Movie recommendation (Netflix, competition papers)

◮ >12M users, >20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model

◮ Website recommendation (Microsoft, WWW10)

◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization

◮ News personalization (Google, WWW07)

◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing 8 / 32

SLIDE 23

Successful Applications

◮ Movie recommendation (Netflix, competition papers)

◮ >12M users, >20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model

◮ Website recommendation (Microsoft, WWW10)

◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization

◮ News personalization (Google, WWW07)

◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing

Distributed processing is necessary!

◮ Big data ◮ Large models ◮ Expensive computations

8 / 32

SLIDE 24

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

9 / 32

SLIDE 25

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

10 / 32

SLIDE 26

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

10 / 32

SLIDE 27

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

10 / 32

SLIDE 28

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

10 / 32

SLIDE 29

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ

L′(θ0)

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

10 / 32

SLIDE 30

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ

L′(θ0)

◮ Jump “approximately” downhill

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

10 / 32

SLIDE 31

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ

L′(θ0)

◮ Jump “approximately” downhill ◮ Stochastic difference equation

θn+1 = θn − ǫnˆ L′(θn)

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

10 / 32

SLIDE 32

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ

L′(θ0)

◮ Jump “approximately” downhill ◮ Stochastic difference equation

θn+1 = θn − ǫnˆ L′(θn)

◮ Under certain conditions,

asymptotically approximates (continuous) gradient descent

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

stepfun(px, py)

t q(t) - q*

●●
10 / 32

SLIDE 33

Stochastic Gradient Descent for Matrix Factorization

◮ Set θ = (W, H) and use

L(θ) =

(i,j)∈Z

Lij(Wi∗, H∗j)

11 / 32

V W H Vij Wi∗ H∗j

SLIDE 34

Stochastic Gradient Descent for Matrix Factorization

◮ Set θ = (W, H) and use

L(θ) =

(i,j)∈Z

Lij(Wi∗, H∗j) L′(θ) =

(i,j)∈Z

L′

ij(Wi∗, H∗j)

11 / 32

V W H Vij Wi∗ H∗j

SLIDE 35

Stochastic Gradient Descent for Matrix Factorization

◮ Set θ = (W, H) and use

L(θ) =

(i,j)∈Z

Lij(Wi∗, H∗j) L′(θ) =

(i,j)∈Z

L′

ij(Wi∗, H∗j)

ˆ L′(θ, z) = NL′

izjz(Wiz∗, H∗jz),

where N = |Z|

11 / 32

V W H Vij Wi∗ H∗j

SLIDE 36

Stochastic Gradient Descent for Matrix Factorization

◮ Set θ = (W, H) and use

L(θ) =

(i,j)∈Z

Lij(Wi∗, H∗j) L′(θ) =

(i,j)∈Z

L′

ij(Wi∗, H∗j)

ˆ L′(θ, z) = NL′

izjz(Wiz∗, H∗jz),

where N = |Z|

◮ SGD epoch

1. Pick a random entry z ∈ Z
2. Compute approximate gradient ˆ

L′(θ, z)

3. Update parameters

θn+1 = θn − ǫn ˆ L′(θn, z)

4. Repeat N times

11 / 32

V W H Vij Wi∗ H∗j

SLIDE 37

Stochastic Gradient Descent on Netflix Data

10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss

LBFGS

SGD

12 / 32

SLIDE 38

Comparison

◮ Per epoch, assuming O(r) gradient computation per element

GD SGD Algorithm Deterministic Randomized Gradient computations 1 N Gradient types Exact Approximate Parameter updates 1 N Time O(rN) O(rN) Space O((m + n)r) O((m + n)r)

◮ Why stochastic?

◮ Fast convergence to vicinity of optimum ◮ Randomization may help escape local minima ◮ Exploitation of “repeated structure” 13 / 32

SLIDE 39

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

14 / 32

SLIDE 40

Averaging Techniques

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn) How to distribute?

15 / 32

SLIDE 41

Averaging Techniques

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn) How to distribute?

◮ Parameter mixing (ISGD)

◮ Map: Run independent instances of SGD on subsets of the

data (until convergence)

◮ Reduce: Average results 15 / 32

SLIDE 42

Averaging Techniques

10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss

LBFGS

SGD ISGD

16 / 32

SLIDE 43

Averaging Techniques

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn) How to distribute?

◮ Parameter mixing (ISGD)

◮ Map: Run independent instances of SGD on subsets of the

data (until convergence)

◮ Reduce: Average results ◮ Does not converge to correct solution! 17 / 32

SLIDE 44

Averaging Techniques

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn) How to distribute?

◮ Parameter mixing (ISGD)

◮ Map: Run independent instances of SGD on subsets of the

data (until convergence)

◮ Reduce: Average results ◮ Does not converge to correct solution!

◮ Iterative Parameter mixing (PSGD)

◮ Map: Run independent instances of SGD on subsets of the

data (for some time)

◮ Reduce: Average results ◮ Repeat 17 / 32

SLIDE 45

Averaging Techniques

10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss

LBFGS

SGD ISGD PSGD

18 / 32

SLIDE 46

Averaging Techniques

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn) How to distribute?

◮ Parameter mixing (ISGD)

◮ Map: Run independent instances of SGD on subsets of the

data (until convergence)

◮ Reduce: Average results ◮ Does not converge to correct solution!

◮ Iterative Parameter mixing (PSGD)

◮ Map: Run independent instances of SGD on subsets of the

data (for some time)

◮ Reduce: Average results ◮ Repeat ◮ Converges slowly! 19 / 32

SLIDE 47

Problem Structure

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn)

◮ An SGD step on example z ∈ Z . . .

1. Reads Wiz∗ and H∗jz
2. Performs gradient computation L′

ij(Wiz∗, H∗jz)

3. Updates Wiz∗ and H∗jz

V W H zn Wi∗ H∗j

20 / 32

SLIDE 48

Problem Structure

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn)

◮ An SGD step on example z ∈ Z . . .

1. Reads Wiz∗ and H∗jz
2. Performs gradient computation L′

ij(Wiz∗, H∗jz)

3. Updates Wiz∗ and H∗jz

V W H zn Wi∗ H∗j zn+1 H∗j

20 / 32

SLIDE 49

Problem Structure

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn)

◮ An SGD step on example z ∈ Z . . .

1. Reads Wiz∗ and H∗jz
2. Performs gradient computation L′

ij(Wiz∗, H∗jz)

3. Updates Wiz∗ and H∗jz

V W H zn Wi∗ H∗j zn+1 Wi∗

20 / 32

SLIDE 50

Problem Structure

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn)

◮ An SGD step on example z ∈ Z . . .

1. Reads Wiz∗ and H∗jz
2. Performs gradient computation L′

ij(Wiz∗, H∗jz)

3. Updates Wiz∗ and H∗jz

◮ Not all steps are dependent V W H zn Wi∗ H∗j zn+1

20 / 32

SLIDE 51

Interchangeability

◮ Two elements z1, z2 ∈ Z are interchangeable if they share

neither row nor column

V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps

θn+1 = θn − ǫˆ L′(θn, zn)

21 / 32

SLIDE 52

Interchangeability

◮ Two elements z1, z2 ∈ Z are interchangeable if they share

neither row nor column

V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps

θn+2 = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn+1, zn+1)

21 / 32

SLIDE 53

Interchangeability

◮ Two elements z1, z2 ∈ Z are interchangeable if they share

neither row nor column

V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps

θn+2 = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn+1, zn+1) = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn, zn+1),

21 / 32

SLIDE 54

Interchangeability

◮ Two elements z1, z2 ∈ Z are interchangeable if they share

neither row nor column

V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps

θn+2 = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn+1, zn+1) = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn, zn+1), become parallelizable!

21 / 32

SLIDE 55

Exploitation

◮ Block and distribute the input matrix V

22 / 32

Node 1 Node 2 Node 3

W1 H1 V11 V12 V13 W2 H2 V21 V22 V23 W3 H3 V31 V32 V33

SLIDE 56

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle 22 / 32

Node 1 Node 2 Node 3

W1 H1 V11 V12 V13 W2 H2 V21 V22 V23 W3 H3 V31 V32 V33

SLIDE 57

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle 22 / 32

Node 1 Node 2 Node 3

W1 H1 V11 W2 H2 V22 W3 H3 V33

SLIDE 58

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle

◮ Step 2:

Simulate sequential SGD

◮ Interchangeable blocks ◮ Throw dice of how

many iterations per block

◮ Throw dice of which

step sizes per block

22 / 32

Node 1 Node 2 Node 3

W1 H1 V11 W2 H2 V22 W3 H3 V33

SLIDE 59

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle

◮ Step 2:

Simulate sequential SGD

◮ Interchangeable blocks ◮ Throw dice of how

many iterations per block

◮ Throw dice of which

step sizes per block

22 / 32

Node 1 Node 2 Node 3

W1 H1 V11 W2 H2 V22 W3 H3 V33 V11 V12 V13 V21 V22 V23 V31 V32 V33

SLIDE 60

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle

◮ Step 2:

Simulate sequential SGD

◮ Interchangeable blocks ◮ Throw dice of how

many iterations per block

◮ Throw dice of which

step sizes per block

22 / 32

Node 1 Node 2 Node 3

W1 H1 V11 W2 H2 V22 W3 H3 V33 V12 V23 V31

SLIDE 61

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

1. Pick a “diagonal”
2. Run SGD on the diagonal (in parallel)
3. Merge the results
4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle

◮ Step 2:

Simulate sequential SGD

◮ Interchangeable blocks ◮ Throw dice of how

many iterations per block

◮ Throw dice of which

step sizes per block

◮ Instance of “stratified SGD” ◮ Provably correct

22 / 32

Node 1 Node 2 Node 3

W1 H1 V11 W2 H2 V22 W3 H3 V33 V12 V23 V31

SLIDE 62

How does it work?

23 / 32

Cycle 0

L2

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L1

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L = 0.3L1 + 0.7L2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

SLIDE 63

How does it work?

23 / 32

Cycle 1

L2

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L1

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L = 0.3L1 + 0.7L2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

SLIDE 64

How does it work?

23 / 32

Cycle 2

L2

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L1

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L = 0.3L1 + 0.7L2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

SLIDE 65

How does it work?

23 / 32

Cycle 3

L2

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L1

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L = 0.3L1 + 0.7L2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

SLIDE 66

How does it work?

23 / 32

Cycle 4

L2

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L1

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L = 0.3L1 + 0.7L2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

SLIDE 67

How does it work?

23 / 32

Cycle 5

L2

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L1

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L = 0.3L1 + 0.7L2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

SLIDE 68

How does it work?

23 / 32

Cycle 6

L2

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L1

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L = 0.3L1 + 0.7L2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

SLIDE 69

How does it work?

23 / 32

Cycle 100

L2

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L1

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L = 0.3L1 + 0.7L2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

SLIDE 70

How does it work?

23 / 32

Cycle 100

L2

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L1

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L = 0.3L1 + 0.7L2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

SLIDE 71

How does it work?

23 / 32

Cycle 100

L2

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L1

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

L = 0.3L1 + 0.7L2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

SLIDE 72

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

24 / 32

SLIDE 73

DSGD scales well (Netflix, NZSL+L2)

1 2 3 5 10 15 Time (hours) Mean Loss

SGD

DSGD 1x7 DSGD 2x7 DSGD 4x7 DSGD 8x7 25 / 32

SLIDE 74

DSGD scales well (Netflix, NZSL+L2)

0.0 0.1 0.2 0.3 0.4 0.5 5 10 15 Time (hours) Mean Loss

SGD

DSGD 1x7 DSGD 2x7 DSGD 4x7 DSGD 8x7 25 / 32

SLIDE 75

DSGD is fast (8x8, Netflix, NZSL)

0.0 0.5 1.0 1.5 2.0 0.6 0.8 1.0 1.2 1.4 Time (hours) Mean Loss

LBFGS

DSGD ISGD PSGD ALS

26 / 32

SLIDE 76

DSGD is fast (8x8, Netflix, NZSL+L2)

0.0 0.5 1.0 1.5 2.0 1.0 1.5 2.0 2.5 3.0 Time (hours) Mean Loss

LBFGS

DSGD ISGD PSGD ALS

27 / 32

SLIDE 77

DSGD is fast (8x8, synth., NZSL+L2)

Time (hours) Mean Loss 5 10 15 20 1 10 100 1000 DSGD PSGD ALS

28 / 32

SLIDE 78

DSGD runs on Hadoop

Fixed CPU (@24) Scaled CPU

100M 400M 1.6B 6.4B 25.6B Data size (# nonzero entries) Wall clock time per epoch (seconds) 500 1000 1500 2000 1x 1.3x 2.3x 6.6x 23.8x DSGD 1.6B @ 5 6.4B @ 20 25.6B @ 80 Data size @ cores Wall clock time per epoch (seconds) 200 400 600 800 1000 1x 1x 1.3x DSGD

(25.6B entries > 1/2TB of data)

29 / 32

SLIDE 79

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

30 / 32

SLIDE 80

Summary

◮ Matrix factorization

◮ Widely applicable via customized loss functions ◮ Large instances (millions × millions with billions of entries)

◮ Distributed Stochastic Gradient Descent

◮ Simple and versatile ◮ Avoids averaging via novel “stratified SGD” variant ◮ Achieves ◮ Fully distributed data/model ◮ Fully distributed processing ◮ Same or better loss ◮ Faster ◮ Good scalability

◮ Future Directions

◮ More decompositions (e.g., losses at 0) ◮ Tensors ◮ Stratified SGD for other models ◮ ... 31 / 32

SLIDE 81

Thank you!

32 / 32