Large-Scale Matrix Factorization with Distributed Stochastic - - PowerPoint PPT Presentation

large scale matrix factorization with distributed
SMART_READER_LITE
LIVE PREVIEW

Large-Scale Matrix Factorization with Distributed Stochastic - - PowerPoint PPT Presentation

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla December 17, 2011 Peter J. Haas Yannis Sismanis Christina Teflioudi Faraz Makari Outline Matrix Factorization Stochastic Gradient Descent


slide-1
SLIDE 1

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Rainer Gemulla December 17, 2011 Peter J. Haas Yannis Sismanis Christina Teflioudi Faraz Makari

slide-2
SLIDE 2

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

2 / 28

slide-3
SLIDE 3

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

3 / 28

slide-4
SLIDE 4

Matrix completion visualized

Original image

4 / 28

slide-5
SLIDE 5

Matrix completion visualized

Original image Partially observed image

4 / 28

slide-6
SLIDE 6

Matrix completion visualized

Original image Partially observed image Reconstructed image

4 / 28

slide-7
SLIDE 7

Matrix completion for recommender systems

◮ Discover latent factors (r = 1)

Avatar The Matrix Up Alice 4 2 Bob 3 2 Charlie 5 3

5 / 28

slide-8
SLIDE 8

Matrix completion for recommender systems

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice 4 2 (1.98) Bob 3 2 (1.21) Charlie 5 3 (2.30)

5 / 28

slide-9
SLIDE 9

Matrix completion for recommender systems

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice 4 2 (1.98) (3.8) (2.3) Bob 3 2 (1.21) (2.7) (2.3) Charlie 5 3 (2.30) (5.2) (2.7)

◮ Minimum loss

min

W,H

  • (i,j)∈Z

(Vij − [WH]ij)2

5 / 28

slide-10
SLIDE 10

Matrix completion for recommender systems

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)

◮ Minimum loss

min

W,H

  • (i,j)∈Z

(Vij − [WH]ij)2

5 / 28

slide-11
SLIDE 11

Matrix completion for recommender systems

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)

◮ Minimum loss

min

W,H,u,m

  • (i,j)∈Z

(Vij − µ − ui − mj − [WH]ij)2

◮ Bias

5 / 28

slide-12
SLIDE 12

Matrix completion for recommender systems

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)

◮ Minimum loss

min

W,H,u,m

  • (i,j)∈Z

(Vij − µ − ui − mj − [WH]ij)2 + λ (W + H + u + m)

◮ Bias, regularization

5 / 28

slide-13
SLIDE 13

Matrix completion for recommender systems

◮ Discover latent factors (r = 1)

Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)

◮ Minimum loss

min

W,H,u,m

  • (i,j,t)∈Zt

(Vij − µ − ui(t) − mj(t) − [W(t)H]ij)2 + λ (W(t) + H + u(t) + m(t))

◮ Bias, regularization, time, . . .

5 / 28

slide-14
SLIDE 14

Generalized Matrix Factorization

◮ A general machine learning problem

◮ Recommender systems, text indexing, face recognition, . . . 6 / 28

slide-15
SLIDE 15

Generalized Matrix Factorization

◮ A general machine learning problem

◮ Recommender systems, text indexing, face recognition, . . .

◮ Training data

◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings) 6 / 28

V Vij

slide-16
SLIDE 16

Generalized Matrix Factorization

◮ A general machine learning problem

◮ Recommender systems, text indexing, face recognition, . . .

◮ Training data

◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)

◮ Parameter space

◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors) 6 / 28

V Vij W H Wi∗ H∗j

slide-17
SLIDE 17

Generalized Matrix Factorization

◮ A general machine learning problem

◮ Recommender systems, text indexing, face recognition, . . .

◮ Training data

◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)

◮ Parameter space

◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors)

◮ Model

◮ Lij(Wi∗, H∗j): loss at element (i, j) ◮ Includes prediction error, regularization,

auxiliary information, . . .

◮ Constraints (e.g., non-negativity) 6 / 28

V Vij W H Wi∗ H∗j

slide-18
SLIDE 18

Generalized Matrix Factorization

◮ A general machine learning problem

◮ Recommender systems, text indexing, face recognition, . . .

◮ Training data

◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)

◮ Parameter space

◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors)

◮ Model

◮ Lij(Wi∗, H∗j): loss at element (i, j) ◮ Includes prediction error, regularization,

auxiliary information, . . .

◮ Constraints (e.g., non-negativity)

◮ Find best model

argmin

W,H

  • (i,j)∈Z

Lij(Wi∗, H∗j)

6 / 28

V Vij W H Wi∗ H∗j

slide-19
SLIDE 19

Successful Applications

◮ Movie recommendation (Netflix, competition papers)

◮ >12M users, >20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model

◮ Website recommendation (Microsoft, WWW10)

◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization

◮ News personalization (Google, WWW07)

◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing 7 / 28

slide-20
SLIDE 20

Successful Applications

◮ Movie recommendation (Netflix, competition papers)

◮ >12M users, >20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model

◮ Website recommendation (Microsoft, WWW10)

◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization

◮ News personalization (Google, WWW07)

◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing

Distributed processing is necessary!

◮ Big data ◮ Large models ◮ Expensive computations

7 / 28

slide-21
SLIDE 21

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

8 / 28

slide-22
SLIDE 22

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

  • 9 / 28
slide-23
SLIDE 23

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ

L′(θ0)

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

  • 9 / 28
slide-24
SLIDE 24

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ

L′(θ0)

◮ Move “approximately” downhill

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

  • 9 / 28
slide-25
SLIDE 25

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ

L′(θ0)

◮ Move “approximately” downhill ◮ Stochastic difference equation

θn+1 = θn − ǫnˆ L′(θn)

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

  • 9 / 28
slide-26
SLIDE 26

Stochastic Gradient Descent

◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ

L′(θ0)

◮ Move “approximately” downhill ◮ Stochastic difference equation

θn+1 = θn − ǫnˆ L′(θn)

◮ Under certain conditions,

asymptotically approximates (continuous) gradient descent

. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5

− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0

  • 0.0

0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

stepfun(px, py)

t q(t) - q*

  • ●●
  • 9 / 28
slide-27
SLIDE 27

Stochastic Gradient Descent for Matrix Factorization

◮ Set θ = (W, H) and use

L(θ) =

  • (i,j)∈Z

Lij(Wi∗, H∗j)

10 / 28

V W H Vij Wi∗ H∗j

slide-28
SLIDE 28

Stochastic Gradient Descent for Matrix Factorization

◮ Set θ = (W, H) and use

L(θ) =

  • (i,j)∈Z

Lij(Wi∗, H∗j) L′(θ) =

  • (i,j)∈Z

L′

ij(Wi∗, H∗j)

10 / 28

V W H Vij Wi∗ H∗j

slide-29
SLIDE 29

Stochastic Gradient Descent for Matrix Factorization

◮ Set θ = (W, H) and use

L(θ) =

  • (i,j)∈Z

Lij(Wi∗, H∗j) L′(θ) =

  • (i,j)∈Z

L′

ij(Wi∗, H∗j)

ˆ L′(θ, z) = NL′

izjz(Wiz∗, H∗jz),

where N = |Z|

10 / 28

V W H Vij Wi∗ H∗j

slide-30
SLIDE 30

Stochastic Gradient Descent for Matrix Factorization

◮ Set θ = (W, H) and use

L(θ) =

  • (i,j)∈Z

Lij(Wi∗, H∗j) L′(θ) =

  • (i,j)∈Z

L′

ij(Wi∗, H∗j)

ˆ L′(θ, z) = NL′

izjz(Wiz∗, H∗jz),

where N = |Z|

◮ SGD epoch

  • 1. Pick a random entry z ∈ Z
  • 2. Compute approximate gradient ˆ

L′(θ, z)

  • 3. Update parameters

θn+1 = θn − ǫn ˆ L′(θn, z)

  • 4. Repeat N times

10 / 28

V W H Vij Wi∗ H∗j

slide-31
SLIDE 31

Stochastic Gradient Descent on Netflix Data

10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss

  • LBFGS

SGD ALS

  • 11 / 28
slide-32
SLIDE 32

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

12 / 28

slide-33
SLIDE 33

Averaging Techniques

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn) How to distribute?

13 / 28

slide-34
SLIDE 34

Averaging Techniques

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn) How to distribute?

◮ Parameter mixing (MSGD)

◮ Map: Run independent instances of SGD on subsets of the

data (until convergence)

◮ Reduce: Average results 13 / 28

slide-35
SLIDE 35

Averaging Techniques

10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss

  • LBFGS

SGD ALS MSGD

  • 14 / 28
slide-36
SLIDE 36

Averaging Techniques

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn) How to distribute?

◮ Parameter mixing (MSGD)

◮ Map: Run independent instances of SGD on subsets of the

data (until convergence)

◮ Reduce: Average results ◮ Does not converge to correct solution! 15 / 28

slide-37
SLIDE 37

Averaging Techniques

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn) How to distribute?

◮ Parameter mixing (MSGD)

◮ Map: Run independent instances of SGD on subsets of the

data (until convergence)

◮ Reduce: Average results ◮ Does not converge to correct solution!

◮ Iterative Parameter mixing (ISGD)

◮ Map: Run independent instances of SGD on subsets of the

data (for some time)

◮ Reduce: Average results ◮ Repeat 15 / 28

slide-38
SLIDE 38

Averaging Techniques

10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss

  • LBFGS

SGD ALS MSGD ISGD

  • 16 / 28
slide-39
SLIDE 39

Averaging Techniques

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn) How to distribute?

◮ Parameter mixing (MSGD)

◮ Map: Run independent instances of SGD on subsets of the

data (until convergence)

◮ Reduce: Average results ◮ Does not converge to correct solution!

◮ Iterative Parameter mixing (ISGD)

◮ Map: Run independent instances of SGD on subsets of the

data (for some time)

◮ Reduce: Average results ◮ Repeat ◮ Converges slowly! 17 / 28

slide-40
SLIDE 40

Problem Structure

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn)

◮ An SGD step on example z ∈ Z . . .

  • 1. Reads Wiz∗ and H∗jz
  • 2. Performs gradient computation L′

ij(Wiz∗, H∗jz)

  • 3. Updates Wiz∗ and H∗jz

V W H zn Wi∗ H∗j

18 / 28

slide-41
SLIDE 41

Problem Structure

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn)

◮ An SGD step on example z ∈ Z . . .

  • 1. Reads Wiz∗ and H∗jz
  • 2. Performs gradient computation L′

ij(Wiz∗, H∗jz)

  • 3. Updates Wiz∗ and H∗jz

V W H zn Wi∗ H∗j zn+1 H∗j

18 / 28

slide-42
SLIDE 42

Problem Structure

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn)

◮ An SGD step on example z ∈ Z . . .

  • 1. Reads Wiz∗ and H∗jz
  • 2. Performs gradient computation L′

ij(Wiz∗, H∗jz)

  • 3. Updates Wiz∗ and H∗jz

V W H zn Wi∗ H∗j zn+1 Wi∗

18 / 28

slide-43
SLIDE 43

Problem Structure

◮ SGD steps depend on each other

θn+1 = θn − ǫnˆ L′(θn)

◮ An SGD step on example z ∈ Z . . .

  • 1. Reads Wiz∗ and H∗jz
  • 2. Performs gradient computation L′

ij(Wiz∗, H∗jz)

  • 3. Updates Wiz∗ and H∗jz

◮ Not all steps are dependent V W H zn Wi∗ H∗j zn+1

18 / 28

slide-44
SLIDE 44

Interchangeability

◮ Two elements z1, z2 ∈ Z are interchangeable if they share

neither row nor column

V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps

θn+1 = θn − ǫˆ L′(θn, zn)

19 / 28

slide-45
SLIDE 45

Interchangeability

◮ Two elements z1, z2 ∈ Z are interchangeable if they share

neither row nor column

V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps

θn+2 = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn+1, zn+1)

19 / 28

slide-46
SLIDE 46

Interchangeability

◮ Two elements z1, z2 ∈ Z are interchangeable if they share

neither row nor column

V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps

θn+2 = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn+1, zn+1) = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn, zn+1), become parallelizable!

19 / 28

slide-47
SLIDE 47

Exploitation

◮ Block and distribute the input matrix V

20 / 28

Node 1 Node 2 Node 3

W1 H1 V11 V12 V13 W2 H2 V21 V22 V23 W3 H3 V31 V32 V33

slide-48
SLIDE 48

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

  • 1. Pick a “diagonal”
  • 2. Run SGD on the diagonal (in parallel)
  • 3. Merge the results
  • 4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle 20 / 28

Node 1 Node 2 Node 3

W1 H1 V11 V12 V13 W2 H2 V21 V22 V23 W3 H3 V31 V32 V33

slide-49
SLIDE 49

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

  • 1. Pick a “diagonal”
  • 2. Run SGD on the diagonal (in parallel)
  • 3. Merge the results
  • 4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle 20 / 28

Node 1 Node 2 Node 3

W1 H1 V11 W2 H2 V22 W3 H3 V33

slide-50
SLIDE 50

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

  • 1. Pick a “diagonal”
  • 2. Run SGD on the diagonal (in parallel)
  • 3. Merge the results
  • 4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle

◮ Step 2:

Simulate sequential SGD

◮ Interchangeable blocks ◮ Throw dice of how

many iterations per block

◮ Throw dice of which

step sizes per block

20 / 28

Node 1 Node 2 Node 3

W1 H1 V11 W2 H2 V22 W3 H3 V33

slide-51
SLIDE 51

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

  • 1. Pick a “diagonal”
  • 2. Run SGD on the diagonal (in parallel)
  • 3. Merge the results
  • 4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle

◮ Step 2:

Simulate sequential SGD

◮ Interchangeable blocks ◮ Throw dice of how

many iterations per block

◮ Throw dice of which

step sizes per block

20 / 28

Node 1 Node 2 Node 3

W1 H1 V11 W2 H2 V22 W3 H3 V33 V11 V12 V13 V21 V22 V23 V31 V32 V33

slide-52
SLIDE 52

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

  • 1. Pick a “diagonal”
  • 2. Run SGD on the diagonal (in parallel)
  • 3. Merge the results
  • 4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle

◮ Step 2:

Simulate sequential SGD

◮ Interchangeable blocks ◮ Throw dice of how

many iterations per block

◮ Throw dice of which

step sizes per block

20 / 28

Node 1 Node 2 Node 3

W1 H1 V11 W2 H2 V22 W3 H3 V33 V12 V23 V31

slide-53
SLIDE 53

Exploitation

◮ Block and distribute the input matrix V ◮ High-level approach (Map only)

  • 1. Pick a “diagonal”
  • 2. Run SGD on the diagonal (in parallel)
  • 3. Merge the results
  • 4. Move on to next “diagonal”

◮ Steps 1–3 form a cycle

◮ Step 2:

Simulate sequential SGD

◮ Interchangeable blocks ◮ Throw dice of how

many iterations per block

◮ Throw dice of which

step sizes per block

◮ Instance of “stratified SGD” ◮ Provably correct

20 / 28

Node 1 Node 2 Node 3

W1 H1 V11 W2 H2 V22 W3 H3 V33 V12 V23 V31

slide-54
SLIDE 54

How does it work?

21 / 28

L = 0.3L1 + 0.7L2

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L1

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L2

Cycle 0

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

slide-55
SLIDE 55

How does it work?

21 / 28

L = 0.3L1 + 0.7L2

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L1

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L2

Cycle 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

slide-56
SLIDE 56

How does it work?

21 / 28

L = 0.3L1 + 0.7L2

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L1

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L2

Cycle 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

slide-57
SLIDE 57

How does it work?

21 / 28

L = 0.3L1 + 0.7L2

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L1

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L2

Cycle 3

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

slide-58
SLIDE 58

How does it work?

21 / 28

L = 0.3L1 + 0.7L2

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L1

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L2

Cycle 4

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

slide-59
SLIDE 59

How does it work?

21 / 28

L = 0.3L1 + 0.7L2

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L1

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L2

Cycle 5

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

slide-60
SLIDE 60

How does it work?

21 / 28

L = 0.3L1 + 0.7L2

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L1

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L2

Cycle 6

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

slide-61
SLIDE 61

How does it work?

21 / 28

L = 0.3L1 + 0.7L2

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L1

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L2

Cycle 100

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

slide-62
SLIDE 62

How does it work?

21 / 28

L = 0.3L1 + 0.7L2

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L1

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L2

Cycle 100

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

slide-63
SLIDE 63

How does it work?

21 / 28

L = 0.3L1 + 0.7L2

0.5 1

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L1

0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6

  • L2

Cycle 100

− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2

slide-64
SLIDE 64

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

22 / 28

slide-65
SLIDE 65

Netflix, Nzsl+L2

480k×18k, 99M nonzeros, rank 50, training

50 100 150 200 250 300 350 0.2 0.4 0.6 0.8 1.0 1.2 Elapsed time (minutes) Loss (x109)

  • SGD

DSGD 1x8 DSGD 2x8 DSGD 4x8

23 / 28

slide-66
SLIDE 66

Netflix, Nzsl+L2

480k×18k, 99M nonzeros, rank 50, training

5 10 15 0.06 0.07 0.08 0.09 0.10 Elapsed time (minutes) Loss (x109)

  • SGD

DSGD 1x8 DSGD 2x8 DSGD 4x8

23 / 28

slide-67
SLIDE 67

Netflix, Nzsl+L2

480k×18k, 99M nonzeros, rank 50, training

5 10 15 0.06 0.07 0.08 0.09 0.10 Elapsed time (minutes) Loss (x109)

  • SGD

DSGD 1x8 DSGD 2x8 DSGD 4x8

DSGD converges significantly faster than SGD.

23 / 28

slide-68
SLIDE 68

Netflix, Nzsl+L2

480k×18k, 99M nonzeros, rank 50, training

5 10 15 0.06 0.07 0.08 0.09 0.10 Elapsed time (minutes) Loss (x109)

  • SGD

DSGD 1x8 DSGD 2x8 DSGD 4x8 PSGD

23 / 28

slide-69
SLIDE 69

Netflix, Nzsl+L2

480k×18k, 99M nonzeros, rank 50, training

10 20 30 40 50 0.06 0.07 0.08 0.09 0.10 Epoch Loss (x109)

  • SGD

DSGD 1x8 DSGD 2x8 DSGD 4x8 PSGD

23 / 28

slide-70
SLIDE 70

Netflix, Nzsl+L2

480k×18k, 99M nonzeros, rank 50, training

10 20 30 40 50 0.06 0.07 0.08 0.09 0.10 Epoch Loss (x109)

  • SGD

DSGD 1x8 DSGD 2x8 DSGD 4x8 PSGD

Stratification is (currently) not for free.

23 / 28

slide-71
SLIDE 71

KDD Cup, Nzsl+Nzl2+Bias

1M×625k, 253M nonzeros, rank 60, training

10 20 30 40 50 60 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Time (minutes) Loss (x1011)

  • SGD

DSGD 1x7 DSGD 2x7 DSGD 4x7

24 / 28

slide-72
SLIDE 72

Synthetic data, Nzsl+L2

10M×1M, 1B nonzeros, rank 50, training

Time (minutes) Loss (x109) 50 100 150 200 1 10 100 1000

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • DALS 8x7

DSGD 8x7 PSGD 1x7

25 / 28

slide-73
SLIDE 73

Outline

Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary

26 / 28

slide-74
SLIDE 74

Summary

◮ Matrix factorization

◮ Widely applicable via customized loss functions ◮ Large instances (millions × millions with billions of entries)

◮ Distributed Stochastic Gradient Descent

◮ Simple and versatile ◮ Avoids averaging via novel “stratified SGD” variant ◮ Achieves ◮ Fully distributed data/model ◮ Fully distributed processing ◮ Competitive to alternative algorithms ◮ Fast, scalable

◮ Future Directions

◮ Improved stratification ◮ Simultaneous computation & communication ◮ Stratification for other models ◮ ... 27 / 28

slide-75
SLIDE 75

Thank you!

28 / 28