Large-Scale Matrix Factorization with Distributed Stochastic - - PowerPoint PPT Presentation
Large-Scale Matrix Factorization with Distributed Stochastic - - PowerPoint PPT Presentation
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla December 17, 2011 Peter J. Haas Yannis Sismanis Christina Teflioudi Faraz Makari Outline Matrix Factorization Stochastic Gradient Descent
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
2 / 28
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
3 / 28
Matrix completion visualized
Original image
4 / 28
Matrix completion visualized
Original image Partially observed image
4 / 28
Matrix completion visualized
Original image Partially observed image Reconstructed image
4 / 28
Matrix completion for recommender systems
◮ Discover latent factors (r = 1)
Avatar The Matrix Up Alice 4 2 Bob 3 2 Charlie 5 3
5 / 28
Matrix completion for recommender systems
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice 4 2 (1.98) Bob 3 2 (1.21) Charlie 5 3 (2.30)
5 / 28
Matrix completion for recommender systems
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice 4 2 (1.98) (3.8) (2.3) Bob 3 2 (1.21) (2.7) (2.3) Charlie 5 3 (2.30) (5.2) (2.7)
◮ Minimum loss
min
W,H
- (i,j)∈Z
(Vij − [WH]ij)2
5 / 28
Matrix completion for recommender systems
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)
◮ Minimum loss
min
W,H
- (i,j)∈Z
(Vij − [WH]ij)2
5 / 28
Matrix completion for recommender systems
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)
◮ Minimum loss
min
W,H,u,m
- (i,j)∈Z
(Vij − µ − ui − mj − [WH]ij)2
◮ Bias
5 / 28
Matrix completion for recommender systems
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)
◮ Minimum loss
min
W,H,u,m
- (i,j)∈Z
(Vij − µ − ui − mj − [WH]ij)2 + λ (W + H + u + m)
◮ Bias, regularization
5 / 28
Matrix completion for recommender systems
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)
◮ Minimum loss
min
W,H,u,m
- (i,j,t)∈Zt
(Vij − µ − ui(t) − mj(t) − [W(t)H]ij)2 + λ (W(t) + H + u(t) + m(t))
◮ Bias, regularization, time, . . .
5 / 28
Generalized Matrix Factorization
◮ A general machine learning problem
◮ Recommender systems, text indexing, face recognition, . . . 6 / 28
Generalized Matrix Factorization
◮ A general machine learning problem
◮ Recommender systems, text indexing, face recognition, . . .
◮ Training data
◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings) 6 / 28
V Vij
Generalized Matrix Factorization
◮ A general machine learning problem
◮ Recommender systems, text indexing, face recognition, . . .
◮ Training data
◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)
◮ Parameter space
◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors) 6 / 28
V Vij W H Wi∗ H∗j
Generalized Matrix Factorization
◮ A general machine learning problem
◮ Recommender systems, text indexing, face recognition, . . .
◮ Training data
◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)
◮ Parameter space
◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors)
◮ Model
◮ Lij(Wi∗, H∗j): loss at element (i, j) ◮ Includes prediction error, regularization,
auxiliary information, . . .
◮ Constraints (e.g., non-negativity) 6 / 28
V Vij W H Wi∗ H∗j
Generalized Matrix Factorization
◮ A general machine learning problem
◮ Recommender systems, text indexing, face recognition, . . .
◮ Training data
◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)
◮ Parameter space
◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors)
◮ Model
◮ Lij(Wi∗, H∗j): loss at element (i, j) ◮ Includes prediction error, regularization,
auxiliary information, . . .
◮ Constraints (e.g., non-negativity)
◮ Find best model
argmin
W,H
- (i,j)∈Z
Lij(Wi∗, H∗j)
6 / 28
V Vij W H Wi∗ H∗j
Successful Applications
◮ Movie recommendation (Netflix, competition papers)
◮ >12M users, >20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model
◮ Website recommendation (Microsoft, WWW10)
◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization
◮ News personalization (Google, WWW07)
◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing 7 / 28
Successful Applications
◮ Movie recommendation (Netflix, competition papers)
◮ >12M users, >20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model
◮ Website recommendation (Microsoft, WWW10)
◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization
◮ News personalization (Google, WWW07)
◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing
Distributed processing is necessary!
◮ Big data ◮ Large models ◮ Expensive computations
7 / 28
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
8 / 28
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 9 / 28
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ
L′(θ0)
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 9 / 28
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ
L′(θ0)
◮ Move “approximately” downhill
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 9 / 28
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ
L′(θ0)
◮ Move “approximately” downhill ◮ Stochastic difference equation
θn+1 = θn − ǫnˆ L′(θn)
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 9 / 28
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ
L′(θ0)
◮ Move “approximately” downhill ◮ Stochastic difference equation
θn+1 = θn − ǫnˆ L′(θn)
◮ Under certain conditions,
asymptotically approximates (continuous) gradient descent
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 0.0
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
stepfun(px, py)
t q(t) - q*
- ●●
- 9 / 28
Stochastic Gradient Descent for Matrix Factorization
◮ Set θ = (W, H) and use
L(θ) =
- (i,j)∈Z
Lij(Wi∗, H∗j)
10 / 28
V W H Vij Wi∗ H∗j
Stochastic Gradient Descent for Matrix Factorization
◮ Set θ = (W, H) and use
L(θ) =
- (i,j)∈Z
Lij(Wi∗, H∗j) L′(θ) =
- (i,j)∈Z
L′
ij(Wi∗, H∗j)
10 / 28
V W H Vij Wi∗ H∗j
Stochastic Gradient Descent for Matrix Factorization
◮ Set θ = (W, H) and use
L(θ) =
- (i,j)∈Z
Lij(Wi∗, H∗j) L′(θ) =
- (i,j)∈Z
L′
ij(Wi∗, H∗j)
ˆ L′(θ, z) = NL′
izjz(Wiz∗, H∗jz),
where N = |Z|
10 / 28
V W H Vij Wi∗ H∗j
Stochastic Gradient Descent for Matrix Factorization
◮ Set θ = (W, H) and use
L(θ) =
- (i,j)∈Z
Lij(Wi∗, H∗j) L′(θ) =
- (i,j)∈Z
L′
ij(Wi∗, H∗j)
ˆ L′(θ, z) = NL′
izjz(Wiz∗, H∗jz),
where N = |Z|
◮ SGD epoch
- 1. Pick a random entry z ∈ Z
- 2. Compute approximate gradient ˆ
L′(θ, z)
- 3. Update parameters
θn+1 = θn − ǫn ˆ L′(θn, z)
- 4. Repeat N times
10 / 28
V W H Vij Wi∗ H∗j
Stochastic Gradient Descent on Netflix Data
10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss
- LBFGS
SGD ALS
- 11 / 28
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
12 / 28
Averaging Techniques
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn) How to distribute?
13 / 28
Averaging Techniques
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn) How to distribute?
◮ Parameter mixing (MSGD)
◮ Map: Run independent instances of SGD on subsets of the
data (until convergence)
◮ Reduce: Average results 13 / 28
Averaging Techniques
10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss
- LBFGS
SGD ALS MSGD
- 14 / 28
Averaging Techniques
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn) How to distribute?
◮ Parameter mixing (MSGD)
◮ Map: Run independent instances of SGD on subsets of the
data (until convergence)
◮ Reduce: Average results ◮ Does not converge to correct solution! 15 / 28
Averaging Techniques
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn) How to distribute?
◮ Parameter mixing (MSGD)
◮ Map: Run independent instances of SGD on subsets of the
data (until convergence)
◮ Reduce: Average results ◮ Does not converge to correct solution!
◮ Iterative Parameter mixing (ISGD)
◮ Map: Run independent instances of SGD on subsets of the
data (for some time)
◮ Reduce: Average results ◮ Repeat 15 / 28
Averaging Techniques
10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss
- LBFGS
SGD ALS MSGD ISGD
- 16 / 28
Averaging Techniques
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn) How to distribute?
◮ Parameter mixing (MSGD)
◮ Map: Run independent instances of SGD on subsets of the
data (until convergence)
◮ Reduce: Average results ◮ Does not converge to correct solution!
◮ Iterative Parameter mixing (ISGD)
◮ Map: Run independent instances of SGD on subsets of the
data (for some time)
◮ Reduce: Average results ◮ Repeat ◮ Converges slowly! 17 / 28
Problem Structure
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn)
◮ An SGD step on example z ∈ Z . . .
- 1. Reads Wiz∗ and H∗jz
- 2. Performs gradient computation L′
ij(Wiz∗, H∗jz)
- 3. Updates Wiz∗ and H∗jz
V W H zn Wi∗ H∗j
18 / 28
Problem Structure
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn)
◮ An SGD step on example z ∈ Z . . .
- 1. Reads Wiz∗ and H∗jz
- 2. Performs gradient computation L′
ij(Wiz∗, H∗jz)
- 3. Updates Wiz∗ and H∗jz
V W H zn Wi∗ H∗j zn+1 H∗j
18 / 28
Problem Structure
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn)
◮ An SGD step on example z ∈ Z . . .
- 1. Reads Wiz∗ and H∗jz
- 2. Performs gradient computation L′
ij(Wiz∗, H∗jz)
- 3. Updates Wiz∗ and H∗jz
V W H zn Wi∗ H∗j zn+1 Wi∗
18 / 28
Problem Structure
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn)
◮ An SGD step on example z ∈ Z . . .
- 1. Reads Wiz∗ and H∗jz
- 2. Performs gradient computation L′
ij(Wiz∗, H∗jz)
- 3. Updates Wiz∗ and H∗jz
◮ Not all steps are dependent V W H zn Wi∗ H∗j zn+1
18 / 28
Interchangeability
◮ Two elements z1, z2 ∈ Z are interchangeable if they share
neither row nor column
V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps
θn+1 = θn − ǫˆ L′(θn, zn)
19 / 28
Interchangeability
◮ Two elements z1, z2 ∈ Z are interchangeable if they share
neither row nor column
V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps
θn+2 = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn+1, zn+1)
19 / 28
Interchangeability
◮ Two elements z1, z2 ∈ Z are interchangeable if they share
neither row nor column
V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps
θn+2 = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn+1, zn+1) = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn, zn+1), become parallelizable!
19 / 28
Exploitation
◮ Block and distribute the input matrix V
20 / 28
Node 1 Node 2 Node 3
W1 H1 V11 V12 V13 W2 H2 V21 V22 V23 W3 H3 V31 V32 V33
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle 20 / 28
Node 1 Node 2 Node 3
W1 H1 V11 V12 V13 W2 H2 V21 V22 V23 W3 H3 V31 V32 V33
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle 20 / 28
Node 1 Node 2 Node 3
W1 H1 V11 W2 H2 V22 W3 H3 V33
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle
◮ Step 2:
Simulate sequential SGD
◮ Interchangeable blocks ◮ Throw dice of how
many iterations per block
◮ Throw dice of which
step sizes per block
20 / 28
Node 1 Node 2 Node 3
W1 H1 V11 W2 H2 V22 W3 H3 V33
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle
◮ Step 2:
Simulate sequential SGD
◮ Interchangeable blocks ◮ Throw dice of how
many iterations per block
◮ Throw dice of which
step sizes per block
20 / 28
Node 1 Node 2 Node 3
W1 H1 V11 W2 H2 V22 W3 H3 V33 V11 V12 V13 V21 V22 V23 V31 V32 V33
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle
◮ Step 2:
Simulate sequential SGD
◮ Interchangeable blocks ◮ Throw dice of how
many iterations per block
◮ Throw dice of which
step sizes per block
20 / 28
Node 1 Node 2 Node 3
W1 H1 V11 W2 H2 V22 W3 H3 V33 V12 V23 V31
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle
◮ Step 2:
Simulate sequential SGD
◮ Interchangeable blocks ◮ Throw dice of how
many iterations per block
◮ Throw dice of which
step sizes per block
◮ Instance of “stratified SGD” ◮ Provably correct
20 / 28
Node 1 Node 2 Node 3
W1 H1 V11 W2 H2 V22 W3 H3 V33 V12 V23 V31
How does it work?
21 / 28
L = 0.3L1 + 0.7L2
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L2
Cycle 0
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
21 / 28
L = 0.3L1 + 0.7L2
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L2
Cycle 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
21 / 28
L = 0.3L1 + 0.7L2
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L2
Cycle 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
21 / 28
L = 0.3L1 + 0.7L2
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L2
Cycle 3
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
21 / 28
L = 0.3L1 + 0.7L2
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L2
Cycle 4
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
21 / 28
L = 0.3L1 + 0.7L2
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L2
Cycle 5
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
21 / 28
L = 0.3L1 + 0.7L2
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L2
Cycle 6
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
21 / 28
L = 0.3L1 + 0.7L2
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L2
Cycle 100
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
21 / 28
L = 0.3L1 + 0.7L2
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L2
Cycle 100
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
21 / 28
L = 0.3L1 + 0.7L2
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L2
Cycle 100
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
22 / 28
Netflix, Nzsl+L2
480k×18k, 99M nonzeros, rank 50, training
50 100 150 200 250 300 350 0.2 0.4 0.6 0.8 1.0 1.2 Elapsed time (minutes) Loss (x109)
- SGD
DSGD 1x8 DSGD 2x8 DSGD 4x8
23 / 28
Netflix, Nzsl+L2
480k×18k, 99M nonzeros, rank 50, training
5 10 15 0.06 0.07 0.08 0.09 0.10 Elapsed time (minutes) Loss (x109)
- SGD
DSGD 1x8 DSGD 2x8 DSGD 4x8
23 / 28
Netflix, Nzsl+L2
480k×18k, 99M nonzeros, rank 50, training
5 10 15 0.06 0.07 0.08 0.09 0.10 Elapsed time (minutes) Loss (x109)
- SGD
DSGD 1x8 DSGD 2x8 DSGD 4x8
DSGD converges significantly faster than SGD.
23 / 28
Netflix, Nzsl+L2
480k×18k, 99M nonzeros, rank 50, training
5 10 15 0.06 0.07 0.08 0.09 0.10 Elapsed time (minutes) Loss (x109)
- SGD
DSGD 1x8 DSGD 2x8 DSGD 4x8 PSGD
23 / 28
Netflix, Nzsl+L2
480k×18k, 99M nonzeros, rank 50, training
10 20 30 40 50 0.06 0.07 0.08 0.09 0.10 Epoch Loss (x109)
- SGD
DSGD 1x8 DSGD 2x8 DSGD 4x8 PSGD
23 / 28
Netflix, Nzsl+L2
480k×18k, 99M nonzeros, rank 50, training
10 20 30 40 50 0.06 0.07 0.08 0.09 0.10 Epoch Loss (x109)
- SGD
DSGD 1x8 DSGD 2x8 DSGD 4x8 PSGD
Stratification is (currently) not for free.
23 / 28
KDD Cup, Nzsl+Nzl2+Bias
1M×625k, 253M nonzeros, rank 60, training
10 20 30 40 50 60 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Time (minutes) Loss (x1011)
- SGD
DSGD 1x7 DSGD 2x7 DSGD 4x7
24 / 28
Synthetic data, Nzsl+L2
10M×1M, 1B nonzeros, rank 50, training
Time (minutes) Loss (x109) 50 100 150 200 1 10 100 1000
- ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
- DALS 8x7
DSGD 8x7 PSGD 1x7
25 / 28
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
26 / 28
Summary
◮ Matrix factorization
◮ Widely applicable via customized loss functions ◮ Large instances (millions × millions with billions of entries)
◮ Distributed Stochastic Gradient Descent
◮ Simple and versatile ◮ Avoids averaging via novel “stratified SGD” variant ◮ Achieves ◮ Fully distributed data/model ◮ Fully distributed processing ◮ Competitive to alternative algorithms ◮ Fast, scalable
◮ Future Directions
◮ Improved stratification ◮ Simultaneous computation & communication ◮ Stratification for other models ◮ ... 27 / 28
Thank you!
28 / 28