Large-Scale Matrix Factorization with Distributed Stochastic - - PowerPoint PPT Presentation
Large-Scale Matrix Factorization with Distributed Stochastic - - PowerPoint PPT Presentation
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla August 23, 2011 Peter J. Haas Yannis Sismanis Erik Nijkamp Outline Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
2 / 32
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
3 / 32
Collaborative Filtering
◮ Problem
◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...) 4 / 32
Collaborative Filtering
◮ Problem
◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...)
◮ Predict additional items a user may like
◮ Assumption: Similar feedback =
⇒ Similar taste
4 / 32
Collaborative Filtering
◮ Problem
◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...)
◮ Predict additional items a user may like
◮ Assumption: Similar feedback =
⇒ Similar taste
◮ Example
Avatar The Matrix Up Alice 4 2 Bob 3 2 Charlie 5 3
4 / 32
Collaborative Filtering
◮ Problem
◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...)
◮ Predict additional items a user may like
◮ Assumption: Similar feedback =
⇒ Similar taste
◮ Example
Avatar The Matrix Up Alice ? 4 2 Bob 3 2 ? Charlie 5 ? 3
4 / 32
Collaborative Filtering
◮ Problem
◮ Set of users ◮ Set of items (movies, books, jokes, products, stories, ...) ◮ Feedback (ratings, purchase, click-through, tags, ...)
◮ Predict additional items a user may like
◮ Assumption: Similar feedback =
⇒ Similar taste
◮ Example
Avatar The Matrix Up Alice ? 4 2 Bob 3 2 ? Charlie 5 ? 3
◮ Netflix competition: 500k users, 20k movies, 100M movie
ratings, 3M question marks
4 / 32
Semantic Factors (Koren et al., 2009)
∈ ∈ ∈
∈
∑
κ
- λ
κ λ Geared toward males Serious Escapist The Princess Diaries Braveheart Lethal Weapon Independence Day Ocean’s 11 Sense and Sensibility Gus Dave Geared toward females Amadeus The Lion King Dumb and Dumber The Color Purple
5 / 32
Latent Factor Models
◮ Discover latent factors (r = 1)
Avatar The Matrix Up Alice 4 2 Bob 3 2 Charlie 5 3
6 / 32
Latent Factor Models
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice 4 2 (1.98) Bob 3 2 (1.21) Charlie 5 3 (2.30)
6 / 32
Latent Factor Models
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice 4 2 (1.98) (3.8) (2.3) Bob 3 2 (1.21) (2.7) (2.3) Charlie 5 3 (2.30) (5.2) (2.7)
◮ Minimum loss
min
W,H
- (i,j)∈Z
(Vij − [WH]ij)2
6 / 32
Latent Factor Models
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)
◮ Minimum loss
min
W,H
- (i,j)∈Z
(Vij − [WH]ij)2
6 / 32
Latent Factor Models
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)
◮ Minimum loss
min
W,H,u,m
- (i,j)∈Z
(Vij − µ − ui − mj − [WH]ij)2
◮ Bias
6 / 32
Latent Factor Models
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)
◮ Minimum loss
min
W,H,u,m
- (i,j)∈Z
(Vij − µ − ui − mj − [WH]ij)2 + λ (W + H + u + m)
◮ Bias, regularization
6 / 32
Latent Factor Models
◮ Discover latent factors (r = 1)
Avatar The Matrix Up (2.24) (1.92) (1.18) Alice ? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2 ? (1.21) (2.7) (2.3) (1.4) Charlie 5 ? 3 (2.30) (5.2) (4.4) (2.7)
◮ Minimum loss
min
W,H,u,m
- (i,j,t)∈Zt
(Vij − µ − ui(t) − mj(t) − [W(t)H]ij)2 + λ (W(t) + H + u(t) + m(t))
◮ Bias, regularization, time
6 / 32
Generalized Matrix Factorization
◮ A general machine learning problem
◮ Recommender systems, text indexing, face recognition, . . . 7 / 32
Generalized Matrix Factorization
◮ A general machine learning problem
◮ Recommender systems, text indexing, face recognition, . . .
◮ Training data
◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings) 7 / 32
V Vij
Generalized Matrix Factorization
◮ A general machine learning problem
◮ Recommender systems, text indexing, face recognition, . . .
◮ Training data
◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)
◮ Parameter space
◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors) 7 / 32
V Vij W H Wi∗ H∗j
Generalized Matrix Factorization
◮ A general machine learning problem
◮ Recommender systems, text indexing, face recognition, . . .
◮ Training data
◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)
◮ Parameter space
◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors)
◮ Model
◮ Lij(Wi∗, H∗j): loss at element (i, j) ◮ Includes prediction error, regularization,
auxiliary information, . . .
◮ Constraints (e.g., non-negativity) 7 / 32
V Vij W H Wi∗ H∗j
Generalized Matrix Factorization
◮ A general machine learning problem
◮ Recommender systems, text indexing, face recognition, . . .
◮ Training data
◮ V: m × n input matrix (e.g., rating matrix) ◮ Z: training set of indexes in V (e.g., subset of known ratings)
◮ Parameter space
◮ W: row factors (e.g., m × r latent customer factors) ◮ H: column factors (e.g., r × n latent movie factors)
◮ Model
◮ Lij(Wi∗, H∗j): loss at element (i, j) ◮ Includes prediction error, regularization,
auxiliary information, . . .
◮ Constraints (e.g., non-negativity)
◮ Find best model
argmin
W,H
- (i,j)∈Z
Lij(Wi∗, H∗j)
7 / 32
V Vij W H Wi∗ H∗j
Successful Applications
◮ Movie recommendation (Netflix, competition papers)
◮ >12M users, >20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model
◮ Website recommendation (Microsoft, WWW10)
◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization
◮ News personalization (Google, WWW07)
◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing 8 / 32
Successful Applications
◮ Movie recommendation (Netflix, competition papers)
◮ >12M users, >20k movies, 2.4B ratings (projected) ◮ 36GB data, 9.2GB model (projected) ◮ Latent factor model
◮ Website recommendation (Microsoft, WWW10)
◮ 51M users, 15M URLs, 1.2B clicks ◮ 17.8GB data, 161GB metadata, 49GB model ◮ Gaussian non-negative matrix factorization
◮ News personalization (Google, WWW07)
◮ Millions of users, millions of stories, ? clicks ◮ Probabilistic latent semantic indexing
Distributed processing is necessary!
◮ Big data ◮ Large models ◮ Expensive computations
8 / 32
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
9 / 32
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 10 / 32
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 10 / 32
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 10 / 32
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 10 / 32
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ
L′(θ0)
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 10 / 32
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ
L′(θ0)
◮ Jump “approximately” downhill
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 10 / 32
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ
L′(θ0)
◮ Jump “approximately” downhill ◮ Stochastic difference equation
θn+1 = θn − ǫnˆ L′(θn)
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 10 / 32
Stochastic Gradient Descent
◮ Find minimum θ∗ of function L ◮ Pick a starting point θ0 ◮ Approximate gradient ˆ
L′(θ0)
◮ Jump “approximately” downhill ◮ Stochastic difference equation
θn+1 = θn − ǫnˆ L′(θn)
◮ Under certain conditions,
asymptotically approximates (continuous) gradient descent
. 5 1 1.5 2 2 . 5 3 3 . 5 4 4 4.5 4 . 5 4.5 5 5 5 5 . 5 5 . 5 6 6 6 . 5 6.5 7 7 7.5
− 1.0 − 0.5 0.0 0.5 1.0 − 1.0 − 0.5 0.0 0.5 1.0
- 0.0
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
stepfun(px, py)
t q(t) - q*
- ●●
- 10 / 32
Stochastic Gradient Descent for Matrix Factorization
◮ Set θ = (W, H) and use
L(θ) =
- (i,j)∈Z
Lij(Wi∗, H∗j)
11 / 32
V W H Vij Wi∗ H∗j
Stochastic Gradient Descent for Matrix Factorization
◮ Set θ = (W, H) and use
L(θ) =
- (i,j)∈Z
Lij(Wi∗, H∗j) L′(θ) =
- (i,j)∈Z
L′
ij(Wi∗, H∗j)
11 / 32
V W H Vij Wi∗ H∗j
Stochastic Gradient Descent for Matrix Factorization
◮ Set θ = (W, H) and use
L(θ) =
- (i,j)∈Z
Lij(Wi∗, H∗j) L′(θ) =
- (i,j)∈Z
L′
ij(Wi∗, H∗j)
ˆ L′(θ, z) = NL′
izjz(Wiz∗, H∗jz),
where N = |Z|
11 / 32
V W H Vij Wi∗ H∗j
Stochastic Gradient Descent for Matrix Factorization
◮ Set θ = (W, H) and use
L(θ) =
- (i,j)∈Z
Lij(Wi∗, H∗j) L′(θ) =
- (i,j)∈Z
L′
ij(Wi∗, H∗j)
ˆ L′(θ, z) = NL′
izjz(Wiz∗, H∗jz),
where N = |Z|
◮ SGD epoch
- 1. Pick a random entry z ∈ Z
- 2. Compute approximate gradient ˆ
L′(θ, z)
- 3. Update parameters
θn+1 = θn − ǫn ˆ L′(θn, z)
- 4. Repeat N times
11 / 32
V W H Vij Wi∗ H∗j
Stochastic Gradient Descent on Netflix Data
10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss
- LBFGS
SGD
- 12 / 32
Comparison
◮ Per epoch, assuming O(r) gradient computation per element
GD SGD Algorithm Deterministic Randomized Gradient computations 1 N Gradient types Exact Approximate Parameter updates 1 N Time O(rN) O(rN) Space O((m + n)r) O((m + n)r)
◮ Why stochastic?
◮ Fast convergence to vicinity of optimum ◮ Randomization may help escape local minima ◮ Exploitation of “repeated structure” 13 / 32
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
14 / 32
Averaging Techniques
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn) How to distribute?
15 / 32
Averaging Techniques
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn) How to distribute?
◮ Parameter mixing (ISGD)
◮ Map: Run independent instances of SGD on subsets of the
data (until convergence)
◮ Reduce: Average results 15 / 32
Averaging Techniques
10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss
- LBFGS
SGD ISGD
- 16 / 32
Averaging Techniques
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn) How to distribute?
◮ Parameter mixing (ISGD)
◮ Map: Run independent instances of SGD on subsets of the
data (until convergence)
◮ Reduce: Average results ◮ Does not converge to correct solution! 17 / 32
Averaging Techniques
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn) How to distribute?
◮ Parameter mixing (ISGD)
◮ Map: Run independent instances of SGD on subsets of the
data (until convergence)
◮ Reduce: Average results ◮ Does not converge to correct solution!
◮ Iterative Parameter mixing (PSGD)
◮ Map: Run independent instances of SGD on subsets of the
data (for some time)
◮ Reduce: Average results ◮ Repeat 17 / 32
Averaging Techniques
10 20 30 40 50 60 0.6 0.8 1.0 1.2 1.4 epoch Mean Loss
- LBFGS
SGD ISGD PSGD
- 18 / 32
Averaging Techniques
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn) How to distribute?
◮ Parameter mixing (ISGD)
◮ Map: Run independent instances of SGD on subsets of the
data (until convergence)
◮ Reduce: Average results ◮ Does not converge to correct solution!
◮ Iterative Parameter mixing (PSGD)
◮ Map: Run independent instances of SGD on subsets of the
data (for some time)
◮ Reduce: Average results ◮ Repeat ◮ Converges slowly! 19 / 32
Problem Structure
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn)
◮ An SGD step on example z ∈ Z . . .
- 1. Reads Wiz∗ and H∗jz
- 2. Performs gradient computation L′
ij(Wiz∗, H∗jz)
- 3. Updates Wiz∗ and H∗jz
V W H zn Wi∗ H∗j
20 / 32
Problem Structure
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn)
◮ An SGD step on example z ∈ Z . . .
- 1. Reads Wiz∗ and H∗jz
- 2. Performs gradient computation L′
ij(Wiz∗, H∗jz)
- 3. Updates Wiz∗ and H∗jz
V W H zn Wi∗ H∗j zn+1 H∗j
20 / 32
Problem Structure
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn)
◮ An SGD step on example z ∈ Z . . .
- 1. Reads Wiz∗ and H∗jz
- 2. Performs gradient computation L′
ij(Wiz∗, H∗jz)
- 3. Updates Wiz∗ and H∗jz
V W H zn Wi∗ H∗j zn+1 Wi∗
20 / 32
Problem Structure
◮ SGD steps depend on each other
θn+1 = θn − ǫnˆ L′(θn)
◮ An SGD step on example z ∈ Z . . .
- 1. Reads Wiz∗ and H∗jz
- 2. Performs gradient computation L′
ij(Wiz∗, H∗jz)
- 3. Updates Wiz∗ and H∗jz
◮ Not all steps are dependent V W H zn Wi∗ H∗j zn+1
20 / 32
Interchangeability
◮ Two elements z1, z2 ∈ Z are interchangeable if they share
neither row nor column
V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps
θn+1 = θn − ǫˆ L′(θn, zn)
21 / 32
Interchangeability
◮ Two elements z1, z2 ∈ Z are interchangeable if they share
neither row nor column
V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps
θn+2 = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn+1, zn+1)
21 / 32
Interchangeability
◮ Two elements z1, z2 ∈ Z are interchangeable if they share
neither row nor column
V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps
θn+2 = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn+1, zn+1) = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn, zn+1),
21 / 32
Interchangeability
◮ Two elements z1, z2 ∈ Z are interchangeable if they share
neither row nor column
V W H zn Wi∗ H∗j zn+1 ◮ When zn and zn+1 are interchangeable, the SGD steps
θn+2 = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn+1, zn+1) = θn − ǫˆ L′(θn, zn) − ǫˆ L′(θn, zn+1), become parallelizable!
21 / 32
Exploitation
◮ Block and distribute the input matrix V
22 / 32
Node 1 Node 2 Node 3
W1 H1 V11 V12 V13 W2 H2 V21 V22 V23 W3 H3 V31 V32 V33
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle 22 / 32
Node 1 Node 2 Node 3
W1 H1 V11 V12 V13 W2 H2 V21 V22 V23 W3 H3 V31 V32 V33
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle 22 / 32
Node 1 Node 2 Node 3
W1 H1 V11 W2 H2 V22 W3 H3 V33
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle
◮ Step 2:
Simulate sequential SGD
◮ Interchangeable blocks ◮ Throw dice of how
many iterations per block
◮ Throw dice of which
step sizes per block
22 / 32
Node 1 Node 2 Node 3
W1 H1 V11 W2 H2 V22 W3 H3 V33
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle
◮ Step 2:
Simulate sequential SGD
◮ Interchangeable blocks ◮ Throw dice of how
many iterations per block
◮ Throw dice of which
step sizes per block
22 / 32
Node 1 Node 2 Node 3
W1 H1 V11 W2 H2 V22 W3 H3 V33 V11 V12 V13 V21 V22 V23 V31 V32 V33
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle
◮ Step 2:
Simulate sequential SGD
◮ Interchangeable blocks ◮ Throw dice of how
many iterations per block
◮ Throw dice of which
step sizes per block
22 / 32
Node 1 Node 2 Node 3
W1 H1 V11 W2 H2 V22 W3 H3 V33 V12 V23 V31
Exploitation
◮ Block and distribute the input matrix V ◮ High-level approach (Map only)
- 1. Pick a “diagonal”
- 2. Run SGD on the diagonal (in parallel)
- 3. Merge the results
- 4. Move on to next “diagonal”
◮ Steps 1–3 form a cycle
◮ Step 2:
Simulate sequential SGD
◮ Interchangeable blocks ◮ Throw dice of how
many iterations per block
◮ Throw dice of which
step sizes per block
◮ Instance of “stratified SGD” ◮ Provably correct
22 / 32
Node 1 Node 2 Node 3
W1 H1 V11 W2 H2 V22 W3 H3 V33 V12 V23 V31
How does it work?
23 / 32
Cycle 0
L2
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L = 0.3L1 + 0.7L2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
23 / 32
Cycle 1
L2
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L = 0.3L1 + 0.7L2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
23 / 32
Cycle 2
L2
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L = 0.3L1 + 0.7L2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
23 / 32
Cycle 3
L2
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L = 0.3L1 + 0.7L2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
23 / 32
Cycle 4
L2
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L = 0.3L1 + 0.7L2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
23 / 32
Cycle 5
L2
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L = 0.3L1 + 0.7L2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
23 / 32
Cycle 6
L2
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L = 0.3L1 + 0.7L2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
23 / 32
Cycle 100
L2
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L = 0.3L1 + 0.7L2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
23 / 32
Cycle 100
L2
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L = 0.3L1 + 0.7L2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
How does it work?
23 / 32
Cycle 100
L2
0.2 0.4 0.6 0.8 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 2 2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L1
0.5 1
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6
- L = 0.3L1 + 0.7L2
− 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 − 0.6 − 0.4 − 0.2 0.0 0.2 0.4 0.6 NA
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
24 / 32
DSGD scales well (Netflix, NZSL+L2)
1 2 3 5 10 15 Time (hours) Mean Loss
- SGD
DSGD 1x7 DSGD 2x7 DSGD 4x7 DSGD 8x7 25 / 32
DSGD scales well (Netflix, NZSL+L2)
0.0 0.1 0.2 0.3 0.4 0.5 5 10 15 Time (hours) Mean Loss
- SGD
DSGD 1x7 DSGD 2x7 DSGD 4x7 DSGD 8x7 25 / 32
DSGD is fast (8x8, Netflix, NZSL)
0.0 0.5 1.0 1.5 2.0 0.6 0.8 1.0 1.2 1.4 Time (hours) Mean Loss
- LBFGS
DSGD ISGD PSGD ALS
- 26 / 32
DSGD is fast (8x8, Netflix, NZSL+L2)
0.0 0.5 1.0 1.5 2.0 1.0 1.5 2.0 2.5 3.0 Time (hours) Mean Loss
- LBFGS
DSGD ISGD PSGD ALS
- 27 / 32
DSGD is fast (8x8, synth., NZSL+L2)
Time (hours) Mean Loss 5 10 15 20 1 10 100 1000 DSGD PSGD ALS
28 / 32
DSGD runs on Hadoop
Fixed CPU (@24) Scaled CPU
100M 400M 1.6B 6.4B 25.6B Data size (# nonzero entries) Wall clock time per epoch (seconds) 500 1000 1500 2000 1x 1.3x 2.3x 6.6x 23.8x DSGD 1.6B @ 5 6.4B @ 20 25.6B @ 80 Data size @ cores Wall clock time per epoch (seconds) 200 400 600 800 1000 1x 1x 1.3x DSGD
(25.6B entries > 1/2TB of data)
29 / 32
Outline
Matrix Factorization Stochastic Gradient Descent Distributed SGD with MapReduce Experiments Summary
30 / 32
Summary
◮ Matrix factorization
◮ Widely applicable via customized loss functions ◮ Large instances (millions × millions with billions of entries)
◮ Distributed Stochastic Gradient Descent
◮ Simple and versatile ◮ Avoids averaging via novel “stratified SGD” variant ◮ Achieves ◮ Fully distributed data/model ◮ Fully distributed processing ◮ Same or better loss ◮ Faster ◮ Good scalability
◮ Future Directions
◮ More decompositions (e.g., losses at 0) ◮ Tensors ◮ Stratified SGD for other models ◮ ... 31 / 32
Thank you!
32 / 32