nomad a distributed framework for latent variable models
play

NOMAD: A Distributed Framework for Latent Variable Models Inderjit - PowerPoint PPT Presentation

NOMAD: A Distributed Framework for Latent Variable Models Inderjit S. Dhillon Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu, C.-J. Hsieh, H. Yun, and S.V.N. Vishwanathan NIPS 2014 Workshop: Distributed


  1. NOMAD: A Distributed Framework for Latent Variable Models Inderjit S. Dhillon Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu, C.-J. Hsieh, H. Yun, and S.V.N. Vishwanathan NIPS 2014 Workshop: Distributed Machine Learning and Matrix Computations Inderjit Dhillon (UT Austin.) Dec 12, 2014 1 / 40

  2. Outline Challenges Matrix Completion Stochastic Gradient Method Existing Distributed Approaches Our Solution: NOMAD-MF Latent Dirichlet Allocation (LDA) Gibbs Sampling Existing Distributed Solutions: AdLDA, Yahoo LDA Our Solution: F+NOMAD-LDA Inderjit Dhillon (UT Austin.) Dec 12, 2014 2 / 40

  3. Large-scale Latent Variable Modeling Latent Variable Models: very useful in many applications Latent models for recommender systems (e.g., MF) Topic models for document corpus (e.g., LDA) Fast growth of data Almost 2 . 5 × 10 18 bytes of data added each day 90% of the world’s data today was generated in the past two year Inderjit Dhillon (UT Austin.) Dec 12, 2014 3 / 40

  4. Challenges Algorithmic as well as hardware level Many effective algorithms involve fine-grain iterative computation ⇒ hard to parallelize Many current parallel approaches bulk synchronization ⇒ wasted CPU power when communicating complicated locking mechanism ⇒ hard to scale to many machines asynchronous computation using parameter server ⇒ not serializable, danger of stale parameters Proposed NOMAD Framework access graph analysis to exploit parallelism asynchronous computation, non-blocking communication, and lock-free serializable (or almost serializable) successful applications: MF and LDA Inderjit Dhillon (UT Austin.) Dec 12, 2014 4 / 40

  5. Matrix Factorization: Recommender Systems Inderjit Dhillon (UT Austin.) Dec 12, 2014 5 / 40

  6. Recommender Systems Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40

  7. Matrix Factorization Approach A ≈ WH T Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40

  8. Matrix Factorization Approach A ≈ WH T Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40

  9. Matrix Factorization Approach � i h j ) 2 + λ � � � W � 2 F + � H � 2 ( A ij − w T min , F W ∈R m × k ( i , j ) ∈ Ω H ∈R n × k Ω = { ( i , j ) | A ij is observed } Regularized terms to avoid over-fitting A transform maps users/items to latent feature space R k the i th user ⇒ i th row of W , w T i , the j th item ⇒ j th column of H T , h j . w T i h j : measures the interaction. Inderjit Dhillon (UT Austin.) Dec 12, 2014 7 / 40

  10. SGM: Stochastic Gradient Method SGM update: pick ( i , j ) ∈ Ω � � R ij ← A ij − w T h 1 h 2 h 3 i h j , w i ← w i − η ( λ | Ω i | w i − R ij h j ) ,     w T A 11 A 12 A 13 h j ← h j − η ( λ 1 Ω j | h j − R ij w i ) ,     | ¯     w T A 21 A 22 A 23     2     Ω i : observed ratings of i -th row.     w T A 31 A 32 A 33 ¯ 3 Ω j : observed ratings of j -th column. An iteration : | Ω | updates Time per update: O ( k ) Time per iteration: O ( | Ω | k ), better than O ( | Ω | k 2 ) for ALS Inderjit Dhillon (UT Austin.) Dec 12, 2014 8 / 40

  11. Parallel Stochastic Gradient Descent for MF Challenge: direct parallel updates ⇒ memory conflicts. Multi-core parallelization Hogwild [Niu 2011] Jellyfish [Recht et al, 2011] FPSGD** [Zhuang et al, 2013] Multi-machine parallelization: DSGD [Gemulla et al, 2011] DSGD ++ [Teflioudi et al, 2013] Inderjit Dhillon (UT Austin.) Dec 12, 2014 9 / 40

  12. DSGD/JellyFish [Gemulla et al, 2011; Recht et al, 2011] x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Synchronize and communicate x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Synchronize and communicate Inderjit Dhillon (UT Austin.) Dec 12, 2014 10 / 40

  13. Proposed Asynchronous Approach: NOMAD-MF [Yun et al, 2014] Inderjit Dhillon (UT Austin.) Dec 12, 2014 11 / 40

  14. Motivation Most existing parallel approaches require Synchronization and/or E.g., ALS, DSGD/JellyFish, DSGD++, CCD++ Computing power is wasted: Interleaved computation and communication Curse of the last reducer Locking and/or E.g., parallel SGD, FPSGD** A standard way to avoid conflict and guarantee serializability Complicated remote locking slows down the computation Hard to implement efficient locking on a distributed system Computation using stale values E.g., Hogwild, Asynchronous SGD using parameter server Lack of serializability Q: Can we avoid both synchronization and locking but keep CPU from being idle and guarantee serializability ? Inderjit Dhillon (UT Austin.) Dec 12, 2014 12 / 40

  15. Our answer: NOMAD A: Yes, NOMAD keeps CPU and network busy simultaneously Stochastic gradient update rule only a small set of variables involved Nomadic token passing widely used in telecommunication area avoids conflict without explicit remote locking Idea: “owner computes” NOMAD: multiple “active tokens” and nomadic passing Features: fully asynchronous computation lock-free implementation non-blocking communication serializable update sequence Inderjit Dhillon (UT Austin.) Dec 12, 2014 13 / 40

  16. Access Graph for Stochastic Gradient Access graph G = ( V , E ): V = { w i } ∪ { h j } E = { e ij : ( i , j ) ∈ Ω } Connection to SG: each e ij corresponds to a SG update only access to w i and h j w i Parallelism: h j edges without common node can be updated in parallel identify “matching” in the graph Nomadic Token Passing: users mechanism s.t. active edges always items form a “matching” serializability guaranteed Inderjit Dhillon (UT Austin.) Dec 12, 2014 14 / 40

  17. More Details Nomadic Tokens for { h j } : n tokens ( j , h j ): O ( k ) space Worker: x x p workers x x x x a computing unit + a concurrent x x x x x x token queue x x x x x x x x x a block of W : O ( mk / p ) x x x x x x a block row of A : O ( | Ω | / p ) x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 15 / 40

  18. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  19. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  20. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  21. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  22. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  23. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  24. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  25. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  26. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  27. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  28. Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

  29. Comparison on a Multi-core System On a 32-core processor with enough RAM. Comparison: NOMAD, FPSGD**, and CCD++. (100M ratings) (250M ratings) Netflix, machines=1, cores=30, λ = 0 . 05, k = 100 Yahoo!, machines=1, cores=30, λ = 1 . 00, k = 100 0 . 95 NOMAD NOMAD FPSGD** FPSGD** 26 CCD++ CCD++ 0 . 94 test RMSE test RMSE 0 . 93 24 0 . 92 22 0 . 91 0 100 200 300 400 0 100 200 300 400 seconds seconds Inderjit Dhillon (UT Austin.) Dec 12, 2014 17 / 40

  30. Comparison on a Distributed System On a distributed system with 32 machines. Comparison: NOMAD, DSGD, DSGD++, and CCD++. (100M ratings) (250M ratings) Netflix, machines=32, cores=4, λ = 0 . 05, k = 100 Yahoo!, machines=32, cores=4, λ = 1 . 00, k = 100 1 NOMAD NOMAD DSGD DSGD 26 DSGD++ 0 . 98 DSGD++ CCD++ CCD++ test RMSE test RMSE 0 . 96 24 0 . 94 22 0 . 92 0 20 40 60 80 100 120 0 20 40 60 80 100 120 seconds seconds Inderjit Dhillon (UT Austin.) Dec 12, 2014 18 / 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend