NOMAD: A Distributed Framework for Latent Variable Models Inderjit - PowerPoint PPT Presentation

NOMAD: A Distributed Framework for Latent Variable Models Inderjit S. Dhillon Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu, C.-J. Hsieh, H. Yun, and S.V.N. Vishwanathan NIPS 2014 Workshop: Distributed Machine Learning and Matrix Computations Inderjit Dhillon (UT Austin.) Dec 12, 2014 1 / 40

Outline Challenges Matrix Completion Stochastic Gradient Method Existing Distributed Approaches Our Solution: NOMAD-MF Latent Dirichlet Allocation (LDA) Gibbs Sampling Existing Distributed Solutions: AdLDA, Yahoo LDA Our Solution: F+NOMAD-LDA Inderjit Dhillon (UT Austin.) Dec 12, 2014 2 / 40

Large-scale Latent Variable Modeling Latent Variable Models: very useful in many applications Latent models for recommender systems (e.g., MF) Topic models for document corpus (e.g., LDA) Fast growth of data Almost 2 . 5 × 10 18 bytes of data added each day 90% of the world’s data today was generated in the past two year Inderjit Dhillon (UT Austin.) Dec 12, 2014 3 / 40

Challenges Algorithmic as well as hardware level Many effective algorithms involve fine-grain iterative computation ⇒ hard to parallelize Many current parallel approaches bulk synchronization ⇒ wasted CPU power when communicating complicated locking mechanism ⇒ hard to scale to many machines asynchronous computation using parameter server ⇒ not serializable, danger of stale parameters Proposed NOMAD Framework access graph analysis to exploit parallelism asynchronous computation, non-blocking communication, and lock-free serializable (or almost serializable) successful applications: MF and LDA Inderjit Dhillon (UT Austin.) Dec 12, 2014 4 / 40

Matrix Factorization: Recommender Systems Inderjit Dhillon (UT Austin.) Dec 12, 2014 5 / 40

Recommender Systems Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40

Matrix Factorization Approach A ≈ WH T Inderjit Dhillon (UT Austin.) Dec 12, 2014 6 / 40

Matrix Factorization Approach � i h j ) 2 + λ � � � W � 2 F + � H � 2 ( A ij − w T min , F W ∈R m × k ( i , j ) ∈ Ω H ∈R n × k Ω = { ( i , j ) | A ij is observed } Regularized terms to avoid over-fitting A transform maps users/items to latent feature space R k the i th user ⇒ i th row of W , w T i , the j th item ⇒ j th column of H T , h j . w T i h j : measures the interaction. Inderjit Dhillon (UT Austin.) Dec 12, 2014 7 / 40

SGM: Stochastic Gradient Method SGM update: pick ( i , j ) ∈ Ω � � R ij ← A ij − w T h 1 h 2 h 3 i h j , w i ← w i − η ( λ | Ω i | w i − R ij h j ) ,     w T A 11 A 12 A 13 h j ← h j − η ( λ 1 Ω j | h j − R ij w i ) ,     | ¯     w T A 21 A 22 A 23     2     Ω i : observed ratings of i -th row.     w T A 31 A 32 A 33 ¯ 3 Ω j : observed ratings of j -th column. An iteration : | Ω | updates Time per update: O ( k ) Time per iteration: O ( | Ω | k ), better than O ( | Ω | k 2 ) for ALS Inderjit Dhillon (UT Austin.) Dec 12, 2014 8 / 40

Parallel Stochastic Gradient Descent for MF Challenge: direct parallel updates ⇒ memory conflicts. Multi-core parallelization Hogwild [Niu 2011] Jellyfish [Recht et al, 2011] FPSGD** [Zhuang et al, 2013] Multi-machine parallelization: DSGD [Gemulla et al, 2011] DSGD ++ [Teflioudi et al, 2013] Inderjit Dhillon (UT Austin.) Dec 12, 2014 9 / 40

DSGD/JellyFish [Gemulla et al, 2011; Recht et al, 2011] x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Synchronize and communicate x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Synchronize and communicate Inderjit Dhillon (UT Austin.) Dec 12, 2014 10 / 40

Proposed Asynchronous Approach: NOMAD-MF [Yun et al, 2014] Inderjit Dhillon (UT Austin.) Dec 12, 2014 11 / 40

Motivation Most existing parallel approaches require Synchronization and/or E.g., ALS, DSGD/JellyFish, DSGD++, CCD++ Computing power is wasted: Interleaved computation and communication Curse of the last reducer Locking and/or E.g., parallel SGD, FPSGD** A standard way to avoid conflict and guarantee serializability Complicated remote locking slows down the computation Hard to implement efficient locking on a distributed system Computation using stale values E.g., Hogwild, Asynchronous SGD using parameter server Lack of serializability Q: Can we avoid both synchronization and locking but keep CPU from being idle and guarantee serializability ? Inderjit Dhillon (UT Austin.) Dec 12, 2014 12 / 40

Our answer: NOMAD A: Yes, NOMAD keeps CPU and network busy simultaneously Stochastic gradient update rule only a small set of variables involved Nomadic token passing widely used in telecommunication area avoids conflict without explicit remote locking Idea: “owner computes” NOMAD: multiple “active tokens” and nomadic passing Features: fully asynchronous computation lock-free implementation non-blocking communication serializable update sequence Inderjit Dhillon (UT Austin.) Dec 12, 2014 13 / 40

Access Graph for Stochastic Gradient Access graph G = ( V , E ): V = { w i } ∪ { h j } E = { e ij : ( i , j ) ∈ Ω } Connection to SG: each e ij corresponds to a SG update only access to w i and h j w i Parallelism: h j edges without common node can be updated in parallel identify “matching” in the graph Nomadic Token Passing: users mechanism s.t. active edges always items form a “matching” serializability guaranteed Inderjit Dhillon (UT Austin.) Dec 12, 2014 14 / 40

More Details Nomadic Tokens for { h j } : n tokens ( j , h j ): O ( k ) space Worker: x x p workers x x x x a computing unit + a concurrent x x x x x x token queue x x x x x x x x x a block of W : O ( mk / p ) x x x x x x a block row of A : O ( | Ω | / p ) x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 15 / 40

Illustration of NOMAD communication x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Inderjit Dhillon (UT Austin.) Dec 12, 2014 16 / 40

Comparison on a Multi-core System On a 32-core processor with enough RAM. Comparison: NOMAD, FPSGD**, and CCD++. (100M ratings) (250M ratings) Netflix, machines=1, cores=30, λ = 0 . 05, k = 100 Yahoo!, machines=1, cores=30, λ = 1 . 00, k = 100 0 . 95 NOMAD NOMAD FPSGD** FPSGD** 26 CCD++ CCD++ 0 . 94 test RMSE test RMSE 0 . 93 24 0 . 92 22 0 . 91 0 100 200 300 400 0 100 200 300 400 seconds seconds Inderjit Dhillon (UT Austin.) Dec 12, 2014 17 / 40

Comparison on a Distributed System On a distributed system with 32 machines. Comparison: NOMAD, DSGD, DSGD++, and CCD++. (100M ratings) (250M ratings) Netflix, machines=32, cores=4, λ = 0 . 05, k = 100 Yahoo!, machines=32, cores=4, λ = 1 . 00, k = 100 1 NOMAD NOMAD DSGD DSGD 26 DSGD++ 0 . 98 DSGD++ CCD++ CCD++ test RMSE test RMSE 0 . 96 24 0 . 94 22 0 . 92 0 20 40 60 80 100 120 0 20 40 60 80 100 120 seconds seconds Inderjit Dhillon (UT Austin.) Dec 12, 2014 18 / 40

NOMAD: A Distributed Framework for Latent Variable Models Inderjit - PowerPoint PPT Presentation

NOMAD: A Distributed Framework for Latent Variable Models Inderjit S. Dhillon Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu, C.-J. Hsieh, H. Yun, and S.V.N. Vishwanathan NIPS 2014 Workshop: Distributed

NOMAD 70m Oceanfast AVAILABLE FOR CHARTER 2019 1 NOMAD 70m Oceanfast Main Salon 2 NOMAD

1 Latent variable models In the next section we will discuss latent variable models for

Nomad HASHICORP Armon Dadgar @armon HASHICORP HASHICORP Cluster Manager Scheduler Nomad

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Nomad Foods Limited Second Quarter 2020 Earnings Conference Call August 6, 2020 Nomad Foods

How to become a Successful Digital Nomad JohnnyFD.com 2 Successful Digital Nomad = Making

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

HOW DEVOPS HELPS SHAPE SMALL TEAMS AND ARCHITECTURE MATHIAS MEYER, @ROIDRAGE DEVOPS PRODUCTION

Returning data-flow to asynchronous programming Matt Gilbert, Senior Staff Engineer, Qualcomm

Verteilte Systeme (Distributed Systems) Karl M. Gschka Karl.Goeschka@tuwien.ac.at

Choreography Projection and Contract Refinement Mario Bravetti Department of Computer Science

Get Reactive: Microservices, Programming, and Systems Jeremy Davis Principal Solution

VR-Based Toolsets for Instructors H.K. Yaqub 1 1 Unity3D Developer, BMT, Bath, UK Abstract This

Technology use in higher education Technology is ubiquitous. Top-down push to increase

Session objectives At the end of the session, you should be able to: identify the key issues