CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs - PowerPoint PPT Presentation

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com

Agenda  Why accelerate recommendation (matrix factorization) using GPUs?  What are the challenges?  How cuMF tackles the challenges?  So what is the result? 2

Why we need a fast and scalable recommender system?  Recommendation is pervasive  Drive 80% of Netflix’s watch hours 1  Digital ads market in US: 37.3 billion 2  Need: fast, scalable, economic 3

ALS (alternating least square) for collaborative filtering  Input : users ratings on some items  Output : all missing ratings θ v Θ T  How: factorize the rating matrix R into and minimize the lost on observed ratings: x u X  ALS: iteratively solve R 4

Θ T Matrix factorization/ALS is versatile θ v Θ T Word word X embedding x u user word X Θ T item Recommendation Topic document X model a very important algorithm to accelerate word 5

Challenges of fast and scalable ALS • ALS needs to solve: LU or Cholesky decomposition: cublas spmm: cublas • Challenge 2: Single GPU can NOT handle big m , n and nnz • Challenge 1: access and aggregate many θ v s: irregular ( R is sparse) and memory intensive 6

Challenge 1: memory access 4.3T × fully-utilized Gflops/s under-utilized 288G × 15 1 Operational intensity (Flops/Byte)  Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec  Higher flops  higher op intensity (more flops per byte)  caching! 7

Address challenge 1: memory-optimized ALS • To obtain 1. Reuse θ v s for many users 2. Load a portion into smem 3. Tile and aggregate in register 8

Address challenge 1: memory-optimized ALS T f T T T f 9

Address challenge 2: scale-up ALS on multiple GPUs  Model parallel: solve a portion of the model model parallel 10

Address challenge 2: scale-up ALS on multiple GPUs  Data parallel: solve with a portion of the Data parallel training data model parallel 11

Address challenge 2: parallel reduction  Data parallel needs cross-GPU reduction One-phase parallel reduction. Two-phase parallel reduction. Intra-socket Inter-socket 12

Recap: cuMF tackled two challenges • ALS needs to solve: LU or Cholesky decomposition: cublas spmm: cublas • Challenge 2: Single GPU can NOT handle big m , n and nnz • Challenge 1: access and aggregate many θ v s: irregular ( R is sparse) and memory intensive 13

Connect cuMF to Spark MLlib ALS apps cuMF JNI mllib/ALS  Spark applications relying on mllib/ALS need no change  Modified mllib/ALS detects GPU and offload computation  Leverage the best of Spark (scale-out) and GPU (scale-up) 14

Connect cuMF to Spark MLlib  RDD on CPU: to distribute rating data and shuffle parameters  Solver on GPU: to form and solve  Able to run on multiple nodes, and multiple GPUs per node shuffle shuffle … RDD RDD RDD RDD RDD RDD RDD RDD RDD RDD RDD RDD … … … CUDA CUDA CUDA CUDA CUDA CUDA kernel kernel kernel kernel kernel kernel GPU1 GPU2 GPU1 1 Power 8 node + 2 K40 1 Power 8 node + 2 K40 15

Implementation  In C (circa. 10k LOC)  CUDA 7.0/7.5, GCC OpenMP v3.0  Baselines: – Libmf: SGD on 1 node [RecSys14] – NOMAD: SGD on >1 nodes [VLDB 14] – SparkALS: ALS on Spark – FactorBird: SGD + parameter server for MF – Facebook: enhanced Giraph 16

CuMF performance • 1 GPU vs. 30 cores, CuMF slightly faster than libmf and NOMAD • CuMF scales well on 1, 2, 4 GPUs X-axis : time in seconds; Y-axis : Root Mean Square Error (RMSE) on test set 17

Effectiveness of memory optimization Aggressively using registers  2x • Using texture  25%-35% faster • X-axis : time in seconds; Y-axis : Root Mean Square Error (RMSE) on test set 18

CuMF performance and cost CuMF @4 GPUs ≈ NOMAD @64 HPC nodes • ≈ 10x NOMAD @32 AWS nodes CuMF @4 GPUs ≈ 10x SparkALS @50 nodes • ≈ 1% of its cost 19

CuMF accelerated Spark on Power 8 CuMF @2 K40s achieves 6+x speedup in training (193 sec  1.3k sec) • *GUI designed by Amir Sanjar 20

Conclusion  Why accelerate recommendation (matrix factorization) using GPU? – Need to be fast, scalable and economic  What are the challenges? – Memory access, scale to multiple GPUs  How cuMF tackles the challenges? – Optimize memory access, parallelism and communication  So what is the result? – Up to 10x as fast, 100x as cost-efficient – Use cuMF standalone or with Spark – GPU can tackle ML problems beyond deep learning! 21

Thank you, questions? Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016 , http://arxiv.org/abs/1603.03820 Source code available soon. Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com http://ibm.biz/wei_tan

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs - PowerPoint PPT Presentation

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com Agenda Why accelerate recommendation (matrix factorization) using GPUs? What are the challenges? How cuMF

Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender

Factorization on GPUs Wei Tan IBM T. J. Watson Research Center wtan@us.ibm.com

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

A Model For Mixed Linear-Tropical Matrix Factorization James Hook, Sanjar Karaev, Pauli Miettinen

Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Singular Value Decomposition (matrix factorization) Singular Value Decomposition The SVD is a

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Large-Scale Data Fusion Tutorial at BC^2, Basel 2015 by Collective Matrix Factorization

NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization Jiezhong Qiu Tsinghua

Basic settings for building a better model Do we need a modeling guideline for remediation sites?

RIC Beijing Report Li Changxing, Nan xuejing Meteorological Observation Center China

Parameters and statistical modeling for comparison of number and severity of traffic conflicts

Water-centric Inter-linkages Some Case Studies in Sri Lanka S.B Weerakoon 1 - Water with Floods,

Do enzyme-inhibiting drugs show increased reliance on certain chemical properties for binding to

Critical Tests of Theory of the Early Universe using the Cosmic Microwave Background Eiichiro

CITY BALLOT MEASURE TO AMEND MEASURE X TO ALLOW RETAIL CANNABIS USES July 21, 2020 City

M easures of A cademic P rogress N orth w est E ducation A ssociation Bernards Twp. BOE April 9,