Factorization on GPUs Wei Tan IBM T. J. Watson Research Center - PowerPoint PPT Presentation

cuMF_sgd: Fast and Scalable Matrix Factorization on GPUs Wei Tan IBM T. J. Watson Research Center wtan@us.ibm.com http://github.com/cumf http://researcher.ibm.com/person/us-wtan GTC 2017

Agenda  Why – accelerate matrix factorization? – use GPUs?  How to accelerate -- cuMF – alternating least square (ALS) – stochastic gradient descent (SGD)  What is the result – cuMF outperform all competitors 2

MF Explained using Recommender Systems  Input : users ratings on some items θ v Θ T  Want: predict missing ratings  How : derive user/item features x u  MF: factorize the rating matrix R into X R and minimize the empirical lost: 3

Other applications of MF Θ T Θ T 8 Topic Word 8 3 model document X embedding word X 5 3 5 word word Θ T 1 Network Link 1 1 compression prediction item X 1 1 1 user 4

To Solve MF: SGD θ v3 θ v1 θ v2 Stochastic gradient descent ( SGD ) x u1 - Update takes one rating at a time - Vector inner product: memory bound x u2 - Need many light epochs - Parallelize: non-trivial - Handle dense (implicit) ratings: no x u3 5

To Solve MF: ALS θ v2 θ v3 θ v1 θ v2 Alternating Least Square ( ALS ) x u1 - Update takes ALL rating at a time - Vector outer product & solve: compute bound x u2 - Need few heavy epochs - Parallelize: straightforward - Handle dense (implicit) ratings: yes 6

Challenge: compute and memory capacity of CPU CPU offers: 1 T flops, 80 GB/s f =100, per epoch • ALS floating-point operations - Netflix: 1.5 T - Hugewiki: 80 T - Facebook: 2000 T • SGD memory transfer - Netflix: 80 GB - Hugewiki: 2.4 TB - Facebook: 80 TB - >> CPU flops and BW capacity 7

GPU vs. CPU: compute FLOPS and memory bandwidth • Raw performance: 1 GPU ≈ 10 x CPU • Practical performance due to slow interconnection 1 GPU > 10 x CPU 4 GPU >> 40 x CPU https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/ 8

Goal: a CUDA library for MF Fast Collab. Word PU  Fast training filtering embedding MF …  Update model quickly cuMF Scalable Kernels for ALS and SGD  Deal with big data  Exploit fast interconnection CUDA Cost efficient  Fully utilize flops or BW  Cheaper than CPU solutions 9

2. how to parallelize Challenges of SGD • Iterate over all ratings and do this in sequence: Parallel worker 0 Parallel worker 1 • Memory bound (a) Hogwild (b) Matrix Blocking Cache Half Memory precision Coalescing 1. update kernel Warp shuffle ILP Register Reuse 10

Challenges on GPUs float a = p[] * q []; //dot product Vectorization Scalability p[] = p[] – gradient[]; //update p q[] = q[] – gradient[]; //update q Memory Access • Cache. • Memory efficiency. • Existing CPU implementations • GPUs are not good at processing Code complexity do not scale beyond ~30 complex logics. Try to keep the threads. code structure simple is very if(…){ • Main reason: the scheduling important for the performance, … overhead & context switch which is a common problem for } overhead increases with else (…){ all sparse applications. … #threads. } 11

Computation Optimization - 1: In-warp Vectorization • Use one thread block to process one update. • For any feature vector length(k), block size = warp size = 32. • In-warp vectorization. Since Kepler architectures, GPUs exploit k=128 Instruction-level ILP by employing VLIW-like techniques. parallelism Continuous instructions with no data dependency will be executed in parallel. Fastest dot product operation on GPUs. In-warp dot No shared memory + no synchronization product overhead + extra hardware acceleration. Fully utilize register. Try to keep all Instruction-level variables in register file, which is the parallelism fastest storage on GPUs. 12

Computation Optimization - 2: Optimized Memory Access We ensure perfect memory Cache Memory coalescing at programming time. coalescing Use L1 cache Every byte of off-chip memory to capture data access is utilized. Achieving 100% localities in the bandwidth efficiency. matrix. Half precision We keep all re-usable data in Register file the register file and avoid register spilling. Maxwell supports memory storage for half precision. By using half precision, we reduce bandwidth consumption by 50% with no accuracy loss. 13

Single GPU Scheduling Scalability problem: Global 1. On CPUs, it does not scale scheduling beyond 30 threads. table Worker 0 2. On GPUs, it does not scale Worker 1 beyond 240 thread blocks. … Rating Matrix Worker N 2x10 8 Libmf Libmf-GPU 1.5x10 8 #Updates/s Centralized scheduling: a 1x10 8 global scheduler is responsible for the workload scheduling. 5x10 7 When a parallel thread finishes its job, it asks the global 0 scheduler for a new job. 0 50 100 150 200 250 300 #Parallel Workers 14

Single GPU Scheduling – 1 Wave-like update • Basic observation: GPUs do not like complex block scheduling. 15

Single GPU Scheduling – 2 Batch-Hogwild • Key idea: - Borrow the idea of Hogwild!. - Optimization: fetch a batch of samples to exploit data locality. • Comparison 16

Performance • Faster than all state-of-the-art techniques with only one GPU card 17

Cross Architecture Scalability 18

Performance and achieved bandwidth 19

Conclusion  SGD-based MF is memory-bound: try to increase the memory bandwidth instead of increase FLOPS.  GPUs do not prefer complex scheduling policy or control logic.  Half precision (16-bit floating point) is accurate enough for matrix factorization.  Understanding the architecture details of GPUs helps a lot when writing high- performance GPU applications.  cuMF_sgd is the fastest SGD-based MF. 20

Thank you, questions?  Acknowledgement: Xiaolong Xie, Liangliang Cao  Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. HPDC 2016  CuMF_SGD: Fast and Scalable Matrix Factorization. HPDC, 2017  Code: http://github.com/cuMF/  Blog: http://ibm.biz/cumf-blog  Contact: Wei Tan, wtan@us.ibm.com 21

Factorization on GPUs Wei Tan IBM T. J. Watson Research Center - PowerPoint PPT Presentation

cuMF_sgd: Fast and Scalable Matrix Factorization on GPUs Wei Tan IBM T. J. Watson Research Center wtan@us.ibm.com http://github.com/cumf http://researcher.ibm.com/person/us-wtan GTC 2017 Agenda Why accelerate matrix factorization?

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs Wei Tan, IBM T. J. Watson

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

The Scalable Petascale Data-Driven Approach for the Cholesky Factorization with multiple GPUs

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Integer Factorization Methods Modular Arithmetic Trial division, Pollards p 1 , Division

Mac Lane and Factorization Walter Tholen York University, Toronto June 15, 2006 Walter Tholen

A Model For Mixed Linear-Tropical Matrix Factorization James Hook, Sanjar Karaev, Pauli Miettinen

Factoring Done by:Rashed salmeen Grade:9ASP2 Prime factorization Prime factorization:is finding

Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender

Incomplete Factorization by Local Exact Factorization (ILUE) Johannes Kraus and Maria Lymbery

Compressed Factorization: Fast and Accurate Low-Rank Factorization of Compressively-Sensed Data

Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department

LU -factorization and probabilities Vincent Vigon 6 septembre 2007 Vincent Vigon () LU

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

The Complexity of Homomorphism Factorization Kevin M. Berg University of Colorado Boulder August

SEA TRAFFIC MANAGEMENT IN THE BALTIC TODAY AND TOMORROW Fredrik Karlsson, Swedish Maritime

or https://www.youtube.com/watch?v=03QLivaG_jE Major activities in FAMOS (Finalising Surveys

ECE ILLINOIS Karl rlReinh nhar ard Graduate Student Investigating Synchrophasor Data to

Makoto Takamoto Max-Planck Institut fr Kernphysik collaborator: Tsuyoshi Inoue Shu-ichiro

The Leptonpropagator PROPOSAL for CORSIKA Jan Soedingrekso, Alexander Sandrock Jean-Marco

A regional scale between-herd epidemiological model with application to bovine viral diarrhoea

Blue Horizons Project Setting the course for clean energy Battery Storage Technology

Re-allocation at consent expiry September 2017 What the RMA says about existing infrastructure