factorization on gpus
play

Factorization on GPUs Wei Tan IBM T. J. Watson Research Center - PowerPoint PPT Presentation

cuMF_sgd: Fast and Scalable Matrix Factorization on GPUs Wei Tan IBM T. J. Watson Research Center wtan@us.ibm.com http://github.com/cumf http://researcher.ibm.com/person/us-wtan GTC 2017 Agenda Why accelerate matrix factorization?


  1. cuMF_sgd: Fast and Scalable Matrix Factorization on GPUs Wei Tan IBM T. J. Watson Research Center wtan@us.ibm.com http://github.com/cumf http://researcher.ibm.com/person/us-wtan GTC 2017

  2. Agenda  Why – accelerate matrix factorization? – use GPUs?  How to accelerate -- cuMF – alternating least square (ALS) – stochastic gradient descent (SGD)  What is the result – cuMF outperform all competitors 2

  3. MF Explained using Recommender Systems  Input : users ratings on some items θ v Θ T  Want: predict missing ratings  How : derive user/item features x u  MF: factorize the rating matrix R into X R and minimize the empirical lost: 3

  4. Other applications of MF Θ T Θ T 8 Topic Word 8 3 model document X embedding word X 5 3 5 word word Θ T 1 Network Link 1 1 compression prediction item X 1 1 1 user 4

  5. To Solve MF: SGD θ v3 θ v1 θ v2 Stochastic gradient descent ( SGD ) x u1 - Update takes one rating at a time - Vector inner product: memory bound x u2 - Need many light epochs - Parallelize: non-trivial - Handle dense (implicit) ratings: no x u3 5

  6. To Solve MF: ALS θ v2 θ v3 θ v1 θ v2 Alternating Least Square ( ALS ) x u1 - Update takes ALL rating at a time - Vector outer product & solve: compute bound x u2 - Need few heavy epochs - Parallelize: straightforward - Handle dense (implicit) ratings: yes 6

  7. Challenge: compute and memory capacity of CPU CPU offers: 1 T flops, 80 GB/s f =100, per epoch • ALS floating-point operations - Netflix: 1.5 T - Hugewiki: 80 T - Facebook: 2000 T • SGD memory transfer - Netflix: 80 GB - Hugewiki: 2.4 TB - Facebook: 80 TB - >> CPU flops and BW capacity 7

  8. GPU vs. CPU: compute FLOPS and memory bandwidth • Raw performance: 1 GPU ≈ 10 x CPU • Practical performance due to slow interconnection 1 GPU > 10 x CPU 4 GPU >> 40 x CPU https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/ 8

  9. Goal: a CUDA library for MF Fast Collab. Word PU  Fast training filtering embedding MF …  Update model quickly cuMF Scalable Kernels for ALS and SGD  Deal with big data  Exploit fast interconnection CUDA Cost efficient  Fully utilize flops or BW  Cheaper than CPU solutions 9

  10. 2. how to parallelize Challenges of SGD • Iterate over all ratings and do this in sequence: Parallel worker 0 Parallel worker 1 • Memory bound (a) Hogwild (b) Matrix Blocking Cache Half Memory precision Coalescing 1. update kernel Warp shuffle ILP Register Reuse 10

  11. Challenges on GPUs float a = p[] * q []; //dot product Vectorization Scalability p[] = p[] – gradient[]; //update p q[] = q[] – gradient[]; //update q Memory Access • Cache. • Memory efficiency. • Existing CPU implementations • GPUs are not good at processing Code complexity do not scale beyond ~30 complex logics. Try to keep the threads. code structure simple is very if(…){ • Main reason: the scheduling important for the performance, … overhead & context switch which is a common problem for } overhead increases with else (…){ all sparse applications. … #threads. } 11

  12. Computation Optimization - 1: In-warp Vectorization • Use one thread block to process one update. • For any feature vector length(k), block size = warp size = 32. • In-warp vectorization. Since Kepler architectures, GPUs exploit k=128 Instruction-level ILP by employing VLIW-like techniques. parallelism Continuous instructions with no data dependency will be executed in parallel. Fastest dot product operation on GPUs. In-warp dot No shared memory + no synchronization product overhead + extra hardware acceleration. Fully utilize register. Try to keep all Instruction-level variables in register file, which is the parallelism fastest storage on GPUs. 12

  13. Computation Optimization - 2: Optimized Memory Access We ensure perfect memory Cache Memory coalescing at programming time. coalescing Use L1 cache Every byte of off-chip memory to capture data access is utilized. Achieving 100% localities in the bandwidth efficiency. matrix. Half precision We keep all re-usable data in Register file the register file and avoid register spilling. Maxwell supports memory storage for half precision. By using half precision, we reduce bandwidth consumption by 50% with no accuracy loss. 13

  14. Single GPU Scheduling Scalability problem: Global 1. On CPUs, it does not scale scheduling beyond 30 threads. table Worker 0 2. On GPUs, it does not scale Worker 1 beyond 240 thread blocks. … Rating Matrix Worker N 2x10 8 Libmf Libmf-GPU 1.5x10 8 #Updates/s Centralized scheduling: a 1x10 8 global scheduler is responsible for the workload scheduling. 5x10 7 When a parallel thread finishes its job, it asks the global 0 scheduler for a new job. 0 50 100 150 200 250 300 #Parallel Workers 14

  15. Single GPU Scheduling – 1 Wave-like update • Basic observation: GPUs do not like complex block scheduling. 15

  16. Single GPU Scheduling – 2 Batch-Hogwild • Key idea: - Borrow the idea of Hogwild!. - Optimization: fetch a batch of samples to exploit data locality. • Comparison 16

  17. Performance • Faster than all state-of-the-art techniques with only one GPU card 17

  18. Cross Architecture Scalability 18

  19. Performance and achieved bandwidth 19

  20. Conclusion  SGD-based MF is memory-bound: try to increase the memory bandwidth instead of increase FLOPS.  GPUs do not prefer complex scheduling policy or control logic.  Half precision (16-bit floating point) is accurate enough for matrix factorization.  Understanding the architecture details of GPUs helps a lot when writing high- performance GPU applications.  cuMF_sgd is the fastest SGD-based MF. 20

  21. Thank you, questions?  Acknowledgement: Xiaolong Xie, Liangliang Cao  Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. HPDC 2016  CuMF_SGD: Fast and Scalable Matrix Factorization. HPDC, 2017  Code: http://github.com/cuMF/  Blog: http://ibm.biz/cumf-blog  Contact: Wei Tan, wtan@us.ibm.com 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend