Wei Tan
IBM T. J. Watson Research Center wtan@us.ibm.com http://github.com/cumf http://researcher.ibm.com/person/us-wtan
cuMF_sgd: Fast and Scalable Matrix Factorization on GPUs
GTC 2017
Factorization on GPUs Wei Tan IBM T. J. Watson Research Center - - PowerPoint PPT Presentation
cuMF_sgd: Fast and Scalable Matrix Factorization on GPUs Wei Tan IBM T. J. Watson Research Center wtan@us.ibm.com http://github.com/cumf http://researcher.ibm.com/person/us-wtan GTC 2017 Agenda Why accelerate matrix factorization?
IBM T. J. Watson Research Center wtan@us.ibm.com http://github.com/cumf http://researcher.ibm.com/person/us-wtan
GTC 2017
2
and minimize the empirical lost:
3
X ΘT word word 8 8 3 3 5 5
X ΘT word document
X ΘT user item 1 1 1 1 1 1
4
5
6
7
8
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
Fast
Scalable Cost efficient
9
10
(a) Hogwild
Parallel worker 0 Parallel worker 1
(b) Matrix Blocking
Cache Memory Coalescing Half precision Warp shuffle ILP Register Reuse
11
do not scale beyond ~30 threads.
#threads.
float a = p[] * q []; //dot product p[] = p[] – gradient[]; //update p q[] = q[] – gradient[]; //update q
if(…){ … } else (…){ … }
complex logics. Try to keep the code structure simple is very important for the performance, which is a common problem for all sparse applications.
12
In-warp dot product Since Kepler architectures, GPUs exploit ILP by employing VLIW-like techniques. Continuous instructions with no data dependency will be executed in parallel.
Instruction-level parallelism
Fastest dot product operation on GPUs. No shared memory + no synchronization
Instruction-level parallelism
Fully utilize register. Try to keep all variables in register file, which is the fastest storage on GPUs.
13
Cache Use L1 cache to capture data localities in the matrix.
Half precision
Maxwell supports memory storage for half precision. By using half precision, we reduce bandwidth consumption by 50% with no accuracy loss.
Memory coalescing
We ensure perfect memory coalescing at programming time. Every byte of off-chip memory access is utilized. Achieving 100% bandwidth efficiency.
Register file
We keep all re-usable data in the register file and avoid register spilling.
14
Global scheduling table Worker 0
…
Worker 1 Worker N Rating Matrix
5x107 1x108 1.5x108 2x108 50 100 150 200 250 300
#Updates/s #Parallel Workers
Libmf Libmf-GPU
Centralized scheduling: a global scheduler is responsible for the workload scheduling. When a parallel thread finishes its job, it asks the global scheduler for a new job. Scalability problem:
beyond 30 threads.
beyond 240 thread blocks.
15
16
17
18
19
20
performance GPU applications.
21