Factorization on GPUs Wei Tan IBM T. J. Watson Research Center - - PowerPoint PPT Presentation

factorization on gpus
SMART_READER_LITE
LIVE PREVIEW

Factorization on GPUs Wei Tan IBM T. J. Watson Research Center - - PowerPoint PPT Presentation

cuMF_sgd: Fast and Scalable Matrix Factorization on GPUs Wei Tan IBM T. J. Watson Research Center wtan@us.ibm.com http://github.com/cumf http://researcher.ibm.com/person/us-wtan GTC 2017 Agenda Why accelerate matrix factorization?


slide-1
SLIDE 1

Wei Tan

IBM T. J. Watson Research Center wtan@us.ibm.com http://github.com/cumf http://researcher.ibm.com/person/us-wtan

cuMF_sgd: Fast and Scalable Matrix Factorization on GPUs

GTC 2017

slide-2
SLIDE 2

Agenda

  • Why

– accelerate matrix factorization? – use GPUs?

  • How to accelerate -- cuMF

– alternating least square (ALS) – stochastic gradient descent (SGD)

  • What is the result – cuMF outperform all competitors

2

slide-3
SLIDE 3

MF Explained using Recommender Systems

  • How: derive user/item features
  • MF: factorize the rating matrix R into

and minimize the empirical lost:

R X

xu

ΘT

θv

  • Input: users ratings on some items
  • Want: predict missing ratings

3

slide-4
SLIDE 4

Other applications of MF

Word embedding

X ΘT word word 8 8 3 3 5 5

Topic model

X ΘT word document

Link prediction

X ΘT user item 1 1 1 1 1 1

Network compression

4

slide-5
SLIDE 5

Stochastic gradient descent (SGD)

  • Update takes one rating at a time
  • Vector inner product: memory bound
  • Need many light epochs
  • Parallelize: non-trivial
  • Handle dense (implicit) ratings: no

To Solve MF: SGD

5

xu1 θv1 xu2 xu3 θv2 θv3

slide-6
SLIDE 6

To Solve MF: ALS

Alternating Least Square (ALS)

  • Update takes ALL rating at a time
  • Vector outer product & solve: compute bound
  • Need few heavy epochs
  • Parallelize: straightforward
  • Handle dense (implicit) ratings: yes

6

xu1 θv1 xu2 θv2 θv3 θv2

slide-7
SLIDE 7

Challenge: compute and memory capacity of CPU

CPU offers: 1 T flops, 80 GB/s f=100, per epoch

  • ALS floating-point operations
  • Netflix: 1.5 T
  • Hugewiki: 80 T
  • Facebook: 2000 T
  • SGD memory transfer
  • Netflix: 80 GB
  • Hugewiki: 2.4 TB
  • Facebook: 80 TB
  • >> CPU flops and BW capacity

7

slide-8
SLIDE 8

GPU vs. CPU: compute FLOPS and memory bandwidth

8

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

  • Raw performance: 1 GPU ≈ 10 x CPU
  • Practical performance due to slow interconnection

1 GPU > 10 x CPU 4 GPU >> 40 x CPU

slide-9
SLIDE 9

Goal: a CUDA library for MF

CUDA

cuMF Kernels for ALS and SGD

PU MF Word embedding … Collab. filtering

Fast

  • Fast training
  • Update model quickly
  • Deal with big data
  • Exploit fast interconnection

Scalable Cost efficient

  • Fully utilize flops or BW
  • Cheaper than CPU solutions

9

slide-10
SLIDE 10

Challenges of SGD

  • Iterate over all ratings and do

this in sequence:

  • Memory bound

10

(a) Hogwild

Parallel worker 0 Parallel worker 1

(b) Matrix Blocking

Cache Memory Coalescing Half precision Warp shuffle ILP Register Reuse

  • 2. how to parallelize
  • 1. update

kernel

slide-11
SLIDE 11

Challenges on GPUs

11

Scalability

  • Existing CPU implementations

do not scale beyond ~30 threads.

  • Main reason: the scheduling
  • verhead & context switch
  • verhead increases with

#threads.

Vectorization Memory Access Code complexity

float a = p[] * q []; //dot product p[] = p[] – gradient[]; //update p q[] = q[] – gradient[]; //update q

if(…){ … } else (…){ … }

  • Cache.
  • Memory efficiency.
  • GPUs are not good at processing

complex logics. Try to keep the code structure simple is very important for the performance, which is a common problem for all sparse applications.

slide-12
SLIDE 12

Computation Optimization - 1: In-warp Vectorization

12

  • Use one thread block to process one update.
  • For any feature vector length(k), block size = warp size = 32.
  • In-warp vectorization.

In-warp dot product Since Kepler architectures, GPUs exploit ILP by employing VLIW-like techniques. Continuous instructions with no data dependency will be executed in parallel.

Instruction-level parallelism

Fastest dot product operation on GPUs. No shared memory + no synchronization

  • verhead + extra hardware acceleration.

k=128

Instruction-level parallelism

Fully utilize register. Try to keep all variables in register file, which is the fastest storage on GPUs.

slide-13
SLIDE 13

Computation Optimization - 2: Optimized Memory Access

13

Cache Use L1 cache to capture data localities in the matrix.

Half precision

Maxwell supports memory storage for half precision. By using half precision, we reduce bandwidth consumption by 50% with no accuracy loss.

Memory coalescing

We ensure perfect memory coalescing at programming time. Every byte of off-chip memory access is utilized. Achieving 100% bandwidth efficiency.

Register file

We keep all re-usable data in the register file and avoid register spilling.

slide-14
SLIDE 14

Single GPU Scheduling

14

Global scheduling table Worker 0

Worker 1 Worker N Rating Matrix

5x107 1x108 1.5x108 2x108 50 100 150 200 250 300

#Updates/s #Parallel Workers

Libmf Libmf-GPU

Centralized scheduling: a global scheduler is responsible for the workload scheduling. When a parallel thread finishes its job, it asks the global scheduler for a new job. Scalability problem:

  • 1. On CPUs, it does not scale

beyond 30 threads.

  • 2. On GPUs, it does not scale

beyond 240 thread blocks.

slide-15
SLIDE 15

Single GPU Scheduling – 1 Wave-like update

  • Basic observation: GPUs do not like complex block scheduling.

15

slide-16
SLIDE 16

Single GPU Scheduling – 2 Batch-Hogwild

16

  • Key idea:
  • Borrow the idea of Hogwild!.
  • Optimization: fetch a batch of samples

to exploit data locality.

  • Comparison
slide-17
SLIDE 17

Performance

17

  • Faster than all state-of-the-art

techniques with only one GPU card

slide-18
SLIDE 18

Cross Architecture Scalability

18

slide-19
SLIDE 19

Performance and achieved bandwidth

19

slide-20
SLIDE 20

Conclusion

20

  • SGD-based MF is memory-bound: try to increase the memory bandwidth instead
  • f increase FLOPS.
  • GPUs do not prefer complex scheduling policy or control logic.
  • Half precision (16-bit floating point) is accurate enough for matrix factorization.
  • Understanding the architecture details of GPUs helps a lot when writing high-

performance GPU applications.

  • cuMF_sgd is the fastest SGD-based MF.
slide-21
SLIDE 21

Thank you, questions?

  • Acknowledgement: Xiaolong Xie, Liangliang Cao
  • Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. HPDC 2016
  • CuMF_SGD: Fast and Scalable Matrix Factorization. HPDC, 2017
  • Code: http://github.com/cuMF/
  • Blog: http://ibm.biz/cumf-blog
  • Contact: Wei Tan, wtan@us.ibm.com

21