CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs - - PowerPoint PPT Presentation

cumf large scale matrix factorization on just one machine
SMART_READER_LITE
LIVE PREVIEW

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs - - PowerPoint PPT Presentation

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com Agenda Why accelerate recommendation (matrix factorization) using GPUs? What are the challenges? How cuMF


slide-1
SLIDE 1

Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

slide-2
SLIDE 2

Agenda

  • Why accelerate recommendation (matrix factorization) using GPUs?
  • What are the challenges?
  • How cuMF tackles the challenges?
  • So what is the result?

2

slide-3
SLIDE 3

Why we need a fast and scalable recommender system?

3

  • Recommendation is pervasive
  • Drive 80% of Netflix’s watch hours1
  • Digital ads market in US: 37.3 billion2
  • Need: fast, scalable, economic
slide-4
SLIDE 4

ALS (alternating least square) for collaborative filtering

4

  • Input: users ratings on some items
  • Output: all missing ratings
  • How: factorize the rating matrix R into

and minimize the lost on observed ratings:

  • ALS: iteratively solve

R X

xu

ΘT

θv

slide-5
SLIDE 5

Matrix factorization/ALS is versatile

5

item

X

xu

ΘT

θv

Recommendation user X ΘT

word

word Word embedding X ΘT

word

document Topic model a very important algorithm to accelerate

slide-6
SLIDE 6

Challenges of fast and scalable ALS

6

  • ALS needs to solve:

spmm: cublas

  • Challenge 1: access and

aggregate many θvs: irregular (R is sparse) and memory intensive LU or Cholesky decomposition: cublas

  • Challenge 2:

Single GPU can NOT handle big m, n and nnz

slide-7
SLIDE 7

Challenge 1: memory access

7

  • Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec
  • Higher flops  higher op intensity (more flops per byte)  caching!

Operational intensity (Flops/Byte) Gflops/s

4.3T 1 288G × 15 ×

under-utilized fully-utilized

slide-8
SLIDE 8

Address challenge 1: memory-optimized ALS

8

  • To obtain
  • 1. Reuse θv s for many users
  • 2. Load a portion into smem
  • 3. Tile and aggregate in register
slide-9
SLIDE 9

Address challenge 1: memory-optimized ALS

9

f f

T T

T T

slide-10
SLIDE 10

Address challenge 2: scale-up ALS on multiple GPUs

  • Model parallel: solve a

portion of the model

10

model parallel

slide-11
SLIDE 11

Address challenge 2: scale-up ALS on multiple GPUs

  • Data parallel: solve

with a portion of the training data

11

Data parallel model parallel

slide-12
SLIDE 12

Address challenge 2: parallel reduction

  • Data parallel needs cross-GPU reduction

12

One-phase parallel reduction. Two-phase parallel reduction.

Inter-socket Intra-socket

slide-13
SLIDE 13

Recap: cuMF tackled two challenges

13

  • ALS needs to solve:

spmm: cublas

  • Challenge 1: access and

aggregate many θvs: irregular (R is sparse) and memory intensive LU or Cholesky decomposition: cublas

  • Challenge 2:

Single GPU can NOT handle big m, n and nnz

slide-14
SLIDE 14

Connect cuMF to Spark MLlib

14

  • Spark applications relying on mllib/ALS need no change
  • Modified mllib/ALS detects GPU and offload computation
  • Leverage the best of Spark (scale-out) and GPU (scale-up)

ALS apps mllib/ALS cuMF JNI

slide-15
SLIDE 15

Connect cuMF to Spark MLlib

1 Power 8 node + 2 K40

CUDA kernel GPU1 RDD CUDA kernel … RDD RDD RDD CUDA kernel GPU2 RDD CUDA kernel … RDD RDD RDD

shuffle

1 Power 8 node + 2 K40

CUDA kernel GPU1 RDD CUDA kernel … RDD RDD RDD

shuffle

  • RDD on CPU: to distribute rating data and shuffle parameters
  • Solver on GPU: to form and solve
  • Able to run on multiple nodes, and multiple GPUs per node

15

slide-16
SLIDE 16

Implementation

16

  • In C (circa. 10k LOC)
  • CUDA 7.0/7.5, GCC OpenMP v3.0
  • Baselines:

– Libmf: SGD on 1 node [RecSys14] – NOMAD: SGD on >1 nodes [VLDB 14] – SparkALS: ALS on Spark – FactorBird: SGD + parameter server for MF – Facebook: enhanced Giraph

slide-17
SLIDE 17

CuMF performance

17

X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set

  • 1 GPU vs. 30 cores, CuMF slightly

faster than libmf and NOMAD

  • CuMF scales well on 1, 2, 4 GPUs
slide-18
SLIDE 18

Effectiveness of memory optimization

18

X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set

  • Aggressively using registers  2x
  • Using texture  25%-35% faster
slide-19
SLIDE 19

CuMF performance and cost

19

  • CuMF @4 GPUs ≈ NOMAD @64 HPC nodes

≈ 10x NOMAD @32 AWS nodes

  • CuMF @4 GPUs ≈ 10x SparkALS @50 nodes

≈ 1% of its cost

slide-20
SLIDE 20

CuMF accelerated Spark on Power 8

20

  • CuMF @2 K40s achieves 6+x speedup in training (193 sec  1.3k sec)

*GUI designed by Amir Sanjar

slide-21
SLIDE 21

Conclusion

  • Why accelerate recommendation (matrix factorization) using GPU?

–Need to be fast, scalable and economic

  • What are the challenges?

–Memory access, scale to multiple GPUs

  • How cuMF tackles the challenges?

–Optimize memory access, parallelism and communication

  • So what is the result?

–Up to 10x as fast, 100x as cost-efficient –Use cuMF standalone or with Spark –GPU can tackle ML problems beyond deep learning!

21

slide-22
SLIDE 22

Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com http://ibm.biz/wei_tan

Thank you, questions?

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016, http://arxiv.org/abs/1603.03820 Source code available soon.