PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate - - PowerPoint PPT Presentation

psl exploiting parallelism sparsity and locality to
SMART_READER_LITE
LIVE PREVIEW

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate - - PowerPoint PPT Presentation

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms Weixin Deng, Pengyu Wang, Jing Wang, Chao Li Shanghai Jiao Tong University Bench19 November 15, 2019 W. Deng et al. (SJTU) Accelerate


slide-1
SLIDE 1

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms

Weixin Deng, Pengyu Wang, Jing Wang, Chao Li

Shanghai Jiao Tong University

Bench’19 November 15, 2019

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 1 / 16

slide-2
SLIDE 2

Outlines

1

Introduction

2

Background

3

Design and Implementation

4

Evaluation

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 2 / 16

slide-3
SLIDE 3

Recommendation Systems

Figure: Movie Recommendation on IMDB

Alternating least squares with weighted-λ-regularization (ALS-WR) is a popular solution to large-scale collaborative filtering.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 3 / 16

slide-4
SLIDE 4

ALS-WR

Algorithm 1 Updating U and M in One Epoch

1: for i = 0 to nu do in parallel 2:

Mi ← M[:, Iu

i ]

3:

Ri ← R[i, Iu

i ]

4:

Vi ← MiRT

i

5:

Ai ← MiM T

i + λnuiE

6:

U[:, i] ← linalg.solve(Ai, Vi)

7: end for 8: for j = 0 to nm do in parallel 9:

Uj ← U[:, Im

j ]

10:

Rj ← R[Im

j , j]

11:

Vj ← UiRT

j

12:

Aj ← UjU T

j + λnmjE

13:

M[:, j] ← linalg.solve(Aj, Vj)

14: end for

Main components of ALS-WR

◮ Accessing data of U, M and R. ◮ Performing linear algebra operations and solving linear equations.

Optimization opportunities

◮ Adopting suitable data layouts. ◮ Parallelizing loops.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 4 / 16

slide-5
SLIDE 5

ALS-WR

Algorithm 1 Updating U and M in One Epoch

1: for i = 0 to nu do in parallel 2:

Mi ← M[:, Iu

i ]

3:

Ri ← R[i, Iu

i ]

4:

Vi ← MiRT

i

5:

Ai ← MiM T

i + λnuiE

6:

U[:, i] ← linalg.solve(Ai, Vi)

7: end for 8: for j = 0 to nm do in parallel 9:

Uj ← U[:, Im

j ]

10:

Rj ← R[Im

j , j]

11:

Vj ← UiRT

j

12:

Aj ← UjU T

j + λnmjE

13:

M[:, j] ← linalg.solve(Aj, Vj)

14: end for

Main components of ALS-WR

◮ Accessing data of U, M and R. ◮ Performing linear algebra operations and solving linear equations.

Optimization opportunities

◮ Adopting suitable data layouts. ◮ Parallelizing loops.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 5 / 16

slide-6
SLIDE 6

ALS-WR

Algorithm 1 Updating U and M in One Epoch

1: for i = 0 to nu do in parallel 2:

Mi ← M[:, Iu

i ]

3:

Ri ← R[i, Iu

i ]

4:

Vi ← MiRT

i

5:

Ai ← MiM T

i + λnuiE

6:

U[:, i] ← linalg.solve(Ai, Vi)

7: end for 8: for j = 0 to nm do in parallel 9:

Uj ← U[:, Im

j ]

10:

Rj ← R[Im

j , j]

11:

Vj ← UiRT

j

12:

Aj ← UjU T

j + λnmjE

13:

M[:, j] ← linalg.solve(Aj, Vj)

14: end for

Main components of ALS-WR

◮ Accessing data of U, M and R. ◮ Performing linear algebra operations and solving linear equations.

Optimization opportunities

◮ Adopting suitable data layouts. ◮ Parallelizing loops.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 6 / 16

slide-7
SLIDE 7

ALS-WR

Algorithm 1 Updating U and M in One Epoch

1: for i = 0 to nu do in parallel 2:

Mi ← M[:, Iu

i ]

3:

Ri ← R[i, Iu

i ]

4:

Vi ← MiRT

i

5:

Ai ← MiM T

i + λnuiE

6:

U[:, i] ← linalg.solve(Ai, Vi)

7: end for 8: for j = 0 to nm do in parallel 9:

Uj ← U[:, Im

j ]

10:

Rj ← R[Im

j , j]

11:

Vj ← UiRT

j

12:

Aj ← UjU T

j + λnmjE

13:

M[:, j] ← linalg.solve(Aj, Vj)

14: end for

Main components of ALS-WR

◮ Accessing data of U, M and R. ◮ Performing linear algebra operations and solving linear equations.

Optimization opportunities

◮ Adopting suitable data layouts. ◮ Parallelizing loops.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 7 / 16

slide-8
SLIDE 8

ALS-WR

Algorithm 1 Updating U and M in One Epoch

1: for i = 0 to nu do in parallel 2:

Mi ← M[:, Iu

i ]

3:

Ri ← R[i, Iu

i ]

4:

Vi ← MiRT

i

5:

Ai ← MiM T

i + λnuiE

6:

U[:, i] ← linalg.solve(Ai, Vi)

7: end for 8: for j = 0 to nm do in parallel 9:

Uj ← U[:, Im

j ]

10:

Rj ← R[Im

j , j]

11:

Vj ← UiRT

j

12:

Aj ← UjU T

j + λnmjE

13:

M[:, j] ← linalg.solve(Aj, Vj)

14: end for

Main components of ALS-WR

◮ Accessing data of U, M and R. ◮ Performing linear algebra operations and solving linear equations.

Optimization opportunities

◮ Adopting suitable data layouts. ◮ Parallelizing loops.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 8 / 16

slide-9
SLIDE 9

Root Mean Square Error (RMSE)

Algorithm 2 Calculating RMSE

1: s ← 0 2: for rij do in parallel 3:

Ui ← U[:, i]

4:

Mj ← M[:, j]

5:

ˆ rij ← U T

i Mj

6:

s ← s + (ˆ rij − rij)2

7: end for 8: return (s/N)1/2

Main components of RMSE

◮ Accessing data of U and M. ◮ Performing linear algebra operations.

Optimization opportunities

◮ Adopting suitable data layouts. ◮ Parallelizing loops.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 9 / 16

slide-10
SLIDE 10

Design and Implementation

Motivation

◮ Accessing columns of dense matrixes, rows and columns of sparse

matrixes.

◮ Performing linear algebra operations and solving linear equations. ◮ Making loops multi-threaded. 1 Parallelism ◮ Multi-threaded loops (OpenMP) ⋆ parallel for ⋆ schedule(dynamic) (work balance) ⋆ reduction(+:sum) (thread local copy that avoids atomic operations) ◮ Single-threaded Intel MKL 2 Sparsity and Locality ◮ Rating data (sparse) ⋆ Compressed Sparse Row (CSR) ⋆ Compressed Sparse Column (CSC) ◮ Feature matrixes (dense) ⋆ Column-major

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 10 / 16

slide-11
SLIDE 11

Results

Input Data ml-20m (500 MB) Dimension of Feature Matrixes nf 100 Number of Epochs 30 Train-Test Data Ratio 4:1

Table: Training Parameters

Loading Data Training Training with RMSE Per Epoch 15s 52s 63s

Table: Execution Time (seconds)

Measured on a 64-core (2 sockets, each with 32 cores) machine with Hygon C86 7185.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 11 / 16

slide-12
SLIDE 12

Performance Analysis

1 Hotspots 2 Scalability 3 Speedup of different optimization

Platform

◮ Performed on a 20-core (2 sockets, each with 10 cores) machine with

Intel Xeon Silver 4114 (hyper-threaded, 40 threads in total).

◮ Use Intel VTune Amplifier to measure hotsopts.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 12 / 16

slide-13
SLIDE 13

Hotspots

Table: Hotspots of the 40-threaded ALS-WR

Hotspots Function CPU Time 1 cblas sgemm 36.1% 2 LAPACKE sgesv work 23.7% 3 intel avx rep memcpy 10.8% 4

  • perator new

7.5% 5 [MKL SERVICE]@malloc 7.1% Performing matrix multiplication and solving linear equations. Accessing columns of dense matrixes. Memory allocation. Further optimizing memory access and memory management might help gain better performance.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 13 / 16

slide-14
SLIDE 14

Scalability

1 4 8 12 16 20 40 Number of Threads 4 8 12 16 20 Speedup 1.0 3.1 5.4 7.6 9.7 11.4 14.5

Figure: Speedup of Multi-threaded ALS-WR

The speedup is up to 14.5x when using 40 threads.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 14 / 16

slide-15
SLIDE 15

Summary

Data Layouts Multi-Threading with Atomic Operations Multi-Threading with Thread Local Copies 5 10 15 20 25 30 Speedup 1.0 1.8 8.9 26.2 1.8x 4.9x 2.9x

Figure: Speedup of Different Optimization

Leveraging sparsity helps process large scale data. Improving spatial locality brings 1.8x speedup. Using multi-threading offers 14.5x speedup.

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 15 / 16

slide-16
SLIDE 16

End

Thank you for listening!

  • W. Deng et al. (SJTU)

Accelerate Matrix Factorization Bench’19 November 15, 2019 16 / 16