psl exploiting parallelism sparsity and locality to
play

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate - PowerPoint PPT Presentation

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms Weixin Deng, Pengyu Wang, Jing Wang, Chao Li Shanghai Jiao Tong University Bench19 November 15, 2019 W. Deng et al. (SJTU) Accelerate


  1. PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms Weixin Deng, Pengyu Wang, Jing Wang, Chao Li Shanghai Jiao Tong University Bench’19 November 15, 2019 W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 1 / 16

  2. Outlines Introduction 1 Background 2 Design and Implementation 3 Evaluation 4 W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 2 / 16

  3. Recommendation Systems Figure: Movie Recommendation on IMDB Alternating least squares with weighted- λ -regularization (ALS-WR) is a popular solution to large-scale collaborative filtering. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 3 / 16

  4. ALS-WR Algorithm 1 Updating U and M in One Epoch 1: for i = 0 to n u do in parallel 8: for j = 0 to n m do in parallel M i ← M [: , I u i ] U j ← U [: , I m j ] 2: 9: R i ← R [ i, I u i ] R j ← R [ I m j , j ] 3: 10: V i ← M i R T 4: V j ← U i R T 11: i j A i ← M i M T i + λn u i E 5: A j ← U j U T j + λn m j E 12: U [: , i ] ← linalg.solve ( A i , V i ) 6: M [: , j ] ← linalg.solve ( A j , V j ) 13: 7: end for 14: end for Main components of ALS-WR ◮ Accessing data of U, M and R . ◮ Performing linear algebra operations and solving linear equations. Optimization opportunities ◮ Adopting suitable data layouts. ◮ Parallelizing loops. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 4 / 16

  5. ALS-WR Algorithm 1 Updating U and M in One Epoch 1: for i = 0 to n u do in parallel 8: for j = 0 to n m do in parallel M i ← M [: , I u i ] U j ← U [: , I m j ] 2: 9: R i ← R [ i, I u i ] R j ← R [ I m j , j ] 3: 10: V i ← M i R T 4: V j ← U i R T 11: i j A i ← M i M T i + λn u i E 5: A j ← U j U T j + λn m j E 12: U [: , i ] ← linalg.solve ( A i , V i ) 6: M [: , j ] ← linalg.solve ( A j , V j ) 13: 7: end for 14: end for Main components of ALS-WR ◮ Accessing data of U, M and R . ◮ Performing linear algebra operations and solving linear equations. Optimization opportunities ◮ Adopting suitable data layouts. ◮ Parallelizing loops. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 5 / 16

  6. ALS-WR Algorithm 1 Updating U and M in One Epoch 1: for i = 0 to n u do in parallel 8: for j = 0 to n m do in parallel M i ← M [: , I u i ] U j ← U [: , I m j ] 2: 9: R i ← R [ i, I u i ] R j ← R [ I m j , j ] 3: 10: V i ← M i R T 4: V j ← U i R T 11: i j A i ← M i M T i + λn u i E 5: A j ← U j U T j + λn m j E 12: U [: , i ] ← linalg.solve ( A i , V i ) 6: M [: , j ] ← linalg.solve ( A j , V j ) 13: 7: end for 14: end for Main components of ALS-WR ◮ Accessing data of U, M and R . ◮ Performing linear algebra operations and solving linear equations. Optimization opportunities ◮ Adopting suitable data layouts. ◮ Parallelizing loops. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 6 / 16

  7. ALS-WR Algorithm 1 Updating U and M in One Epoch 1: for i = 0 to n u do in parallel 8: for j = 0 to n m do in parallel M i ← M [: , I u i ] U j ← U [: , I m j ] 2: 9: R i ← R [ i, I u i ] R j ← R [ I m j , j ] 3: 10: V i ← M i R T 4: V j ← U i R T 11: i j A i ← M i M T i + λn u i E 5: A j ← U j U T j + λn m j E 12: U [: , i ] ← linalg.solve ( A i , V i ) 6: M [: , j ] ← linalg.solve ( A j , V j ) 13: 7: end for 14: end for Main components of ALS-WR ◮ Accessing data of U, M and R . ◮ Performing linear algebra operations and solving linear equations. Optimization opportunities ◮ Adopting suitable data layouts. ◮ Parallelizing loops. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 7 / 16

  8. ALS-WR Algorithm 1 Updating U and M in One Epoch 1: for i = 0 to n u do in parallel 8: for j = 0 to n m do in parallel M i ← M [: , I u i ] U j ← U [: , I m j ] 2: 9: R i ← R [ i, I u i ] R j ← R [ I m j , j ] 3: 10: V i ← M i R T 4: V j ← U i R T 11: i j A i ← M i M T i + λn u i E 5: A j ← U j U T j + λn m j E 12: U [: , i ] ← linalg.solve ( A i , V i ) 6: M [: , j ] ← linalg.solve ( A j , V j ) 13: 7: end for 14: end for Main components of ALS-WR ◮ Accessing data of U, M and R . ◮ Performing linear algebra operations and solving linear equations. Optimization opportunities ◮ Adopting suitable data layouts. ◮ Parallelizing loops. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 8 / 16

  9. Root Mean Square Error (RMSE) Algorithm 2 Calculating RMSE 1: s ← 0 2: for r ij do in parallel U i ← U [: , i ] 3: M j ← M [: , j ] 4: r ij ← U T ˆ i M j 5: r ij − r ij ) 2 s ← s + (ˆ 6: 7: end for 8: return ( s/N ) 1 / 2 Main components of RMSE ◮ Accessing data of U and M . ◮ Performing linear algebra operations. Optimization opportunities ◮ Adopting suitable data layouts. ◮ Parallelizing loops. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 9 / 16

  10. Design and Implementation Motivation ◮ Accessing columns of dense matrixes, rows and columns of sparse matrixes. ◮ Performing linear algebra operations and solving linear equations. ◮ Making loops multi-threaded. 1 Parallelism ◮ Multi-threaded loops (OpenMP) ⋆ parallel for ⋆ schedule(dynamic) (work balance) ⋆ reduction(+:sum) (thread local copy that avoids atomic operations) ◮ Single-threaded Intel MKL 2 Sparsity and Locality ◮ Rating data (sparse) ⋆ Compressed Sparse Row (CSR) ⋆ Compressed Sparse Column (CSC) ◮ Feature matrixes (dense) ⋆ Column-major W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 10 / 16

  11. Results Input Data ml-20m (500 MB) Dimension of Feature Matrixes n f 100 Number of Epochs 30 Train-Test Data Ratio 4:1 Table: Training Parameters Loading Data Training Training with RMSE Per Epoch 15s 52s 63s Table: Execution Time (seconds) Measured on a 64-core (2 sockets, each with 32 cores) machine with Hygon C86 7185. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 11 / 16

  12. Performance Analysis 1 Hotspots 2 Scalability 3 Speedup of different optimization Platform ◮ Performed on a 20-core (2 sockets, each with 10 cores) machine with Intel Xeon Silver 4114 (hyper-threaded, 40 threads in total). ◮ Use Intel VTune Amplifier to measure hotsopts. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 12 / 16

  13. Hotspots Table: Hotspots of the 40-threaded ALS-WR Hotspots Function CPU Time 1 36.1% cblas sgemm 2 23.7% LAPACKE sgesv work 3 10.8% intel avx rep memcpy 4 7.5% operator new 5 7.1% [MKL SERVICE]@malloc Performing matrix multiplication and solving linear equations. Accessing columns of dense matrixes. Memory allocation. Further optimizing memory access and memory management might help gain better performance. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 13 / 16

  14. Scalability 20 16 14.5 Speedup 11.4 12 9.7 7.6 8 5.4 4 3.1 1.0 0 1 4 8 12 16 20 40 Number of Threads Figure: Speedup of Multi-threaded ALS-WR The speedup is up to 14.5x when using 40 threads. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 14 / 16

  15. Summary 30 26.2 25 2.9x 20 Speedup 15 8.9 10 4.9x 5 1.8x 1.8 1.0 0 Data Layouts Multi-Threading with Multi-Threading with Atomic Operations Thread Local Copies Figure: Speedup of Different Optimization Leveraging sparsity helps process large scale data. Improving spatial locality brings 1.8x speedup. Using multi-threading offers 14.5x speedup. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 15 / 16

  16. End Thank you for listening! W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 16 / 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend