PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate - PowerPoint PPT Presentation

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms Weixin Deng, Pengyu Wang, Jing Wang, Chao Li Shanghai Jiao Tong University Bench’19 November 15, 2019 W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 1 / 16

Outlines Introduction 1 Background 2 Design and Implementation 3 Evaluation 4 W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 2 / 16

Recommendation Systems Figure: Movie Recommendation on IMDB Alternating least squares with weighted- λ -regularization (ALS-WR) is a popular solution to large-scale collaborative filtering. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 3 / 16

ALS-WR Algorithm 1 Updating U and M in One Epoch 1: for i = 0 to n u do in parallel 8: for j = 0 to n m do in parallel M i ← M [: , I u i ] U j ← U [: , I m j ] 2: 9: R i ← R [ i, I u i ] R j ← R [ I m j , j ] 3: 10: V i ← M i R T 4: V j ← U i R T 11: i j A i ← M i M T i + λn u i E 5: A j ← U j U T j + λn m j E 12: U [: , i ] ← linalg.solve ( A i , V i ) 6: M [: , j ] ← linalg.solve ( A j , V j ) 13: 7: end for 14: end for Main components of ALS-WR ◮ Accessing data of U, M and R . ◮ Performing linear algebra operations and solving linear equations. Optimization opportunities ◮ Adopting suitable data layouts. ◮ Parallelizing loops. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 4 / 16

Root Mean Square Error (RMSE) Algorithm 2 Calculating RMSE 1: s ← 0 2: for r ij do in parallel U i ← U [: , i ] 3: M j ← M [: , j ] 4: r ij ← U T ˆ i M j 5: r ij − r ij ) 2 s ← s + (ˆ 6: 7: end for 8: return ( s/N ) 1 / 2 Main components of RMSE ◮ Accessing data of U and M . ◮ Performing linear algebra operations. Optimization opportunities ◮ Adopting suitable data layouts. ◮ Parallelizing loops. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 9 / 16

Design and Implementation Motivation ◮ Accessing columns of dense matrixes, rows and columns of sparse matrixes. ◮ Performing linear algebra operations and solving linear equations. ◮ Making loops multi-threaded. 1 Parallelism ◮ Multi-threaded loops (OpenMP) ⋆ parallel for ⋆ schedule(dynamic) (work balance) ⋆ reduction(+:sum) (thread local copy that avoids atomic operations) ◮ Single-threaded Intel MKL 2 Sparsity and Locality ◮ Rating data (sparse) ⋆ Compressed Sparse Row (CSR) ⋆ Compressed Sparse Column (CSC) ◮ Feature matrixes (dense) ⋆ Column-major W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 10 / 16

Results Input Data ml-20m (500 MB) Dimension of Feature Matrixes n f 100 Number of Epochs 30 Train-Test Data Ratio 4:1 Table: Training Parameters Loading Data Training Training with RMSE Per Epoch 15s 52s 63s Table: Execution Time (seconds) Measured on a 64-core (2 sockets, each with 32 cores) machine with Hygon C86 7185. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 11 / 16

Performance Analysis 1 Hotspots 2 Scalability 3 Speedup of different optimization Platform ◮ Performed on a 20-core (2 sockets, each with 10 cores) machine with Intel Xeon Silver 4114 (hyper-threaded, 40 threads in total). ◮ Use Intel VTune Amplifier to measure hotsopts. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 12 / 16

Hotspots Table: Hotspots of the 40-threaded ALS-WR Hotspots Function CPU Time 1 36.1% cblas sgemm 2 23.7% LAPACKE sgesv work 3 10.8% intel avx rep memcpy 4 7.5% operator new 5 7.1% [MKL SERVICE]@malloc Performing matrix multiplication and solving linear equations. Accessing columns of dense matrixes. Memory allocation. Further optimizing memory access and memory management might help gain better performance. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 13 / 16

Scalability 20 16 14.5 Speedup 11.4 12 9.7 7.6 8 5.4 4 3.1 1.0 0 1 4 8 12 16 20 40 Number of Threads Figure: Speedup of Multi-threaded ALS-WR The speedup is up to 14.5x when using 40 threads. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 14 / 16

Summary 30 26.2 25 2.9x 20 Speedup 15 8.9 10 4.9x 5 1.8x 1.8 1.0 0 Data Layouts Multi-Threading with Multi-Threading with Atomic Operations Thread Local Copies Figure: Speedup of Different Optimization Leveraging sparsity helps process large scale data. Improving spatial locality brings 1.8x speedup. Using multi-threading offers 14.5x speedup. W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 15 / 16

End Thank you for listening! W. Deng et al. (SJTU) Accelerate Matrix Factorization Bench’19 November 15, 2019 16 / 16

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate - PowerPoint PPT Presentation

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms Weixin Deng, Pengyu Wang, Jing Wang, Chao Li Shanghai Jiao Tong University Bench19 November 15, 2019 W. Deng et al. (SJTU) Accelerate

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Compiling for Parallelism & Locality Last time SSA and its uses Today

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

locality.org.uk Locality is the national network of ambitious and enterprising community-led

PBGL: A High-Performance Distributed-Memory Parallel Graph Library Andrew Lumsdaine Indiana

Welcome HILT is made possible by HILT is made possible by HILT is made possible by University

Chapel in the CHIUW 2016 (Cosmological) Wild Nikhil Padmanabhan About 2 June 2016 My

A Spline Dimensional Decomposition for High-Dimensional Uncertainty Quantification Sharif Rahman

Adv Advanced anced Worksho shop p on n Ea Earthquake Fa Fault Mechanics: The Theory, ,

MOSKitt UIM MOSKitt UIM (User Interface Modeling) (User Interface Modeling) Joan Fons a ,

CSSE463: Image Recognition Matt Boutell Myers240C x8534 boutell@rose-hulman.edu What is

Mixed Similarity Learning for Recommendation with Implicit Feedback Mengsi Liu, Weike Pan # , Miao

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate - PowerPoint PPT Presentation

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms Weixin Deng, Pengyu Wang, Jing Wang, Chao Li Shanghai Jiao Tong University Bench19 November 15, 2019 W. Deng et al. (SJTU) Accelerate

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Compiling for Parallelism &amp; Locality Last time SSA and its uses Today

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

locality.org.uk Locality is the national network of ambitious and enterprising community-led

PBGL: A High-Performance Distributed-Memory Parallel Graph Library Andrew Lumsdaine Indiana

Welcome HILT is made possible by HILT is made possible by HILT is made possible by University

Chapel in the CHIUW 2016 (Cosmological) Wild Nikhil Padmanabhan About 2 June 2016 My

A Spline Dimensional Decomposition for High-Dimensional Uncertainty Quantification Sharif Rahman

Adv Advanced anced Worksho shop p on n Ea Earthquake Fa Fault Mechanics: The Theory, ,

MOSKitt UIM MOSKitt UIM (User Interface Modeling) (User Interface Modeling) Joan Fons a ,

CSSE463: Image Recognition Matt Boutell Myers240C x8534 boutell@rose-hulman.edu What is

Mixed Similarity Learning for Recommendation with Implicit Feedback Mengsi Liu, Weike Pan # , Miao

Compiling for Parallelism & Locality Last time SSA and its uses Today