fatman vs littleboy scaling up linear algebraic

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in - PowerPoint PPT Presentation

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms Luna Xu (Virginia Tech) Seung-Hwan Lim (ORNL) Ali R. Butt (Virginia Tech) Sreenivas R. Sukumar (ORNL) Ramakrishnan Kannan (ORNL) 5/5/2015 HPC is used


  1. FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms Luna Xu (Virginia Tech) Seung-Hwan Lim (ORNL) Ali R. Butt (Virginia Tech) Sreenivas R. Sukumar (ORNL) Ramakrishnan Kannan (ORNL) 5/5/2015

  2. HPC is used to enable scientific discovery Scientific Simulation/Observation Data Analysis Scientific Discovery 2

  3. Increasing role of data in HPC Source: Stephens, Zachary D. et al. “Big Data: Astronomical or Genomical?” PLoS Biology 13.7 (2015) PMC . Web. 3 Nov. 2016.

  4. Efficient big data processing à Faster HPC • Scale of data to be analyzed is growing exponentially • Observation: Data analysis in HPC is similar to data analysis in big data community • Big data scale-out platforms can benefit data-based scientific discovery 4

  5. Scale-out data processing is becoming popular 5

  6. Problem: Unable to leverage accelerators GPU GPU GPU GPU GPU GPU GPU 6

  7. Per node performance does not scale Sources: https://www.nextplatform.com/2015/07/13/top-500-supercomputer-list- reflects-shifting-state-of-global-hpc-trends/; Nvidia 7

  8. Per node performance does not scale Sources: https://www.nextplatform.com/2015/07/13/top-500-supercomputer-list- reflects-shifting-state-of-global-hpc-trends/; Nvidia 8

  9. Our contribution: Scaling up linear algebra in scale-out data platforms • Introduce the support for distributed dense matrix manipulations in Spark for scale-out matrix operations • Adopt scale-up hardware acceleration for BLAS-3 operations of distributed matrices • Design a flexible controller to decide when and whether to use hardware accelerators based on the density of matrices 9

  10. Agenda • Introduction • Background • Design • Evaluation • Conclusion 10

  11. Distributed matrix support in Spark Row Indexed Coordinate Input Matrix Row Matrix file Matrix Block Matrix RDD Sparse Matrix 11

  12. Distributed matrix support in Spark Row Indexed Coordinate Input Matrix Row Matrix file Matrix Dependency between internal components in MLlib treats all matrices as sparse matrix regardless of density Block Matrix RDD Sparse Matrix 12

  13. Distributed matrix support in Spark Row Indexed Coordinate Input Matrix Row Matrix file Matrix X Block Matrix Ad-hoc Scala Impl RDD Sparse Matrix 13

  14. Distributed matrix support in Spark Row Indexed Coordinate Input Matrix Row Matrix file Matrix Ad-hoc implementation of sparse matrix multiplication prevents use of hardware acceleration X Block Matrix Ad-hoc Scala Impl RDD Sparse Matrix 14

  15. Agenda • Introduction • Background • Design • Evaluation • Conclusion 15

  16. Design considerations • Enable user transparency • Support scalable matrix multiplication • Support dense and sparse matrices 16

  17. System architecture matrix MLlib multiplication Spark cluster selector Scala BLAS Impl enabler netlib-java (JNI) …… JVM native Open NV BLAS BLAS cuBLAS xt CPU CPU CPU worker GPU GPU GPU 17

  18. Agenda • Introduction • Background • Design • Evaluation • Conclusion 18

  19. Methodology Gramian matrix computation with GEMM • XX T • Machine learning (PCA, SVD) • Data analysis (all-pair similarity) 19

  20. Methodology Gramian matrix computation with GEMM • XX T # of Rows (Cols) Density Raw size (GB) • Machine learning (PCA, SVD) 4873 1 0.34 • Data analysis (all-pair similarity) 14684 1 3.1 24495 1 8.4 663331 1 77 4873 0.05 0.104 14684 0.05 2.6 24495 0.05 19 663331 0.05 41 20

  21. Methodology • Spark configuration: • System spec: • Version: 1.6.1 System Rhea GPU node • 2 node cluster CPU Dual Intel Xeon E5 • One executor per CPU cores 28 (56 HT) node (56 cores, Memory 1TB 800GB) GPU Dual NVIDIA K80 • BLAS configuration GPU cores 4992 x 2 • OpenBLAS v0.2.19 (hand compile) GPU memory 24 x 2 GB • NVBLAS v7.5 CUDA 7.5 21

  22. Overall performance: Dense Matrix OpenBLAS NVBLAS 2.4 2.2 2 Speedup 1.8 2.2x 1.6 1.4 1.5x 1.2 1 baseline 4873 14684 24495 66331 Matrix size 22

  23. Performance: Dense Matrix 85.2% GC Shuffle Compute Others 100 90 80 Time percentage (%) 70 60 50 40 30 20 10 0 4873 14684 24495 66331 Matrix size 23

  24. Performance: Dense Matrix 92.9% 96.1% GC Shuffle Compute Others 100 90 80 Time percentage (%) 70 60 50 40 30 20 10 0 4873 14684 24495 66331 Matrix size 24

  25. Overall performance: Sparse Matrix OpenBLAS NVBLAS 1.1 1.05 10% baseline 1 0.95 0.9 Speedup 0.85 0.8 0.75 0.7 0.65 0.6 4873 24495 66331 97708 Matrix size 25

  26. Overall performance: Sparse Matrix OpenBLAS NVBLAS 1.1 1.05 baseline 1 0.95 0.9 Speedup 0.85 36.7% 0.8 0.75 0.7 0.65 0.6 4873 24495 66331 97708 Matrix size 26

  27. Performance: Sparse Matrix GC Shuffle Compute Others 22% 100 90 80 Time percentage (%) 70 85.1% 60 50 40 30 20 10 0 4873 14684 24495 66331 Matrix size 27

  28. Conclusion • We employ scale-up accelerations for linear algebraic operations in Spark • Our approach decides whether to use hardware accelerations based on matrix density • The system improves overall performance up to more than 2x compared to default Spark • Contact: Luna Xu, xuluna@cs.vt.edu DSSL: http://dssl.cs.vt.edu/ 28

Recommend


More recommend