FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms Luna Xu (Virginia Tech) Seung-Hwan Lim (ORNL) Ali R. Butt (Virginia Tech) Sreenivas R. Sukumar (ORNL) Ramakrishnan Kannan (ORNL) 5/5/2015
HPC is used to enable scientific discovery Scientific Simulation/Observation Data Analysis Scientific Discovery 2
Increasing role of data in HPC Source: Stephens, Zachary D. et al. “Big Data: Astronomical or Genomical?” PLoS Biology 13.7 (2015) PMC . Web. 3 Nov. 2016.
Efficient big data processing à Faster HPC • Scale of data to be analyzed is growing exponentially • Observation: Data analysis in HPC is similar to data analysis in big data community • Big data scale-out platforms can benefit data-based scientific discovery 4
Scale-out data processing is becoming popular 5
Problem: Unable to leverage accelerators GPU GPU GPU GPU GPU GPU GPU 6
Per node performance does not scale Sources: https://www.nextplatform.com/2015/07/13/top-500-supercomputer-list- reflects-shifting-state-of-global-hpc-trends/; Nvidia 7
Per node performance does not scale Sources: https://www.nextplatform.com/2015/07/13/top-500-supercomputer-list- reflects-shifting-state-of-global-hpc-trends/; Nvidia 8
Our contribution: Scaling up linear algebra in scale-out data platforms • Introduce the support for distributed dense matrix manipulations in Spark for scale-out matrix operations • Adopt scale-up hardware acceleration for BLAS-3 operations of distributed matrices • Design a flexible controller to decide when and whether to use hardware accelerators based on the density of matrices 9
Agenda • Introduction • Background • Design • Evaluation • Conclusion 10
Distributed matrix support in Spark Row Indexed Coordinate Input Matrix Row Matrix file Matrix Block Matrix RDD Sparse Matrix 11
Distributed matrix support in Spark Row Indexed Coordinate Input Matrix Row Matrix file Matrix Dependency between internal components in MLlib treats all matrices as sparse matrix regardless of density Block Matrix RDD Sparse Matrix 12
Distributed matrix support in Spark Row Indexed Coordinate Input Matrix Row Matrix file Matrix X Block Matrix Ad-hoc Scala Impl RDD Sparse Matrix 13
Distributed matrix support in Spark Row Indexed Coordinate Input Matrix Row Matrix file Matrix Ad-hoc implementation of sparse matrix multiplication prevents use of hardware acceleration X Block Matrix Ad-hoc Scala Impl RDD Sparse Matrix 14
Agenda • Introduction • Background • Design • Evaluation • Conclusion 15
Design considerations • Enable user transparency • Support scalable matrix multiplication • Support dense and sparse matrices 16
System architecture matrix MLlib multiplication Spark cluster selector Scala BLAS Impl enabler netlib-java (JNI) …… JVM native Open NV BLAS BLAS cuBLAS xt CPU CPU CPU worker GPU GPU GPU 17
Agenda • Introduction • Background • Design • Evaluation • Conclusion 18
Methodology Gramian matrix computation with GEMM • XX T • Machine learning (PCA, SVD) • Data analysis (all-pair similarity) 19
Methodology Gramian matrix computation with GEMM • XX T # of Rows (Cols) Density Raw size (GB) • Machine learning (PCA, SVD) 4873 1 0.34 • Data analysis (all-pair similarity) 14684 1 3.1 24495 1 8.4 663331 1 77 4873 0.05 0.104 14684 0.05 2.6 24495 0.05 19 663331 0.05 41 20
Methodology • Spark configuration: • System spec: • Version: 1.6.1 System Rhea GPU node • 2 node cluster CPU Dual Intel Xeon E5 • One executor per CPU cores 28 (56 HT) node (56 cores, Memory 1TB 800GB) GPU Dual NVIDIA K80 • BLAS configuration GPU cores 4992 x 2 • OpenBLAS v0.2.19 (hand compile) GPU memory 24 x 2 GB • NVBLAS v7.5 CUDA 7.5 21
Overall performance: Dense Matrix OpenBLAS NVBLAS 2.4 2.2 2 Speedup 1.8 2.2x 1.6 1.4 1.5x 1.2 1 baseline 4873 14684 24495 66331 Matrix size 22
Performance: Dense Matrix 85.2% GC Shuffle Compute Others 100 90 80 Time percentage (%) 70 60 50 40 30 20 10 0 4873 14684 24495 66331 Matrix size 23
Performance: Dense Matrix 92.9% 96.1% GC Shuffle Compute Others 100 90 80 Time percentage (%) 70 60 50 40 30 20 10 0 4873 14684 24495 66331 Matrix size 24
Overall performance: Sparse Matrix OpenBLAS NVBLAS 1.1 1.05 10% baseline 1 0.95 0.9 Speedup 0.85 0.8 0.75 0.7 0.65 0.6 4873 24495 66331 97708 Matrix size 25
Overall performance: Sparse Matrix OpenBLAS NVBLAS 1.1 1.05 baseline 1 0.95 0.9 Speedup 0.85 36.7% 0.8 0.75 0.7 0.65 0.6 4873 24495 66331 97708 Matrix size 26
Performance: Sparse Matrix GC Shuffle Compute Others 22% 100 90 80 Time percentage (%) 70 85.1% 60 50 40 30 20 10 0 4873 14684 24495 66331 Matrix size 27
Conclusion • We employ scale-up accelerations for linear algebraic operations in Spark • Our approach decides whether to use hardware accelerations based on matrix density • The system improves overall performance up to more than 2x compared to default Spark • Contact: Luna Xu, xuluna@cs.vt.edu DSSL: http://dssl.cs.vt.edu/ 28
Recommend
More recommend