Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e - PowerPoint PPT Presentation

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e , Luis P. Garc´ ıa, Javier Cuenca and Domingo Gregorio Bernab´ Gim´ enez Universidad de Murcia/Universidad Polit´ ecnica de Cartagena Scientific Computing and Parallel Programming Group International Conference on Computational Science June 10-12, 2014 Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 1 / 24

Introduction Due to the omnipresence of multicore systems with GPU accelerators: Necessary software optimization techniques to benefit from the potential of the CPU+GPU system Modelling the execution time of the routine and apply some empirical approach to study the behaviour In this work : Empirical auto-tuning technique for a basic hybrid linear algebra kernel : methodology for installation and modelling How to use the basic auto-tuned kernel in a higher level routine . LU factorization. Achieves optimum load balance between GPUs and CPUs when they are performing linear algebra routines Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 2 / 24

Outline Introduction 1 Motivation 2 Auto-tuning a multi-device matrix multiplication 3 Auto-tuning a multi-device LU factorization by blocks 4 Conclusions and future research 5 Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 3 / 24

Outline 1 Introduction 2 Motivation 3 Auto-tuning a multi-device matrix multiplication 4 Auto-tuning a multi-device LU factorization by blocks 5 Conclusions and future research Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 4 / 24

Motivation Autotuning technique for achieving optimum load balance between GPUs and CPUs in basic linear algebra routines Matrix multiplication kernel, the basic idea is to carry out a matrix-multiplication simultaneously on both GPU and CPU cores. Overlap the multi-device (CPU+GPU) computations and data transfers DGEMM in CPU+GPU DGEMM in CPU and GPU N C = α AB + β C ⇒ C = α ( AB 1 + AB 2 ) + β ( C 1 + C 2 ) B 1 B 2 K α AB 1 + β C 1 can be performed in GPU and α AB 2 + β C 2 in CPU N GP U N CP U K C 1 C 2 A M Distribution between GPU and CPU N = N gpu + N cpu depend of N , relative speed of GPU and CPU, number of cores in the system Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 5 / 24

Motivation DGEMM CPU+GPU // Asynchronous transfer requires pinned host memory 1 cudaMallocHost(( void **) &h˙A, sizeof ( double )*szeA); 2 // Copy async host memory to device 3 cublasSetMatrixAsync(M, K, h˙A, d˙A, ...); 4 cublasSetMatrixAsync(K, N˙gpu , h˙B+ldb* N˙cpu , d˙B, ...); 5 // Have GPU do C˙1 = AxB˙1 6 cublasDgemm(M, N˙gpu , K, d˙A, d˙B, d˙C, ...); 7 // Copy async results from device to host 8 cublasGetMatrixAsync(M, N˙gpu , d˙C, lddc, h˙C+ldc* N˙cpu , ...); 9 // Have CPU do C˙2 = AxB˙2 10 dgemm˙(&M, & N˙cpu , &K, h˙A, h˙B+ldb* N˙gpu , h˙C+ldc* N˙gpu , ...); 11 GPU part: CUBLAS, MAGMA, CULA Tools CPU part with multithread BLAS: MKL, GotoBLAS, ATLAS Computations GPU-CPU are overlapped and data transfers GPU-CPU are performed asynchronously to achieve the maximum performance To reduce the data transfer time CPU-GPU, we use the pinned memory mechanism provided by CUDA Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 6 / 24

Empirical modelling of the execution time General scheme empirical modelling ⇒ N CPU and N GPU INSTALLATION hybrid DGEMM (., Installation Set M , N , K , A , { 384 , 1152 , · · · , 8064 } LDA, B , LDB, C , Execution LDC, B , LDB, N CPU = N CPU + ∆N CPU N CPU ) N GPU = N − N CPU T dgemm ( m , n ) = k 1 m 2 n + k 2 m 2 + k 3 m T comu ( n ) = t s + nt w T comu h 2 d and T comu d 2 h T dgemm gpu ( m, n ) and T dgemm cpu ( m, n ) LEAST SQUARE t s h 2 d , t w h 2 d and t s d 2 h , t w d 2 h k i gpu and k i cpu T EXEC = max ( T dgemm cpu + γ T comu , T dgemm gpu + T comu ) The model of the execution time of the hybrid DGEMM routine γ : overlap of CPU computation and data transfer CPU-GPU. Obtained experimentally for a particular system γ ∈ [ 0 , 1 ] Experiments with M ∈ Installation Set . Initial value for N CPU = 0 The value of N CPU is increased by a predetermined amount until the modelled execution time exceeds by a threshold the previous lowest modeled execution time T EXEC − T MIN > Th T MIN Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 8 / 24

Installation of the hybrid dgemm routine General scheme empirical modelling ⇒ N CPU and N GPU INSTALLATION hybrid DGEMM (., Installation Set M , N , K , A , { 384 , 1152 , · · · , 8064 } LDA, B , LDB, C , Execution LDC, B , LDB, N CPU = N CPU + ∆N CPU N CPU ) N GPU = N − N CPU T dgemm ( m , n ) = k 1 m 2 n + k 2 m 2 + k 3 m T comu ( n ) = t s + nt w T comu h 2 d and T comu d 2 h T dgemm gpu ( m, n ) and T dgemm cpu ( m, n ) LEAST SQUARE t s h 2 d , t w h 2 d and t s d 2 h , t w d 2 h k i gpu and k i cpu T EXEC = max ( T dgemm cpu + γ T comu , T dgemm gpu + T comu ) Installation Estimates the time to transfer n bytes CPU-GPU Obtains t s (the latency of sending the first byte) and t w (the time required to send each subsequent byte) Estimated linear regresion over experimental results for CUDA routines cublasSetMatrixAsync and cublasGetMatrixAsync Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 9 / 24

Installation of the hybrid dgemm routine General scheme empirical modelling ⇒ N CPU and N GPU INSTALLATION hybrid DGEMM (., Installation Set M , N , K , A , { 384 , 1152 , · · · , 8064 } LDA, B , LDB, C , Execution LDC, B , LDB, N CPU = N CPU + ∆N CPU N CPU ) N GPU = N − N CPU T dgemm ( m , n ) = k 1 m 2 n + k 2 m 2 + k 3 m T comu ( n ) = t s + nt w T comu h 2 d and T comu d 2 h T dgemm gpu ( m, n ) and T dgemm cpu ( m, n ) LEAST SQUARE t s h 2 d , t w h 2 d and t s d 2 h , t w d 2 h k i gpu and k i cpu T EXEC = max ( T dgemm cpu + γ T comu , T dgemm gpu + T comu ) Installation Estimation of k i : least-square using the experimental results of simple benchmarks for the basic routines dgemm and cublasDgemm over specified data in the Installation Set The benchmarks obtain the running times of the basic operations with the data storage and access scheme used in the hybrid routine Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 10 / 24

Installation of the hybrid dgemm routine Computational systems 12CK20 : is a shared-memory system with two hexa-cores (12 cores) Intel Xeon E5-2620 and a GPU device Tesla K20c (based on Kepler Architecture) with 4800 Mbytes of Global Memory and 2496 CUDA cores (13 Streaming Multiprocessors and 192 Streaming Processors) Installation It has been empirically tested that with γ = 1 is best predicts the time cost for the computational system 12CK20 T exec = max ( T dgemm cpu , T dgemm gpu ) + T comu The reason is that the CPU is not idle during the copy of matrices A and B from CPU to GPU The average deviation between the modelled time and the measured time for problem sizes in the Installation Set ranges from: 4.14% for medium and large matrix size 11.44% for small matrix sizes Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 11 / 24

Experimental results for the hybrid dgemm routine Validation Set � = Installation Set Model OPTIMUM Deviation n N CPU time N CPU time (%) 768 0 0.0036 0 0.0036 0.00 1536 48 0.0199 0 0.0171 16.61 2304 224 0.0424 240 0.0411 3.14 3072 384 0.0846 336 0.0842 0.46 3840 512 0.1459 512 0.1459 0.00 4608 640 0.2359 640 0.2359 0.00 5376 768 0.3562 800 0.3558 0.10 6144 896 0.5110 960 0.5100 0.18 6912 1008 0.7093 1072 0.7019 1.06 7680 1136 0.9618 1200 0.9375 2.59 8448 1264 1.2305 1280 1.2255 0.41 9216 1376 1.9682 1280 1.5803 24.55 9984 1504 2.1745 1280 2.1573 0.80 10572 1616 2.3111 1552 2.3101 0.04 11520 1744 3.3041 1392 3.0419 8.62 Table for different matrix size in a Validation Set ⇒ Execution time dgemm with optimum selection of N CPU and the selection provided by the empirical model N CPU is well predicted only in 3 of 15 cases. But the N CPU selected is very close to the optimum Not a great influence on the mean of the relative deviation from the optimum. Value of approximately 4% Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 12 / 24

Experimental results for the hybrid dgemm routine Matrix Multiplication - 12CK20 1000 800 GFLOPS 600 Hybrid DGEMM Model 400 Hybrid DGEMM Optimum MKL + CUBLAS 2000 4000 6000 8000 10000 matrix size GFLOPS average values obtained in 12CK20 The improvement is similar to that obtained with the optimum distribution ( Hybrid DGEMM Optimum ), and very close to the addition of GPLOPS that can be obtained ideally working with MKL dgemm and CUBLAS dgemm separately ( MKL+CUBLAS ) Bernab´ e et al. (SCPPG) gbernabe@um.es ICCS / June 10-12, 2014 13 / 24

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e - PowerPoint PPT Presentation

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e , Luis P. Garc a, Javier Cuenca and Domingo Gregorio Bernab Gim enez Universidad de Murcia/Universidad Polit ecnica de Cartagena Scientific Computing and Parallel

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU Platforms Enrique S.

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

Quick Tour of Basic Probability Theory and Linear Algebra CS224w: Social and Information Network

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

SUPERCOMPUTERS TO SUPERCARS Bill Veenhuis Sr. Solutions Architect, Automotive

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Deep Learning on GPU Mattias Flt Dept. of Automatic Control Lund Institute of Technology

Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,

OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh,

Neurosurgeon Collaborative Intelligence Between the Cloud and Mobile Edge by Y. Kang,

Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e - PowerPoint PPT Presentation

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e , Luis P. Garc a, Javier Cuenca and Domingo Gregorio Bernab Gim enez Universidad de Murcia/Universidad Polit ecnica de Cartagena Scientific Computing and Parallel

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU Platforms Enrique S.

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

Quick Tour of Basic Probability Theory and Linear Algebra CS224w: Social and Information Network

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

SUPERCOMPUTERS TO SUPERCARS Bill Veenhuis Sr. Solutions Architect, Automotive

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Deep Learning on GPU Mattias Flt Dept. of Automatic Control Lund Institute of Technology

Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing

GPU Servers for Research in Quantum Fluids L. Galantucci HPC &amp; Quantum Summit QEII Centre,

OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh,

Neurosurgeon Collaborative Intelligence Between the Cloud and Mobile Edge by Y. Kang,

Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,