empirical modelling of linear algebra shared memory
play

Empirical Modelling of Linear Algebra Shared-Memory Routines Jes - PowerPoint PPT Presentation

Empirical Modelling of Linear Algebra Shared-Memory Routines Jes us C amara Luis P. Garc a Javier Cuenca Domingo Gim enez Scientific Computing and Parallel Programming Group University of Murcia, SPAIN UMU ICCS 2013 Domingo


  1. Empirical Modelling of Linear Algebra Shared-Memory Routines Jes´ us C´ amara Luis P. Garc´ ıa Javier Cuenca Domingo Gim´ enez Scientific Computing and Parallel Programming Group University of Murcia, SPAIN UMU ICCS 2013 Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 1 / 24

  2. Introduction Multicore processor, cc-NUMA systems can offer performance improvements Necessary software optimization techniques to benefit from the potential of the hardware Modelling the execution time of the routine Apply some empirical approach to study the behavior In this work: Analysis of the behavior of multithread LAPACK routines on PLASMA and Intel MKL Methodology for installation and modelling: take decisions at running time to reduce execution time Typical decisions: number of threads, block or tile size, routine to use Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 2 / 24

  3. Outline Introduction Auto-tuning methodology Motivation Routines and Computational Systems Empirical Modelling Method. PLASMA Comparison with Intel MKL LAPACK Conclusions and future work lines Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 3 / 24

  4. Auto-tuning methodology DESIGN EXECUTION Execution LAR of LAR Extracting AP Optimum AP Implementing Selection of Selected AP the Manager Optimum AP n R SOLAR Manager INSTALLATION Tuning AP Installation Set Tuned AP Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 4 / 24

  5. Motivation PLASMA parallelism is not hidden inside BLAS PLASMA library relies on TILE algorithms Tile Algorithms: OUTER BLOCK SIZE and INNER BLOCK SIZE Auto-tuning PLASMA: finding the outer and inner block size pairs that maximize performance. BUT DEFAULT VALUES ARE USED FOR TILE SIZES Cholesky: 120 LU: (200, 20) QR: (144, 48) Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 5 / 24

  6. Motivation The optimum tile size depends on number of threads, matrix size and computing system Cholesky factorization, 16 cores system n = 3048 n = 4072 n = 5096 0 . 8 1 . 5 t (seconds) t (seconds) t (seconds) 0 . 3 0 . 6 1 0 . 4 40 40 40 120 120 120 16 16 16 14 14 14 200 200 200 12 12 12 b b b 10 10 10 280 280 280 8 8 8 threads threads threads Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 6 / 24

  7. Motivation The optimum tile size depends on number of threads, matrix size and computing system Cholesky factorization, 24 cores system n = 3048 n = 4072 n = 5096 0 . 8 1 . 5 t (seconds) t (seconds) t (seconds) 0 . 3 0 . 6 1 0 . 2 0 . 4 40 40 40 120 120 120 24 24 24 20 20 20 200 200 200 16 16 16 b b b 12 12 12 280 280 280 8 8 8 threads threads threads Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 7 / 24

  8. Routines and Computational Systems LAPACK Routines Cholesky LU QR Computational Systems Hipatia : System with 16 cores, 4 Intel Xeon Quad-Core, 2.93 GHz (4 cores). Linux 2.6.18, Intel icc (v12.0.0) and Intel MKL (v10.3.2) Saturno : NUMA system with 24 cores, 4 Intel Xeon X7542 (hexa-core) processors, 1.87 GHz, 32 GB of shared-memory. Linux 2.6.35, Intel icc compiler (v12.0.2) and Intel MKL (v10.3.2) Joule : NUMA system with 64 cores, 4 AMD Opteron 6276 (16 cores) processors, 2.3 GHz, 64 GB of shared-memory. Linux 2.6.32, Intel icc compiler (v12.1.3) and Intel MKL (v10.3.9) Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 8 / 24

  9. Empirical Modelling Method Empirically estimated model of the execution time. Possible combinations: problem size and algorithm parameters (number of threads, block sizes, etc.) Algorithm parameter number of threads ( t ): { n 3 , n 2 , n , 1 } × { t , 1 , 1 t } n 3 n 2 t + k 2 n 2 t + k 3 n 2 + k 4 T ( n , t ) = k 1 t + k 5 nt + k 6 n Experiments are performed for different values of n and t Estimation of the values of k i with LS or NNLS LS: all coefficients have non zero values (positive or negative) NNLS: values obtained for the coefficients will be positive or zero Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 9 / 24

  10. Empirical Modelling Method. PLASMA Algorithm parameters: Number of threads ( t ). Outer block size ( b ). Inner block size ( l ) Possible combinations: { n 3 , n 2 , n , 1 } × { t , 1 , 1 t } × { b 2 , b , 1 , 1 b } × { l 2 , l , 1 , 1 l } n 3 n 3 n 3 n 3 n 2 t b + k 5 n 2 tb + k 6 n 2 t + k 7 T ( n , t , b , l ) = k 1 t + k 2 bl + k 3 l + k 4 bl + n 2 t n 2 t n 2 n 2 b + k 10 n 2 tl + k 11 n 2 b + k 12 n 2 l + k 13 k 8 + k 9 bl + k 14 l + l n 2 n 2 n 2 b n 2 l b + k 16 n 2 + k 17 + k 20 ntb 2 + t + k 18 + k 19 k 15 t t nt nt k 21 ntbl + k 22 ntb + k 23 ntl 2 + k 24 l + k 25 b + k 26 ntl + k 27 nt + k 28 nb 2 + k 29 nbl + k 30 nb + k 31 nl 2 + k 32 nl + k 33 n Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 10 / 24

  11. Empirical Modelling Method. PLASMA Installation Executing the routine with values of the parameters in an Installation Set Varying t , b and l to some possible preselected values Estimation of 33 or 17 k i coefficients with LS or NNLS Obtain Mod-LS or Mod-NNLS models for the execution time of the routine The model and the different possible values for the algorithm parameters are stored At execution time: the number of threads and tile sizes are selected for each problem size with the information provided by the model Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 11 / 24

  12. Example. Empirically modelling of Cholesky Installation Set . Cholesky n = { 500 , 1500 , 2500 , 3500 , 4500 , 5500 , 6500 , 7500 , 8500 , 9500 } b ranging from 40 to 300, b inc = 40 t ranging from 4, 6 or 16 to the number of available cores, t inc = 4 , 6 , 16 Mod-NNLS Hipatia. T ( n , t , b ) = k 1 n 3 t + k 2 n 3 b Saturno. T ( n , t , b ) = k 1 n 3 t + k 2 n 3 b + k 3 n 2 tb + k 11 ntb 2 Joule. T ( n , t , b ) = k 1 n 3 t + k 2 n 3 b + k 3 n 2 tb + k 6 n 2 b + k 11 ntb 2 With the Mod-NNLS the non-zero coefficients change with the execution platform. Insight about the contribution of the value of the algorithmic parameters. Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 12 / 24

  13. Example. Empirically modelling of Cholesky Validation Set � = Installation Set hipatia Saturno Joule size Optimum Optimum Optimum Mod-LS Mod-NNLS Mod-LS Mod-NNLS Mod-LS Mod-NNLS 2000 (12,200) (8,280) (16,280) (24,80) (12,80) (24,80) (32,40) (16,200) (64,40) 3000 (16,280) (12,280) (16,280) (24,80) (18,80) (24,80) (64,60) (48,40) (64,60) 4000 (16,280) (16,280) (16,280) (24,120) (24,120) (24,80) (64,60) (64,60) (64,60) 5000 (16,280) (16,280) (16,280) (24,120) (24,160) (24,120) (64,140) (64,60) (64,60) 6000 (16,280) (16,280) (16,280) (24,120) (24,160) (24,120) (64,140) (64,60) (64,80) 7000 (16,280) (16,280) (16,280) (24,160) (24,200) (24,120) (64,140) (64,60) (64,80) 8000 (16,280) (16,280) (16,280) (24,160) (24,240) (24,160) (64,200) (64,200) (64,100) 9000 (16,280) (16,280) (16,280) (24,200) (24,240) (24,160) (64,200) (64,200) (64,100) 10000 (16,280) (16,280) (16,280) (24,200) (24,280) (24,160) (64,200) (64,200) (64,100) Cholesky ( Hipatia ) Cholesky ( Saturno ) Cholesky ( Joule ) 4 . 17 0 18 . 42 2000 2000 2000 9 . 85 16 . 9 27 . 33 0 0 0 3000 3000 3000 1 . 91 12 . 64 7 . 38 0 0 . 45 0 4000 4000 4000 0 0 0 0 0 8 . 05 5000 5000 5000 0 0 . 57 8 . 05 0 0 21 . 83 n n n 6000 6000 6000 0 1 . 72 2 . 38 0 1 . 37 9 . 91 7000 7000 7000 0 0 . 88 1 . 65 0 0 1 . 29 8000 8000 8000 0 1 . 13 0 0 1 . 2 1 . 76 9000 9000 9000 Mod-LS Mod-LS Mod-LS 0 0 . 91 0 0 Mod-NNLS 0 . 54 Mod-NNLS 1 . 7 Mod-NNLS 10000 10000 10000 0 1 . 08 0 0 2 4 6 8 10 0 5 10 15 0 10 20 30 Desviation (%) Desviation (%) Desviation (%) Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 13 / 24

  14. Example. Empirically modelling of QR Installation Set . QR n = { 512 , 1024 , 1536 , 2048 , 2560 , 3072 , 3584 , 4096 , 4608 , 5120 } b ranging from 24 to 304, b inc = 40 l ranging form 28 to 208, l inc = 20 t ranging from 4, 6 or 16 to the number of available cores, t inc = 4 , 6 , 16 Mod-NNLS Hipatia. T ( n , t , b , l ) = k 1 n 3 t + k 2 n 3 bl + k 19 n 2 l t Saturno. T ( n , t , b , l ) = k 1 n 3 t + k 4 n 3 b + k 10 n 2 tl + k 20 ntb 2 + k 23 ntl 2 + k 24 nt l Joule. T ( n , t , b , l ) = k 1 n 3 t + k 4 n 3 b + k 12 n 2 l + k 13 n 2 bl + k 17 n 2 t + k 19 n 2 l t + k 20 ntb 2 + k 21 ntl 2 + k 28 nb 2 Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 14 / 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend