Empirical Modelling of Linear Algebra Shared-Memory Routines Jes - PowerPoint PPT Presentation

Empirical Modelling of Linear Algebra Shared-Memory Routines Jes´ us C´ amara Luis P. Garc´ ıa Javier Cuenca Domingo Gim´ enez Scientific Computing and Parallel Programming Group University of Murcia, SPAIN UMU ICCS 2013 Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 1 / 24

Introduction Multicore processor, cc-NUMA systems can offer performance improvements Necessary software optimization techniques to benefit from the potential of the hardware Modelling the execution time of the routine Apply some empirical approach to study the behavior In this work: Analysis of the behavior of multithread LAPACK routines on PLASMA and Intel MKL Methodology for installation and modelling: take decisions at running time to reduce execution time Typical decisions: number of threads, block or tile size, routine to use Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 2 / 24

Outline Introduction Auto-tuning methodology Motivation Routines and Computational Systems Empirical Modelling Method. PLASMA Comparison with Intel MKL LAPACK Conclusions and future work lines Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 3 / 24

Auto-tuning methodology DESIGN EXECUTION Execution LAR of LAR Extracting AP Optimum AP Implementing Selection of Selected AP the Manager Optimum AP n R SOLAR Manager INSTALLATION Tuning AP Installation Set Tuned AP Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 4 / 24

Motivation PLASMA parallelism is not hidden inside BLAS PLASMA library relies on TILE algorithms Tile Algorithms: OUTER BLOCK SIZE and INNER BLOCK SIZE Auto-tuning PLASMA: finding the outer and inner block size pairs that maximize performance. BUT DEFAULT VALUES ARE USED FOR TILE SIZES Cholesky: 120 LU: (200, 20) QR: (144, 48) Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 5 / 24

Motivation The optimum tile size depends on number of threads, matrix size and computing system Cholesky factorization, 16 cores system n = 3048 n = 4072 n = 5096 0 . 8 1 . 5 t (seconds) t (seconds) t (seconds) 0 . 3 0 . 6 1 0 . 4 40 40 40 120 120 120 16 16 16 14 14 14 200 200 200 12 12 12 b b b 10 10 10 280 280 280 8 8 8 threads threads threads Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 6 / 24

Motivation The optimum tile size depends on number of threads, matrix size and computing system Cholesky factorization, 24 cores system n = 3048 n = 4072 n = 5096 0 . 8 1 . 5 t (seconds) t (seconds) t (seconds) 0 . 3 0 . 6 1 0 . 2 0 . 4 40 40 40 120 120 120 24 24 24 20 20 20 200 200 200 16 16 16 b b b 12 12 12 280 280 280 8 8 8 threads threads threads Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 7 / 24

Routines and Computational Systems LAPACK Routines Cholesky LU QR Computational Systems Hipatia : System with 16 cores, 4 Intel Xeon Quad-Core, 2.93 GHz (4 cores). Linux 2.6.18, Intel icc (v12.0.0) and Intel MKL (v10.3.2) Saturno : NUMA system with 24 cores, 4 Intel Xeon X7542 (hexa-core) processors, 1.87 GHz, 32 GB of shared-memory. Linux 2.6.35, Intel icc compiler (v12.0.2) and Intel MKL (v10.3.2) Joule : NUMA system with 64 cores, 4 AMD Opteron 6276 (16 cores) processors, 2.3 GHz, 64 GB of shared-memory. Linux 2.6.32, Intel icc compiler (v12.1.3) and Intel MKL (v10.3.9) Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 8 / 24

Empirical Modelling Method Empirically estimated model of the execution time. Possible combinations: problem size and algorithm parameters (number of threads, block sizes, etc.) Algorithm parameter number of threads ( t ): { n 3 , n 2 , n , 1 } × { t , 1 , 1 t } n 3 n 2 t + k 2 n 2 t + k 3 n 2 + k 4 T ( n , t ) = k 1 t + k 5 nt + k 6 n Experiments are performed for different values of n and t Estimation of the values of k i with LS or NNLS LS: all coefficients have non zero values (positive or negative) NNLS: values obtained for the coefficients will be positive or zero Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 9 / 24

Empirical Modelling Method. PLASMA Algorithm parameters: Number of threads ( t ). Outer block size ( b ). Inner block size ( l ) Possible combinations: { n 3 , n 2 , n , 1 } × { t , 1 , 1 t } × { b 2 , b , 1 , 1 b } × { l 2 , l , 1 , 1 l } n 3 n 3 n 3 n 3 n 2 t b + k 5 n 2 tb + k 6 n 2 t + k 7 T ( n , t , b , l ) = k 1 t + k 2 bl + k 3 l + k 4 bl + n 2 t n 2 t n 2 n 2 b + k 10 n 2 tl + k 11 n 2 b + k 12 n 2 l + k 13 k 8 + k 9 bl + k 14 l + l n 2 n 2 n 2 b n 2 l b + k 16 n 2 + k 17 + k 20 ntb 2 + t + k 18 + k 19 k 15 t t nt nt k 21 ntbl + k 22 ntb + k 23 ntl 2 + k 24 l + k 25 b + k 26 ntl + k 27 nt + k 28 nb 2 + k 29 nbl + k 30 nb + k 31 nl 2 + k 32 nl + k 33 n Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 10 / 24

Empirical Modelling Method. PLASMA Installation Executing the routine with values of the parameters in an Installation Set Varying t , b and l to some possible preselected values Estimation of 33 or 17 k i coefficients with LS or NNLS Obtain Mod-LS or Mod-NNLS models for the execution time of the routine The model and the different possible values for the algorithm parameters are stored At execution time: the number of threads and tile sizes are selected for each problem size with the information provided by the model Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 11 / 24

Example. Empirically modelling of Cholesky Installation Set . Cholesky n = { 500 , 1500 , 2500 , 3500 , 4500 , 5500 , 6500 , 7500 , 8500 , 9500 } b ranging from 40 to 300, b inc = 40 t ranging from 4, 6 or 16 to the number of available cores, t inc = 4 , 6 , 16 Mod-NNLS Hipatia. T ( n , t , b ) = k 1 n 3 t + k 2 n 3 b Saturno. T ( n , t , b ) = k 1 n 3 t + k 2 n 3 b + k 3 n 2 tb + k 11 ntb 2 Joule. T ( n , t , b ) = k 1 n 3 t + k 2 n 3 b + k 3 n 2 tb + k 6 n 2 b + k 11 ntb 2 With the Mod-NNLS the non-zero coefficients change with the execution platform. Insight about the contribution of the value of the algorithmic parameters. Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 12 / 24

Example. Empirically modelling of Cholesky Validation Set � = Installation Set hipatia Saturno Joule size Optimum Optimum Optimum Mod-LS Mod-NNLS Mod-LS Mod-NNLS Mod-LS Mod-NNLS 2000 (12,200) (8,280) (16,280) (24,80) (12,80) (24,80) (32,40) (16,200) (64,40) 3000 (16,280) (12,280) (16,280) (24,80) (18,80) (24,80) (64,60) (48,40) (64,60) 4000 (16,280) (16,280) (16,280) (24,120) (24,120) (24,80) (64,60) (64,60) (64,60) 5000 (16,280) (16,280) (16,280) (24,120) (24,160) (24,120) (64,140) (64,60) (64,60) 6000 (16,280) (16,280) (16,280) (24,120) (24,160) (24,120) (64,140) (64,60) (64,80) 7000 (16,280) (16,280) (16,280) (24,160) (24,200) (24,120) (64,140) (64,60) (64,80) 8000 (16,280) (16,280) (16,280) (24,160) (24,240) (24,160) (64,200) (64,200) (64,100) 9000 (16,280) (16,280) (16,280) (24,200) (24,240) (24,160) (64,200) (64,200) (64,100) 10000 (16,280) (16,280) (16,280) (24,200) (24,280) (24,160) (64,200) (64,200) (64,100) Cholesky ( Hipatia ) Cholesky ( Saturno ) Cholesky ( Joule ) 4 . 17 0 18 . 42 2000 2000 2000 9 . 85 16 . 9 27 . 33 0 0 0 3000 3000 3000 1 . 91 12 . 64 7 . 38 0 0 . 45 0 4000 4000 4000 0 0 0 0 0 8 . 05 5000 5000 5000 0 0 . 57 8 . 05 0 0 21 . 83 n n n 6000 6000 6000 0 1 . 72 2 . 38 0 1 . 37 9 . 91 7000 7000 7000 0 0 . 88 1 . 65 0 0 1 . 29 8000 8000 8000 0 1 . 13 0 0 1 . 2 1 . 76 9000 9000 9000 Mod-LS Mod-LS Mod-LS 0 0 . 91 0 0 Mod-NNLS 0 . 54 Mod-NNLS 1 . 7 Mod-NNLS 10000 10000 10000 0 1 . 08 0 0 2 4 6 8 10 0 5 10 15 0 10 20 30 Desviation (%) Desviation (%) Desviation (%) Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 13 / 24

Example. Empirically modelling of QR Installation Set . QR n = { 512 , 1024 , 1536 , 2048 , 2560 , 3072 , 3584 , 4096 , 4608 , 5120 } b ranging from 24 to 304, b inc = 40 l ranging form 28 to 208, l inc = 20 t ranging from 4, 6 or 16 to the number of available cores, t inc = 4 , 6 , 16 Mod-NNLS Hipatia. T ( n , t , b , l ) = k 1 n 3 t + k 2 n 3 bl + k 19 n 2 l t Saturno. T ( n , t , b , l ) = k 1 n 3 t + k 4 n 3 b + k 10 n 2 tl + k 20 ntb 2 + k 23 ntl 2 + k 24 nt l Joule. T ( n , t , b , l ) = k 1 n 3 t + k 4 n 3 b + k 12 n 2 l + k 13 n 2 bl + k 17 n 2 t + k 19 n 2 l t + k 20 ntb 2 + k 21 ntl 2 + k 28 nb 2 Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 14 / 24

Empirical Modelling of Linear Algebra Shared-Memory Routines Jes - PowerPoint PPT Presentation

Empirical Modelling of Linear Algebra Shared-Memory Routines Jes us C amara Luis P. Garc a Javier Cuenca Domingo Gim enez Scientific Computing and Parallel Programming Group University of Murcia, SPAIN UMU ICCS 2013 Domingo

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Extraordinary IEF Ministerial Meeting | 22 February 2011 | Riyadh Achievements to date

Joint Wastewater Master Plan Stage 4 Final Plan Executive Committee June 11, 2018 Agenda

Nicola Downs Presentation 1 st March 2012 The Welcome Good evening and a very warm welcome to

A; Strategy Background zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA ';' " Low Carbon

circuit Jukka Pekola, Low Temperature Laboratory Aalto University, Helsinki, Finland Dmitri

Key points Financial summary Balance sheet Cash flow Operational review - 1 Operational review

- - 1205 River = Southern T. /-5 Trade Corridor Study Area Illtroductioll I-5: is

Portland North Small Starts Portland North Small Starts Alternatives Analysis Alternatives

Empirical Modelling of Linear Algebra Shared-Memory Routines Jes - PowerPoint PPT Presentation

Empirical Modelling of Linear Algebra Shared-Memory Routines Jes us C amara Luis P. Garc a Javier Cuenca Domingo Gim enez Scientific Computing and Parallel Programming Group University of Murcia, SPAIN UMU ICCS 2013 Domingo

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

PV Math Department MCL Vision Credit Options Credit General General/Post- College Honors

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Linear algebra explained in four pages Excerpt from the N O BULLSHIT GUIDE TO LINEAR ALGEBRA by

Matrices Basic Linear Algebra Overview Lecture will cover why matrices and linear algebra

MATRICES AND LINEAR ALGEBRA Linear Algebra Matrix manipulation is the original essence of

Extraordinary IEF Ministerial Meeting | 22 February 2011 | Riyadh Achievements to date

Joint Wastewater Master Plan Stage 4 Final Plan Executive Committee June 11, 2018 Agenda

Nicola Downs Presentation 1 st March 2012 The Welcome Good evening and a very warm welcome to

A; Strategy Background zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA ';' &quot; Low Carbon

circuit Jukka Pekola, Low Temperature Laboratory Aalto University, Helsinki, Finland Dmitri

Key points Financial summary Balance sheet Cash flow Operational review - 1 Operational review

- - 1205 River = Southern T. /-5 Trade Corridor Study Area Illtroductioll I-5: is

Portland North Small Starts Portland North Small Starts Alternatives Analysis Alternatives

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

A; Strategy Background zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA ';' " Low Carbon