Empirical Modelling of Linear Algebra Shared-Memory Routines Jes - - PowerPoint PPT Presentation

empirical modelling of linear algebra shared memory
SMART_READER_LITE
LIVE PREVIEW

Empirical Modelling of Linear Algebra Shared-Memory Routines Jes - - PowerPoint PPT Presentation

Empirical Modelling of Linear Algebra Shared-Memory Routines Jes us C amara Luis P. Garc a Javier Cuenca Domingo Gim enez Scientific Computing and Parallel Programming Group University of Murcia, SPAIN UMU ICCS 2013 Domingo


slide-1
SLIDE 1

Empirical Modelling of Linear Algebra Shared-Memory Routines

Jes´ us C´ amara Luis P. Garc´ ıa Javier Cuenca Domingo Gim´ enez Scientific Computing and Parallel Programming Group University of Murcia, SPAIN

UMU

ICCS 2013

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 1 / 24

slide-2
SLIDE 2

Introduction

Multicore processor, cc-NUMA systems can offer performance improvements

Necessary software optimization techniques to benefit from the potential of the hardware Modelling the execution time of the routine Apply some empirical approach to study the behavior

In this work:

Analysis of the behavior of multithread LAPACK routines on PLASMA and Intel MKL Methodology for installation and modelling: take decisions at running time to reduce execution time Typical decisions: number of threads, block or tile size, routine to use

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 2 / 24

slide-3
SLIDE 3

Outline

Introduction Auto-tuning methodology Motivation Routines and Computational Systems Empirical Modelling Method. PLASMA Comparison with Intel MKL LAPACK Conclusions and future work lines

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 3 / 24

slide-4
SLIDE 4

Auto-tuning methodology

DESIGN INSTALLATION EXECUTION

LAR Extracting AP Selected AP Implementing the Manager SOLAR Manager Tuning AP Installation Set Tuned AP Execution

  • f LAR

Optimum AP Selection of Optimum AP nR Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 4 / 24

slide-5
SLIDE 5

Motivation

PLASMA parallelism is not hidden inside BLAS PLASMA library relies on TILE algorithms Tile Algorithms: OUTER BLOCK SIZE and INNER BLOCK SIZE Auto-tuning PLASMA: finding the outer and inner block size pairs that maximize performance. BUT DEFAULT VALUES ARE USED FOR TILE SIZES Cholesky: 120 LU: (200, 20) QR: (144, 48)

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 5 / 24

slide-6
SLIDE 6

Motivation

The optimum tile size depends on number of threads, matrix size and computing system

Cholesky factorization, 16 cores system

40 120 200 280 8 10 12 14 16 0.3 b threads t (seconds) n = 3048 40 120 200 280 8 10 12 14 16 0.4 0.6 0.8 b threads t (seconds) n = 4072 40 120 200 280 8 10 12 14 16 1 1.5 b threads t (seconds) n = 5096 Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 6 / 24

slide-7
SLIDE 7

Motivation

The optimum tile size depends on number of threads, matrix size and computing system

Cholesky factorization, 24 cores system

40 120 200 280 8 12 16 20 24 0.2 0.3 b threads t (seconds) n = 3048 40 120 200 280 8 12 16 20 24 0.4 0.6 0.8 b threads t (seconds) n = 4072 40 120 200 280 8 12 16 20 24 1 1.5 b threads t (seconds) n = 5096 Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 7 / 24

slide-8
SLIDE 8

Routines and Computational Systems

LAPACK Routines Cholesky LU QR Computational Systems Hipatia: System with 16 cores, 4 Intel Xeon Quad-Core, 2.93 GHz (4 cores). Linux 2.6.18, Intel icc (v12.0.0) and Intel MKL (v10.3.2) Saturno: NUMA system with 24 cores, 4 Intel Xeon X7542 (hexa-core) processors, 1.87 GHz, 32 GB of shared-memory. Linux 2.6.35, Intel icc compiler (v12.0.2) and Intel MKL (v10.3.2) Joule: NUMA system with 64 cores, 4 AMD Opteron 6276 (16 cores) processors, 2.3 GHz, 64 GB of shared-memory. Linux 2.6.32, Intel icc compiler (v12.1.3) and Intel MKL (v10.3.9)

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 8 / 24

slide-9
SLIDE 9

Empirical Modelling Method

Empirically estimated model of the execution time. Possible combinations: problem size and algorithm parameters (number of threads, block sizes, etc.) Algorithm parameter number of threads (t): {n3, n2, n, 1} × {t, 1, 1

t }

T(n, t) = k1 n3 t + k2n2t + k3n2 + k4 n2 t + k5nt + k6n Experiments are performed for different values of n and t Estimation of the values of ki with LS or NNLS LS: all coefficients have non zero values (positive or negative) NNLS: values obtained for the coefficients will be positive or zero

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 9 / 24

slide-10
SLIDE 10

Empirical Modelling Method. PLASMA

Algorithm parameters:

Number of threads (t). Outer block size (b). Inner block size (l) Possible combinations: {n3, n2, n, 1}×{t, 1, 1

t }×{b2, b, 1, 1 b }×{l2, l, 1, 1 l }

T(n, t, b, l) =k1 n3 t + k2 n3 bl + k3 n3 l + k4 n3 b + k5n2tb + k6n2t + k7 n2t bl + k8 n2t l + k9 n2t b + k10n2tl + k11n2b + k12n2l + k13 n2 bl + k14 n2 l + k15 n2 b + k16n2 + k17 n2 t + k18 n2b t + k19 n2l t + k20ntb2+ k21ntbl + k22ntb + k23ntl2 + k24 nt l + k25 nt b + k26ntl+ k27nt + k28nb2 + k29nbl + k30nb + k31nl2 + k32nl + k33n

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 10 / 24

slide-11
SLIDE 11

Empirical Modelling Method. PLASMA

Installation Executing the routine with values of the parameters in an Installation Set Varying t, b and l to some possible preselected values Estimation of 33 or 17 ki coefficients with LS or NNLS Obtain Mod-LS or Mod-NNLS models for the execution time of the routine The model and the different possible values for the algorithm parameters are stored At execution time: the number of threads and tile sizes are selected for each problem size with the information provided by the model

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 11 / 24

slide-12
SLIDE 12
  • Example. Empirically modelling of Cholesky

Installation Set. Cholesky n = {500, 1500, 2500, 3500, 4500, 5500, 6500, 7500, 8500, 9500} b ranging from 40 to 300, b inc = 40 t ranging from 4, 6 or 16 to the number of available cores, t inc = 4, 6, 16 Mod-NNLS

  • Hipatia. T(n, t, b) = k1 n3

t + k2 n3 b

  • Saturno. T(n, t, b) = k1 n3

t + k2 n3 b + k3n2tb + k11ntb2

  • Joule. T(n, t, b) = k1 n3

t + k2 n3 b + k3n2tb + k6n2b + k11ntb2

With the Mod-NNLS the non-zero coefficients change with the execution

  • platform. Insight about the contribution of the value of the algorithmic

parameters.

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 12 / 24

slide-13
SLIDE 13
  • Example. Empirically modelling of Cholesky

Validation Set = Installation Set

hipatia Saturno Joule size Optimum Mod-LS Mod-NNLS Optimum Mod-LS Mod-NNLS Optimum Mod-LS Mod-NNLS 2000 (12,200) (8,280) (16,280) (24,80) (12,80) (24,80) (32,40) (16,200) (64,40) 3000 (16,280) (12,280) (16,280) (24,80) (18,80) (24,80) (64,60) (48,40) (64,60) 4000 (16,280) (16,280) (16,280) (24,120) (24,120) (24,80) (64,60) (64,60) (64,60) 5000 (16,280) (16,280) (16,280) (24,120) (24,160) (24,120) (64,140) (64,60) (64,60) 6000 (16,280) (16,280) (16,280) (24,120) (24,160) (24,120) (64,140) (64,60) (64,80) 7000 (16,280) (16,280) (16,280) (24,160) (24,200) (24,120) (64,140) (64,60) (64,80) 8000 (16,280) (16,280) (16,280) (24,160) (24,240) (24,160) (64,200) (64,200) (64,100) 9000 (16,280) (16,280) (16,280) (24,200) (24,240) (24,160) (64,200) (64,200) (64,100) 10000 (16,280) (16,280) (16,280) (24,200) (24,280) (24,160) (64,200) (64,200) (64,100)

2 4 6 8 10 10000 9000 8000 7000 6000 5000 4000 3000 2000 1.91 9.85 4.17 Desviation (%) n Cholesky (Hipatia) Mod-LS Mod-NNLS 5 10 15 2000 3000 4000 5000 6000 7000 8000 9000 10000 16.9 12.64 0.57 1.72 0.88 1.13 0.91 1.08 0.45 1.37 1.2 0.54 Desviation (%) n Cholesky (Saturno) Mod-LS Mod-NNLS 10 20 30 2000 3000 4000 5000 6000 7000 8000 9000 10000 27.33 7.38 8.05 2.38 1.65 18.42 8.05 21.83 9.91 1.29 1.76 1.7 Desviation (%) n Cholesky (Joule) Mod-LS Mod-NNLS

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 13 / 24

slide-14
SLIDE 14
  • Example. Empirically modelling of QR

Installation Set. QR n = {512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4608, 5120} b ranging from 24 to 304, b inc = 40 l ranging form 28 to 208, l inc = 20 t ranging from 4, 6 or 16 to the number of available cores, t inc = 4, 6, 16 Mod-NNLS

  • Hipatia. T(n, t, b, l) = k1 n3

t + k2 n3 bl + k19 n2l t

Saturno. T(n, t, b, l) = k1 n3

t + k4 n3 b + k10n2tl + k20ntb2 + k23ntl2 + k24 nt l

  • Joule. T(n, t, b, l) =

k1 n3

t +k4 n3 b +k12n2l+k13 n2 bl +k17 n2 t +k19 n2l t +k20ntb2+k21ntl2+k28nb2

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 14 / 24

slide-15
SLIDE 15
  • Example. Empirically modelling of QR

Validation Set = Installation Set

hipatia Saturno Joule size Optimum Mod-LS Mod-NNLS Optimum Mod-LS Mod-NNLS Optimum Mod-LS Mod-NNLS 1512 (16,224,88) (12,304,188) (16,304,68) (24,104,28) (24,104,28) (24,104,68) (64,64,28) (32,104,28) (64,64,48) 2024 (16,264,88) (16,304,188) (16,304,68) (24,104,28) (24,104,28) (24,104,68) (48,64,28) (64,104,48) (64,64,48) 2536 (16,264,88) (16,304,188) (16,304,68) (24,144,48) (24,144,28) (24,144,48) (64,64,28) (64,104,28) (64,104,48) 3048 (16,264,88) (16,304,168) (16,304,88) (24,144,48) (24,144,48) (24,144,48) (64,64,28) (64,104,28) (64,104,48) 3560 (16,264,88) (16,304,168) (16,304,88) (24,144,48) (24,184,48) (24,144,48) (64,64,28) (64,104,28) (64,104,48) 4072 (16,304,108) (16,304,168) (16,304,88) (24,184,48) (24,184,48) (24,184,48) (64,144,48) (64,104,28) (64,104,48) 4584 (16,264,88) (16,304,28) (16,304,108) (24,184,48) (24,224,48) (24,184,48) (64,144,48) (64,104,28) (64,144,28) 5096 (16,304,88) (16,304,28) (16,304,108) (24,184,48) (24,224,48) (24,224,48) (64,144,48) (64,104,28) (64,144,28) 5608 (16,304,108) (16,304,28) (16,304,108) (24,144,48) (24,224,48) (24,224,48) (64,144,48) (64,104,28) (64,144,28)

10 20 30 1512 2024 2536 3048 3560 4072 4584 5096 5608 29.63 18.48 15.47 16.84 6.63 4.66 17.15 11.44 14.68 12.6 9.12 10.54 12.32 2.36 0.52 2.62 0.24 Desviation (%) n QR (Hipatia) Mod-LS Mod-NNLS 2 4 6 8 1512 2024 2536 3048 3560 4072 4584 5096 5608 3.76 1.06 0.94 3.21 8.39 6.42 1.06 0.94 3.21 Desviation (%) n QR (Saturno) Mod-LS Mod-NNLS 5 10 15 1512 2024 2536 3048 3560 4072 4584 5096 5608 14.15 17.23 2.12 5.31 0.45 4.38 7.05 6.87 11.1 3.38 10.39 7.65 5.6 0.78 5.74 1.68 1.56 1.62 Desviation (%) n QR (Joule) Mod-LS Mod-NNLS

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 15 / 24

slide-16
SLIDE 16

Global results. Empirically modelling of Cholesky and QR

Cholesky The mean deviation of the optimum:

Mod-LS: 1 % (Hipatia), 4 % (Saturno), 5 % (Joule) Mod-NNLS: 0.5 % (Hipatia), 0.4 % (Saturno), 7 % (Joule)

The mean improvement with respect to the Default execution is 27 % in Hipatia, 1 % in Saturno and 13 % in Joule. QR The mean deviation of the optimum:

Mod-LS: 15 % (Hipatia), 1 % (Saturno), 8 % (Joule) Mod-NNLS: 6 % (Hipatia), 2 % (Saturno), 4 % (Joule)

The mean improvement with respect to the Default execution is 34 % in Hipatia, 2 % in Saturno and 7 % in Joule.

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 16 / 24

slide-17
SLIDE 17

Comparison with Intel MKL LAPACK

Hipatia

500 5000 10000 15000 20000 20 40 60 80 100 matrix size t (seconds) QR (DGEQRF) PLASMA MKL 500 5000 10000 15000 20000 20 40 60 matrix size t (seconds) LU (DGETRF) PLASMA MKL 500 5000 10000 15000 20000 10 20 30 matrix size t (seconds) Cholesky (DPOTRF) PLASMA MKL

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 17 / 24

slide-18
SLIDE 18

Comparison with Intel MKL LAPACK

Hipatia

500 5000 10000 15000 20000 20 40 60 80 100 matrix size t (seconds) QR (DGEQRF) PLASMA MKL 500 5000 10000 15000 20000 20 40 60 matrix size t (seconds) LU (DGETRF) PLASMA MKL 500 5000 10000 15000 20000 10 20 30 matrix size t (seconds) Cholesky (DPOTRF) PLASMA MKL

Saturno

500 5000 10000 15000 20000 20 40 60 80 matrix size t (seconds) QR (DGEQRF) PLASMA MKL 500 5000 10000 15000 20000 10 20 30 40 matrix size t (seconds) LU (DGETRF) PLASMA MKL 500 5000 10000 15000 20000 5 10 15 20 25 matrix size t (seconds) Cholesky (DPOTRF) PLASMA MKL

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 18 / 24

slide-19
SLIDE 19

Comparison with Intel MKL LAPACK

Hipatia Saturno

500 5000 10000 15000 20000 20 40 60 80 100 matrix size t (seconds) QR (DGEQRF) PLASMA MKL 500 5000 10000 15000 20000 20 40 60 matrix size t (seconds) LU (DGETRF) PLASMA MKL 500 5000 10000 15000 20000 10 20 30 matrix size t (seconds) Cholesky (DPOTRF) PLASMA MKL 500 5000 10000 15000 20000 20 40 60 80 matrix size t (seconds) QR (DGEQRF) PLASMA MKL 500 5000 10000 15000 20000 10 20 30 40 matrix size t (seconds) LU (DGETRF) PLASMA MKL 500 5000 10000 15000 20000 5 10 15 20 25 matrix size t (seconds) Cholesky (DPOTRF) PLASMA MKL

Joule

500 5000 10000 15000 20000 20 40 60 80 matrix size t (seconds) QR (DGEQRF) PLASMA MKL 500 5000 10000 15000 20000 20 40 matrix size t (seconds) LU (DGETRF) PLASMA MKL 500 5000 10000 15000 20000 10 20 matrix size t (seconds) Cholesky (DPOTRF) PLASMA MKL

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 19 / 24

slide-20
SLIDE 20

Comparison with Intel MKL LAPACK

Hipatia Saturno Joule

500 5000 10000 15000 20000 20 40 60 80 100 matrix size t (seconds) QR (DGEQRF) PLASMA MKL 500 5000 10000 15000 20000 20 40 60 matrix size t (seconds) LU (DGETRF) PLASMA MKL 500 5000 10000 15000 20000 10 20 30 matrix size t (seconds) Cholesky (DPOTRF) PLASMA MKL 500 5000 10000 15000 20000 20 40 60 80 matrix size t (seconds) QR (DGEQRF) PLASMA MKL 500 5000 10000 15000 20000 10 20 30 40 matrix size t (seconds) LU (DGETRF) PLASMA MKL 500 5000 10000 15000 20000 5 10 15 20 25 matrix size t (seconds) Cholesky (DPOTRF) PLASMA MKL 500 5000 10000 15000 20000 20 40 60 80 matrix size t (seconds) QR (DGEQRF) PLASMA MKL 500 5000 10000 15000 20000 20 40 matrix size t (seconds) LU (DGETRF) PLASMA MKL 500 5000 10000 15000 20000 10 20 matrix size t (seconds) Cholesky (DPOTRF) PLASMA MKL

Intel MKL outperform PLASMA for large matrices, except QR or in Joule. Impossible to draw general conclusions about the advantage of using MKL PLASMA can compete against MKL. Correct selection of parameters The auto-tuning methodology can be used to decide the implementation to use and the correct selection of the tile sizes and number of threads

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 20 / 24

slide-21
SLIDE 21

Comparison with Intel MKL LAPACK. LU

PLASMA LU vs MKL LU. Hipatia size Mod-LS Mod-NNLS Default MKL 2000 0.176 (16,120,20) 0.239 (16,280,40) 0.190 0.147 3000 0.433 (16,120,20) 0.447 (16,280,60) 0.377 0.378 4000 0.726 (16,280,100) 0.729 (16,280,60) 0.669 0.695 5000 1.124 (16,280,120) 1.121 (16,280,80) 1.127 1.196 6000 1.745 (16,280,240) 1.777 (16,280,80) 1.820 1.853 7000 2.621 (16,280,20) 2.619 (16,280,80) 2.779 2.659 8000 3.797 (16,280,20) 3.821 (16,280,40) 4.088 3.863 Total 10.622 10.753 11.05 10.791 The improvement with Mod-LS and Mod-NNLS are close Best times with Mod-LS. Preferred installation method Difference of 4 % with respect to the Default

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 21 / 24

slide-22
SLIDE 22

Comparison with Intel MKL LAPACK. LU

PLASMA LU vs MKL LU. Saturno size Mod-LS Mod-NNLS Default MKL 3000 0.214 (24,120,20) 0.464 (24,280,60) 0.324 0.279 4000 0.383 (24,160,60) 0.647 (24,280,60) 0.504 0.653 5000 0.692 (24,160,60) 0.859 (24,280,60) 0.865 1.121 6000 1.144 (24,160,60) 1.184 (24,280,80) 1.156 1.853 7000 1.777 (24,160,120) 1.770 (24,280,80) 2.131 2.359 8000 2.508 (24,200,20) 2.498 (24,280,80) 2.508 3.452 9000 3.521 (24,200,20) 3.464 (24,280,100) 3.521 4.727 Total 10.239 10.886 11.09 14.444 The improvements with Mod-LS and Mod-NNLS are close Best times with Mod-LS. Preferred installation method Difference of 7 % with respect to the Default

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 22 / 24

slide-23
SLIDE 23

Comparison with Intel MKL LAPACK. LU

PLASMA LU vs MKL LU. Joule size Mod-LS Mod-NNLS Default MKL 3000 0.750 (64,120,20) 0.828 (64,280,80) 0.596 0.536 4000 0.822 (64,160,160) 1.164 (64,280,60) 0.848 0.938 5000 1.137 (64,200,200) 1.672 (64,280,60) 1.168 1.775 6000 1.458 (64,200,200) 2.122 (64,280,60) 1.493 2.736 7000 1.939 (64,200,200) 2.817 (64,280,80) 2.029 3.826 8000 2.730 (64,200,20) 2.758 (64,280,80) 2.730 5.066 9000 3.738 (64,200,20) 3.983 (64,280,80) 3.738 8.259 Total 12.574 15.344 12.602 23.136 The auto-tuning selects satisfactory values of the parameters Similar results to those obtained without parameters tuning: difference of 0.22 % with respect to the Default

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 23 / 24

slide-24
SLIDE 24

Conclusions

Empirical auto-tuning approach for PLASMA implementation of LAPACK routines: Provide a set of empirically obtained models The models facilitate a satisfactory selection of the algorithmic parameters This work focus on the Cholesky, QR and LU factorizations. But it is representative of the process to be done for auto-tuning the whole library The methodology is useful to obtain execution times close to the lowest achievable without the need for user expertise The results have shown that PLASMA can, with an autotuning methodology, be competitive compared to the Intel MKL library

Domingo Gim´ enez (UMU) domingo@um.es ICCS / June 5-7, 2013 24 / 24