static versus dynamic memory allocation a comparison for
play

Static versus Dynamic Memory Allocation: a Comparison for Linear - PowerPoint PPT Presentation

Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels Toufik Baroudi 1 Vincent Loechner 2 Rachid Seghir 1 1 University of Batna, Algeria 2 University of Strasbourg & INRIA, France IMPACT 2020 Baroudi, Loechner,


  1. Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels Toufik Baroudi 1 Vincent Loechner 2 Rachid Seghir 1 1 University of Batna, Algeria 2 University of Strasbourg & INRIA, France IMPACT 2020 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 1 / 25

  2. Introduction Two years ago [BSL17, TACO]: compact data layout for regular sparse matrices optimized by Pluto our preliminary benchmarks were inconsistent → due to matrix allocation mode: as static declared array or as array of pointers to dynamically allocated memory Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 2 / 25

  3. Introduction: Content of this Presentation We precisely analyze one code: triangular matrix multiplication using the performance counters (#instr., #mem. access, #L1-L3 cache misses, #TLB misses, #vectorized instr.) Ran the same tests on the PolyBench linear algebra kernels Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 3 / 25

  4. Introduction: Objective Array allocation mode influences performance! Main factors of performance variation: ability of the compiler to detect vectorization number of cache misses and memory loads Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 4 / 25

  5. Introduction: Objective Array allocation mode influences performance! Main factors of performance variation: ability of the compiler to detect vectorization number of cache misses and memory loads This work is not a manifest for one type of allocation or the other, it is a warning: declaration and allocation of arrays matters! Comparing various versions of codes using different array allocation modes can get biased Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 4 / 25

  6. Table of Contents Introduction 1 Triangular Matrix Multiplication: Demonstration 2 Triangular Matrix Multiplication: Performance Analysis 3 PolyBench: Performance Analysis 4 Conclusion 5 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 5 / 25

  7. Motivation Triangular Matrix Multiplication Demo in completely different conditions than in the paper: on this laptop (MacOS 10.14, clang/llvm-9.0.0, 4-cores Intel core i7) Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 6 / 25

  8. Table of Contents Introduction 1 Triangular Matrix Multiplication: Demonstration 2 Triangular Matrix Multiplication: Performance Analysis 3 PolyBench: Performance Analysis 4 Conclusion 5 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 7 / 25

  9. Triangular Matrix Multiplication: setup Intel platform: dual socket Intel Xeon E5-2650v3 (Haswell-EP) 2x10 hyperthreaded cores, AVX2 (256 bits) AMD platform: dual socket AMD Opteron 6172 (Magny-Cours) 2x12 cores, SSE (128 bits) using pluto-0.11.4 --tile --parallel using gcc-7.4.0 -O3 -march=native -fopenmp on a regular Linux 4.0.15 (Ubuntu) problem size: N=8000 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 8 / 25

  10. Triangular Matrix Multiplication: execution time 512 10 3 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 17 . 3 25 20 10 . 2 9 . 7 Par Dyn 10 1 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 9 / 25

  11. Triangular Matrix Multiplication: L1-dcache-loads 10 3 execution time (s) 512 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # L1-dc-ld (billions) 1 , 000 834 727 Orig Stat Orig Dyn 471 500 366 369 Par Stat 353 347 320 319 323 279 204 205 Par Dyn 173 43 . 8 44 . 1 0 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 10 / 25

  12. Triangular Matrix Multiplication: L1-dcache-misses 10 3 512 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # L1-dc-misses (billions) 19 . 88 19 . 87 19 . 7 19 . 8 Orig Stat 10 1 Orig Dyn 3 . 62 3 . 17 2 . 68 2 . 44 2 . 25 Par Stat 2 . 08 1 . 03 Par Dyn 1 . 01 0 . 89 0 . 91 0 . 88 0 . 85 10 0 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 11 / 25

  13. Triangular Matrix Multiplication: L3-dcache-misses 10 3 512 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # L3-dc-misses (billions) 3 . 25 2 . 99 2 . 19 1 . 86 Orig Stat Orig Dyn 0 . 83 0 . 82 0 . 61 10 0 0 . 53 0 . 44 0 . 41 0 . 42 Par Stat 0 . 4 Par Dyn 0 . 12 0 . 12 0 . 12 0 . 12 10 − 1 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 12 / 25

  14. Triangular Matrix Multiplication: dTLB-misses 10 3 execution time (s) 512 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 25 17 . 3 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # dTLB-misses (billions) 10 3 345 300 250 204 208 207 Orig Stat 177 105 Orig Dyn 10 2 66 58 Par Stat 43 43 42 40 40 35 Par Dyn 10 1 Intel C1 AMD C1 Intel C2 AMD C2 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 13 / 25

  15. Triangular Matrix Multiplication: vectorized instructions 512 10 3 execution time (s) 446 209 135 . 1 135 . 2 Orig Stat 166 64 . 8 64 . 1 Orig Dyn 10 2 40 . 9 34 . 2 Par Stat 19 . 8 17 . 3 25 20 10 . 2 Par Dyn 9 . 7 10 1 Intel C1 AMD C1 Intel C2 AMD C2 # vectorized inst. (millions) 69 , 360 69 , 310 10 6 2 , 006 . 5 Orig Stat 1 , 280 Orig Dyn 10 3 Par Stat 19 . 3 Par Dyn 4 . 6 3 1 10 0 Intel C1 AMD C1 Intel C2 AMD C2 unavailable on the AMD but “ gcc -fopt-info-vec ” seems to confirm the correlation Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 14 / 25

  16. Triangular Matrix Multiplication: Synthesis array allocation mode has a significant impact on the performance of this code it can have opposite effects on different processors! factors of influence: number of memory accesses number of cache and TLB misses number of vectorized instructions other experiments 1 on the Intel platform show that the number of vectorized instructions is a major factor of influence 1 on other triangular matrix kernels: Cholesky, SolveMat, sspfa. Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 15 / 25

  17. Table of Contents Introduction 1 Triangular Matrix Multiplication: Demonstration 2 Triangular Matrix Multiplication: Performance Analysis 3 PolyBench: Performance Analysis 4 Conclusion 5 Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 16 / 25

  18. PolyBench: setup on the Intel platform using pluto-0.11.4 --tile --parallel using gcc-7.4.0 -O3 -march=native -fopenmp PolyBench macro POLYBENCH STACK ARRAYS : static version: stack allocated static array dynamic version: multidimensional heap-allocated array ( not an array of pointers as in the previous experiment) problem size: N=2,000 for O ( N 3 ) algorithms N=20,000 for O ( N 2 ) algorithms Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 17 / 25

  19. PolyBench: execution time 3.8x 247 . 4 execution time (s) +20% 164 . 8 10 3 43 . 35 65 . 6 Orig Stat 12 . 72 12 . 73 9 . 55 Orig Dyn 3 . 02 3 . 04 2 . 21 2 . 21 1 . 82 3 . 1 10 1 1 . 69 1 . 48 0 . 93 0 . 93 Par Stat 0 . 75 0 . 76 0 . 76 0 . 41 0 . 41 0 . 41 0 . 43 0 . 36 0 . 34 0 . 36 0 . 4 0 . 4 0 . 4 0 . 12 0 . 12 Par Dyn 10 − 1 atax mvt 2mm 3mm bicg trisolv lu cholesky Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 18 / 25

  20. PolyBench: vectorized instructions Orig Dyn Orig Stat Orig Dyn Orig Stat Baroudi, Loechner, Seghir Par Dyn Par Stat Par Dyn Par Stat 38 . 66 164 . 8 2mm 2mm 8 , 042 43 . 35 6 , 294 1 . 69 5 , 537 1 . 82 39 . 21 247 . 4 3mm 3mm 12 , 042 65 . 6 6 , 295 3 . 02 5 , 539 3 . 1 Static vs. Dynamic Memory Allocation 12 , 016 0 . 93 atax atax 12 , 013 0 . 93 1 , 214 0 . 4 1 , 176 0 . 41 951 . 8 1 . 48 bicg bicg 952 . 1 0 . 75 1 , 264 0 . 4 1 , 264 0 . 41 901 . 7 9 . 55 mvt 1 , 102 mvt 3 . 04 901 . 6 0 . 4 901 . 3 0 . 41 300 . 8 0 . 36 trisolv trisolv 300 . 8 0 . 34 300 . 8 0 . 12 300 . 8 0 . 12 6 , 016 12 . 72 6 , 018 12 . 73 lu lu 8 , 082 0 . 36 7 , 830 0 . 43 cholesky cholesky 6 , 006 2 . 21 6 , 009 2 . 21 6 , 007 0 . 76 6 , 007 0 . 76 IMPACT 2020 10 1 10 2 10 3 10 4 10 − 1 10 1 10 3 # vectorized inst. (millions) execution time (s) 19 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend