Static versus Dynamic Memory Allocation: a Comparison for Linear - - PowerPoint PPT Presentation

static versus dynamic memory allocation a comparison for
SMART_READER_LITE
LIVE PREVIEW

Static versus Dynamic Memory Allocation: a Comparison for Linear - - PowerPoint PPT Presentation

Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels Toufik Baroudi 1 Vincent Loechner 2 Rachid Seghir 1 1 University of Batna, Algeria 2 University of Strasbourg & INRIA, France IMPACT 2020 Baroudi, Loechner,


slide-1
SLIDE 1

Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels

Toufik Baroudi1 Vincent Loechner2 Rachid Seghir1

1University of Batna, Algeria 2University of Strasbourg & INRIA, France

IMPACT 2020

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 1 / 25

slide-2
SLIDE 2

Introduction

Two years ago [BSL17, TACO]: compact data layout for regular sparse matrices

  • ptimized by Pluto
  • ur preliminary benchmarks were inconsistent

→ due to matrix allocation mode: as static declared array or as array of pointers to dynamically allocated memory

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 2 / 25

slide-3
SLIDE 3

Introduction: Content of this Presentation

We precisely analyze one code: triangular matrix multiplication using the performance counters (#instr., #mem. access, #L1-L3 cache misses, #TLB misses, #vectorized instr.) Ran the same tests on the PolyBench linear algebra kernels

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 3 / 25

slide-4
SLIDE 4

Introduction: Objective

Array allocation mode influences performance!

Main factors of performance variation: ability of the compiler to detect vectorization number of cache misses and memory loads

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 4 / 25

slide-5
SLIDE 5

Introduction: Objective

Array allocation mode influences performance!

Main factors of performance variation: ability of the compiler to detect vectorization number of cache misses and memory loads This work is not a manifest for one type of allocation or the other, it is a warning: declaration and allocation of arrays matters! Comparing various versions of codes using different array allocation modes can get biased

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 4 / 25

slide-6
SLIDE 6

Table of Contents

1

Introduction

2

Triangular Matrix Multiplication: Demonstration

3

Triangular Matrix Multiplication: Performance Analysis

4

PolyBench: Performance Analysis

5

Conclusion

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 5 / 25

slide-7
SLIDE 7

Motivation Triangular Matrix Multiplication Demo

in completely different conditions than in the paper:

  • n this laptop (MacOS 10.14, clang/llvm-9.0.0, 4-cores Intel core i7)

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 6 / 25

slide-8
SLIDE 8

Table of Contents

1

Introduction

2

Triangular Matrix Multiplication: Demonstration

3

Triangular Matrix Multiplication: Performance Analysis

4

PolyBench: Performance Analysis

5

Conclusion

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 7 / 25

slide-9
SLIDE 9

Triangular Matrix Multiplication: setup

Intel platform: dual socket Intel Xeon E5-2650v3 (Haswell-EP) 2x10 hyperthreaded cores, AVX2 (256 bits) AMD platform: dual socket AMD Opteron 6172 (Magny-Cours) 2x12 cores, SSE (128 bits) using pluto-0.11.4 --tile --parallel using gcc-7.4.0 -O3 -march=native -fopenmp

  • n a regular Linux 4.0.15 (Ubuntu)

problem size: N=8000

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 8 / 25

slide-10
SLIDE 10

Triangular Matrix Multiplication: execution time

Intel C1 AMD C1 Intel C2 AMD C2 101 102 103

166 446 64.8 135.1 209 512 64.1 135.2 10.2 25 9.7 40.9 19.8 20 17.3 34.2

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 9 / 25

slide-11
SLIDE 11

Triangular Matrix Multiplication: L1-dcache-loads

Intel C1 AMD C1 Intel C2 AMD C2 101 102 103

166 446 64.8 135.1 209 512 64.1 135.2 10.2 25 9.7 40.9 19.8 20 17.3 34.2

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn Intel C1 AMD C1 Intel C2 AMD C2 500 1,000

353 471 43.8 204 727 834 44.1 205 279 320 173 323 347 366 319 369

# L1-dc-ld (billions) Orig Stat Orig Dyn Par Stat Par Dyn

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 10 / 25

slide-12
SLIDE 12

Triangular Matrix Multiplication: L1-dcache-misses

Intel C1 AMD C1 Intel C2 AMD C2 101 102 103

166 446 64.8 135.1 209 512 64.1 135.2 10.2 25 9.7 40.9 19.8 20 17.3 34.2

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn Intel C1 AMD C1 Intel C2 AMD C2 100 101

19.88 1.01 19.7 3.17 19.87 1.03 19.8 3.62 2.25 0.89 2.08 0.85 2.44 0.88 2.68 0.91

# L1-dc-misses (billions) Orig Stat Orig Dyn Par Stat Par Dyn

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 11 / 25

slide-13
SLIDE 13

Triangular Matrix Multiplication: L3-dcache-misses

Intel C1 AMD C1 Intel C2 AMD C2 101 102 103

166 446 64.8 135.1 209 512 64.1 135.2 10.2 25 9.7 40.9 19.8 20 17.3 34.2

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn Intel C1 AMD C1 Intel C2 AMD C2 10−1 100

0.61 0.83 3.25 1.86 0.53 0.82 2.99 2.19 0.12 0.44 0.12 0.42 0.12 0.41 0.12 0.4

# L3-dc-misses (billions) Orig Stat Orig Dyn Par Stat Par Dyn

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 12 / 25

slide-14
SLIDE 14

Triangular Matrix Multiplication: dTLB-misses

Intel C1 AMD C1 Intel C2 AMD C2 101 102 103

166 446 64.8 135.1 209 512 64.1 135.2 10.2 25 9.7 40.9 19.8 20 17.3 34.2

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn Intel C1 AMD C1 Intel C2 AMD C2 101 102 103

43 300 177 208 40 204 105 207 43 250 58 345 40 66 42 35

# dTLB-misses (billions) Orig Stat Orig Dyn Par Stat Par Dyn

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 13 / 25

slide-15
SLIDE 15

Triangular Matrix Multiplication: vectorized instructions

Intel C1 AMD C1 Intel C2 AMD C2 101 102 103

166 446 64.8 135.1 209 512 64.1 135.2 10.2 25 9.7 40.9 19.8 20 17.3 34.2

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn Intel C1 AMD C1 Intel C2 AMD C2 100 103 106

19.3 69,360 1 69,310 2,006.5 1,280 4.6 3

# vectorized inst. (millions) Orig Stat Orig Dyn Par Stat Par Dyn unavailable on the AMD but “gcc -fopt-info-vec” seems to confirm the correlation

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 14 / 25

slide-16
SLIDE 16

Triangular Matrix Multiplication: Synthesis

array allocation mode has a significant impact on the performance of this code it can have opposite effects on different processors! factors of influence:

number of memory accesses number of cache and TLB misses number of vectorized instructions

  • ther experiments1 on the Intel platform show that the number of

vectorized instructions is a major factor of influence

1on other triangular matrix kernels: Cholesky, SolveMat, sspfa. Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 15 / 25

slide-17
SLIDE 17

Table of Contents

1

Introduction

2

Triangular Matrix Multiplication: Demonstration

3

Triangular Matrix Multiplication: Performance Analysis

4

PolyBench: Performance Analysis

5

Conclusion

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 16 / 25

slide-18
SLIDE 18

PolyBench: setup

  • n the Intel platform

using pluto-0.11.4 --tile --parallel using gcc-7.4.0 -O3 -march=native -fopenmp PolyBench macro POLYBENCH STACK ARRAYS:

static version: stack allocated static array dynamic version: multidimensional heap-allocated array (not an array of pointers as in the previous experiment)

problem size:

N=2,000 for O(N3) algorithms N=20,000 for O(N2) algorithms

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 17 / 25

slide-19
SLIDE 19

PolyBench: execution time

3.8x

+20%

2mm 3mm atax bicg mvt trisolv lu cholesky 10−1 101 103

164.8 247.4 0.93 1.48 9.55 0.36 12.72 2.21 43.35 65.6 0.93 0.75 3.04 0.34 12.73 2.21 1.69 3.02 0.4 0.4 0.4 0.12 0.36 0.76 1.82 3.1 0.41 0.41 0.41 0.12 0.43 0.76

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 18 / 25

slide-20
SLIDE 20

PolyBench: vectorized instructions

2mm 3mm atax bicg mvt trisolv lu cholesky 10−1 101 103

164.8 247.4 0.93 1.48 9.55 0.36 12.72 2.21 43.35 65.6 0.93 0.75 3.04 0.34 12.73 2.21 1.69 3.02 0.4 0.4 0.4 0.12 0.36 0.76 1.82 3.1 0.41 0.41 0.41 0.12 0.43 0.76

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn 2mm 3mm atax bicg mvt trisolv lu cholesky 101 102 103 104

38.66 39.21 12,016 951.8 901.7 300.8 6,016 6,006 8,042 12,042 12,013 952.1 1,102 300.8 6,018 6,009 6,294 6,295 1,214 1,264 901.6 300.8 8,082 6,007 5,537 5,539 1,176 1,264 901.3 300.8 7,830 6,007

# vectorized inst. (millions) Orig Stat Orig Dyn Par Stat Par Dyn

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 19 / 25

slide-21
SLIDE 21

PolyBench: L1 cache misses

2mm 3mm atax bicg mvt trisolv lu cholesky 10−1 101 103

164.8 247.4 0.93 1.48 9.55 0.36 12.72 2.21 43.35 65.6 0.93 0.75 3.04 0.34 12.73 2.21 1.69 3.02 0.4 0.4 0.4 0.12 0.36 0.76 1.82 3.1 0.41 0.41 0.41 0.12 0.43 0.76

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn 2mm 3mm atax bicg mvt trisolv lu cholesky 101 103 105

25,870 38,820 260 210 1,030 82 15,432 12,761 6,483 9,725 260 210 370 85 15,423 12,751 223 365 194 193 232 65 12,993 12,960 271 412 223 210 263 67 12,982 12,951

# L1-dcache-misses (millions) Orig Stat Orig Dyn Par Stat Par Dyn

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 20 / 25

slide-22
SLIDE 22

PolyBench: memory loads

? 2mm 3mm atax bicg mvt trisolv lu cholesky 10−1 101 103

164.8 247.4 0.93 1.48 9.55 0.36 12.72 2.21 43.35 65.6 0.93 0.75 3.04 0.34 12.73 2.21 1.69 3.02 0.4 0.4 0.4 0.12 0.36 0.76 1.82 3.1 0.41 0.41 0.41 0.12 0.43 0.76

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn 2mm 3mm atax bicg mvt trisolv lu cholesky 100 101 102

32.19 48.29 1.66 3.07 2.37 0.75 16.35 13.12 8.08 12.11 1.68 1.87 1.72 0.75 16.34 13.11 23.78 40.16 1.91 1.94 3.06 1.24 13.2 13.02 27.85 44.2 2.11 2.12 3.06 1.21 14.65 13.01

# L1-dcache-loads (billions) Orig Stat Orig Dyn Par Stat Par Dyn

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 21 / 25

slide-23
SLIDE 23

PolyBench: memory loads

?

+2.6% +20%

2mm 3mm atax bicg mvt trisolv lu cholesky 10−1 101 103

164.8 247.4 0.93 1.48 9.55 0.36 12.72 2.21 43.35 65.6 0.93 0.75 3.04 0.34 12.73 2.21 1.69 3.02 0.4 0.4 0.4 0.12 0.36 0.76 1.82 3.1 0.41 0.41 0.41 0.12 0.43 0.76

execution time (s) Orig Stat Orig Dyn Par Stat Par Dyn

+10% +11%

2mm 3mm atax bicg mvt trisolv lu cholesky 100 101 102

32.19 48.29 1.66 3.07 2.37 0.75 16.35 13.12 8.08 12.11 1.68 1.87 1.72 0.75 16.34 13.11 23.78 40.16 1.91 1.94 3.06 1.24 13.2 13.02 27.85 44.2 2.11 2.12 3.06 1.21 14.65 13.01

# L1-dcache-loads (billions) Orig Stat Orig Dyn Par Stat Par Dyn

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 22 / 25

slide-24
SLIDE 24

Analysis

factors of influence:

number of vectorized instructions number of cache misses number of memory accesses? ?

why are there more/less memory accesses when allocating arrays statically or dynamically?

→ we suspect that the varying pressure on register allocation changes the compiler’s decision on data reuse (bicg)

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 23 / 25

slide-25
SLIDE 25

Table of Contents

1

Introduction

2

Triangular Matrix Multiplication: Demonstration

3

Triangular Matrix Multiplication: Performance Analysis

4

PolyBench: Performance Analysis

5

Conclusion

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 24 / 25

slide-26
SLIDE 26

Conclusion

Array allocation has a significant impact on performance in many cases It can be alternatively in favor of static or dynamic allocation and it can even flip on different architectures! Be careful when you compare codes using different allocations (e.g. when working on data layout transformations): this side effect could bias your measurements!

Baroudi, Loechner, Seghir Static vs. Dynamic Memory Allocation IMPACT 2020 25 / 25