Tuning space optimization for multi- core architectures V. Martnez - - PowerPoint PPT Presentation

tuning space optimization for multi core architectures
SMART_READER_LITE
LIVE PREVIEW

Tuning space optimization for multi- core architectures V. Martnez - - PowerPoint PPT Presentation

Tuning space optimization for multi- core architectures V. Martnez , F. Dupros, M. Castro, H. Aochi and P. Navaux 2 Contents Introduction. HPC applications Performance Stencil Model. Testbed configuration. Experiments.


slide-1
SLIDE 1

Tuning space

  • ptimization for multi-

core architectures

  • V. Martínez, F. Dupros, M. Castro,
  • H. Aochi and P. Navaux
slide-2
SLIDE 2
  • Introduction.
  • HPC applications Performance
  • Stencil Model.
  • Testbed configuration.
  • Experiments.
  • Results.
  • Conclusion.

2

Contents

slide-3
SLIDE 3

Scientific Applications

3

Multi-core Applications S Stencil Applications: Data dependency

slide-4
SLIDE 4

Contribution

To find best configuration of runtime parameters (tuning) for stencil computations based on number of available threads and L3 CM reduction

4

slide-5
SLIDE 5
  • Introduction.
  • Stencil Model.
  • Jacobi 7-point
  • Testbed configuration.
  • Experiments.
  • Results.
  • Conclusion.

5

Contents

slide-6
SLIDE 6

Stencil: 7-point Jacobi

  • 3D Stencil.
  • Heat equation
  • Finite difference method.
  • Calculate:

6

Bi,j,k =αAi,j,k + β(Ai−1,j,k + Ai,j−1,k + Ai,j,k−1 + Ai+1,j,k + Ai,j+1,k + Ai,j,k+1)

Algorithm 1 Pseudocode for stencil algorithm

1: for each timestep do 2:

Compute in parallel

3:

for each block in X-direction do

4:

for each block in Y-direction do

5:

for each block in Z-direction do

6:

Compute stencil(3D tile)

7:

end for

8:

end for

9:

end for

10: end for

slide-7
SLIDE 7
  • Introduction.
  • Stencil Model.
  • Testbed configuration.
  • Hardware/Application Setup.
  • Experiments.
  • Results.
  • Conclusion.

7

Contents

slide-8
SLIDE 8

Experiments (Testbed)

8

Node 1 Node 2 Processor

i5-4570 Xeon X7550

Clock (GHz)

3.2 2.0

Cores

4 8

Sockets

1 4

Threads

4 64

L3 Cache size (MB)

6 18

Compiler

gcc-4.6.4 gcc-4.6.4

slide-9
SLIDE 9
  • Introduction.
  • Stencil Model.
  • Testbed configuration.
  • Experiments.
  • Results.
  • Conclusion.

9

Contents

slide-10
SLIDE 10

Experiments (Setup)

  • Output:
  • Cache misses:

(PAPI_L3_TCM)

  • Cache Accesses:

(PAPI_L3_TCA)

  • Time
  • GFLOPS

10

Input vector Total configurations Node 1 Node 2 Threads 2 6 Looping 2 2 Size 3 3 Chunk 4 4 Scheduling 3 3 Total 144 432

slide-11
SLIDE 11

Algorithms

Naive

  • Triple nested loops coming from the

three spatial dimensions Blocking

  • Dependencies between

components are exploited to implement a space-time decomposition. Skew

  • Decompose the stencil using both

the space and the time directions but in a specific order.

11

size of the tile

  • U1

U2 U0 U3

slide-12
SLIDE 12
  • Introduction.
  • Stencil Model.
  • Testbed configuration.
  • Experiments.
  • Results.
  • Tuning
  • Conclusion.

12

Contents

slide-13
SLIDE 13

Results (Algorithms)

13

2 4 6 8 Node 1 Node 2

Gflops

1e+02 1e+05 1e+08 Node 1 Node 2

L3 cache misses

Performance Cache Misses

Naive (magenta), Blocking (green) and Skew(cyan)

slide-14
SLIDE 14

Results (Scalability)

14

Node 1 Node 2

Naive (magenta), Blocking (green) and Skew(cyan)

2 4 6 8 2 4

Gflops

2 4 6 8 2 4 8 16 32 64

Gflops

slide-15
SLIDE 15

Results (code

  • ptimization)

15

Node 1 Node 2

Parallefor (red), Tasking (blue)

0.0 2.5 5.0 7.5 10.0 Naive Blocking Skew

Gflops

0.0 2.5 5.0 7.5 10.0 Naive Blocking Skew

Gflops

slide-16
SLIDE 16

Results (Problem size)

16

Node 1 - Tasking Node 2 - Parallelfor

Naive (magenta), Blocking (green) and Skew(cyan)

0.0 2.5 5.0 7.5 10.0 128 256 512

Gflops

0.0 2.5 5.0 7.5 10.0 128 256 512

Gflops

slide-17
SLIDE 17

Results (Scheduling)

17

Naive Chunk size (Node 2) Blocking Skew

Parallelfor (red), Tasking (blue)

5 10 15 32 128 256 512

Gflops

5 10 15 32 128 256 512

Gflops

5 10 15 32 128 256 512

Gflops

slide-18
SLIDE 18

Results (Scheduling)

18

Naive Policy (Node 2) Blocking Skew

Dynamic (green), Guided (Orange) and Static (Gray)

5 10 15 32 128 256 512

Gflops

5 10 15 32 128 256 512

Gflops

5 10 15 32 128 256 512

Gflops

slide-19
SLIDE 19

Makespan (Naive parallelfor)

19

slide-20
SLIDE 20

Makespan (Naive parallelfor)

19

slide-21
SLIDE 21

Makespan (Naive tasking)

20

slide-22
SLIDE 22

Makespan (Naive tasking)

20

slide-23
SLIDE 23
  • Introduction.
  • Stencil Model.
  • Testbed configuration.
  • Experiments.
  • Results.
  • Conclusion.

21

Contents

slide-24
SLIDE 24

Conclusion

  • Tasking achieves good performance when the algorithm does

not use cache intensively (Skew).

  • Chunk size and scheduling algorithms (OpenMP) play an

important role for Naive algorithm and can contribute to achieve a peak of performance.

  • Fitting: Cache misses can be approximated by linear fitting

whereas the performance could be predicted by exponential fitting

  • Future: we intend to develop an auto-tuning approach to

automatize the choice of the input parameters

22

slide-25
SLIDE 25

Questions? victor.martinez@inf.ufrgs.br

Thanks