Tuning space optimization for multi- core architectures V. Martnez - PowerPoint PPT Presentation

Tuning space optimization for multi- core architectures V. Martínez , F. Dupros, M. Castro, H. Aochi and P. Navaux

2 Contents • Introduction. • HPC applications Performance • Stencil Model. • Testbed configuration. • Experiments. • Results. • Conclusion.

3 Scientific Applications Applications Multi-core S Stencil Applications: Data dependency

4 Contribution To find best configuration of runtime parameters (tuning) for stencil computations based on number of available threads and L3 CM reduction

5 Contents • Introduction. • Stencil Model. • Jacobi 7-point • Testbed configuration. • Experiments. • Results. • Conclusion.

6 Stencil: 7-point Jacobi • 3D Stencil. • Heat equation • Finite difference method. • Calculate: Algorithm 1 Pseudocode for stencil algorithm 1: for each timestep do Compute in parallel 2: for each block in X-direction do 3: B i,j,k = α A i,j,k for each block in Y-direction do 4: for each block in Z-direction do 5: + β ( A i − 1 ,j,k + A i,j − 1 ,k + A i,j,k − 1 Compute stencil(3D tile) 6: end for 7: + A i +1 ,j,k + A i,j +1 ,k + A i,j,k +1 ) end for 8: end for 9: 10: end for

7 Contents • Introduction. • Stencil Model. • Testbed configuration. • Hardware/Application Setup. • Experiments. • Results. • Conclusion.

8 Experiments (Testbed) Node 1 Node 2 Processor i5-4570 Xeon X7550 Clock (GHz) 3.2 2.0 Cores 4 8 Sockets 1 4 Threads 4 64 L3 Cache size (MB) 6 18 Compiler gcc-4.6.4 gcc-4.6.4

9 Contents • Introduction. • Stencil Model. • Testbed configuration. • Experiments. • Results. • Conclusion.

10 Experiments (Setup) Total configurations • Output: Input vector Node 1 Node 2 • Cache misses: Threads 2 6 (PAPI_L3_TCM) Looping 2 2 • Cache Accesses: Size 3 3 (PAPI_L3_TCA) Chunk 4 4 • Time Scheduling 3 3 • GFLOPS Total 144 432

�� 11 Algorithms Naive • Triple nested loops coming from the three spatial dimensions Blocking • Dependencies between components are exploited to implement a space-time decomposition. size of the tile U3 Skew U2 U1 • Decompose the stencil using both U0 the space and the time directions but in a specific order.

12 Contents • Introduction. • Stencil Model. • Testbed configuration. • Experiments. • Results. • Tuning • Conclusion.

13 Results (Algorithms) Performance Cache Misses 8 1e+08 6 L3 cache misses 1e+05 Gflops 4 2 1e+02 0 Node 1 Node 2 Node 1 Node 2 Naive (magenta), Blocking (green) and Skew(cyan)

14 Results (Scalability) Node 1 Node 2 8 8 6 6 Gflops Gflops 4 4 2 2 0 0 2 4 2 4 8 16 32 64 Naive (magenta), Blocking (green) and Skew(cyan)

15 Results (code optimization) Node 1 Node 2 10.0 10.0 7.5 7.5 Gflops Gflops 5.0 5.0 2.5 2.5 0.0 0.0 Naive Blocking Skew Naive Blocking Skew Parallefor (red), Tasking (blue)

16 Results (Problem size) Node 1 - Tasking Node 2 - Parallelfor 10.0 10.0 7.5 7.5 Gflops Gflops 5.0 5.0 2.5 2.5 0.0 0.0 128 256 512 128 256 512 Naive (magenta), Blocking (green) and Skew(cyan)

17 Results (Scheduling) Chunk size (Node 2) Naive Blocking Skew 15 15 15 10 10 10 Gflops Gflops Gflops 5 5 5 0 0 0 32 128 256 512 32 128 256 512 32 128 256 512 Parallelfor (red), Tasking (blue)

18 Results (Scheduling) Policy (Node 2) Naive Blocking Skew 15 15 15 10 10 10 Gflops Gflops Gflops 5 5 5 0 0 0 32 128 256 512 32 128 256 512 32 128 256 512 Dynamic (green), Guided (Orange) and Static (Gray)

19 Makespan (Naive parallelfor)

20 Makespan (Naive tasking)

21 Contents • Introduction. • Stencil Model. • Testbed configuration. • Experiments. • Results. • Conclusion.

22 Conclusion • Tasking achieves good performance when the algorithm does not use cache intensively (Skew). • Chunk size and scheduling algorithms (OpenMP) play an important role for Naive algorithm and can contribute to achieve a peak of performance. • Fitting: Cache misses can be approximated by linear fitting whereas the performance could be predicted by exponential fitting • Future: we intend to develop an auto-tuning approach to automatize the choice of the input parameters

Thanks Questions? victor.martinez@inf.ufrgs.br

Tuning space optimization for multi- core architectures V. Martnez - PowerPoint PPT Presentation

Tuning space optimization for multi- core architectures V. Martnez , F. Dupros, M. Castro, H. Aochi and P. Navaux 2 Contents Introduction. HPC applications Performance Stencil Model. Testbed configuration. Experiments.

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Architectures Architectural styles Software architectures Architectures versus middleware

Data Mining II Optimization & Parameter Tuning Heiko Paulheim Why Parameter Tuning?

Adversarial event generator tuning with Bayesian Optimization Maxim Borisyak, Andrey Ustyuzhanin

Data Mining II Optimization & Parameter Tuning Heiko Paulheim Why Parameter Tuning?

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

Scheduling Multi-Periodic Mixed-Criticality DAGs on Multi-Core Architectures Roberto MEDINA

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Tuning the Untunable Techniques for Accelerating Deep Learning Optimization Talk ID: S9313

HYPRE: High Performance Preconditioners October 18, 2013 Robert D. Falgout Center for Applied

Edge-Adaptive Image Interpolation with Contour Stencils Pascal Getreuer Dec 27, 2010 TV along

CS475/CM375 Lecture 4: Sept 22 Sparse Gaussian Elimination, Graph Representation Reading: [Saad]

Iterator-Based Optimization of Imperfectly-Nested Loops DANIEL FESHBACH, MARY GLASER, MICHELLE

Boundary Approximations for Semi-Lagrangian Schemes Applied to Hamilton-Jacobi-Bellman Equations

Acceleration of stencil- based fusion kernels Y. ASAHI 1 , G. Latu 1 , T. Ina 2 , Y. Idomura 2 ,

RegCM Climate Model Refactoring for HPC Graziano Giuliani ggiulian@ictp.it International Centre

Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN AICS Background In the

Tuning space optimization for multi- core architectures V. Martnez - PowerPoint PPT Presentation

Tuning space optimization for multi- core architectures V. Martnez , F. Dupros, M. Castro, H. Aochi and P. Navaux 2 Contents Introduction. HPC applications Performance Stencil Model. Testbed configuration. Experiments.

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Architectures Architectural styles Software architectures Architectures versus middleware

Data Mining II Optimization &amp; Parameter Tuning Heiko Paulheim Why Parameter Tuning?

Adversarial event generator tuning with Bayesian Optimization Maxim Borisyak, Andrey Ustyuzhanin

Data Mining II Optimization &amp; Parameter Tuning Heiko Paulheim Why Parameter Tuning?

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

Scheduling Multi-Periodic Mixed-Criticality DAGs on Multi-Core Architectures Roberto MEDINA

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Tuning the Untunable Techniques for Accelerating Deep Learning Optimization Talk ID: S9313

HYPRE: High Performance Preconditioners October 18, 2013 Robert D. Falgout Center for Applied

Edge-Adaptive Image Interpolation with Contour Stencils Pascal Getreuer Dec 27, 2010 TV along

CS475/CM375 Lecture 4: Sept 22 Sparse Gaussian Elimination, Graph Representation Reading: [Saad]

Iterator-Based Optimization of Imperfectly-Nested Loops DANIEL FESHBACH, MARY GLASER, MICHELLE

Boundary Approximations for Semi-Lagrangian Schemes Applied to Hamilton-Jacobi-Bellman Equations

Acceleration of stencil- based fusion kernels Y. ASAHI 1 , G. Latu 1 , T. Ina 2 , Y. Idomura 2 ,

RegCM Climate Model Refactoring for HPC Graziano Giuliani ggiulian@ictp.it International Centre

Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN AICS Background In the

Data Mining II Optimization & Parameter Tuning Heiko Paulheim Why Parameter Tuning?

Data Mining II Optimization & Parameter Tuning Heiko Paulheim Why Parameter Tuning?