EU H2020 Centre of
- f Excellenc
nce (CoE) ) 1 Decembe ber 2018 – 30 Novembe ber 2021 Grant Ag Agreement nt No 824080
Par arall llel Performan ance Optim imiz ization and Productiv - - PowerPoint PPT Presentation
Par arall llel Performan ance Optim imiz ization and Productiv ivity EU H2020 Centre of of Excellenc nce (CoE) ) 1 Decembe ber 2018 30 Novembe ber 2021 Grant Ag Agreement nt No 824080 POP CoE A Centre of Excellence On
EU H2020 Centre of
nce (CoE) ) 1 Decembe ber 2018 – 30 Novembe ber 2021 Grant Ag Agreement nt No 824080
2
A team with
proven commitment in application to real academic and industrial use cases
3
Why?
Frequent lack of quantified understanding of actual behaviour Not clear most productive direction of code refactoring
compute intensive applications and productivity of the development efforts What?
4
When? December 2018 – November 2021 How?
describing application and needs https://pop-coe.eu/request-service-form
5
qualifies and quantifies approaches to address them (recommendations)
effect of proposed optimisations
Note: Effort shared between our experts and customer!
7
SerE = max (CT / TT on ideal network)
8
CT = Computational time TT = Total time
2 4 8 16 Parallel Efficiency 0.98 0.94 0.90 0.85 Load Balance 0.99 0.97 0.91 0.92 Serialization efficiency 0.99 0.98 0.99 0.94 Transfer Efficiency 0.99 0.99 0.99 0.98 Computation Efficiency 1.00 0.96 0.87 0.70 Global efficiency 0.98 0.90 0.78 0.59
9
2 4 8 16 IPC Scaling Efficiency 1.00 0.99 0.96 0.84 Instruction Scaling Efficiency 1.00 0.97 0.94 0.91 Core frequency efficiency 1.00 0.99 0.96 0.91
10
(if available at customer site)
behaviour
directions to refactor code
performance in specific production conditions
environment setup
provider
performance in production conditions
modifying environment setup
time allocation processes
11
12
13
Performance Audits and Plans
Proof-of- Concept
Area Codes Computational Fluid Dynamics
DROPS (RWTH Aachen), Nek5000 (PDC KTH), SOWFA (CENER), ParFlow (FZ-Juelich), FDS (COAC) & others
Electronic StructureCalculations
ADF, BAND, DFTB (SCM), Quantum Expresso (Cineca), FHI-AIMS (University of Barcelona), SIESTA (BSC), ONETEP (University of Warwick)
Earth Sciences
NEMO (BULL), UKCA (University of Cambridge), SHEMAT-Suite (RWTH Aachen), GITM (Cefas) & others
Finite Element Analysis
Ateles, Musubi (University of Siegen) & others
GyrokineticPlasma Turbulence
GYSELA (CEA), GS2 (STFC)
Materials Modelling
VAMPIRE (University of York), GraGLeS2D (RWTH Aachen), DPM (University of Luxembourg), QUIP (University of Warwick), FIDIMAG (University of Southampton), GBmolDD (University of Durham), k-Wave (Brno University), EPW (University of Oxford) & others
Neural Networks
OpenNN (Artelnics)
14
15
MPI
60 56
OpenMP
12 11
Others**
8 1
Accelerator
3 4+4 1 1
* Based on data collected for 161 POP Performance Audits ** MAGMA Celery TBB GASPI C++ threads MATLAB PT StarPU GlobalArrays Charm++ Fortran Coarray
16
Fortran 59 31 C / C++ 47 2 Python 4 3 Other** 4 5 6
** TCL Matlab Perl Octave Java * Based on data collected for 161 POP Performance Audits
17
0% 5% 10% 15% 20% 25% 30%
Chemistry Engineering Earth Science CFD Energy Other Machine Learning Health
All SMEs
18
55% 25% 7% 13%
Academic Research Large company SME
19
20
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
Communication issues Computation issues Load Balance
21
0% 20% 40% 60% 80% 100% 120%
MPI OpenMP Hybrid MPI + OpenMP Load Balance Computation Communication
22
23
Improvements Reductions
memory)
24
25
26
execution time
worked as expected
production runs
27
in complex and tissue-realistic media
exterior MPI processes with fewer grid cells took much longer than interior
requiring many more small and poorly-balanced parallel loops
reduced overall runtime by a factor of 2
28
www.k-wave.org
showing exterior MPI ranks (0,3) and interior MPI ranks (1,2)
29
30
sequential computational performance
application with specific recommended improvements
scenario and pressure model used
exchange during the work
31
32
to 56 seconds: 450-fold speed-up!
33
epw.org.uk
improvements, enabled EPW simulations to scale to previously impractical 1920 MPI ranks
ranks
What is the observed performance gain after implementing recommendations?
25% 25% 20% overall, 50% for the given module 50-75% (case dependent) 12% Up to 62 %, depending on the use case. 6 - 47 % depending on the test case. 15%
Only performance gain Better scalability Possibility to run on a slower platform (handling the same problem size) Possibility to treat larger problems Possibility to better exploit new architectures (mixing multi- and many- core servers) Other (please specify) 0% 10% 20% 30% 40% 50% 60% 70% 80%
What are the main results?
A few person x days A few person x weeks A few person x months
0% 10% 20% 30% 40% 50% 60%
How much effort was necessary?
35
36
37
service
Performance Audits
(73 customers)
and 70% plan to implement the suggested code modifications
Performance Plans
(11 customers)
Proof-of-Concepts
(8 customers)
* Based on data collected in 92 customer satisfaction questionnaires and 52 phone interviews with customers
38
Expected more interest from industry / SME / ISVs
39
40
41
public information and servcies
12-Dec-2018 42
per month
12-Dec-2018 43
12-Dec-2018 44
12-Dec-2018 45
12-Dec-2018 46
47
07-Feb-19 48
This project has received funding from the European Union‘s Horizon 2020 research and innovation programme under grant agreement No 676553 and 824080.