and nd Pr Prod oduct ctivity EU H2020 Centre of of Excellenc - PowerPoint PPT Presentation

Perf erfor ormance e Opti Optimisa sation on and nd Pr Prod oduct ctivity EU H2020 Centre of of Excellenc nce (CoE) 1 Oc Octob ober 2015 – 31 March h 2018 Grant Ag Agreement nt No 676553

POP CoE • A Centre of Excellence • On Performance Optimisation and Productivity • Promoting best practices in parallel programming • Providing Services • Precise understanding of application and system behaviour • Suggestion/support on how to refactor code in the most productive way • Horizontal • Transversal across application areas, platforms, scales • For (your?) academic AND industrial codes and users ! 2

Partners • Who? • BSC (coordinator), ES • HLRS, DE • JSC, DE • NAG, UK • RWTH Aachen, IT Center, DE • TERATEC, FR A team with • Excellence in performance tools and tuning • Excellence in programming models and practices • Research and development background AND proven commitment in application to real academic and industrial use cases 3

Motivation Why? • Complexity of machines and codes  Frequent lack of quantified understanding of actual behaviour  Not clear most productive direction of code refactoring • Important to maximize efficiency (performance, power) of compute intensive applications and productivity of the development efforts What? • Parallel programs, mainly MPI/OpenMP • Although also CUDA, OpenCL, OpenACC , Python, … 4

The process … When? October 2015 – March 2018 How? • Apply • Fill in small questionnaire describing application and needs https://pop-coe.eu/request-service-form • Questions? Ask pop@bsc.es • Selection/assignment process • Install tools @ your production machine (local, PRACE, …) • Interactively: Gather data  Analysis  Report 5

Services provided by the CoE  Report ? Parallel Application Performance Audit • Primary service • Identify performance issues of customer code (at customer site) • Small effort (< 1 month)  Report ! Parallel Application Performance Plan • Follow-up on the audit service • Identifies the root causes of the issues found and qualifies and quantifies approaches to address them • Longer effort (1-3 months)  Proof-of-Concept  Software Demonstrator • Experiments and mock-up tests for customer codes • Kernel extraction, parallelisation, mini-apps experiments to show effect of proposed optimisations • 6 months effort

Outline of a Typical Audit Report • Application Structure • (if appropriate) Region of Interest • Scalability Information • Application Efficiency • E.g. time spent outside MPI • Load Balance • Whether due to internal or external factors • Serial Performance • Identification of poor code quality • Communications • E.g. sensitivity to network performance • Summary and Recommendations 7

Effic fficiencies • The following metrics are used in a POP Performance Audit: • Global Efficiency (GE): GE = PE * CompE CT = Computational time TT = Total time • Parallel Efficiency (PE): PE = LB * CommE • Load Balance Efficiency (LB): LB = avg(CT)/max(CT) • Communication Efficiency (CommE): CommE = SerE * TE • Serialization Efficiency (SerE): SerE = max (CT / TT on ideal network) • Transfer Efficiency (TE): TE = TT on ideal network / TT • Computation Efficiency (CompE) • Computed out of IPC Scaling and Instruction Scaling • For strong scaling: ideal scaling -> efficiency of 1.0 • Details see https://sharepoint.ecampus.rwth-aachen.de/units/rz/HPC/public/Shared%20Documents/Metrics.pdf 8

Targe get customers • Code developers • Infrastructure operators • Assessment of detailed actual • Assessment of achieved behaviour performance in production conditions • Suggestion of most productive • Possible improvements from directions to refactor code modifying environment setup • Users • Information for time computer • Assessment of achieved time allocation processes performance in specific • Training of support staff production conditions • Possible improvements modifying • Vendors environment setup • Benchmarking • Evidence to interact with code • Customer support provider • System dimensioning/design 9

POP Users and Their Codes Area Codes Computational Fluid Dynamics DROPS (RWTH Aachen), Nek5000 (PDC KTH), SOWFA (CENER), ParFlow (FZ-Juelich), FDS (COAC) & others Electronic StructureCalculations ADF (SCM), Quantum Expresso (Cineca), FHI-AIMS (University of Barcelona), SIESTA (BSC), ONETEP (University of Warwick) Earth Sciences NEMO (BULL), UKCA (University of Cambridge), SHEMAT-Suite (RWTH Aachen) & others Finite Element Analysis Ateles (University of Siegen) & others GyrokineticPlasma Turbulence GYSELA (CEA), GS2 (STFC) Materials Modelling VAMPIRE (University of York), GraGLeS2D (RWTH Aachen), DPM (University of Luxembourg), QUIP (University of Warwick) & others Neural Networks OpenNN (Artelnics) 10

Costumer Feedback (Sep 2016) • Results from 18 of 23 completed feedback surveys (~78%) • How responsive have the POP experts been to • What was the quality of their answers? your questions or concerns about the analysis and the report? 11

Best Be st Pract ctice ices in Perfor ormanc nce Analy lysi sis • Powerful tools … • Unify methodologies • Structure • Extrae + Paraver • Spatio temporal / syntactic • Score-P + Scalasca/TAU/Vampir + Cube • Metrics • Dimemas, Extra-P • Parallel fundamental factors: Efficiency, Load balance, Serialization • Other commercial tools • Programming model related metrics • User level code sequential • … and techniques performance • Hierarchical search • Clustering, modeling, projection, • From high level fundamental extrapolation, memory access patterns, behavior to its causes • … with extreme detail … • To deliver insight • … and up to extreme scale • To estimate potentials 12

Perf erfor ormance e Tool ols 13

Tools • Install and use already available monitoring and analysis technology • Analysis and predictive capabilities • Delivering insight • With extreme detail • Up to extreme scale • Commercial toolsets • Open-source toolsets • Extrae + Paraver (if available at customer site) • Intel tools • Score-P + Cube + Scalasca/TAU/Vampir • Cray tools • Dimemas, Extra-P • Allinea tools • SimGrid 14

Tool Ecosystem -- -- Overview Periscope TAU TAU ParaProf PerfExplorer CUBE4 report CUBE CUBE4 Online interface report Score-P Scalasca wait-state analysis PAPI Remote Guidance Instrumented Vampir target OTF2 application traces

Tool Ecosystem -- -- Status • Score-P (www.score-p.org) • Parallel Program Instrumentation and Profile/Trace Measurement • MPI, OpenMP, SHMEM, CUDA, OpenCL, OmpSssupport • Latest version: 3.0 • New: User function sampling + MPI measurement, OpenACC support • Scalasca (www.scalasca.org) • Scalable Profile and Trace analysis • Latest version: 2.3.1 • New: More platforms (Xeon Phi, K computer, ARM64, …), Score -P 2.X and 3.x support • Cube (www.scalasca.org) • Profile browser • Latest version: 4.3.4 • Soon: Client/server architecture, more analysis plugins, performance improvements

BSC C Performance Tools (www ww.bsc/es/paraver) Instantaneous metrics for ALL hardware Flexible trace visualization and analysis counters at “no” cost Adaptive burst mode tracing 2.5 s BSC-ES – EC-EARTH BSC-ES – EC-EARTH 1600 cores 26.7MB trace Eff: 0.43; LB: 0.52; Comm:0.81 Tracking performance evolution Advanced clustering algorithms AMG2013 17

BSC C Performance Tools (www ww.bsc/es/paraver) What if … What if … we increase the IPC of Cluster1? … we balance Clusters 1 & 2? 18

BSC C Performance Tools (www ww.bsc/es/paraver) Models and Projection Data access patterns Dimemas Several core counts Intel – BSC ExascaleLab eff_factors.py eff.csv Tareador extrapolation.py No MPI noise + No OS noise 19 “ Scalability prediction for fundamental performance factors ” J. Labarta et al. SuperFRI 2014

Code Audi udit Exampl ples 20

DPM – University of Luxembourg • Numerical simulation tool for studying the motion and chemical conversion of particulate material in furnaces • C++ code parallelised with MPI • Key audit results: • Performance problems were due to the way that the code had been parallelised • Scalability limited by end- point contention due to sending MPI messages in increasing-rank order 21

Qu Quantum Espresso – Cineca/MaX CoE • An integrated suite of codes for nanoscale electronic structure calculations and materials modelling • Very widely used • Fortran code with hybrid MPI+OpenMP • Key audit result: • For a significant portion of time only 1 out of 5 OpenMP threads per MPI process does useful computation (1.77x speedup over 1 thread) 22

VAMPIRE – University of York • Magnetic materials simulation code • C++ code parallelised with MPI • Key audit results: • Best enhancements would be to vectorise main loops, improve cache reuse and replace multiple calls to the random number generator with a single call that returns a vector of numbers • Initial implementation of these points by the user suggests that they could lead to 2x speedup 23

and nd Pr Prod oduct ctivity EU H2020 Centre of of Excellenc - PowerPoint PPT Presentation

Perf erfor ormance e Opti Optimisa sation on and nd Pr Prod oduct ctivity EU H2020 Centre of of Excellenc nce (CoE) 1 Oc Octob ober 2015 31 March h

and nd Pr Prod oduct ctivity EU H2020 Center r of of Excellence (CoE)

Presentation 2016 SC ORGANISATION CONSULTING PROD SRL Share Capital: 270 000 euros

PR oduct IMP ocess rovements to stay ahead! MCG Managementberatung GmbH

CO COMM MM 31 310: 0: Fu Fund ndra rais ising ing Pers rsonal al Pro rodu duct

SALT Vagrant and Virtualbox Ben Hosmer @bhosmer Monday, April 22, 13 Local Dev Prod

Common constraint mistakes SU P P LY C H AIN AN ALYTIC S IN P YTH ON Aaren St u bber eld S

Going D/S/K Prod Like A Pro BRET FISHER Docker Captain, DevOps Dude, Creator of Docker Mastery

R EGULATORY S TUDIES L OTS 1 AND 2 ECOWAS Regional Electricity Regulatory Authority A CTIVITY

WALES A CTIVITY C OURSES FOR THE DEVELOPMENT OF YOUNG PEOPLE Manor Adventure has a capacity of

P ROGRAM A CTIVITY G UIDE 2012 Phone:+(501) 822.8032 Main email: mbay@btl.net Home Campus Site:

NM NMYSA SA RET RETURN RN TO ACT CTIVITY GA GAME D DAY M MODIFIC IFICATIO IONS NS OP

pro rodu ductivity ctivity mod odel el Publ blic ic Se Sect ctor or Eco cono nomists

INTR INTROD ODUCT UCTIO ION N TO NAVIA IANCE NCE For Students And Parents In Grades 6-12

monitoring and compliance problems January, 2019 1 Introdu oduct ctions ons: Silicon

WE L L NE SS PRE SE NT AT I ON T I T L E INT R ODUCT ION q We we re de sig ne d

fuellin fu lling g of f a mu multi-pr produ oduct ct value ue ch chain? ?

Experience Reg Kontz Green Leagues 2 Green Leagues 3 Meet and Greet! Name Community League

Protection CSE473 - Spring 2008 Professor Jaeger www.cse.psu.edu/~tjaeger/cse473-s08/ CSE473

Computational Approaches for Efficient Scheduling of Steel Plants as Demand Response Resource Xiao

ICT for Energy Balanced Living Gerd Kortuem, Janet van der Linden, Blaine Price, Jacky Bourgeois

Straw Proposal Business Programs Follow-up Business Program Follow-up Topic Follow-up

On Auditing Elections When Precincts Have Different Sizes Javed A. Aslam Raluca A. Popa and

A Little Bit of History Instrumenta2on is the enabler of science, both

Introduction Prof. Christian Terwiesch Operations in a Restaurant Prof. Christian Terwiesch