Introduction to Parallel Application Performance Engineering Brian - PowerPoint PPT Presentation

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Introduction to Parallel Application Performance Engineering Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray)

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Performance: an old problem “ The most constant difficulty in contriving the engine has arisen from the desire to Difference Engine reduce the time in which the calculations were executed to the shortest which is possible. ” Charles Babbage 1791 – 1871 NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 2

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Today: the “ free lunch ” is over Moore's law is still in charge, but ■ Clock rates no longer increase ■ Performance gains only through ■ increased parallelism Optimizations of applications more ■ difficult Increasing application complexity ■ Multi-physics ■ Multi-scale ■ Increasing machine complexity ■ Hierarchical networks / memory ■ More CPUs / multi-core ■  Every doubling of scale reveals a new bottleneck! NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 3

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Performance factors of parallel applications “ Sequential ” performance factors ■ Computation ■  Choose right algorithm, use optimizing compiler Cache and memory ■  Tough! Only limited tool support, hope compiler gets it right Input / output ■  Often not given enough attention “ Parallel ” performance factors ■ Partitioning / decomposition ■ Communication (i.e., message passing) ■ Multithreading ■ Synchronization / locking ■  More or less understood, good tool support NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 4

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Tuning basics Successful engineering is a combination of ■ Careful setting of various tuning parameters ■ The right algorithms and libraries ■ Compiler flags and directives ■ … ■ Thinking !!! ■ Measurement is better than guessing ■ To determine performance bottlenecks ■ To compare alternatives ■ To validate tuning decisions and optimizations ■  After each step! NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 5

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Performance engineering workflow • Prepare application with symbols • Collection of performance data • Insert extra code (probes/hooks) • Aggregation of performance data Preparation Measurement Optimization Analysis • Modifications intended to • Calculation of metrics eliminate/reduce performance • Identification of performance problem problems • Presentation of results NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 6

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING The 80/20 rule Programs typically spend 80% of their time in 20% of the code ■ Programmers typically spend 20% of their effort to get 80% of the total speedup ■ possible for the application  Know when to stop! Don't optimize what does not matter ■  Make the common case fast! “ If you optimize everything, you will always be unhappy. ” Donald E. Knuth NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 7

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Metrics of performance What can be measured? ■ A count of how often an event occurs ■ E.g., the number of MPI point-to-point messages sent ■ The duration of some interval ■ E.g., the time spent these send calls ■ The size of some parameter ■ E.g., the number of bytes transmitted by these calls ■ Derived metrics ■ E.g., rates / throughput ■ Needed for normalization ■ NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 8

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Example metrics Execution time ■ Number of function calls ■ CPI ■ CPU cycles per instruction ■ FLOPS ■ Floating-point operations executed per second ■ “ math ” Operations? HW Operations? HW Instructions? 32-/64- bit? … NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 9

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Execution time Wall-clock time ■ Includes waiting time: I/O, memory, other system activities ■ In time-sharing environments also the time consumed by other applications ■ CPU time ■ Time spent by the CPU to execute the application ■ Does not include time the program was context-switched out ■ Problem: Does not include inherent waiting time (e.g., I/O) ■ Problem: Portability? What is user, what is system time? ■ Problem: Execution time is non-deterministic ■ Use mean or minimum of several runs ■ NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 10

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Inclusive vs. Exclusive values Inclusive ■ Information of all sub-elements aggregated into single value ■ Exclusive ■ Information cannot be subdivided further ■ int foo() { int a; a = 1 + 1; bar(); Inclusive Exclusive a = a + 1; return a; } NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 11

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Classification of measurement techniques How are performance measurements triggered? ■ Sampling ■ Code instrumentation ■ How is performance data recorded? ■ Profiling / Runtime summarization ■ Tracing ■ How is performance data analyzed? ■ Online ■ Post mortem ■ NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 12

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Sampling t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Time main foo(0) foo(1) foo(2) Measurement int main() {  Running program is periodically interrupted to take int i; measurement for (i=0; i < 3; i++) foo(i);  Timer interrupt, OS signal, or HWC overflow  Service routine examines return-address stack return 0; }  Addresses are mapped to routines using symbol table information void foo(int i)  Statistical inference of program behavior {  Not very detailed information on highly volatile metrics if (i > 0)  Requires long-running applications foo(i – 1);  Works with unmodified executables } NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 13

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Instrumentation t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 Time Time main foo(0) foo(1) foo(2) Measurement int main() { int i;  Measurement code is inserted such that every event Enter( “ main ” ); for (i=0; i < 3; i++) of interest is captured directly foo(i); Leave( “ main ” );  Can be done in various ways return 0;  Advantage: }  Much more detailed information void foo(int i)  Disadvantage: { Enter( “ foo ” );  Processing of source-code / executable if (i > 0) necessary foo(i – 1);  Large relative overheads for small functions Leave( “ foo ” ); } NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 14

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Instrumentation techniques Static instrumentation ■ Program is instrumented prior to execution ■ Dynamic instrumentation ■ Program is instrumented at runtime ■ Code is inserted ■ Manually ■ Automatically ■ By a preprocessor / source-to-source translation tool ■ By a compiler ■ By linking against a pre-instrumented library / runtime system ■ By binary-rewrite / dynamic instrumentation tool ■ NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 15

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Critical issues Accuracy ■ Intrusion overhead ■ Measurement itself needs time and thus lowers performance ■ Perturbation ■ Measurement alters program behaviour ■ E.g., memory access pattern ■ Accuracy of timers & counters ■ Granularity ■ How many measurements? ■ How much information / processing during each measurement? ■  Tradeoff: Accuracy vs. Expressiveness of data NEIC HPC & APPLICATIONS WORKSHOP: PARALLEL APPLICATION PERFORMANCE ENGINEERING (REYKJAVIK, ICELAND, 24 AUGUST 2017) 16

Introduction to Parallel Application Performance Engineering Brian - PowerPoint PPT Presentation

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING Introduction to Parallel Application Performance Engineering Brian Wylie Jlich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Parallel Query Execution in POLARDB for MySQL ystein Grvlen Benny Wang Alibaba Cloud Agenda

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.3 Parallel

Scenegraphs and Engines Scenegraphs and Engines Scenegraphs Application Application

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Different approaches to Talk based on the work made in collaboration with: the global periodicity

I ntroduction to Parallel Perform ance Engineering Bert W esarg Technische Universitt Dresden

eSTREAM Algorithms for the Next Round http://www.ecrypt.eu.org/stream/ 27 March 2007 Matt

ChurchTuring Thesis CSCI 3130 Formal Languages and Automata Theory Siu On CHAN Fall 2018

Preleminary work in Lyon Florent de Dinechin, Nicolas Brunie Introduction Introduction First

Flexible Timing Simulation of RISC-V Processors with Sniper Neet eethu B Bal al M Mal ally

DU DUNE NE's Hardware Trigger architecture, Su Supern rnova tri rigger Ba Babak k Ab Abi

FAWN: A Fast Array of Wimpy Nodes David G. Andersen, Jason Franklin, Michael Kaminsky * , Amar