I ntroduction to Parallel Perform ance Engineering Bert W esarg - PowerPoint PPT Presentation

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING I ntroduction to Parallel Perform ance Engineering Bert W esarg Technische Universität Dresden (with content used with permission from tutorials by Bernd Mohr/ JSC and Luiz DeRose/ Cray)

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Perform ance: an old problem “ The most constant difficulty in contriving the engine has arisen from the desire to Difference Engine reduce the time in which the calculations were executed to the shortest which is possible. ” Charles Babbage 1791 – 1871 PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 2

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Today: the “free lunch” is over Moore's law is still in charge, but ■ Clock rates no longer increase ■ Performance gains only through ■ increased parallelism Optimizations of applications more ■ difficult Increasing application complexity ■ Multi-physics ■ Multi-scale ■ Increasing machine complexity ■ ■ Hierarchical networks / memory ■ More CPUs / multi-core  Every doubling of scale reveals a new bottleneck! PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 3

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Perform ance factors of parallel applications “Sequential” performance factors ■ Computation ■  Choose right algorithm, use optimizing compiler Cache and memory ■  Tough! Only limited tool support, hope compiler gets it right Input / output ■  Often not given enough attention “Parallel” performance factors ■ Partitioning / decomposition ■ ■ Communication (i.e., message passing) Multithreading ■ Synchronization / locking ■  More or less understood, good tool support PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 4

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Tuning basics Successful engineering is a combination of ■ Careful setting of various tuning parameters ■ The right algorithms and libraries ■ Compiler flags and directives ■ … ■ ■ Thinking !!! Measurement is better than guessing ■ To determine performance bottlenecks ■ To compare alternatives ■ ■ To validate tuning decisions and optimizations  After each step! PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 5

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Perform ance engineering w orkflow • Prepare application with symbols • Collection of performance data • Insert extra code (probes/ hooks) • Aggregation of performance data Preparation Measurement Optimization Analysis • Modifications intended to • Calculation of metrics eliminate/ reduce performance • Identification of performance problem problems • Presentation of results PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 6

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING The 8 0 / 2 0 rule Programs typically spend 80% of their time in 20% of the code ■ Programmers typically spend 20% of their effort to get 80% of the total speedup ■ possible for the application  Know when to stop! Don't optimize what does not matter ■  Make the common case fast! “ If you optimize everything, you will always be unhappy. ” Donald E. Knuth PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 7

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Metrics of perform ance What can be measured? ■ A count of how often an event occurs ■ ■ E.g., the number of MPI point-to-point messages sent The duration of some interval ■ E.g., the time spent these send calls ■ The size of some parameter ■ E.g., the number of bytes transmitted by these calls ■ Derived metrics ■ E.g., rates / throughput ■ ■ Needed for normalization PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 8

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Exam ple m etrics Execution time ■ Number of function calls ■ CPI ■ ■ CPU cycles per instruction FLOPS ■ Floating-point operations executed per second ■ “ math ” Operations? HW Operations? HW Instructions? 32-/64-bit? … PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 9

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Execution tim e Wall-clock time ■ Includes waiting time: I/ O, memory, other system activities ■ In time-sharing environments also the time consumed by other applications ■ CPU time ■ ■ Time spent by the CPU to execute the application Does not include time the program was context-switched out ■ ■ Problem: Does not include inherent waiting time (e.g., I/ O) Problem: Portability? What is user, what is system time? ■ Problem: Execution time is non-deterministic ■ Use median of several runs, or at least the arithmetic mean ■ PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 10

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING I nclusive vs. Exclusive values Inclusive ■ Information of all sub-elements aggregated into single value ■ Exclusive ■ Information cannot be subdivided further ■ int foo() { int a; a = 1 + 1; Inclusive Exclusive bar(); a = a + 1; return a; } PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 11

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Classification of m easurem ent techniques How are perform ance m easurem ents triggered? ■ Sam pling ■ Code instrum entation ■ How is performance data recorded? ■ Profiling / Runtime summarization ■ Tracing / Logging ■ How is performance data analyzed? ■ Post mortem ■ Online ■ PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 12

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Sam pling t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Time main foo(0) foo(1) foo(2) Measurement int main() {  Running program is periodically interrupted to take int i; measurement for (i=0; i < 3; i++) foo(i);  Timer interrupt, OS signal, or HWC overflow  Service routine examines return-address stack return 0; }  Addresses are mapped to routines using symbol table information void foo(int i)  Statistical inference of program behavior {  Not very detailed information on highly volatile metrics if (i > 0)  Requires long-running applications foo(i – 1);  Works with unmodified executables } PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 13

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING I nstrum entation t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 Time Time main foo(0) foo(1) foo(2) Measurement int main() { int i;  Measurement code is inserted such that every event Enter( “ main ” ); for (i=0; i < 3; i++) of interest is captured directly foo(i); Leave( “ main ” );  Can be done in various ways return 0;  Advantage: }  Much more detailed information void foo(int i)  Disadvantage: { Enter( “ foo ” );  Processing of source-code / executable if (i > 0) necessary foo(i – 1); Leave( “ foo ” );  Large relative overheads for small functions } PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 14

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING I nstrum entation techniques Static instrumentation ■ Program is instrumented prior to execution ■ Dynamic instrumentation ■ Program is instrumented at runtime ■ Code is inserted ■ Manually ■ Automatically ■ By a preprocessor / source-to-source translation tool ■ By a compiler ■ By linking against a pre-instrumented library / runtime system ■ ■ By binary-rewrite / dynamic instrumentation tool PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 15

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Critical issues Accuracy ■ Intrusion overhead ■ ■ Measurement itself needs time and thus lowers performance Perturbation ■ Measurement alters program behaviour ■ E.g., memory access pattern ■ ■ Accuracy of timers & counters Granularity ■ How many measurements? ■ How much information / processing during each measurement? ■  Tradeoff: Accuracy vs. Expressiveness of data PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 16

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Classification of m easurem ent techniques How are performance measurements triggered? ■ Sampling ■ Code instrumentation ■ How is perform ance data recorded? ■ Profiling / Runtim e sum m arization ■ Tracing / Logging ■ How is performance data analyzed? ■ Post mortem ■ Online ■ PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 17

VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Profiling / Runtim e sum m arization Recording of aggregated information ■ Total, maximum, minimum, … ■ For measurements ■ Time ■ ■ Counts ■ Function calls ■ Bytes transferred Hardware counters ■ Over program and system entities ■ ■ Functions, call sites, basic blocks, loops, … Processes, threads ■  Profile = summarization of events over the whole execution interval PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 18

I ntroduction to Parallel Perform ance Engineering Bert W esarg - PowerPoint PPT Presentation

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING I ntroduction to Parallel Perform ance Engineering Bert W esarg Technische Universitt Dresden (with content used with permission from tutorials by Bernd Mohr/ JSC and Luiz DeRose/ Cray)

BERT 3.0 The New BERT Wheres Ernie????? Logging into Bert BERT now uses the same style logon as

Validation, Synthesis Validation, Synthesis and Perform ance Perform ance Evaluation of of

Tourism Exp o Perform a nce Im p rov em ent 30 May 2013 Pw C Hum a n Resource Serv ices (HRS)

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of a Multi-core W eb Server

Part 1 Part 1 I ntroduction Review of I ntroduction Review of I ntroduction, Review of I

GeoCom putational I ntelligence and GeoCom putational I ntelligence and High-perform ance

Meeting Annual 2009 1 2 0 0 9 Annual Meeting Agenda I ntroduction 2 0 0 8 Perform ance

BERT Bidirectional Encoder Representations from Transformers Introduction What is BERT?

Perform ance Visualization of Hybrid Cell Applications Scicom P 1 5 , May 1 9 th, Barcelona

Control, inference and learning Bert Kappen : SNN Donders Institute, Radboud University, Nijmegen

BERT Basic Error Response Type Bert Why: Document WG Choice What: method to sign

Architecture in Motion How Adyen achieved 100x Bert Wolters - EVP Technology bert@adyen.com

PHOTONICS IN THE MAGIC KINGDOM FAIRY TALES AND TALENT FAIRS BERT GYSELINCKX IMEC USA BERT

ZDLRA @ METRONOM 1 0 .2 4 .2 0 1 8 1 I ntroduction Agenda 2 Mission 3 Best Practices 4

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

National Voter R Registration Act: Public Assista Public Assista ance Agencies ance Agencies

eSTREAM Algorithms for the Next Round http://www.ecrypt.eu.org/stream/ 27 March 2007 Matt

ChurchTuring Thesis CSCI 3130 Formal Languages and Automata Theory Siu On CHAN Fall 2018

Women In Open Source & Computer Technology Allison Fox - RITlug Secretary Wait, youre a

Data Integration with Ontologies Sebastian Brandt brandt@cs.manchester.ac.uk (slides by Bijan

Different approaches to Talk based on the work made in collaboration with: the global periodicity

Introduction to Parallel Application Performance Engineering Brian Wylie Jlich Supercomputing

Preleminary work in Lyon Florent de Dinechin, Nicolas Brunie Introduction Introduction First

Flexible Timing Simulation of RISC-V Processors with Sniper Neet eethu B Bal al M Mal ally

I ntroduction to Parallel Perform ance Engineering Bert W esarg - PowerPoint PPT Presentation

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING I ntroduction to Parallel Perform ance Engineering Bert W esarg Technische Universitt Dresden (with content used with permission from tutorials by Bernd Mohr/ JSC and Luiz DeRose/ Cray)

BERT 3.0 The New BERT Wheres Ernie????? Logging into Bert BERT now uses the same style logon as

Validation, Synthesis Validation, Synthesis and Perform ance Perform ance Evaluation of of

Tourism Exp o Perform a nce Im p rov em ent 30 May 2013 Pw C Hum a n Resource Serv ices (HRS)

Bryan Veal Annie Foong Intel R&amp;D Perform ance Scalability of a Multi-core W eb Server

Part 1 Part 1 I ntroduction Review of I ntroduction Review of I ntroduction, Review of I

GeoCom putational I ntelligence and GeoCom putational I ntelligence and High-perform ance

Meeting Annual 2009 1 2 0 0 9 Annual Meeting Agenda I ntroduction 2 0 0 8 Perform ance

BERT Bidirectional Encoder Representations from Transformers Introduction What is BERT?

Perform ance Visualization of Hybrid Cell Applications Scicom P 1 5 , May 1 9 th, Barcelona

Control, inference and learning Bert Kappen : SNN Donders Institute, Radboud University, Nijmegen

BERT Basic Error Response Type Bert Why: Document WG Choice What: method to sign

Architecture in Motion How Adyen achieved 100x Bert Wolters - EVP Technology bert@adyen.com

PHOTONICS IN THE MAGIC KINGDOM FAIRY TALES AND TALENT FAIRS BERT GYSELINCKX IMEC USA BERT

ZDLRA @ METRONOM 1 0 .2 4 .2 0 1 8 1 I ntroduction Agenda 2 Mission 3 Best Practices 4

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

National Voter R Registration Act: Public Assista Public Assista ance Agencies ance Agencies

eSTREAM Algorithms for the Next Round http://www.ecrypt.eu.org/stream/ 27 March 2007 Matt

ChurchTuring Thesis CSCI 3130 Formal Languages and Automata Theory Siu On CHAN Fall 2018

Women In Open Source &amp; Computer Technology Allison Fox - RITlug Secretary Wait, youre a

Data Integration with Ontologies Sebastian Brandt brandt@cs.manchester.ac.uk (slides by Bijan

Different approaches to Talk based on the work made in collaboration with: the global periodicity

Introduction to Parallel Application Performance Engineering Brian Wylie Jlich Supercomputing

Preleminary work in Lyon Florent de Dinechin, Nicolas Brunie Introduction Introduction First

Flexible Timing Simulation of RISC-V Processors with Sniper Neet eethu B Bal al M Mal ally

Bryan Veal Annie Foong Intel R&D Perform ance Scalability of a Multi-core W eb Server

Women In Open Source & Computer Technology Allison Fox - RITlug Secretary Wait, youre a