introduction to performance analysis
play

Introduction to Performance Analysis Visualization and Analysis of - PowerPoint PPT Presentation

Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale Software VAPLS 2013 Todd Gamblin Katherine Isaacs UC Davis, LLNL LLNL Why does my code run slowly? Bad algorithm 1. Poor computational


  1. Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale Software VAPLS 2013 Todd Gamblin Katherine Isaacs UC Davis, LLNL LLNL

  2. Why does my code run slowly? Bad algorithm 1. • Poor computational complexity, poor performance Takes poor advantage of the machine 2. • Code does not use hardware resources efficiently • Different code may take better or worse advantage of different types of hardware • Many factors can contribute — CPU, memory, threading, network, I/O It just has a lot of work to do 3. • Already using best algorithm • Maps well to machine  Distinguishing between these scenarios is difficult! Lawrence Livermore National Laboratory

  3. Profiling a single process  Profiling is one of the most fundamental forms of performance analysis Source Code  Measures how much time is spent in particular functions in the code • May include calling context • May map time to particular files/line numbers Elapsed Time  Helps programmers locate the bottleneck • Amdahl’s law: figure out what needs to be sped up. Functions Screenshot from HPCToolkit Lawrence Livermore National Laboratory

  4. How do we measure hardware?  Modern chips offer special Hardware Performance Counters: • Event counters, e.g.: — Number of FP, integer, memory, etc. instructions — Number of L1, L2 cache misses — Number of pipeline stalls — Only so many counters can be measured at once. • Instruction Sampling — Precise latency and memory access information — Operands and other metadata for particular instructions — Newer chips support this  Counters provide useful diagnostic information • Can explain why a particular region consumes lots of time. • Generally need to attribute counters to source code or another domain first. Lawrence Livermore National Laboratory

  5. Understanding processor complexity  Once you’ve identified the “hot spot”, how do you know what the 17 Processor Cores problem is? • Have to dig deeper into hardware • Understand how code interacts with the architecture  Processors themselves are parallel: • Multiple functional units Shared L2 Cache • Multiple instructions issued per clock cycle • SIMD (vector) instructions • Hyperthreading 17-core Blue Gene/Q SOC  Can code exploit these? Lawrence Livermore National Laboratory

  6. Understanding memory  Threads and processes on a single node Core 0 Core 1 Core 2 Core 3 communicate through shared memory L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache  Memory is hierarchical Processor 0 • Many levels of cache • Different access speeds Main Memory • Different levels of sharing in cache Lawrence Livermore National Laboratory

  7. Understanding memory  Access to main memory Memory 0 Memory 1 may not have uniform C0 C1 C2 C3 C0 C1 C3 C2 access time L1 L1 L1 L1 L1 L1 L1 L1 • More cores means uniform latency is hard to maintain L2 L2 Processor 0 Processor 1  Many modern processors have Non-Uniform C0 C1 C3 C2 C0 C1 C3 C2 Memory Access latency L1 L1 L1 L1 (NUMA) L1 L1 L1 L1 L2 L2 • Time to access remote Processor 3 Processor 3 sockets is longer than local Memory 2 Memory 3 ones 4-socket, 16-core NUMA node Lawrence Livermore National Laboratory

  8. Modern supercomputers are composed of many processors  Tianhe-2 in China • 3.1 million Intel Xeon Phi cores • Fat tree network  Titan at ORNL • 560,000 AMD Opteron cores • 18,688 Nvidia GPUs • 3D Torus/mesh network  IBM Blue Gene/Q at LLNL • 1.5 million PowerPC A2 cores • 98,000 network nodes x 16 cores • 5D Torus/mesh network Lawrence Livermore National Laboratory

  9. Processors pass messages over a network  Each node in the network is a multi- core processor  Programs pass messages over the network  Many topologies: 4-D Torus network topology • Fat Tree • Cartesian (Torus/Mesh) • Dragonfly – Multiple routing options for each one!  Most recent networks have extensive performance counters • Measure bandwidth on links • Measure contention on node Fat Tree Network Topology Lawrence Livermore National Laboratory

  10. Tracing in a message-passing application  Tracing records all function calls and messages • Can record along with counters • Large volume of records • Clocks may not be synchronized  Identify causes and propagation of delays  Log behavior of adaptive algorithms in practice Screenshot from Vampir Lawrence Livermore National Laboratory

  11. Understanding parallel performance requires mapping hardware measurements to intuitive domains  Map hardware events to source code, data structures, etc. • Understand why performance is bad • Take action based on what the hardware data correlates to  Most programmers look at a small fraction of hardware data • Automated visualization and analysis could help leverage the data Lawrence Livermore National Laboratory

  12. Tools for collecting performance measurements  HPCToolkit: hpctoolkit.org  mpiP: mpip.sourceforge.net  Open|Speedshop: openspeedshop.org  Paraver: www.bsc.es/computer-sciences/performance-tools/paraver  P n MPI: scalability.llnl.gov/pnmpi/  Scalasca: www.scalasca.org  ScalaTrace: moss.csc.ncsu.edu/~mueller/ScalaTrace/  TAU: www.cs.uoregon.edu/research/tau/  VampirTrace: www.tu-dresden.de/zih/vampirtrace …and many more! Lawrence Livermore National Laboratory

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend