Introduction to Performance Analysis Visualization and Analysis of - PowerPoint PPT Presentation

Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale Software VAPLS 2013 Todd Gamblin Katherine Isaacs UC Davis, LLNL LLNL

Why does my code run slowly? Bad algorithm 1. • Poor computational complexity, poor performance Takes poor advantage of the machine 2. • Code does not use hardware resources efficiently • Different code may take better or worse advantage of different types of hardware • Many factors can contribute — CPU, memory, threading, network, I/O It just has a lot of work to do 3. • Already using best algorithm • Maps well to machine  Distinguishing between these scenarios is difficult! Lawrence Livermore National Laboratory

Profiling a single process  Profiling is one of the most fundamental forms of performance analysis Source Code  Measures how much time is spent in particular functions in the code • May include calling context • May map time to particular files/line numbers Elapsed Time  Helps programmers locate the bottleneck • Amdahl’s law: figure out what needs to be sped up. Functions Screenshot from HPCToolkit Lawrence Livermore National Laboratory

How do we measure hardware?  Modern chips offer special Hardware Performance Counters: • Event counters, e.g.: — Number of FP, integer, memory, etc. instructions — Number of L1, L2 cache misses — Number of pipeline stalls — Only so many counters can be measured at once. • Instruction Sampling — Precise latency and memory access information — Operands and other metadata for particular instructions — Newer chips support this  Counters provide useful diagnostic information • Can explain why a particular region consumes lots of time. • Generally need to attribute counters to source code or another domain first. Lawrence Livermore National Laboratory

Understanding processor complexity  Once you’ve identified the “hot spot”, how do you know what the 17 Processor Cores problem is? • Have to dig deeper into hardware • Understand how code interacts with the architecture  Processors themselves are parallel: • Multiple functional units Shared L2 Cache • Multiple instructions issued per clock cycle • SIMD (vector) instructions • Hyperthreading 17-core Blue Gene/Q SOC  Can code exploit these? Lawrence Livermore National Laboratory

Understanding memory  Threads and processes on a single node Core 0 Core 1 Core 2 Core 3 communicate through shared memory L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache  Memory is hierarchical Processor 0 • Many levels of cache • Different access speeds Main Memory • Different levels of sharing in cache Lawrence Livermore National Laboratory

Understanding memory  Access to main memory Memory 0 Memory 1 may not have uniform C0 C1 C2 C3 C0 C1 C3 C2 access time L1 L1 L1 L1 L1 L1 L1 L1 • More cores means uniform latency is hard to maintain L2 L2 Processor 0 Processor 1  Many modern processors have Non-Uniform C0 C1 C3 C2 C0 C1 C3 C2 Memory Access latency L1 L1 L1 L1 (NUMA) L1 L1 L1 L1 L2 L2 • Time to access remote Processor 3 Processor 3 sockets is longer than local Memory 2 Memory 3 ones 4-socket, 16-core NUMA node Lawrence Livermore National Laboratory

Modern supercomputers are composed of many processors  Tianhe-2 in China • 3.1 million Intel Xeon Phi cores • Fat tree network  Titan at ORNL • 560,000 AMD Opteron cores • 18,688 Nvidia GPUs • 3D Torus/mesh network  IBM Blue Gene/Q at LLNL • 1.5 million PowerPC A2 cores • 98,000 network nodes x 16 cores • 5D Torus/mesh network Lawrence Livermore National Laboratory

Processors pass messages over a network  Each node in the network is a multi- core processor  Programs pass messages over the network  Many topologies: 4-D Torus network topology • Fat Tree • Cartesian (Torus/Mesh) • Dragonfly – Multiple routing options for each one!  Most recent networks have extensive performance counters • Measure bandwidth on links • Measure contention on node Fat Tree Network Topology Lawrence Livermore National Laboratory

Tracing in a message-passing application  Tracing records all function calls and messages • Can record along with counters • Large volume of records • Clocks may not be synchronized  Identify causes and propagation of delays  Log behavior of adaptive algorithms in practice Screenshot from Vampir Lawrence Livermore National Laboratory

Understanding parallel performance requires mapping hardware measurements to intuitive domains  Map hardware events to source code, data structures, etc. • Understand why performance is bad • Take action based on what the hardware data correlates to  Most programmers look at a small fraction of hardware data • Automated visualization and analysis could help leverage the data Lawrence Livermore National Laboratory

Tools for collecting performance measurements  HPCToolkit: hpctoolkit.org  mpiP: mpip.sourceforge.net  Open|Speedshop: openspeedshop.org  Paraver: www.bsc.es/computer-sciences/performance-tools/paraver  P n MPI: scalability.llnl.gov/pnmpi/  Scalasca: www.scalasca.org  ScalaTrace: moss.csc.ncsu.edu/~mueller/ScalaTrace/  TAU: www.cs.uoregon.edu/research/tau/  VampirTrace: www.tu-dresden.de/zih/vampirtrace …and many more! Lawrence Livermore National Laboratory

Introduction to Performance Analysis Visualization and Analysis of - PowerPoint PPT Presentation

Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale Software VAPLS 2013 Todd Gamblin Katherine Isaacs UC Davis, LLNL LLNL Why does my code run slowly? Bad algorithm 1. Poor computational

Verification Verification, Performance Performance Analysis Performance Performance Analysis

High Performance Systems EuroMPI 2015 Objectives Yet another performance analysis tool

SWOT Analysis W T S O SWOT Analysis Learning Objectives What is SWOT Analysis? What is SWOT

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Penn Analysis of Cold ADC Long Term Performance Data Analysis Backup Slides Richard Diurba June

CS 147: Computer Systems Performance Analysis Approaching Performance Projects 1 / 35 Overview

Performance Analysis: new tools and concepts from the cloud Brendan Gregg Lead Performance

Technical Analysis Technical Analysis Technical Analysis Technical Analysis Introduction

Performance Measurement Performance Analysis Paper and pencil. Dont need a working computer

Stella Performance Strategy & Analysis Tool June 5 & 6, 2019 1 Stella Performance

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

March 2019 CONTENTS Page Combined Partner Performance 1 Breckland Performance Reports 2-6

Performance Bas Performance Bas Performance Bas Performance Bas ed ed ed ed Methodology for

Quarter ended 30 th June 2018 1 1 2 3 Sales and Performance Collection Asset Analysis

Measuring Performance November 17, 2008 Measuring Performance Introduction CPU Peformance and

4. Performance Analysis of Parallel Programs 4.1 Performance Evaluation of Computer User

BOAST Performance Portability Using Meta-Programming and Auto-Tuning Frdric Desprez 1 , Brice

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

Capacity Planning of Supercomputers Simulating MPI Applications at Scale Tom Cornebize Under the

ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON HETEROGENEOUS MANY CORE

Data- Intensive

and Zonal Field Generation Z. Lin University of California, Irvine Fusion Simulation Center,

Introduction to Performance Analysis Visualization and Analysis of - PowerPoint PPT Presentation

Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale Software VAPLS 2013 Todd Gamblin Katherine Isaacs UC Davis, LLNL LLNL Why does my code run slowly? Bad algorithm 1. Poor computational

Verification Verification, Performance Performance Analysis Performance Performance Analysis

High Performance Systems EuroMPI 2015 Objectives Yet another performance analysis tool

SWOT Analysis W T S O SWOT Analysis Learning Objectives What is SWOT Analysis? What is SWOT

Analysis and Optimizations Analysis and Optimizations Program Analysis Program Analysis

Penn Analysis of Cold ADC Long Term Performance Data Analysis Backup Slides Richard Diurba June

CS 147: Computer Systems Performance Analysis Approaching Performance Projects 1 / 35 Overview

Performance Analysis: new tools and concepts from the cloud Brendan Gregg Lead Performance

Technical Analysis Technical Analysis Technical Analysis Technical Analysis Introduction

Performance Measurement Performance Analysis Paper and pencil. Dont need a working computer

Stella Performance Strategy &amp; Analysis Tool June 5 &amp; 6, 2019 1 Stella Performance

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

March 2019 CONTENTS Page Combined Partner Performance 1 Breckland Performance Reports 2-6

Performance Bas Performance Bas Performance Bas Performance Bas ed ed ed ed Methodology for

Quarter ended 30 th June 2018 1 1 2 3 Sales and Performance Collection Asset Analysis

Measuring Performance November 17, 2008 Measuring Performance Introduction CPU Peformance and

4. Performance Analysis of Parallel Programs 4.1 Performance Evaluation of Computer User

BOAST Performance Portability Using Meta-Programming and Auto-Tuning Frdric Desprez 1 , Brice

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers

Capacity Planning of Supercomputers Simulating MPI Applications at Scale Tom Cornebize Under the

ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON HETEROGENEOUS MANY CORE

Data- Intensive

and Zonal Field Generation Z. Lin University of California, Irvine Fusion Simulation Center,

Stella Performance Strategy & Analysis Tool June 5 & 6, 2019 1 Stella Performance