Introduction to Performance Analysis Visualization and Analysis of - - PowerPoint PPT Presentation

introduction to performance analysis
SMART_READER_LITE
LIVE PREVIEW

Introduction to Performance Analysis Visualization and Analysis of - - PowerPoint PPT Presentation

Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale Software VAPLS 2013 Todd Gamblin Katherine Isaacs UC Davis, LLNL LLNL Why does my code run slowly? Bad algorithm 1. Poor computational


slide-1
SLIDE 1

Introduction to Performance Analysis

Visualization and Analysis of Performance on Large-scale Software Todd Gamblin LLNL

VAPLS 2013

Katherine Isaacs UC Davis, LLNL

slide-2
SLIDE 2

Lawrence Livermore National Laboratory

1.

Bad algorithm

  • Poor computational complexity, poor performance

2.

Takes poor advantage of the machine

  • Code does not use hardware resources efficiently
  • Different code may take better or worse advantage of different types
  • f hardware
  • Many factors can contribute

CPU, memory, threading, network, I/O

3.

It just has a lot of work to do

  • Already using best algorithm
  • Maps well to machine
  • Distinguishing between these scenarios is difficult!

Why does my code run slowly?

slide-3
SLIDE 3

Lawrence Livermore National Laboratory

  • Profiling is one of the most

fundamental forms of performance analysis

  • Measures how much time is spent

in particular functions in the code

  • May include calling context
  • May map time to particular files/line

numbers

  • Helps programmers locate the

bottleneck

  • Amdahl’s law: figure out what needs

to be sped up.

Profiling a single process

Source Code Elapsed Time Functions

Screenshot from HPCToolkit

slide-4
SLIDE 4

Lawrence Livermore National Laboratory

  • Modern chips offer special Hardware Performance Counters:
  • Event counters, e.g.:

Number of FP, integer, memory, etc. instructions

Number of L1, L2 cache misses

Number of pipeline stalls

Only so many counters can be measured at once.

  • Instruction Sampling

Precise latency and memory access information

Operands and other metadata for particular instructions

Newer chips support this

  • Counters provide useful diagnostic information
  • Can explain why a particular region consumes lots of time.
  • Generally need to attribute counters to source code or another domain first.

How do we measure hardware?

slide-5
SLIDE 5

Lawrence Livermore National Laboratory

  • Once you’ve identified the “hot

spot”, how do you know what the problem is?

  • Have to dig deeper into hardware
  • Understand how code interacts with

the architecture

  • Processors themselves are

parallel:

  • Multiple functional units
  • Multiple instructions issued per clock

cycle

  • SIMD (vector) instructions
  • Hyperthreading
  • Can code exploit these?

Understanding processor complexity

17-core Blue Gene/Q SOC

17 Processor Cores Shared L2 Cache

slide-6
SLIDE 6

Lawrence Livermore National Laboratory

Processor 0

  • Threads and processes
  • n a single node

communicate through shared memory

  • Memory is hierarchical
  • Many levels of cache
  • Different access speeds
  • Different levels of sharing

in cache

Understanding memory

Core 0 Core 1 Core 2 Core 3 L2 Cache L1 Cache L1 Cache L1 Cache L1 Cache Main Memory

slide-7
SLIDE 7

Lawrence Livermore National Laboratory

  • Access to main memory

may not have uniform access time

  • More cores means uniform

latency is hard to maintain

  • Many modern processors

have Non-Uniform Memory Access latency (NUMA)

  • Time to access remote

sockets is longer than local

  • nes

Understanding memory

Processor 0

C0 C1 C2 C3 L2 L1 L1 L1 L1

Memory 0

Processor 1

C0 C1 C2 C3 L2 L1 L1 L1 L1

Memory 1 Memory 3 Memory 2

Processor 3

C0 C1 C2 C3 L2 L1 L1 L1 L1

Processor 3

C0 C1 C2 C3 L2 L1 L1 L1 L1

4-socket, 16-core NUMA node

slide-8
SLIDE 8

Lawrence Livermore National Laboratory

  • Tianhe-2 in China
  • 3.1 million Intel Xeon Phi cores
  • Fat tree network
  • Titan at ORNL
  • 560,000 AMD Opteron cores
  • 18,688 Nvidia GPUs
  • 3D Torus/mesh network
  • IBM Blue Gene/Q at LLNL
  • 1.5 million PowerPC A2 cores
  • 98,000 network nodes x 16 cores
  • 5D Torus/mesh network

Modern supercomputers are composed of many processors

slide-9
SLIDE 9

Lawrence Livermore National Laboratory

  • Each node in the network is a multi-

core processor

  • Programs pass messages over the

network

  • Many topologies:
  • Fat Tree
  • Cartesian (Torus/Mesh)
  • Dragonfly

– Multiple routing options for each one!

  • Most recent networks have extensive

performance counters

  • Measure bandwidth on links
  • Measure contention on node

Processors pass messages over a network

4-D Torus network topology Fat Tree Network Topology

slide-10
SLIDE 10

Lawrence Livermore National Laboratory

  • Tracing records all function calls and messages
  • Can record along with counters
  • Large volume of records
  • Clocks may not be synchronized
  • Identify causes and propagation of delays
  • Log behavior of adaptive algorithms in practice

Tracing in a message-passing application

Screenshot from Vampir

slide-11
SLIDE 11

Lawrence Livermore National Laboratory

  • Map hardware events to source code, data

structures, etc.

  • Understand why performance is bad
  • Take action based on what the hardware data

correlates to

  • Most programmers look at a small fraction of

hardware data

  • Automated visualization and analysis could help

leverage the data

Understanding parallel performance requires mapping hardware measurements to intuitive domains

slide-12
SLIDE 12

Lawrence Livermore National Laboratory

  • HPCToolkit: hpctoolkit.org
  • mpiP: mpip.sourceforge.net
  • Open|Speedshop: openspeedshop.org
  • Paraver: www.bsc.es/computer-sciences/performance-tools/paraver
  • PnMPI: scalability.llnl.gov/pnmpi/
  • Scalasca: www.scalasca.org
  • ScalaTrace: moss.csc.ncsu.edu/~mueller/ScalaTrace/
  • TAU: www.cs.uoregon.edu/research/tau/
  • VampirTrace: www.tu-dresden.de/zih/vampirtrace

…and many more!

Tools for collecting performance measurements