Introduction to Performance Analysis
Visualization and Analysis of Performance on Large-scale Software Todd Gamblin LLNL
VAPLS 2013
Introduction to Performance Analysis Visualization and Analysis of - - PowerPoint PPT Presentation
Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale Software VAPLS 2013 Todd Gamblin Katherine Isaacs UC Davis, LLNL LLNL Why does my code run slowly? Bad algorithm 1. Poor computational
VAPLS 2013
Lawrence Livermore National Laboratory
—
CPU, memory, threading, network, I/O
Lawrence Livermore National Laboratory
Source Code Elapsed Time Functions
Screenshot from HPCToolkit
Lawrence Livermore National Laboratory
—
Number of FP, integer, memory, etc. instructions
—
Number of L1, L2 cache misses
—
Number of pipeline stalls
—
Only so many counters can be measured at once.
—
Precise latency and memory access information
—
Operands and other metadata for particular instructions
—
Newer chips support this
Lawrence Livermore National Laboratory
the architecture
cycle
17-core Blue Gene/Q SOC
17 Processor Cores Shared L2 Cache
Lawrence Livermore National Laboratory
Processor 0
Core 0 Core 1 Core 2 Core 3 L2 Cache L1 Cache L1 Cache L1 Cache L1 Cache Main Memory
Lawrence Livermore National Laboratory
Processor 0
C0 C1 C2 C3 L2 L1 L1 L1 L1
Memory 0
Processor 1
C0 C1 C2 C3 L2 L1 L1 L1 L1
Memory 1 Memory 3 Memory 2
Processor 3
C0 C1 C2 C3 L2 L1 L1 L1 L1
Processor 3
C0 C1 C2 C3 L2 L1 L1 L1 L1
4-socket, 16-core NUMA node
Lawrence Livermore National Laboratory
Lawrence Livermore National Laboratory
– Multiple routing options for each one!
4-D Torus network topology Fat Tree Network Topology
Lawrence Livermore National Laboratory
Screenshot from Vampir
Lawrence Livermore National Laboratory
Lawrence Livermore National Laboratory