[PPT] - HPCToolkit: Performance Tools for Parallel Scientific Codes John PowerPoint Presentation

SLIDE 1

1

HPCToolkit: Performance Tools for Parallel Scientific Codes

John Mellor-Crummey

Department of Computer Science Rice University johnmc@rice.edu

http://hpctoolkit.org

Building Community Codes for Effective Scientific Research on HPC Platforms September 7, 2012

SLIDE 2

Challenges for Computational Scientists

Execution environments and applications are rapidly evolving

— architecture

– rapidly changing multicore microprocessor designs – increasing scale of parallel systems – growing use of accelerators

— applications

– transition from MPI everywhere to threaded implementations – add additional scientific capabilities – maintain multiple variants or configurations

Computational scientists need to

— assess weaknesses in algorithms and their implementations — improve scalability of executions within and across nodes — adapt to changes in emerging architectures

2 Performance tools can play an important role as a guide

SLIDE 3

3

Performance Analysis Challenges

Complex architectures are hard to use efficiently

— multi-level parallelism: multi-core, ILP, SIMD instructions — multi-level memory hierarchy — result: gap between typical and peak performance is huge

Complex applications present challenges

— measurement and analysis — understanding behaviors and tuning performance

Supercomputer platforms compound the complexity

— unique hardware — unique microkernel-based operating systems — multifaceted performance concerns

– computation – data movement – communication – I/O

SLIDE 4

4

What Users Want

Multi-platform, programming model independent tools
Accurate measurement of complex parallel codes

— large, multi-lingual programs — fully optimized code: loop optimization, templates, inlining — binary-only libraries, sometimes partially stripped — complex execution environments

– dynamic loading, static linking – SPMD parallel codes with threaded node programs – batch jobs

Effective performance analysis

— insightful analysis that pinpoints and explains problems

– correlate measurements with code for actionable results – support analysis at the desired level intuitive enough for application scientists and engineers detailed enough for library developers and compiler writers

Scalable to petascale and beyond

SLIDE 5

HPCToolkit - 160K lines, 797 files

— measurement, data analysis: 110K lines C/C++, scripts; 424 files — hpcviewer, hpctraceviewer GUIs: 54K lines Java; 373 files

HPCToolkit externals - 2.5M lines C/C++, 5782 files

— components developed

– execution control: libmonitor - 7K lines, 35 files – binary analysis: Open Analysis - 76K lines, 343 files (+ ANL, Colorado)

— components extensively modified

– binary analysis: GNU binutils - 1.44M total, 1650 files; (448K bfd)

— other components

– stack unwinding: libunwind – XML: libxml2, xerces – understanding binaries: libelf, libdwarf, symtabAPI

* With support from the US government

DOE Office of Science: DE-FC02-07ER25800, DE-FC02-06ER25762 LANL: 03891-001-99-4G, 74837-001-03 49, 86192-001-04 49, 12783-001-05 49 AFRL: FA8650-09-C-7915

“We Build It” *

5

SLIDE 6

Contributors

Current

— staff: Michael Fagan, Mark Krentel, Laksono Adhianto — students: Xu Liu, Milind Chabbi, Karthik Murthy — external: Nathan Tallent (PNNL)

Alumni

— students: Gabriel Marin (ORNL), Nathan Froyd (Mozilla) — staff: Rob Fowler (UNC) — interns: Sinchan Banerjee (MIT), Michael Franco (Rice), Reed Landrum (Stanford), Bowden Kelly (Georgia Tech), Philip Taffet (St. John’s High School)

6

SLIDE 7

7

HPCToolkit Approach

Employ binary-level measurement and analysis

— observe fully optimized, dynamically linked executions — support multi-lingual codes with external binary-only libraries

Use sampling-based measurement (avoid instrumentation)

— controllable overhead — minimize systematic error and avoid blind spots — enable data collection for large-scale parallelism

Collect and correlate multiple derived performance metrics

— diagnosis typically requires more than one species of metric

Associate metrics with both static and dynamic context

— loop nests, procedures, inlined code, calling context

Support top-down performance analysis

— natural approach that minimizes burden on developers

SLIDE 8

8

Outline

Overview of Rice’s HPCToolkit
Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

Understanding temporal behavior
Assessing process variability
Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

Summary and conclusions

SLIDE 9

source code

ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

9

SLIDE 10

source code

ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

10

For dynamically-linked executables on stock Linux

— compile and link as you usually do

For statically-linked executables (e.g. for Blue Gene, Cray)

— add monitoring by using hpclink as prefix to your link line

SLIDE 11

source code

ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

Measure execution unobtrusively

— launch optimized application binaries

– dynamically-linked applications: launch with hpcrun e.g., mpirun -np 8192 hpcrun -t -e WALLCLOCK@5000 flash3 ... – statically-linked applications: control with environment variables

— collect statistical call path profiles of events of interest

11

SLIDE 12

Measure and attribute costs in context

sample timer or hardware counter overflows gather calling context using stack unwinding

Call Path Profiling

12

Call path sample

instruction pointer return address return address return address

Overhead proportional to sampling frequency... ...not call frequency

Calling context tree

SLIDE 13

source code

ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

Analyze binary with hpcstruct: recover program structure

— analyze machine code, line map, debugging information — extract loop nesting & identify inlined procedures — map transformed loops and procedures to source

13

SLIDE 14

source code

ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

Combine multiple profiles

— multiple threads; multiple processes; multiple executions

Correlate metrics to static & dynamic program structure

14

SLIDE 15

source code

ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

Presentation

— explore performance data from multiple perspectives

– rank order by metrics to focus on what’s important – compute derived metrics to help gain insight e.g. scalability losses, waste, CPI, bandwidth

— graph thread-level metrics for contexts — explore evolution of behavior over time

15

SLIDE 16

Analyzing Chombo@1024PE with hpcviewer

16 costs for

inlined procedures
loops
function calls in full context

source pane navigation pane metric pane view control metric display

SLIDE 17

17

Outline

Overview of Rice’s HPCToolkit
Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

Understanding temporal behavior
Assessing process variability
Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

Summary and conclusions

SLIDE 18

18

The Problem of Scaling

0.500 0.625 0.750 0.875 1.000 1 4 1 6 6 4 2 5 6 1 2 4 4 9 6 1 6 3 8 4 6 5 5 3 6

Efficiency CPUs Ideal efficiency Actual efficiency

?

Note: higher is better

SLIDE 19

19

Wanted: Scalability Analysis

Isolate scalability bottlenecks
Guide user to problems
Quantify the magnitude of each problem

SLIDE 20

20

Challenges for Pinpointing Scalability Bottlenecks

Parallel applications

— modern software uses layers of libraries — performance is often context dependent

Monitoring

— bottleneck nature: computation, data movement, synchronization? — 2 pragmatic constraints

– acceptable data volume – low perturbation for use in production runs

Example climate code skeleton

main

cean

atmosphere wait wait sea ice wait land wait

SLIDE 21

21

Performance Analysis with Expectations

You have performance expectations for your parallel code

— strong scaling: linear speedup — weak scaling: constant execution time

Put your expectations to work

— measure performance under different conditions

– e.g. different levels of parallelism or different inputs

— express your expectations as an equation — compute the deviation from expectations for each calling context

– for both inclusive and exclusive costs

— correlate the metrics with the source code — explore the annotated call tree interactively

SLIDE 22

200K 400K 600K

22

Pinpointing and Quantifying Scalability Bottlenecks

=

− P Q P × coefficients for analysis

f strong scaling

Q ×

SLIDE 23

Parallel, adaptive-mesh refinement (AMR) code
Block structured AMR; a block is the unit of computation
Designed for compressible reactive flows
Can solve a broad range of (astro)physical problems
Portable: runs on many massively-parallel systems
Scales and performs well
Fully modular and extensible: components can be

combined to create many different applications

23

Scalability Analysis Demo: FLASH3

Cellular detonation Helium burning on neutron stars Laser-driven shock instabilities Nova outbursts on white dwarfs Rayleigh-Taylor instability Orzag/Tang MHD vortex Magnetic Rayleigh-Taylor

Figures courtesy of FLASH Team, University of Chicago

Code: University of Chicago FLASH3 Simulation: white dwarf detonation Platform: Blue Gene/P Experiment: 8192 vs. 256 processors Scaling type: weak

SLIDE 24

Improved Flash Scaling of AMR Setup

24 Graph courtesy of Anshu Dubey, U Chicago

SLIDE 25

25

Outline

Overview of Rice’s HPCToolkit
Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

Understanding temporal behavior
Assessing process variability
Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

Summary and conclusions

SLIDE 26

Profiling compresses out the temporal dimension

—temporal patterns, e.g. serialization, are invisible in profiles

What can we do? Trace call path samples

—sketch:

– N times per second, take a call path sample of each thread –

rganize the samples for each thread along a time line

– view how the execution evolves left to right – what do we view? assign each procedure a color; view a depth slice of an execution 26

Understanding Temporal Behavior

Time Processes Call stack

SLIDE 27

27

hpctraceviewer: detail of FLASH3@256PE

Load imbalance among threads appears as different lengths of colored bands along the x axis

SLIDE 28

28

Outline

Overview of Rice’s HPCToolkit
Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

Understanding temporal behavior
Assessing process variability
Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

Summary and conclusions

SLIDE 29

MPBS @ 960 cores, radix sort

Two views of load imbalance since not on a 2k cores

29

SLIDE 30

30

Outline

Overview of Rice’s HPCToolkit
Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

Understanding temporal behavior
Assessing process variability
Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

Summary and conclusions

SLIDE 31

Blame Shifting

Problem: in many circumstances sampling measures

symptoms of performance losses rather than causes

— worker threads waiting for work — threads waiting for a lock — MPI process waiting for peers in a collective communication — idle GPU waiting for work

Approach: shift blame for losses from victims to perpetrators

— blame code executing while other threads are idle — blame code executed by lock holder when thread(s) are waiting — blame processes that arrive late to collectives — shift blame between CPU and GPU for hybrid code

31

SLIDE 32

Performance Expectations for Hybrid Code with Blame Shifting

32

SLIDE 33

Performance Expectations for Hybrid Code with Blame Shifting

33

SLIDE 34

Performance Expectations for Hybrid Code with Blame Shifting

34

SLIDE 35

Performance Expectations for Hybrid Code with Blame Shifting

35

SLIDE 36

GPU Successes with HPCToolkit

LAMMPS: identified hardware problem with Keeneland system

— improperly seated GPUs were observed to have lower data copy bandwidth

LULESH: identified the dynamic memory allocation using

cudaMalloc and cudaFree accounted for 90% of the idleness

f the GPU

36

SLIDE 37

Goal: associate memory hierarchy performance losses with data
Approach

— intercept allocations to associate with their data ranges — measure latency with “instruction-based sampling” (AMD Opteron) — present quantitative results using hpcviewer

37

Data Centric Analysis

SLIDE 38

Data Centric Analysis of S3D

38

41.2% of memory hierarchy latency related to yspecies array yspecies latency for this loop is 14.5% of total latency in program

SLIDE 39

39

Outline

Overview of Rice’s HPCToolkit
Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

Understanding temporal behavior
Assessing process variability
Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

Summary and conclusions

SLIDE 40

Summary

Sampling provides low overhead measurement
Call path profiling + binary analysis + blame shifting = insight

— scalability bottlenecks — where insufficient parallelism lurks — sources of lock contention — load imbalance — temporal dynamics — bottlenecks in hybrid code — problematic data structures

Other capabilities

— attribute memory leaks back to their full calling context

40

SLIDE 41

Current Technical Issues

Keeping up with emerging platforms ...

Blue Waters (Cray XK6)

— AMD Interlagos introduces new vector instructions

– issue: binary analysis tools need update (even gdb is ignorant at present!)

— NVIDIA Tesla K20 GPUs

– issue: monitoring and assessing impact of asynchronous activities

— OpenACC programming model

– issue: creates threads that synchronize before main

Stampede (Intel MIC)

— new instruction set for MIC — new programming models

– MIC only programming – MIC as MPI rank (ranks with heterogeneous capabilities) – offload model

— massive threading

Keeneland (NVIDIA Tesla K20 GPU)

— multi-socket, multi-GPU nodes 41

SLIDE 42

A Spectrum of Tool Challenges

Intellectual

— research - develop new techniques for measurement, analysis, and presentation

– challenges of emerging systems increasing scale of systems (e.g. Sequoia) heterogeneity (e.g., host + accelerator (MIC or GPU); AMD Fusion) exploding growth of threading: MIC supports 200+ threads – blame shifting: identify causes rather than symptoms – analyzing asynchronous activities – measure and analyze all facets of application performance CPU, accelerator, intra- and inter-node data move & sync, I/O, interaction with HW, interaction with other jobs, interaction with system software – provide higher level insight, diagnosis, and guidance

Community leadership and engagement

— OS support for tools

– past successes: BG/P CNK and Cray CNL kernels; BG/Q spec; Linux kernel; PMU device drivers – today: interfaces for observing communication, I/O network issues, data access latency

— standards committees: today - OpenMP tools API; future - OpenCL tools API? — vendor engagement: NVIDIA, Intel, IBM —

utreach with workshops
Software

— development: new instructions for new CPUs; new programming models for heterogeneous — maintenance — deployment & user support

42

SLIDE 43

A Community of End Users

Government laboratories

— DOE Office of Science Labs

– Argonne National Laboratory – Lawrence Berkeley National Laboratory – Oak Ridge National Laboratory – National Renewable Energy Laboratory

— DOE NNSA Labs

– Sandia National Laboratory – Los Alamos National Laboratory – Lawrence Livermore National Laboratory

— DOD Engineering Research Development Center — TACC — 25 PerfExpert Sites

Numerous universities

— University of Utah — University of Southern California — Ohio State — UNC Chapel Hill — Mississippi State — Georgia Tech — ...

43

Companies

— Bull — Cray Computer — SAS Institute — Western Geco — AMD — Numerical Algorithms Group — Shell

Foreign centers

— Juelich Supercomputing Centre — Britain’s Atomic Weapons Establishment — Australian National University — Laboratoire d'Aérologie, O.M.P. — INRIA — Norwegian Metacenter for High Performance Computing

SLIDE 44

A Community: One Group’s Tools ...

... are another group’s infrastructure

— PerfExpert (University of Texas and Texas State)

– uses HPCToolkit’s measurement and analysis infrastructure to collect hardware performance counter data, analyze application binaries, and attribute performance to routines and loop nests

— bullx (Bull)

– HPCToolkit performance tools are part of the standard bullx software suite deployed on Bull’s supercomputers custom extensions to hpcviewer to support cluster analysis

— MIAMI (ORNL)

– uses HPCToolkit’s hpcviewer user interface for interactive analysis of performance modeling results

— OpenSpeedshop (Krell)

– uses HPCToolkit’s libmonitor for

— Scott Pakin (LANL)

– uses HPCToolkit’s hpcviewer user interface for interactive analysis of performance measurement data collected with custom instrumentation 44

SLIDE 45

Software Engineering?

Ideally: design abstractions and implement them
Harsh reality: performance tools interact with the system at the lowest level,

which introduces complexity and design compromises

— absence of proper interfaces and mechanisms in layers we interact with

– blocking system calls that are invisible to timer-based profiling, system calls that can’t be restarted when interrupted – stripped code (e.g., runtimes for OpenMP, CUDA) – wrong information from compilers – libraries (e.g., libc) that lack proper interfaces for wrapping – helper threads launched during initialization of dynamic libraries – lack of standard calling conventions in low-level code (e.g., dynamic library loading) – ...

— experimentally determine what’s possible and how things work

– e.g., how can I get notification when CUDA kernels complete without serializing multiple threads that share a GPU

nly a stripped library from NVIDIA; no behavioral documentation

— grow a piece of infrastructure from a prototype

With a larger team, we’d re-implement some components from scratch with the

benefit of hindsight

Many packages we depend upon leave much to be desired but there is

insufficient funding to rewrite them

45

SLIDE 46

Open Issues

Incentives are lacking for community performance tools

— rational model: build reusable building tool blocks — reality

– funding for research, not development – components help our competitors more than us

Ongoing funding for maintenance, deployment, user support?
Continuity?

46

SLIDE 47

HPCToolkit Capabilities at a Glance

Attribute Costs to Code Analyze Behavior

ver Time