HPCToolkit: Performance Tools for Parallel Scientific Codes John - - PowerPoint PPT Presentation

hpctoolkit performance tools for parallel scientific codes
SMART_READER_LITE
LIVE PREVIEW

HPCToolkit: Performance Tools for Parallel Scientific Codes John - - PowerPoint PPT Presentation

HPCToolkit: Performance Tools for Parallel Scientific Codes John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu http://hpctoolkit.org 1 Building Community Codes for Effective Scientific Research on HPC


slide-1
SLIDE 1

1

HPCToolkit: Performance Tools for Parallel Scientific Codes

John Mellor-Crummey

Department of Computer Science Rice University johnmc@rice.edu

http://hpctoolkit.org

Building Community Codes for Effective Scientific Research on HPC Platforms September 7, 2012

slide-2
SLIDE 2

Challenges for Computational Scientists

  • Execution environments and applications are rapidly evolving

— architecture

– rapidly changing multicore microprocessor designs – increasing scale of parallel systems – growing use of accelerators

— applications

– transition from MPI everywhere to threaded implementations – add additional scientific capabilities – maintain multiple variants or configurations

  • Computational scientists need to

— assess weaknesses in algorithms and their implementations — improve scalability of executions within and across nodes — adapt to changes in emerging architectures

2

Performance tools can play an important role as a guide

slide-3
SLIDE 3

3

Performance Analysis Challenges

  • Complex architectures are hard to use efficiently

— multi-level parallelism: multi-core, ILP, SIMD instructions — multi-level memory hierarchy — result: gap between typical and peak performance is huge

  • Complex applications present challenges

— measurement and analysis — understanding behaviors and tuning performance

  • Supercomputer platforms compound the complexity

— unique hardware — unique microkernel-based operating systems — multifaceted performance concerns

– computation – data movement – communication – I/O

slide-4
SLIDE 4

4

What Users Want

  • Multi-platform, programming model independent tools
  • Accurate measurement of complex parallel codes

— large, multi-lingual programs — fully optimized code: loop optimization, templates, inlining — binary-only libraries, sometimes partially stripped — complex execution environments

– dynamic loading, static linking – SPMD parallel codes with threaded node programs – batch jobs

  • Effective performance analysis

— insightful analysis that pinpoints and explains problems

– correlate measurements with code for actionable results – support analysis at the desired level intuitive enough for application scientists and engineers detailed enough for library developers and compiler writers

  • Scalable to petascale and beyond
slide-5
SLIDE 5
  • HPCToolkit - 160K lines, 797 files

— measurement, data analysis: 110K lines C/C++, scripts; 424 files — hpcviewer, hpctraceviewer GUIs: 54K lines Java; 373 files

  • HPCToolkit externals - 2.5M lines C/C++, 5782 files

— components developed

– execution control: libmonitor - 7K lines, 35 files – binary analysis: Open Analysis - 76K lines, 343 files (+ ANL, Colorado)

— components extensively modified

– binary analysis: GNU binutils - 1.44M total, 1650 files; (448K bfd)

— other components

– stack unwinding: libunwind – XML: libxml2, xerces – understanding binaries: libelf, libdwarf, symtabAPI

* With support from the US government

DOE Office of Science: DE-FC02-07ER25800, DE-FC02-06ER25762 LANL: 03891-001-99-4G, 74837-001-03 49, 86192-001-04 49, 12783-001-05 49 AFRL: FA8650-09-C-7915

“We Build It” *

5

slide-6
SLIDE 6

Contributors

  • Current

— staff: Michael Fagan, Mark Krentel, Laksono Adhianto — students: Xu Liu, Milind Chabbi, Karthik Murthy — external: Nathan Tallent (PNNL)

  • Alumni

— students: Gabriel Marin (ORNL), Nathan Froyd (Mozilla) — staff: Rob Fowler (UNC) — interns: Sinchan Banerjee (MIT), Michael Franco (Rice), Reed Landrum (Stanford), Bowden Kelly (Georgia Tech), Philip Taffet (St. John’s High School)

6

slide-7
SLIDE 7

7

HPCToolkit Approach

  • Employ binary-level measurement and analysis

— observe fully optimized, dynamically linked executions — support multi-lingual codes with external binary-only libraries

  • Use sampling-based measurement (avoid instrumentation)

— controllable overhead — minimize systematic error and avoid blind spots — enable data collection for large-scale parallelism

  • Collect and correlate multiple derived performance metrics

— diagnosis typically requires more than one species of metric

  • Associate metrics with both static and dynamic context

— loop nests, procedures, inlined code, calling context

  • Support top-down performance analysis

— natural approach that minimizes burden on developers

slide-8
SLIDE 8

8

Outline

  • Overview of Rice’s HPCToolkit
  • Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

  • Understanding temporal behavior
  • Assessing process variability
  • Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

  • Summary and conclusions
slide-9
SLIDE 9

source code

  • ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

9

slide-10
SLIDE 10

source code

  • ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

10

  • For dynamically-linked executables on stock Linux

— compile and link as you usually do

  • For statically-linked executables (e.g. for Blue Gene, Cray)

— add monitoring by using hpclink as prefix to your link line

slide-11
SLIDE 11

source code

  • ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

  • Measure execution unobtrusively

— launch optimized application binaries

– dynamically-linked applications: launch with hpcrun e.g., mpirun -np 8192 hpcrun -t -e WALLCLOCK@5000 flash3 ... – statically-linked applications: control with environment variables

— collect statistical call path profiles of events of interest

11

slide-12
SLIDE 12

Measure and attribute costs in context

sample timer or hardware counter overflows gather calling context using stack unwinding

Call Path Profiling

12

Call path sample

instruction pointer return address return address return address

Overhead proportional to sampling frequency... ...not call frequency

Calling context tree

slide-13
SLIDE 13

source code

  • ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

  • Analyze binary with hpcstruct: recover program structure

— analyze machine code, line map, debugging information — extract loop nesting & identify inlined procedures — map transformed loops and procedures to source

13

slide-14
SLIDE 14

source code

  • ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

  • Combine multiple profiles

— multiple threads; multiple processes; multiple executions

  • Correlate metrics to static & dynamic program structure

14

slide-15
SLIDE 15

source code

  • ptimized

binary compile & link call path profile profile execution

[hpcrun]

binary analysis

[hpcstruct]

interpret profile correlate w/ source

[hpcprof/hpcprof-mpi]

database presentation

[hpcviewer/ hpctraceviewer]

program structure

HPCToolkit Workflow

  • Presentation

— explore performance data from multiple perspectives

– rank order by metrics to focus on what’s important – compute derived metrics to help gain insight e.g. scalability losses, waste, CPI, bandwidth

— graph thread-level metrics for contexts — explore evolution of behavior over time

15

slide-16
SLIDE 16

Analyzing Chombo@1024PE with hpcviewer

16

costs for

  • inlined procedures
  • loops
  • function calls in full context

source pane navigation pane metric pane view control metric display

slide-17
SLIDE 17

17

Outline

  • Overview of Rice’s HPCToolkit
  • Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

  • Understanding temporal behavior
  • Assessing process variability
  • Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

  • Summary and conclusions
slide-18
SLIDE 18

18

The Problem of Scaling

0.500 0.625 0.750 0.875 1.000 1 4 1 6 6 4 2 5 6 1 2 4 4 9 6 1 6 3 8 4 6 5 5 3 6

Efficiency CPUs Ideal efficiency Actual efficiency

?

Note: higher is better

slide-19
SLIDE 19

19

Wanted: Scalability Analysis

  • Isolate scalability bottlenecks
  • Guide user to problems
  • Quantify the magnitude of each problem
slide-20
SLIDE 20

20

Challenges for Pinpointing Scalability Bottlenecks

  • Parallel applications

— modern software uses layers of libraries — performance is often context dependent

  • Monitoring

— bottleneck nature: computation, data movement, synchronization? — 2 pragmatic constraints

– acceptable data volume – low perturbation for use in production runs

Example climate code skeleton

main

  • cean

atmosphere wait wait sea ice wait land wait

slide-21
SLIDE 21

21

Performance Analysis with Expectations

  • You have performance expectations for your parallel code

— strong scaling: linear speedup — weak scaling: constant execution time

  • Put your expectations to work

— measure performance under different conditions

– e.g. different levels of parallelism or different inputs

— express your expectations as an equation — compute the deviation from expectations for each calling context

– for both inclusive and exclusive costs

— correlate the metrics with the source code — explore the annotated call tree interactively

slide-22
SLIDE 22

200K 400K 600K

22

Pinpointing and Quantifying Scalability Bottlenecks

=

− P Q P × coefficients for analysis

  • f strong scaling

Q ×

slide-23
SLIDE 23
  • Parallel, adaptive-mesh refinement (AMR) code
  • Block structured AMR; a block is the unit of computation
  • Designed for compressible reactive flows
  • Can solve a broad range of (astro)physical problems
  • Portable: runs on many massively-parallel systems
  • Scales and performs well
  • Fully modular and extensible: components can be

combined to create many different applications

23

Scalability Analysis Demo: FLASH3

Cellular detonation Helium burning on neutron stars Laser-driven shock instabilities Nova outbursts on white dwarfs Rayleigh-Taylor instability Orzag/Tang MHD vortex Magnetic Rayleigh-Taylor

Figures courtesy of FLASH Team, University of Chicago

Code: University of Chicago FLASH3 Simulation: white dwarf detonation Platform: Blue Gene/P Experiment: 8192 vs. 256 processors Scaling type: weak

slide-24
SLIDE 24

Improved Flash Scaling of AMR Setup

24 Graph courtesy of Anshu Dubey, U Chicago

slide-25
SLIDE 25

25

Outline

  • Overview of Rice’s HPCToolkit
  • Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

  • Understanding temporal behavior
  • Assessing process variability
  • Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

  • Summary and conclusions
slide-26
SLIDE 26
  • Profiling compresses out the temporal dimension

—temporal patterns, e.g. serialization, are invisible in profiles

  • What can we do? Trace call path samples

—sketch:

– N times per second, take a call path sample of each thread –

  • rganize the samples for each thread along a time line

– view how the execution evolves left to right – what do we view? assign each procedure a color; view a depth slice of an execution 26

Understanding Temporal Behavior

Time Processes Call stack

slide-27
SLIDE 27

27

hpctraceviewer: detail of FLASH3@256PE

Load imbalance among threads appears as different lengths of colored bands along the x axis

slide-28
SLIDE 28

28

Outline

  • Overview of Rice’s HPCToolkit
  • Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

  • Understanding temporal behavior
  • Assessing process variability
  • Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

  • Summary and conclusions
slide-29
SLIDE 29

MPBS @ 960 cores, radix sort

Two views of load imbalance since not on a 2k cores

29

slide-30
SLIDE 30

30

Outline

  • Overview of Rice’s HPCToolkit
  • Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

  • Understanding temporal behavior
  • Assessing process variability
  • Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

  • Summary and conclusions
slide-31
SLIDE 31

Blame Shifting

  • Problem: in many circumstances sampling measures

symptoms of performance losses rather than causes

— worker threads waiting for work — threads waiting for a lock — MPI process waiting for peers in a collective communication — idle GPU waiting for work

  • Approach: shift blame for losses from victims to perpetrators

— blame code executing while other threads are idle — blame code executed by lock holder when thread(s) are waiting — blame processes that arrive late to collectives — shift blame between CPU and GPU for hybrid code

31

slide-32
SLIDE 32

Performance Expectations for Hybrid Code with Blame Shifting

32

slide-33
SLIDE 33

Performance Expectations for Hybrid Code with Blame Shifting

33

slide-34
SLIDE 34

Performance Expectations for Hybrid Code with Blame Shifting

34

slide-35
SLIDE 35

Performance Expectations for Hybrid Code with Blame Shifting

35

slide-36
SLIDE 36

GPU Successes with HPCToolkit

  • LAMMPS: identified hardware problem with Keeneland system

— improperly seated GPUs were observed to have lower data copy bandwidth

  • LULESH: identified the dynamic memory allocation using

cudaMalloc and cudaFree accounted for 90% of the idleness

  • f the GPU

36

slide-37
SLIDE 37
  • Goal: associate memory hierarchy performance losses with data
  • Approach

— intercept allocations to associate with their data ranges — measure latency with “instruction-based sampling” (AMD Opteron) — present quantitative results using hpcviewer

37

Data Centric Analysis

slide-38
SLIDE 38

Data Centric Analysis of S3D

38

41.2% of memory hierarchy latency related to yspecies array yspecies latency for this loop is 14.5% of total latency in program

slide-39
SLIDE 39

39

Outline

  • Overview of Rice’s HPCToolkit
  • Pinpointing scalability bottlenecks

— scalability bottlenecks on large-scale parallel systems — scaling on multicore processors

  • Understanding temporal behavior
  • Assessing process variability
  • Understanding threading, GPU, and memory hierarchy

— blame shifting — attributing memory hierarchy costs to data

  • Summary and conclusions
slide-40
SLIDE 40

Summary

  • Sampling provides low overhead measurement
  • Call path profiling + binary analysis + blame shifting = insight

— scalability bottlenecks — where insufficient parallelism lurks — sources of lock contention — load imbalance — temporal dynamics — bottlenecks in hybrid code — problematic data structures

  • Other capabilities

— attribute memory leaks back to their full calling context

40

slide-41
SLIDE 41

Current Technical Issues

Keeping up with emerging platforms ...

  • Blue Waters (Cray XK6)

— AMD Interlagos introduces new vector instructions

– issue: binary analysis tools need update (even gdb is ignorant at present!)

— NVIDIA Tesla K20 GPUs

– issue: monitoring and assessing impact of asynchronous activities

— OpenACC programming model

– issue: creates threads that synchronize before main

  • Stampede (Intel MIC)

— new instruction set for MIC — new programming models

– MIC only programming – MIC as MPI rank (ranks with heterogeneous capabilities) – offload model

— massive threading

  • Keeneland (NVIDIA Tesla K20 GPU)

— multi-socket, multi-GPU nodes 41

slide-42
SLIDE 42

A Spectrum of Tool Challenges

  • Intellectual

— research - develop new techniques for measurement, analysis, and presentation

– challenges of emerging systems increasing scale of systems (e.g. Sequoia) heterogeneity (e.g., host + accelerator (MIC or GPU); AMD Fusion) exploding growth of threading: MIC supports 200+ threads – blame shifting: identify causes rather than symptoms – analyzing asynchronous activities – measure and analyze all facets of application performance CPU, accelerator, intra- and inter-node data move & sync, I/O, interaction with HW, interaction with other jobs, interaction with system software – provide higher level insight, diagnosis, and guidance

  • Community leadership and engagement

— OS support for tools

– past successes: BG/P CNK and Cray CNL kernels; BG/Q spec; Linux kernel; PMU device drivers – today: interfaces for observing communication, I/O network issues, data access latency

— standards committees: today - OpenMP tools API; future - OpenCL tools API? — vendor engagement: NVIDIA, Intel, IBM —

  • utreach with workshops
  • Software

— development: new instructions for new CPUs; new programming models for heterogeneous — maintenance — deployment & user support

42

slide-43
SLIDE 43

A Community of End Users

  • Government laboratories

— DOE Office of Science Labs

– Argonne National Laboratory – Lawrence Berkeley National Laboratory – Oak Ridge National Laboratory – National Renewable Energy Laboratory

— DOE NNSA Labs

– Sandia National Laboratory – Los Alamos National Laboratory – Lawrence Livermore National Laboratory

— DOD Engineering Research Development Center — TACC — 25 PerfExpert Sites

  • Numerous universities

— University of Utah — University of Southern California — Ohio State — UNC Chapel Hill — Mississippi State — Georgia Tech — ...

43

  • Companies

— Bull — Cray Computer — SAS Institute — Western Geco — AMD — Numerical Algorithms Group — Shell

  • Foreign centers

— Juelich Supercomputing Centre — Britain’s Atomic Weapons Establishment — Australian National University — Laboratoire d'Aérologie, O.M.P. — INRIA — Norwegian Metacenter for High Performance Computing

slide-44
SLIDE 44

A Community: One Group’s Tools ...

... are another group’s infrastructure

— PerfExpert (University of Texas and Texas State)

– uses HPCToolkit’s measurement and analysis infrastructure to collect hardware performance counter data, analyze application binaries, and attribute performance to routines and loop nests

— bullx (Bull)

– HPCToolkit performance tools are part of the standard bullx software suite deployed on Bull’s supercomputers custom extensions to hpcviewer to support cluster analysis

— MIAMI (ORNL)

– uses HPCToolkit’s hpcviewer user interface for interactive analysis of performance modeling results

— OpenSpeedshop (Krell)

– uses HPCToolkit’s libmonitor for

— Scott Pakin (LANL)

– uses HPCToolkit’s hpcviewer user interface for interactive analysis of performance measurement data collected with custom instrumentation 44

slide-45
SLIDE 45

Software Engineering?

  • Ideally: design abstractions and implement them
  • Harsh reality: performance tools interact with the system at the lowest level,

which introduces complexity and design compromises

— absence of proper interfaces and mechanisms in layers we interact with

– blocking system calls that are invisible to timer-based profiling, system calls that can’t be restarted when interrupted – stripped code (e.g., runtimes for OpenMP, CUDA) – wrong information from compilers – libraries (e.g., libc) that lack proper interfaces for wrapping – helper threads launched during initialization of dynamic libraries – lack of standard calling conventions in low-level code (e.g., dynamic library loading) – ...

— experimentally determine what’s possible and how things work

– e.g., how can I get notification when CUDA kernels complete without serializing multiple threads that share a GPU

  • nly a stripped library from NVIDIA; no behavioral documentation

— grow a piece of infrastructure from a prototype

  • With a larger team, we’d re-implement some components from scratch with the

benefit of hindsight

  • Many packages we depend upon leave much to be desired but there is

insufficient funding to rewrite them

45

slide-46
SLIDE 46

Open Issues

  • Incentives are lacking for community performance tools

— rational model: build reusable building tool blocks — reality

– funding for research, not development – components help our competitors more than us

  • Ongoing funding for maintenance, deployment, user support?
  • Continuity?

46

slide-47
SLIDE 47

HPCToolkit Capabilities at a Glance

Attribute Costs to Code Analyze Behavior

  • ver Time

Assess Imbalance and Variability Associate Costs with Data Shift Blame from Symptoms to Causes Pinpoint & Quantify Scaling Bottlenecks

hpctoolkit.org