Using Sampling to Understand Parallel Program Performance Nathan - - PowerPoint PPT Presentation

using sampling to understand parallel program performance
SMART_READER_LITE
LIVE PREVIEW

Using Sampling to Understand Parallel Program Performance Nathan - - PowerPoint PPT Presentation

Using Sampling to Understand Parallel Program Performance Nathan Tallent John Mellor-Crummey M. Krentel, L. Adhianto, M. Fagan, X. Liu Dept. of Computer Science, Rice University Parallel Tools Workshop TU Dresden Sept 26-27, 2011


slide-1
SLIDE 1

Using Sampling to Understand Parallel Program Performance

hpctoolkit.org

Parallel Tools Workshop • TU Dresden • Sept 26-27, 2011

Nathan Tallent John Mellor-Crummey

  • M. Krentel, L. Adhianto, M. Fagan, X. Liu
  • Dept. of Computer Science, Rice University

Monday, September 26, 2011

slide-2
SLIDE 2

Advertisement: WHIST 2012

  • International Workshop on

High-performance Infrastructure for Scalable Tools

— in conjunction with PPoPP 2012 — February 25-29, 2012 — New Orleans, LA, USA

  • All things tools:

— special emphasis on performance and scalability — special emphasis on infrastructure

  • how do we build tools?
  • ften not welcome in other venues
  • No more CScADS Tools Workshop; come to WHIST instead!

2

whist-workshop.org

Monday, September 26, 2011

slide-3
SLIDE 3

For Measurement, Sampling is Obvious...

  • Instrumentation (synchronous)

— monitors every instance of something — must be used with care

  • instrumenting every procedure ➟ large overheads

– may preclude measurement of production runs – encourages selective instrumentation ➟ blind spots

  • instrumentation disproportionately affects small

procedures (systematic error)

  • cannot measure at the statement level

— note: specialized techniques may reduce overheads

  • sample instrumentation (switch btwn heavyweight & lightweight inst)
  • dynamically insert & remove binary instrumentation
  • Sampling (asynchronous)

— provides representative and detailed measurements

  • assumes sufficient samples; no correlations (usually easy to satisfy)

3

Instrumentation types

  • source code
  • compiler-inserted
  • static binary
  • dynamic binary

Use sampling when possible; use instrumentation when necessary.

Monday, September 26, 2011

slide-4
SLIDE 4

… but: Sampling is Also Questionable

  • How to attribute metrics to loops?

— attribution to outer loops trivial with source-code instrumentation

  • How to attribute metrics to data objects?
  • How to collect call paths at arbitrary sample points?

— stack unwinding can be hard: try using libunwind or GLIBC backtrace() in a SEGV handler for optimized code

  • How to collect call paths for languages that don’t use stacks?

— perfect stack unwinding not sufficient

  • Will sampling miss something important?

— it is possible to miss the first cause in a chain of events

  • How to pinpoint lock contention?

— with instrumentation, it is easy to track lock acquire/release

  • Can sampling pinpoint the causes of lock contention?

4

Monday, September 26, 2011

slide-5
SLIDE 5
  • Motivation
  • Call Path Profiling
  • Pinpointing Scaling Bottlenecks
  • Blame Shifting

— parallel idleness and overhead in work stealing — lock contention — load imbalance

  • Call Path Tracing
  • Conclusions

5

Outline

Monday, September 26, 2011

slide-6
SLIDE 6
  • Use timer or hardware counter as a sampling source
  • Gather calling context using stack unwinding

Measure & attribute costs in calling context

Sampling-based Call Path Profiling is Easy...*

6

Call path sample Calling Context Tree (CCT)

Overhead proportional to sampling frequency... ...not call frequency

instruction pointer return address return address return address

“main” sample point

CCT size scales well

Monday, September 26, 2011

slide-7
SLIDE 7

… but: Hard to Unwind From Async Sample

  • Asynchronous sampling:

— ✓ overhead proportional to sampling frequency

  • low, controllable
  • no sample => irrelevant to performance

— ✓ overhead effects are not systematic — ✗ must be able to unwind from every point in an executable

  • (use call stack unwinding to gather calling context)
  • Why is unwinding difficult? Requires either:

— frame pointers (linking every frame on the stack)

  • mitted in fully optimized code

— complete (and correct) unwind information

  • mitted for epilogues
  • erroneous (optimizers fail to maintain info while applying xforms)
  • missing for hand-coded assembly or partially stripped libraries

7

Dynamically analyze binary to compute unwind information

  • ften spend significant time in such routines!

Monday, September 26, 2011

slide-8
SLIDE 8

Unwinding Fully Optimized Parallel Code

  • Identify procedure bounds

— for dynamically linked code, do at runtime — for statically linked code, do at compile time

  • Compute (on demand) unwind recipes for a procedure:

— scan the procedure’s object code, tracking the locations of

  • caller’s program counter
  • caller’s frame and stack pointer

— create unwind recipes between pairs of frame-relevant instructions

  • Processors: x86-64, x86, Power/BGP, MIPS (SiCortex ☹)
  • Results

— accurate call path profiles — overheads of < 2% for sampling frequencies of 200/s

8

PLDI 2009. Distinguished Paper Award.

Monday, September 26, 2011

slide-9
SLIDE 9

Attributing Measurements to Source

  • Compilers create semantic gap between binary and source

— call paths differ from user-level

  • inlining, tail calls

— loops differ from user-level

  • software pipelining, unroll & jam, blocking, etc.
  • Must bridge this semantic gap

— to be useful, a tool should

  • attribute binary-level measurements to source-level abstractions
  • How?

— cannot use instrumentation to measure user-level loops

9

Statically analyze binary to compute an object to source-code mapping

Monday, September 26, 2011

slide-10
SLIDE 10

Attribution to Static & Dynamic Context

1-2% overhead

costs for

  • inlined procedures
  • loops
  • function calls in full context

PLDI 2009. Distinguished Paper Award.

Monday, September 26, 2011

slide-11
SLIDE 11
  • Motivation
  • Call Path Profiling
  • Pinpointing Scaling Bottlenecks
  • Blame Shifting

— parallel idleness and overhead in work stealing — lock contention — load imbalance

  • Call Path Tracing
  • Conclusions

11

Outline

Monday, September 26, 2011

slide-12
SLIDE 12
  • What if my app doesn’t scale? Where is the bottleneck?
  • First step: Performance tools must scale!

— measure at scale

  • sample processes (SPMD apps)

– record data on a process with probability p – simplification of Gamblin et al., IPDPS ’08

— analyze measurements

  • analyze measurements in parallel
  • data structure has space requirements of O(1 call path tree)

— present performance data (without requiring an Altix)

  • use above data structure
  • Next step: Deliver insight

— identify scaling bottlenecks in context…

  • use detailed sampling-based call path profile

12

The Challenges of Scale: O(100K) cores

simple and effective

Monday, September 26, 2011

slide-13
SLIDE 13

13

Pinpointing & Quantifying Scalability Bottlenecks

200 s. 400 s. 600 s.

=

P × − P Q Q ×

Weak scaling : no coefficients Strong scaling : red coefficients Average CCTP Average CCTQ

Coarfa, Mellor-Crummey, Froyd, Dotsenko. ICS’07 SC 2009

Monday, September 26, 2011

slide-14
SLIDE 14

FLASH: Top-down View of Scaling Losses

14

Weak scaling on BG/P 256 to 8192 cores

21% of scaling loss is due to looping

  • ver all MPI ranks to

build adaptive mesh

Monday, September 26, 2011

slide-15
SLIDE 15

Improved Flash Scaling of AMR Setup

15 Graph courtesy of Anshu Dubey, U Chicago Note: lower is better

Monday, September 26, 2011

slide-16
SLIDE 16
  • Motivation
  • Call Path Profiling
  • Pinpointing Scaling Bottlenecks
  • Blame Shifting

— parallel idleness and overhead in work stealing — lock contention — load imbalance

  • Call Path Tracing
  • Conclusions

16

Outline

Monday, September 26, 2011

slide-17
SLIDE 17

Blame Shifting

  • Problem: sampling often measures symptoms of performance

losses rather than causes

— worker threads waiting for work — threads waiting for a lock — MPI process waiting for peers in a collective communication

  • Approach: shift blame from victims to perpetrators
  • Flavors

— active measurement — post-mortem analysis only

17

Monday, September 26, 2011

slide-18
SLIDE 18

Cilk: An Influential Multithreaded Language

18

cilk int fib(n) { if (n < 2) return n; else { int x, y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x + y); } }

f(n-2) f(n) f(n-1) f(n-3) f(n-2) f(n-4) f(n-3)

... ... ... ... ... ... ... ...

spawn: asynchronous call; creates a logical task that

  • nly blocks at a sync

spawn + recursion (fib): quickly creates significant logical parallelism

To map logical tasks to compute cores: lazy thread creation + work-stealing scheduler

Monday, September 26, 2011

slide-19
SLIDE 19

Call Path Profiling of Cilk: Stack ≠ Path

19

thread 1 thread 2 thread 3 f(n-1) f(n-3)

...

f(n) f(n-1) f(n-3)

...

Logical call path profiling: Recover full relationship between physical and source-level execution Problem: Work stealing separates source-level calling contexts in space and time

f(n-2) f(n) f(n-1) f(n-3) f(n-2) f(n-4) f(n-3)

... ... ... ... ... ... ... ...

Consider thread 3:

  • physical call path
  • logical call path

Monday, September 26, 2011

slide-20
SLIDE 20
  • Possible problems:

— parallel idleness: parallelism is too coarse-grained — parallel overhead: parallelism is too fine-grained — idleness and overhead: wrong algorithm

  • Try logical call path profiling:

— measuring idleness

  • samples accumulate

in the scheduler!

  • no source-level insight!

— measuring overhead

  • how?!
  • work and overhead are both

machine instructions

What If My Cilk Program Is Slow?

20

while (noWork) { t = random thread; try stealing from t; } save-live-vars();

  • ffer-task();

user-work(); retract-task(); You spent time here!

Goal: Insightfully attribute idleness and overhead to logical contexts

PPoPP 2009 IEEE Computer 12/09

Monday, September 26, 2011

slide-21
SLIDE 21

Pinpointing Idleness via Blame Shifting

  • Effort = work + idleness

— parallel idleness: when a thread (core) is idle or blocked

  • Blame shifting: blame idleness on its cause

— before: blame idleness on self (victim) — now: blame idleness on working threads (perpetrators)

  • blame workers because they do not create parallel tasks
  • Blame shifting requires ‘third party’ info (are others working?)

— maintain a small amount of global state: # of workers, # idlers

  • (add atomic increment/decrement to scheduler)
  • Shifting blame:

— on sampling a working thread t:

  • t.work += 1
  • t.idleness += # idlers / # workers

— on sampling an idle thread t:

  • do nothing! (others will claim the idleness)

21 apportion blame for current idleness

Monday, September 26, 2011

slide-22
SLIDE 22

Pinpointing Overhead

  • Work = useful-work + overhead

— overhead: when a thread works on non-user code

  • How to distinguish useful-work from overhead?

— instrument to distinguish between useful-work & overhead

  • too costly!
  • Insight: tag instructions contributing to overhead

— post-mortem analysis:

  • partition samples into useful-work and overhead

22 no measurement overhead!

Monday, September 26, 2011

slide-23
SLIDE 23

Bottom-up Idleness for Cilk ‘Cholesky’

23

Pinpoints serial initialization/finalization routines.

We can pinpoint and quantify the effect of serialization.

percent percent

Monday, September 26, 2011

slide-24
SLIDE 24
  • Motivation
  • Call Path Profiling
  • Pinpointing Scaling Bottlenecks
  • Blame Shifting

— parallel idleness and overhead in work stealing — lock contention — load imbalance

  • Call Path Tracing
  • Conclusions

24

Outline

Monday, September 26, 2011

slide-25
SLIDE 25

Lock Contention is a Problem!

  • Locks are widely used

— fine-grain locking is the gold-standard for performance — explicit threading (e.g., Pthreads) — implementing higher-level models

  • critical sections in OpenMP
  • software transactional memory
  • Key performance problem: Lock contention

— when a thread (and therefore core) is idling waiting for a lock

  • Lock contention => parallel idleness

— chief form of parallel idleness in many programs

25

Monday, September 26, 2011

slide-26
SLIDE 26

Pinpoint Lock Contention via Blame Shifting

  • Goal: precisely blame contention on its cause

— But: tool must not itself become a bottleneck!

  • Could adapt work-stealing strategy, but...

— imprecise: blames all workers (rather than one) for idleness

  • Insight: lock = precise communication channel (shared state)
  • Shifting blame to perpetrators:

— on sampling a working thread t:

  • t.work += 1

— on sampling an idle thread t waiting for lock L:

  • L.idleness += 1 (atomic add)

— when working thread t releases lock L:

  • t.idleness += atomic-swap(L.idleness, 0)
  • unwind the call stack to locate lock contention in calling context

26 Associate idleness with lock, not thread. exactly represents contention while t held lock L

Monday, September 26, 2011

slide-27
SLIDE 27

Lock Contention in MADNESS

27

lock contention accounts for 23.5%

  • f execution time.

Adding futures to shared global work queue. µs 16 cores; 1 thread/core (4 x Barcelona) Quantum chemistry; MPI + pthreads

  • 65M distinct locks
  • max. of 340K live locks
  • 30K lock acquisitions/sec/thread

1-5% overhead

PPoPP 2010

Monday, September 26, 2011

slide-28
SLIDE 28
  • Motivation
  • Call Path Profiling
  • Pinpointing Scaling Bottlenecks
  • Blame Shifting

— parallel idleness and overhead in work stealing — lock contention — load imbalance

  • Call Path Tracing
  • Conclusions

28

Outline

Monday, September 26, 2011

slide-29
SLIDE 29

What’s The Root Cause of Load Imbalance?

29

  • Most prior root-cause analyses use trace-based measurement

— example: SCALASCA

  • Traces are wonderful, but difficult to scale
  • Question: Can we find root causes of load imbalance in large-

scale executions using call path profiling?

— ✓ sample-based call path profiles scale and are very detailed — ✗ but they collapse potential causal chains

Monday, September 26, 2011

slide-30
SLIDE 30

Blame-Shift Idleness in Call Path Profiles

30

  • 1. Identify idleness (exposed waiting): All imbalance is

manifested in idleness (e.g., MPI_Cray_Progress_Wait)

  • 2. Identify balance points

(procedures or loops): A balance point cannot contribute to imbalance

  • 3. Blame-shift idleness (effect)
  • n closest ancestor balance

point (nearer to cause)

  • 4. Context-sensitive

scatter plots... CCT

Precise! Post mortem! 0 measurement overhead!

Monday, September 26, 2011

slide-31
SLIDE 31

31

PFLOTRAN

  • 2. Notice top two

call sites...

  • 3. Plot per-rank context-

sensitive cycle values: Early finishers... ...become early arrivers at Allreduce

  • 1. Drill down ‘hot path’ to

a balance point. Imbalance: blame-shifted idleness. 8K cores, Cray XT5 SC 2010

Monday, September 26, 2011

slide-32
SLIDE 32
  • Motivation
  • Call Path Profiling
  • Pinpointing Scaling Bottlenecks
  • Blame Shifting

— parallel idleness and overhead in work stealing — lock contention — load imbalance

  • Call Path Tracing
  • Conclusions

32

Outline

Monday, September 26, 2011

slide-33
SLIDE 33

Understanding Temporal Behavior

  • Time-dependent behavior is often invisible in profiles

— but tracing is difficult to scale to long or large executions

  • What can we do? Trace call path samples:

— on each sample, record call path of each thread — organize the samples for each thread along a time line — view how the execution hierarchically evolves

  • assign each procedure a color; view a depth slice of an execution

— use sampling to scalably render large-scale traces

33

Processes Time Call stack

slice

“main”

ICS 2011

Monday, September 26, 2011

slide-34
SLIDE 34

Exposing Temporal Call Path Patterns

34

PFLOTRAN, 8184 processes, Cray XT5

MPI rank time

Process-time view at selected depth Depth-time view for selected rank

Call path depth

Monday, September 26, 2011

slide-35
SLIDE 35

Presenting Large Traces on Small Displays

  • How to render an arbitrary portion of an arbitrarily large trace?

— we have a display window of dimensions h × w — typically many more processes (or threads) than h — typically many more samples (trace records) than w

  • Solution: sample the samples!

35

Trace with n processes

time process p1 pi pn

h w

process time

each sample defines a pixel

samples (of samples)

On a laptop!

Monday, September 26, 2011

slide-36
SLIDE 36

Will Sampling Miss Something Important?

36

  • Sampling may miss the precise cause of an anomaly...

— but, important anomalies will have (local/non-local) effects

  • Sampling exposes effects of the important anomalies

Using sampling for both measurement and presentation clearly exposed the problem.

In an unusual execution, 8184 processes took 190 s to complete MPI_Init! (FLASH, JaguarPF, Cray XT5) a lagging process...

Monday, September 26, 2011

slide-37
SLIDE 37
  • Motivation
  • Call Path Profiling
  • Pinpointing Scaling Bottlenecks
  • Blame Shifting

— parallel idleness and overhead in work stealing — lock contention — load imbalance

  • Call Path Tracing
  • Conclusions

37

Outline

Monday, September 26, 2011

slide-38
SLIDE 38

Conclusions

  • Obtain insight, accuracy & precision by combining sampling,

call path profiling, binary analysis, and blame-shifting

  • Show surprisingly effective measurement and source-level

attribution for fully optimized code (1-3% overhead)

— statements in their full static and dynamic context — project low-level measurements to much higher levels

  • Sampling-based measurements can deliver insight into

— scalability bottlenecks — parallel inefficiency in work-stealing — causes of lock contention — load imbalance — temporal patterns — problematic data structures (work by Xu Liu)

  • Obtain insight on very different applications and architectures

38

hpctoolkit.org

Monday, September 26, 2011

slide-39
SLIDE 39

HPCToolkit Capabilities at a Glance

39 Attribute Costs to Code Analyze Behavior

  • ver Time

Assess Imbalance and Variability Associate Costs with Data Shift Blame from Symptoms to Causes Pinpoint & Quantify Scaling Bottlenecks

hpctoolkit.org

Monday, September 26, 2011