Using Sampling to Understand Parallel Program Performance Nathan - PowerPoint PPT Presentation

Using Sampling to Understand Parallel Program Performance Nathan Tallent John Mellor-Crummey M. Krentel, L. Adhianto, M. Fagan, X. Liu Dept. of Computer Science, Rice University Parallel Tools Workshop • TU Dresden • Sept 26-27, 2011 hpctoolkit.org Monday, September 26, 2011

Advertisement: WHIST 2012 • International Workshop on High-performance Infrastructure for Scalable Tools — in conjunction with PPoPP 2012 — February 25-29, 2012 — New Orleans, LA, USA • All things tools: — special emphasis on performance and scalability — special emphasis on infrastructure • how do we build tools? • often not welcome in other venues • No more CScADS Tools Workshop; come to WHIST instead! whist-workshop.org 2 Monday, September 26, 2011

For Measurement, Sampling is Obvious... • Instrumentation (synchronous) Instrumentation types • source code — monitors every instance of something • compiler-inserted — must be used with care • static binary • dynamic binary • instrumenting every procedure ➟ large overheads – may preclude measurement of production runs – encourages selective instrumentation ➟ blind spots • instrumentation disproportionately affects small procedures ( systematic error) • cannot measure at the statement level — note: specialized techniques may reduce overheads • sample instrumentation (switch btwn heavyweight & lightweight inst) • dynamically insert & remove binary instrumentation • Sampling (asynchronous) — provides representative and detailed measurements • assumes sufficient samples; no correlations (usually easy to satisfy) Use sampling when possible; use instrumentation when necessary. 3 Monday, September 26, 2011

… but: Sampling is Also Questionable • How to attribute metrics to loops? — attribution to outer loops trivial with source-code instrumentation • How to attribute metrics to data objects? • How to collect call paths at arbitrary sample points? — stack unwinding can be hard: try using libunwind or GLIBC backtrace() in a SEGV handler for optimized code • How to collect call paths for languages that don’t use stacks? — perfect stack unwinding not sufficient • Will sampling miss something important? — it is possible to miss the first cause in a chain of events • How to pinpoint lock contention? — with instrumentation, it is easy to track lock acquire/release • Can sampling pinpoint the causes of lock contention? 4 Monday, September 26, 2011

Outline • Motivation • Call Path Profiling • Pinpointing Scaling Bottlenecks • Blame Shifting — parallel idleness and overhead in work stealing — lock contention — load imbalance • Call Path Tracing • Conclusions 5 Monday, September 26, 2011

Sampling-based Call Path Profiling is Easy...* Measure & attribute costs in calling context • Use timer or hardware counter as a sampling source • Gather calling context using stack unwinding Call path sample Calling Context Tree (CCT) return address “main” return address return address instruction pointer sample point Overhead proportional to sampling frequency... ...not call frequency CCT size scales well 6 Monday, September 26, 2011

… but: Hard to Unwind From Async Sample • Asynchronous sampling: — ✓ overhead proportional to sampling frequency • low, controllable • no sample => irrelevant to performance — ✓ overhead effects are not systematic — ✗ must be able to unwind from every point in an executable • (use call stack unwinding to gather calling context) • Why is unwinding difficult? Requires either: — frame pointers (linking every frame on the stack) • omitted in fully optimized code — complete (and correct) unwind information • omitted for epilogues • erroneous (optimizers fail to maintain info while applying xforms) • missing for hand-coded assembly or partially stripped libraries often spend significant time in such routines! Dynamically analyze binary to compute unwind information 7 Monday, September 26, 2011

Unwinding Fully Optimized Parallel Code • Identify procedure bounds — for dynamically linked code, do at runtime — for statically linked code, do at compile time • Compute (on demand) unwind recipes for a procedure: — scan the procedure’s object code, tracking the locations of • caller’s program counter • caller’s frame and stack pointer — create unwind recipes between pairs of frame-relevant instructions • Processors: x86-64, x86, Power/BGP, MIPS (SiCortex ☹ ) • Results — accurate call path profiles — overheads of < 2% for sampling frequencies of 200/s PLDI 2009. Distinguished Paper Award . 8 Monday, September 26, 2011

Attributing Measurements to Source • Compilers create semantic gap between binary and source — call paths differ from user-level • inlining, tail calls — loops differ from user-level • software pipelining, unroll & jam, blocking, etc. • Must bridge this semantic gap — to be useful, a tool should • attribute binary-level measurements to source-level abstractions • How? — cannot use instrumentation to measure user-level loops Statically analyze binary to compute an object to source-code mapping 9 Monday, September 26, 2011

Attribution to Static & Dynamic Context 1-2% overhead costs for • inlined procedures • loops • function calls in full context PLDI 2009. Distinguished Paper Award . Monday, September 26, 2011

The Challenges of Scale: O(100K) cores • What if my app doesn’t scale? Where is the bottleneck? • First step: Performance tools must scale! — measure at scale • sample processes (SPMD apps) simple and – record data on a process with probability p effective – simplification of Gamblin et al., IPDPS ’08 — analyze measurements • analyze measurements in parallel • data structure has space requirements of O(1 call path tree) — present performance data (without requiring an Altix) • use above data structure • Next step: Deliver insight — identify scaling bottlenecks in context… • use detailed sampling-based call path profile 12 Monday, September 26, 2011

Pinpointing & Quantifying Scalability Bottlenecks Average CCT Q Average CCT P = P × − Q × 400 s. 600 s. P Q 200 s. SC 2009 Weak scaling : no coefficients Strong scaling : red coefficients Coarfa, Mellor-Crummey, Froyd, Dotsenko. ICS’07 13 Monday, September 26, 2011

FLASH: Top-down View of Scaling Losses Weak scaling on BG/P 256 to 8192 cores 21 % of scaling loss is due to looping over all MPI ranks to build adaptive mesh 14 Monday, September 26, 2011

Improved Flash Scaling of AMR Setup Note: lower is better Graph courtesy of Anshu Dubey, U Chicago 15 Monday, September 26, 2011

Blame Shifting • Problem: sampling often measures symptoms of performance losses rather than causes — worker threads waiting for work — threads waiting for a lock — MPI process waiting for peers in a collective communication • Approach: shift blame from victims to perpetrators • Flavors — active measurement — post-mortem analysis only 17 Monday, September 26, 2011

Cilk: An Influential Multithreaded Language cilk int fib(n) { spawn : asynchronous call; if (n < 2) return n; creates a logical task that else { only blocks at a sync int x, y; x = spawn fib(n-1); y = spawn fib(n-2); f(n) sync; return (x + y); f(n-1) f(n-2) } } f(n-2) f(n-3) f(n-3) f(n-4) spawn + recursion ( fib ): quickly creates significant ... ... ... ... ... ... logical parallelism ... ... To map logical tasks to compute cores: lazy thread creation + work-stealing scheduler 18 Monday, September 26, 2011

Call Path Profiling of Cilk: Stack ≠ Path Problem: Work stealing thread 1 f(n) separates source-level thread 2 thread 3 f(n-1) f(n-2) calling contexts in space and time f(n-2) f(n-3) f(n-3) f(n-4) ... ... ... ... ... ... ... ... Consider thread 3: • physical call path f(n-1) f(n-3) ... • logical call path f(n) f(n-1) f(n-3) ... Logical call path profiling: Recover full relationship between physical and source-level execution 19 Monday, September 26, 2011

What If My Cilk Program Is Slow? • Possible problems: — parallel idleness: parallelism is too coarse-grained — parallel overhead: parallelism is too fine-grained — idleness and overhead: wrong algorithm • Try logical call path profiling: You spent time here! — measuring idleness while (noWork) { • samples accumulate t = random thread; in the scheduler! try stealing from t; • no source-level insight! } — measuring overhead save-live-vars(); offer-task(); • how?! user-work(); • work and overhead are both retract-task(); machine instructions PPoPP 2009 Goal: Insightfully attribute idleness IEEE Computer and overhead to logical contexts 12/09 20 Monday, September 26, 2011

Using Sampling to Understand Parallel Program Performance Nathan - PowerPoint PPT Presentation

Using Sampling to Understand Parallel Program Performance Nathan Tallent John Mellor-Crummey M. Krentel, L. Adhianto, M. Fagan, X. Liu Dept. of Computer Science, Rice University Parallel Tools Workshop TU Dresden Sept 26-27, 2011

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Create Sampling Distributions from Single Die V0G 11/16/2016 V0G Create Sampling Distribution

Randomized Sampling Problems Sorting in Parallel Selection Anil Maheshwari

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

University Florence of y heat work beet W Kg heat work beet W Kg heat work beet W

Adversarial Search Berlin Chen 2004 References: 1. S. Russell and P. Norvig. Artificial

Modelling & Datatypes Koen Lindstrm Claessen Software Software = Programs + Data

Today Nondeterministic games: backgammon 0 1 2 3 4 5 6 7 8 9 10 11 12 See Russell and

Message from by the election of a very talented, energetic opportunity which we must seize with

On Kaleidoscope Designs Francesca Merola Roma Tre University Joint work with Marco Buratti

and the Fair Funding Consultation 1 Outline for the briefing today 2018/19 Provisional

Welcome to the January meeting 13 th January Malcolm Wells, MBE So Far, So Good 10 th February

Using Sampling to Understand Parallel Program Performance Nathan - PowerPoint PPT Presentation

Using Sampling to Understand Parallel Program Performance Nathan Tallent John Mellor-Crummey M. Krentel, L. Adhianto, M. Fagan, X. Liu Dept. of Computer Science, Rice University Parallel Tools Workshop TU Dresden Sept 26-27, 2011

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Create Sampling Distributions from Single Die V0G 11/16/2016 V0G Create Sampling Distribution

Randomized Sampling Problems Sorting in Parallel Selection Anil Maheshwari

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

University Florence of y heat work beet W Kg heat work beet W Kg heat work beet W

Adversarial Search Berlin Chen 2004 References: 1. S. Russell and P. Norvig. Artificial

Modelling &amp; Datatypes Koen Lindstrm Claessen Software Software = Programs + Data

Today Nondeterministic games: backgammon 0 1 2 3 4 5 6 7 8 9 10 11 12 See Russell and

Message from by the election of a very talented, energetic opportunity which we must seize with

On Kaleidoscope Designs Francesca Merola Roma Tre University Joint work with Marco Buratti

and the Fair Funding Consultation 1 Outline for the briefing today 2018/19 Provisional

Welcome to the January meeting 13 th January Malcolm Wells, MBE So Far, So Good 10 th February

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Modelling & Datatypes Koen Lindstrm Claessen Software Software = Programs + Data