A Case for Parallelism Profilers and Advisers with What-If Analyses - PowerPoint PPT Presentation

A Case for Parallelism Profilers and Advisers with What-If Analyses Santosh Nagarakatte Rutgers University, USA @ Workshop on Dependable and Secure Software Systems @ ETH Zurich, October 2019

Is Parallel Programming Hard, And, If So, What Can You Do About It? “ Parallel programming has earned a reputation as one of the most difficult areas a hacker can tackle. Papers and textbooks warn of the perils of deadlock, livelock, race conditions, non-determinism, Amdahl’s-Law limits to scaling, and excessive realtime latencies. And these perils are quite real; we authors have accumulated uncounted years of experience dealing with them, and all of the emotional scars, grey hairs, and hair loss that go with such experiences.” [McKenny:arXiv17] Main reasons: use of the wrong abstraction, lack of performance analysis and debugging tools

Illustrative Example Write a parallel program Given a range of integers (0 to n) Find all the prime numbers in the range Student in my class Perform a computation on the primes Output result

Illustrative Example – Writing a Parallel Program Work-Sharing Tasking Feature rich SIMD Offload #pragma omp parallel for Incremental for(int i=0; i<n; ++i) compute(i); parallelization 4

Illustrative Example …….. 1 2 3 4 n Identify the number of processors on the machine (4) Divide the range into 4 parts and perform computation Run: ./primes Student in my class Speedup: 1.8X over serial Why? execution Load Imbalance

Need to write Performance Portable Code - Advocacy for Task Parallelism …….. 1 2 3 4 n Express all the parallelism as tasks T 1 T 2 T 3 T m …….. Runtime Runtime that dynamically balances load by assigning tasks to idle threads P k P 1 P 2 ……..

Illustrative Example …….. 1 2 3 4 n …….. T m T 1 T 2 T 3 Expresses parallel work in terms of tasks Is it performance Run: ./primes_tasks portable? Student in my class Speedup: 3.8X over serial execution on 4 cores

Performance Debugging Tools GProf OProfile ARMMap Coz • Most of them provide info on frequently executed regions. • Critical path information is useful • Coz [SOSP 2015]: Identifies if a line of code matters in increasing speedup on a given machine. Intel Intel NVProf VTune Advisor

Our Parallelism Profilers and Advisers: TaskProf & OMP-WHIP [FSE 2017, SC 2018, PLDI 2019] • Making a case for measuring logical parallelism Series-parallel relations + fine-grained measurements is a performance model • Where should programmer focus? Regions with low parallelism => serialization. Critical path! Profiler • Does it matter? Automatically identify regions to increase parallelism to a threshold What-if Analyses - mimic the effect of parallelization Adviser Differential analyses to identify regions with secondary effects General for multiple parallelism models. This talk focuses on OpenMP

Performance Model for Logical Parallelism and What-If Analyses 10

Performance Model for Computing Parallelism Profile on a machine with low core count and identify scalability bottlenecks • OSPG: Logical series-parallel relations between parts of a OpenMP program • Inspired by prior work: DPST [PLDI 2012], SP Parse tree [SPAA 2015] • OSPG Fine-grained measurements 11

OpenMP Series Parallel Graph (OSPG) • A data structure to capture series-parallel relations • Inspired by Dynamic Program Structure Tree [PLDI 2012] • OSPG is an ordered tree in the absence of task dependencies in OpenMP • Handles the combination of work-sharing (fork-join programs with threads) and tasking • Precisely captures the semantics of OpenMP • Three kinds of nodes : W, S, and P nodes similar to Async, Finish, and Step nodes in the DPST

Code Fragments in OpenMP Programs Execution structure OpenMP code snippet … b a(); #pragma omp parallel a c b(); c(); b … A code fragment is the longest sequence of instructions in the dynamic execution before encountering an OpenMP construct 13

Capturing Series-Parallel Relation with the OSPG W-nodes capture computation A maximal sequence of dynamic instructions between b2 two OpenMP directives S1 a1 c4 b3 W4 S2 W1 P-nodes capture the parallel relation Nodes in the sub-tree of a P-node logically executes in parallel with right siblings of the P-node P2 P1 S-nodes capture the series relation Nodes in the sub-tree of a S-node logically executes W2 W3 in series with right siblings of the S-node 14

Capturing Series-Parallel Relation with the OSPG b2 Determine the series-parallel S1 relation between any pair of W c4 a1 nodes with an LCA query b3 W1 S2 W4 Check the type of the LCA’s child on the path to the left w-node. If it’s a p-node , they execute in parallel . P1 P2 Otherwise, they execute in series S2 = LCA(W2,W3) W2 W3 P1 = Left-Child(S2,W2,W3) 15

Capturing Series-Parallel Relation with the OSPG b2 Determine the series-parallel S1 relation between any pair of W c4 a1 nodes with an LCA query b3 W1 S2 W4 Check the type of the LCA’s child on the path to the left w-node. If it’s a p-node , they execute in parallel . P1 P2 Otherwise, they execute in series S1 = LCA(W2,W4) W2 W3 S2 = Left-Child(S1,W2,W4) 16

Profiling an OpenMP Merge Sort Program • Merge sort program parallelized with OpenMP void main(){ void mergeSort(int* arr, int s, int e){ int* arr = init(&n); if (n <= CUT_OFF) #pragma omp parallel serialSort(arr, s, e); #pragma omp single int mid = s + (e-s)/2; mergeSort(arr, 0, n); #pragma omp task } mergeSort(arr, s, mid); #pragma omp task mergeSort(arr, mid+1, e); #pragma omp taskwait merge(arr, s, e); } 17

OSPG Construction S0 void main(){ int* arr = init(&n); #pragma omp parallel W0 S1 #pragma omp single mergeSort(arr, 0, n); } P0 P1 W1 18

OSPG Construction S0 W0 S1 void mergeSort(int* arr, int s, int e){ if (n <= CUT_OFF) P0 P1 serialSort(arr, s, e); int mid = s + (e-s)/2; #pragma omp task mergeSort(arr, s, mid); W2 S2 W5 W1 #pragma omp task mergeSort(arr, mid+1, e); #pragma omp taskwait P3 P2 merge(arr, s, e); } W3 W4 19

Parallelism Computation Using OSPG 20

Compute Parallelism S0 W 260 W 6 W0 S1 W 254 Measure work in each Work node with fine grained measurements W 254 P0 W 52 W 2 W4 W1 S2 Compute work for each internal node W 200 P2 P3 W 100 W 100 W 100 W 100 W2 W3 21

Compute Serial Work S0 W0 S1 Measure work in each Work node P0 Compute work for each internal node W4 Identify serial work on critical path W1 S2 P2 P3 W2 W3 22

W 260 Compute Serial Work S0 SW 160 W 6 W0 S1 W 254 Measure work in each Work node SW 154 W 254 P0 Compute work for each internal node SW 154 W 52 W 2 W4 Compute serial work for each W1 S2 Internal node W 200 P2 P3 W 100 W 100 SW 100 SW 100 SW 100 W 100 W 100 W2 W3 23

W 260 Source Code Attribution S0 SW 160 W0 S1 main L1 W 254 P0 omp parallel L3 SW 154 W4 omp task L11 W1 S2 omp task L13 P2 P3 W 100 W 100 Aggregate parallelism at OpenMP SW 100 SW 100 constructs W2 W3 24

Parallelism Profile S0 W0 S1 Line Number Work Serial Parallelism Critical Path P0 Work Work % program:1 260 160 1.625 3.75 omp parallel:3 254 154 1.65 33.75 W1 S2 W4 omp task:11 100 100 1.00 62.5 omp task:13 100 100 1.00 0 P3 P2 W2 W3 25

Identify what parts of the code matter in increasing parallelism 26

Adviser mode with What-If Analyses S0 Identify code regions that must be W0 S1 optimized to increase parallelism 6 P0 Which region to select? Select a 52 region to 2 W1 S2 W4 optimize Select step node P3 P2 performing highest work on critical path 100 W2 W3 100

Adviser mode with What-If Analyses Identify all W-nodes Identify code regions corresponding to the S0 that must be optimized region and perform to increase parallelism what-if analyses W0 S1 ` What-If Profile 6 P0 Line Work Cwork Parallelism CP Select 1 260 85 3.05 7.05% 52 highest step 3 254 79 1.65 63.5% W1 S2 W4 node on 11 100 25 4.00 29.45% 2 critical path 13 100 25 4.00 0% P3 P2 25 W2 W3 25 100 100 Repeat until threshold parallelism is reached

Tasking and Scheduling Overhead S0 W0 S1 Runtime Parallelism overhead P0 ` W1 S2 W4 Speedup P3 P2 ` ` W2 W3

Adviser mode with What-If Analyses Identify code regions S0 that must be optimized to increase parallelism W0 S1 What-If Profile 6 P0 Line Work Cwork Parallelism CP Select 1 260 85 3.05 7.05% 52 highest step 3 254 79 1.65 63.5% W1 S2 W4 node on 11 100 25 4.00 29.45% 2 critical path 13 100 25 4.00 0% P3 P2 25 W2 W3 25 100 100 Repeat until threshold parallelism is reached OR Work of highest step node < K * average tasking overhead

A Case for Parallelism Profilers and Advisers with What-If Analyses - PowerPoint PPT Presentation

A Case for Parallelism Profilers and Advisers with What-If Analyses Santosh Nagarakatte Rutgers University, USA @ Workshop on Dependable and Secure Software Systems @ ETH Zurich, October 2019 Is Parallel Programming Hard, And, If So, What Can

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

INTRODUCTION TO NVIDIA PROFILING TOOLS Chandler Zhou, 20191219 Overview of Profilers Nsight

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Advisers Act Registration Exemptions for July 20 , 2011 Venture Capital Fund Advisers and Private

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Robo-Advisers and Advisers Act Compliance Investment Management Webinar Stephanie M. Monaco Amy

Overture INVESTMENT ADVISERS INVESTMENT ADVISER OVERSIGHT DEVELOPMENTS Jen Klass Partner

StartUpBridge | Allied Advisers May 2019 MEMBER FINRA, SIPC (BA SECURITIES) Confidential

Advising the Advisers 2017 Advising the Advisers 2014 Join the global debate The importance of

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

ProtoDUNE-SP Reconstruction Software Review and Performance Leigh Whitehead On behalf of the

How to Solve Complex Problems in Parallel (Parallel Divide and Conquer and Task Parallelism)

Use of Task Graph Model for Parallel Program Design Detailed steps for parallel program design

Sparklens: Understanding the Scalability Limits of Spark Applications Ashish Dubey, Qubole ABOUT

Understanding Customer Problem Troubleshooting from Storage System Logs Weihang Jiang

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation:

Welcome We will begin at 7:30 pm Central Time. OFA Community Engagement Fellowship Spring 2018

QCD critical point and event-by-event fluctuations M. Stephanov U. of Illinois at Chicago QCD

A Case for Parallelism Profilers and Advisers with What-If Analyses - PowerPoint PPT Presentation

A Case for Parallelism Profilers and Advisers with What-If Analyses Santosh Nagarakatte Rutgers University, USA @ Workshop on Dependable and Secure Software Systems @ ETH Zurich, October 2019 Is Parallel Programming Hard, And, If So, What Can

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

INTRODUCTION TO NVIDIA PROFILING TOOLS Chandler Zhou, 20191219 Overview of Profilers Nsight

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Advisers Act Registration Exemptions for July 20 , 2011 Venture Capital Fund Advisers and Private

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Robo-Advisers and Advisers Act Compliance Investment Management Webinar Stephanie M. Monaco Amy

Overture INVESTMENT ADVISERS INVESTMENT ADVISER OVERSIGHT DEVELOPMENTS Jen Klass Partner

StartUpBridge | Allied Advisers May 2019 MEMBER FINRA, SIPC (BA SECURITIES) Confidential

Advising the Advisers 2017 Advising the Advisers 2014 Join the global debate The importance of

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

ProtoDUNE-SP Reconstruction Software Review and Performance Leigh Whitehead On behalf of the

How to Solve Complex Problems in Parallel (Parallel Divide and Conquer and Task Parallelism)

Use of Task Graph Model for Parallel Program Design Detailed steps for parallel program design

Sparklens: Understanding the Scalability Limits of Spark Applications Ashish Dubey, Qubole ABOUT

Understanding Customer Problem Troubleshooting from Storage System Logs Weihang Jiang

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson &amp; Pyla UX Evaluation:

Welcome We will begin at 7:30 pm Central Time. OFA Community Engagement Fellowship Spring 2018

QCD critical point and event-by-event fluctuations M. Stephanov U. of Illinois at Chicago QCD

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation: