Parallel Programming and Heterogeneous Computing A4 Workloads & - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing A4 – Workloads & Foster’s Methodology Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group

Example: Can You Easily Parallelize … ? Computing the n- th Fibonacci number: : Data Dependency F n = F n-1 + F n-2 , with F 0 = 0, F 1 = 1 ParProg20 A4 Foster’s Methodology Sven Köhler Cannot be obviously parallelized, due to data dependency: the result of one step depends on an earlier step to have produced a result. Chart 2

Example: Can You Easily Parallelize … ? Searching an unsorted, discrete problem space for a specific value. I keep left Found! ParProg20 A4 Stop! Foster’s Methodology Sven Köhler Model space as a tree, parallelize search walk on sub-trees. Might require communication (“don’t go there”, “stop all”). Chart 3

Example: Can You Easily Parallelize … ? Approximating π using a Monte Carlo method? Pick random points 0 ≤ x, y ≤ 1. Point is in circle if x 2 + y 2 ≤ 1. P(X): how likely a point ends in X. π = 4 * P(circle) / P(square) ParProg20 A4 ≈ 4 * #ptsInCircle / #ptsTotal Foster’s Methodology Parallel action for each point completely Sven Köhler independend, no commucation required ( embarrassingly parallel ). Chart 4

( Sidenote: Berkeley Dwarfs [Berkeley2006] ) Last two slides showed typical examples of different classes of parallel algorithms. The Landscape of Parallel Computing Research: A View from Berkeley We’ll revisit them at the end of this semester, but you can already read up on them. Krste Asanovic Ras Bodik Bryan Christopher Catanzaro Joseph James Gebis Parry Husbands Kurt Keutzer David A. Patterson William Lester Plishker ParProg20 A4 John Shalf Samuel Webb Williams Foster’s Katherine A. Yelick Methodology Sven Köhler Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-183 Chart 5 http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html December 18, 2006

Workloads “task-level parallelism” “data-level parallelism” ParProg20 A4 Foster’s Methodology Sven Köhler Different tasks being Parallel execution of the ■ ■ performed at the same time same task on disjoint data sets Chart 6 Might originate from the ■ same or different programs

Workloads Task / data size can be coarse-grained or fine-grained . Size decision depends on algorithm design or configuration of execution unit. task1 � time task2 Sometimes also “flow parallelism” added task3 ■ Overlapping work on data stream □ Examples: Pipelines, assembly line model □ Sometimes also “functional parallelism” added ■ ParProg20 A4 � Foster’s Distinct functional units of your algorithm, □ Methodology exchanging data in a cyclic communication graph Sven Köhler For those four terms no clear distinction in literature. Chart 7

Execution Environment Mapping I data parallelism I I I D D D D D D D D D D D D maps well to D D D D D D D D D D D D D D D D D D D D D D D D SIMD Single Instruction stream Multiple Data streams I I I I I task parallelism I I I ParProg20 A4 I I I Foster’s I I D D D D D D D D D D D Methodology I D D D D D D D D D D maps well to Sven Köhler D D D D D D D D D D MIMD Multiple Instruction streams Chart 8 Multiple Data streams

Execution Environment Mapping Shared Nothing/ Shared Memory (SM) Distributed Memory (DM) Data SM-SIMD Systems DM-SIMD Systems Parallel SSE, AltiVec, CUDA Hadoop, systolic arrays Task SM-MIMD Systems DM-MIMD Systems ParProg20 A4 Parallel ManyCore/SMP systems Clusters, MPP systems Foster’s Methodology Sven Köhler Execution environments are optimized for one kind of workload, event though Chart 9 they can also be used for other ones.

The Parallel Programming Problem Configuration Type ParProg20 A4 Foster’s Methodology Parallel Application Execution Environment Match ? Sven Köhler Chart 10

Designing Parallel Algorithms [Foster] Map workload problem on an execution environment ■ Concurrency for speedup □ Data locality for speedup □ Scalability □ Best parallel solution typically ■ differs massively from the sequential version of an algorithm Foster defines four distinct stages ■ of a methodological approach We will use this as a guide in the ParProg20 A4 ■ upcoming discussions Foster’s Methodology Note: Foster talks about communication, ■ Sven Köhler we use the term synchronization instead Chart 11

Example: Parallel Reduction Reduce a set of elements into one, given an operation, ■ e.g. summation: f(a, b) = a + b 0 1 2 3 4 5 6 7 1 5 9 13 ParProg20 A4 6 22 Foster’s Methodology Sven Köhler 28 Chart 12

Designing Parallel Algorithms [Foster] A) Search for concurrency and scalability ■ Partitioning □ Decompose computation and data into the smallest possible tasks Communication □ Define necessary coordination of task execution B) Search for locality and other performance-related issues ■ Agglomeration □ Consider performance and implementation costs Mapping ParProg20 A4 □ Maximize execution unit utilization, minimize communication Foster’s Methodology Sven Köhler Might require backtracking or parallel investigation of steps ■ Chart 13

Partitioning Step [Foster] Expose opportunities for parallel execution – ■ fine-grained decomposition Good partition keeps computation and data together ■ Data partitioning leads to data parallelism □ Computation partitioning leads task parallelism □ Complementary approaches, can lead to different algorithms □ Reveal hidden structures of the algorithm that have potential □ Investigate complementary views on the problem □ Avoid replication of either computation or data, can be revised later to reduce ■ ParProg20 A4 communication overhead Foster’s Step results in multiple candidate solutions Methodology ■ Sven Köhler 0 1 2 3 4 5 6 7 a + b Chart 14 1 5 9 13

Partitioning - Decomposition Types Domain Decomposition � ■ Define small data fragments □ Specify computation for them □ Different phases of computation □ on the same data are handled separately Rule of thumb: □ First focus on large or frequently used data structures Functional Decomposition ■ Split up computation into disjoint � □ tasks, ignore the data accessed ParProg20 A4 for the moment Foster’s Methodology With significant data overlap, □ Sven Köhler domain decomposition is more appropriate Chart 15

Partitioning - Checklist Checklist for resulting partitioning scheme ■ Order of magnitude more tasks than processors? □ -> Keeps flexibility for next steps Avoidance of redundant computation and storage requirements? □ -> Scalability for large problem sizes Tasks of comparable size? □ -> Goal to allocate equal work to processors Does number of tasks scale with the problem size? □ -> Algorithm should be able to solve larger tasks with more processors Resolve bad partitioning by estimating performance behavior, ■ ParProg20 A4 and eventually reformulating the problem Foster’s Methodology Sven Köhler Chart 16

Communication Step [Foster] Specify links between data consumers and data producers ■ Specify kind and number of messages on these links ■ Domain decomposition problems might have tricky communication ■ infrastructures, due to data dependencies Communication in functional decomposition problems can easily be modeled ■ from the data flow between the tasks Categorization of communication patterns ■ Local communication (few neighbors) vs. □ global communication Structured communication (e.g. tree) vs. ParProg20 A4 □ unstructured communication Foster’s Methodology Static vs. dynamic communication structure □ Sven Köhler Synchronous vs. asynchronous communication □ Chart 17

Communication - Hints Distribute computation and communication, ■ don‘t centralize algorithm Bad example: Central manager for parallel summation □ Divide-and-conquer helps as mental model to identify concurrency □ Unstructured communication is hard to agglomerate, ■ better avoid it Checklist for communication design ■ Do all tasks perform the same amount of communication? □ -> Distribute or replicate communication hot spots Does each task perform only local communication? ParProg20 A4 □ Foster’s Can communication happen concurrently? □ Methodology Can computation happen concurrently? Sven Köhler □ Chart 18

Agglomeration Step [Foster] Algorithm so far is correct, ■ but not specialized for some execution environment Check again partitioning and communication decisions ■ Agglomerate tasks for efficient execution on some machine □ Replicate data and / or computation for efficiency reasons □ Resulting number of tasks can still be greater than the number of processors ■ Three conflicting guiding decisions ■ Reduce communication costs by coarser granularity of computation and □ communication Preserve flexibility with respect to later mapping decisions □ ParProg20 A4 Foster’s Reduce software engineering costs (serial -> parallel version) □ Methodology Sven Köhler 0 1 2 3 4 5 6 7 addh4 a,b,c,d Chart 19 6 22

Agglomeration [Foster] ParProg20 A4 Foster’s Methodology Sven Köhler Chart 20

Parallel Programming and Heterogeneous Computing A4 Workloads & - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing A4 Workloads & Fosters Methodology Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group Example: Can You Easily Parallelize

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

Upwind Summation By Parts Methods for Large Scale Elastic Wave Equation ICERM, Brown University

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER EROGENEO ENEOUS US PARALLE

Inference in ecology and evolution beyond generalised linear mixed models Reinder Radersma

Integrating Maude into Hets Mihai Codescu, 1 Till Mossakowski, 1 Adri an Riesco 2 and Christian

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D.

Similarity Measures There are an enormous number of ways in which we can measure similarity

OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Parallel Programming and Heterogeneous Computing A4 Workloads & - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing A4 Workloads & Fosters Methodology Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group Example: Can You Easily Parallelize

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Outline Overview Theoretical background Parallel computing systems Parallel

Overview Parallel computing platforms Approaches to building parallel computers

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using

Upwind Summation By Parts Methods for Large Scale Elastic Wave Equation ICERM, Brown University

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER EROGENEO ENEOUS US PARALLE

Inference in ecology and evolution beyond generalised linear mixed models Reinder Radersma

Integrating Maude into Hets Mihai Codescu, 1 Till Mossakowski, 1 Adri an Riesco 2 and Christian

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D.

Similarity Measures There are an enormous number of ways in which we can measure similarity

OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &