PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel - PowerPoint PPT Presentation

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar “Parallel Programming”, Summer Semester 2011

CONTENT  Introduction  Parallel program design  Patterns for parallel programming  A: Algorithm structure  B: Supporting structures 2

Context INTRODUCTION around parallel programming

PARALLEL PROGRAMMING MODELS  Many different models reflecting the various different parallel hardware architectures  2 or rather 3 most common models:  Shared memory  Distributed memory  Hybrid models (combining shared and distributed memory) 4

PARALLEL PROGRAMMING MODELS Shared memory Distributed memory 5

PROGRAMMING CHALLENGES Shared memory Distributed memory  Synchronize memory  Communication access bandwidth and resulting latency  Locking vs. potential race conditions  Manage message passing  Synchronous vs. asynchronous communication 6

PARALLEL PROGRAMMING STANDARDS  2 common standards as examples for the 2 parallel programming models:  Open Multi-Processing (OpenMP)  Message passing interface (MPI) 7

OpenMP  Collection of libraries and compiler directives for parallel programming on shared memory computers  Programmers have to explicitly designate blocks that are to run in parallel by adding directives like:  OpenMP then creates a number of threads executing the designated code block 8

MPI  Library with routines to manage message passing for programming on distributed memory computers  Messages are sent from one process to another  Routines for synchronization, broadcasts, blocking and non blocking communication 9

MPI EXAMPLE MPI.Scatter MPI.Gather 10

PARALLEL PROGRAM General strategies for finding DESIGN concurrency

FINDING CONCURRENCY  General approach: Analyze a problem to identify exploitable concurrency  Main concept is decomposition : Divide a computation into smaller parts all or some of which can run concurrently 12

SOME TERMINOLOGY  Tasks : Programmer-defined units into which the main computation is decomposed  Unit of execution (UE) : Generalization of processes and threads 13

TASK DECOMPOSITION  Decompose a problem into tasks that can run concurrently  Few large tasks vs. many small tasks  Minimize dependencies among tasks 14

GROUP TASKS  Group tasks to simplify managing their dependencies  Tasks within a group run at the same time  Based on decomposition: Group tasks that belong to the same high-level operations  Based on constraints: Group tasks with the same constraints 15

ORDER TASKS  Order task groups to satisfy constraints among them  Order must be:  Restrictive enough to satisfy constraints  Not too restrictive to improve flexibility and hence efficiency  Identify dependencies – e.g.:  Group A requires data from group B  Important: Also identify the independent groups  Identify potential dead locks 16

DATA DECOMPOSITION  Decompose a problem‘s data into units that can be operated on relatively independent  Look at problem‘s central data structures  Decomposition already implied by or or basis for task decomposition  Again: Few large chunks vs. many small chunks  Improve flexibility: Configurable granularity 17

DATA SHARING  Share decomposed data among tasks  Identify task-local and shared data  Classify shared data: read/write or read only?  Identify potential race conditions  Note: Sometimes data sharing implies communication 18

PATTERNS FOR Typical PARALLEL parallel program structures PROGRAMMING

A: ALGORITHM STRUCTURE  How can the identified concurrency be used to build a program?  3 examples for typical parallel algorithm structures:  Organize by tasks: Divide & conquer  Organize by data decomposition: Geometric/domain decomposition  Organize by data flow: Pipeline 20

DIVIDE & CONQUER  Principle: Split a problem recursively into smaller solvable sub problems and merge their results  Potential concurrency: Sub problems can be solved simultaneously 21

DIVIDE & CONQUER  Precondition: Sub problems can be solved independently  Efficiency constraint: Split and merge should be trivial compared to sub problems  Challenge: Standard base case can lead to too many too small tasks  End recursion earlier? 22

GEOMETRIC/DOMAIN DECOMPOSITION  Principle: Organize an algorithm around a linear data structure that was decomposed into concurrently updatable chunks  Potential concurrency: Chunks can be updated simultaneously 23

GEOMETRIC/DOMAIN DECOMPOSITION  Example: Simple blur filter where every pixel is set to the average value of its surrounding pixels  Image can be split into squares  Each square is updated by a task  To update square border information from other squares is required 24

GEOMETRIC/DOMAIN DECOMPOSITION  Again: Granularity of decomposition?  Choose square/cubic chunks to minimize surface and thus nonlocal data  Replicating nonlocal data can reduce communication → “ghost boundaries”  Optimization: Overlap update and exchange of nonlocal data  Number of tasks > number of UEs for better load balance 25

PIPELINE  Principle based on analogy assembly line : Data flowing through a set of stages  Potential concurrency: Operations can be performed simultaneously on different data items time C 5 C 6 C 1 C 2 C 3 C 4 Pipeline stage 1 C 5 C 6 C 1 C 2 C 3 C 4 Pipeline stage 2 C 5 C 6 Pipeline stage 3 C 1 C 2 C 3 C 4 26

PIPELINE  Example: Instruction pipeline in CPUs  Fetch (instruction)  Decode  Execute  ... 27

PIPELINE  Precondition: Dependencies among tasks allow an appropriate ordering  Efficiency constraint: Number of stages << number of processed items  Pipeline can also be nonlinear 28

B: SUPPORTING STRUCTURES  Intermediate stage between problem oriented algorithm structure patterns and their realization in a programming environment  Structures that “support” the realization of parallel algorithms  4 examples:  Single program, multiple data (SPMD)  Task farming/Master & Worker  Fork & Join  Shared data 29

SINGLE PROGRAM, MULTIPLE DATA  Principle: The same code runs on every UE processing different data  Most common technique to write parallel programs! 30

SINGLE PROGRAM, MULTIPLE DATA  Program stages: 1. Initialize and obtain unique ID for each UE 2. Run the same program on every UE: Differences in the instructions are driven by the ID 3. Distribute data by decomposing or sharing/copying global data  Risk: Complex branching and data decomposition can make the code awful to understand and maintain 31

TASK FARMING/MASTER & WORKER  Principle: A master task (“farmer”) dispatches tasks to many worker UEs and collects (“farms”) the results 32

TASK FARMING/MASTER & WORKER 33

TASK FARMING/MASTER & WORKER  Precondition: Tasks are relatively independent  Master:  Initiates computation  Creates a bag of tasks and stores them e.g. in a shared queue  Launches the worker tasks and waits  Collects the results and shuts down the computation  Workers:  While the bag of tasks is not empty pop a task and solve it  Flexible through indirect scheduling  Optimization: Master can become a worker too 34

FORK & JOIN  Principle: Tasks create (“fork”) and terminate (“join”) other tasks dynamically  Example: An algorithm designed after the Divide & Conquer pattern 35

FORK & JOIN  Mapping the tasks to UEs can be done directly or indirectly  Direct : Each subtask is mapped to a new UE  Disadvantage: UE creation and destruction is expensive  Standard programming model in OpenMP  Indirect : Subtasks are stored inside a shared queue and handled by a static number of UEs  Concept behind OpenMP 36

SHARED DATA  Problem: Manage access to shared data  Principle: Define an access protocol that assures that the results of a computation are correct for any ordering of the operations on the data 37

SHARED DATA  Model shared data as a(n) (abstract) data type with a fixed set of operations  Operations can be seen as transactions (→ ACID properties )  Start with a simple solution and improve performance step-by-step:  Only one operation can be executed at any point in time  Improve performance by separating operations into noninterfering sets  Separate operations in read and write operations  Many different lock strategies… 38

QUESTIONS?

REFERENCES  T. Mattson, B. Sanders and B. Massingill. Patterns for parallel programming . Addison-Wesley, 2004.  A. Grama, A. Gupta, G. Karypis and V. Kumar. Introduction to parallel computing . Addison Wesley, 2 nd Edition, 2003.  P. S. Pacheco. An introduction to parallel programming . Morgan Kaufmann, 2011.  Images from Mattson et al. 2004 40

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel - PowerPoint PPT Presentation

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer Semester 2011 CONTENT Introduction Parallel program design Patterns for parallel programming A: Algorithm structure B: Supporting

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Kentuckys No Pass/No Drive Law KRS 159.051 Protocols and Procedures for Schools Working with

Robust Counting Via Counter Braids: An Error-Resilient Network Measurement Architecture Yi Lu

Snap-Stabilization in Message-Passing Systems Sylvie Delat (Universit Paris 11 Orsay),

Reactive Cloud-Native Networking info@netifi.com www.netifi.com Speakers Arsalan Farooq CEO

2016 Presenter: Glenn Lindsey VA7HC What were you doing at 1139 hours on Tuesday December 29,

1 Picking your Topic Please send me an email or make an appointment by October 1, with your

Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer Steffen Christgau,

Motion Denoising with Application to Time-lapse Photography Michael Rubinstein MIT CSAIL Ce Liu

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel - PowerPoint PPT Presentation

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer Semester 2011 CONTENT Introduction Parallel program design Patterns for parallel programming A: Algorithm structure B: Supporting

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Kentuckys No Pass/No Drive Law KRS 159.051 Protocols and Procedures for Schools Working with

Robust Counting Via Counter Braids: An Error-Resilient Network Measurement Architecture Yi Lu

Snap-Stabilization in Message-Passing Systems Sylvie Delat (Universit Paris 11 Orsay),

Reactive Cloud-Native Networking info@netifi.com www.netifi.com Speakers Arsalan Farooq CEO

2016 Presenter: Glenn Lindsey VA7HC What were you doing at 1139 hours on Tuesday December 29,

1 Picking your Topic Please send me an email or make an appointment by October 1, with your

Awareness of MPI Virtual Process Topologies on the Single-Chip Cloud Computer Steffen Christgau,

Motion Denoising with Application to Time-lapse Photography Michael Rubinstein MIT CSAIL Ce Liu

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &