PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel - - PowerPoint PPT Presentation

parallel
SMART_READER_LITE
LIVE PREVIEW

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel - - PowerPoint PPT Presentation

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer Semester 2011 CONTENT Introduction Parallel program design Patterns for parallel programming A: Algorithm structure B: Supporting


slide-1
SLIDE 1

Joachim Nitschke

PARALLEL PROGRAMMING

Project Seminar “Parallel Programming”, Summer Semester 2011

slide-2
SLIDE 2

 Introduction  Parallel program design  Patterns for parallel programming

  • A: Algorithm structure
  • B: Supporting structures

CONTENT

2

slide-3
SLIDE 3

Context around parallel programming

INTRODUCTION

slide-4
SLIDE 4

 Many different models reflecting the various different parallel hardware architectures  2 or rather 3 most common models:

  • Shared memory
  • Distributed memory
  • Hybrid models (combining shared and distributed memory)

PARALLEL PROGRAMMING MODELS

4

slide-5
SLIDE 5

Shared memory Distributed memory

PARALLEL PROGRAMMING MODELS

5

slide-6
SLIDE 6

Shared memory

 Synchronize memory access  Locking vs. potential race conditions

Distributed memory

 Communication bandwidth and resulting latency  Manage message passing  Synchronous vs. asynchronous communication

PROGRAMMING CHALLENGES

6

slide-7
SLIDE 7

 2 common standards as examples for the 2 parallel programming models:

  • Open Multi-Processing (OpenMP)
  • Message passing interface (MPI)

PARALLEL PROGRAMMING STANDARDS

7

slide-8
SLIDE 8

 Collection of libraries and compiler directives for parallel programming on shared memory computers  Programmers have to explicitly designate blocks that are to run in parallel by adding directives like:  OpenMP then creates a number of threads executing the designated code block

OpenMP

8

slide-9
SLIDE 9

 Library with routines to manage message passing for programming on distributed memory computers  Messages are sent from one process to another  Routines for synchronization, broadcasts, blocking and non blocking communication

MPI

9

slide-10
SLIDE 10

MPI.Scatter MPI.Gather

MPI EXAMPLE

10

slide-11
SLIDE 11

General strategies for finding concurrency

PARALLEL PROGRAM DESIGN

slide-12
SLIDE 12

 General approach: Analyze a problem to identify exploitable concurrency  Main concept is decomposition: Divide a computation into smaller parts all or some of which can run concurrently

FINDING CONCURRENCY

12

slide-13
SLIDE 13

 Tasks: Programmer-defined units into which the main computation is decomposed  Unit of execution (UE): Generalization of processes and threads

SOME TERMINOLOGY

13

slide-14
SLIDE 14

 Decompose a problem into tasks that can run concurrently  Few large tasks vs. many small tasks  Minimize dependencies among tasks

TASK DECOMPOSITION

14

slide-15
SLIDE 15

 Group tasks to simplify managing their dependencies  Tasks within a group run at the same time  Based on decomposition: Group tasks that belong to the same high-level operations  Based on constraints: Group tasks with the same constraints

GROUP TASKS

15

slide-16
SLIDE 16

 Order task groups to satisfy constraints among them  Order must be:

  • Restrictive enough to satisfy constraints
  • Not too restrictive to improve flexibility and hence efficiency

 Identify dependencies – e.g.:

  • Group A requires data from group B

 Important: Also identify the independent groups  Identify potential dead locks

ORDER TASKS

16

slide-17
SLIDE 17

 Decompose a problem‘s data into units that can be

  • perated on relatively independent

 Look at problem‘s central data structures  Decomposition already implied by or

  • r basis for task

decomposition  Again: Few large chunks vs. many small chunks

  • Improve flexibility: Configurable granularity

DATA DECOMPOSITION

17

slide-18
SLIDE 18

 Share decomposed data among tasks  Identify task-local and shared data  Classify shared data: read/write or read only?  Identify potential race conditions  Note: Sometimes data sharing implies communication

DATA SHARING

18

slide-19
SLIDE 19

Typical parallel program structures

PATTERNS FOR PARALLEL PROGRAMMING

slide-20
SLIDE 20

 How can the identified concurrency be used to build a program?  3 examples for typical parallel algorithm structures:

  • Organize by tasks: Divide & conquer
  • Organize by data decomposition: Geometric/domain

decomposition

  • Organize by data flow: Pipeline

A: ALGORITHM STRUCTURE

20

slide-21
SLIDE 21

 Principle: Split a problem recursively into smaller solvable sub problems and merge their results  Potential concurrency: Sub problems can be solved simultaneously

DIVIDE & CONQUER

21

slide-22
SLIDE 22

 Precondition: Sub problems can be solved independently  Efficiency constraint: Split and merge should be trivial compared to sub problems  Challenge: Standard base case can lead to too many too small tasks

  • End recursion earlier?

DIVIDE & CONQUER

22

slide-23
SLIDE 23

 Principle: Organize an algorithm around a linear data structure that was decomposed into concurrently updatable chunks  Potential concurrency: Chunks can be updated simultaneously

GEOMETRIC/DOMAIN DECOMPOSITION

23

slide-24
SLIDE 24

 Example: Simple blur filter where every pixel is set to the average value of its surrounding pixels

  • Image can be split into

squares

  • Each square is updated by a

task

  • To update square border

information from other squares is required

GEOMETRIC/DOMAIN DECOMPOSITION

24

slide-25
SLIDE 25

 Again: Granularity of decomposition?  Choose square/cubic chunks to minimize surface and thus nonlocal data  Replicating nonlocal data can reduce communication → “ghost boundaries”  Optimization: Overlap update and exchange of nonlocal data  Number of tasks > number of UEs for better load balance

GEOMETRIC/DOMAIN DECOMPOSITION

25

slide-26
SLIDE 26

 Principle based on analogy assembly line: Data flowing through a set of stages  Potential concurrency: Operations can be performed simultaneously on different data items

PIPELINE

26

C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6

Pipeline stage 1 Pipeline stage 2 Pipeline stage 3

time

slide-27
SLIDE 27

 Example: Instruction pipeline in CPUs

  • Fetch (instruction)
  • Decode
  • Execute
  • ...

PIPELINE

27

slide-28
SLIDE 28

 Precondition: Dependencies among tasks allow an appropriate ordering  Efficiency constraint: Number of stages << number

  • f processed items

 Pipeline can also be nonlinear

PIPELINE

28

slide-29
SLIDE 29

 Intermediate stage between problem oriented algorithm structure patterns and their realization in a programming environment  Structures that “support” the realization of parallel algorithms  4 examples:

  • Single program, multiple data (SPMD)
  • Task farming/Master & Worker
  • Fork & Join
  • Shared data

B: SUPPORTING STRUCTURES

29

slide-30
SLIDE 30

 Principle: The same code runs on every UE processing different data  Most common technique to write parallel programs!

SINGLE PROGRAM, MULTIPLE DATA

30

slide-31
SLIDE 31

 Program stages:

  • 1. Initialize and obtain unique ID for each UE
  • 2. Run the same program on every UE: Differences in the

instructions are driven by the ID

  • 3. Distribute data by decomposing or sharing/copying global

data

 Risk: Complex branching and data decomposition can make the code awful to understand and maintain

SINGLE PROGRAM, MULTIPLE DATA

31

slide-32
SLIDE 32

 Principle: A master task (“farmer”) dispatches tasks to many worker UEs and collects (“farms”) the results

TASK FARMING/MASTER & WORKER

32

slide-33
SLIDE 33

TASK FARMING/MASTER & WORKER

33

slide-34
SLIDE 34

 Precondition: Tasks are relatively independent  Master:

  • Initiates computation
  • Creates a bag of tasks and stores them e.g. in a shared queue
  • Launches the worker tasks and waits
  • Collects the results and shuts down the computation

 Workers:

  • While the bag of tasks is not empty pop a task and solve it

 Flexible through indirect scheduling  Optimization: Master can become a worker too

TASK FARMING/MASTER & WORKER

34

slide-35
SLIDE 35

 Principle: Tasks create (“fork”) and terminate (“join”) other tasks dynamically  Example: An algorithm designed after the Divide & Conquer pattern

FORK & JOIN

35

slide-36
SLIDE 36

 Mapping the tasks to UEs can be done directly or indirectly  Direct: Each subtask is mapped to a new UE

  • Disadvantage: UE creation and destruction is expensive
  • Standard programming model in OpenMP

 Indirect: Subtasks are stored inside a shared queue and handled by a static number of UEs  Concept behind OpenMP

FORK & JOIN

36

slide-37
SLIDE 37

 Problem: Manage access to shared data  Principle: Define an access protocol that assures that the results of a computation are correct for any ordering of the operations on the data

SHARED DATA

37

slide-38
SLIDE 38

 Model shared data as a(n) (abstract) data type with a fixed set of operations  Operations can be seen as transactions (→ ACID properties)  Start with a simple solution and improve performance step-by-step:

  • Only one operation can be executed at any point in time
  • Improve performance by separating operations into

noninterfering sets

  • Separate operations in read and write operations
  • Many different lock strategies…

SHARED DATA

38

slide-39
SLIDE 39

QUESTIONS?

slide-40
SLIDE 40

 T. Mattson, B. Sanders and B. Massingill. Patterns for parallel programming. Addison-Wesley, 2004.  A. Grama, A. Gupta, G. Karypis and V. Kumar. Introduction to parallel computing. Addison Wesley, 2nd Edition, 2003.  P. S. Pacheco. An introduction to parallel

  • programming. Morgan Kaufmann, 2011.

 Images from Mattson et al. 2004

REFERENCES

40