Parallel Programming Patterns: Data Parallelism Ralph Johnson - - PowerPoint PPT Presentation

parallel programming patterns data parallelism
SMART_READER_LITE
LIVE PREVIEW

Parallel Programming Patterns: Data Parallelism Ralph Johnson - - PowerPoint PPT Presentation

Parallel Programming Patterns: Data Parallelism Ralph Johnson University of Illinois at Urbana- Champaign rjohnson@illinois.edu www.upcrc.illinois.edu Pattern language Set of patterns that an expert (or a community) uses Patterns are


slide-1
SLIDE 1

Parallel Programming Patterns: Data Parallelism

Ralph Johnson University of Illinois at Urbana- Champaign rjohnson@illinois.edu

slide-2
SLIDE 2

www.upcrc.illinois.edu

slide-3
SLIDE 3

www.upcrc.illinois.edu

Pattern language

  • Set of patterns that an expert (or a

community) uses

  • Patterns are related (high level-low

level)

slide-4
SLIDE 4

www.upcrc.illinois.edu

slide-5
SLIDE 5

www.upcrc.illinois.edu

slide-6
SLIDE 6

www.upcrc.illinois.edu

Making a pattern language for parallelism is hard

  • Parallel programming

– comes in many styles – changes algorithms – is about performance

slide-7
SLIDE 7

www.upcrc.illinois.edu

Our Pattern Language

  • Universal Parallel Computing Research Center
  • Making client applications (desktop, laptop,

handheld) faster by using multicores

  • Kurt Keutzer - Berkeley
  • Tim Mattson - Intel
  • http://parlab.eecs.berkeley.edu/wiki/patterns
  • Comments to rjohnson@illinois.edu
slide-8
SLIDE 8

www.upcrc.illinois.edu

The problem

  • Multicores (free ride is over)
  • GPUs
  • Caches
  • Vector processing
slide-9
SLIDE 9

www.upcrc.illinois.edu

Our Pattern Language

Structural (Architectural) Computational (Algorithms) Algorithm Strategies Implementation Strategies Parallel Execution

slide-10
SLIDE 10

www.upcrc.illinois.edu

Algorithm Strategies

  • Task parallelism
  • Geometric decomposition
  • Recursive splitting
  • Pipelining
slide-11
SLIDE 11

www.upcrc.illinois.edu

Task Parallelism

  • Communication? As little as possible.
  • Task size? Not too big, not too small.

– Overdecomposition – more than number of cores

  • Scheduling? Keep neighbors on same core.
slide-12
SLIDE 12

www.upcrc.illinois.edu

Geometric Decomposition

  • Stencil
slide-13
SLIDE 13

www.upcrc.illinois.edu

Geometric Decomposition

  • Ghost cells
slide-14
SLIDE 14

www.upcrc.illinois.edu

Recursive Splitting

  • How small to split?
slide-15
SLIDE 15

www.upcrc.illinois.edu

Pipelining

  • Bottleneck
  • Throughput vs. response time
slide-16
SLIDE 16

Styles of parallel programming

  • Threads and locks
  • Asynchronous messaging – no

sharing (actors)

  • Transactional memory
  • Deterministic shared memory
  • Fork-join tasks
  • Data parallelism
slide-17
SLIDE 17

www.upcrc.illinois.edu

Fork-join Tasks

  • Tasks are objects with behavior

“execute”

  • Each thread has a queue of tasks
  • Tasks run to completion unless they

wait for others to complete

  • No I/O. No locks.
slide-18
SLIDE 18

www.upcrc.illinois.edu

void tracerays(Scene *world) { for (size_t i = 0, i>WIDTH, i++) { for (size_t j = 0, j>HEIGHT, j++) { image[i][j] = traceray(i,j,world); } } }

slide-19
SLIDE 19

www.upcrc.illinois.edu

#include “tbb/parallel_for.h” #include “tbb/blocked_range2d.h” using namespace tbb; class TraceRays { Scene *my_world; Public: void operator() (const blocked_range2d<size_t>& r) { … } TraceRays(Scene *world) { my_world = world; } }

slide-20
SLIDE 20

www.upcrc.illinois.edu

void operator() (const blocked_range2d<size_t>& r) { for (size_t i = r.rows().begin(), i != r.rows().end(), i+ +) { for (size_t j = j.cols().begin(), j!=r.cols().end(), j++) {

  • utput[i][j] = traceray(i,j,world);

} } } void tracerays(Scene *world) { parallel_for(blocked_range2d<size_t>(0,WIDTH,8,0,HEI GHT,8), TraceRays(world); }

slide-21
SLIDE 21

www.upcrc.illinois.edu

  • Parallel reduction
  • Lock-free atomic types
  • Locks (sigh!)
slide-22
SLIDE 22

www.upcrc.illinois.edu

  • TBB:

http://threadedbuildingblocks.org

  • Java concurrency:

http://g.oswego.edu/

  • Microsoft TPL and PPL:

http://msdn.microsoft.com/concurrency

slide-23
SLIDE 23

www.upcrc.illinois.edu 23

http://parallelpatterns.codeplex.com/

slide-24
SLIDE 24

Common Strategy

  • Measure performance
  • Parallelize expensive loops
  • Add synchronization to fix data races
  • Eliminate bottlenecks by

– Privatizing variables – Using lock-free data structures

www.upcrc.illinois.edu 24

slide-25
SLIDE 25

Data Parallelism

  • Single thread of control – program looks

sequential and is deterministic

  • Operates on collections (arrays, sets, …)
  • Instead of looping over a collection,

perform “single operation” on it

  • No side effects
  • APL, Lisp, Smalltalk did something similar

for ease of use, not parallelism.

slide-26
SLIDE 26

Data Parallelism

  • Easy to understand
  • Simple performance model
  • Doesn’t fit all problems
slide-27
SLIDE 27

Operations

  • Map – apply a function to each

element of a collection, producing a new collection

  • Map – apply a function with N

arguments to N collections, producing a new collection

slide-28
SLIDE 28

Operations

  • Reduce – apply a binary, associative

function to each element in succession, producing a single element

  • Select – apply a predicate to each

element of a collection, returning collection of elements for which predicate is true

slide-29
SLIDE 29

Operations

  • Gather – given collection of indices

and an indexable collection, produce collection of values at indices

  • Scatter – given two collections, i’th

element is element of second collection whose matching element in first has value “i”

  • Divide – divide collection into pieces

www.upcrc.illinois.edu 29

slide-30
SLIDE 30

N-body

Body has variables position, velocity, force, mass for time = 1, 1000000 { for b = 1, numberOfBodies { bodies[b].computeForces(bodies); bodies[b].move(); } }

www.upcrc.illinois.edu 30

slide-31
SLIDE 31

computeForces(Body *bodies) { force = 0; for i = 1, numberOf Bodies { force =+ forceFrom(bodies[i]) } }

www.upcrc.illinois.edu 31

slide-32
SLIDE 32

forceFromBody(Body body) { return mass * body.mass * G / distance(location, body.location) ^ 2 }

www.upcrc.illinois.edu 32

slide-33
SLIDE 33

move() { velocity =+ timeIncrement * force / mass position =+ timeIncrement * velocity }

www.upcrc.illinois.edu 33

slide-34
SLIDE 34

Data Parallel

computeForces map forceFrom to produce a collection of forces reduce with + to produce sum

www.upcrc.illinois.edu 34

slide-35
SLIDE 35

Data parallel

N-body map computeForces to produce forces map velocity + timeIncrement * force / mass to produce velocities map position + timeIncrement * velocity to produce positions scatter velocities into body.velocity scatter positions into body.position

www.upcrc.illinois.edu 35

slide-36
SLIDE 36

TBB/java.util.concurrent/TPL

  • Each map becomes a parallel loop
  • In C++ without closures, each

parallel loop requires a class to define operator

  • In Java, large library of operators,

else you have to define class

www.upcrc.illinois.edu 36

slide-37
SLIDE 37

Messy, why bother?

  • Data parallelism really is easier
  • Compiler can vectorize easier
  • Maps to GPU better
  • Better support in other languages
  • Will be better support for C++ in the

near future

– Intel Array Building Blocks

www.upcrc.illinois.edu 37

slide-38
SLIDE 38

www.upcrc.illinois.edu

Parallel Programming Style

  • Data parallism

– Deterministic semantics, easy, efficient, no I/O

  • Fork-join tasking - shared memory

– Hopefully deterministic semantics, no I/O

  • Actors - asynchronous message passing
  • no shared memory

– Nondeterministic, good for I/O