CO444H parallelism Ben Livshits 1 Why Parallelism? One way to - - PowerPoint PPT Presentation

co444h
SMART_READER_LITE
LIVE PREVIEW

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to - - PowerPoint PPT Presentation

Loops and CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to use parallelism. Unfortunately, it is not easy to develop software that can take advantage of parallel machines. Dividing the


slide-1
SLIDE 1

Loops and parallelism

CO444H

Ben Livshits

1

slide-2
SLIDE 2

Why Parallelism?

  • One way to speed up a computation is to use parallelism.
  • Unfortunately, it is not easy to develop software that can

take advantage of parallel machines.

  • Dividing the computation into units that can execute on

different processors in parallel is already hard enough; yet that by itself does not guarantee a speedup.

  • We must also minimize inter-processor communication,

because communication overhead can easily make the parallel code run even slower than the sequential execution!

2

slide-3
SLIDE 3

Maximizing Data Locality

  • Minimizing communication can be thought of as a special case of

improving a program's data locality. In general, we say that a program has good data locality if a processor often accesses the same data it has used recently.

  • Surely if a processor on a parallel machine has good locality, it does not

peed to communicate with other processors frequently. Thus, parallelism and data locality need to be considered hand-in-hand. Data locality, by itself, is also important for the performance of individual

  • processors. Why?
  • Modem processors have one or more level of caches in the memory

hierarchy; a memory access can take tens of machine cycles whereas a cache hit would only take a few cycles. If a program does not have good data locality and misses in the cache often, its performance will suffer.

3

slide-4
SLIDE 4

Agenda

  • Introduction
  • Single Loop
  • Nested Loops
  • Data Dependence Analysis

4

Based on slides taken from Wei Li

slide-5
SLIDE 5

Motivation: Better Parallelism

  • DOALL loops: loops whose iterations can execute in

parallel

  • New abstraction needed
  • Abstraction used in data flow analysis is inadequate
  • Information from all instances of a statement for

multiple indices is combined

5

for i = 11, 20 a[i] = a[i] + 3

slide-6
SLIDE 6

Focus on Affine Array Accesses

  • For example, if i and j are the index variables of surrounding

loops, then Z[i][j] and Z[i][i + j] are affine accesses.

  • A function of one or more variables, ii,i2 ,... ,in is affine if it

can be expressed as a sum of a constant, plus constant multiples of the variables

  • i.e., co + c1i1 + c2i2 + • • • + cnin , where co,

c1,… , cn are constants.

  • Affine functions are usually known as linear functions,

although strictly speaking, linear functions do not have the c0 term.

6

slide-7
SLIDE 7

Try to Minimize Inter-processor Communication

  • Processors on a symmetric

multiprocessor share the same address

  • space. To communicate, a processor can

simply write to a memory location, which is then read by any other processor.

  • Symmetric multiprocessors use a

coherent cache protocol to hide the presence of caches from the programmer.

  • When a processor wishes to write to a

cache line, copies from all other caches are removed. When a processor requests data not found in its cache, the request goes out on the shared bus, and the data will be fetched either from memory or from the cache of another processor.

7

slide-8
SLIDE 8

Memory Access Costs

  • The time taken for one processor

to communicate with another is about twice the cost of a memory access.

  • The data, in units of cache lines,

must first be written from the first processor's cache to memory, and then fetched from the memory to the cache of the second processor.

  • You may think that interprocessor

communication is relatively cheap, since it is only about twice as slow as a memory access.

  • Memory accesses are very expensive

when compared to cache hits—they can be a hundred times slower.

  • This analysis brings home the

similarity between efficient parallelization and locality analysis.

  • For a processor to perform well,

either on its own or in the context of a multiprocessor, it must find most of the data it operates on in its cache

8

slide-9
SLIDE 9

Application-level Parallelism

  • We use two high-level

metrics to estimate how well a parallel application will perform:

  • parallelism coverage which is

the percentage of the computation that runs in parallel and

  • granularity of parallelism,

which is the amount of computation that each processor can execute without synchronizing or communicating with others

  • Loops are a great target

to parallelize

  • What properties of

loops are we looking for in terms of better parallelism?

9

slide-10
SLIDE 10

Other Examples of Coarse- Grained Parallelism

  • How about map-

reduce computations?

  • Analyzing

astronomical data from telescopes, etc.?

  • What about a web

server with a back-end database?

  • Running

simulations with multiple parameters?

10

slide-11
SLIDE 11

TPL: Task Parallel Library

11

slide-12
SLIDE 12

More Elaborate TPL Example

12

https://msdn.microsoft.com/en-us/library/dd997393(v=vs.110).aspx

  • When a ForEach<TSource> loop executes, it divides its source collection

into multiple partitions

  • Each partition will get its own copy of the "thread-local" variable
slide-13
SLIDE 13

Automatic Parallelism

  • With TPL, we saw some examples of developer-

controlled parallelism

  • The developer has to plan ahead and parallelize

their code carefully

  • They can make mistakes, some can lead to incorrect

results and others crashes

  • Some of these bugs may only exhibit themselves on

machines with a large number of processors because execution schedules will be more complex

13

slide-14
SLIDE 14

Matrices: Layout Considerations

  • Suppose Z is stored in a

row-major order

  • We can do it column-by-

column

  • Or row-by-row – this will

match the layout

  • Or can parallelize the outer

loop here

  • b is the partition to give to

every processor

  • M processors
  • This the code for p-th proc

14

slide-15
SLIDE 15

15

Examples

for i = 11, 20 a[i] = a[i] + 3 Parallel? for i = 11, 20 a[i] = a[i-1] + 3 Parallel?

slide-16
SLIDE 16

16

Examples

for i = 11, 20 a[i] = a[i] + 3 Parallel for i = 11, 20 a[i] = a[i-1] + 3 for i = 11, 20 a[i] = a[i-10] + 3 Not parallel Parallel?

slide-17
SLIDE 17

Single Loops

17

slide-18
SLIDE 18

18

Data Dependence of Scalar Variables

  • True dependence
  • Anti-dependence

a = = a = a a =

 Output dependence  Input dependence

a = a = = a = a

slide-19
SLIDE 19

19

Array Accesses in a Loop

for i = 2, 5 a[i] = a[i] + 3 a[2] a[2] a[3] a[3] a[4] a[4] a[5] a[5] read write

slide-20
SLIDE 20

20

Array True-dependence

for i = 2, 5 a[i] = a[i-2] + 3 a[0] a[2] a[1] a[3] a[2] a[4] a[3] a[5] read write

slide-21
SLIDE 21

21

Array Anti- dependence

for i = 2, 5 a[i-2] = a[i] + 3 a[2] a[0] a[3] a[1] a[4] a[2] a[5] a[3] read write

slide-22
SLIDE 22

22

Dynamic Data Dependence

  • Let o and o’ be two (dynamic) operations
  • Data dependence exists from o to o’, iff
  • either o or o’ is a write operation
  • o and o’ may refer to the same location
  • o executes before o’
slide-23
SLIDE 23

23

Static Data Dependence

  • Let a and a’ be two static array accesses (not

necessarily distinct)

  • Data dependence exists from a to a’, iff
  • either a or a’ is a write operation
  • There exists a dynamic instance of a (o) and a dynamic

instance of a’ (o’) such that

  • o and o’ may refer to the same location
  • o executes before o’
slide-24
SLIDE 24

24

Recognizing DOALL Loops

  • Find data dependences in loop
  • Definition: a dependence is loop-carried

if it crosses an iteration boundary

  • If there are no loop-carried

dependences then loop is parallelizable

slide-25
SLIDE 25

25

Compute Dependences

  • There is a dependence between a[i] and a[i-2] if
  • There exist two iterations ir and iw within the loop

bounds such that iterations ir and iw read and write the same array element, respectively

  • There exist ir, iw , 2 ≤ ir, iw ≤ 5, ir = iw -2

for i= 2, 5 a[i-2] = a[i] + 3

slide-26
SLIDE 26

26

Compute Dependences

  • There is a dependence between a[i-2] and

a[i-2] if

  • There exist two iterations iv and iw within the

loop bounds such that iterations iv and iw write the same array element, respectively

  • There exist iv, iw , 2 ≤ iv , iw ≤ 5, iv -2= iw -2

for i= 2, 5 a[i-2] = a[i] + 3

slide-27
SLIDE 27

27

Parallelization

  • Is there a loop-carried dependence

between a[i] and a[i-2]?

  • Is there a loop-carried dependence

between a[i-2] and a[i-2]?

for i= 2, 5 a[i-2] = a[i] + 3

slide-28
SLIDE 28

Nested Loops

28

slide-29
SLIDE 29

Iteration Spaces

  • The iteration space is the set of the dynamic

execution instances in a computation, that is, the set of combinations of values taken on by the loop indexes.

  • The data space is the set of array elements

accessed.

  • The processor space is the set of processors in the
  • system. Normally, these processors are assigned

integer numbers or vectors of integers to distinguish among them.

29

slide-30
SLIDE 30

Iteration Spaces Illustrated

30

slide-31
SLIDE 31

31

Nested Loops

  • Which loop(s) are parallel?

for i1 = 0, 5 for i2 = 0, 3 a[i1,i2] = a[i1-2,i2-1] + 3

slide-32
SLIDE 32

32

Iteration Space

  • An abstraction for loops
  • Iteration is represented

as coordinates in iteration space.

for i1 = 0, 5 for i2 = 0, 3 a[i1,i2]= 3 i1 i2

slide-33
SLIDE 33

33

Execution Order

  • Sequential execution order
  • f iterations: Lexicographic
  • rder [0,0], [0,1], …[0,3],

[1,0], [1,1], …[1,3], [2,0]…

  • Let I = (i1,i2,… in). I is

lexicographically less than I’, I<I’, iff there exists k such that (i1,… ik-1) = (i’1,… i’k-1) and ik< i’k

i1 i2

slide-34
SLIDE 34

34

Parallelism for Nested Loops

  • Is there a data dependence between a[i1,i2]

and a[i1-2,i2-1]?

  • There exist i1r, i2r, i1w, i2w, such that
  • 0 ≤ i1r, i1w ≤ 5,
  • 0 ≤ i2r, i2w ≤ 3,
  • i1r - 2 = i1w
  • i2r - 1 = i2w
slide-35
SLIDE 35

35

Loop-carried Dependence

  • If there are no loop-carried dependences, then

loop is parallelizable

  • Outer: Dependence carried by outer loop:
  • i1r ≠ i1w
  • Inner: Dependence carried by inner loop:
  • i1r = i1w
  • i2r ≠ i2w
  • This can naturally be extended to dependence

carried by loop level k

slide-36
SLIDE 36

36

Nested Loops

  • Which loop carries the

dependence?

for i1 = 0, 5 for i2 = 0, 3 a[i1,i2] = a[i1-2,i2-1] + 3 i1 i2

slide-37
SLIDE 37

Data Dependence Analysis

37

slide-38
SLIDE 38

Optimizing Compilers and Parallelism

  • Ideally, a parallelizing compiler automatically translates
  • rdinary sequential programs into efficient parallel

programs and optimizes the locality of these programs.

  • Unfortunately, compilers without high-level knowledge

about the application, can only preserve the semantics of the original algorithm, which may not be amenable to parallelization.

  • Furthermore, programmers may have made arbitrary

choices that limit the program's parallelism.

38

slide-39
SLIDE 39

39

Solving Data Dependence Problems

  • Memory disambiguation is un-decidable at

compile-time.

read(n) for i = 0, 3 a[i] = a[n] + 3

slide-40
SLIDE 40

40

Domain of Data Dependence Analysis

  • Only use loop bounds and array indices which

are integer linear functions of variables.

for i1 = 1, n for i2 = 2*i1, 100 a[i1+2*i2+3][4*i1+2*i2][i1*i1] = … … = a[1][2*i1+1][i2] + 3

slide-41
SLIDE 41

41

Equations

  • There is a data dependence, if
  • There exist i1r, i2r, i1w, i2w, such that
  • 0 ≤ i1r, i1w ≤ n, 2*i1r ≤ i2r ≤ 100, 2*i1w ≤ i2w ≤ 100,
  • i1w + 2*i2w +3 = 1, 4*i1w + 2*i2w = 2*i1r + 1
  • Note: ignoring non-affine relations

for i1 = 1, n for i2 = 2*i1, 100 a[i1+2*i2+3][4*i1+2*i2][i1*i1] = … … = a[1][2*i1+1][i2] + 3

slide-42
SLIDE 42

42

Solutions

  • There is a data dependence, if
  • There exist i1r, i2r, i1w, i2w, such that
  • 0 ≤ i1r, i1w ≤ n, 2*i1w ≤ i2w ≤ 100, 2*i1w ≤ i2w ≤ 100,
  • i1w + 2*i2w +3 = 1, 4*i1w + 2*i2w - 1 = i1r + 1
  • No solution → No data dependence
  • Solution → there may be a dependence
slide-43
SLIDE 43

43

Form of Data Dependence Analysis

  • Data dependence problems originally contains

equalities and equalities

  • Eliminate inequalities in the problem statement:
  • Replace a ≠ b with two sub-problems: a>b or a<b
  • We get

2 2 1 1

, , int b i A b i A i        

slide-44
SLIDE 44

44

Form of Data Dependence Analysis

  • Eliminate equalities in the problem statement:
  • Replace a =b with two sub-problems: a≤b and b≤a
  • Integer programming is NP-complete, i.e. Expensive

b i A i      , int

slide-45
SLIDE 45

45

Techniques: Inexact Tests

  • Examples: GCD test, Banerjee’s test
  • 2 outcomes
  • No → no dependence
  • Don’t know → assume there is a solution → dependence
  • Extra data dependence constraints
  • Sacrifice parallelism for compiler efficiency
slide-46
SLIDE 46

46

GCD Test

  • Is there any dependence?
  • Solve a linear Diophantine equation
  • 2*iw = 2*ir + 1

for i = 1, 100 a[2*i] = … … = a[2*i+1] + 3

slide-47
SLIDE 47

47

GCD

  • The greatest common divisor (GCD) of integers a1, a2,

…, an, denoted gcd(a1, a2, …, an), is the largest integer that evenly divides all these integers.

  • Theorem: The linear Diophantine equation

has an integer solution x1, x2, …, xniff gcd(a1, a2, …, an) divides c

c x a x a x a

n n

    ...

2 2 1 1

slide-48
SLIDE 48

48

Examples

  • Example 1: gcd(2,-2) = 2. No solutions
  • Example 2: gcd(24,36,54) = 6. Many solutions

1 2 2

2 1

  x x

30 54 36 24    z y x

slide-49
SLIDE 49

49

Multiple Equalities

  • Equation 1: gcd(1,-2,1) = 1. Many solutions
  • Equation 2: gcd(3,2,1) = 1. Many solutions
  • Is there any solution satisfying both

equations?

5 2 3 2       z y x z y x

slide-50
SLIDE 50

50

The Euclidean Algorithm

  • Assume a and b are positive integers, and a > b.
  • Let c be the remainder of a/b.
  • If c=0, then gcd(a,b) = b.
  • Otherwise, gcd(a,b) = gcd(b,c).
  • gcd(a1, a2, …, an) = gcd(gcd(a1, a2), a3 …, an)
slide-51
SLIDE 51

51

Exact Analysis

  • Most memory disambiguations are simple integer

programs.

  • Approach:
  • Solve exactly – yes, or no solution
  • Solve exactly with Fourier-Motzkin + branch and bound
  • Omega package from University of Maryland
slide-52
SLIDE 52

52

Incremental Analysis

  • Use a series of simple tests to solve simple programs

(based on properties of inequalities rather than array access patterns)

  • Solve exactly with Fourier-Motzkin + branch and

bound

  • Memoization
  • Many identical integer programs solved for each program
  • Save the results so it need not be recomputed
slide-53
SLIDE 53

53

State of the Art

  • Multiprocessors need large outer parallel loops
  • Many inter-procedural optimizations are needed
  • Interprocedural scalar optimizations
  • Dependence
  • Privatization
  • Reduction recognition
  • Interprocedural array analysis
  • Array section analysis
slide-54
SLIDE 54

54

Summary

  • DOALL loops
  • Iteration Space
  • Data dependence analysis