Loops and parallelism
CO444H
Ben Livshits
1
CO444H parallelism Ben Livshits 1 Why Parallelism? One way to - - PowerPoint PPT Presentation
Loops and CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to use parallelism. Unfortunately, it is not easy to develop software that can take advantage of parallel machines. Dividing the
Loops and parallelism
1
take advantage of parallel machines.
different processors in parallel is already hard enough; yet that by itself does not guarantee a speedup.
because communication overhead can easily make the parallel code run even slower than the sequential execution!
2
improving a program's data locality. In general, we say that a program has good data locality if a processor often accesses the same data it has used recently.
peed to communicate with other processors frequently. Thus, parallelism and data locality need to be considered hand-in-hand. Data locality, by itself, is also important for the performance of individual
hierarchy; a memory access can take tens of machine cycles whereas a cache hit would only take a few cycles. If a program does not have good data locality and misses in the cache often, its performance will suffer.
3
4
Based on slides taken from Wei Li
parallel
multiple indices is combined
5
for i = 11, 20 a[i] = a[i] + 3
loops, then Z[i][j] and Z[i][i + j] are affine accesses.
can be expressed as a sum of a constant, plus constant multiples of the variables
c1,… , cn are constants.
although strictly speaking, linear functions do not have the c0 term.
6
multiprocessor share the same address
simply write to a memory location, which is then read by any other processor.
coherent cache protocol to hide the presence of caches from the programmer.
cache line, copies from all other caches are removed. When a processor requests data not found in its cache, the request goes out on the shared bus, and the data will be fetched either from memory or from the cache of another processor.
7
to communicate with another is about twice the cost of a memory access.
must first be written from the first processor's cache to memory, and then fetched from the memory to the cache of the second processor.
communication is relatively cheap, since it is only about twice as slow as a memory access.
when compared to cache hits—they can be a hundred times slower.
similarity between efficient parallelization and locality analysis.
either on its own or in the context of a multiprocessor, it must find most of the data it operates on in its cache
8
metrics to estimate how well a parallel application will perform:
the percentage of the computation that runs in parallel and
which is the amount of computation that each processor can execute without synchronizing or communicating with others
to parallelize
loops are we looking for in terms of better parallelism?
9
reduce computations?
astronomical data from telescopes, etc.?
server with a back-end database?
simulations with multiple parameters?
10
TPL: Task Parallel Library
11
12
https://msdn.microsoft.com/en-us/library/dd997393(v=vs.110).aspx
into multiple partitions
controlled parallelism
their code carefully
results and others crashes
machines with a large number of processors because execution schedules will be more complex
13
row-major order
column
match the layout
loop here
every processor
14
15
for i = 11, 20 a[i] = a[i] + 3 Parallel? for i = 11, 20 a[i] = a[i-1] + 3 Parallel?
16
for i = 11, 20 a[i] = a[i] + 3 Parallel for i = 11, 20 a[i] = a[i-1] + 3 for i = 11, 20 a[i] = a[i-10] + 3 Not parallel Parallel?
17
18
a = = a = a a =
Output dependence Input dependence
a = a = = a = a
19
for i = 2, 5 a[i] = a[i] + 3 a[2] a[2] a[3] a[3] a[4] a[4] a[5] a[5] read write
20
for i = 2, 5 a[i] = a[i-2] + 3 a[0] a[2] a[1] a[3] a[2] a[4] a[3] a[5] read write
21
for i = 2, 5 a[i-2] = a[i] + 3 a[2] a[0] a[3] a[1] a[4] a[2] a[5] a[3] read write
22
23
Static Data Dependence
necessarily distinct)
instance of a’ (o’) such that
24
if it crosses an iteration boundary
dependences then loop is parallelizable
25
bounds such that iterations ir and iw read and write the same array element, respectively
for i= 2, 5 a[i-2] = a[i] + 3
26
a[i-2] if
loop bounds such that iterations iv and iw write the same array element, respectively
for i= 2, 5 a[i-2] = a[i] + 3
27
between a[i] and a[i-2]?
between a[i-2] and a[i-2]?
for i= 2, 5 a[i-2] = a[i] + 3
28
execution instances in a computation, that is, the set of combinations of values taken on by the loop indexes.
accessed.
integer numbers or vectors of integers to distinguish among them.
29
30
31
for i1 = 0, 5 for i2 = 0, 3 a[i1,i2] = a[i1-2,i2-1] + 3
32
as coordinates in iteration space.
for i1 = 0, 5 for i2 = 0, 3 a[i1,i2]= 3 i1 i2
33
[1,0], [1,1], …[1,3], [2,0]…
lexicographically less than I’, I<I’, iff there exists k such that (i1,… ik-1) = (i’1,… i’k-1) and ik< i’k
i1 i2
34
and a[i1-2,i2-1]?
35
loop is parallelizable
carried by loop level k
36
dependence?
for i1 = 0, 5 for i2 = 0, 3 a[i1,i2] = a[i1-2,i2-1] + 3 i1 i2
37
Optimizing Compilers and Parallelism
programs and optimizes the locality of these programs.
about the application, can only preserve the semantics of the original algorithm, which may not be amenable to parallelization.
choices that limit the program's parallelism.
38
39
Solving Data Dependence Problems
compile-time.
read(n) for i = 0, 3 a[i] = a[n] + 3
40
Domain of Data Dependence Analysis
are integer linear functions of variables.
for i1 = 1, n for i2 = 2*i1, 100 a[i1+2*i2+3][4*i1+2*i2][i1*i1] = … … = a[1][2*i1+1][i2] + 3
41
for i1 = 1, n for i2 = 2*i1, 100 a[i1+2*i2+3][4*i1+2*i2][i1*i1] = … … = a[1][2*i1+1][i2] + 3
42
43
Form of Data Dependence Analysis
equalities and equalities
2 2 1 1
44
Form of Data Dependence Analysis
45
46
for i = 1, 100 a[2*i] = … … = a[2*i+1] + 3
47
…, an, denoted gcd(a1, a2, …, an), is the largest integer that evenly divides all these integers.
has an integer solution x1, x2, …, xniff gcd(a1, a2, …, an) divides c
c x a x a x a
n n
...
2 2 1 1
48
2 1
49
equations?
50
51
programs.
52
(based on properties of inequalities rather than array access patterns)
bound
53
54