Parallel Algorithms and Implementations
CS260 – Algorithmic Engineering Yihan Sun
Parallel Algorithms and CS260 Algorithmic Engineering - - PowerPoint PPT Presentation
Parallel Algorithms and CS260 Algorithmic Engineering Implementations Yihan Sun Algorithmic engineering make your code faster 2 Ways to Make Code Faster Cannot rely on the improvement of hardware anymore Use multicores! 3
CS260 – Algorithmic Engineering Yihan Sun
2
improvement of hardware anymore
3
4
5
(And avoid any contention between them)
(Pictures from 9gag.com)
6
Memory leaking! Memory leaking: memory which is no longer needed is not released
(Pictures from 9gag.com)
7
Memory leaking! Deadlock! Deadlock: a state in which each member of a group is waiting for another member, including itself, to take action, such as releasing a lock
(Pictures from 9gag.com)
8
Memory leaking! Data Race Deadlock! Data Race: Two or more processors are accessing the same memory location, and at least one of them is writing
(Pictures from 9gag.com)
9
Memory leaking! Data Race Missing the 10th dog! Did it become a zombie??? Deadlock! Zombie process: a process that has completed execution but still has an entry in the process table
10
processors available
11
12
number of tasks and let some parallel threads execute them
synchronized by a join operation
parallel (parallel_for)
13
programming languages
another thread
thread
parallel
then eight, … in O(log n) rounds
14
cilk_spawn do_thing_1; do_thing_2; cilk_sync; cilk_for (int i = 0; i < n; i++) { do_something; } #include <cilk/cilk.h> #include <cilk/cilk_api.h>
Fork Join
As long as you can design a parallel algorithm in fork-join, implementing them requires very little work on top of your sequential C++ code
between string and thread?
must be a programmer
15
16
#include “pbbslib/utilities.h” par_do([&] () {do_thing_1;}, [&] () {do_thing_2;}); parallel_for (0, 100, [&] (int i) {Do_something});
lambda expression (must be function calls) You can also use cilk or openmp to compile your code
17
18
in the algorithm
Work = 17 span = 8
algorithm runs on one processor
(asymptotically) no more than the best (optimal) sequential algorithm
efficient when a small number of processor are available
19
1
dependency chain
number of processors
faster and faster when more and more processors are available - scalability
20
∞
21
full-loaded, we need this amount of time
we need this amount of time
22
Can be scheduled in time
(w.h.p. for some randomized schedulers)
𝑼𝟐 𝑼∞
23
24
25 1 3 2 6 5 4 8 7 3 7 11 15 10 26 36
+ + + + + + +
reduce(A, n) { if (n == 1) return A[0]; In parallel: L = reduce(A, n/2); R = reduce(A + n/2, n-n/2); return L+R; }
26
int reduce(int* A, int n) { if (n == 1) return A[0]; int L, R; L = cilk_spawn reduce(A, n/2); R = reduce(A+n/2, n-n/2); cilk_sync; return L+R; } reduce(A, n) { if (n == 1) return A[0]; In parallel: L = reduce(A, n/2); R = reduce(A + n/2, n-n/2); return L+R; }
It is still valid is running sequentially, i.e., by one processor
27
#include “pbbslib/utilities.h” void reduce(int* A, int n, int& ret) { if (n == 1) ret = A[0]; else { int L, R; par_do([&] () {reduce(A, n/2, L);}, [&] () {reduce(A+n/2, n-n/2, R);}); ret = L+R; } } parallel_for (0, 100, [&] (int i) {A[i] = i;});
lambda expression (must be function calls) You can also use cilk or openmp to compile your code
28
Sequential running time 0.61s Parallel code on 24 threads* 4.51s Parallel code on 4 threads 17.14s Parallel code on 1 thread 59.95s
*: 12 cores with 24 hyperthreads
int reduce(int* A, int n) { if (n == 1) return A[0]; int L, R; L = cilk_spawn reduce(A, n/2); R = reduce(A+n/2, n-n/2); cilk_sync; return L+R; }
Code was running on course server
29
Sequential running time 0.61s Parallel code on 24 threads* 4.51s Parallel code on 4 threads 17.14s Parallel code on 1 thread 59.95s
*: 12 cores with 24 hyperthreads
int reduce(int* A, int n) { if (n == 1) return A[0]; int L, R; L = cilk_spawn reduce(A, n/2); R = reduce(A+n/2, n-n/2); cilk_sync; return L+R; }
Code was running on course server
30
parallelism
31
int reduce(int* A, int n) { if (n < threshold) { int ans = 0; for (int i = 0; i < n; i++) ans += A[i]; return ans; } int L, R; L = cilk_spawn reduce(A, n/2); R = reduce(A+n/2, n-n/2); cilk_sync; return L+R; } int reduce(int* A, int n) { if (n == 1) return A[0]; int L, R; L = cilk_spawn reduce(A, n/2); R = reduce(A+n/2, n-n/2); cilk_sync; return L+R; }
32
Alg lgorit ithm hm Thresho shold ld Tim ime Sequential running time
Parallel code on 24 threads 100 0.27s Parallel code on 24 threads 10000 0.19s Parallel code on 24 threads 1000000 0.19s Parallel code on 24 threads 10000000 0.22s
Best threshold depends on the machine parameters and the problem
33
Alg lgorit ithm hm Thresho shold ld Tim ime Sequential running time
Parallel code on 24 threads 100 0.27s Parallel code on 24 threads 10000 0.19s Parallel code on 24 threads 1000000 0.19s Parallel code on 24 threads 10000000 0.22s
In the best case using 24 threads improves the performance by about 3 times.
parallel to be too small
34
35 Sum[0] Sum[1] Sum[2] Sum[3] Sum[4] Sum[5]
Threshold
𝑜
36
37
Alg lgorit ithm hm #blo locks Tim ime Sequential running time
Parallel code on 24 threads 100 0.26s Parallel code on 24 threads 1000 0.19s Parallel code on 24 threads 100000 0.19s Parallel code on 24 threads 10000000 0.21s
For more complicated algorithms, the best #blocks can be different
38
can design ign alg lgorit ithms hms to m make e it it w work-effic efficie ient nt wit ith 𝑷(𝐦𝐩𝐡 𝒐) depth th
gain in we ne need d coarsening ening to a avoid id smal all l parallel llel tasks
als lso use the blo lockin ing g id idea?
40
ivid ide e the array in into 𝒖 blo locks, s, each h wit ith siz ize about ut 𝒄
Compute ute the sum m of each ch blo lock ck in in an arr rray 𝑪, in in para rall llel el (sequ quenti ential al wit ithin in each blo lock)
ute the prefix fix sum of 𝑪 sequen quentia ially lly, , and writ ite the prefix efix sum m of B t to th the 𝒄-th th, , 𝟑𝒄-th th, … slots in the output
ill l in in th the rest of each blo lock in in parall llel el – run a se sequ quent entia ial l prefi fix x sum m for each blo lock, , wit ith an offset set decid ided ed by the prefi fix x sum m at th the end of the previ vious us blo lock
41 1 2 3 4 5 6 7 8 9 10 11 12 10 26 42 6 21 45 78 1 3 10 15 28 36 55 66
associative operations
42
43
44