Parallel Algorithms and Implementations
CS26 S260 β Algor gorit ithmic mic Engin ginee eerin ing Yihan han Sun un
* Some of the slides are from MIT 6.712, 6.886 and CMU 15-853.
Parallel Algorithms and CS26 S260 Algor gorit ithmic mic Engin - - PowerPoint PPT Presentation
Parallel Algorithms and CS26 S260 Algor gorit ithmic mic Engin ginee eerin ing Yihan han Sun un Implementations * Some of the slides are from MIT 6.712, 6.886 and CMU 15-853. Last Lecture Sc Schedu dule ler: Help you map
CS26 S260 β Algor gorit ithmic mic Engin ginee eerin ing Yihan han Sun un
* Some of the slides are from MIT 6.712, 6.886 and CMU 15-853.
Schedu dule ler:
processors
join in
parallel
synchronize them
Work rk-span pan
sequential complexity
dependence graph
2
Can be scheduled in time: π
π π + π
for work π, span π on π processors
ite C++ code in in p parallel llel
3
int reduce(int* A, int n) { if (n == 1) return A[0]; int L, R; L = cilk_spawn reduce(A, n/2); R = reduce(A+n/2, n-n/2); cilk_sync; return L+R; } reduce(A, n) { if (n == 1) return A[0]; In parallel: L = reduce(A, n/2); R = reduce(A + n/2, n-n/2); return L+R; }
e/scan an alg lgorit ithms hms
ening
4
5
ds access s one memory
loca catio ion n at th the same e tim ime
it is is possi sibl ble e for two threads ds to a access s the same e memory
lo locatio ion, n, we need to co conside ider r co concu curr rrency ency
course
6
concurrent reads needed)
importa rtant nt prin incip iple le to d deal l wit ith concurrency ency is is th the co corr rrect ctness ness
nd to co consid ider er is is the perfo forma rmanc nce
7
ness is is the fir irst st consid ider erati ation! n!
Sometim imes es concur urrenc ency y is is in inevita itabl ble
8
A joke for you to understand this:
Alice: I can compute multiplication very fast. Bob: Really? What is 843342 Γ 3424? Alice: 20. Bob: What? Thatβs not correct! Alice: Wasnβt that fast?
are-and and-sw swap ap (CAS) S)
in the pointer π with value π€πππ, if they are equal, change πβs value to vnew and return true. Otherwise do nothing and return false.
and-set set (TAS) S)
if so, set it to true and return true. Otherwise, return false.
and-ad add d (FAA)
return the old value
iorit ity-wr write ite:
smaller than the current value in π
9
and-ad add d (FAA): ): in integer ger FAA(in integer teger* * p, in intege ger x): add in intege ger πβs s valu lue by π¦, a and return n the
ld valu lue
add a value to a shared variable
a global sequentialized
10 Shared variable sum void Add(x) { FAA(&sum, x); } Shared variable count int get_id { return FAA(&count, 1); } Shared variable sum void Add(x) { sum = sum + x; } void Add(x) { temp = sum; sum = temp + x; } void Add(x) { temp = sum; sum = temp + x; } sum = 5 5 5 P2: add(4) P1: add(3) 9 8 sum = 8 (but should be 12)
11 void insert(node* x) { node* old_head = head; x->next = old_head; while (!CAS(&head, old_head, x)) { node* old_head = head; x->next = old_head; } }
X1 X2
head void insert(node* x) { x->next = head; head = x; }
X1 X2
head struct node { value_type value; node* next; }; shared variable node* head;
12 void insert(node* x) { node* old_head = head; x->next = old_head; while (!CAS(&head, old_head, x)) { node* old_head = head; x->next = old_head; } }
X1 X2
void insert(node* x) { x->next = head; head = x; }
X1 X2
head struct node { value_type value; node* next; }; shared variable node* head;
13
14
Given n an array π© of ele lements ents and a p predica icate te func nctio tion n π, ,
ut an arr rray πͺ wit ith h ele lements ents in in π© that at satisf isfy y π
15
4 2 9 3 6 5 7 11 10 8 9 3 5 7 11 π π¦ = α π’π π£π ππ π¦ ππ‘ πππ ππππ‘π ππ π¦ ππ‘ ππ€ππ
π΅ = πΆ =
How can we know
e length ngth of πͺ in n parallel? rallel?
16
4 2 9 3 6 5 7 11 10 8
π΅ =
1 1 1 1 1
can we know where e shoul uld d 9 g go?
17
4 2 9 3 6 5 7 11 10 8
π© =
1 1 1 1 1 1 2 2 3 4 5 9 3 5 7 11
πΆ =
1 2 3 4 5
index Prefix sum of flags Flags of A
Filter(A, n, B, f) { new array flag[n], ps[n]; para_for (i = 1 to n) { flag[i] = f(A[i]); } ps = scan(flag, n); parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; } }
an array A, m move e ele lements ents in in A s A smaller ller than n π to the le left and those e la larg rger r than π to the ri righ ght
ivid iding ing crit iteria ia ge general ally ly can be any predict ictor
18 6 2 9 4 1 3 5 8 7
A
2 4 1 3 5 6 9 8 7
Possible
Partition by 6
19 6 2 9 4 1 3 5 8 7 1 1 1 1 1 1
Partition(A, n, k, B) { new array flag[n], ps[n]; parallel_for (i = 1 to n) { flag[i] = (A[i]<k); } ps = scan(flag, n); parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; } }
A flag
X 2 X 4 1 3 5 X X
A
1 1 2 3 4 5 5 5 6
Prefix sum
using 6 as a pivot
2 4 1 3 5
pack
Can we avoid using too much extra space?
(Looking at the left part as an example)
20
sequen uence e is is a f funct nction ion, , so it it d does s not need d to b be stored
21
sequen uence e is is a f funct nction ion, , so it it d does s not need d to b be stored
22
int reduce(int* A, int n) { if (n == 1) return A[0]; int L, R; L = cilk_spawn reduce(A, n/2); R = reduce(A+n/2, n-n/2); cilk_sync; return L+R; } int main() { cin >> n; parallel_for (int i = 0; i < n; i++) A[i] = i; cout << reduce(A, n) << endl; inline int get_val(int i) {return i;} int reduce(int start, int n, function f) { if (n == 1) return f(start); int L, R; L = cilk_spawn reduce(start, n/2, f); R = reduce(start+n/2, n-n/2, f); cilk_sync; return L+R; } int main() { cin >> n; cout << reduce(0, n, get_val) << endl; Running time: about 0.19s for n=10^9, with coarsening Running time: about 0.16s for n=10^9, with coarsening
23
Partition(A, n, k, B) { new array flag[n], ps[n]; parallel_for (i = 1 to n) { flag[i] = (A[i]<k); } ps = scan(flag, n); parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; } Partition(A, n, k, B) { new array ps[n]; ps = scan(0, n, [&](int i) {return (A[i]<k);}); parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; }
Old version New version
(We can also get rid of the ps[] array, but it makes the program a bit more complicated) Equivalent to having an array: flag[i] = (A[i]<k); But without explicitly storing it
24
Usual ally ly only ly need d to paralle lleliz lize the
para ralle llel l task la large ge enoug ugh
25 cilk_for (int i = 0; i < n; ++i) for (int k = 0; k < n; ++k) for (int j = 0; j < n; ++j) C[i][j] += A[i][k] * B[k][j]; cilk_for (int i = 0; i < n; ++i) for (int k = 0; k < n; ++k) cilk_for (int j = 0; j < n; ++j) C[i][j] += A[i][k] * B[k][j]; for (int i = 0; i < n; ++i) for (int k = 0; k < n; ++k) cilk_for (int j = 0; j < n; ++j) C[i][j] += A[i][k] * B[k][j];
Running time: 3.18s Running time: 531.71s Running time: 10.64s Parallel i loop Parallel i and j loops Parallel j loop
Rule of Thumb Parallelize outer loops rather than inner loops
e condit ition ion hold lds, s, ru run the for r lo loop in in para rall llel el
size of the for-loop is larger than a threshold
rwise, ise, ru run it it sequen uentia tially lly
g., for a fo for-lo loop
ith siz ize smaller ller than n 20 2000 00, run it it s sequ quent entia ially lly, ,
wise se run it it in in p paralle llel
26 #define granular_for(_i, _start, _end, _cond, _body) { \ if (_cond) { \ {parallel_for(size_t _i=_start; _i < _end; _i++) { \ _body \ }} \ } else { \ {for (size_t _i=_start; _i < _end; _i++) { \ _body \ }} \ } \ } granular_for (i, 0, n, (n>2000), {A[i]=i});
a fo for-lo loop,
ine e each ch _bsiz size of th them m as one task, , and run them m in in paralle llel
lso to av avoid id the ca case when n each task is is too sm small ll
this in some sense, but it doesnβt know much about your loop body
g., put each 500 l loop- body y in into one task
27 #define nblocks(_n,_bsize) (1 + ((_n)-1)/(_bsize)) #define blocked_for(_i, _s, _e, _bsize, _body) { \ intT _ss = _s; \ intT _ee = _e; \ intT _n = _ee-_ss; \ intT _l = nblocks(_n,_bsize); \ parallel_for (intT _i = 0; _i < _l; _i++) { \ intT _s = _ss + _i * (_bsize); \ intT _e = min(_s + (_bsize), _ee); \ for (intT _j = s; _j < e; j++) { \ _body \ } \ } \ } block_for (i, 0, n, 500, {A[i]=i}); # of blocks From start of the block to the end of the block
28
llocate e enough gh memory ry in in a advanc ance
is means ns β usin ing g std:: ::vec vector
*can* * slo low down your paralle llel l code (if if you are not careful ful enough) gh)
parallel running
29
not use e the defau fault lt rando dom m number mber ge genera rator
down parallel performance
Use a ha hash h func nctio tion n in instead ead
your hash function β itβs a pseudo- random number generator anyway
30 // a 32-bit hash function inline uint32_t random_hash(uint32_t a) { a = (a+0x7ed55d16) + (a<<12); a = (a^0xc761c23c) ^ (a>>19); a = (a+0x165667b1) + (a<<5); a = (a+0xd3a2646c) ^ (a<<9); a = (a+0xfd7046c5) + (a<<3); a = (a^0xb55a4f09) ^ (a>>16); return a; } parallel_for (i = 0 to n) random[i]=random_hash(i);
Generate n random integers in parallel
31
Given n two so sorted d arrays, ys, merge ge them m in into one sorted d array
Sequentia entiall lly, y, use e two mo moving ing poin inter ers
32 4 7 8 1 2 3 5 6 9 1 2 3 4 5 6 7 8 9
ind d the median ian π of one arr rray
inary y search ch it it in in th the other array
in the correct slo lot
sivel vely, y, in in paralle llel l do:
into the left half of the output
right half of the output
33 9 3 4 6 2 1 5 7 8 4 1 2 3 9 6 7 8 5 Binary search 3 2 1 9 6 5 7 8 Subproblem 1: Merge 2,3 with 0,1 Subproblem 2: Merge 6,9 with 5,7,8
34 9 3 4 6 2 1 5 7 8 4 1 2 3 9 6 7 8 5 Binary search 3 2 1 9 6 5 7 8 Subproblem 1: Merge 2,3 with 0,1 Subproblem 2: Merge 6,9 with 5,7,8 //merge array A of length n1 and array B of length n2 into array C. Merge(Aβ, n1, Bβ, n2, C) { if (Aβ is empty or Bβ is empty) base_case; m = n1/2; m2 = binary_search(Bβ, Aβ[m]); C[m+m2+1] = Aβ[m]; in parallel: merge(Aβ, m, Bβ, m2, C); merge(Aβ+m+1, n1-m-1, Bβ+m2+1, n2-m2-1, C+m+m2); return C; }