parallel algorithms and
play

Parallel Algorithms and CS26 S260 Algor gorit ithmic mic Engin - PowerPoint PPT Presentation

Parallel Algorithms and CS26 S260 Algor gorit ithmic mic Engin ginee eerin ing Yihan han Sun un Implementations * Some of the slides are from MIT 6.712, 6.886 and CMU 15-853. Last Lecture Sc Schedu dule ler: Help you map


  1. Parallel Algorithms and CS26 S260 – Algor gorit ithmic mic Engin ginee eerin ing Yihan han Sun un Implementations * Some of the slides are from MIT 6.712, 6.886 and CMU 15-853.

  2. Last Lecture • Sc Schedu dule ler: • Help you map your parallel tasks to processors • Fork-jo join in • Fork: create several tasks that will be run in parallel • Join: after all forked threads finish, synchronize them • Wo Work rk-span pan Can be scheduled in • Work: total number of operations, 𝑋 sequential complexity time: 𝑃 𝑞 + 𝑇 • Span (depth): the longest chain in the for work 𝑋 , span 𝑇 on 𝑞 dependence graph processors 2

  3. Last Lecture • Writ ite C++ code in in p parallel llel Pseudocode Code using Cilk reduce(A, n) { int reduce(int* A, int n) { if (n == 1) return A[0]; if (n == 1) return A[0]; In parallel: int L, R; L = reduce(A, n/2); L = cilk_spawn reduce(A, n/2); R = reduce(A + n/2, n-n/2); R = reduce(A+n/2, n-n/2); return L+R; cilk_sync ; } return L+R; } 3

  4. Last Lecture • Reduce/sc e/scan an alg lgorit ithms hms • Divide-and-conquer or blocking • Coarsening ening • Avoid overhead of fork-join • Let each subtask large enough 4

  5. Concurrency & Atomic primitives 5

  6. Concurrency • When two threads ds access s one memory ory lo loca catio ion n at th the same e tim ime • When it it is is possi sibl ble e for two threads ds to a access s the same e memory ory lo locatio ion, n, we need to co conside ider r co concu curr rrency ency • Usually we only care when at least one of them is a write • Race – will be introduced later in the course • Parallelism ≠ concurrency • For the reduce/scan algorithm we just saw, no concurrency occurs (even no concurrent reads needed) 6

  7. Concurrency • The most im importa rtant nt prin incip iple le to d deal l wit ith concurrency ency is is th the co corr rrect ctness ness • Does it still give expected output even when concurrency occurs? • The second nd to co consid ider er is is the perfo forma rmanc nce • Usually leads to slowdown for your algorithm • The system needs to guarantee some correctness – results in much overhead 7

  8. Concurrency • Correctness ness is is the fir irst st consid ider erati ation! n! A joke for you to understand this: Alice: I can compute multiplication very fast. Bob: Really? What is 843342 × 3424 ? Alice: 20. Bob: What? That’s not correct! Alice: Wasn’t that fast? • So Sometim imes es concur urrenc ency y is is in inevita itabl ble • Solution 1: Locks – usually safe, but slow • Solution 2: Some atomic primitives • Supported by most systems • Needs careful design 8

  9. Atomic primitives • Compar are-and and-sw swap ap (CAS) S) • bool CAS(value* p, value vold, value vnew): compare the value stored in the pointer 𝑞 with value 𝑤𝑝𝑚𝑒 , if they are equal, change 𝑞 ’ s value to vnew and return true. Otherwise do nothing and return false. • Test-and and-set set (TAS) S) • bool TAS(bool* p): determine if the Boolean value stored at 𝑞 is false, if so, set it to true and return true. Otherwise, return false. • Fetch-and and-ad add d (FAA) • integer FAA(integer* p, integer x): add integer 𝑞 ’ s value by 𝑦 , and return the old value • Prio iorit ity-wr write ite: • integer PW(integer* p, integer x): write x to p if and only if x is smaller than the current value in 𝑞 9

  10. sum = 5 Use Atomic Primitives P1: add(3) P2: add(4) void Add(x) { void Add(x) { • Fetch-and and-ad add d (FAA): ): temp = sum; temp = sum; 5 5 in integer ger FAA(in integer teger* * p, sum = temp + x; sum = temp + x; } } 9 8 in intege ger x): add in intege ger 𝑞 ’ s s valu lue by 𝑦 , a and return n the sum = 8 (but should be 12) old ld valu lue Shared variable sum Shared variable sum void Add(x) { void Add(x) { • Multiple threads want to sum = sum + x; FAA(&sum, x); add a value to a shared } } variable Shared variable count • Multiple threads want to get int get_id { a global sequentialized return FAA(&count, 1); order } 10

  11. struct node { Use Atomic Primitives value_type value; node* next; }; shared variable node* head; • Compare-and-swap: • Multiple threads wants to add to the head of a linked-list X1 void insert(node* x) { node* old_head = head; x->next = old_head; while (!CAS(&head, old_head, x)) { ? X2 head node* old_head = head; x->next = old_head; } head } X1 void insert(node* x) { x->next = head; head = x; X2 } 11

  12. struct node { Use Atomic Primitives value_type value; node* next; }; shared variable node* head; • Compare-and-swap: • Multiple threads wants to add to the head of a linked-list X1 void insert(node* x) { node* old_head = head; x->next = old_head; while (!CAS(&head, old_head, x)) { X2 old_head node* old_head = head; x->next = old_head; } head } X1 void insert(node* x) { x->next = head; head = x; X2 } old_head 12

  13. Concurrency – rule of thumb • Do not use concurrency, algorithmically • If you have to (with the guarantee of correctness) • Do not use concurrent writes • If you have to (with the guarantee of correctness) • Do not use locks, use atomic primitives (still, with the guarantee of correctness) 13

  14. Filtering/packing 14

  15. Parallel filtering / packing • Gi Given n an array 𝑩 of ele lements ents and a p predica icate te func nctio tion n 𝒈 , , output ut an arr rray 𝑪 wit ith h ele lements ents in in 𝑩 that at satisf isfy y 𝒈 𝑔 𝑦 = ቊ 𝑢𝑠𝑣𝑓 𝑗𝑔 𝑦 𝑗𝑡 𝑝𝑒𝑒 𝑔𝑏𝑚𝑡𝑓 𝑗𝑔 𝑦 𝑗𝑡 𝑓𝑤𝑓𝑜 4 2 9 3 6 5 7 11 10 8 𝐵 = 9 3 5 7 11 𝐶 = 15

  16. Parallel filtering / packing • Ho How can we know ow the e length ngth of 𝑪 in n parallel? rallel? • Count the number of red elements – parallel reduce • 𝑃(𝑜) work and 𝑃(log 𝑜) depth 4 2 9 3 6 5 7 11 10 8 𝐵 = 0 0 1 1 0 1 1 1 0 0 16

  17. Filter(A, n, B, f) { Parallel filtering / packing new array flag[n], ps[n]; para_for (i = 1 to n) { flag[i] = f(A[i]); } • How ca can we know where e shoul uld d 9 g go? ps = scan(flag, n); • 9 is the first red element, 3 is the second, … parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; } } 4 2 9 3 6 5 7 11 10 8 𝑩 = 0 0 1 1 0 1 1 1 0 0 Flags of A 0 0 1 2 2 3 4 5 0 0 Prefix sum of flags 1 2 3 4 5 index 9 3 5 7 11 𝐶 = 17

  18. Application of filter: partition in quicksort • For a an array A, m move e ele lements ents in in A s A smaller ller than n 𝒍 to the le left and those e la larg rger r than 𝒍 to the ri righ ght 6 2 9 4 1 3 5 8 7 0 A Partition by 6 Possible 2 4 1 3 5 0 6 9 8 7 output: • The div ivid iding ing crit iteria ia ge general ally ly can be any predict ictor or 18

  19. Partition(A, n, k, B) { Using filter for partition new array flag[n], ps[n]; parallel_for (i = 1 to n) { flag[i] = (A[i]<k); } (Looking at the left part as an example) ps = scan(flag, n); parallel_for(i=1 to n) { using 6 as a pivot if (ps[i]!=ps[i-1]) 6 2 9 4 1 3 5 8 7 0 A B[ps[i]] = A[i]; } } 0 1 0 1 1 1 1 0 0 1 flag Can we avoid using too much extra space? X 2 X 4 1 3 5 X X 0 A 0 1 1 2 3 4 5 5 5 6 Prefix sum of flag 2 4 1 3 5 0 pack 19

  20. Implementation trick: delayed sequence 20

  21. Delayed sequence • A se sequen uence e is is a f funct nction ion, , so it it d does s not need d to b be stored • It maps an index (subscript) to a value • Save some space! 21

  22. Delayed sequence • A se sequen uence e is is a f funct nction ion, , so it it d does s not need d to b be stored • Save some space int reduce(int* A, int n) { inline int get_val(int i) {return i;} if (n == 1) return A[0]; int reduce(int start, int n, function f) int L, R; { L = cilk_spawn reduce(A, n/2); if (n == 1) return f(start); Running time: Running time: R = reduce(A+n/2, n-n/2); int L, R; about 0.16s for n=10^9, about 0.19s for n=10^9, cilk_sync ; L = cilk_spawn reduce(start, n/2, f); with coarsening with coarsening return L+R; } R = reduce(start+n/2, n-n/2, f); cilk_sync ; int main() { return L+R; } cin >> n; parallel_for (int i = 0; i < n; i++) int main() { A[i] = i; cin >> n; cout << reduce(A, n) << endl; cout << reduce(0, n, get_val) << endl; 22

  23. Partition without the flag array Old version New version Partition(A, n, k, B) { Partition(A, n, k, B) { new array flag[n], ps[n]; new array ps[n]; parallel_for (i = 1 to n) { ps = scan(0, n, flag[i] = (A[i]<k); } [&](int i) {return (A[i]<k);}); ps = scan(flag, n); parallel_for(i=1 to n) { parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; B[ps[i]] = A[i]; } } Equivalent to having an array: flag[i] = (A[i]<k); But without explicitly storing it (We can also get rid of the ps[] array, but it makes the program a bit more complicated) 23

  24. Implementation trick: nested/granular/blocked parallel for-loops 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend