Parallel Algorithms and CS26 S260 Algor gorit ithmic mic Engin - - PowerPoint PPT Presentation

β–Ά
parallel algorithms and
SMART_READER_LITE
LIVE PREVIEW

Parallel Algorithms and CS26 S260 Algor gorit ithmic mic Engin - - PowerPoint PPT Presentation

Parallel Algorithms and CS26 S260 Algor gorit ithmic mic Engin ginee eerin ing Yihan han Sun un Implementations * Some of the slides are from MIT 6.712, 6.886 and CMU 15-853. Last Lecture Sc Schedu dule ler: Help you map


slide-1
SLIDE 1

Parallel Algorithms and Implementations

CS26 S260 – Algor gorit ithmic mic Engin ginee eerin ing Yihan han Sun un

* Some of the slides are from MIT 6.712, 6.886 and CMU 15-853.

slide-2
SLIDE 2

Last Lecture

  • Sc

Schedu dule ler:

  • Help you map your parallel tasks to

processors

  • Fork-jo

join in

  • Fork: create several tasks that will be run in

parallel

  • Join: after all forked threads finish,

synchronize them

  • Wo

Work rk-span pan

  • Work: total number of operations,

sequential complexity

  • Span (depth): the longest chain in the

dependence graph

2

Can be scheduled in time: 𝑃

𝑋 π‘ž + 𝑇

for work 𝑋, span 𝑇 on π‘ž processors

slide-3
SLIDE 3

Last Lecture

  • Writ

ite C++ code in in p parallel llel

3

int reduce(int* A, int n) { if (n == 1) return A[0]; int L, R; L = cilk_spawn reduce(A, n/2); R = reduce(A+n/2, n-n/2); cilk_sync; return L+R; } reduce(A, n) { if (n == 1) return A[0]; In parallel: L = reduce(A, n/2); R = reduce(A + n/2, n-n/2); return L+R; }

Pseudocode Code using Cilk

slide-4
SLIDE 4

Last Lecture

  • Reduce/sc

e/scan an alg lgorit ithms hms

  • Divide-and-conquer or blocking
  • Coarsening

ening

  • Avoid overhead of fork-join
  • Let each subtask large enough

4

slide-5
SLIDE 5

Concurrency & Atomic primitives

5

slide-6
SLIDE 6

Concurrency

  • When two threads

ds access s one memory

  • ry lo

loca catio ion n at th the same e tim ime

  • When it

it is is possi sibl ble e for two threads ds to a access s the same e memory

  • ry

lo locatio ion, n, we need to co conside ider r co concu curr rrency ency

  • Usually we only care when at least
  • ne of them is a write
  • Race – will be introduced later in the

course

6

  • Parallelism β‰  concurrency
  • For the reduce/scan algorithm we just saw, no concurrency occurs (even no

concurrent reads needed)

slide-7
SLIDE 7

Concurrency

  • The most im

importa rtant nt prin incip iple le to d deal l wit ith concurrency ency is is th the co corr rrect ctness ness

  • Does it still give expected output even when concurrency occurs?
  • The second

nd to co consid ider er is is the perfo forma rmanc nce

  • Usually leads to slowdown for your algorithm
  • The system needs to guarantee some correctness – results in much
  • verhead

7

slide-8
SLIDE 8

Concurrency

  • Correctness

ness is is the fir irst st consid ider erati ation! n!

  • So

Sometim imes es concur urrenc ency y is is in inevita itabl ble

  • Solution 1: Locks – usually safe, but slow
  • Solution 2: Some atomic primitives
  • Supported by most systems
  • Needs careful design

8

A joke for you to understand this:

Alice: I can compute multiplication very fast. Bob: Really? What is 843342 Γ— 3424? Alice: 20. Bob: What? That’s not correct! Alice: Wasn’t that fast?

slide-9
SLIDE 9

Atomic primitives

  • Compar

are-and and-sw swap ap (CAS) S)

  • bool CAS(value* p, value vold, value vnew): compare the value stored

in the pointer π‘ž with value π‘€π‘π‘šπ‘’, if they are equal, change π‘žβ€™s value to vnew and return true. Otherwise do nothing and return false.

  • Test-and

and-set set (TAS) S)

  • bool TAS(bool* p): determine if the Boolean value stored at π‘ž is false,

if so, set it to true and return true. Otherwise, return false.

  • Fetch-and

and-ad add d (FAA)

  • integer FAA(integer* p, integer x): add integer π‘žβ€™s value by 𝑦, and

return the old value

  • Prio

iorit ity-wr write ite:

  • integer PW(integer* p, integer x): write x to p if and only if x is

smaller than the current value in π‘ž

9

slide-10
SLIDE 10

Use Atomic Primitives

  • Fetch-and

and-ad add d (FAA): ): in integer ger FAA(in integer teger* * p, in intege ger x): add in intege ger π‘žβ€™s s valu lue by 𝑦, a and return n the

  • ld

ld valu lue

  • Multiple threads want to

add a value to a shared variable

  • Multiple threads want to get

a global sequentialized

  • rder

10 Shared variable sum void Add(x) { FAA(&sum, x); } Shared variable count int get_id { return FAA(&count, 1); } Shared variable sum void Add(x) { sum = sum + x; } void Add(x) { temp = sum; sum = temp + x; } void Add(x) { temp = sum; sum = temp + x; } sum = 5 5 5 P2: add(4) P1: add(3) 9 8 sum = 8 (but should be 12)

slide-11
SLIDE 11

Use Atomic Primitives

11 void insert(node* x) { node* old_head = head; x->next = old_head; while (!CAS(&head, old_head, x)) { node* old_head = head; x->next = old_head; } }

  • Compare-and-swap:
  • Multiple threads wants to add to the head of a linked-list

X1 X2

head void insert(node* x) { x->next = head; head = x; }

?

X1 X2

head struct node { value_type value; node* next; }; shared variable node* head;

slide-12
SLIDE 12

Use Atomic Primitives

12 void insert(node* x) { node* old_head = head; x->next = old_head; while (!CAS(&head, old_head, x)) { node* old_head = head; x->next = old_head; } }

  • Compare-and-swap:
  • Multiple threads wants to add to the head of a linked-list

X1 X2

  • ld_head

void insert(node* x) { x->next = head; head = x; }

X1 X2

head struct node { value_type value; node* next; }; shared variable node* head;

  • ld_head
slide-13
SLIDE 13

Concurrency – rule of thumb

  • Do not use concurrency, algorithmically
  • If you have to (with the guarantee of correctness)
  • Do not use concurrent writes
  • If you have to (with the guarantee of correctness)
  • Do not use locks, use atomic primitives (still, with the guarantee of correctness)

13

slide-14
SLIDE 14

Filtering/packing

14

slide-15
SLIDE 15

Parallel filtering / packing

  • Gi

Given n an array 𝑩 of ele lements ents and a p predica icate te func nctio tion n π’ˆ, ,

  • utput

ut an arr rray π‘ͺ wit ith h ele lements ents in in 𝑩 that at satisf isfy y π’ˆ

15

4 2 9 3 6 5 7 11 10 8 9 3 5 7 11 𝑔 𝑦 = α‰Š 𝑒𝑠𝑣𝑓 𝑗𝑔 𝑦 𝑗𝑑 𝑝𝑒𝑒 π‘”π‘π‘šπ‘‘π‘“ 𝑗𝑔 𝑦 𝑗𝑑 π‘“π‘€π‘“π‘œ

𝐡 = 𝐢 =

slide-16
SLIDE 16

Parallel filtering / packing

  • Ho

How can we know

  • w the

e length ngth of π‘ͺ in n parallel? rallel?

  • Count the number of red elements – parallel reduce
  • 𝑃(π‘œ) work and 𝑃(log π‘œ) depth

16

4 2 9 3 6 5 7 11 10 8

𝐡 =

1 1 1 1 1

slide-17
SLIDE 17

Parallel filtering / packing

  • How ca

can we know where e shoul uld d 9 g go?

  • 9 is the first red element, 3 is the second, …

17

4 2 9 3 6 5 7 11 10 8

𝑩 =

1 1 1 1 1 1 2 2 3 4 5 9 3 5 7 11

𝐢 =

1 2 3 4 5

index Prefix sum of flags Flags of A

Filter(A, n, B, f) { new array flag[n], ps[n]; para_for (i = 1 to n) { flag[i] = f(A[i]); } ps = scan(flag, n); parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; } }

slide-18
SLIDE 18

Application of filter: partition in quicksort

  • For a

an array A, m move e ele lements ents in in A s A smaller ller than n 𝒍 to the le left and those e la larg rger r than 𝒍 to the ri righ ght

  • The div

ivid iding ing crit iteria ia ge general ally ly can be any predict ictor

  • r

18 6 2 9 4 1 3 5 8 7

A

2 4 1 3 5 6 9 8 7

Possible

  • utput:

Partition by 6

slide-19
SLIDE 19

Using filter for partition

19 6 2 9 4 1 3 5 8 7 1 1 1 1 1 1

Partition(A, n, k, B) { new array flag[n], ps[n]; parallel_for (i = 1 to n) { flag[i] = (A[i]<k); } ps = scan(flag, n); parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; } }

A flag

X 2 X 4 1 3 5 X X

A

1 1 2 3 4 5 5 5 6

Prefix sum

  • f flag

using 6 as a pivot

2 4 1 3 5

pack

Can we avoid using too much extra space?

(Looking at the left part as an example)

slide-20
SLIDE 20

Implementation trick: delayed sequence

20

slide-21
SLIDE 21

Delayed sequence

  • A se

sequen uence e is is a f funct nction ion, , so it it d does s not need d to b be stored

  • It maps an index (subscript) to a value
  • Save some space!

21

slide-22
SLIDE 22

Delayed sequence

  • A se

sequen uence e is is a f funct nction ion, , so it it d does s not need d to b be stored

  • Save some space

22

int reduce(int* A, int n) { if (n == 1) return A[0]; int L, R; L = cilk_spawn reduce(A, n/2); R = reduce(A+n/2, n-n/2); cilk_sync; return L+R; } int main() { cin >> n; parallel_for (int i = 0; i < n; i++) A[i] = i; cout << reduce(A, n) << endl; inline int get_val(int i) {return i;} int reduce(int start, int n, function f) { if (n == 1) return f(start); int L, R; L = cilk_spawn reduce(start, n/2, f); R = reduce(start+n/2, n-n/2, f); cilk_sync; return L+R; } int main() { cin >> n; cout << reduce(0, n, get_val) << endl; Running time: about 0.19s for n=10^9, with coarsening Running time: about 0.16s for n=10^9, with coarsening

slide-23
SLIDE 23

Partition without the flag array

23

Partition(A, n, k, B) { new array flag[n], ps[n]; parallel_for (i = 1 to n) { flag[i] = (A[i]<k); } ps = scan(flag, n); parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; } Partition(A, n, k, B) { new array ps[n]; ps = scan(0, n, [&](int i) {return (A[i]<k);}); parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; }

Old version New version

(We can also get rid of the ps[] array, but it makes the program a bit more complicated) Equivalent to having an array: flag[i] = (A[i]<k); But without explicitly storing it

slide-24
SLIDE 24

Implementation trick: nested/granular/blocked parallel for-loops

24

slide-25
SLIDE 25

Nested parallel for-loops

  • Us

Usual ally ly only ly need d to paralle lleliz lize the

  • utmost
  • st one
  • Make each

para ralle llel l task la large ge enoug ugh

25 cilk_for (int i = 0; i < n; ++i) for (int k = 0; k < n; ++k) for (int j = 0; j < n; ++j) C[i][j] += A[i][k] * B[k][j]; cilk_for (int i = 0; i < n; ++i) for (int k = 0; k < n; ++k) cilk_for (int j = 0; j < n; ++j) C[i][j] += A[i][k] * B[k][j]; for (int i = 0; i < n; ++i) for (int k = 0; k < n; ++k) cilk_for (int j = 0; j < n; ++j) C[i][j] += A[i][k] * B[k][j];

Running time: 3.18s Running time: 531.71s Running time: 10.64s Parallel i loop Parallel i and j loops Parallel j loop

Rule of Thumb Parallelize outer loops rather than inner loops

slide-26
SLIDE 26

Granular-for

  • If some

e condit ition ion hold lds, s, ru run the for r lo loop in in para rall llel el

  • Usually determining if the

size of the for-loop is larger than a threshold

  • Otherw

rwise, ise, ru run it it sequen uentia tially lly

  • E.g.

g., for a fo for-lo loop

  • p wit

ith siz ize smaller ller than n 20 2000 00, run it it s sequ quent entia ially lly, ,

  • therwi

wise se run it it in in p paralle llel

26 #define granular_for(_i, _start, _end, _cond, _body) { \ if (_cond) { \ {parallel_for(size_t _i=_start; _i < _end; _i++) { \ _body \ }} \ } else { \ {for (size_t _i=_start; _i < _end; _i++) { \ _body \ }} \ } \ } granular_for (i, 0, n, (n>2000), {A[i]=i});

slide-27
SLIDE 27

Blocked-for

  • For a

a fo for-lo loop,

  • p, combin

ine e each ch _bsiz size of th them m as one task, , and run them m in in paralle llel

  • Als

lso to av avoid id the ca case when n each task is is too sm small ll

  • Your scheduler can help do

this in some sense, but it doesn’t know much about your loop body

  • E.g.

g., put each 500 l loop- body y in into one task

27 #define nblocks(_n,_bsize) (1 + ((_n)-1)/(_bsize)) #define blocked_for(_i, _s, _e, _bsize, _body) { \ intT _ss = _s; \ intT _ee = _e; \ intT _n = _ee-_ss; \ intT _l = nblocks(_n,_bsize); \ parallel_for (intT _i = 0; _i < _l; _i++) { \ intT _s = _ss + _i * (_bsize); \ intT _e = min(_s + (_bsize), _ee); \ for (intT _j = s; _j < e; j++) { \ _body \ } \ } \ } block_for (i, 0, n, 500, {A[i]=i}); # of blocks From start of the block to the end of the block

slide-28
SLIDE 28

Implementation trick: dos and don’ts

28

slide-29
SLIDE 29

Allocate large memory

  • Don’t (frequently, dynamically) allocate memory in parallel
  • This has to go through the OS
  • New space cannot be allocated in parallel with other threads running
  • All

llocate e enough gh memory ry in in a advanc ance

  • When needed, distribute the memory to the threads
  • This

is means ns – usin ing g std:: ::vec vector

  • r *c

*can* * slo low down your paralle llel l code (if if you are not careful ful enough) gh)

  • When resizing it needs to allocate new space and delete the old one
  • If you want to use std::vector, reserve enough space before starting

parallel running

29

slide-30
SLIDE 30

Generating random numbers

  • Do no

not use e the defau fault lt rando dom m number mber ge genera rator

  • r
  • Use system time
  • Involve synchronization – slows

down parallel performance

  • Us

Use a ha hash h func nctio tion n in instead ead

  • Just write some random things as

your hash function – it’s a pseudo- random number generator anyway

30 // a 32-bit hash function inline uint32_t random_hash(uint32_t a) { a = (a+0x7ed55d16) + (a<<12); a = (a^0xc761c23c) ^ (a>>19); a = (a+0x165667b1) + (a<<5); a = (a+0xd3a2646c) ^ (a<<9); a = (a+0xfd7046c5) + (a<<3); a = (a^0xb55a4f09) ^ (a>>16); return a; } parallel_for (i = 0 to n) random[i]=random_hash(i);

Generate n random integers in parallel

slide-31
SLIDE 31

Parallel merging

31

slide-32
SLIDE 32

Parallel merging

  • Gi

Given n two so sorted d arrays, ys, merge ge them m in into one sorted d array

  • Se

Sequentia entiall lly, y, use e two mo moving ing poin inter ers

32 4 7 8 1 2 3 5 6 9 1 2 3 4 5 6 7 8 9

slide-33
SLIDE 33

A parallel merge algorithm

  • Fin

ind d the median ian 𝒏 of one arr rray

  • Bin

inary y search ch it it in in th the other array

  • Put 𝒏 in

in the correct slo lot

  • Recursi

sivel vely, y, in in paralle llel l do:

  • Merge the left two sub-arrays

into the left half of the output

  • Merge the right ones into the

right half of the output

33 9 3 4 6 2 1 5 7 8 4 1 2 3 9 6 7 8 5 Binary search 3 2 1 9 6 5 7 8 Subproblem 1: Merge 2,3 with 0,1 Subproblem 2: Merge 6,9 with 5,7,8

slide-34
SLIDE 34

A parallel merge algorithm

34 9 3 4 6 2 1 5 7 8 4 1 2 3 9 6 7 8 5 Binary search 3 2 1 9 6 5 7 8 Subproblem 1: Merge 2,3 with 0,1 Subproblem 2: Merge 6,9 with 5,7,8 //merge array A of length n1 and array B of length n2 into array C. Merge(A’, n1, B’, n2, C) { if (A’ is empty or B’ is empty) base_case; m = n1/2; m2 = binary_search(B’, A’[m]); C[m+m2+1] = A’[m]; in parallel: merge(A’, m, B’, m2, C); merge(A’+m+1, n1-m-1, B’+m2+1, n2-m2-1, C+m+m2); return C; }