Threading Subroutines Bo Zhang zhangb@cs.duke.edu Nikos P. - - PowerPoint PPT Presentation

threading subroutines
SMART_READER_LITE
LIVE PREVIEW

Threading Subroutines Bo Zhang zhangb@cs.duke.edu Nikos P. - - PowerPoint PPT Presentation

P ARALLEL P ROGRAMMING ON M ULTICORES Threading Subroutines Bo Zhang zhangb@cs.duke.edu Nikos P. Pitsianis Xiaobai Sun D EPARTMENT OF C OMPUTER S CIENCE D UKE U NIVERSITY 09/29/2010 1 / 38 M Y E XPERIENCE B ACKGROUND Ph.D. in Mathematics;


slide-1
SLIDE 1

PARALLEL PROGRAMMING ON MULTICORES Threading Subroutines

Bo Zhang zhangb@cs.duke.edu Nikos P. Pitsianis Xiaobai Sun

DEPARTMENT OF COMPUTER SCIENCE DUKE UNIVERSITY

09/29/2010

1 / 38

slide-2
SLIDE 2

MY EXPERIENCE

BACKGROUND Ph.D. in Mathematics; basic knowledge of C WHAT I HAVE GAINED

◮ Learned Fortran, Pthread, and OpenMP ◮ Parallel Fast Multipole Method (FMM)

  • 1. 1st parallel version on multicore desktop & laptop
  • 2. Initial code has 12k lines; current code has 18k lines
  • 3. In both Pthread and OpenMP
  • 4. Problem size: 20M (laptop); 100M (workstation)
  • 5. Speedup: 8X on an oct-core workstation

2 / 38

slide-3
SLIDE 3

USING EXISTING SUBROUTINES

You have a working sequential code, but the parallel version sees

  • 1. Segmentation fault
  • 2. Bus error
  • 3. Heap collapse
  • 4. Inconsistent results

What happened?

3 / 38

slide-4
SLIDE 4

USE OF EXISTING SUBROUTINES

Potential Problems, not THREAD-SAFE Concurrent updates to global variables Concurrent use of local scratch space Stack overflow

4 / 38

slide-5
SLIDE 5

USE OF EXISTING SUBROUTINES

Concurrent updates to global variables

double balance; void deposit (double s) { balance += s; } void withdraw (double s) { balance -= s; } int main () { balance = 0.0 deposit (s1); deposit (s2); withdraw (s3); . . . printf("%f\n",balance); return 0; } 5 / 38

slide-6
SLIDE 6

USE OF EXISTING SUBROUTINES

Concurrent updates to global variables

double balance; void deposit (double s) { balance += s; } void withdraw (double s) { balance -= s; } int main () { balance = 0.0 deposit (s1); deposit (s2); withdraw (s3); . . . printf("%f\n",balance); return 0; }

Use mutex variable

6 / 38

slide-7
SLIDE 7

USE OF EXISTING SUBROUTINES

Concurrent updates to global variables

double balance; void deposit (double s) { balance += s; } void withdraw (double s) { balance -= s; } int main () { balance = 0.0 deposit (s1); deposit (s2); withdraw (s3); . . . printf("%f\n",balance); return 0; } double balance; pthread_mutex_t balance_mutex; void deposit (double s) { pthread_mutex_lock (&balance_mutex); balance += s; pthread_mutex_unlock (&balance_mutex); } void withdraw (double s) { pthread_mutex_lock (&balance_mutex); balance -= s; pthread_mutex_unlock (&balance_mutex); } int main () { balance = 0.0; . . . return 0; }

Use mutex variable

7 / 38

slide-8
SLIDE 8

USE OF EXISTING SUBROUTINES

Concurrent use of local scratch space

int main () { int s1, s2, s, work[100]; s1 = task1(work); s2 = task2(work); s = s1+s2; printf("%d\n", s); return 0; } 8 / 38

slide-9
SLIDE 9

USE OF EXISTING SUBROUTINES

Concurrent use of local scratch space

int main () { int s1, s2, s, work[100]; s1 = task1(work); s2 = task2(work); s = s1+s2; printf("%d\n", s); return 0; }

Allocate for each thread; trim it down when necessary

9 / 38

slide-10
SLIDE 10

USE OF EXISTING SUBROUTINES

Concurrent use of local scratch space

int main () { int s1, s2, s, work[100]; s1 = task1(work); s2 = task2(work); s = s1+s2; printf("%d\n", s); return 0; } int main () { int s1, s2, s, work1[100], work2[100]; s1 = task1(work1); s2 = task2(work2); s = s1+s2; printf("%d\n", s); return 0; }

Allocate for each thread; trim it down when necessary

10 / 38

slide-11
SLIDE 11

USE OF EXISTING SUBROUTINES

Stack overflow

Caused by recursive call of subroutine

void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); } int main () { int s, i = 2; foo(i, &s); return 0; } 11 / 38

slide-12
SLIDE 12

USE OF EXISTING SUBROUTINES

Stack overflow

Caused by recursive call of subroutine

void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); } int main () { int s, i = 2; foo(i, &s); return 0; } int i, s

Main Stack

12 / 38

slide-13
SLIDE 13

USE OF EXISTING SUBROUTINES

Stack overflow

Caused by recursive call of subroutine

void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); } int main () { int s, i = 2; foo(i, &s); return 0; } int i, s

Main

int ia[100]

foo(2) Stack

13 / 38

slide-14
SLIDE 14

USE OF EXISTING SUBROUTINES

Stack overflow

Caused by recursive call of subroutine

void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); } int main () { int s, i = 2; foo(i, &s); return 0; } int i, s

Main

int ia[100]

foo(2)

int ia[100]

foo(1) Stack

14 / 38

slide-15
SLIDE 15

USE OF EXISTING SUBROUTINES

Stack overflow

Caused by recursive call of subroutine

void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); } int main () { int s, i = 2; foo(i, &s); return 0; } int i, s

Main

int ia[100]

foo(2)

int ia[100]

foo(1)

int ia[100]

foo(0) Stack

15 / 38

slide-16
SLIDE 16

USE OF EXISTING SUBROUTINES

Stack overflow

Caused by recursive call of subroutine Unfold recursion

void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); } int main () { int s, i = 2; foo(i, &s); return 0; }

Stack

int main () { int i, s, ia[100]; for ( i = 0; i < 2; i++ ) s += compute(ia,i); return 0; } int i, s ia[100] 16 / 38

slide-17
SLIDE 17

USE OF EXISTING SUBROUTINES

Stack overflow

Caused by memory allocation within subroutine

int ps[3]; void *foo (void *threadid) { long tid = (long)threadid; int ia[100]; ps[tid] = compute(ia,i); pthread_exit(NULL); } int main () { long i; int s; pthread_t threads[3]; for ( i = 0; i < 3; i++ ) pthread_create(&threads[i], NULL, foo, (void *i)); for ( i = 0; i < 3; i++ ) pthread_join (threads[t], NULL); s = ps[0]+ps[1]+ps[2]; return 0; } 17 / 38

slide-18
SLIDE 18

USE OF EXISTING SUBROUTINES

Stack overflow

Caused by memory allocation within subroutine Dynamic allocation with malloc() Use pthread library pthread_attr_setstacksize() Allocate within main(), pass pointer to thread

18 / 38

slide-19
SLIDE 19

USE OF EXISTING SUBROUTINES

Stack overflow

Caused by memory allocation within subroutine Allocate within main(), pass pointer to thread

int ps[3]; int *gia; void *foo (void *threadid) { long tid = (long)threadid; int *ia = &gia[tid*100]; ps[tid] = compute(ia,i); pthread_exit(NULL); } int main () { long i; int s; pthread_t threads[3]; gia = (int *)malloc(sizeof(int)*3*100); for ( i = 0; i < 3; i++ ) pthread_create(&threads[i], NULL, foo, (void *i)); for ( i = 0; i < 3; i++ ) pthread_join (threads[t], NULL); s = ps[0]+ps[1]+ps[2]; free(gia); return 0; } 19 / 38

slide-20
SLIDE 20

ALGORITHMIC REDUCTION OF CRITICAL SECTIONS READING & WRITING ARE NOT SYMMETRIC

  • 1. Concurrent reading is fast & safe
  • 2. Concurrent writing must be AVOIDED

20 / 38

slide-21
SLIDE 21

ALGORITHMIC REDUCTION OF CRITICAL SECTIONS

Example: TML of FMM

  • 21 / 38
slide-22
SLIDE 22

ALGORITHMIC REDUCTION OF CRITICAL SECTIONS

Example: TML of FMM

22 / 38

slide-23
SLIDE 23

ALGORITHMIC REDUCTION OF CRITICAL SECTIONS

Example: TML of FMM

23 / 38

slide-24
SLIDE 24

ALGORITHMIC REDUCTION OF CRITICAL SECTIONS

Example: TML of FMM B1 B2

24 / 38

slide-25
SLIDE 25

ALGORITHMIC REDUCTION OF CRITICAL SECTIONS

Example: TML of FMM B1 B2

?

25 / 38

slide-26
SLIDE 26

ALGORITHMIC REDUCTION OF CRITICAL SECTIONS

Example: TML of FMM B1 B2 B3 Need O(N) Locks!!!

26 / 38

slide-27
SLIDE 27

ALGORITHMIC REDUCTION OF CRITICAL SECTIONS

Example: TML of FMM B1 B2 B3 B4 Need O(3d − 2d) Locks

27 / 38

slide-28
SLIDE 28

ALGORITHMIC REDUCTION OF CRITICAL SECTIONS

Example: TML of FMM B1 B2 B3 Need ZERO Locks

28 / 38

slide-29
SLIDE 29

DEVELOPING TIPS Save a copy of your working code Do sequential version first Work on one subroutine at a time Pick the most time-critical subroutine first

29 / 38

slide-30
SLIDE 30

DEVELOPING TIPS Study software structures

lmax lmax − 1 lmax lmax − 2 lmax − 1 lmax − 3 lmax − 2 lmax − 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 3 1 2 2 1 2 . . . . . . . . . lmax − 3 lmax − 2 lmax − 1 lmax All l’s All l’s All Particles

OPERATIONS Upward Pass: TSM, TMM List 2: TME, TEE, TEL List 3: TMT or TST List 5 & Evaluate Local: TLL & TLT List 4: TSL or TST List 1: TST Sum Local & Direct TIME STEP

30 / 38

slide-31
SLIDE 31

CODING TIPS Do NOT use unless necessary No goto statement in Fortran Loops in Fortran No break statement UTILIZE COMPILER INSTEAD OF BLOCKING IT

31 / 38

slide-32
SLIDE 32

CODING TIPS Do NOT use unless necessary No goto statement in Fortran Loops in Fortran

DO ...... ....... ENDDO DO 10 ...... ....... 10 CONTINUE

No break statement UTILIZE COMPILER INSTEAD OF BLOCKING IT

32 / 38

slide-33
SLIDE 33

CODING TIPS Do NOT use unless necessary No goto statement in Fortran Loops in Fortran No break statement

while (cond) { ....... cond = foo(); } while (1) { ...... break; }

UTILIZE COMPILER INSTEAD OF BLOCKING IT

33 / 38

slide-34
SLIDE 34

TESTING & TUNING TIPS

Makefile Split Fortran subroutines into individual file Debugging & profiling tools Shell script for batch test

34 / 38

slide-35
SLIDE 35

TESTING & TUNING TIPS

Makefile Split Fortran subroutines into individual file

√ Fsplit http://developers.sun.com/sunstudio/documentation/ss12/mr/man1/fsplit.1.html √ Use individual compilation flag for each routine

Debugging & profiling tools Shell script for batch test

35 / 38

slide-36
SLIDE 36

TESTING & TUNING TIPS

Makefile Split Fortran subroutines into individual file Debugging & profiling tools

√ gprof http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html √ DDD http://www.gnu.org/software/ddd √ valgrind http://valgrind.org/

Shell script for batch test

36 / 38

slide-37
SLIDE 37

TESTING & TUNING TIPS

Makefile Split Fortran subroutines into individual file Debugging & profiling tools Shell script for batch test

#! /bin/bash for flag in -O -O2 -O4 do make clean make CFLAGS=$flag fFFLAGS=$flag sFFLAGS=-O for (( t=2; t<=128; t=t*2 )) do for (( s=50; s<=90; s=s+5 )) do for (( count=1; count<=10; count++ )) do ./fmm -n 10000 -t $t -s $s -d 2 done done done done

> nohup ./script &

37 / 38

slide-38
SLIDE 38

SUMMARY

◮ Threading subroutines ◮ Algorithmic reduction of critical sections ◮ Programming development tips

Thank you for your attention Please send your comments to zhangb@cs.duke.edu

38 / 38