threading subroutines
play

Threading Subroutines Bo Zhang zhangb@cs.duke.edu Nikos P. - PowerPoint PPT Presentation

P ARALLEL P ROGRAMMING ON M ULTICORES Threading Subroutines Bo Zhang zhangb@cs.duke.edu Nikos P. Pitsianis Xiaobai Sun D EPARTMENT OF C OMPUTER S CIENCE D UKE U NIVERSITY 09/29/2010 1 / 38 M Y E XPERIENCE B ACKGROUND Ph.D. in Mathematics;


  1. P ARALLEL P ROGRAMMING ON M ULTICORES Threading Subroutines Bo Zhang zhangb@cs.duke.edu Nikos P. Pitsianis Xiaobai Sun D EPARTMENT OF C OMPUTER S CIENCE D UKE U NIVERSITY 09/29/2010 1 / 38

  2. M Y E XPERIENCE B ACKGROUND Ph.D. in Mathematics; basic knowledge of C W HAT I HAVE GAINED ◮ Learned Fortran, Pthread, and OpenMP ◮ Parallel Fast Multipole Method (FMM) 1. 1st parallel version on multicore desktop & laptop 2. Initial code has 12k lines; current code has 18k lines 3. In both Pthread and OpenMP 4. Problem size: 20M (laptop); 100M (workstation) 5. Speedup: 8X on an oct-core workstation 2 / 38

  3. U SING E XISTING S UBROUTINES You have a working sequential code, but the parallel version sees 1. Segmentation fault 2. Bus error 3. Heap collapse 4. Inconsistent results What happened? 3 / 38

  4. U SE OF E XISTING S UBROUTINES Potential Problems, not T HREAD -S AFE Concurrent updates to global variables Concurrent use of local scratch space Stack overflow 4 / 38

  5. U SE OF E XISTING S UBROUTINES Concurrent updates to global variables double balance; void deposit (double s) { balance += s; } void withdraw (double s) { balance -= s; } int main () { balance = 0.0 deposit (s1); deposit (s2); withdraw (s3); . . . printf("%f \ n",balance); return 0; } 5 / 38

  6. U SE OF E XISTING S UBROUTINES Concurrent updates to global variables double balance; void deposit (double s) { balance += s; } void withdraw (double s) { balance -= s; } int main () { balance = 0.0 deposit (s1); deposit (s2); withdraw (s3); . . . printf("%f \ n",balance); return 0; } Use mutex variable 6 / 38

  7. U SE OF E XISTING S UBROUTINES Concurrent updates to global variables double balance; double balance; pthread_mutex_t balance_mutex; void deposit (double s) { void deposit (double s) { balance += s; pthread_mutex_lock (&balance_mutex); } balance += s; void withdraw (double s) { pthread_mutex_unlock (&balance_mutex); balance -= s; } } void withdraw (double s) { int main () { pthread_mutex_lock (&balance_mutex); balance = 0.0 balance -= s; deposit (s1); pthread_mutex_unlock (&balance_mutex); deposit (s2); } withdraw (s3); int main () { . . . balance = 0.0; printf("%f \ n",balance); return 0; . . . return 0; } } Use mutex variable 7 / 38

  8. U SE OF E XISTING S UBROUTINES Concurrent use of local scratch space int main () { int s1, s2, s, work[100]; s1 = task1(work); s2 = task2(work); s = s1+s2; printf("%d \ n", s); return 0; } 8 / 38

  9. U SE OF E XISTING S UBROUTINES Concurrent use of local scratch space int main () { int s1, s2, s, work[100]; s1 = task1(work); s2 = task2(work); s = s1+s2; printf("%d \ n", s); return 0; } Allocate for each thread; trim it down when necessary 9 / 38

  10. U SE OF E XISTING S UBROUTINES Concurrent use of local scratch space int main () { int main () { int s1, s2, s, work[100]; int s1, s2, s, work1[100], work2[100]; s1 = task1(work); s1 = task1(work1); s2 = task2(work); s2 = task2(work2); s = s1+s2; s = s1+s2; printf("%d \ n", s); printf("%d \ n", s); return 0; return 0; } } Allocate for each thread; trim it down when necessary 10 / 38

  11. U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); } int main () { int s, i = 2; foo(i, &s); return 0; } 11 / 38

  12. U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); } int main () { int s, i = 2; foo(i, &s); int i, s return 0; Main } Stack 12 / 38

  13. U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); int ia[100] } foo(2) int main () { int s, i = 2; foo(i, &s); int i, s return 0; Main } Stack 13 / 38

  14. U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine int ia[100] foo(1) void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); int ia[100] } foo(2) int main () { int s, i = 2; foo(i, &s); int i, s return 0; Main } Stack 14 / 38

  15. U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine int ia[100] foo(0) int ia[100] foo(1) void foo (int i, int *s) { int ia[100]; if ( i > 0) foo(i-1, s); *s += compute(ia, i); int ia[100] } foo(2) int main () { int s, i = 2; foo(i, &s); int i, s return 0; Main } Stack 15 / 38

  16. U SE OF E XISTING S UBROUTINES Stack overflow Caused by recursive call of subroutine Unfold recursion int main () { void foo (int i, int *s) { int i, s, ia[100]; int ia[100]; for ( i = 0; i < 2; i++ ) if ( i > 0) s += compute(ia,i); foo(i-1, s); return 0; *s += compute(ia, i); } } int i, s int main () { ia[100] int s, i = 2; foo(i, &s); return 0; } Stack 16 / 38

  17. U SE OF E XISTING S UBROUTINES Stack overflow Caused by memory allocation within subroutine int ps[3]; void *foo (void *threadid) { long tid = (long)threadid; int ia[100]; ps[tid] = compute(ia,i); pthread_exit(NULL); } int main () { long i; int s; pthread_t threads[3]; for ( i = 0; i < 3; i++ ) pthread_create(&threads[i], NULL, foo, (void *i)); for ( i = 0; i < 3; i++ ) pthread_join (threads[t], NULL); s = ps[0]+ps[1]+ps[2]; return 0; } 17 / 38

  18. U SE OF E XISTING S UBROUTINES Stack overflow Caused by memory allocation within subroutine Dynamic allocation with malloc() Use pthread library pthread_attr_setstacksize() Allocate within main() , pass pointer to thread 18 / 38

  19. U SE OF E XISTING S UBROUTINES Stack overflow Caused by memory allocation within subroutine Allocate within main() , pass pointer to thread int ps[3]; int *gia; void *foo (void *threadid) { long tid = (long)threadid; int *ia = &gia[tid*100]; ps[tid] = compute(ia,i); pthread_exit(NULL); } int main () { long i; int s; pthread_t threads[3]; gia = (int *)malloc(sizeof(int)*3*100); for ( i = 0; i < 3; i++ ) pthread_create(&threads[i], NULL, foo, (void *i)); for ( i = 0; i < 3; i++ ) pthread_join (threads[t], NULL); s = ps[0]+ps[1]+ps[2]; free(gia); return 0; } 19 / 38

  20. A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS R EADING & W RITING ARE NOT SYMMETRIC 1. Concurrent reading is fast & safe 2. Concurrent writing must be AVOIDED 20 / 38

  21. A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 21 / 38

  22. A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM 22 / 38

  23. A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM 23 / 38

  24. A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM B 2 B 1 24 / 38

  25. A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM ? B 2 B 1 25 / 38

  26. A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM B 2 B 1 B 3 Need O ( N ) Locks!!! 26 / 38

  27. A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM B 2 B 4 B 1 B 3 Need O ( 3 d − 2 d ) Locks 27 / 38

  28. A LGORITHMIC R EDUCTION OF C RITICAL S ECTIONS Example: T ML of FMM B 2 B 1 B 3 Need ZERO Locks 28 / 38

  29. D EVELOPING T IPS Save a copy of your working code Do sequential version first Work on one subroutine at a time Pick the most time-critical subroutine first 29 / 38

  30. D EVELOPING T IPS Study software structures List 5 & Evaluate Local: T LL & T LT Upward Pass: T SM , T MM List 2: T ME , T EE , T EL Sum Local & Direct List 3: T MT or T ST List 4: T SL or T ST List 1: T ST l max All l ’s All l ’s O PERATIONS l max − 1 l max l max − 2 l max − 1 l max − 3 l max − 2 l max − 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 3 1 2 2 0 1 2 . . . . . . . . . l max − 3 l max − 2 l max − 1 l max All Particles T IME S TEP 30 / 38

  31. C ODING T IPS Do NOT use unless necessary No goto statement in Fortran Loops in Fortran No break statement U TILIZE C OMPILER INSTEAD OF BLOCKING IT 31 / 38

  32. C ODING T IPS Do NOT use unless necessary No goto statement in Fortran Loops in Fortran DO ...... DO 10 ...... ....... ....... ENDDO 10 CONTINUE No break statement U TILIZE C OMPILER INSTEAD OF BLOCKING IT 32 / 38

  33. C ODING T IPS Do NOT use unless necessary No goto statement in Fortran Loops in Fortran No break statement while (cond) { while (1) { ....... ...... cond = foo(); break; } } U TILIZE C OMPILER INSTEAD OF BLOCKING IT 33 / 38

  34. T ESTING & T UNING T IPS Makefile Split Fortran subroutines into individual file Debugging & profiling tools Shell script for batch test 34 / 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend