Parallel Computing Portable Software & Cost-Effective Hardware - PowerPoint PPT Presentation

Getting Started With MPI Process and processor We refer to process as a logical unit which executes its own code, in an MIMD style. Processor is a physical device on which one or several processes are executed. The MPI standard uses the concept process consistently through- out its documentation. However, we only consider situations where one processor is responsible for one process and therefore use the two terms inter- changeably. 25

Getting Started With MPI Bindings to MPI routines MPI comprises a message-passing library where all the routines have corresponding C-binding MPI Command name and Fortran-binding (routine names are in uppercase) MPI COMMAND NAME (We focus on the C-bindings in the lecture notes.) 26

Getting Started With MPI Communicator A group of MPI processes with a name (context). Any process is identified by its rank. The rank is only meaningful within a particular communicator. By default communicator MPI COMM WORLD contains all the MPI processes. Mechanism to identify subset of processes. Promotes modular design of parallel libraries. 27

Getting Started With MPI The 6 most important MPI routines MPI Init - initiate an MPI computation MPI Finalize - terminate the MPI computation and clean up MPI Comm size - how many processes participate in a given MPI communicator? MPI Comm rank - which one am I? (A number between 0 and size-1 .) MPI Send - send a message to a particular process within an MPI communicator MPI Recv - receive a message from a particular process within an MPI communicator 28

Getting Started With MPI The first MPI program Let every process write “Hello world” on the standard output. #include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); printf("Hello world, I’ve rank %d out of %d procs.\n", my_rank,size); MPI_Finalize (); return 0; } 29

Getting Started With MPI The Fortran program program hello include "mpif.h" integer size, my_rank, ierr ! call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierr) ! print *,"Hello world, I’ve rank ",my_rank," out of ",size ! call MPI_FINALIZE(ierr) end 30

Getting Started With MPI Compilation by e.g., > mpicc -c hello.c > mpicc -o hello hello.o Execution > mpirun -np 4 ./hello Example running result (non-deterministic) Hello world, I’ve rank 2 out of 4 procs. Hello world, I’ve rank 1 out of 4 procs. Hello world, I’ve rank 3 out of 4 procs. Hello world, I’ve rank 0 out of 4 procs. 31

✷ ✷ ❴ ❴ ❴ ✷ Getting Started With MPI Parallel execution · · · Process 0 Process 1 Process P -1 �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✿ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✿ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✿ ❀✵❁✧❂✳✽✜❂✪✄✢✁✲✑❃★❅❄✞✄✜✦✞✬✵✭✂✏❆✯❇❄☛✦✞✬✵✭✙✏✌✶❆✿ ❀✵❁✧❂✳✽✜❂✪✄✢✁✲✑❃★❅❄✞✄✜✦✞✬✵✭✂✏❆✯❇❄☛✦✞✬✵✭✙✏✌✶❆✿ ❀✵❁✧❂✳✽✜❂✪✄✢✁✲✑❃★❅❄✞✄✜✦✞✬✵✭✂✏❆✯❇❄☛✦✞✬✵✭✙✏✌✶❆✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❉★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✙✏✌✁✒✹✌☞✙✶▲✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❉★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✙✏✌✁✒✹✌☞✙✶▲✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❉★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✙✏✌✁✒✹✌☞✙✶▲✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾▼★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❆✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾▼★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❆✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾▼★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❆✿ ✛✌✬✧✁☎✄✌✑✵◆❖★✲P✪◗☛☞✵✝✵✝✮✓❙❘✜✓✞✬✜✝✞✡❖✯❚❂❆❯❲❱☛☞✎✬☛✦✳✄✵✾❨❳✮✡❩✓✳✟✵✑❩✓✞◆❨❳✵✡✫✛✌✬☛✓☛✆✵✏✺✔❅❬✠✄▲P✢✯ ✛✌✬✧✁☎✄✌✑✵◆❖★✲P✪◗☛☞✵✝✵✝✮✓❙❘✜✓✞✬✜✝✞✡❖✯❚❂❆❯❲❱☛☞✎✬☛✦✳✄✵✾❨❳✮✡❩✓✳✟✵✑❩✓✞◆❨❳✵✡✫✛✌✬☛✓☛✆✵✏✺✔❅❬✠✄▲P✢✯ ✛✌✬✧✁☎✄✌✑✵◆❖★✲P✪◗☛☞✵✝✵✝✮✓❙❘✜✓✞✬✜✝✞✡❖✯❚❂❆❯❲❱☛☞✎✬☛✦✳✄✵✾❨❳✮✡❩✓✳✟✵✑❩✓✞◆❨❳✵✡✫✛✌✬☛✓☛✆✵✏✺✔❅❬✠✄▲P✢✯ ✚✙✼✜✽✳✬☛✦✠✄✌✾✕✯❭✏✌✁✲✹✌☞✙✶❆✿ ✚✙✼✜✽✳✬☛✦✠✄✌✾✕✯❭✏✌✁✲✹✌☞✙✶❆✿ ✚✙✼✜✽✳✬☛✦✠✄✌✾✕✯❭✏✌✁✲✹✌☞✙✶❆✿ ❀✵❁✧❂✳✽✳❪✂✁☎✄✜✦✌✝✜✁✲✹✌☞▼★✪✶▲✿ ❀✵❁✧❂✳✽✳❪✂✁☎✄✜✦✌✝✜✁✲✹✌☞▼★✪✶▲✿ ❀✵❁✧❂✳✽✳❪✂✁☎✄✜✦✌✝✜✁✲✹✌☞▼★✪✶▲✿ ✬☛☞✞✑✮✟✌✬✞✄❨❫❖✿ ✬☛☞✞✑✮✟✌✬✞✄❨❫❖✿ ✬☛☞✞✑✮✟✌✬✞✄❨❫❖✿ No inter-processor communication, no synchronization. 32

Getting Started With MPI Synchronization Many parallel algorithms require that none process proceeds before all the processes have reached the same state at certain points of a program. Explicit synchronization int MPI_Barrier (MPI_Comm comm) Implicit synchronization through use of e.g. pairs of MPI Send and MPI Recv . Ask yourself the following question: “ If Process 1 progresses 100 times faster than Process 2, will the final result still be correct? ” 33

Getting Started With MPI Example: ordered output Explicit synchronization #include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank,i; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); for (i=0; i<size; i++) { MPI_Barrier (MPI_COMM_WORLD); if (i==my_rank) { printf("Hello world, I’ve rank %d out of %d procs.\n", my_rank, size); fflush (stdout); } } MPI_Finalize (); return 0; } 34

❢ ✷ ❢ ✷ ✷ ✷ ✷ ❢ ❢ ❢ ✷ ✷ ✷ ❢ ✷ ❢ ❢ ❢ Getting Started With MPI · · · Process 0 Process 1 Process P -1 �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯✿✁❁❀ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯✿✁❁❀ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯✿✁❁❀ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❅★❇❆✞✄✜✦✞✬✵✭✂✏❁✯❈❆☛✦✞✬✵✭✙✏✌✶❁❀ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❅★❇❆✞✄✜✦✞✬✵✭✂✏❁✯❈❆☛✦✞✬✵✭✙✏✌✶❁❀ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❅★❇❆✞✄✜✦✞✬✵✭✂✏❁✯❈❆☛✦✞✬✵✭✙✏✌✶❁❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❊★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✙✏✌✁✒✹✌☞✙✶▼❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❊★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✙✏✌✁✒✹✌☞✙✶▼❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❊★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✙✏✌✁✒✹✌☞✙✶▼❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾◆★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❁❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾◆★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❁❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾◆★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❁❀ ❖✌✓✞✬P★✪✁✒◗✌❘✺❀❙✁✠✍☛✏✌✁✒✹✌☞✺❀✱✁✒❚✵❚✂✶ ❖✌✓✞✬P★✪✁✒◗✌❘✺❀❙✁✠✍☛✏✌✁✒✹✌☞✺❀✱✁✒❚✵❚✂✶ ❖✌✓✞✬P★✪✁✒◗✌❘✺❀❙✁✠✍☛✏✌✁✒✹✌☞✺❀✱✁✒❚✵❚✂✶ ❂✵❃✂❄✳✽✠❯✜✦✞✬✮✬✧✁✠☞✞✬❱★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏✂✶▼❀ ❂✵❃✂❄✳✽✠❯✜✦✞✬✮✬✧✁✠☞✞✬❱★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏✂✶▼❀ ❂✵❃✂❄✳✽✠❯✜✦✞✬✮✬✧✁✠☞✞✬❱★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏✂✶▼❀ ✛ ✲ ✛ ✲ ✁✒❖P★✪✁✒◗✮◗✒✚✙✼✜✽✳✬✌✦✳✄✌✾✧✶ ✁✒❖P★✪✁✒◗✮◗✒✚✙✼✜✽✳✬✌✦✳✄✌✾✧✶ ✁✒❖P★✪✁✒◗✮◗✒✚✙✼✜✽✳✬✌✦✳✄✌✾✧✶ ✛✌✬✧✁☎✄✵✑✌❖❲★✲❳✪❨✌☞✌✝✵✝✮✓❙❩✜✓✞✬✜✝✳✡❑✯❬❄❁❭❫❪☛☞❴✬☛✦✠✄✌✾❛❵✵✡❜✓✠✟✌✑❝✓✞❖❛❵✮✡❞✛✌✬✌✓✜✆✵✏✺✔❇❡✒✄❁❳✢✯ ✛✌✬✧✁☎✄✵✑✌❖❲★✲❳✪❨✌☞✌✝✵✝✮✓❙❩✜✓✞✬✜✝✳✡❑✯❬❄❁❭❫❪☛☞❴✬☛✦✠✄✌✾❛❵✵✡❜✓✠✟✌✑❝✓✞❖❛❵✮✡❞✛✌✬✌✓✜✆✵✏✺✔❇❡✒✄❁❳✢✯ ✛✌✬✧✁☎✄✵✑✌❖❲★✲❳✪❨✌☞✌✝✵✝✮✓❙❩✜✓✞✬✜✝✳✡❑✯❬❄❁❭❫❪☛☞❴✬☛✦✠✄✌✾❛❵✵✡❜✓✠✟✌✑❝✓✞❖❛❵✮✡❞✛✌✬✌✓✜✆✵✏✺✔❇❡✒✄❁❳✢✯ ✚✙✼✜✽✳✬✌✦✳✄✌✾✕✯✱✏✌✁✲✹✌☞✙✶❁❀ ✚✙✼✜✽✳✬✌✦✳✄✌✾✕✯✱✏✌✁✲✹✌☞✙✶❁❀ ✚✙✼✜✽✳✬✌✦✳✄✌✾✕✯✱✏✌✁✲✹✌☞✙✶❁❀ ❖✵❖☛✝✠✟✂✏✲✖P★✿✏✒✑✌✡✵✓✳✟✌✑✧✶❁❀ ❖✵❖☛✝✠✟✂✏✲✖P★✿✏✒✑✌✡✵✓✳✟✌✑✧✶❁❀ ❖✵❖☛✝✠✟✂✏✲✖P★✿✏✒✑✌✡✵✓✳✟✌✑✧✶❁❀ ❂✵❃✧❄✳✽✳❣✂✁☎✄✜✦✌✝✜✁✲✹✌☞◆★✪✶▼❀ ❂✵❃✧❄✳✽✳❣✂✁☎✄✜✦✌✝✜✁✲✹✌☞◆★✪✶▼❀ ❂✵❃✧❄✳✽✳❣✂✁☎✄✜✦✌✝✜✁✲✹✌☞◆★✪✶▼❀ ✬☛☞✞✑✮✟✌✬✞✄❤❘❲❀ ✬☛☞✞✑✮✟✌✬✞✄❤❘❲❀ ✬☛☞✞✑✮✟✌✬✞✄❤❘❲❀ The processes synchronize between themselves P times. Parallel execution result: Hello world, I’ve rank 0 out of 4 procs. Hello world, I’ve rank 1 out of 4 procs. Hello world, I’ve rank 2 out of 4 procs. Hello world, I’ve rank 3 out of 4 procs. 35

Getting Started With MPI Point-to-point communication An MPI message is an array of elements of a particular MPI datatype. - predefined standard types - derived types MPI message = “data inside an envelope” Data : start address of the message buffer, counter of elements in the buffer, data type. Envelope : source/destination process, message tag, communicator. 36

Getting Started With MPI To send a message int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm); This blocking send function returns when the data has been de- livered to the system and the buffer can be reused. The message may not have been received by the destination process. 37

Getting Started With MPI To receive a message int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status); This blocking receive function waits until a matching message is received from the system so that the buffer contains the incoming message. Match of data type, source process (or MPI ANY SOURCE ), message tag (or MPI ANY TAG ). Receiving fewer datatype elements than count is ok, but receiving more is an error. 38

Getting Started With MPI MPI Status The source or tag of a received message may not be known if wildcard values were used in the receive function. In C, MPI Status is a structure that contains further information. It can be queried as follows: status.MPI_SOURCE status.MPI_TAG MPI_Get_count (MPI_Status *status, MPI_Datatype datatype, int *count); 39

Getting Started With MPI Another solution to “ordered output” Passing semaphore from process to process, in a ring #include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank, flag; MPI_Status status; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); if (my_rank>0) MPI_Recv (&flag, 1, MPI_INT, my_rank-1, 100, MPI_COMM_WORLD, &status); printf("Hello world, I’ve rank %d out of %d procs.\n",my_rank,size); if (my_rank<size-1) MPI_Send (&my_rank, 1, MPI_INT, my_rank+1, 100, MPI_COMM_WORLD); MPI_Finalize (); return 0; } 40

❥ ✷ ❥ ✷ ✷ ❥ Getting Started With MPI · · · Process 0 Process 1 Process P -1 �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ �✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯❀✿☛✝✮✦✳✭✕❁ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯❀✿☛✝✮✦✳✭✕❁ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯❀✿☛✝✮✦✳✭✕❁ ❂✵❃✧❄✳✽✮❅✳✑☛✦✞✑✮✟✧✏✥✏✒✑☛✦✞✑✞✟✧✏✰❁ ❂✵❃✧❄✳✽✮❅✳✑☛✦✞✑✮✟✧✏✥✏✒✑☛✦✞✑✞✟✧✏✰❁ ❂✵❃✧❄✳✽✮❅✳✑☛✦✞✑✮✟✧✏✥✏✒✑☛✦✞✑✞✟✧✏✰❁ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❆★❈❇✞✄✜✦✞✬✵✭✂✏❉✯❊❇☛✦✞✬✵✭✙✏✌✶❉❁ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❆★❈❇✞✄✜✦✞✬✵✭✂✏❉✯❊❇☛✦✞✬✵✭✙✏✌✶❉❁ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❆★❈❇✞✄✜✦✞✬✵✭✂✏❉✯❊❇☛✦✞✬✵✭✙✏✌✶❉❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞●★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✙✏✌✁✒✹✌☞✙✶❖❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞●★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✙✏✌✁✒✹✌☞✙✶❖❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞●★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✙✏✌✁✒✹✌☞✙✶❖❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾P★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❉❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾P★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❉❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾P★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❉❁ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✘✮◗✙✶ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✘✮◗✙✶ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✘✮◗✙✶ ❂✵❃✂❄✳✽✳❏☛☞✜✆✲❘❆★❈❇✌✿☛✝✞✦✞✭✕✯❚❙✢✯❯❂✵❃✧❄✳✽☛❄✲❱✵❲✕✯ ❂✵❃✂❄✳✽✳❏☛☞✜✆✲❘❆★❈❇✌✿☛✝✞✦✞✭✕✯❚❙✢✯❯❂✵❃✧❄✳✽☛❄✲❱✵❲✕✯ ❂✵❃✂❄✳✽✳❏☛☞✜✆✲❘❆★❈❇✌✿☛✝✞✦✞✭✕✯❚❙✢✯❯❂✵❃✧❄✳✽☛❄✲❱✵❲✕✯ ✚✙✼✜✽✠✬☛✦✳✄✌✾✜❳✙❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✕✯◆❇✂✏✒✑✌✦✞✑✮✟✧✏✌✶❖❁ ✚✙✼✜✽✠✬☛✦✳✄✌✾✜❳✙❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✕✯◆❇✂✏✒✑✌✦✞✑✮✟✧✏✌✶❖❁ ✚✙✼✜✽✠✬☛✦✳✄✌✾✜❳✙❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✕✯◆❇✂✏✒✑✌✦✞✑✮✟✧✏✌✶❖❁ ✛✌✬✧✁☎✄✌✑✵✿❩★✲❭✪❪☛☞✵✝✵✝✮✓❚❫✜✓✞✬✜✝✞✡❩✯❨❄❉❴❵❘☛☞✎✬☛✦✳✄✵✾❜❛✮✡❝✓✳✟✵✑❝✓✞✿❜❛✵✡✫✛✌✬☛✓☛✆✵✏✺✔❈❞✠✄❖❭✢✯✤✚✙✼✜✽✠✬☛✦✠✄✌✾▼✯❡✏✌✁☎✹✌☞✙✶❖❁ ✛✌✬✧✁☎✄✌✑✵✿❩★✲❭✪❪☛☞✵✝✵✝✮✓❚❫✜✓✞✬✜✝✞✡❩✯❨❄❉❴❵❘☛☞✎✬☛✦✳✄✵✾❜❛✮✡❝✓✳✟✵✑❝✓✞✿❜❛✵✡✫✛✌✬☛✓☛✆✵✏✺✔❈❞✠✄❖❭✢✯✤✚✙✼✜✽✠✬☛✦✠✄✌✾▼✯❡✏✌✁☎✹✌☞✙✶❖❁ ✛✌✬✧✁☎✄✌✑✵✿❩★✲❭✪❪☛☞✵✝✵✝✮✓❚❫✜✓✞✬✜✝✞✡❩✯❨❄❉❴❵❘☛☞✎✬☛✦✳✄✵✾❜❛✮✡❝✓✳✟✵✑❝✓✞✿❜❛✵✡✫✛✌✬☛✓☛✆✵✏✺✔❈❞✠✄❖❭✢✯✤✚✙✼✜✽✠✬☛✦✠✄✌✾▼✯❡✏✌✁☎✹✌☞✙✶❖❁ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✍☛✏✌✁✲✹✌☞✌❳✂❙✠✶ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✍☛✏✌✁✲✹✌☞✌❳✂❙✠✶ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✍☛✏✌✁✲✹✌☞✌❳✂❙✠✶ ❂✵❃✂❄✳✽✮❅✵☞✳✄✌✡❢★❈❇✠✚✙✼☛✽✳✬☛✦✳✄✌✾▼✯❣❙✢✯❯❂✵❃✧❄✳✽✜❄☎❱✵❲✕✯ ❂✵❃✂❄✳✽✮❅✵☞✳✄✌✡❢★❈❇✠✚✙✼☛✽✳✬☛✦✳✄✌✾▼✯❣❙✢✯❯❂✵❃✧❄✳✽✜❄☎❱✵❲✕✯ ❂✵❃✂❄✳✽✮❅✵☞✳✄✌✡❢★❈❇✠✚✙✼☛✽✳✬☛✦✳✄✌✾▼✯❣❙✢✯❯❂✵❃✧❄✳✽✜❄☎❱✵❲✕✯ ✚✙✼✜✽✠✬☛✦✳✄✌✾✌❤✧❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✧✶❖❁ ✚✙✼✜✽✠✬☛✦✳✄✌✾✌❤✧❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✧✶❖❁ ✚✙✼✜✽✠✬☛✦✳✄✌✾✌❤✧❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✧✶❖❁ ✲ ✲ ❂✵❃✧❄✳✽✳✐✂✁☎✄✜✦✌✝✜✁✲✹✌☞P★✪✶❖❁ ❂✵❃✧❄✳✽✳✐✂✁☎✄✜✦✌✝✜✁✲✹✌☞P★✪✶❖❁ ❂✵❃✧❄✳✽✳✐✂✁☎✄✜✦✌✝✜✁✲✹✌☞P★✪✶❖❁ ✬☛☞✞✑✮✟✌✬✞✄❜◗❩❁ ✬☛☞✞✑✮✟✌✬✞✄❜◗❩❁ ✬☛☞✞✑✮✟✌✬✞✄❜◗❩❁ Process 0 sends a message to Process 1, which forwards it further to Process 2, and so on. 41

Getting Started With MPI MPI timer double MPI_Wtime(void) This function returns a number representing the number of wall- clock seconds elapsed since some time in the past. Example usage: double starttime, endtime; starttime = MPI_Wtime(); /* .... work to be timed ... */ endtime = MPI_Wtime(); printf("That took %f seconds\n",endtime-starttime); 42

Getting Started With MPI Timing point-to-point communication (ping-pong) #include <stdio.h> #include <stdlib.h> #include "mpi.h" #define NUMBER_OF_TESTS 10 int main( int argc, char **argv ) { double *buf; int rank; int n; double t1, t2, tmin; int i, j, k, nloop; MPI_Status status; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); if (rank == 0) printf( "Kind\t\tn\ttime (sec)\tRate (MB/sec)\n" ); for (n=1; n<1100000; n*=2) { nloop = 1000; buf = (double *) malloc( n * sizeof(double) ); 43

Getting Started With MPI if (!buf) { fprintf( stderr, "Could not allocate send/recv buffer of size %d\n", n ); MPI_Abort( MPI_COMM_WORLD, 1 ); } tmin = 1000; for (k=0; k<NUMBER_OF_TESTS; k++) { if (rank == 0) { /* Make sure both processes are ready */ MPI_Sendrecv( MPI_BOTTOM, 0, MPI_INT, 1, 14, MPI_BOTTOM, 0, MPI_INT, 1, 14, MPI_COMM_WORLD, &status ); t1 = MPI_Wtime(); for (j=0; j<nloop; j++) { MPI_Send( buf, n, MPI_DOUBLE, 1, k, MPI_COMM_WORLD ); MPI_Recv( buf, n, MPI_DOUBLE, 1, k, MPI_COMM_WORLD, &status ); } t2 = (MPI_Wtime() - t1) / nloop; if (t2 < tmin) tmin = t2; } else if (rank == 1) { /* Make sure both processes are ready */ MPI_Sendrecv( MPI_BOTTOM, 0, MPI_INT, 0, 14, MPI_BOTTOM, 0, MPI_INT, 0, 14, MPI_COMM_WORLD, &status ); for (j=0; j<nloop; j++) { MPI_Recv( buf, n, MPI_DOUBLE, 0, k, MPI_COMM_WORLD, &status ); 44

Getting Started With MPI MPI_Send( buf, n, MPI_DOUBLE, 0, k, MPI_COMM_WORLD ); } } } /* Convert to half the round-trip time */ tmin = tmin / 2.0; if (rank == 0) { double rate; if (tmin > 0) rate = n * sizeof(double) * 1.0e-6 /tmin; else rate = 0.0; printf( "Send/Recv\t%d\t%f\t%f\n", n, tmin, rate ); } free( buf ); } MPI_Finalize( ); return 0; } 45

Getting Started With MPI Measurements of pingpong.c on a Linux cluster Kind n time (sec) Rate (MB/sec) Send/Recv 1 0.000154 0.051985 Send/Recv 2 0.000155 0.103559 Send/Recv 4 0.000158 0.202938 Send/Recv 8 0.000162 0.394915 Send/Recv 16 0.000173 0.739092 Send/Recv 32 0.000193 1.323439 Send/Recv 64 0.000244 2.097787 Send/Recv 128 0.000339 3.018741 Send/Recv 256 0.000473 4.329810 Send/Recv 512 0.000671 6.104322 Send/Recv 1024 0.001056 7.757576 Send/Recv 2048 0.001797 9.114882 Send/Recv 4096 0.003232 10.137046 Send/Recv 8192 0.006121 10.706747 Send/Recv 16384 0.012293 10.662762 Send/Recv 32768 0.024315 10.781164 Send/Recv 65536 0.048755 10.753412 Send/Recv 131072 0.097074 10.801766 Send/Recv 262144 0.194003 10.809867 Send/Recv 524288 0.386721 10.845800 Send/Recv 1048576 0.771487 10.873298 46

Getting Started With MPI Exercises for Day 1 Exercise One : Write a new “Hello World” program, where all the processes first generate a text message using sprintf and then send it to Process 0 (you may use strlen(message)+1 to find out the length of the message). Afterwards, Process 0 is responsible for writing out all the messages on the standard output. Exercise Two : Modify the “ping-pong” program to involve more than 2 processes in the measurement. That is, Process 0 sends message to Process 1, which forwards the message further to Process 2, and so on, until Process P -1 returns the message back to Process 0. Run the test for some different choices of P . 47

Getting Started With MPI Exercise Three : Write a parallel program for calculating π , using the formula: � 1 4 π = 1 + x 2 dx . 0 (Hint: use numerical integration and divide the interval [0 , 1] into n intervals.) How to use the PBS scheduling system? 1. mpicc program.c 2. qsub -o res.txt pbs.sh , where the script pbs.sh may be: #! /bin/sh #PBS -e stderr.txt #PBS -l walltime=0:05:00,ncpus=4 cd $PBS_O_WORKDIR mpirun -np $NCPUS ./a.out exit $? 48

Basic MPI Programming Example of send & receive: sum of random numbers Write an MPI program in which each process generates a random number and then Process 0 calculates the sum of these numbers. #include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank, i, a, sum; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); srand (7654321*(my_rank+1)); a = rand()%100; printf("<%02d> a=%d\n",my_rank,a); fflush (stdout); if (my_rank==0) { MPI_Status status; sum = a; for (i=1; i<size; i++) { MPI_Recv (&a, 1, MPI_INT, i, 500, MPI_COMM_WORLD, &status); 49

Basic MPI Programming sum += a; } printf("<%02d> sum=%d\n",my_rank,sum); } else MPI_Send (&a, 1, MPI_INT, 0, 500, MPI_COMM_WORLD); MPI_Finalize (); return 0; } 50

Basic MPI Programming Rules for point-to-point communication Message order preservation – If Process A sends two messages to Process B, which posts two matching receive calls, then the two messages are guaranteed to be received in the order they were sent. Progress – It is not possible for a matching send and receive pair to remain permanently outstanding. That is, if one process posts a send and a second process posts a matching receive, then either the send or the receive will eventually complete. 51

Basic MPI Programming Probing in MPI It is possible in MPI to only read the envelope of a message before choosing whether or not to read the actual message. int MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status *status) The function blocks until a message having given source and/or tag is available. The result of probing is returned in an MPI Status data structure. 52

Basic MPI Programming Sum of random numbers: Alternative 1 Use of MPI Probe #include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank, i, a, sum; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); srand (7654321*(my_rank+1)); a = rand()%100; printf("<%02d> a=%d\n",my_rank,a); fflush (stdout); if (my_rank==0) { MPI_Status status; sum = a; for (i=1; i<size; i++) { MPI_Probe (MPI_ANY_SOURCE,500,MPI_COMM_WORLD,&status); MPI_Recv (&a, 1, MPI_INT, status.MPI_SOURCE,500,MPI_COMM_WORLD,&status); 53

Basic MPI Programming sum += a; } printf("<%02d> sum=%d\n",my_rank,sum); } else MPI_Send (&a, 1, MPI_INT, 0, 500, MPI_COMM_WORLD); MPI_Finalize (); return 0; } 54

Basic MPI Programming Collective communication Communication carried out by all processes in a communicator data processes A 0 A 0 one-to-all broadcast A 0 A 0 MPI_BCAST A 0 A 0 A 0 A 1 A 2 A 3 all-to-one gather A 1 A 2 MPI_GATHER A 3 A 0 A 1 A 2 A 3 A 0 one-to-all scatter A 1 A 2 MPI_SCATTER A 3 55

Basic MPI Programming Collective communication (Cont’d) Processes . . . 0 1 2 3 Initial 2 4 5 7 0 3 6 2 Data : MPI_REDUCE with MPI_MIN, root = 0 : - - - - - - 0 2 MPI_ALLREDUCE with MPI_MIN: 0 2 0 2 0 2 0 2 MPI_REDUCE with MPI_SUM, root = 1 : - - - - - - 1316 56

Basic MPI Programming Sum of random numbers: Alternative 2 Use of MPI Gather #include <stdio.h> #include <mpi.h> #include <malloc.h> int main (int nargs, char** args) { int size, my_rank, i, a, sum, *array; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); srand (7654321*(my_rank+1)); a = rand()%100; printf("<%02d> a=%d\n",my_rank,a); fflush (stdout); if (my_rank==0) array = (int*) malloc(size*sizeof(int)); MPI_Gather (&a, 1, MPI_INT, array, 1, MPI_INT, 0, MPI_COMM_WORLD); if (my_rank==0) { 57

Basic MPI Programming sum = a; for (i=1; i<size; i++) sum += array[i]; printf("<%02d> sum=%d\n",my_rank,sum); free (array); } MPI_Finalize (); return 0; } 58

Basic MPI Programming Sum of random numbers: Alternative 3 Use of MPI Reduce ( MPI Allreduce ) #include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int my_rank, a, sum; MPI_Init (&nargs, &args); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); srand (7654321*(my_rank+1)); a = rand()%100; printf("<%02d> a=%d\n",my_rank,a); fflush (stdout); MPI_Reduce (&a,&sum,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD); if (my_rank==0) printf("<%02d> sum=%d\n",my_rank,sum); MPI_Finalize (); return 0; } 59

Basic MPI Programming Example: inner-product Write an MPI program that calculates inner-product between two vectors � u = ( u 1 , u 2 , . . . , u M ) and � u = ( v 1 , v 2 , . . . , v M ), where x k = k u k = x k , v k = x 2 k , M , 1 ≤ k ≤ M. Partition � u and � v into P segments (load balancing) with sub- length m i : ⌊ M � P ⌋ + 1 0 ≤ i < mod( M, P ) m i = ⌊ M P ⌋ mod( M, P ) ≤ i < P First find the local result. Then compute the global result. 60

Basic MPI Programming #include <stdio.h> #include <mpi.h> #include <malloc.h> int main (int nargs, char** args) { int size, my_rank, i, m = 1000, m_i, lower_bound, res; double l_sum, g_sum, time, x, dx, *u, *v; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); if (my_rank==0 && nargs>1) m = atoi(args[1]); MPI_Bcast (&m, 1, MPI_INT, 0, MPI_COMM_WORLD); dx = 1.0/m; time = MPI_Wtime(); /* data partition & load balancing */ m_i = m/size; res = m%size; lower_bound = my_rank*m_i; 61

Basic MPI Programming lower_bound += (my_rank<=res) ? my_rank : res; if (my_rank+1<=res) ++m_i; /* allocation of data storage */ u = (double*) malloc (m_i*sizeof(double)); v = (double*) malloc (m_i*sizeof(double)); /* fill out the u og v-vectors */ x = lower_bound*dx; for (i=0; i<m_i; i++) { x += dx; u[i] = x; v[i] = x*x; } /* calculate the local result */ l_sum = 0.; for (i=0; i<m_i; i++) l_sum += u[i]*v[i]; MPI_Allreduce (&l_sum,&g_sum,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD); time = MPI_Wtime()-time; 62

Basic MPI Programming /* output the global result */ printf("<%d>m=%d, lower_bound=%d, m_i=%d, u*v=%g, time=%g\n", my_rank, m, lower_bound, m_i, g_sum*dx, time); free (u); free (v); MPI_Finalize (); return 0; } 63

Basic MPI Programming Parallelizing explicit finite element schemes Consider the following 1D wave equation: ∂ 2 u ∂t 2 = γ 2 ∂ 2 u ∂x 2 , x ∈ (0 , 1) , t > 0 , u (0 , t ) = U L , u (1 , t ) = U R , u ( x, 0) = f ( x ) , ∂ ∂tu ( x, 0) = 0 . 64

Basic MPI Programming Discretization Let u k i denote the numerical approximation to u ( x, t ) at the spa- 1 tial grid point x i and the temporal grid point t k , where ∆ x = n +1 , x i = i ∆ x and t k = k ∆ t . Defining C = γ ∆ t/ ∆ x , u 0 i = f ( x i ) , i = 0 , . . . , n + 1 , u k +1 = 2 u k i − u k − 1 + C 2 ( u k i +1 − 2 u k i + u k i − 1 ) , i i i = 1 , . . . , n, k ≥ 0 , u k +1 = U L , k ≥ 0 , 0 u k +1 n +1 = U R , k ≥ 0 , i + 1 u − 1 = u 0 2 C 2 ( u 0 i +1 − 2 u 0 i + u 0 i − 1 ) , i = 1 , . . . , n . i 65

Basic MPI Programming Domain partition Each subdomain has a number of computational points, plus 2 “ghost points” that are used to contain values from neighboring subdomains. It is only on those computational points that u k +1 is updated i i +1 , and u k − 1 based on u k i − 1 , u k i , u k . After local computation at i each time level, values on the leftmost and rightmost computational points are sent to the left and right neighbors, respectively. Values from neighbors are received into the left and right ghost points. 66

Basic MPI Programming #include <stdio.h> #include <malloc.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank, i, n = 999, n_i; double h, x, t, dt, gamma = 1.0, C = 0.9, tstop = 1.0; double *up, *u, *um, *tmp, umax = 0.05, UL = 0., UR = 0., time; MPI_Status status; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); if (my_rank==0 && nargs>1) /* total number of points in (0,1) */ n = atoi(args[1]); if (my_rank==0 && nargs>2) /* read in Courant number */ C = atof(args[2]); if (my_rank==0 && nargs>3) /* length of simulation */ tstop = atof(args[3]); MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD); 67

Basic MPI Programming MPI_Bcast (&C, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); MPI_Bcast (&tstop, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); h = 1.0/(n+1); /* distance between two points */ time = MPI_Wtime(); /* find number of subdomain computational points, assume mod(n,size)=0 */ n_i = n/size; up = (double*) malloc ((n_i+2)*sizeof(double)); u = (double*) malloc ((n_i+2)*sizeof(double)); um = (double*) malloc ((n_i+2)*sizeof(double)); /* set initial conditions */ x = my_rank*n_i*h; for (i=0; i<=n_i+1; i++) { u[i] = (x<0.7) ? umax/0.7*x : umax/0.3*(1.0-x); x += h; } if (my_rank==0) u[0] = UL; if (my_rank==size-1) u[n_i+1] = UR; for (i=1; i<=n_i; i++) um[i] = u[i]+0.5*C*C*(u[i+1]-2*u[i]+u[i-1]); /* artificial condition */ 68

Basic MPI Programming dt = C/gamma*h; t = 0.; while (t < tstop) { /* time stepping loop */ t += dt; for (i=1; i<=n_i; i++) up[i] = 2*u[i]-um[i]+C*C*(u[i+1]-2*u[i]+u[i-1]); if (my_rank>0) { /* receive from left neighbor */ MPI_Recv (&(up[0]),1,MPI_DOUBLE,my_rank-1,501,MPI_COMM_WORLD,&status); /* send left neighbor */ MPI_Send (&(up[1]),1,MPI_DOUBLE,my_rank-1,502,MPI_COMM_WORLD); } else up[0] = UL; if (my_rank<size-1) { /* send to right neighbor */ MPI_Send (&(up[n_i]),1,MPI_DOUBLE,my_rank+1,501,MPI_COMM_WORLD); /* receive from right neighbor */ MPI_Recv (&(up[n_i+1]),1,MPI_DOUBLE,my_rank+1,502,MPI_COMM_WORLD,&status); } else up[n_i+1] = UR; 69

Basic MPI Programming /* prepare for next time step */ tmp = um; um = u; u = up; up = tmp; } time = MPI_Wtime()-time; printf("<%d> time=%g\n", my_rank, time); free (um); free (u); free (up); MPI_Finalize (); return 0; } 70

Basic MPI Programming Two-point boundary value problem − u ′′ ( x ) = f ( x ) , 0 < x < 1 , x (0) = x (1) = 0 . 1 Uniform 1D grid: x 0 , x 1 , . . . , x n +1 , ∆ x = n +1 . Finite difference discretization − u i − 1 + 2 u i − u i +1 = f i , 1 ≤ i ≤ n. ∆ x 2       2 − 1 0 u 1 f 1 − 1 2 − 1 u 2 f 2     ∆ x 2   =    .   .  ... ... ... . . . .             0 − 1 2 u n f n Au = f 71

Basic MPI Programming A simple iterative method for Au = f Start with an initial guess u 0 . Jacobi iteration ∆ x 2 f i + u k − 1 n − 1 + u k − 1 u k � � i = / 2 . n +1 Calculate residual r = f − Au k . Repeat Jacobi iterations until the norm of residual is small enough. 72

Basic MPI Programming Partition of work load Divide the interior points x 1 , x 2 , . . . , x n among the P processors: x i, 1 , x i, 2 , . . . , x i,n i . We need two “ghost” boundary nodes x i, 0 and x i,n i +1 . Those two ghost boundary nodes need to be updated after each Jacobi iteration by receiving data from the neighbors. 73

Basic MPI Programming The main program #include <stdio.h> #include <mpi.h> #include <math.h> extern void jacobi_iteration (int my_rank, int size, int n_i, double* rhs, double* x_k1, double* x_k); extern void calc_residual (int my_rank, int size, int n_i, double* rhs, double* x, double* res); extern double norm (int n_i, double* x); int main (int nargs, char** args) { int size, my_rank, i, j, n = 99999, n_i, lower_bound; double time, base, r_norm, x, dx, *rhs, *res, *x_k, *x_k1; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); if (my_rank==0 && nargs>1) n = atoi(args[1]); MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD); dx = 1.0/(n+1); 74

Basic MPI Programming time = MPI_Wtime(); /* data partition & load balancing */ n_i = n/size; i = n%size; lower_bound = my_rank*n_i; lower_bound += (my_rank<=i) ? my_rank : i; if (my_rank+1<=i) ++n_i; /* allocation of data storage, plus 2 (ghost) boundary points */ x_k = (double*) malloc ((n_i+2)*sizeof(double)); x_k1= (double*) malloc ((n_i+2)*sizeof(double)); rhs = (double*) malloc ((n_i+2)*sizeof(double)); res = (double*) malloc ((n_i+2)*sizeof(double)); /* fill out the rhs og x_k-vectors */ x = lower_bound*dx; for (i=1; i<=n_i; i++) { x += dx; rhs[i] = dx*dx*M_PI*M_PI*sin(M_PI*x); x_k[i] = x_k1[i] = 0.; } x_k[0]=x_k[n_i+1]=x_k1[0]=x_k1[n_i+1]=0.; 75

Basic MPI Programming base = norm(n_i,rhs); r_norm = 1.0; i = 0; while (i<1000 && (r_norm/base)>1.0e-4) { ++i; jacobi_iteration(my_rank,size,n_i,rhs,x_k1,x_k); calc_residual(my_rank,size,n_i,rhs,x_k,res); r_norm = norm(n_i,res); for (j=0; j<n_i+2; j++) x_k1[j] = x_k[j]; } time = MPI_Wtime()-time; printf("<%d> n+1=%d, iters=%d, time=%g\n",my_rank,n+1,i,time); free (x_k); free (x_k1); free (rhs); free (res); MPI_Finalize (); return 0; } 76

Basic MPI Programming The routine for Jacobi iteration #include <mpi.h> void jacobi_iteration (int my_rank, int size, int n_i, double* rhs, double* x_k1, double* x_k) { MPI_Status status; for (int i=1; i<=n_i; i++) x_k[i] = 0.5*(rhs[i]+x_k1[i-1]+x_k1[i+1]); if (my_rank>0) { /* receive from left neighbor */ MPI_Recv (&(x_k[0]),1,MPI_DOUBLE,my_rank-1,501,MPI_COMM_WORLD,&status); /* send left neighbor */ MPI_Send (&(x_k[1]),1,MPI_DOUBLE,my_rank-1,502,MPI_COMM_WORLD); } if (my_rank<size-1) { /* send to right neighbor */ MPI_Send (&(x_k[n_i]),1,MPI_DOUBLE,my_rank+1,501,MPI_COMM_WORLD); /* receive from right neighbor */ MPI_Recv (&(x_k[n_i+1]),1,MPI_DOUBLE,my_rank+1,502,MPI_COMM_WORLD,&status); } } 77

Basic MPI Programming The routine for calculating the residual #include <mpi.h> void calc_residual (int my_rank, int size, int n_i, double* rhs, double* x, double* res) { /* no communication is necessary */ int i; for (i=1; i<=n_i; i++) res[i] = rhs[i]-2.0*x[i]+x[i-1]+x[i+1]; } Note that no communication is necessary, because the latest solution vector is already correctly duplicated between neighboring processes after jacobi iteration . 78

Basic MPI Programming The routine for calculating the norm of a vector #include <mpi.h> #include <math.h> double norm (int n_i, double* x) { double l_sum=0., g_sum; for (int i=1; i<=n_i; i++) l_sum += x[i]*x[i]; MPI_Allreduce (&l_sum,&g_sum,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD); return sqrt(g_sum); } 79

Basic MPI Programming Source of deadlocks Send a large message from one process to another - If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive) . Process 0 Process 1 Send(1) Send(0) Recv(1) Recv(0) Unsafe because it depends on the availability of system buffers. 80

Basic MPI Programming Solutions to deadlocks - Order the operations more carefully - Use MPI Sendrecv - Use MPI Bsend - Use non-blocking operations Performance may be improved on many systems by overlapping communication and computation. This is especially true on systems where communication can be executed autonomously by an intelligent communication controller. Use of non-blocking and completion routines allow computation and communication to be overlapped. (Not guaranteed, though.) 81

Basic MPI Programming Non-blocking send and receive Non-blocking send : returns “immediately”; message buffer should not be written to after return; must check for local completion. int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) Non-blocking receive : returns “immediately”; message buffer should not be read from after return; must check for local completion. int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) The use of nonblocking receives may also avoid system buffering and memory-to-memory copying, as information is provided early on the location of the receive buffer. 82

Basic MPI Programming MPI Request A request object identifies various properties of a communication operation. In addition, this object stores information about the status of the pending communication operation. 83

Basic MPI Programming Local completion Two basic ways of checking on non-blocking sends and receives: - MPI Wait blocks until the communication is complete. MPI_Wait(MPI_Request *request, MPI_Status *status) - MPI Test returns “immediately”, and sets flag to true is the communication is complete. MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) 84

Basic MPI Programming Using non-blocking send & receive #include <mpi.h> extern MPI_Request* req; extern MPI_Status* status; void jacobi_iteration_nb (int my_rank, int size, int n_i, double* rhs, double* x_k1, double* x_k) { int i; /* calculate values on the end-points */ x_k[1] = 0.5*(rhs[1]+x_k1[0]+x_k1[2]); x_k[n_i] = 0.5*(rhs[n_i]+x_k1[n_i-1]+x_k1[n_i+1]); if (my_rank>0) { /* receive from left neighbor */ MPI_Irecv (&(x_k[0]),1,MPI_DOUBLE, my_rank-1,501,MPI_COMM_WORLD,&req[0]); /* send left neighbor */ MPI_Isend (&(x_k[1]),1,MPI_DOUBLE, my_rank-1,502,MPI_COMM_WORLD,&req[1]); } if (my_rank<size-1) { 85

Basic MPI Programming /* send to right neighbor */ MPI_Isend (&(x_k[n_i]),1,MPI_DOUBLE, my_rank+1,501,MPI_COMM_WORLD,&req[2]); /* receive from right neighbor */ MPI_Irecv (&(x_k[n_i+1]),1,MPI_DOUBLE, my_rank+1,502,MPI_COMM_WORLD,&req[3]); } /* calculate values on the inner-points */ for (i=2; i<n_i; i++) x_k[i] = 0.5*(rhs[i]+x_k1[i-1]+x_k1[i+1]); if (my_rank>0) { MPI_Wait (&req[0],&status[0]); MPI_Wait (&req[1],&status[1]); } if (my_rank<size-1) { MPI_Wait (&req[2],&status[2]); MPI_Wait (&req[3],&status[3]); } } 86

✂ � ✝ ✆ ☎ ✄ ✁ Basic MPI Programming Exercises for Day 2 Exercise One : Introduce non-blocking send/receive MPI func- tions into the “1D wave” program. Do you get any performance improvement? Also improve the program so that load balance is maintained for any choice of n . Can you think of other improve- ments? Exercise Two : Data storage preparation for doing unstructured finite element computation in parallel. ✞✌✍ ✞✌☞ ✞✏✎ ✞☛✡ ✞✏✑ ✞✠✟ 87

Basic MPI Programming When a non-overlapping partition of a finite element grid is carried out, i.e., each element belongs to only one subdomain, grid points lying on the internal boundaries will be shared between neighboring subdomains. Assume grid points in every subdomain remember their original global number, contained in files: global-ids.00 , global-ids.01 , and so. Write an MPI program that can a) find out for each subdomain who the neighboring subdomains are, b) how many points are shared between each pair of neighbor subdomains, and c) calculate the Euclidean norm of a global vector that is dis- tributed as sub-vectors contained in files: sub-vec.00 , sub-vec.01 and so on. 88

Basic MPI Programming The data files are contained in mpi-kurs.tar , which can be found under directory /site/vitsim/ on the machine pico.uio.no . The Euclidean norm should be 17.3594 if you have programmed ev- erything correctly. 89

Advanced MPI Programming More about point-to-point communication When a standard mode blocking send call returns, the message data and envelope have been “safely stored away”. The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer. MPI decides whether outgoing messages will be buffered. If MPI buffers outgoing messages, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. Then the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver. 90

Advanced MPI Programming Four communication modes for sending messages standard mode - a send may be initiated even if a matching receive has not been initiated. buffered mode - similar to standard mode, but completion is always independent of matching receive, and message may be buffered to ensure this. synchronous mode - a send may be initiated only if a matching receive has been initiated. ready mode - a send will not complete until message delivery is guaranteed. All 4 modes have blocking and non-blocking versions for send. 91

Advanced MPI Programming The buffered mode ( MPI Bsend ) A buffered mode send operation can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. If a send is executed and no matching receive is posted, then MPI must buffer the outgoing message. An error will occur if there is insufficient buffer space. The amount of available buffer space is controlled by the user. Buffer allocation by the user may be required for the buffered mode to be effective. 92

Advanced MPI Programming Buffer allocation for buffered send A user specifies a buffer to be used for buffering messages sent in buffered mode. To provide a buffer in the user’s memory to be used for buffering outgoing messages: int MPI_Buffer_attach( void* buffer, int size) To detach the buffer currently associated with MPI: int MPI_Buffer_detach( void* buffer_addr, int* size) 93

Advanced MPI Programming The synchronous mode ( MPI Ssend ) A synchronous send can be started whether or not a matching receive was posted. However, the send will complete successfully only if a matching receive is posted, and the receive operation has started to receive the message. The completion of a synchronous send not only indicates that the send buffer can be reused, but also indicates that the receiver has reached a certain point in its execution, namely that it has started executing the matching receive. 94

Advanced MPI Programming The ready mode ( MPI Rsend ) A send using the mode may be started only if the matching receive is already posted. Otherwise, the operation is erroneous and its outcome is undefined. The completion of the send operation does not depend on the status of a matching receive, and merely indicates that the send buffer can be reused. A send using the ready mode provides additional information to the system (namely that a matching receive is already posted), that can save some overhead. In a correct program, a ready send could be replaced by a standard send with no effect on the behavior of the program other than performance. 95

Advanced MPI Programming Persistent communication requests Often a communication with the same argument list is repeatedly executed. MPI can bind the list of communication arguments to a persistent communication request once and, then, repeatedly use the request to initiate and complete messages. Overhead reduction for communication between the process and communication controller. A persistent communication request can e.g. be created as (no communication yet): int MPI_Send_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) 96

Advanced MPI Programming Persistent communication requests (cont’d) A communication that uses a persistent request is initiated by int MPI_Start(MPI_Request *request) A communication started with a call to MPI Start can be com- pleted by a call to MPI Wait or MPI Test . The request becomes inactive after successful completion. The request is not deallocated and it can be activated anew by a MPI Start call. A persistent request is deallocated by a call to MPI Request free . 97

Advanced MPI Programming Send-receive operations MPI send-receive operations combine in one call the sending of a message to one destination and the receiving of another message, from another process. A send-receive operation is very useful for executing a shift operation across a chain of processes. int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status) 98

Parallel Computing Portable Software & Cost-Effective Hardware - PowerPoint PPT Presentation

Parallel Computing Portable Software & Cost-Effective Hardware 2001.05.28-2001.06.01 http://www.ifi.uio.no/~xingca/MPI-COURSE/ Day 1-3: Parallel programming using MPI Lecturer : Xing Cai Day 4: Beowulf Linux cluster Lecturer :

PC PORTABLE PC PORTABLE PC PORTABLE Introducing the PC Portable Lamp, one of a range of

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Portable fuel cell system s Jaeyoung Lee September 19, 2006 http:/ / w w w .h2 fc.re.kr Energy

Portable Parallel I/O Handling large datasets in heterogeneous parallel environments May 21,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

PORTABLE MANAGEMENT BEX/BTA Oversight Committee May 17, 2019 Agenda Portable Management

Portable Enforcement Solution International Product Marketing Department Portable PTZ Dome Body

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Outline Overview Theoretical background Parallel computing systems Parallel

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

CCM: The CORBA Component Model Portable Object Adapter (POA) revisited Portable Object

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

Topology Optimization for Computational Fabrication Jun Wu Depart. of Design Engineering, TU

tr ts r t

Teaching Team Roger A. McEowen CALT Director mceowen@iastate.edu

Friends Dont Let Friends Tune Code Jeffrey K. Hollingsworth University of Maryland

An Overview of (Electronic) System An Overview of (Electronic) System Level Design: beyond

In silico stochastic simulation of Ca 2+ triggered synaptic release Andrea Bracciali Enrico

Second Circuit Criminal Law Update Richard Levitt LEVITT & KAIZER 40 Fulton Street, 23rd

Simulation of Multi-Material Compressible Flows with Interfaces Samuel KOKH

Parallel Computing Portable Software & Cost-Effective Hardware - PowerPoint PPT Presentation

Parallel Computing Portable Software & Cost-Effective Hardware 2001.05.28-2001.06.01 http://www.ifi.uio.no/~xingca/MPI-COURSE/ Day 1-3: Parallel programming using MPI Lecturer : Xing Cai Day 4: Beowulf Linux cluster Lecturer :

PC PORTABLE PC PORTABLE PC PORTABLE Introducing the PC Portable Lamp, one of a range of

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Portable fuel cell system s Jaeyoung Lee September 19, 2006 http:/ / w w w .h2 fc.re.kr Energy

Portable Parallel I/O Handling large datasets in heterogeneous parallel environments May 21,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

PORTABLE MANAGEMENT BEX/BTA Oversight Committee May 17, 2019 Agenda Portable Management

Portable Enforcement Solution International Product Marketing Department Portable PTZ Dome Body

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Outline Overview Theoretical background Parallel computing systems Parallel

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

CCM: The CORBA Component Model Portable Object Adapter (POA) revisited Portable Object

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

Topology Optimization for Computational Fabrication Jun Wu Depart. of Design Engineering, TU

tr ts r t

Teaching Team Roger A. McEowen CALT Director mceowen@iastate.edu

Friends Dont Let Friends Tune Code Jeffrey K. Hollingsworth University of Maryland

An Overview of (Electronic) System An Overview of (Electronic) System Level Design: beyond

In silico stochastic simulation of Ca 2+ triggered synaptic release Andrea Bracciali Enrico

Second Circuit Criminal Law Update Richard Levitt LEVITT &amp; KAIZER 40 Fulton Street, 23rd

Simulation of Multi-Material Compressible Flows with Interfaces Samuel KOKH

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Second Circuit Criminal Law Update Richard Levitt LEVITT & KAIZER 40 Fulton Street, 23rd