Portjng the MiniAero Applicatjon to Charm++ David S. Hollman, Janine - PowerPoint PPT Presentation

Communicatjon: ghost exchanges, unstructured mesh What is MiniAero? MiniAero is a proxy app illustratjng the common computatjon and communicatjon patuerns in unstructured mesh codes of interest to Sandia 3D, unstructured, finite volume computatjonal fluid dynamics code Uses Runge-Kutua fourth-order tjme marching Has optjons for first and second order spatjal discretjzatjon Includes inviscid Roe and viscous Newtonian flux optjons Baseline applicatjon is about 3800 lines of C++ code using MPI and Kokkos Very litule task parallelism; mostly a data parallel problem May 7, 2015 5

What is MiniAero? MiniAero is a proxy app illustratjng the common computatjon and communicatjon patuerns in unstructured mesh codes of interest to Sandia 3D, unstructured, finite volume computatjonal fluid dynamics code Uses Runge-Kutua fourth-order tjme marching Has optjons for first and second order spatjal discretjzatjon Includes inviscid Roe and viscous Newtonian flux optjons Baseline applicatjon is about 3800 lines of C++ code using MPI and Kokkos Very litule task parallelism; mostly a data parallel problem Communicatjon: ghost exchanges, unstructured mesh May 7, 2015 5

March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Portjng MiniAero to Charm++ began with a “bootcamp”: Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: Background and Status May 7, 2015 6

running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: May 7, 2015 6

running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 May 7, 2015 6

running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida May 7, 2015 6

running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance May 7, 2015 6

running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Current state of the code: Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme May 7, 2015 6

running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: May 7, 2015 6

most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: running and passes test suite May 7, 2015 6

SMP version does not work (Kokkos incompatjbility) Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: running and passes test suite most immediately apparent optjmizatjons done May 7, 2015 6

Background and Status Portjng MiniAero to Charm++ began with a “bootcamp”: March 9-12, 2015 Led by Nikhil Jain and Eric Mikida About 10 Sandia scientjsts in atuendance Since the workshop, we've had one scientjst working 50% tjme on the port and a couple others working 10-20% tjme Current state of the code: running and passes test suite most immediately apparent optjmizatjons done SMP version does not work (Kokkos incompatjbility) May 7, 2015 6

Outline Introductjon The Process: Portjng an Explicit Aerodynamics Miniapp to the Chare Model What was easy? What was harder? Preliminary Results and Performance Next Steps May 7, 2015 7

: CkStartCheckpoint(...) : CkStartMemCheckpoint(...) (synchronous) To disk In memory (to partner node) Must be done synchronously Load balancing 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side when ResumeFromSync() { } } this ->AtSync(); // Do the actual load rebalancing serial { // Called when load balancing is completed (required) What was easy? May 7, 2015 8

: CkStartCheckpoint(...) : CkStartMemCheckpoint(...) To disk In memory (to partner node) Must be done synchronously Checkpointjng Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Both of these require only serializatjon on the user side // Called when load balancing is completed (required) when ResumeFromSync() { } this ->AtSync(); // Do the actual load rebalancing serial { } What was easy? Load balancing May 7, 2015 8

: CkStartCheckpoint(...) : CkStartMemCheckpoint(...) To disk In memory (to partner node) Must be done synchronously Checkpointjng Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Both of these require only serializatjon on the user side when ResumeFromSync() { } // Called when load balancing is completed (required) } this ->AtSync(); // Do the actual load rebalancing serial { What was easy? Load balancing (synchronous) May 7, 2015 8

: CkStartCheckpoint(...) : CkStartMemCheckpoint(...) To disk In memory (to partner node) Must be done synchronously Checkpointjng Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side when ResumeFromSync() { } serial { // Do the actual load rebalancing this ->AtSync(); } // Called when load balancing is completed (required) What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } May 7, 2015 8

: CkStartCheckpoint(...) : CkStartMemCheckpoint(...) To disk In memory (to partner node) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side when ResumeFromSync() { } // Called when load balancing is completed (required) } this ->AtSync(); // Do the actual load rebalancing serial { What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng May 7, 2015 8

: CkStartMemCheckpoint(...) : CkStartCheckpoint(...) In memory (to partner node) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side when ResumeFromSync() { } // Called when load balancing is completed (required) } this ->AtSync(); // Do the actual load rebalancing serial { What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk May 7, 2015 8

: CkStartMemCheckpoint(...) In memory (to partner node) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side // Called when load balancing is completed (required) when ResumeFromSync() { } serial { // Do the actual load rebalancing this ->AtSync(); } What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) May 7, 2015 8

: CkStartMemCheckpoint(...) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side // Called when load balancing is completed (required) } serial { // Do the actual load rebalancing this ->AtSync(); when ResumeFromSync() { } What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) In memory (to partner node) May 7, 2015 8

Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side // Called when load balancing is completed (required) } serial { // Do the actual load rebalancing when ResumeFromSync() { } this ->AtSync(); What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) In memory (to partner node): CkStartMemCheckpoint(...) May 7, 2015 8

Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side // Called when load balancing is completed (required) this ->AtSync(); serial { when ResumeFromSync() { } // Do the actual load rebalancing } What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) In memory (to partner node): CkStartMemCheckpoint(...) Must be done synchronously May 7, 2015 8

Both of these require only serializatjon on the user side // Called when load balancing is completed (required) // Do the actual load rebalancing when ResumeFromSync() { } } serial { this ->AtSync(); What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) In memory (to partner node): CkStartMemCheckpoint(...) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding May 7, 2015 8

// Called when load balancing is completed (required) // Do the actual load rebalancing when ResumeFromSync() { } } this ->AtSync(); serial { What was easy? Load balancing (synchronous) 1 // at the end of the timestep... 2 if (doLoadBalance && timestepCounter % loadBalanceInterval == 0) { 3 4 5 6 7 8 9 } Checkpointjng To disk: CkStartCheckpoint(...) In memory (to partner node): CkStartMemCheckpoint(...) Must be done synchronously Both of these were key features we wanted to test in AMT runtjmes, both were done essentjally on the first day of coding Both of these require only serializatjon on the user side May 7, 2015 8

sends become message-like functjon calls on proxies of array members receives become when clauses Just like in MPI, data dependencies are expressed in terms of messages: “Quick start” implementatjon: map one chare to one MPI process int ndata, double data[ndata]); ); generate data(); /* ... */ thisProxy[partner].receive_data( n_send, data when receive_data( int partner, }; entry void do_stuff_2() { int n_send, double * other_data) serial { memcpy(other_data_, data, entry void do_stuff_1() { }; entry void receive_data( int src, array [ 1D ] NewStuffDoer { use_other_data(); } use_other_data(); }; MPI_Send(other_data, n_recv, }; MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { } MPI ⇒ Charm++ is relatjvely straightgorward MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

sends become message-like functjon calls on proxies of array members receives become when clauses Just like in MPI, data dependencies are expressed in terms of messages: int ndata, double data[ndata]); n_send, data entry void do_stuff_1() { generate data(); /* ... */ thisProxy[partner].receive_data( entry void do_stuff_2() { ); }; entry void receive_data( int src, when receive_data( int partner, int n_send, double * other_data) serial { } }; array [ 1D ] NewStuffDoer { use_other_data(); } use_other_data(); }; MPI_Send(other_data, n_recv, }; MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { memcpy(other_data_, data, MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

sends become message-like functjon calls on proxies of array members receives become when clauses int ndata, double data[ndata]); array [ 1D ] NewStuffDoer { int n_send, double * other_data) when receive_data( int partner, entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( /* ... */ generate data(); entry void do_stuff_1() { use_other_data(); entry void receive_data( int src, } }; memcpy(other_data_, data, use_other_data(); } MPI_Send(other_data, n_recv, }; MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { }; serial { MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

receives become when clauses int ndata, double data[ndata]); }; when receive_data( int partner, entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( /* ... */ generate data(); entry void do_stuff_1() { memcpy(other_data_, data, entry void receive_data( int src, array [ 1D ] NewStuffDoer { } serial { use_other_data(); use_other_data(); MPI_Send(other_data, n_recv, } MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { }; }; int n_send, double * other_data) MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

int ndata, double data[ndata]); } entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( /* ... */ generate data(); entry void do_stuff_1() { serial { entry void receive_data( int src, array [ 1D ] NewStuffDoer { }; use_other_data(); int n_send, double * other_data) memcpy(other_data_, data, MPI_Send(other_data, n_recv, use_other_data(); MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { } }; }; when receive_data( int partner, MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

statjc variables conditjonal communicatjon a lot of size and metadata communicatjon setup can be skipped /* ... */ generate data(); }; thisProxy[partner].receive_data( n_send, data ); entry void do_stuff_2() { }; int ndata, double data[ndata]); when receive_data( int partner, int n_send, double * other_data) serial { memcpy(other_data_, data, entry void do_stuff_1() { }; entry void receive_data( int src, array [ 1D ] NewStuffDoer { use_other_data(); } use_other_data(); }; MPI_Send(other_data, n_recv, MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { } MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses “Gotchas”: MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

conditjonal communicatjon a lot of size and metadata communicatjon setup can be skipped /* ... */ n_send, data entry void do_stuff_1() { generate data(); } thisProxy[partner].receive_data( entry void do_stuff_2() { ); }; entry void receive_data( int src, when receive_data( int partner, int n_send, double * other_data) serial { int ndata, double data[ndata]); }; array [ 1D ] NewStuffDoer { use_other_data(); } use_other_data(); }; MPI_Send(other_data, n_recv, }; MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { memcpy(other_data_, data, MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses “Gotchas”: statjc variables MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

a lot of size and metadata communicatjon setup can be skipped /* ... */ array [ 1D ] NewStuffDoer { int n_send, double * other_data) when receive_data( int partner, entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( use_other_data(); generate data(); entry void do_stuff_1() { int ndata, double data[ndata]); entry void receive_data( int src, } }; memcpy(other_data_, data, use_other_data(); } MPI_Send(other_data, n_recv, }; MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { }; serial { MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses “Gotchas”: statjc variables conditjonal communicatjon MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

/* ... */ } entry void do_stuff_2() { }; ); n_send, data thisProxy[partner].receive_data( serial { generate data(); entry void do_stuff_1() { int ndata, double data[ndata]); entry void receive_data( int src, array [ 1D ] NewStuffDoer { }; use_other_data(); int n_send, double * other_data) memcpy(other_data_, data, MPI_Send(other_data, n_recv, use_other_data(); MPI_Irecv(data, n_send, /* ... */ generate data(); void do_stuff() { /* ... */ class OldStuffDoer { } }; }; when receive_data( int partner, MPI ⇒ Charm++ is relatjvely straightgorward “Quick start” implementatjon: map one chare to one MPI process Just like in MPI, data dependencies are expressed in terms of messages: sends become message-like functjon calls on proxies of array members receives become when clauses “Gotchas”: statjc variables conditjonal communicatjon a lot of size and metadata communicatjon setup can be skipped MPI_DOUBLE, partner, /*...*/ ); MPI_DOUBLE, partner, /*...*/ ); n_send * sizeof ( double )); May 7, 2015 9

“Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file “Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) Two approaches: Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? May 7, 2015 10

“Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: May 7, 2015 10

“Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s May 7, 2015 10

“Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file May 7, 2015 10

“Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… May 7, 2015 10

“Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) Is there a good middle ground? MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal May 7, 2015 10

“Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? May 7, 2015 10

MPI ⇒ Charm++ is relatjvely straightgorward …at first! Is this the best approach for our workloads, or does it lead to unnecessary synchronizatjons being lefu over from the MPI version? Two approaches: “Botuom up”: Map sends and receives to functjon calls and when s “Top down”: Think about task structure and dependencies of code, write this into the .ci file Clearly, “top down” approach will lead to betuer, more efficient code in most cases, but… For productjon code, a complete “top down” overhaul is completely impractjcal Is there a good middle ground? “Botuom up”-ness vs. “top down”-ness of approach should be assessed before writjng too much code (in any portjng project) May 7, 2015 10

handles memory layout and loop structure to produce optjmized kernels on multjple devices Explicitly listjng all specializatjons can get out of hand quickly. For instance… Kokkos is a performance portability layer aimed primarily at on-node parallelism Applicatjon developer implements generic code, Kokkos library implements device-specific specializatjons MiniAero was originally writuen in “MPI+Kokkos” What happens when you need to write templated code that uses Kokkos? 1 template < typename Device> 2 struct ddot { 3 4 5 6 7 8 9 ) : A(A_in), B(B_in), result(0) 10 11 12 13 result += A(i) * B(i); 14 15 }; 16 17 void do_stuff() { 18 19 20 21 22 23 } Kokkos::parallel_for( num_items, ddot<Kokkos::Cuda>(v1, v2) const Kokkos::View<Device>& A_in, /* ... */ } inline void operator ()( int i) { { } const Kokkos::View<Device>& B_in ddot( double result; const Kokkos::View<Device>& A, B; ); What was harder: Kokkos integratjon and templated code May 7, 2015 11

Explicitly listjng all specializatjons can get out of hand quickly. For instance… 1 template < typename Device> 2 struct ddot { 3 4 handles memory layout and loop 5 structure to produce optjmized 6 kernels on multjple devices 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { What happens when you need to 18 write templated code that uses 19 Kokkos? 20 21 22 23 } inline void operator ()( int i) { { } const Kokkos::View<Device>& B_in } ddot( ddot<Kokkos::Cuda>(v1, v2) double result; const Kokkos::View<Device>& A, B; /* ... */ ); Kokkos::parallel_for( num_items, const Kokkos::View<Device>& A_in, What was harder: Kokkos integratjon and templated code Kokkos is a performance portability layer aimed primarily at on-node parallelism May 7, 2015 11

Explicitly listjng all specializatjons can get out of hand quickly. For instance… 1 template < typename Device> 2 struct ddot { 3 4 5 6 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { What happens when you need to 18 write templated code that uses 19 Kokkos? 20 21 22 23 } } /* ... */ const Kokkos::View<Device>& A_in, inline void operator ()( int i) { { } const Kokkos::View<Device>& B_in num_items, ddot( double result; const Kokkos::View<Device>& A, B; ddot<Kokkos::Cuda>(v1, v2) ); Kokkos::parallel_for( What was harder: Kokkos integratjon and templated code Kokkos is a performance portability layer aimed primarily at on-node parallelism handles memory layout and loop structure to produce optjmized kernels on multjple devices May 7, 2015 11

Explicitly listjng all specializatjons can get out of hand quickly. For instance… 1 template < typename Device> 2 struct ddot { 3 4 5 6 7 8 9 ) : A(A_in), B(B_in), result(0) 10 11 12 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { What happens when you need to 18 write templated code that uses 19 Kokkos? 20 21 22 23 } double result; ); } ddot<Kokkos::Cuda>(v1, v2) { } inline void operator ()( int i) { num_items, Kokkos::parallel_for( const Kokkos::View<Device>& B_in /* ... */ ddot( const Kokkos::View<Device>& A, B; const Kokkos::View<Device>& A_in, What was harder: Kokkos integratjon and templated code Kokkos is a performance portability layer aimed primarily at on-node parallelism handles memory layout and loop structure to produce optjmized kernels on multjple devices Applicatjon developer implements generic code, Kokkos library implements device-specific specializatjons May 7, 2015 11

Explicitly listjng all specializatjons can get out of hand quickly. For instance… MiniAero was originally writuen in “MPI+Kokkos” What happens when you need to write templated code that uses Kokkos? const Kokkos::View<Device>& A_in, ddot( const Kokkos::View<Device>& B_in const Kokkos::View<Device>& A, B; { } inline void operator ()( int i) { } /* ... */ Kokkos::parallel_for( num_items, ddot<Kokkos::Cuda>(v1, v2) ); double result; What was harder: Kokkos integratjon and templated code Kokkos is a performance portability 1 template < typename Device> layer aimed primarily at on-node 2 struct ddot { parallelism 3 4 handles memory layout and loop 5 structure to produce optjmized 6 kernels on multjple devices 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 15 }; 16 17 void do_stuff() { 18 19 20 21 22 23 } May 7, 2015 11

Explicitly listjng all specializatjons can get out of hand quickly. For instance… What happens when you need to write templated code that uses Kokkos? const Kokkos::View<Device>& B_in double result; ddot( const Kokkos::View<Device>& A_in, const Kokkos::View<Device>& A, B; inline void operator ()( int i) { } /* ... */ Kokkos::parallel_for( num_items, ddot<Kokkos::Cuda>(v1, v2) ); { } What was harder: Kokkos integratjon and templated code Kokkos is a performance portability 1 template < typename Device> layer aimed primarily at on-node 2 struct ddot { parallelism 3 4 handles memory layout and loop 5 structure to produce optjmized 6 kernels on multjple devices 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { 18 19 20 21 22 23 } May 7, 2015 11

Explicitly listjng all specializatjons can get out of hand quickly. For instance… const Kokkos::View<Device>& A_in, ); ddot<Kokkos::Cuda>(v1, v2) num_items, Kokkos::parallel_for( /* ... */ } inline void operator ()( int i) { { } const Kokkos::View<Device>& B_in ddot( double result; const Kokkos::View<Device>& A, B; What was harder: Kokkos integratjon and templated code Kokkos is a performance portability 1 template < typename Device> layer aimed primarily at on-node 2 struct ddot { parallelism 3 4 handles memory layout and loop 5 structure to produce optjmized 6 kernels on multjple devices 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { What happens when you need to 18 write templated code that uses 19 Kokkos? 20 21 22 23 } May 7, 2015 11

const Kokkos::View<Device>& A_in, const Kokkos::View<Device>& A, B; ); ddot<Kokkos::Cuda>(v1, v2) num_items, Kokkos::parallel_for( /* ... */ } inline void operator ()( int i) { { } const Kokkos::View<Device>& B_in ddot( double result; What was harder: Kokkos integratjon and templated code Kokkos is a performance portability 1 template < typename Device> layer aimed primarily at on-node 2 struct ddot { parallelism 3 4 handles memory layout and loop 5 structure to produce optjmized 6 kernels on multjple devices 7 8 Applicatjon developer implements 9 ) : A(A_in), B(B_in), result(0) generic code, Kokkos library 10 11 implements device-specific 12 specializatjons 13 result += A(i) * B(i); 14 MiniAero was originally writuen in 15 }; “MPI+Kokkos” 16 17 void do_stuff() { What happens when you need to 18 write templated code that uses 19 Kokkos? 20 21 Explicitly listjng all specializatjons 22 can get out of hand quickly. For 23 } instance… May 7, 2015 11

1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; , so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . Each communicates a different Kokkos::View type Kokkos::View<Device, double *,5> m_data1; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, int *> m_data3; /* etc... */ /* ... */ : public CBase_RK4Solver<Device> Template specializatjon explosion The MiniAero solver has five different ghost exchanges. May 7, 2015 12

1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The solver chare is already parameterized on the Kokkos device type: The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . , so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); : public CBase_RK4Solver<Device> Kokkos::View<Device, double *,5> m_data1; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, int *> m_data3; /* ... */ /* etc... */ Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type May 7, 2015 12

1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The solver chare is already parameterized on the Kokkos device type: The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); /* ... */ Kokkos::View<Device, double *,5> m_data1; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, int *> m_data3; /* etc... */ : public CBase_RK4Solver<Device> Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: May 7, 2015 12

1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The solver chare is already parameterized on the Kokkos device type: The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . /* etc... */ /* ... */ Kokkos::View<Device, double *,5> m_data1; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, int *> m_data3; : public CBase_RK4Solver<Device> Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); May 7, 2015 12

1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . Kokkos::View<Device, double *,5,3> m_data2; : public CBase_RK4Solver<Device> Kokkos::View<Device, double *,5> m_data1; /* etc... */ /* ... */ Kokkos::View<Device, int *> m_data3; Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: May 7, 2015 12

1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 4 5 { 6 7 8 9 10 }; The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . Kokkos::View<Device, int *> m_data3; Kokkos::View<Device, double *,5> m_data1; /* ... */ Kokkos::View<Device, double *,5,3> m_data2; /* etc... */ : public CBase_RK4Solver<Device> Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: 1 /* solver.ci */ 2 template < typename Device> 3 array [ 1D ] RK4Solver { 4 5 }; May 7, 2015 12

The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . : public CBase_RK4Solver<Device> /* ... */ /* etc... */ Kokkos::View<Device, int *> m_data3; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, double *,5> m_data1; Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; May 7, 2015 12

That already leads to 20 different explicit signatures for receive_ghost_data() . : public CBase_RK4Solver<Device> /* ... */ /* etc... */ Kokkos::View<Device, int *> m_data3; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, double *,5> m_data1; Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP May 7, 2015 12

: public CBase_RK4Solver<Device> /* ... */ /* etc... */ Kokkos::View<Device, int *> m_data3; Kokkos::View<Device, double *,5,3> m_data2; Kokkos::View<Device, double *,5> m_data1; Template specializatjon explosion The MiniAero solver has five different ghost exchanges. Each communicates a different Kokkos::View type, so we want an entry method prototype that looks something like this: 1 template < typename ViewType> 2 entry [ local ] void receive_ghost_data(ViewType& v); The solver chare is already parameterized on the Kokkos device type: 1 /* solver.h */ 2 template < typename Device> 3 class RK4Solver 1 /* solver.ci */ 4 2 template < typename Device> 5 { 3 array [ 1D ] RK4Solver { 6 4 7 5 }; 8 9 10 }; The devices we'd like to test include Kokkos::Serial , Kokkos::Threads , Kokkos::Cuda , and Kokkos::OpenMP That already leads to 20 different explicit signatures for receive_ghost_data() . May 7, 2015 12

/* comm_stuff.ci */ size*sizeof( double )); }; template < typename Device> array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup recv_buffers_[src] = May 7, 2015 13

template < typename Device> size*sizeof( double )); /* comm_stuff.ci */ }; array [ 1D ] CommStuffDoer { entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup Is this ideal? Obviously not recv_buffers_[src] = May 7, 2015 13

array [ 1D ] CommStuffDoer { size*sizeof( double )); /* comm_stuff.ci */ template < typename Device> }; entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup Is this typical of the effort required to make templated code work with an asynchronous many-task runtjme system (AMT RTS)? Maybe recv_buffers_[src] = May 7, 2015 13

array [ 1D ] CommStuffDoer { size*sizeof( double )); /* comm_stuff.ci */ template < typename Device> }; entry void recv_it( int src, int size, double data[size]); entry void do_recv_done(); entry [ local ] void do_recv( int src) { when recv_it[src]( int s, int size, double data[size]) serial { memcpy(recv_buffers[src], data, do_recv_done(); } } }; entry void do_stuff() { /* ... */ serial { int src = /*...*/ , dest = /*...*/ ; send_it(dest, my_data_1_); setup_recv(src, my_data_1_); do_recv(src); } when do_recv_done() serial { }; delete recv_buffers_[src]; } void send_it( int dst, const ViewT& data) { /* comm_stuff.h */ template < typename Device> class CommStuffDoer : public CBase_CommStuffDoer<Device> { Kokkos::View<Device, double *,3> my_data_1_; Kokkos::View<Device, int *,3,5> my_data_2_; /* ... */ std::vector< double *> recv_buffers_; template < typename ViewT> size_t size = get_size(data, dst); insert_data(data, recv_buffers_[src], src); double * data = extract_data(data, dst); this ->thisProxy[dst].recv_it( this ->thisIndex, size, data); } template < typename ViewT> void setup_recv( int src, ViewT& data) { }; get_buffer(data, src); } template < typename ViewT> void finish_recv( int src, ViewT& data) { finish_recv(src, my_data_1_); Template specializatjon: our workaround Patuern: templated setup, non-templated entry method, templated cleanup Does Charm++ even support templated entry methods inside templated chares? (We couldn't figure out how to do it) recv_buffers_[src] = May 7, 2015 13

What happens first? Now what happens first? Perhaps using naming conventjons? (e.g., EM_*() ) Suppose all of the do_stuff_*() methods are ordinary, non-entry methods. Now suppose do_stuff_1() is an entry method and do_stuff_2() is a normal method. How does the programmer who didn't write do_stuff_1() know this? 1 entry void do_stuff() { 2 3 4 5 6 }; 1 entry void EM_do_stuff() { 2 3 4 5 6 }; In short, mixing entry method calls and regular method calls without using naming conventjons makes it difficult to write self-documentjng code serial { } EM_do_stuff_1(); do_stuff_2(); do_stuff_2(); } serial { do_stuff_1(); Distjnguishing Entry Methods from Regular Method Calls May 7, 2015 14

What happens first? Now what happens first? Perhaps using naming conventjons? (e.g., EM_*() ) Suppose all of the do_stuff_*() methods are ordinary, non-entry methods. Now suppose do_stuff_1() is an entry method and do_stuff_2() is a normal method. 1 entry void EM_do_stuff() { 2 How does the programmer who 3 4 didn't write do_stuff_1() 5 know this? 6 }; In short, mixing entry method calls and regular method calls without using naming conventjons makes it difficult to write self-documentjng code serial { do_stuff_1(); do_stuff_2(); } do_stuff_2(); } serial { EM_do_stuff_1(); Distjnguishing Entry Methods from Regular Method Calls 1 entry void do_stuff() { 2 3 4 5 6 }; May 7, 2015 14

Now what happens first? Perhaps using naming conventjons? (e.g., EM_*() ) What happens first? Now suppose do_stuff_1() is an entry method and do_stuff_2() is a normal method. 1 entry void EM_do_stuff() { 2 How does the programmer who 3 4 didn't write do_stuff_1() 5 know this? 6 }; In short, mixing entry method calls and regular method calls without using naming conventjons makes it difficult to write self-documentjng code } serial { do_stuff_2(); serial { do_stuff_2(); EM_do_stuff_1(); } do_stuff_1(); Distjnguishing Entry Methods from Regular Method Calls Suppose all of the do_stuff_*() methods are ordinary, non-entry methods. 1 entry void do_stuff() { 2 3 4 5 6 }; May 7, 2015 14

Now what happens first? Perhaps using naming conventjons? (e.g., EM_*() ) Now suppose do_stuff_1() is an entry method and do_stuff_2() is a normal method. 1 entry void EM_do_stuff() { 2 How does the programmer who 3 4 didn't write do_stuff_1() 5 know this? 6 }; In short, mixing entry method calls and regular method calls without using naming conventjons makes it difficult to write self-documentjng code serial { } serial { do_stuff_2(); EM_do_stuff_1(); do_stuff_2(); } do_stuff_1(); Distjnguishing Entry Methods from Regular Method Calls Suppose all of the do_stuff_*() methods are ordinary, non-entry methods. 1 entry void do_stuff() { 2 What happens first? 3 4 5 6 }; May 7, 2015 14

Perhaps using naming conventjons? (e.g., EM_*() ) 1 entry void EM_do_stuff() { Now what happens first? 2 How does the programmer who 3 4 didn't write do_stuff_1() 5 know this? 6 }; In short, mixing entry method calls and regular method calls without using naming conventjons makes it difficult to write self-documentjng code serial { serial { do_stuff_1(); EM_do_stuff_1(); } do_stuff_2(); } do_stuff_2(); Distjnguishing Entry Methods from Regular Method Calls Suppose all of the do_stuff_*() methods are ordinary, non-entry methods. 1 entry void do_stuff() { 2 What happens first? 3 4 Now suppose do_stuff_1() 5 is an entry method and 6 }; do_stuff_2() is a normal method. May 7, 2015 14

Portjng the MiniAero Applicatjon to Charm++ David S. Hollman, Janine - PowerPoint PPT Presentation

Lessons Learned from Portjng the MiniAero Applicatjon to Charm++ David S. Hollman, Janine Bennetu (PI), Jeremiah Wilke (Chief Architect), Ken Franko, Hemanth Kolla, Paul Lin, Greg Sjaardema, Nicole Slatuengren, Keita Teranishi, Nikhil Jain, Eric

Recent Results in Charm Physics Recent Results in Charm Physics Topics Topics Rare Charm

Applications! Where we are in the Course Applicatjon layer protocols are ofuen part of

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

Charm physics and XYZ states at BESIII Evgeny BOGER JINR Dubna On behalf of BESIII

Combination and QCD Analysis of Charm Production Cross Section Measurements in DIS at HERA Kenan

CHARM Community Health And Resources Management A Scenario Planning Mapping Tool Yu Wen Chou

CHARM: Cassini-Huygens Mission to Saturn 10 th Anniversary!! Titan Highlights Zibi Turtle,

Charm and and bottom bottom Heavy baryon Heavy baryon Charm mass spectrum from from mass

relaxation time on the quenched lattice Atsuro Ikeda, Masayuki Asakawa, Masakiyo Kitazawa Osaka

Charm++ as an Energy Efficient Runtime 1 4/18/17 BILGE ACUN - CHARM++ WORKSHOP 2017 Interaction

CHARM 2016 @ Bologna Italy Angelo Carbone on behalf of Department of Physics CHARM 2015 and

BigSim Tutorial Presented by Eric Bohm Charm++ Workshop 2008 Parallel Programming Laboratory

A Parallel Union-Find Library in Charm ++ Karthik Senthil Parallel Programming Laboratory

Adaptive Control of Variable-Speed Variable-Pitch Wind Turbines Using RBF Neural Network By:

Final Exam Prep Lecture ME EN 412 Andrew Ning aning@byu.edu Outline What Have We Learned

Transformation Matrices Mangal Kothari Department of Aerospace Engineering Indian Institute of

Precomputed Panel Solver for Aerodynamics Simulation Haoran Xie The University of Tokyo / JAIST

Architectures in a Central Simulation Framework with Networking Capability to Support Wargaming

Ulti-LIT Market Benchmarking Key Risks Wearable UI Sensors Weight Visibility Contrac t

Harvesting the W ind: The Agricultural Landow ner Perspective Oklahom a W ind Energy Conference

Differential games with state constraints and viability kernels N. D. Botkin, J. Diepolder, V. L.