cloc ock del elta com ompres ession on for for scalable e
play

Cloc ock Del elta Com ompres ession on for for Scalable e - PowerPoint PPT Presentation

Cloc ock Del elta Com ompres ession on for for Scalable e Order er-R -Rep eplay of Non of on-D -Det eter erministic Pa Parallel el Application ons SC15 Kento Sato , Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz


  1. Cloc ock Del elta Com ompres ession on for for Scalable e Order er-R -Rep eplay of Non of on-D -Det eter erministic Pa Parallel el Application ons SC15 Kento Sato , Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz November 19 th , 2015 LLNL-PRES-679294 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

  2. Debugging large-scale applications is becoming problematic “On average, software developers spend 50% of their programming time finding and fixing bugs.” [1] With trends towards asynchronous communication patterns in MPI applications, MPI non-determinism will significantly increase debugging cost [1] Source: http://www.prweb.com/releases/2013/1/prweb10298185.htm, CAMBRIDGE, UK 2" (PRWEB) JANUARY 08, 2013 LLNL-PRES-679294

  3. What is MPI non-determinism (ND) ? ! Message receive orders can be different across executions ( " Internal ND) — Unpredictable system noise (e.g. network, system daemon & OS jitter) ! Arithmetic orders can also change across executions ( " External ND) P0 P1 P2 P0 P1 P2 a b b c c a Execution A: (a+b)+c Execution B: a+(b+c) 3" LLNL-PRES-679294

  4. MPI non-determinism significantly increases debugging cost ! Control flows of an application can change across different runs Deterministic apps Non-deterministic apps ! Non-deterministic control flow Input Input Successful run, seg-fault or hang — ! Non-deterministic numerical results debug Floating-point arithmetic is “NOT” — necessarily associative (a+b)+c � ≠ a+(b+c) � " Developers need to do debug runs until the Result Bug Result same bug is reproduced seg-fault Hangs Result A Result B " Running as intended ? Application bugs ? Silent data corruption ? In ND applications, it’s hard to reproduce bugs and incorrect results, It costs excessive amounts of time for “ reproducing ”, finding and fixing bugs 4" LLNL-PRES-679294

  5. Case study: “Monte Carlo Simulation Benchmark” (MCB) ! CORAL proxy application ! MPI non-determinism MCB: Monte Carlo Benchmark Final numerical results are different between 1 st and 2 nd run $ diff result_run1.out result_run2.out result_run1.out:< IMC E_RR_total -3.3140234409e-05 -8.302693774e-08 2.9153322360e-08 -4.8198506756e-06 2.3113821822e-06 result_run2.out:> IMC E_RR_total -3.3140234410e-05 -8.302693776e-08 2.9153322360e-08 -4.8198506757e-06 2.3113821821e-06 22e-06 09e-05 74e-08 56e-06 21e-06 10e-05 76e-08 57e-06 5" LLNL-PRES-679294

  6. Why MPI non-determinism occurs ? Typical MPI non-deterministic code ! In such non-deterministic applications, each MPI_Irecv(…, MPI_ANY_SOURCE, …); process doesn’t know which rank will send while(1) { message MPI_Test(flag); — e.g.) Particle simulation if (flag) { <computation> ! Messages can arrive in any order from MPI_Irecv(…, MPI_ANY_SOURCE, …); neighbors " inconsistent message arrivals } } north Source of MPI non-determinism west east MPI matching functions Wait family Test family south single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany some MPI_Waitsome MPI_Testsome MCB: Monte Carlo Benchmark all MPI_Waitall MPI_Testall 6" LLNL-PRES-679294

  7. State-of-the-art approach: Record-and-replay Record-and-replay ! Traces, records message receive orders in a run, and rank 0 rank 1 rank 2 rank 3 replays the orders in successive runs for debugging — Record-and-replay can reproduce a target control flow rank 2 — Developers can focus on debugging a particular control rank 0 flow rank 2 rank 3 Debugging a particular control flow in replay rank 1 Input rank 1 Developer can focus on debugging particular rank 3 control flow rank 1 rank 2 rank 1 Output Hanging seg-fault Output A Output B 7" LLNL-PRES-679294

  8. Record-and-replay won't work at scale ! Record-and-replay produces large amount of recording data Over ” 10 GB/node ” for 24 hours in MCB — ! For scalable record-replay with low overhead, the record data must fit into local memory, but capacity is limited Storing in shared/parallel file system is not scalable approach — Record-and-replay rank 0 rank 1 rank 2 rank 3 rank 2 rank 0 rank 2 rank 3 10 GB/node rank 1 rank 1 MCB: Monte Carlo Benchmark rank 3 rank 1 rank 2 rank 1 Challenges "Record"size"reduc4on"for"scalable"record:replay" 8" LLNL-PRES-679294

  9. Proposal: Clock Delta Compression (CDC) ! Putting logical-clock (Lamport clock) into each MPI message ! Actual message receive orders (i.e. wall-clock orders) are very similar to logical clock orders in each MPI rank — MPI messages are received in almost monotonically increasing logical-clock order ! CDC records only the order differences between the wall-clock order and the logical- clock order without recording the entire message order Logical-clock rank x 0 100 200 300 400 500 600 700 800 1 rank 0 2 4 Received message in wall-clock order 7 10 rank 0 13 13 rank 2 8 16 rank 1 8 19 rank 0 15 22 rank 1 19 25 28 31 rank 0 17 34 37 rank 0 18 40 9" LLNL-PRES-679294

  10. Result in MCB ! 40 times smaller than the one w/o compression 40 original record MCB: Monte Carlo Benchmark CDC 1 10" LLNL-PRES-679294

  11. Outline ! Background ! General record-and-replay ! CDC: Clock delta compression ! Implementation ! Evaluation ! Conclusion 11" LLNL-PRES-679294

  12. How to record-and-replay MPI applications ? ! Source of MPI non-determinism is these matching functions — “Replaying these matching functions’ behavior” " “Replaying MPI application’s behavior” Matching functions in MPI Wait family Test family single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany some MPI_Waitsome MPI_Testsome all MPI_Waitall MPI_Testall Source of MPI non-determinism Questions "What"informa4on"need"to"be"recorded"for"replaying"these"matching"func4ons"?" 12" LLNL-PRES-679294

  13. Necessary values to be recorded for correct replay ! Example rank x rank 0 rank 0 message rank 0 rank 2 rank x rank 1 rank 1 rank 0 message rank 1 rank 0 rank 2 message rank 0 13" LLNL-PRES-679294

  14. Necessary values for correct replay Matching functions in MPI Wait family Test family rank x single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany count flag rank some MPI_Waitsome MPI_Testsome rank 0 -- 1 0 all MPI_Waitall MPI_Testall 2 0 -- -- 1 0 ! rank rank 0 -- 1 2 rank 2 — Who send the messages? rank 1 -- 1 1 ! count & flag rank 0 -- 1 0 — For MPI_Test family rank 1 -- 1 1 • flag: Matched or unmatched ? • count: How many time unmatched ? 3 0 -- -- 1 0 ! id rank 0 -- 0 -- — For application-level out-of-order 1 1 0 rank 0 ! with_next — For matching some/all functions 14" LLNL-PRES-679294

  15. Necessary values for correct replay Matching functions in MPI Wait family Test family rank x single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany count flag rank some MPI_Waitsome MPI_Testsome rank 0 -- 1 0 all MPI_Waitall MPI_Testall 2 0 -- -- 1 0 ! rank rank 0 -- 1 2 rank 2 — Who send the messages? rank 1 -- 1 1 ! count & flag rank 0 -- 1 0 — For MPI_Test family rank 1 -- 1 1 • flag: Matched or unmatched ? • count: How many time unmatched ? 3 0 -- -- 1 0 ! id rank 0 -- 0 -- — For application-level out-of-order 1 1 0 rank 0 ! with_next — For matching some/all functions 15" LLNL-PRES-679294

  16. Application-level out-of-order Application-level out-of-order ! MPI guarantees that any two communications executed by a process are ordered rank 0 rank 1 Send: A " B — Recv: A " B — msg: A MPI_Irecv (req[0]) ! However, timing of matching function calls depends msg: B MPI_Irecv (req[1]) on an application Message receive order is not necessary equal to — message send order MPI_Test (req[0]) msg: A MPI_Send ! For example, msg: B “msg: B” may matches earlier than “msg: A” — MPI_Send ! Recording only “ rank ” cannot distinguish between A " B and B " A msg: B MPI_Test (req[1]) rank ?? ?? msg: B MPI_Test (req[0]) msg: A msg: A rank 1 msg: A msg: B rank 1 16" LLNL-PRES-679294

  17. Each rank need to assign “id” number to each message rank x rank 0 rank 0 0 message id rank 0 1 rank 2 0 rank x rank 1 rank 1 0 rank 0 2 message id rank 1 1 rank 0 3 rank 2 id message rank 0 4 17" LLNL-PRES-679294

  18. Necessary values for correct replay Matching functions in MPI Wait family Test family rank x single MPI_Wait MPI_Test any MPI_Waitany MPI_Testany count flag rank id some MPI_Waitsome MPI_Testsome rank 0 0 -- 1 0 0 all MPI_Waitall MPI_Testall 2 0 -- -- -- 1 0 1 ! rank rank 0 1 -- 1 2 0 rank 2 0 — Who send the messages? rank 1 0 -- 1 1 0 ! count & flag rank 0 2 -- 1 0 2 — For MPI_Test family rank 1 1 -- 1 1 1 • flag: Matched or unmatched ? • count: How many time unmatched ? 3 0 -- -- -- 1 0 3 ! id rank 0 3 -- 0 -- -- — For application-level out-of-order 1 1 0 4 rank 0 4 ! with_next — For matching some/all functions 18" LLNL-PRES-679294

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend