scalable e tool ools for for deb ebugging non on d det
play

Scalable e Tool ools for for Deb ebugging Non on-D -Det eter - PowerPoint PPT Presentation

Scalable e Tool ools for for Deb ebugging Non on-D -Det eter erministic MPI PI Application ons ReMPI: MPI Record-and-Replay tool Kento Sato , Dong H. Ahn, Ignacio Laguna, Scalable Tools Workshop Gregory L. Lee, Mar>n Schulz and Chris


  1. Scalable e Tool ools for for Deb ebugging Non on-D -Det eter erministic MPI PI Application ons ReMPI: MPI Record-and-Replay tool Kento Sato , Dong H. Ahn, Ignacio Laguna, Scalable Tools Workshop Gregory L. Lee, Mar>n Schulz and Chris Chambreau August 2 nd , 2016 LLNL-PRES-698040 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

  2. Debugging large-scale applica4ons is already challenging “On average, software developers spend 50% of their programming time finding and fixing bugs.” [1] [1] Source: http://www.prweb.com/releases/2013/1/prweb10298185.htm, CAMBRIDGE, UK (PRWEB) JANUARY 08, 2013 With trends towards asynchronous communication patterns in MPI applications, MPI non-determinism will significantly increase debugging cost 2 LLNL-PRES-698040

  3. What is MPI non-determinism ? § Message receive orders can be different across execu>ons — Unpredictable system noise (e.g. network, system daemon & OS jiPer) § Floa>ng point arithme>c orders can also change across execu>ons P0 P1 P2 P0 P1 P2 a b b c c a Execution A: (a+b)+c Execution B: a+(b+c) 3 LLNL-PRES-698040

  4. Non-determinism also increases debugging cost § Control flows of an application can change across different runs § Non-determinis>c control flow Input — Successful run, seg-fault or hang § Non-determinis>c numerical results — Floa>ng-point arithme>c is non-associa>ve (a+b)+c � ≠ a+(b+c) � Result seg-fault è Developers need to do debug runs until the Hangs Result A Result B target bug manifests In non-deterministic applications, it’s hard to reproduce bugs and incorrect results. It costs excessive amounts of time for “ reproducing ” target bugs 4 LLNL-PRES-698040

  5. Non-determinis4c bugs --- Case study: Pf3d and Diablo/Hypre 2.10.1 § Debugging non-determinis>c hangs oZen cost computa>onal scien>sts substan>al >me and efforts § Diablo - hung only once every 30 § Pf3d – hung only when runs aZer a few hours scaling to half a million MPI § The scien>sts spent 2 months in the processes period of 18 months and gave up debugging it § The scien>sts refused to Hypre is an MPI-based library for solving large, debug for 6 months … sparse linear systems of equations on massively parallel computers 5 LLNL-PRES-698040

  6. Non-determinis4c numerical result --- Case study: “Monte Carlo Simula4on” (MCB) § CORAL proxy applica>on § MPI non-determinism Table 1: Catalyst Specification Nodes 304 batch nodes CPU 2.4 GHz Intel Xeon E5-2695 v2 (24 cores in total) Memory 128 GB Interconnect InfiniBand QDR (QLogic) Local Storage Intel SSD 910 Series MCB: Monte Carlo Benchmark (PCIe 2.0, MLC) Final numerical results are different between 1 st and 2 nd run $ diff result_run1.out result_run2.out result_run1.out:< IMC E_RR_total -3.3140234409e-05 -8.302693774e-08 2.9153322360e-08 -4.8198506756e-06 2.3113821822e-06 result_run2.out:> IMC E_RR_total -3.3140234410e-05 -8.302693776e-08 2.9153322360e-08 -4.8198506757e-06 2.3113821821e-06 09e-05 56e-06 22e-06 74e-08 10e-05 57e-06 21e-06 76e-08 * The source was modified by the scientist to demonstrate the issue in the field 6 LLNL-PRES-698040

  7. Why MPI non-determinism occurs ? § It’s typically due to communica>on with MPI_ANY_SOURCE § In non-determinis>c applica>ons, each process doesn’t know which rank will send message § Messages can arrive in any order from neighbors è inconsistent message arrivals Communications with neighbors MPI_ANY_SOURCE communication MPI_Irecv(…, MPI_ANY_SOURCE, …); north while(1) { MPI_Test(flag); west if (flag) { east <computation> MPI_Irecv(…, MPI_ANY_SOURCE, …); south } } MCB: Monte Carlo Benchmark 7 LLNL-PRES-698040

  8. ReMPI can reproduce message matching § ReMPI can reproduce message matching by using record-and-replay technique Record-and-replay § Traces, records message receive orders in a run, and replays rank 0 rank 1 rank 2 rank 3 the orders in successive runs for debugging Record-and-replay can reproduce a target control flow — rank 2 Developers can focus on debugging a par>cular control flow — rank 0 rank 2 rank 3 Debugging a particular control flow in replay rank 1 Input rank 1 Developer can focus on debugging particular rank 3 control flow rank 1 rank 2 rank 1 Output Hanging seg-fault Output A Output B 8 LLNL-PRES-698040

  9. Record overhead to performance § Performance metric: how many par>cles are tracked per second 5.00E+09 Performance (tracks/sec) ReMPI MCB w/o Recording 4.50E+09 MCB w/ gzip (Local storage) 4.00E+09 ReMPI 3.50E+09 3.00E+09 2.50E+09 2.00E+09 1.50E+09 1.00E+09 MCB 5.00E+08 0.00E+00 48 96 192 384 768 1536 3072 # of processes § ReMPI becomes scalable by recording to local memory/storage — Each rank independently writes record à No communica>on across MPI ranks 0 1 2 3 4 5 6 7 node 0 node 1 node 2 node 3 11 LLNL-PRES-698040

  10. Record-and-replay won't work at scale § Record-and-replay produces large amount of recording data — Over ” 10 GB/node ” per day in MCB — Over ” 24 GB/node ” per day in Diablo § For scalable record-replay with low overhead, the record data must fit into local memory, but capacity is limited — Storing in shared/parallel file system is not scalable approach — Some systems may not have fast local storage Record-and-replay rank 0 rank 1 rank 2 rank 3 10 GB/node rank 2 rank 0 rank 2 rank 3 MCB rank 1 rank 1 rank 3 rank 1 rank 2 24 GB/node rank 1 Diablo Challenges Record size reduc>on for scalable record-replay 12 LLNL-PRES-698040

  11. Clock Delta Compression (CDC) Received order Logical order Logical clock (Order by wall-clock) (Order by logical-clock) sender 1 1 1 3 2 Receiver 3 sender 2 2 ≈ 4 6 5 4 sender 3 6 5 13 LLNL-PRES-698040

  12. Logical clock vs. wall clock “ The global order of messages exchanged among MPI processes are very similar to a logical-clock order (e.g., Lamport clock) “ Lamport clock values of received messages for particle exchanges in MCB (MPI rank = 0) 2500 of received message 2000 Lamport clock 1500 1000 500 0 Each process frequently exchanges 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 messages with neighbors Received messages in received order (MPI rank = 0) 14 LLNL-PRES-698040

  13. Clock Delta Compression (CDC) § Our approach, clock delta compression, only records the difference between received order and logical order instead of recording en>re received order Received order Logical order (Order by wall-clock) (Order by logical-clock) 1 Permutation 2 difference 3 diff 4 5 6 15 LLNL-PRES-698040

  14. Logical clock order is reproducible [1] § Logical-clock order is always reproducible, so CDC only records the permuta>on difference P 0 E 0 E 0 E 0 1 3 2 Recv events Send events Recv events Theorem 1. CDC can correctly replay message events, that E 1 E 1 E 1 is, E = ˆ E where E and ˆ 3 E are ordered sets of events for a 1 2 Logical order e 0 e 1 e 2 e 3 e 4 e 5 e 6 P 1 record and a replay mode. (Order by logical-clock) Proof (Mathematical induction). (i) Basis : Show Send events Recv events Send events the first send events are replayable, i.e., ∀ x s.t. “ E x 1 is send 1 events” ⇒ “ E x 1 is replayable”. As defined in Definition 7.(i) P 2 E 2 E 2 E x 1 is deterministic, that is , E x 1 is always replayed. In Fig- 1 2 ure 12, E 1 1 is deterministic, that is, is always replayed. (ii) 2 Recv events Send events Inductive step for send events : Show send events are replayable if the all previous message events are replayed, i.e., “ ∀ E → E s.t. E is replayed, E is send event set” ⇒ “ E 3 is replayable”. As defined in Definition 7.(ii), E is determin- Theorem 1 istic, that is, E is always replayed. (iii) Inductive step for E 0 E 0 E 0 receive events : Show receive events are replayable if the 3 Proof in Theorem 1.(i) 1 2 E 1 E 1 E 1 all previous message events are replayed, i.e., “ ∀ E → E s.t. 4 Proof in Theorem 1.(ii) 1 2 3 E is replayed, E is receive event set” ⇒ “ E is replayable”. Proof in Theorem 1.(iii) E 2 E 2 As proofed in Proposition 1, all message receives in E can 1 2 be replayed by CDC. Therefore, all of the events can be re- 5 played, i.e., E = ˆ E . (Mathematical induction processes are graphically shown in Figure 12.) � Theorem 2. CDC can replay piggyback clocks. 6 Proof. As proved in Theorem 1, since CDC can replay all message events, send events and clock ticking are re- played. Thus, CDC can replay piggyback clock sends. � [1] Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee and Martin Schulz, “Clock Delta Compression for Scalable Order-Replay of Non-Deterministic Parallel Applications”, In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis 2015 (SC15), Austin, USA, Nov, 2015. 16 LLNL-PRES-698040

  15. Clock Delta Compression (CDC) § Our approach, clock delta compression, only records the difference between received order and logical order instead of recording en>re received order This logical order is reproducible Received order Logical order (Order by wall-clock) (Order by logical-clock) 1 Permutation 2 difference + 3 4 5 6 17 LLNL-PRES-698040

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend