Scalable e Tool ools for for Deb ebugging Non on-D -Det eter - PowerPoint PPT Presentation

Scalable e Tool ools for for Deb ebugging Non on-D -Det eter erministic MPI PI Application ons ReMPI: MPI Record-and-Replay tool Kento Sato , Dong H. Ahn, Ignacio Laguna, Scalable Tools Workshop Gregory L. Lee, Mar>n Schulz and Chris Chambreau August 2 nd , 2016 LLNL-PRES-698040 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Debugging large-scale applica4ons is already challenging “On average, software developers spend 50% of their programming time finding and fixing bugs.” [1] [1] Source: http://www.prweb.com/releases/2013/1/prweb10298185.htm, CAMBRIDGE, UK (PRWEB) JANUARY 08, 2013 With trends towards asynchronous communication patterns in MPI applications, MPI non-determinism will significantly increase debugging cost 2 LLNL-PRES-698040

What is MPI non-determinism ? § Message receive orders can be different across execu>ons — Unpredictable system noise (e.g. network, system daemon & OS jiPer) § Floa>ng point arithme>c orders can also change across execu>ons P0 P1 P2 P0 P1 P2 a b b c c a Execution A: (a+b)+c Execution B: a+(b+c) 3 LLNL-PRES-698040

Non-determinism also increases debugging cost § Control flows of an application can change across different runs § Non-determinis>c control flow Input — Successful run, seg-fault or hang § Non-determinis>c numerical results — Floa>ng-point arithme>c is non-associa>ve (a+b)+c � ≠ a+(b+c) � Result seg-fault è Developers need to do debug runs until the Hangs Result A Result B target bug manifests In non-deterministic applications, it’s hard to reproduce bugs and incorrect results. It costs excessive amounts of time for “ reproducing ” target bugs 4 LLNL-PRES-698040

Non-determinis4c bugs --- Case study: Pf3d and Diablo/Hypre 2.10.1 § Debugging non-determinis>c hangs oZen cost computa>onal scien>sts substan>al >me and efforts § Diablo - hung only once every 30 § Pf3d – hung only when runs aZer a few hours scaling to half a million MPI § The scien>sts spent 2 months in the processes period of 18 months and gave up debugging it § The scien>sts refused to Hypre is an MPI-based library for solving large, debug for 6 months … sparse linear systems of equations on massively parallel computers 5 LLNL-PRES-698040

Non-determinis4c numerical result --- Case study: “Monte Carlo Simula4on” (MCB) § CORAL proxy applica>on § MPI non-determinism Table 1: Catalyst Specification Nodes 304 batch nodes CPU 2.4 GHz Intel Xeon E5-2695 v2 (24 cores in total) Memory 128 GB Interconnect InfiniBand QDR (QLogic) Local Storage Intel SSD 910 Series MCB: Monte Carlo Benchmark (PCIe 2.0, MLC) Final numerical results are different between 1 st and 2 nd run $ diff result_run1.out result_run2.out result_run1.out:< IMC E_RR_total -3.3140234409e-05 -8.302693774e-08 2.9153322360e-08 -4.8198506756e-06 2.3113821822e-06 result_run2.out:> IMC E_RR_total -3.3140234410e-05 -8.302693776e-08 2.9153322360e-08 -4.8198506757e-06 2.3113821821e-06 09e-05 56e-06 22e-06 74e-08 10e-05 57e-06 21e-06 76e-08 * The source was modified by the scientist to demonstrate the issue in the field 6 LLNL-PRES-698040

Why MPI non-determinism occurs ? § It’s typically due to communica>on with MPI_ANY_SOURCE § In non-determinis>c applica>ons, each process doesn’t know which rank will send message § Messages can arrive in any order from neighbors è inconsistent message arrivals Communications with neighbors MPI_ANY_SOURCE communication MPI_Irecv(…, MPI_ANY_SOURCE, …); north while(1) { MPI_Test(flag); west if (flag) { east <computation> MPI_Irecv(…, MPI_ANY_SOURCE, …); south } } MCB: Monte Carlo Benchmark 7 LLNL-PRES-698040

ReMPI can reproduce message matching § ReMPI can reproduce message matching by using record-and-replay technique Record-and-replay § Traces, records message receive orders in a run, and replays rank 0 rank 1 rank 2 rank 3 the orders in successive runs for debugging Record-and-replay can reproduce a target control flow — rank 2 Developers can focus on debugging a par>cular control flow — rank 0 rank 2 rank 3 Debugging a particular control flow in replay rank 1 Input rank 1 Developer can focus on debugging particular rank 3 control flow rank 1 rank 2 rank 1 Output Hanging seg-fault Output A Output B 8 LLNL-PRES-698040

Record overhead to performance § Performance metric: how many par>cles are tracked per second 5.00E+09 Performance (tracks/sec) ReMPI MCB w/o Recording 4.50E+09 MCB w/ gzip (Local storage) 4.00E+09 ReMPI 3.50E+09 3.00E+09 2.50E+09 2.00E+09 1.50E+09 1.00E+09 MCB 5.00E+08 0.00E+00 48 96 192 384 768 1536 3072 # of processes § ReMPI becomes scalable by recording to local memory/storage — Each rank independently writes record à No communica>on across MPI ranks 0 1 2 3 4 5 6 7 node 0 node 1 node 2 node 3 11 LLNL-PRES-698040

Record-and-replay won't work at scale § Record-and-replay produces large amount of recording data — Over ” 10 GB/node ” per day in MCB — Over ” 24 GB/node ” per day in Diablo § For scalable record-replay with low overhead, the record data must fit into local memory, but capacity is limited — Storing in shared/parallel file system is not scalable approach — Some systems may not have fast local storage Record-and-replay rank 0 rank 1 rank 2 rank 3 10 GB/node rank 2 rank 0 rank 2 rank 3 MCB rank 1 rank 1 rank 3 rank 1 rank 2 24 GB/node rank 1 Diablo Challenges Record size reduc>on for scalable record-replay 12 LLNL-PRES-698040

Clock Delta Compression (CDC) Received order Logical order Logical clock (Order by wall-clock) (Order by logical-clock) sender 1 1 1 3 2 Receiver 3 sender 2 2 ≈ 4 6 5 4 sender 3 6 5 13 LLNL-PRES-698040

Logical clock vs. wall clock “ The global order of messages exchanged among MPI processes are very similar to a logical-clock order (e.g., Lamport clock) “ Lamport clock values of received messages for particle exchanges in MCB (MPI rank = 0) 2500 of received message 2000 Lamport clock 1500 1000 500 0 Each process frequently exchanges 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 messages with neighbors Received messages in received order (MPI rank = 0) 14 LLNL-PRES-698040

Clock Delta Compression (CDC) § Our approach, clock delta compression, only records the difference between received order and logical order instead of recording en>re received order Received order Logical order (Order by wall-clock) (Order by logical-clock) 1 Permutation 2 difference 3 diff 4 5 6 15 LLNL-PRES-698040

Logical clock order is reproducible [1] § Logical-clock order is always reproducible, so CDC only records the permuta>on difference P 0 E 0 E 0 E 0 1 3 2 Recv events Send events Recv events Theorem 1. CDC can correctly replay message events, that E 1 E 1 E 1 is, E = ˆ E where E and ˆ 3 E are ordered sets of events for a 1 2 Logical order e 0 e 1 e 2 e 3 e 4 e 5 e 6 P 1 record and a replay mode. (Order by logical-clock) Proof (Mathematical induction). (i) Basis : Show Send events Recv events Send events the first send events are replayable, i.e., ∀ x s.t. “ E x 1 is send 1 events” ⇒ “ E x 1 is replayable”. As defined in Definition 7.(i) P 2 E 2 E 2 E x 1 is deterministic, that is , E x 1 is always replayed. In Fig- 1 2 ure 12, E 1 1 is deterministic, that is, is always replayed. (ii) 2 Recv events Send events Inductive step for send events : Show send events are replayable if the all previous message events are replayed, i.e., “ ∀ E → E s.t. E is replayed, E is send event set” ⇒ “ E 3 is replayable”. As defined in Definition 7.(ii), E is determin- Theorem 1 istic, that is, E is always replayed. (iii) Inductive step for E 0 E 0 E 0 receive events : Show receive events are replayable if the 3 Proof in Theorem 1.(i) 1 2 E 1 E 1 E 1 all previous message events are replayed, i.e., “ ∀ E → E s.t. 4 Proof in Theorem 1.(ii) 1 2 3 E is replayed, E is receive event set” ⇒ “ E is replayable”. Proof in Theorem 1.(iii) E 2 E 2 As proofed in Proposition 1, all message receives in E can 1 2 be replayed by CDC. Therefore, all of the events can be re- 5 played, i.e., E = ˆ E . (Mathematical induction processes are graphically shown in Figure 12.) � Theorem 2. CDC can replay piggyback clocks. 6 Proof. As proved in Theorem 1, since CDC can replay all message events, send events and clock ticking are replayed. Thus, CDC can replay piggyback clock sends. � [1] Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee and Martin Schulz, “Clock Delta Compression for Scalable Order-Replay of Non-Deterministic Parallel Applications”, In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis 2015 (SC15), Austin, USA, Nov, 2015. 16 LLNL-PRES-698040

Clock Delta Compression (CDC) § Our approach, clock delta compression, only records the difference between received order and logical order instead of recording en>re received order This logical order is reproducible Received order Logical order (Order by wall-clock) (Order by logical-clock) 1 Permutation 2 difference + 3 4 5 6 17 LLNL-PRES-698040

Scalable e Tool ools for for Deb ebugging Non on-D -Det eter - PowerPoint PPT Presentation

Scalable e Tool ools for for Deb ebugging Non on-D -Det eter erministic MPI PI Application ons ReMPI: MPI Record-and-Replay tool Kento Sato , Dong H. Ahn, Ignacio Laguna, Scalable Tools Workshop Gregory L. Lee, Mar>n Schulz and Chris

Statistical Natural Language Processing Sing DET NOUN PUNCT Def Sing 3s,Pres Sing,Dem case

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Agenda Personnel Update Deb Fournier Enrollment Data Deb Fournier Initiatives

Sandy Burke BSN RN CWCN Deb Perry MS RN Olmsted Medical Center Rochester, MN Deb Perry MS,

Grif Griffin T Griffin T Grif Griffin T Grif Griffin T Grif n Tools and Supply n Tools and

C UES FOR S CENT I NTENSIFICATION IN D EBUGGING Alexandre Perez and Rui Abreu Faculty of

Palatine Police Department Det. Josh Hester Det. Phil Hemmeler Marijuana is most commonly

Varfr ett holistiskt perspektiv? - att se hela individen i det idrottsliga Nedan r det

Hvad er det for en fisk?! Hvad er det for en fisk?! Louis Tim Larsen, July 2014 Finish

Overview of todays session DET resources to support learning from home including: DET

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Introduction: Deb Davenport deb@workwithneese.com Kat Potter kat@workwithneese.com The secret

Machine Learning Machine Learning Fast & Slow Fast & Slow Suman Deb Roy Suman Deb Roy

Division of Environmental Biology (DEB) Virtual Office Hour Welcome to the DEB Virtual Office

DEB ii Platform for a Lexicographic Station Ale s Hor ak Karel Pala Faculty of

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Semantic Accountable Matchmaking for E-Science Resource Sharing Zeqian Meng Rizos Sakellariou

Compilers and VMs for Programming Environments Used by Scien;sts

Arrays, ArrayLists, Wrapper Classes, Auto-boxing Check out ArraysAndLists from SVN Test next

Securing the Cloud Identity Management and Network Security in the Cloud Mark Ryland Chief

Computer Science Meets Foreign Policy Stephanie Forrest Biodesign

Sources of error R.W. Oldford Population attributes: Interest lies in assessing and/or

Statistics 380 Probability and Statistics for the Physical Sciences Instructor: Peter Bloomfield

HOWARD COUNTY Prepared by: Ukeles Associates, Inc. June 1, 2011 CONTENTS 2 About the Jewish

Scalable e Tool ools for for Deb ebugging Non on-D -Det eter - PowerPoint PPT Presentation

Scalable e Tool ools for for Deb ebugging Non on-D -Det eter erministic MPI PI Application ons ReMPI: MPI Record-and-Replay tool Kento Sato , Dong H. Ahn, Ignacio Laguna, Scalable Tools Workshop Gregory L. Lee, Mar>n Schulz and Chris

Statistical Natural Language Processing Sing DET NOUN PUNCT Def Sing 3s,Pres Sing,Dem case

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Agenda Personnel Update Deb Fournier Enrollment Data Deb Fournier Initiatives

Sandy Burke BSN RN CWCN Deb Perry MS RN Olmsted Medical Center Rochester, MN Deb Perry MS,

Grif Griffin T Griffin T Grif Griffin T Grif Griffin T Grif n Tools and Supply n Tools and

C UES FOR S CENT I NTENSIFICATION IN D EBUGGING Alexandre Perez and Rui Abreu Faculty of

Palatine Police Department Det. Josh Hester Det. Phil Hemmeler Marijuana is most commonly

Varfr ett holistiskt perspektiv? - att se hela individen i det idrottsliga Nedan r det

Hvad er det for en fisk?! Hvad er det for en fisk?! Louis Tim Larsen, July 2014 Finish

Overview of todays session DET resources to support learning from home including: DET

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Introduction: Deb Davenport deb@workwithneese.com Kat Potter kat@workwithneese.com The secret

Machine Learning Machine Learning Fast &amp; Slow Fast &amp; Slow Suman Deb Roy Suman Deb Roy

Division of Environmental Biology (DEB) Virtual Office Hour Welcome to the DEB Virtual Office

DEB ii Platform for a Lexicographic Station Ale s Hor ak Karel Pala Faculty of

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Semantic Accountable Matchmaking for E-Science Resource Sharing Zeqian Meng Rizos Sakellariou

Compilers and VMs for Programming Environments Used by Scien;sts

Arrays, ArrayLists, Wrapper Classes, Auto-boxing Check out ArraysAndLists from SVN Test next

Securing the Cloud Identity Management and Network Security in the Cloud Mark Ryland Chief

Computer Science Meets Foreign Policy Stephanie Forrest Biodesign

Sources of error R.W. Oldford Population attributes: Interest lies in assessing and/or

Statistics 380 Probability and Statistics for the Physical Sciences Instructor: Peter Bloomfield

HOWARD COUNTY Prepared by: Ukeles Associates, Inc. June 1, 2011 CONTENTS 2 About the Jewish

Machine Learning Machine Learning Fast & Slow Fast & Slow Suman Deb Roy Suman Deb Roy