 
              Debugging Large Scale Parallel Applications Filippo Gioachin Parallel Programming Laboratory Departement of Computer Science University of Illinois at Urbana-Champaign
Outline ● Introduction – Motivations ● Debugging on Large Machines – Scalability ● Using Fewer Resources – Virtualized Debugging – Processor Extraction ● Summary 29 April 2010 Filippo Gioachin - UIUC 2
Motivations ● Debugging is a fundamental part of software development ● Parallel programs have all the sequential bugs: – Memory corruption – Incorrect results – .... 29 April 2010 Filippo Gioachin - UIUC 3
Motivations (2) ● Parallel programs have other bugs: – Data races / multicore (heavily studied in literature) – Communication mistakes – Synchronization mistakes / Message races ● To complicate things more: – Non-determinism – Problems may show up only at large scale 29 April 2010 Filippo Gioachin - UIUC 4
Problems at Large Scale ● Problems may not appear at small scale – Races between messages ● Latencies in the underlying hardware – Incorrect messaging – Data decomposition 29 April 2010 Filippo Gioachin - UIUC 5
Problems at Large Scale (2) ● Infeasible – Debugger needs to handle many processors – Human can be overwhelmed by information – Long waiting time in queue – Machine not available ● Expensive – Large machine allocation consume a lot of computational resources 29 April 2010 Filippo Gioachin - UIUC 6
CharmDebug Overview GDB Login Node Launch SSH SSH tunnel Single CCS connection CharmDebug Parallel Machine 29 April 2010 Filippo Gioachin - UIUC 7
Converse Client-Server Scalability ● CCS connects to the application as a whole – Forward requests for single processors – Gather information from the whole application ● Uses the same communication infrastucture as the applicaiton 29 April 2010 Filippo Gioachin - UIUC 8
Debugging on Large Systems ● Attaching to running application – 48 processors cluster ● 28 ms with 48 point-to-point queries ● 2 ms with a single global query ● Example: Memory statistics collection – 12 to 20 ms up to 4k processors – Counted on the client debugger * F. Gioachin, C.W. Lee, L.V. Kalé: "Scalable Interaction with Scalable Interaction with Parallel Applications", In Parallel Applications Proceedings of TeraGrid'09 29 April 2010 Filippo Gioachin - UIUC 9
Autoinspection ● The programmer should not manually handle all the processors – Unsupervised execution – Notification to the user from interesting processors ● Breakpoints ● Abort / signals ● Memory corruption ● Assertion failure 29 April 2010 Filippo Gioachin - UIUC 10
Python Scripting ● Upload a script to perform checking on the correctness of data structures when needed * F. Gioachin, L.V. Kalé: length = charm.getValue(self, MyArray, len) "Dynamic High- Dynamic High- arr = charm.getValue(self, MyArray, data) Level Scripting Level Scripting for i in range(0, length): in Parallel in Parallel value = charm.getArray(arr, double, i) Applications". In Applications if (value > 10 or value < -10): Proceedings of print "Error: value ", i, " = ", value the 23rd IEEE return i International Select on Parallel and Access program's Distributed which entry data (circumvent Processing Suspend execution points the lack of reflection) Symposium if a value is returned script (IPDPS 2009) should run 29 April 2010 Filippo Gioachin - UIUC 11
Can you debug on a big machine? ● Feasibility – How long do you have to wait before your job starts? ● Are you available when you job starts? – Is the machine even available? ● Cost – How many allocation units are you using to do your debugging? 29 April 2010 Filippo Gioachin - UIUC 12
Virtualized Emulation ● Use emulation techniques to provide virtual processors to display to the user – Different scenario from performance analysis ● Cannot assume correctness of program – Debugger needs to communicate with application – Single address space * F. Gioachin, G. Zheng, L.V. Kalé: "Debugging Large Scale Applications in a Virtualized Debugging Large Scale Applications in a Virtualized Environment Environment". PPL Technical Report, April 2010 29 April 2010 Filippo Gioachin - UIUC 13
Virtualized Charm++ ● Converse on top of BigSim AMPI – Processors become virtual processors Charm++ – Two Converse layers BigSim ● Virtualized Converse ● Original BigSim Emulator Converse MPI, Infiniband, Myrinet, UDP/TCP, LAPI, etc ... 29 April 2010 Filippo Gioachin - UIUC 14
BigSim Emulator Virtual Processor Virtual Processor Worker Thread Worker Thread Communication Communication Thread Thread Message Queue Converse Main Thread 29 April 2010 Filippo Gioachin - UIUC 15
Converse Client-Server under Emulated Environment Virtual Processor Virtual Processor Worker Thread Worker Thread Communication Communication Thread Thread VP 87 VP 513 Real PE 12 Message Queue Converse Main Thread CCS Host 29 April 2010 Filippo Gioachin - UIUC 16
Usage: Starting 29 April 2010 Filippo Gioachin - UIUC 17
Usage: Debugging 29 April 2010 Filippo Gioachin - UIUC 18
Performance: Jacobi (on NCSA's BluePrint) ● User thinks for one minute about what to do: – 8 processors ● 86 sec. ● ~0.2 SU – 1024 procs ● 60.5 sec. ● ~17 SU 29 April 2010 Filippo Gioachin - UIUC 19
Restrictions ● Small memory footprint – Many processors needs to fit into a single physical processor ● Session should be constraint by human speed – Allocation idle most of the time waiting for user input – Bad for computation intensive applications 29 April 2010 Filippo Gioachin - UIUC 20
Separation of Virtual Entities ● Single address space shared by different entities – One entity can write in memory of another entity ● Protect memory such that spurious writes can be detected ● Exploit the scheduler in message driven systems Reset memory * F. Gioachin, L.V. protection Kalé: "Memory Memory Tagging in Charm++" Tagging in Charm++ User code: in Proceedings of the Pick message process 6th Workshop on message Parallel and Distributed Systems: No Testing, Analysis, and Debugging Check memory Yes Has (PADTAD '08) corruption corruption occured? 29 April 2010 Filippo Gioachin - UIUC 21
Do we need all the processors? ● The problem manifests itself on a single processor – If more than one, they are equivalent ● The cause can span multiple processors (causally related) – The subset is generally much smaller than the whole system ● Select the interesting processors and ignore the others 29 April 2010 Filippo Gioachin - UIUC 22
Fighting non-determinism ● Record all data processed by each processor – Huge volume of data stored – High interference with application ● Likely the bug will not appear... – Need to run a non-optimized code ● Record only message ordering – Based on piecewise deterministic assumption – Must re-execute using the same machine 29 April 2010 Filippo Gioachin - UIUC 23
Three-step Procedure for Processor Extraction Select r Execute program e processors s recording message U to record ordering S t e p Minimize perturbation (few Replay application 1 with detailed bytes per message) No recording enabled Has bug S appeared? t e p 2 Yes Replay selected ● Use message ordering to * F. Gioachin, G. Zheng, processors as L.V. Kalé: "Robust Record- Robust Record- stand-alone ● Iterate for S guarantee determinism Replay with Processor Replay with Processor e t Extraction" in Proceedings of Extraction p ● Can execute in the incremental 3 the Workshop on Parallel and extraction Distributed Systems: Testing, No Yes virtualized environment Is problem Done Analysis, and Debugging solved? (PADTAD – VIII), 2010 29 April 2010 Filippo Gioachin - UIUC 24
What if the piecewise deterministic assumption is not met? ● Make sure to detect it, and notify the user If all messages are identical, If all messages are identical, then we can assume the non- then we can assume the non- determinism was captured determinism was captured ● Methods to detect failure: – Message size and destination – Checksum of the whole message (XOR, CRC32) 29 April 2010 Filippo Gioachin - UIUC 25
Computing Checksums ● Checksum considers memory as raw data, ignores what it contains int short – Pointers double double – Garbage double ● Uninitialized fields double ● Compiler padding int ● Use Charm++ memory allocator – Intercept calls to malloc and pre-fill memory 29 April 2010 Filippo Gioachin - UIUC 26
Message Order Recording Performance (on NCSA's Abe) 29 April 2010 Filippo Gioachin - UIUC 27
kNeighbor 29 April 2010 Filippo Gioachin - UIUC 28
ChaNGa (dwf1.2048 on NCSA's BluePrint) 29 April 2010 Filippo Gioachin - UIUC 29
Replaying the Application 29 April 2010 Filippo Gioachin - UIUC 30
Replaying under BigSim Emulation: NAMD 29 April 2010 Filippo Gioachin - UIUC 31
Recommend
More recommend