Debugging Large Scale Parallel Applications Filippo Gioachin - PowerPoint PPT Presentation

Debugging Large Scale Parallel Applications Filippo Gioachin Parallel Programming Laboratory Departement of Computer Science University of Illinois at Urbana-Champaign

Outline ● Introduction – Motivations ● Debugging on Large Machines – Scalability ● Using Fewer Resources – Virtualized Debugging – Processor Extraction ● Summary 29 April 2010 Filippo Gioachin - UIUC 2

Motivations ● Debugging is a fundamental part of software development ● Parallel programs have all the sequential bugs: – Memory corruption – Incorrect results – .... 29 April 2010 Filippo Gioachin - UIUC 3

Motivations (2) ● Parallel programs have other bugs: – Data races / multicore (heavily studied in literature) – Communication mistakes – Synchronization mistakes / Message races ● To complicate things more: – Non-determinism – Problems may show up only at large scale 29 April 2010 Filippo Gioachin - UIUC 4

Problems at Large Scale ● Problems may not appear at small scale – Races between messages ● Latencies in the underlying hardware – Incorrect messaging – Data decomposition 29 April 2010 Filippo Gioachin - UIUC 5

Problems at Large Scale (2) ● Infeasible – Debugger needs to handle many processors – Human can be overwhelmed by information – Long waiting time in queue – Machine not available ● Expensive – Large machine allocation consume a lot of computational resources 29 April 2010 Filippo Gioachin - UIUC 6

CharmDebug Overview GDB Login Node Launch SSH SSH tunnel Single CCS connection CharmDebug Parallel Machine 29 April 2010 Filippo Gioachin - UIUC 7

Converse Client-Server Scalability ● CCS connects to the application as a whole – Forward requests for single processors – Gather information from the whole application ● Uses the same communication infrastucture as the applicaiton 29 April 2010 Filippo Gioachin - UIUC 8

Debugging on Large Systems ● Attaching to running application – 48 processors cluster ● 28 ms with 48 point-to-point queries ● 2 ms with a single global query ● Example: Memory statistics collection – 12 to 20 ms up to 4k processors – Counted on the client debugger * F. Gioachin, C.W. Lee, L.V. Kalé: "Scalable Interaction with Scalable Interaction with Parallel Applications", In Parallel Applications Proceedings of TeraGrid'09 29 April 2010 Filippo Gioachin - UIUC 9

Autoinspection ● The programmer should not manually handle all the processors – Unsupervised execution – Notification to the user from interesting processors ● Breakpoints ● Abort / signals ● Memory corruption ● Assertion failure 29 April 2010 Filippo Gioachin - UIUC 10

Python Scripting ● Upload a script to perform checking on the correctness of data structures when needed * F. Gioachin, L.V. Kalé: length = charm.getValue(self, MyArray, len) "Dynamic High- Dynamic High- arr = charm.getValue(self, MyArray, data) Level Scripting Level Scripting for i in range(0, length): in Parallel in Parallel value = charm.getArray(arr, double, i) Applications". In Applications if (value > 10 or value < -10): Proceedings of print "Error: value ", i, " = ", value the 23rd IEEE return i International Select on Parallel and Access program's Distributed which entry data (circumvent Processing Suspend execution points the lack of reflection) Symposium if a value is returned script (IPDPS 2009) should run 29 April 2010 Filippo Gioachin - UIUC 11

Can you debug on a big machine? ● Feasibility – How long do you have to wait before your job starts? ● Are you available when you job starts? – Is the machine even available? ● Cost – How many allocation units are you using to do your debugging? 29 April 2010 Filippo Gioachin - UIUC 12

Virtualized Emulation ● Use emulation techniques to provide virtual processors to display to the user – Different scenario from performance analysis ● Cannot assume correctness of program – Debugger needs to communicate with application – Single address space * F. Gioachin, G. Zheng, L.V. Kalé: "Debugging Large Scale Applications in a Virtualized Debugging Large Scale Applications in a Virtualized Environment Environment". PPL Technical Report, April 2010 29 April 2010 Filippo Gioachin - UIUC 13

Virtualized Charm++ ● Converse on top of BigSim AMPI – Processors become virtual processors Charm++ – Two Converse layers BigSim ● Virtualized Converse ● Original BigSim Emulator Converse MPI, Infiniband, Myrinet, UDP/TCP, LAPI, etc ... 29 April 2010 Filippo Gioachin - UIUC 14

BigSim Emulator Virtual Processor Virtual Processor Worker Thread Worker Thread Communication Communication Thread Thread Message Queue Converse Main Thread 29 April 2010 Filippo Gioachin - UIUC 15

Converse Client-Server under Emulated Environment Virtual Processor Virtual Processor Worker Thread Worker Thread Communication Communication Thread Thread VP 87 VP 513 Real PE 12 Message Queue Converse Main Thread CCS Host 29 April 2010 Filippo Gioachin - UIUC 16

Usage: Starting 29 April 2010 Filippo Gioachin - UIUC 17

Usage: Debugging 29 April 2010 Filippo Gioachin - UIUC 18

Performance: Jacobi (on NCSA's BluePrint) ● User thinks for one minute about what to do: – 8 processors ● 86 sec. ● ~0.2 SU – 1024 procs ● 60.5 sec. ● ~17 SU 29 April 2010 Filippo Gioachin - UIUC 19

Restrictions ● Small memory footprint – Many processors needs to fit into a single physical processor ● Session should be constraint by human speed – Allocation idle most of the time waiting for user input – Bad for computation intensive applications 29 April 2010 Filippo Gioachin - UIUC 20

Separation of Virtual Entities ● Single address space shared by different entities – One entity can write in memory of another entity ● Protect memory such that spurious writes can be detected ● Exploit the scheduler in message driven systems Reset memory * F. Gioachin, L.V. protection Kalé: "Memory Memory Tagging in Charm++" Tagging in Charm++ User code: in Proceedings of the Pick message process 6th Workshop on message Parallel and Distributed Systems: No Testing, Analysis, and Debugging Check memory Yes Has (PADTAD '08) corruption corruption occured? 29 April 2010 Filippo Gioachin - UIUC 21

Do we need all the processors? ● The problem manifests itself on a single processor – If more than one, they are equivalent ● The cause can span multiple processors (causally related) – The subset is generally much smaller than the whole system ● Select the interesting processors and ignore the others 29 April 2010 Filippo Gioachin - UIUC 22

Fighting non-determinism ● Record all data processed by each processor – Huge volume of data stored – High interference with application ● Likely the bug will not appear... – Need to run a non-optimized code ● Record only message ordering – Based on piecewise deterministic assumption – Must re-execute using the same machine 29 April 2010 Filippo Gioachin - UIUC 23

Three-step Procedure for Processor Extraction Select r Execute program e processors s recording message U to record ordering S t e p Minimize perturbation (few Replay application 1 with detailed bytes per message) No recording enabled Has bug S appeared? t e p 2 Yes Replay selected ● Use message ordering to * F. Gioachin, G. Zheng, processors as L.V. Kalé: "Robust Record- Robust Record- stand-alone ● Iterate for S guarantee determinism Replay with Processor Replay with Processor e t Extraction" in Proceedings of Extraction p ● Can execute in the incremental 3 the Workshop on Parallel and extraction Distributed Systems: Testing, No Yes virtualized environment Is problem Done Analysis, and Debugging solved? (PADTAD – VIII), 2010 29 April 2010 Filippo Gioachin - UIUC 24

What if the piecewise deterministic assumption is not met? ● Make sure to detect it, and notify the user If all messages are identical, If all messages are identical, then we can assume the non- then we can assume the non- determinism was captured determinism was captured ● Methods to detect failure: – Message size and destination – Checksum of the whole message (XOR, CRC32) 29 April 2010 Filippo Gioachin - UIUC 25

Computing Checksums ● Checksum considers memory as raw data, ignores what it contains int short – Pointers double double – Garbage double ● Uninitialized fields double ● Compiler padding int ● Use Charm++ memory allocator – Intercept calls to malloc and pre-fill memory 29 April 2010 Filippo Gioachin - UIUC 26

Message Order Recording Performance (on NCSA's Abe) 29 April 2010 Filippo Gioachin - UIUC 27

kNeighbor 29 April 2010 Filippo Gioachin - UIUC 28

ChaNGa (dwf1.2048 on NCSA's BluePrint) 29 April 2010 Filippo Gioachin - UIUC 29

Replaying the Application 29 April 2010 Filippo Gioachin - UIUC 30

Replaying under BigSim Emulation: NAMD 29 April 2010 Filippo Gioachin - UIUC 31

Debugging Large Scale Parallel Applications Filippo Gioachin - PowerPoint PPT Presentation

Debugging Large Scale Parallel Applications Filippo Gioachin Parallel Programming Laboratory Departement of Computer Science University of Illinois at Urbana-Champaign Outline Introduction Motivations Debugging on Large Machines

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

Parallel Debugging Objective Learn the basics of debugging parallel programs

Basics of Parallel Debugging J. Melvin jmelvin@ices.utexas.edu Sustainable Horizons Institute

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Debugging Scalable Applications on the XT May 2nd 2009 Chris Gottbrath Director, Product

Debugging microservices in production Bryan Cantrill CTO bryan@joyent.com @bcantrill

Scalable Post-Mortem Debugging Abel Mathew CEO - Backtrace amathew@backtrace.io @nullisnt0

Embedded Software TI2726-B 8. Debugging techniques Koen Langendoen Embedded Software Group

Kernel Debugging and Virtualization John Baldwin January 15, 2015 What is Kernel Debugging

Debugging Techniques for C Programs Debugging Basics Will focus on the gcc/gdb combination.

Introduction to Debugging the Introduction to Debugging the FreeBSD Kernel FreeBSD Kernel May

Introduction to Debugging with Windbg Module Overview Introduction to Debugging Callstacks and

Jiwei Li, NLP Researcher By Pragya Arora & Piyush Ghai Introduction Graduated from

CSE 311: Foundations of Computing Lecture 2: More Logic, Equivalence & Digital Circuits If

Converse of Smith Theory Min Yan Hong Kong University of Science and Technology International

never done jalewis@thoughtworks.com @boicy 1 never done 2 never done Incomplete adjective

Fast Correlation Attacks and Linear Codes Lauri Tarkkala November 25, 2004 1

Methods for the Reconstruction of Parallel Turbo Codes M. Cluzeau, M. Finiasz, and J.-P. Tillich

Optimum MDS convolutional codes over GF(2 m ) and their relation to the trace functi n ngela

Overcoming Delay, Synchronization and Cyclic Paths Meir Feder Tel Aviv University joint work

Debugging Large Scale Parallel Applications Filippo Gioachin - PowerPoint PPT Presentation

Debugging Large Scale Parallel Applications Filippo Gioachin Parallel Programming Laboratory Departement of Computer Science University of Illinois at Urbana-Champaign Outline Introduction Motivations Debugging on Large Machines

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

Parallel Debugging Objective Learn the basics of debugging parallel programs

Basics of Parallel Debugging J. Melvin jmelvin@ices.utexas.edu Sustainable Horizons Institute

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Debugging Scalable Applications on the XT May 2nd 2009 Chris Gottbrath Director, Product

Debugging microservices in production Bryan Cantrill CTO bryan@joyent.com @bcantrill

Scalable Post-Mortem Debugging Abel Mathew CEO - Backtrace amathew@backtrace.io @nullisnt0

Embedded Software TI2726-B 8. Debugging techniques Koen Langendoen Embedded Software Group

Kernel Debugging and Virtualization John Baldwin January 15, 2015 What is Kernel Debugging

Debugging Techniques for C Programs Debugging Basics Will focus on the gcc/gdb combination.

Introduction to Debugging the Introduction to Debugging the FreeBSD Kernel FreeBSD Kernel May

Introduction to Debugging with Windbg Module Overview Introduction to Debugging Callstacks and

Jiwei Li, NLP Researcher By Pragya Arora &amp; Piyush Ghai Introduction Graduated from

CSE 311: Foundations of Computing Lecture 2: More Logic, Equivalence &amp; Digital Circuits If

Converse of Smith Theory Min Yan Hong Kong University of Science and Technology International

never done jalewis@thoughtworks.com @boicy 1 never done 2 never done Incomplete adjective

Fast Correlation Attacks and Linear Codes Lauri Tarkkala November 25, 2004 1

Methods for the Reconstruction of Parallel Turbo Codes M. Cluzeau, M. Finiasz, and J.-P. Tillich

Optimum MDS convolutional codes over GF(2 m ) and their relation to the trace functi n ngela

Overcoming Delay, Synchronization and Cyclic Paths Meir Feder Tel Aviv University joint work

Jiwei Li, NLP Researcher By Pragya Arora & Piyush Ghai Introduction Graduated from

CSE 311: Foundations of Computing Lecture 2: More Logic, Equivalence & Digital Circuits If