Debugging Large Scale Parallel Applications Filippo Gioachin - - PowerPoint PPT Presentation

debugging large scale parallel applications
SMART_READER_LITE
LIVE PREVIEW

Debugging Large Scale Parallel Applications Filippo Gioachin - - PowerPoint PPT Presentation

Debugging Large Scale Parallel Applications Filippo Gioachin Parallel Programming Laboratory Departement of Computer Science University of Illinois at Urbana-Champaign Outline Introduction Motivations Debugging on Large Machines


slide-1
SLIDE 1

Debugging Large Scale Parallel Applications

Filippo Gioachin

Parallel Programming Laboratory Departement of Computer Science University of Illinois at Urbana-Champaign

slide-2
SLIDE 2

29 April 2010 Filippo Gioachin - UIUC 2

Outline

  • Introduction

– Motivations

  • Debugging on Large Machines

– Scalability

  • Using Fewer Resources

– Virtualized Debugging – Processor Extraction

  • Summary
slide-3
SLIDE 3

29 April 2010 Filippo Gioachin - UIUC 3

Motivations

  • Debugging is a fundamental part of software

development

  • Parallel programs have all the sequential bugs:

– Memory corruption – Incorrect results – ....

slide-4
SLIDE 4

29 April 2010 Filippo Gioachin - UIUC 4

Motivations (2)

  • Parallel programs have other bugs:

– Data races / multicore (heavily studied in literature) – Communication mistakes – Synchronization mistakes / Message races

  • To complicate things more:

– Non-determinism – Problems may show up only at large scale

slide-5
SLIDE 5

29 April 2010 Filippo Gioachin - UIUC 5

Problems at Large Scale

  • Problems may not appear at small scale

– Races between messages

  • Latencies in the underlying hardware

– Incorrect messaging – Data decomposition

slide-6
SLIDE 6

29 April 2010 Filippo Gioachin - UIUC 6

Problems at Large Scale (2)

  • Infeasible

– Debugger needs to handle many processors – Human can be overwhelmed by information – Long waiting time in queue – Machine not available

  • Expensive

– Large machine allocation consume a lot of

computational resources

slide-7
SLIDE 7

29 April 2010 Filippo Gioachin - UIUC 7

CharmDebug Overview

CharmDebug Login Node Launch Single CCS connection

SSH tunnel

SSH Parallel Machine

GDB

slide-8
SLIDE 8

29 April 2010 Filippo Gioachin - UIUC 8

Converse Client-Server Scalability

  • CCS connects to the application as a whole

– Forward requests for single processors – Gather information from

the whole application

  • Uses the same communication

infrastucture as the applicaiton

slide-9
SLIDE 9

29 April 2010 Filippo Gioachin - UIUC 9

Debugging on Large Systems

  • Attaching to running application

– 48 processors cluster

  • 28 ms with 48 point-to-point queries
  • 2 ms with a single global query
  • Example: Memory statistics collection

– 12 to 20 ms up to 4k processors – Counted on the client debugger

* F. Gioachin, C.W. Lee, L.V. Kalé: "Scalable Interaction with Scalable Interaction with Parallel Applications Parallel Applications", In Proceedings of TeraGrid'09

slide-10
SLIDE 10

29 April 2010 Filippo Gioachin - UIUC 10

Autoinspection

  • The programmer should not manually handle all

the processors

– Unsupervised execution – Notification to the user from interesting processors

  • Breakpoints
  • Abort / signals
  • Memory corruption
  • Assertion failure
slide-11
SLIDE 11

29 April 2010 Filippo Gioachin - UIUC 11

Python Scripting

length = charm.getValue(self, MyArray, len) arr = charm.getValue(self, MyArray, data) for i in range(0, length): value = charm.getArray(arr, double, i) if (value > 10 or value < -10): print "Error: value ", i, " = ", value return i

Select on which entry points the script should run Suspend execution if a value is returned Access program's data (circumvent lack of reflection)

  • Upload a script to perform checking on the

correctness of data structures when needed

* F. Gioachin, L.V. Kalé: "Dynamic High- Dynamic High- Level Scripting Level Scripting in Parallel in Parallel Applications Applications". In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009)

slide-12
SLIDE 12

29 April 2010 Filippo Gioachin - UIUC 12

Can you debug

  • n a big machine?
  • Feasibility

– How long do you have to wait before your job starts?

  • Are you available when you job starts?

– Is the machine even available?

  • Cost

– How many allocation units are you using to do your

debugging?

slide-13
SLIDE 13

29 April 2010 Filippo Gioachin - UIUC 13

Virtualized Emulation

  • Use emulation techniques to provide virtual

processors to display to the user

– Different scenario from performance analysis

  • Cannot assume correctness of program

– Debugger needs to communicate with application – Single address space

* F. Gioachin, G. Zheng, L.V. Kalé: "Debugging Large Scale Applications in a Virtualized Debugging Large Scale Applications in a Virtualized Environment Environment". PPL Technical Report, April 2010

slide-14
SLIDE 14

29 April 2010 Filippo Gioachin - UIUC 14

Virtualized Charm++

  • Converse on top of BigSim

– Processors become

virtual processors

– Two Converse layers

  • Virtualized
  • Original

AMPI Charm++ Converse MPI, Infiniband, Myrinet, UDP/TCP, LAPI, etc ... BigSim Emulator BigSim Converse

slide-15
SLIDE 15

29 April 2010 Filippo Gioachin - UIUC 15

Virtual Processor Virtual Processor

BigSim Emulator

Message Queue Converse Main Thread

Worker Thread Worker Thread Communication Thread Communication Thread

slide-16
SLIDE 16

29 April 2010 Filippo Gioachin - UIUC 16

Converse Client-Server under Emulated Environment

Virtual Processor

Worker Thread Communication Thread

Message Queue Converse Main Thread Virtual Processor

Worker Thread Communication Thread

CCS Host Real PE 12 VP 513 VP 87

slide-17
SLIDE 17

29 April 2010 Filippo Gioachin - UIUC 17

Usage: Starting

slide-18
SLIDE 18

29 April 2010 Filippo Gioachin - UIUC 18

Usage: Debugging

slide-19
SLIDE 19

29 April 2010 Filippo Gioachin - UIUC 19

Performance: Jacobi

(on NCSA's BluePrint)

  • User thinks for one minute about what to do:

– 8 processors

  • 86 sec.
  • ~0.2 SU

– 1024 procs

  • 60.5 sec.
  • ~17 SU
slide-20
SLIDE 20

29 April 2010 Filippo Gioachin - UIUC 20

Restrictions

  • Small memory footprint

– Many processors needs to fit into a single physical

processor

  • Session should be constraint by human speed

– Allocation idle most of the time waiting for user input – Bad for computation intensive applications

slide-21
SLIDE 21

29 April 2010 Filippo Gioachin - UIUC 21

Separation of Virtual Entities

  • Single address space shared by different entities

– One entity can write in memory of another entity

  • Protect memory such that spurious writes can be detected
  • Exploit the scheduler in message driven systems

Has corruption

  • ccured?

Reset memory protection Check memory corruption No Yes Pick message User code: process message

* F. Gioachin, L.V. Kalé: "Memory Memory Tagging in Charm++ Tagging in Charm++" in Proceedings of the 6th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging (PADTAD '08)

slide-22
SLIDE 22

29 April 2010 Filippo Gioachin - UIUC 22

Do we need all the processors?

  • The problem manifests itself on a

single processor

– If more than one, they are equivalent

  • The cause can span multiple

processors (causally related)

– The subset is generally much smaller

than the whole system

  • Select the interesting processors

and ignore the others

slide-23
SLIDE 23

29 April 2010 Filippo Gioachin - UIUC 23

Fighting non-determinism

  • Record all data processed by each processor

– Huge volume of data stored – High interference with application

  • Likely the bug will not appear...

– Need to run a non-optimized code

  • Record only message ordering

– Based on piecewise deterministic assumption – Must re-execute using the same machine

slide-24
SLIDE 24

29 April 2010 Filippo Gioachin - UIUC 24

Three-step Procedure for Processor Extraction

Execute program recording message

  • rdering

Replay application with detailed recording enabled Replay selected processors as stand-alone Is problem solved? Done Select processors to record Yes No S t e p 1 S t e p 2 S t e p 3 Has bug appeared? Yes No

Minimize perturbation (few bytes per message)

  • Iterate for

incremental extraction

  • Use message ordering to

guarantee determinism

  • Can execute in the

virtualized environment

* F. Gioachin, G. Zheng, L.V. Kalé: "Robust Record- Robust Record- Replay with Processor Replay with Processor Extraction Extraction" in Proceedings of the Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging (PADTAD – VIII), 2010

U s e r

slide-25
SLIDE 25

29 April 2010 Filippo Gioachin - UIUC 25

What if the piecewise deterministic assumption is not met?

  • Make sure to detect it, and notify the user

If all messages are identical, If all messages are identical, then we can assume the non- then we can assume the non- determinism was captured determinism was captured

  • Methods to detect failure:

– Message size and destination – Checksum of the whole message (XOR, CRC32)

slide-26
SLIDE 26

29 April 2010 Filippo Gioachin - UIUC 26

Computing Checksums

  • Checksum considers memory as raw data,

ignores what it contains

– Pointers – Garbage

  • Uninitialized fields
  • Compiler padding
  • Use Charm++ memory allocator

– Intercept calls to malloc and pre-fill memory

double int short double int double double

slide-27
SLIDE 27

29 April 2010 Filippo Gioachin - UIUC 27

Message Order Recording Performance (on NCSA's Abe)

slide-28
SLIDE 28

29 April 2010 Filippo Gioachin - UIUC 28

kNeighbor

slide-29
SLIDE 29

29 April 2010 Filippo Gioachin - UIUC 29

ChaNGa

(dwf1.2048 on NCSA's BluePrint)

slide-30
SLIDE 30

29 April 2010 Filippo Gioachin - UIUC 30

Replaying the Application

slide-31
SLIDE 31

29 April 2010 Filippo Gioachin - UIUC 31

Replaying under BigSim Emulation: NAMD

slide-32
SLIDE 32

29 April 2010 Filippo Gioachin - UIUC 32

Amount of Data Saved

Number of Processors 128 256 512 1024 Record Per-processor 0.87 0.67 0.54 0.44 Total 112 173 279 453 Record+checksum Per-processor 1.49 1.14 0.92 0.75 Total 190 292 473 765 Detailed record Per-processor 111 79 59 47

ChaNGa dwf1.2048, numbers in MB

slide-33
SLIDE 33

29 April 2010 Filippo Gioachin - UIUC 33

Debugging Case Study

  • Message race during particle exchange

– Fixed with tedious print statements (while trying to

avoid hiding the bug...)

../charmdebug +p16 ../ChaNGa cube300.param +record +recplay-crc ../charmdebug +p16 ../ChaNGa cube300.param +replay +recplay-crc +record-detail 7 gdb ../ChaNGa >> run cube300.param +replay-detail 7/16

slide-34
SLIDE 34

29 April 2010 Filippo Gioachin - UIUC 34

Summary

  • Important for the debugging system to scale to

large configurations

  • Resources are expensive and should not be

wasted

– Virtualized Debugging to debug large scale

applications on small clusters

– Processor Extraction to capture non-determinism of

parallel application

  • Must not interfere too much with the application timing
slide-35
SLIDE 35

29 April 2010 Filippo Gioachin - UIUC 35

Future Extensions

  • Shared memory compliance
  • Race detector

– Automated testing of message delivery to discover

message races

  • Replay in isolation of single virtual entities

– Conditions of validity